This section gives an overview of the operations that can be performed on
a given set of metadata obtained particularly from data-rich objects such
as those obtained from curatedTCGAData
. There are several operations that
work with microRNA, methylation, mutation, and assays that have gene symbol
annotations.
CpGtoRanges
Using the methylation annotations in
IlluminaHumanMethylation450kanno.ilmn12.hg19
and the minfi
package, we
look up CpG probes and convert to genomic coordinates with CpGtoRanges
.
The function provides two assays, one with mapped probes and the other with
unmapped probes. Excluding unmapped probes can be done by setting the
unmapped
argument to FALSE
. This will run for both types of methylation
data (27k and 450k).
methcoad <- CpGtoRanges(coad)
## Setting options('download.file.method.GEOquery'='auto')
## Setting options('GEOquery.inmemory.gpl'=FALSE)
## Warning: 'experiments' dropped; see 'metadata'
## harmonizing input:
## removing 535 sampleMap rows not in names(experiments)
mirToRanges
microRNA assays obtained from curatedTCGAData
have annotated sequences
that can be converted to genomic ranges using the mirbase.db
package.
The function looks up all sequences and converts them to (‘hg19’) ranges.
For those rows that cannot be found, an ‘unranged’ assay is introduced
in the resulting MultiAssayExperiment object.
mircoad <- mirToRanges(coad)
## Warning in (function (seqlevels, genome, new_style) : cannot switch some of
## hg19's seqlevels from UCSC to NCBI style
## Warning: 'experiments' dropped; see 'metadata'
## harmonizing input:
## removing 221 sampleMap rows not in names(experiments)
qreduceTCGA
The qreduceTCGA
function converts RaggedExperiment
mutation data objects
to RangedSummarizedExperiment
using org.Hs.eg.db
and the qreduceTCGA
utility function from RaggedExperiment
to summarize ‘silent’ and ‘non-silent’
mutations based on a ‘Variant_Classification’ metadata column in the original
object.
It uses ‘hg19’ transcript database (‘TxDb’) package internally to summarize
regions using qreduceAssay
. The current genome build (‘hg18’) in the data
must be translated to ‘hg19’.
In this example, we first set the appropriate build name in the mutation
dataset COAD_Mutation-20160128
according to the
NCBI website
and we then use seqlevelsStyle
to match the UCSC
style in the chain.
rag <- "COAD_Mutation-20160128"
# add the appropriate genome annotation
genome(coad[[rag]]) <- "NCBI36"
# change the style to UCSC
seqlevelsStyle(rowRanges(coad[[rag]])) <- "UCSC"
# inspect changes
seqlevels(rowRanges(coad[[rag]]))
## [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9"
## [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
## [19] "chr19" "chr20" "chr21" "chr22" "chrX" "chrY"
genome(coad[[rag]])
## chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11
## "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18"
## chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22
## "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18" "hg18"
## chrX chrY
## "hg18" "hg18"
Now we use liftOver
from rtracklayer
to translate ‘hg18’ builds
to ‘hg19’ using the downloaded chain file.
lifturl <-
"http://hgdownload.cse.ucsc.edu/goldenpath/hg18/liftOver/hg18ToHg19.over.chain.gz"
bfc <- BiocFileCache()
qfile <- bfcquery(bfc, "18to19chain", exact = TRUE)[["rpath"]]
cfile <-
if (length(qfile) && file.exists(qfile)) {
bfcquery(bfc, "18to19chain", exact = TRUE)[["rpath"]]
} else {
bfcadd(bfc, "18to19chain", lifturl)
}
chainfile <- file.path(tempdir(), gsub("\\.gz", "", basename(cfile)))
R.utils::gunzip(cfile, destname = chainfile, remove = FALSE)
chain <- suppressMessages(
rtracklayer::import.chain(chainfile)
)
ranges19 <- rtracklayer::liftOver(rowRanges(coad[[rag]]), chain)
The same can be done to convert hg19
to hg38
(the same build that the
Genomic Data Commons uses) with the corresponding chain file:
liftchain <-
"http://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/hg19ToHg38.over.chain.gz"
bfc <- BiocFileCache()
q38file <- bfcquery(bfc, "19to38chain", exact = TRUE)[["rpath"]]
c38file <-
if (length(q38file) && file.exists(q38file)) {
bfcquery(bfc, "19to38chain", exact = TRUE)[["rpath"]]
} else {
bfcadd(bfc, "19to38chain", liftchain)
}
cloc38 <- file.path(tempdir(), gsub("\\.gz", "", basename(c38file)))
R.utils::gunzip(c38file, destname = cloc38, remove = FALSE)
chain38 <- suppressMessages(
rtracklayer::import.chain(cloc38)
)
## then use the liftOver function using the 'chain38' object
## as above
ranges38 <- rtracklayer::liftOver(unlist(ranges19), chain38)
This will give us a list of ranges, each element corresponding to a single row
in the RaggedExperiment
. We remove rows that had no matches in the liftOver
process and replace the ranges in the original RaggedExperiment
with the
replacement method. Finally, we put the RaggedExperiment
object back into the
MultiAssayExperiment
.
re19 <- coad[[rag]][as.logical(lengths(ranges19))]
ranges19 <- unlist(ranges19)
genome(ranges19) <- "hg19"
rowRanges(re19) <- ranges19
# replacement
coad[["COAD_Mutation-20160128"]] <- re19
rowRanges(re19)
## GRanges object with 62523 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr20 1552407-1552408 +
## [2] chr1 161736152-161736153 +
## [3] chr7 100685895 +
## [4] chr7 103824453 +
## [5] chr7 104783644 +
## ... ... ... ...
## [62519] chr9 36369716 +
## [62520] chr9 37692640 +
## [62521] chr9 6007456 +
## [62522] chrX 123785782 +
## [62523] chrX 51487184 +
## -------
## seqinfo: 24 sequences from hg19 genome; no seqlengths
Now that we have matching builds, we can finally run the qreduceTCGA
function.
coad <- qreduceTCGA(coad, keep.assay = TRUE)
##
## 403 genes were dropped because they have exons located on both strands
## of the same reference sequence or on more than one reference sequence,
## so cannot be represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a
## GRangesList object, or use suppressMessages() to suppress this message.
## Warning in (function (seqlevels, genome, new_style) : cannot switch some of
## hg19's seqlevels from UCSC to NCBI style
## 'select()' returned 1:1 mapping between keys and columns
## Warning in .normarg_seqlevelsStyle(value): more than one seqlevels style
## supplied, using the 1st one only
## Warning in .normarg_seqlevelsStyle(value): cannot switch some of hg19's
## seqlevels from UCSC to NCBI style
symbolsToRanges
In the cases where row annotations indicate gene symbols, the symbolsToRanges
utility function converts genes to genomic ranges and replaces existing
assays with RangedSummarizedExperiment
objects. Gene annotations are given
as ‘hg19’ genomic regions.
symbolsToRanges(coad)
## 403 genes were dropped because they have exons located on both strands
## of the same reference sequence or on more than one reference sequence,
## so cannot be represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a
## GRangesList object, or use suppressMessages() to suppress this message.
## Warning in (function (seqlevels, genome, new_style) : cannot switch some of
## hg19's seqlevels from UCSC to NCBI style
## 'select()' returned 1:1 mapping between keys and columns
## 403 genes were dropped because they have exons located on both strands
## of the same reference sequence or on more than one reference sequence,
## so cannot be represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a
## GRangesList object, or use suppressMessages() to suppress this message.
## Warning in (function (seqlevels, genome, new_style) : cannot switch some of
## hg19's seqlevels from UCSC to NCBI style
## 'select()' returned 1:1 mapping between keys and columns
## Warning: 'experiments' dropped; see 'metadata'
## harmonizing input:
## removing 363 sampleMap rows not in names(experiments)
## A MultiAssayExperiment object of 11 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 11:
## [1] COAD_CNASeq-20160128: RaggedExperiment with 40530 rows and 136 columns
## [2] COAD_miRNASeqGene-20160128: SummarizedExperiment with 705 rows and 221 columns
## [3] COAD_Mutation-20160128: RaggedExperiment with 62523 rows and 154 columns
## [4] COAD_Methylation_methyl27-20160128: SummarizedExperiment with 27578 rows and 202 columns
## [5] COAD_Methylation_methyl450-20160128: SummarizedExperiment with 485577 rows and 333 columns
## [6] COAD_Mutation-20160128_simplified: RangedSummarizedExperiment with 22914 rows and 154 columns
## [7] COAD_CNASeq-20160128_simplified: RangedSummarizedExperiment with 22914 rows and 136 columns
## [8] COAD_mRNAArray-20160128_ranged: RangedSummarizedExperiment with 14214 rows and 172 columns
## [9] COAD_mRNAArray-20160128_unranged: SummarizedExperiment with 3600 rows and 172 columns
## [10] COAD_RNASeq2GeneNorm-20160128_ranged: RangedSummarizedExperiment with 17132 rows and 191 columns
## [11] COAD_RNASeq2GeneNorm-20160128_unranged: SummarizedExperiment with 3369 rows and 191 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save data to flat files