curatedTCGAData
helpersMultiAssayExperiment
objects from curatedTCGAData
sampleTables
: what sample types are present in the data?splitAssays
: separate the data from different tissue typesgetSubtypeMap
: manually curated molecular subtypesgetClinicalNames
: key “level 4” clinical & pathological datamergeColData
: expanding the colData
of a MultiAssayExperiment
sessionInfo
The TCGAutils
package completes a suite of Bioconductor packages for
convenient access, integration, and analysis of The Cancer Genome Atlas.
It includes:
0. helpers for working with TCGA through the Bioconductor packages
MultiAssayExperiment (for coordinated representation and
manipulation of multi-omits experiments) and curatedTCGAData,
which provides unrestricted TCGA data as MultiAssayExperiment
objects,
0. helpers for importing TCGA data as from flat data structures such as
data.frame
or DataFrame
read from delimited data structures provided by
the Broad Institute’s Firehose, and
0. functions for interpreting TCGA barcodes and for mapping between
barcodes and Universally Unique Identifiers (UUIDs).
if (!require("BiocManager"))
install.packages("BiocManager")
BiocManager::install("TCGAutils")
Required packages for this vignette:
library(TCGAutils)
library(curatedTCGAData)
library(MultiAssayExperiment)
library(RTCGAToolbox)
curatedTCGAData
helpersFunctions such as getSubtypeMap
and getClinicalNames
provide information
on data inside a MultiAssayExperiment object downloaded from
curatedTCGAData. sampleTables
and splitAssays
support useful
operations on these MultiAssayExperiment
objects.
MultiAssayExperiment
objects from curatedTCGAData
For demonstration we download part of the Colon Adenocarcinoma (COAD) dataset
usingcuratedTCGAData
via ExperimentHub
. This command will download any
data type that starts with CN*
such as CNASeq
:
coad <- curatedTCGAData::curatedTCGAData(diseaseCode = "COAD",
assays = "CN*", dry.run = FALSE)
For a list of all available data types, use dry.run = FALSE
and an
asterisk *
as the assay input value:
curatedTCGAData("COAD", "*")
## COAD_CNASNP
## "COAD_CNASNP-20160128.rda"
## COAD_CNASeq
## "COAD_CNASeq-20160128.rda"
## COAD_CNVSNP
## "COAD_CNVSNP-20160128.rda"
## COAD_GISTIC_AllByGene
## "COAD_GISTIC_AllByGene-20160128.rda"
## COAD_GISTIC_ThresholdedByGene
## "COAD_GISTIC_ThresholdedByGene-20160128.rda"
## COAD_Methylation1
## "COAD_Methylation_methyl27-20160128.rda"
## COAD_Methylation2
## "COAD_Methylation_methyl450-20160128.rda"
## COAD_Mutation
## "COAD_Mutation-20160128.rda"
## COAD_RNASeq2GeneNorm
## "COAD_RNASeq2GeneNorm-20160128.rda"
## COAD_RNASeqGene
## "COAD_RNASeqGene-20160128.rda"
## COAD_RPPAArray
## "COAD_RPPAArray-20160128.rda"
## COAD_mRNAArray
## "COAD_mRNAArray-20160128.rda"
## COAD_miRNASeqGene
## "COAD_miRNASeqGene-20160128.rda"
sampleTables
: what sample types are present in the data?The sampleTables
function gives a tally of available
samples in the dataset based on the TCGA barcode information.
sampleTables(coad)
## $`COAD_CNASNP-20160128`
##
## 01 02 06 10 11
## 449 1 1 373 90
##
## $`COAD_CNASeq-20160128`
##
## 01 10 11
## 68 55 13
##
## $`COAD_CNVSNP-20160128`
##
## 01 02 06 10 11
## 449 1 1 373 90
For reference in interpreting the sample type codes, see the sampleTypes
table:
data("sampleTypes")
sampleTypes
## Code Definition Short.Letter.Code
## 1 01 Primary Solid Tumor TP
## 2 02 Recurrent Solid Tumor TR
## 3 03 Primary Blood Derived Cancer - Peripheral Blood TB
## 4 04 Recurrent Blood Derived Cancer - Bone Marrow TRBM
## 5 05 Additional - New Primary TAP
## 6 06 Metastatic TM
## 7 07 Additional Metastatic TAM
## 8 08 Human Tumor Original Cells THOC
## 9 09 Primary Blood Derived Cancer - Bone Marrow TBM
## 10 10 Blood Derived Normal NB
## 11 11 Solid Tissue Normal NT
## 12 12 Buccal Cell Normal NBC
## 13 13 EBV Immortalized Normal NEBV
## 14 14 Bone Marrow Normal NBM
## 15 15 sample type 15 15SH
## 16 16 sample type 16 16SH
## 17 20 Control Analyte CELLC
## 18 40 Recurrent Blood Derived Cancer - Peripheral Blood TRB
## 19 50 Cell Lines CELL
## 20 60 Primary Xenograft Tissue XP
## 21 61 Cell Line Derived Xenograft Tissue XCL
## 22 99 sample type 99 99SH
splitAssays
: separate the data from different tissue typesTCGA datasets include multiple -omics for solid tumors, adjacent normal
tissues, blood-derived cancers and normals, and other tissue types, which may
be mixed together in a single dataset. The MultiAssayExperiment
object
generated here has one patient per row of its colData
, but each patient may
have two or more -omics profiles by any assay, whether due to assaying of
different types of tissues or to technical replication. splitAssays
separates
profiles from different tissue types (such as tumor and adjacent normal) into
different assays of the MultiAssayExperiment
by taking a vector of sample
codes, and partitioning the current assays into assays with an appended sample
code:
(tnmae <- splitAssays(coad, c("01", "11")))
## Selecting 'Primary Solid Tumor' samples
## Selecting 'Solid Tissue Normal' samples
## A MultiAssayExperiment object of 6 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 6:
## [1] 01_COAD_CNASNP-20160128: RaggedExperiment with 459490 rows and 449 columns
## [2] 01_COAD_CNASeq-20160128: RaggedExperiment with 40530 rows and 68 columns
## [3] 01_COAD_CNVSNP-20160128: RaggedExperiment with 90534 rows and 449 columns
## [4] 11_COAD_CNASNP-20160128: RaggedExperiment with 459490 rows and 90 columns
## [5] 11_COAD_CNASeq-20160128: RaggedExperiment with 40530 rows and 13 columns
## [6] 11_COAD_CNVSNP-20160128: RaggedExperiment with 90534 rows and 90 columns
## Features:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample availability DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
The MultiAssayExperiment package then provides functionality to
merge replicate profiles for a single patient (mergeReplicates()
), which
would now be appropriate but would not have been appropriate before
splitting different tissue types into different assays, because that would
average measurements from tumors and normal tissues.
MultiAssayExperiment
also defines the MatchedAssayExperiment
class, which
eliminates any profiles not present across all assays and ensures identical
ordering of profiles (columns) in each assay. In this example, it will match
tumors to adjacent normals in subsequent assays:
(matchmae <- as(tnmae, "MatchedAssayExperiment"))
## A MatchedAssayExperiment object of 6 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 6:
## [1] 01_COAD_CNASNP-20160128: RaggedExperiment with 459490 rows and 12 columns
## [2] 01_COAD_CNASeq-20160128: RaggedExperiment with 40530 rows and 12 columns
## [3] 01_COAD_CNVSNP-20160128: RaggedExperiment with 90534 rows and 12 columns
## [4] 11_COAD_CNASNP-20160128: RaggedExperiment with 459490 rows and 12 columns
## [5] 11_COAD_CNASeq-20160128: RaggedExperiment with 40530 rows and 12 columns
## [6] 11_COAD_CNVSNP-20160128: RaggedExperiment with 90534 rows and 12 columns
## Features:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample availability DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
Only about 12 participants have both a matched tumor and solid normal sample.
getSubtypeMap
: manually curated molecular subtypesPer-tumor subtypes are saved in the metadata
of the colData
slot of MultiAssayExperiment
objects downloaded from curatedTCGAData
.
These subtypes were manually curated from the supplemental tables of all
primary TCGA publications:
getSubtypeMap(coad)
## COAD_annotations COAD_subtype
## 1 Patient_ID patient
## 2 msi MSI_status
## 3 methylation_subtypes methylation_subtype
## 4 mrna_subtypes expression_subtype
## 5 histological_subtypes histological_type
getClinicalNames
: key “level 4” clinical & pathological dataThe curatedTCGAData
colData
contain hundreds of columns, obtained from
merging all unrestricted levels of clinical, pathological, and biospecimen data.
This function provides the names of “level 4” clinical/pathological variables,
which are the only ones provided by most other TCGA analysis tools.
Users may then use these variable names for subsetting or analysis, and may
even want to subset the colData
to only these commonly used variables.
getClinicalNames("COAD")
## [1] "years_to_birth"
## [2] "vital_status"
## [3] "days_to_death"
## [4] "days_to_last_followup"
## [5] "tumor_tissue_site"
## [6] "pathologic_stage"
## [7] "pathology_T_stage"
## [8] "pathology_N_stage"
## [9] "pathology_M_stage"
## [10] "gender"
## [11] "date_of_initial_pathologic_diagnosis"
## [12] "days_to_last_known_alive"
## [13] "radiation_therapy"
## [14] "histological_type"
## [15] "residual_tumor"
## [16] "number_of_lymph_nodes"
## [17] "race"
## [18] "ethnicity"
Warning: some names may not exactly match the colData
names in the object
due to differences in variable types. These variables are kept separate and
differentiated with x
and y
. For example, vital_status
in this case
corresponds to two different variables obtained from the pipeline. One variable
is interger type and the other character:
class(colData(coad)[["vital_status.x"]])
## [1] "integer"
class(colData(coad)[["vital_status.y"]])
## [1] "character"
table(colData(coad)[["vital_status.x"]])
##
## 0 1
## 355 102
table(colData(coad)[["vital_status.y"]])
##
## DECEASED LIVING
## 22 179
Such conflicts should be inspected in this manner, and conflicts resolved by choosing the more complete variable, or by treating any conflicting values as unknown (“NA”).
mergeColData
: expanding the colData
of a MultiAssayExperiment
This function merges a data.frame
or DataFrame
into the
colData
of an existing MultiAssayExperiment
object. It will match
column names and row names to do a full merge of both data sets. This
convenience function can be used, for example, to add subtype information
available for a subset of patients to the colData
. Here is an example on an
empty MultiAssayExperiment
just to demonstrate its usage:
mergeColData(MultiAssayExperiment(), data.frame())
## A MultiAssayExperiment object of 0 listed
## experiments with no user-defined names and respective classes.
## Containing an ExperimentList class object of length 0:
## Features:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample availability DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
A few functions in the package accept either files or classes such as
data.frame
and FirehoseGISTIC
as input and return standard Bioconductor
classes.
makeGRangesListFromExonFiles
%??% Could you use the GenomicDataCommons library to get an example file instead, or show the GDC command that would be used to obtain the file?
The GRangesList
class from the GenomicRanges package is
suitable for grouping GRanges
vectors as a list, such as for grouping exons
by gene. In this example we use a legacy exon quantification file from the
Genomic Data Commons. We then use makeGRangesListFromExonFiles
to create a
GRangesList
from vectors of file paths and names (where necessary). Some
adjustments have been made to the file name for cross-platform compatibility
with Windows operating system limitations.
## Load example file found in package
pkgDir <- system.file("extdata", package = "TCGAutils", mustWork = TRUE)
exonFile <- list.files(pkgDir, pattern = "cation\\.txt$", full.names = TRUE)
exonFile
## [1] "/tmp/RtmpeKC7y6/Rinst2cb93605bba3/TCGAutils/extdata/bt.exon_quantification.txt"
## We add the original file prefix to query for the UUID and get the
## TCGAbarcode
filePrefix <- "unc.edu.32741f9a-9fec-441f-96b4-e504e62c5362.1755371."
## Add actual file name manually
makeGRangesListFromExonFiles(exonFile,
fileNames = paste0(filePrefix, basename(exonFile)))
## Parsed with column specification:
## cols(
## exon = col_character(),
## raw_counts = col_double(),
## median_length_normalized = col_double(),
## RPKM = col_double()
## )
## GRangesList object of length 1:
## $TCGA-AA-3678-01A-01R-0905-07
## GRanges object with 100 ranges and 3 metadata columns:
## seqnames ranges strand | raw_counts median_length_normalized
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr1 11874-12227 + | 4 0.4929178
## [2] chr1 12595-12721 + | 2 0.3412699
## [3] chr1 12613-12721 + | 2 0.3981481
## [4] chr1 12646-12697 + | 2 0.372549
## [5] chr1 13221-14409 + | 39 0.6329966
## ... ... ... ... . ... ...
## [96] chr1 881782-881925 - | 179 1
## [97] chr1 883511-883612 - | 151 1
## [98] chr1 883870-883983 - | 155 1
## [99] chr1 886507-886618 - | 144 1
## [100] chr1 887380-887519 - | 158 1
## RPKM
## <numeric>
## [1] 0.322476823123937
## [2] 0.449436202306589
## [3] 0.523655024705842
## [4] 1.09766149409494
## [5] 0.936104924316458
## ... ...
## [96] 35.4758096772073
## [97] 42.2492061354581
## [98] 38.8032966772158
## [99] 36.6932556597451
## [100] 32.2085244124429
##
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Note GRangesList
objects must be converted to RaggedExperiment
class to incorporate them into a MultiAssayExperiment
.
makeGRangesListFromCopyNumber
Other processed, genomic range-based data from TCGA data can be imported using
makeGRangesListFromCopyNumber
. This tab-delimited data file of copy number
alterations from bladder urothelial carcinoma (BLCA) was obtained from the
Genomic Data Commons and is included in TCGAUtils
as an example:
grlFile <- system.file("extdata", "blca_cnaseq.txt", package = "TCGAutils")
grl <- read.table(grlFile)
head(grl)
## Sample Chromosome Start End Num_Probes
## 1 TCGA-BL-A0C8-01A-11D-A10R-02 14 70362113 73912204 NA
## 2 TCGA-BL-A0C8-01A-11D-A10R-02 9 115609546 131133898 NA
## 5 TCGA-BL-A13I-01A-11D-A13U-02 13 19020028 49129100 NA
## 6 TCGA-BL-A13I-01A-11D-A13U-02 1 10208 246409808 NA
## 9 TCGA-BL-A13J-01A-11D-A10R-02 23 3119586 5636448 NA
## 10 TCGA-BL-A13J-01A-11D-A10R-02 7 10127 35776912 NA
## Segment_Mean
## 1 -0.182879931
## 2 0.039675162
## 5 0.002085552
## 6 -0.014224752
## 9 0.877072555
## 10 0.113873871
makeGRangesListFromCopyNumber(grl, split.field = "Sample")
## GRangesList object of length 116:
## $TCGA-BL-A0C8-01A-11D-A10R-02
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] 14 70362113-73912204 *
## [2] 9 115609546-131133898 *
##
## $TCGA-BL-A13I-01A-11D-A13U-02
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## [1] 13 19020028-49129100 *
## [2] 1 10208-246409808 *
##
## $TCGA-BL-A13J-01A-11D-A10R-02
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## [1] 23 3119586-5636448 *
## [2] 7 10127-35776912 *
##
## ...
## <113 more elements>
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
makeGRangesListFromCopyNumber(grl, split.field = "Sample",
keep.extra.columns = TRUE)
## GRangesList object of length 116:
## $TCGA-BL-A0C8-01A-11D-A10R-02
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | Num_Probes Segment_Mean
## <Rle> <IRanges> <Rle> | <logical> <numeric>
## [1] 14 70362113-73912204 * | <NA> -0.182879930738387
## [2] 9 115609546-131133898 * | <NA> 0.0396751622235396
##
## $TCGA-BL-A13I-01A-11D-A13U-02
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | Num_Probes Segment_Mean
## [1] 13 19020028-49129100 * | <NA> 0.00208555197637913
## [2] 1 10208-246409808 * | <NA> -0.0142247519016688
##
## $TCGA-BL-A13J-01A-11D-A10R-02
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | Num_Probes Segment_Mean
## [1] 23 3119586-5636448 * | <NA> 0.877072555244314
## [2] 7 10127-35776912 * | <NA> 0.113873871106118
##
## ...
## <113 more elements>
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
makeSummarizedExperimentFromGISTIC
This function is only used for converting the FirehoseGISTIC
class of the
RTCGAToolbox package. It allows the user to obtain thresholded
by gene data, probabilities and peak regions.
tempDIR <- tempdir()
co <- getFirehoseData("COAD", clinical = FALSE, GISTIC = TRUE,
destdir = tempDIR)
## gdac.broadinstitute.org_COAD-TP.CopyNumber_Gistic2.Level_4.2016012800.0.0.tar.gz
selectType(co, "GISTIC")
## Dataset:COAD
## FirehoseGISTIC object, dim: 24776 454
class(selectType(co, "GISTIC"))
## [1] "FirehoseGISTIC"
## attr(,"package")
## [1] "RTCGAToolbox"
makeSummarizedExperimentFromGISTIC(co, "Peaks")
## class: RangedSummarizedExperiment
## dim: 66 451
## metadata(0):
## assays(1): ''
## rownames(66): 23 24 ... 65 66
## rowData names(10): rowRanges Unique.Name ... Amplitude.Threshold
## type
## colnames(451): TCGA-3L-AA1B-01A-11D-A36W-01
## TCGA-4N-A93T-01A-11D-A36W-01 ... TCGA-T9-A92H-01A-11D-A36W-01
## TCGA-WS-AB45-01A-11D-A40O-01
## colData names(0):
The TCGA project has generated massive amounts of data. Some data can be
obtained with Universally Unique IDentifiers (UUID) and other
data with TCGA barcodes. The Genomic Data Commons provides a JSON API for
mapping between UUID and barcode, but it is difficult for many people to
understand. TCGAutils
makes simple functions available for two-way
translation between vectors of these identifiers.
Here we translate the first two TCGA barcodes of the previous copy-number alterations dataset to UUID:
(xbarcode <- head(colnames(coad)[["COAD_CNASeq-20160128"]], 4L))
## [1] "TCGA-A6-2671-01A-01D-1405-02" "TCGA-A6-2671-10A-01D-1405-02"
## [3] "TCGA-A6-2674-01A-02D-1167-02" "TCGA-A6-2674-10A-01D-1167-02"
barcodeToUUID(xbarcode)
## submitter_id case_id
## 1 TCGA-A6-2671 565e2726-4942-4726-89d3-c5e3797f7204
## 2 TCGA-A6-2674 57cdaa1c-4e94-4a28-ab3b-300c0457555f
Here we have a known case UUID that we want to translate into a TCGA barcode.
UUIDtoBarcode("ae55b2d3-62a1-419e-9f9a-5ddfac356db4", id_type = "case_id")
## case_id submitter_id
## 1 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 TCGA-B0-5117
In cases where we want to translate a known file UUID to the associated TCGA
patient barcode, we can use UUIDtoBarcode
. Optional barcode information can
be included for sample, portion/analyte, and plate/center by specifying the
end_point
argument.
UUIDtoBarcode("0001801b-54b0-4551-8d7a-d66fb59429bf",
id_type = "file_id", end_point = "center")
## file_id
## 1 0001801b-54b0-4551-8d7a-d66fb59429bf
## cases.samples.portions.analytes.aliquots.submitter_id
## 1 TCGA-B0-5094-11A-01D-1421-08
We can also translate from file UUIDs to case UUIDs and vice versa as long as
we know the input type. We can use the case UUID from the previous example to
get the associated file UUIDs using UUIDtoUUID
. Note that this translation
is a one to many relationship, thus yielding a data.frame
of file UUIDs for a
single case UUID.
head(UUIDtoUUID("ae55b2d3-62a1-419e-9f9a-5ddfac356db4", to_type = "file_id"))
## case_id files.file_id
## 1 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 d6625424-1503-4735-a7cc-2a3de606278f
## 2 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 bc62b7e8-1b54-438b-944b-3e0655cdf6ac
## 3 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 fad3f577-6e5e-4039-b35b-69269f42d488
## 4 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 48c342b0-e7a2-4a7b-8556-55bcd8ad9ea0
## 5 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 e8eb689b-c5e1-4a21-acec-5f2ac25dbf97
## 6 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 8f345db1-686b-4e72-9b90-704978cde526
One possible way to verify that file IDs are matching case UUIDS is to
browse to the Genomic Data Commons webpage with the specific file UUID.
Here we look at the first file UUID entry in the output data.frame
:
https://portal.gdc.cancer.gov/files/0ff55a5e-6058-4e0b-9641-e3cb375ff214
In the page we check that the case UUID matches the input.
Several functions exist for working with TCGA barcodes, the main function being
TCGAbarcode
. It takes a TCGA barcode and returns information about
participant, sample, and/or portion.
## Return participant barcodes
TCGAbarcode(xbarcode, participant = TRUE)
## [1] "TCGA-A6-2671" "TCGA-A6-2671" "TCGA-A6-2674" "TCGA-A6-2674"
## Just return samples
TCGAbarcode(xbarcode, participant = FALSE, sample = TRUE)
## [1] "01A" "10A" "01A" "10A"
## Include sample data as well
TCGAbarcode(xbarcode, participant = TRUE, sample = TRUE)
## [1] "TCGA-A6-2671-01A" "TCGA-A6-2671-10A" "TCGA-A6-2674-01A"
## [4] "TCGA-A6-2674-10A"
## Include portion and analyte data
TCGAbarcode(xbarcode, participant = TRUE, sample = TRUE, portion = TRUE)
## [1] "TCGA-A6-2671-01A-01D" "TCGA-A6-2671-10A-01D" "TCGA-A6-2674-01A-02D"
## [4] "TCGA-A6-2674-10A-01D"
Based on lookup table values, the user can select certain sample types from a vector of sample barcodes. Below we select “Primary Solid Tumors” from a vector of barcodes, returning a logical vector identifying the matching samples.
## Select primary solid tumors
TCGAsampleSelect(xbarcode, "01")
## Selecting 'Primary Solid Tumor' samples
## [1] TRUE FALSE TRUE FALSE
## Select blood derived normals
TCGAsampleSelect(xbarcode, "10")
## Selecting 'Blood Derived Normal' samples
## [1] FALSE TRUE FALSE TRUE
data.frame
representation of barcodeThe straightforward TCGAbiospec
function will take the information contained
in the TCGA barcode and display it in data.frame
format with appropriate
column names.
TCGAbiospec(xbarcode)
## submitter_id sample_definition sample vial portion analyte plate center
## 1 TCGA-A6-2671 Primary Solid Tumor 01 A 01 D 1405 02
## 2 TCGA-A6-2671 Blood Derived Normal 10 A 01 D 1405 02
## 3 TCGA-A6-2674 Primary Solid Tumor 01 A 02 D 1167 02
## 4 TCGA-A6-2674 Blood Derived Normal 10 A 01 D 1167 02
The TCGAutils
package provides several helper datasets for working with TCGA barcodes.
sampleTypes
As shown previously, the reference dataset sampleTypes
defines sample codes
and their sample types (see ?sampleTypes
for source url).
## Obtained previously
sampleCodes <- TCGAbarcode(xbarcode, participant = FALSE, sample = TRUE)
## Lookup table
head(sampleTypes)
## Code Definition Short.Letter.Code
## 1 01 Primary Solid Tumor TP
## 2 02 Recurrent Solid Tumor TR
## 3 03 Primary Blood Derived Cancer - Peripheral Blood TB
## 4 04 Recurrent Blood Derived Cancer - Bone Marrow TRBM
## 5 05 Additional - New Primary TAP
## 6 06 Metastatic TM
## Match codes found in the barcode to the lookup table
sampleTypes[match(unique(substr(sampleCodes, 1L, 2L)), sampleTypes[["Code"]]), ]
## Code Definition Short.Letter.Code
## 1 01 Primary Solid Tumor TP
## 10 10 Blood Derived Normal NB
Source: https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/sample-type-codes
clinicalNames
- Firehose pipeline clinical variablesclinicalNames
is a list of the level 4 variable names (the most commonly used
clinical and pathological variables, with follow-ups merged) from each
colData
datasets in curatedTCGAData
. Shipped curatedTCGAData
MultiAssayExperiment
objects merge additional levels 1-3 clinical,
pathological, and biospecimen data and contain many more variables than the ones
listed here.
data("clinicalNames")
clinicalNames
## CharacterList of length 33
## [["ACC"]] years_to_birth vital_status days_to_death ... race ethnicity
## [["BLCA"]] years_to_birth vital_status days_to_death ... race ethnicity
## [["BRCA"]] years_to_birth vital_status days_to_death ... race ethnicity
## [["CESC"]] years_to_birth vital_status ... age_at_diagnosis clinical_stage
## [["CHOL"]] years_to_birth vital_status days_to_death ... race ethnicity
## [["COAD"]] years_to_birth vital_status days_to_death ... race ethnicity
## [["DLBC"]] years_to_birth vital_status days_to_death ... race ethnicity
## [["ESCA"]] years_to_birth vital_status days_to_death ... race ethnicity
## [["GBM"]] years_to_birth vital_status days_to_death ... race ethnicity
## [["HNSC"]] years_to_birth vital_status days_to_death ... race ethnicity
## ...
## <23 more elements>
lengths(clinicalNames)
## ACC BLCA BRCA CESC CHOL COAD DLBC ESCA GBM HNSC KICH KIRC KIRP LAML LGG
## 16 18 17 48 16 18 11 19 12 19 18 19 19 9 12
## LIHC LUAD LUSC MESO OV PAAD PCPG PRAD READ SARC SKCM STAD TGCT THCA THYM
## 16 20 20 17 12 19 13 18 18 12 17 17 15 21 11
## UCEC UCS UVM
## 9 11 14
sessionInfo
sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.8-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.8-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] RTCGAToolbox_2.12.1 curatedTCGAData_1.4.3
## [3] MultiAssayExperiment_1.8.3 SummarizedExperiment_1.12.0
## [5] DelayedArray_0.8.0 BiocParallel_1.16.6
## [7] matrixStats_0.54.0 Biobase_2.42.0
## [9] GenomicRanges_1.34.0 GenomeInfoDb_1.18.2
## [11] IRanges_2.16.0 S4Vectors_0.20.1
## [13] BiocGenerics_0.28.0 TCGAutils_1.2.2
## [15] BiocStyle_2.10.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-6 bit64_0.9-7
## [3] progress_1.2.0 httr_1.4.0
## [5] GenomicDataCommons_1.6.0 tools_3.5.2
## [7] R6_2.3.0 DBI_1.0.0
## [9] lazyeval_0.2.1 tidyselect_0.2.5
## [11] prettyunits_1.0.2 bit_1.1-14
## [13] curl_3.3 compiler_3.5.2
## [15] rvest_0.3.2 xml2_1.2.0
## [17] rtracklayer_1.42.1 bookdown_0.9
## [19] readr_1.3.1 rappdirs_0.3.1
## [21] RCircos_1.2.0 stringr_1.4.0
## [23] digest_0.6.18 Rsamtools_1.34.1
## [25] rmarkdown_1.11 XVector_0.22.0
## [27] pkgconfig_2.0.2 htmltools_0.3.6
## [29] limma_3.38.3 rlang_0.3.1
## [31] RSQLite_2.1.1 shiny_1.2.0
## [33] bindr_0.1.1 jsonlite_1.6
## [35] dplyr_0.7.8 RCurl_1.95-4.11
## [37] magrittr_1.5 GenomeInfoDbData_1.2.0
## [39] Matrix_1.2-15 Rcpp_1.0.0
## [41] stringi_1.3.1 yaml_2.2.0
## [43] RaggedExperiment_1.6.0 RJSONIO_1.3-1.1
## [45] zlibbioc_1.28.0 AnnotationHub_2.14.3
## [47] grid_3.5.2 blob_1.1.1
## [49] promises_1.0.1 ExperimentHub_1.8.0
## [51] crayon_1.3.4 lattice_0.20-38
## [53] Biostrings_2.50.2 splines_3.5.2
## [55] GenomicFeatures_1.34.3 hms_0.4.2
## [57] knitr_1.21 pillar_1.3.1
## [59] biomaRt_2.38.0 XML_3.98-1.17
## [61] glue_1.3.0 evaluate_0.13
## [63] data.table_1.12.0 BiocManager_1.30.4
## [65] httpuv_1.4.5.1 purrr_0.3.0
## [67] assertthat_0.2.0 xfun_0.4
## [69] mime_0.6 xtable_1.8-3
## [71] later_0.8.0 survival_2.43-3
## [73] tibble_2.0.1 GenomicAlignments_1.18.1
## [75] AnnotationDbi_1.44.0 memoise_1.1.0
## [77] bindrcpp_0.2.2 interactiveDisplayBase_1.20.0