1 Overview
2 Installation
3 curatedTCGAData helpers
4 Importing TCGA text data files to Bioconductor classes
5 Translating and interpreting TCGA identifiers
6 Reference data
- 6.1 sampleTypes
- 6.2 clinicalNames - Firehose pipeline clinical variables
7 sessionInfo

1 Overview

The TCGAutils package completes a suite of Bioconductor packages for convenient access, integration, and analysis of The Cancer Genome Atlas. It includes: 0. helpers for working with TCGA through the Bioconductor packages MultiAssayExperiment (for coordinated representation and manipulation of multi-omits experiments) and curatedTCGAData, which provides unrestricted TCGA data as MultiAssayExperiment objects, 0. helpers for importing TCGA data as from flat data structures such as data.frame or DataFrame read from delimited data structures provided by the Broad Institute’s Firehose, and 0. functions for interpreting TCGA barcodes and for mapping between barcodes and Universally Unique Identifiers (UUIDs).

2 Installation

if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("TCGAutils")

Required packages for this vignette:

library(TCGAutils)
library(curatedTCGAData)
library(MultiAssayExperiment)
library(RTCGAToolbox)

3 `curatedTCGAData` helpers

Functions such as getSubtypeMap and getClinicalNames provide information on data inside a MultiAssayExperiment object downloaded from curatedTCGAData. sampleTables and splitAssays support useful operations on these MultiAssayExperiment objects.

3.1 obtaining TCGA as `MultiAssayExperiment` objects from `curatedTCGAData`

For demonstration we download part of the Colon Adenocarcinoma (COAD) dataset usingcuratedTCGAData via ExperimentHub. This command will download any data type that starts with CN* such as CNASeq:

coad <- curatedTCGAData::curatedTCGAData(diseaseCode = "COAD",
    assays = "CN*", dry.run = FALSE)

For a list of all available data types, use dry.run = FALSE and an asterisk * as the assay input value:

curatedTCGAData("COAD", "*")

##                                  COAD_CNASNP 
##                   "COAD_CNASNP-20160128.rda" 
##                                  COAD_CNASeq 
##                   "COAD_CNASeq-20160128.rda" 
##                                  COAD_CNVSNP 
##                   "COAD_CNVSNP-20160128.rda" 
##                        COAD_GISTIC_AllByGene 
##         "COAD_GISTIC_AllByGene-20160128.rda" 
##                COAD_GISTIC_ThresholdedByGene 
## "COAD_GISTIC_ThresholdedByGene-20160128.rda" 
##                            COAD_Methylation1 
##     "COAD_Methylation_methyl27-20160128.rda" 
##                            COAD_Methylation2 
##    "COAD_Methylation_methyl450-20160128.rda" 
##                                COAD_Mutation 
##                 "COAD_Mutation-20160128.rda" 
##                         COAD_RNASeq2GeneNorm 
##          "COAD_RNASeq2GeneNorm-20160128.rda" 
##                              COAD_RNASeqGene 
##               "COAD_RNASeqGene-20160128.rda" 
##                               COAD_RPPAArray 
##                "COAD_RPPAArray-20160128.rda" 
##                               COAD_mRNAArray 
##                "COAD_mRNAArray-20160128.rda" 
##                            COAD_miRNASeqGene 
##             "COAD_miRNASeqGene-20160128.rda"

3.2 `sampleTables`: what sample types are present in the data?

The sampleTables function gives a tally of available samples in the dataset based on the TCGA barcode information.

sampleTables(coad)

## $`COAD_CNASNP-20160128`
## 
##  01  02  06  10  11 
## 449   1   1 373  90 
## 
## $`COAD_CNASeq-20160128`
## 
## 01 10 11 
## 68 55 13 
## 
## $`COAD_CNVSNP-20160128`
## 
##  01  02  06  10  11 
## 449   1   1 373  90

For reference in interpreting the sample type codes, see the sampleTypes table:

data("sampleTypes")
sampleTypes

##    Code                                        Definition Short.Letter.Code
## 1    01                               Primary Solid Tumor                TP
## 2    02                             Recurrent Solid Tumor                TR
## 3    03   Primary Blood Derived Cancer - Peripheral Blood                TB
## 4    04      Recurrent Blood Derived Cancer - Bone Marrow              TRBM
## 5    05                          Additional - New Primary               TAP
## 6    06                                        Metastatic                TM
## 7    07                             Additional Metastatic               TAM
## 8    08                        Human Tumor Original Cells              THOC
## 9    09        Primary Blood Derived Cancer - Bone Marrow               TBM
## 10   10                              Blood Derived Normal                NB
## 11   11                               Solid Tissue Normal                NT
## 12   12                                Buccal Cell Normal               NBC
## 13   13                           EBV Immortalized Normal              NEBV
## 14   14                                Bone Marrow Normal               NBM
## 15   15                                    sample type 15              15SH
## 16   16                                    sample type 16              16SH
## 17   20                                   Control Analyte             CELLC
## 18   40 Recurrent Blood Derived Cancer - Peripheral Blood               TRB
## 19   50                                        Cell Lines              CELL
## 20   60                          Primary Xenograft Tissue                XP
## 21   61                Cell Line Derived Xenograft Tissue               XCL
## 22   99                                    sample type 99              99SH

3.3 `splitAssays`: separate the data from different tissue types

TCGA datasets include multiple -omics for solid tumors, adjacent normal tissues, blood-derived cancers and normals, and other tissue types, which may be mixed together in a single dataset. The MultiAssayExperiment object generated here has one patient per row of its colData, but each patient may have two or more -omics profiles by any assay, whether due to assaying of different types of tissues or to technical replication. splitAssays separates profiles from different tissue types (such as tumor and adjacent normal) into different assays of the MultiAssayExperiment by taking a vector of sample codes, and partitioning the current assays into assays with an appended sample code:

(tnmae <- splitAssays(coad, c("01", "11")))

## Selecting 'Primary Solid Tumor' samples

## Selecting 'Solid Tissue Normal' samples

## A MultiAssayExperiment object of 6 listed
##  experiments with user-defined names and respective classes. 
##  Containing an ExperimentList class object of length 6: 
##  [1] 01_COAD_CNASNP-20160128: RaggedExperiment with 459490 rows and 449 columns 
##  [2] 01_COAD_CNASeq-20160128: RaggedExperiment with 40530 rows and 68 columns 
##  [3] 01_COAD_CNVSNP-20160128: RaggedExperiment with 90534 rows and 449 columns 
##  [4] 11_COAD_CNASNP-20160128: RaggedExperiment with 459490 rows and 90 columns 
##  [5] 11_COAD_CNASeq-20160128: RaggedExperiment with 40530 rows and 13 columns 
##  [6] 11_COAD_CNVSNP-20160128: RaggedExperiment with 90534 rows and 90 columns 
## Features: 
##  experiments() - obtain the ExperimentList instance 
##  colData() - the primary/phenotype DataFrame 
##  sampleMap() - the sample availability DataFrame 
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment 
##  *Format() - convert into a long or wide DataFrame 
##  assays() - convert ExperimentList to a SimpleList of matrices

The MultiAssayExperiment package then provides functionality to merge replicate profiles for a single patient (mergeReplicates()), which would now be appropriate but would not have been appropriate before splitting different tissue types into different assays, because that would average measurements from tumors and normal tissues.

MultiAssayExperiment also defines the MatchedAssayExperiment class, which eliminates any profiles not present across all assays and ensures identical ordering of profiles (columns) in each assay. In this example, it will match tumors to adjacent normals in subsequent assays:

(matchmae <- as(tnmae, "MatchedAssayExperiment"))

## A MatchedAssayExperiment object of 6 listed
##  experiments with user-defined names and respective classes. 
##  Containing an ExperimentList class object of length 6: 
##  [1] 01_COAD_CNASNP-20160128: RaggedExperiment with 459490 rows and 12 columns 
##  [2] 01_COAD_CNASeq-20160128: RaggedExperiment with 40530 rows and 12 columns 
##  [3] 01_COAD_CNVSNP-20160128: RaggedExperiment with 90534 rows and 12 columns 
##  [4] 11_COAD_CNASNP-20160128: RaggedExperiment with 459490 rows and 12 columns 
##  [5] 11_COAD_CNASeq-20160128: RaggedExperiment with 40530 rows and 12 columns 
##  [6] 11_COAD_CNVSNP-20160128: RaggedExperiment with 90534 rows and 12 columns 
## Features: 
##  experiments() - obtain the ExperimentList instance 
##  colData() - the primary/phenotype DataFrame 
##  sampleMap() - the sample availability DataFrame 
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment 
##  *Format() - convert into a long or wide DataFrame 
##  assays() - convert ExperimentList to a SimpleList of matrices

Only about 12 participants have both a matched tumor and solid normal sample.

3.4 `getSubtypeMap`: manually curated molecular subtypes

Per-tumor subtypes are saved in the metadata of the colData slot of MultiAssayExperiment objects downloaded from curatedTCGAData. These subtypes were manually curated from the supplemental tables of all primary TCGA publications:

getSubtypeMap(coad)

##        COAD_annotations        COAD_subtype
## 1            Patient_ID             patient
## 2                   msi          MSI_status
## 3  methylation_subtypes methylation_subtype
## 4         mrna_subtypes  expression_subtype
## 5 histological_subtypes   histological_type

3.5 `getClinicalNames`: key “level 4” clinical & pathological data

The curatedTCGAData colData contain hundreds of columns, obtained from merging all unrestricted levels of clinical, pathological, and biospecimen data. This function provides the names of “level 4” clinical/pathological variables, which are the only ones provided by most other TCGA analysis tools. Users may then use these variable names for subsetting or analysis, and may even want to subset the colData to only these commonly used variables.

getClinicalNames("COAD")

##  [1] "years_to_birth"                      
##  [2] "vital_status"                        
##  [3] "days_to_death"                       
##  [4] "days_to_last_followup"               
##  [5] "tumor_tissue_site"                   
##  [6] "pathologic_stage"                    
##  [7] "pathology_T_stage"                   
##  [8] "pathology_N_stage"                   
##  [9] "pathology_M_stage"                   
## [10] "gender"                              
## [11] "date_of_initial_pathologic_diagnosis"
## [12] "days_to_last_known_alive"            
## [13] "radiation_therapy"                   
## [14] "histological_type"                   
## [15] "residual_tumor"                      
## [16] "number_of_lymph_nodes"               
## [17] "race"                                
## [18] "ethnicity"

Warning: some names may not exactly match the colData names in the object due to differences in variable types. These variables are kept separate and differentiated with x and y. For example, vital_status in this case corresponds to two different variables obtained from the pipeline. One variable is interger type and the other character:

class(colData(coad)[["vital_status.x"]])

## [1] "integer"

class(colData(coad)[["vital_status.y"]])

## [1] "character"

table(colData(coad)[["vital_status.x"]])

## 
##   0   1 
## 355 102

table(colData(coad)[["vital_status.y"]])

## 
## DECEASED   LIVING 
##       22      179

Such conflicts should be inspected in this manner, and conflicts resolved by choosing the more complete variable, or by treating any conflicting values as unknown (“NA”).

3.6 `mergeColData`: expanding the `colData` of a `MultiAssayExperiment`

This function merges a data.frame or DataFrame into the colData of an existing MultiAssayExperiment object. It will match column names and row names to do a full merge of both data sets. This convenience function can be used, for example, to add subtype information available for a subset of patients to the colData. Here is an example on an empty MultiAssayExperiment just to demonstrate its usage:

mergeColData(MultiAssayExperiment(), data.frame())

## A MultiAssayExperiment object of 0 listed
##  experiments with no user-defined names and respective classes. 
##  Containing an ExperimentList class object of length 0:  
## Features: 
##  experiments() - obtain the ExperimentList instance 
##  colData() - the primary/phenotype DataFrame 
##  sampleMap() - the sample availability DataFrame 
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment 
##  *Format() - convert into a long or wide DataFrame 
##  assays() - convert ExperimentList to a SimpleList of matrices

4 Importing TCGA text data files to Bioconductor classes

A few functions in the package accept either files or classes such as data.frame and FirehoseGISTIC as input and return standard Bioconductor classes.

4.1 `makeGRangesListFromExonFiles`

%??% Could you use the GenomicDataCommons library to get an example file instead, or show the GDC command that would be used to obtain the file?

The GRangesList class from the GenomicRanges package is suitable for grouping GRanges vectors as a list, such as for grouping exons by gene. In this example we use a legacy exon quantification file from the Genomic Data Commons. We then use makeGRangesListFromExonFiles to create a GRangesList from vectors of file paths and names (where necessary). Some adjustments have been made to the file name for cross-platform compatibility with Windows operating system limitations.

## Load example file found in package
pkgDir <- system.file("extdata", package = "TCGAutils", mustWork = TRUE)
exonFile <- list.files(pkgDir, pattern = "cation\\.txt$", full.names = TRUE)
exonFile

## [1] "/tmp/RtmpeKC7y6/Rinst2cb93605bba3/TCGAutils/extdata/bt.exon_quantification.txt"

## We add the original file prefix to query for the UUID and get the
## TCGAbarcode
filePrefix <- "unc.edu.32741f9a-9fec-441f-96b4-e504e62c5362.1755371."

## Add actual file name manually
makeGRangesListFromExonFiles(exonFile,
    fileNames = paste0(filePrefix, basename(exonFile)))

## Parsed with column specification:
## cols(
##   exon = col_character(),
##   raw_counts = col_double(),
##   median_length_normalized = col_double(),
##   RPKM = col_double()
## )

## GRangesList object of length 1:
## $TCGA-AA-3678-01A-01R-0905-07 
## GRanges object with 100 ranges and 3 metadata columns:
##         seqnames        ranges strand | raw_counts median_length_normalized
##            <Rle>     <IRanges>  <Rle> |  <numeric>                <numeric>
##     [1]     chr1   11874-12227      + |          4                0.4929178
##     [2]     chr1   12595-12721      + |          2                0.3412699
##     [3]     chr1   12613-12721      + |          2                0.3981481
##     [4]     chr1   12646-12697      + |          2                 0.372549
##     [5]     chr1   13221-14409      + |         39                0.6329966
##     ...      ...           ...    ... .        ...                      ...
##    [96]     chr1 881782-881925      - |        179                        1
##    [97]     chr1 883511-883612      - |        151                        1
##    [98]     chr1 883870-883983      - |        155                        1
##    [99]     chr1 886507-886618      - |        144                        1
##   [100]     chr1 887380-887519      - |        158                        1
##                      RPKM
##                 <numeric>
##     [1] 0.322476823123937
##     [2] 0.449436202306589
##     [3] 0.523655024705842
##     [4]  1.09766149409494
##     [5] 0.936104924316458
##     ...               ...
##    [96]  35.4758096772073
##    [97]  42.2492061354581
##    [98]  38.8032966772158
##    [99]  36.6932556597451
##   [100]  32.2085244124429
## 
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths

Note GRangesList objects must be converted to RaggedExperiment class to incorporate them into a MultiAssayExperiment.

4.2 `makeGRangesListFromCopyNumber`

Other processed, genomic range-based data from TCGA data can be imported using makeGRangesListFromCopyNumber. This tab-delimited data file of copy number alterations from bladder urothelial carcinoma (BLCA) was obtained from the Genomic Data Commons and is included in TCGAUtils as an example:

grlFile <- system.file("extdata", "blca_cnaseq.txt", package = "TCGAutils")
grl <- read.table(grlFile)
head(grl)

##                          Sample Chromosome     Start       End Num_Probes
## 1  TCGA-BL-A0C8-01A-11D-A10R-02         14  70362113  73912204         NA
## 2  TCGA-BL-A0C8-01A-11D-A10R-02          9 115609546 131133898         NA
## 5  TCGA-BL-A13I-01A-11D-A13U-02         13  19020028  49129100         NA
## 6  TCGA-BL-A13I-01A-11D-A13U-02          1     10208 246409808         NA
## 9  TCGA-BL-A13J-01A-11D-A10R-02         23   3119586   5636448         NA
## 10 TCGA-BL-A13J-01A-11D-A10R-02          7     10127  35776912         NA
##    Segment_Mean
## 1  -0.182879931
## 2   0.039675162
## 5   0.002085552
## 6  -0.014224752
## 9   0.877072555
## 10  0.113873871

makeGRangesListFromCopyNumber(grl, split.field = "Sample")

## GRangesList object of length 116:
## $TCGA-BL-A0C8-01A-11D-A10R-02 
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames              ranges strand
##          <Rle>           <IRanges>  <Rle>
##   [1]       14   70362113-73912204      *
##   [2]        9 115609546-131133898      *
## 
## $TCGA-BL-A13I-01A-11D-A13U-02 
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames            ranges strand
##   [1]       13 19020028-49129100      *
##   [2]        1   10208-246409808      *
## 
## $TCGA-BL-A13J-01A-11D-A10R-02 
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames          ranges strand
##   [1]       23 3119586-5636448      *
##   [2]        7  10127-35776912      *
## 
## ...
## <113 more elements>
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths

makeGRangesListFromCopyNumber(grl, split.field = "Sample",
    keep.extra.columns = TRUE)

## GRangesList object of length 116:
## $TCGA-BL-A0C8-01A-11D-A10R-02 
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames              ranges strand | Num_Probes       Segment_Mean
##          <Rle>           <IRanges>  <Rle> |  <logical>          <numeric>
##   [1]       14   70362113-73912204      * |       <NA> -0.182879930738387
##   [2]        9 115609546-131133898      * |       <NA> 0.0396751622235396
## 
## $TCGA-BL-A13I-01A-11D-A13U-02 
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames            ranges strand | Num_Probes        Segment_Mean
##   [1]       13 19020028-49129100      * |       <NA> 0.00208555197637913
##   [2]        1   10208-246409808      * |       <NA> -0.0142247519016688
## 
## $TCGA-BL-A13J-01A-11D-A10R-02 
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames          ranges strand | Num_Probes      Segment_Mean
##   [1]       23 3119586-5636448      * |       <NA> 0.877072555244314
##   [2]        7  10127-35776912      * |       <NA> 0.113873871106118
## 
## ...
## <113 more elements>
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths

4.3 `makeSummarizedExperimentFromGISTIC`

This function is only used for converting the FirehoseGISTIC class of the RTCGAToolbox package. It allows the user to obtain thresholded by gene data, probabilities and peak regions.

tempDIR <- tempdir()
co <- getFirehoseData("COAD", clinical = FALSE, GISTIC = TRUE,
    destdir = tempDIR)

## gdac.broadinstitute.org_COAD-TP.CopyNumber_Gistic2.Level_4.2016012800.0.0.tar.gz

selectType(co, "GISTIC")

## Dataset:COAD

## FirehoseGISTIC object, dim: 24776    454

class(selectType(co, "GISTIC"))

## [1] "FirehoseGISTIC"
## attr(,"package")
## [1] "RTCGAToolbox"

makeSummarizedExperimentFromGISTIC(co, "Peaks")

## class: RangedSummarizedExperiment 
## dim: 66 451 
## metadata(0):
## assays(1): ''
## rownames(66): 23 24 ... 65 66
## rowData names(10): rowRanges Unique.Name ... Amplitude.Threshold
##   type
## colnames(451): TCGA-3L-AA1B-01A-11D-A36W-01
##   TCGA-4N-A93T-01A-11D-A36W-01 ... TCGA-T9-A92H-01A-11D-A36W-01
##   TCGA-WS-AB45-01A-11D-A40O-01
## colData names(0):

5 Translating and interpreting TCGA identifiers

5.1 Translation

The TCGA project has generated massive amounts of data. Some data can be obtained with Universally Unique IDentifiers (UUID) and other data with TCGA barcodes. The Genomic Data Commons provides a JSON API for mapping between UUID and barcode, but it is difficult for many people to understand. TCGAutils makes simple functions available for two-way translation between vectors of these identifiers.

5.1.1 TCGA barcode to UUID

Here we translate the first two TCGA barcodes of the previous copy-number alterations dataset to UUID:

(xbarcode <- head(colnames(coad)[["COAD_CNASeq-20160128"]], 4L))

## [1] "TCGA-A6-2671-01A-01D-1405-02" "TCGA-A6-2671-10A-01D-1405-02"
## [3] "TCGA-A6-2674-01A-02D-1167-02" "TCGA-A6-2674-10A-01D-1167-02"

barcodeToUUID(xbarcode)

##   submitter_id                              case_id
## 1 TCGA-A6-2671 565e2726-4942-4726-89d3-c5e3797f7204
## 2 TCGA-A6-2674 57cdaa1c-4e94-4a28-ab3b-300c0457555f

5.1.2 UUID to TCGA barcode

Here we have a known case UUID that we want to translate into a TCGA barcode.

UUIDtoBarcode("ae55b2d3-62a1-419e-9f9a-5ddfac356db4", id_type = "case_id")

##                                case_id submitter_id
## 1 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 TCGA-B0-5117

In cases where we want to translate a known file UUID to the associated TCGA patient barcode, we can use UUIDtoBarcode. Optional barcode information can be included for sample, portion/analyte, and plate/center by specifying the end_point argument.

UUIDtoBarcode("0001801b-54b0-4551-8d7a-d66fb59429bf",
    id_type = "file_id", end_point = "center")

##                                file_id
## 1 0001801b-54b0-4551-8d7a-d66fb59429bf
##   cases.samples.portions.analytes.aliquots.submitter_id
## 1                          TCGA-B0-5094-11A-01D-1421-08

5.1.3 UUID to UUID

We can also translate from file UUIDs to case UUIDs and vice versa as long as we know the input type. We can use the case UUID from the previous example to get the associated file UUIDs using UUIDtoUUID. Note that this translation is a one to many relationship, thus yielding a data.frame of file UUIDs for a single case UUID.

head(UUIDtoUUID("ae55b2d3-62a1-419e-9f9a-5ddfac356db4", to_type = "file_id"))

##                                case_id                        files.file_id
## 1 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 d6625424-1503-4735-a7cc-2a3de606278f
## 2 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 bc62b7e8-1b54-438b-944b-3e0655cdf6ac
## 3 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 fad3f577-6e5e-4039-b35b-69269f42d488
## 4 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 48c342b0-e7a2-4a7b-8556-55bcd8ad9ea0
## 5 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 e8eb689b-c5e1-4a21-acec-5f2ac25dbf97
## 6 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 8f345db1-686b-4e72-9b90-704978cde526

One possible way to verify that file IDs are matching case UUIDS is to browse to the Genomic Data Commons webpage with the specific file UUID. Here we look at the first file UUID entry in the output data.frame:

https://portal.gdc.cancer.gov/files/0ff55a5e-6058-4e0b-9641-e3cb375ff214

In the page we check that the case UUID matches the input.

5.2 Parsing TCGA barcodes

Several functions exist for working with TCGA barcodes, the main function being TCGAbarcode. It takes a TCGA barcode and returns information about participant, sample, and/or portion.

## Return participant barcodes
TCGAbarcode(xbarcode, participant = TRUE)

## [1] "TCGA-A6-2671" "TCGA-A6-2671" "TCGA-A6-2674" "TCGA-A6-2674"

## Just return samples
TCGAbarcode(xbarcode, participant = FALSE, sample = TRUE)

## [1] "01A" "10A" "01A" "10A"

## Include sample data as well
TCGAbarcode(xbarcode, participant = TRUE, sample = TRUE)

## [1] "TCGA-A6-2671-01A" "TCGA-A6-2671-10A" "TCGA-A6-2674-01A"
## [4] "TCGA-A6-2674-10A"

## Include portion and analyte data
TCGAbarcode(xbarcode, participant = TRUE, sample = TRUE, portion = TRUE)

## [1] "TCGA-A6-2671-01A-01D" "TCGA-A6-2671-10A-01D" "TCGA-A6-2674-01A-02D"
## [4] "TCGA-A6-2674-10A-01D"

5.3 Sample select

Based on lookup table values, the user can select certain sample types from a vector of sample barcodes. Below we select “Primary Solid Tumors” from a vector of barcodes, returning a logical vector identifying the matching samples.

## Select primary solid tumors
TCGAsampleSelect(xbarcode, "01")

## Selecting 'Primary Solid Tumor' samples

## [1]  TRUE FALSE  TRUE FALSE

## Select blood derived normals
TCGAsampleSelect(xbarcode, "10")

## Selecting 'Blood Derived Normal' samples

## [1] FALSE  TRUE FALSE  TRUE

5.4 `data.frame` representation of barcode

The straightforward TCGAbiospec function will take the information contained in the TCGA barcode and display it in data.frame format with appropriate column names.

TCGAbiospec(xbarcode)

##   submitter_id    sample_definition sample vial portion analyte plate center
## 1 TCGA-A6-2671  Primary Solid Tumor     01    A      01       D  1405     02
## 2 TCGA-A6-2671 Blood Derived Normal     10    A      01       D  1405     02
## 3 TCGA-A6-2674  Primary Solid Tumor     01    A      02       D  1167     02
## 4 TCGA-A6-2674 Blood Derived Normal     10    A      01       D  1167     02

6 Reference data

The TCGAutils package provides several helper datasets for working with TCGA barcodes.

6.1 `sampleTypes`

As shown previously, the reference dataset sampleTypes defines sample codes and their sample types (see ?sampleTypes for source url).

## Obtained previously
sampleCodes <- TCGAbarcode(xbarcode, participant = FALSE, sample = TRUE)

## Lookup table
head(sampleTypes)

##   Code                                      Definition Short.Letter.Code
## 1   01                             Primary Solid Tumor                TP
## 2   02                           Recurrent Solid Tumor                TR
## 3   03 Primary Blood Derived Cancer - Peripheral Blood                TB
## 4   04    Recurrent Blood Derived Cancer - Bone Marrow              TRBM
## 5   05                        Additional - New Primary               TAP
## 6   06                                      Metastatic                TM

## Match codes found in the barcode to the lookup table
sampleTypes[match(unique(substr(sampleCodes, 1L, 2L)), sampleTypes[["Code"]]), ]

##    Code           Definition Short.Letter.Code
## 1    01  Primary Solid Tumor                TP
## 10   10 Blood Derived Normal                NB

Source: https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/sample-type-codes

6.2 `clinicalNames` - Firehose pipeline clinical variables

clinicalNames is a list of the level 4 variable names (the most commonly used clinical and pathological variables, with follow-ups merged) from each colData datasets in curatedTCGAData. Shipped curatedTCGAData MultiAssayExperiment objects merge additional levels 1-3 clinical, pathological, and biospecimen data and contain many more variables than the ones listed here.

data("clinicalNames")

clinicalNames

## CharacterList of length 33
## [["ACC"]] years_to_birth vital_status days_to_death ... race ethnicity
## [["BLCA"]] years_to_birth vital_status days_to_death ... race ethnicity
## [["BRCA"]] years_to_birth vital_status days_to_death ... race ethnicity
## [["CESC"]] years_to_birth vital_status ... age_at_diagnosis clinical_stage
## [["CHOL"]] years_to_birth vital_status days_to_death ... race ethnicity
## [["COAD"]] years_to_birth vital_status days_to_death ... race ethnicity
## [["DLBC"]] years_to_birth vital_status days_to_death ... race ethnicity
## [["ESCA"]] years_to_birth vital_status days_to_death ... race ethnicity
## [["GBM"]] years_to_birth vital_status days_to_death ... race ethnicity
## [["HNSC"]] years_to_birth vital_status days_to_death ... race ethnicity
## ...
## <23 more elements>

lengths(clinicalNames)

##  ACC BLCA BRCA CESC CHOL COAD DLBC ESCA  GBM HNSC KICH KIRC KIRP LAML  LGG 
##   16   18   17   48   16   18   11   19   12   19   18   19   19    9   12 
## LIHC LUAD LUSC MESO   OV PAAD PCPG PRAD READ SARC SKCM STAD TGCT THCA THYM 
##   16   20   20   17   12   19   13   18   18   12   17   17   15   21   11 
## UCEC  UCS  UVM 
##    9   11   14

7 `sessionInfo`

sessionInfo()

## R version 3.5.2 (2018-12-20)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
## 
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.8-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.8-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] RTCGAToolbox_2.12.1         curatedTCGAData_1.4.3      
##  [3] MultiAssayExperiment_1.8.3  SummarizedExperiment_1.12.0
##  [5] DelayedArray_0.8.0          BiocParallel_1.16.6        
##  [7] matrixStats_0.54.0          Biobase_2.42.0             
##  [9] GenomicRanges_1.34.0        GenomeInfoDb_1.18.2        
## [11] IRanges_2.16.0              S4Vectors_0.20.1           
## [13] BiocGenerics_0.28.0         TCGAutils_1.2.2            
## [15] BiocStyle_2.10.0           
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6                  bit64_0.9-7                  
##  [3] progress_1.2.0                httr_1.4.0                   
##  [5] GenomicDataCommons_1.6.0      tools_3.5.2                  
##  [7] R6_2.3.0                      DBI_1.0.0                    
##  [9] lazyeval_0.2.1                tidyselect_0.2.5             
## [11] prettyunits_1.0.2             bit_1.1-14                   
## [13] curl_3.3                      compiler_3.5.2               
## [15] rvest_0.3.2                   xml2_1.2.0                   
## [17] rtracklayer_1.42.1            bookdown_0.9                 
## [19] readr_1.3.1                   rappdirs_0.3.1               
## [21] RCircos_1.2.0                 stringr_1.4.0                
## [23] digest_0.6.18                 Rsamtools_1.34.1             
## [25] rmarkdown_1.11                XVector_0.22.0               
## [27] pkgconfig_2.0.2               htmltools_0.3.6              
## [29] limma_3.38.3                  rlang_0.3.1                  
## [31] RSQLite_2.1.1                 shiny_1.2.0                  
## [33] bindr_0.1.1                   jsonlite_1.6                 
## [35] dplyr_0.7.8                   RCurl_1.95-4.11              
## [37] magrittr_1.5                  GenomeInfoDbData_1.2.0       
## [39] Matrix_1.2-15                 Rcpp_1.0.0                   
## [41] stringi_1.3.1                 yaml_2.2.0                   
## [43] RaggedExperiment_1.6.0        RJSONIO_1.3-1.1              
## [45] zlibbioc_1.28.0               AnnotationHub_2.14.3         
## [47] grid_3.5.2                    blob_1.1.1                   
## [49] promises_1.0.1                ExperimentHub_1.8.0          
## [51] crayon_1.3.4                  lattice_0.20-38              
## [53] Biostrings_2.50.2             splines_3.5.2                
## [55] GenomicFeatures_1.34.3        hms_0.4.2                    
## [57] knitr_1.21                    pillar_1.3.1                 
## [59] biomaRt_2.38.0                XML_3.98-1.17                
## [61] glue_1.3.0                    evaluate_0.13                
## [63] data.table_1.12.0             BiocManager_1.30.4           
## [65] httpuv_1.4.5.1                purrr_0.3.0                  
## [67] assertthat_0.2.0              xfun_0.4                     
## [69] mime_0.6                      xtable_1.8-3                 
## [71] later_0.8.0                   survival_2.43-3              
## [73] tibble_2.0.1                  GenomicAlignments_1.18.1     
## [75] AnnotationDbi_1.44.0          memoise_1.1.0                
## [77] bindrcpp_0.2.2                interactiveDisplayBase_1.20.0

TCGAutils: Helper functions for working with TCGA datasets

February 13, 2019

Contents

1 Overview

2 Installation

3 `curatedTCGAData` helpers

3.1 obtaining TCGA as `MultiAssayExperiment` objects from `curatedTCGAData`

3.2 `sampleTables`: what sample types are present in the data?

3.3 `splitAssays`: separate the data from different tissue types

3.4 `getSubtypeMap`: manually curated molecular subtypes

3.5 `getClinicalNames`: key “level 4” clinical & pathological data

3.6 `mergeColData`: expanding the `colData` of a `MultiAssayExperiment`

4 Importing TCGA text data files to Bioconductor classes

4.1 `makeGRangesListFromExonFiles`

4.2 `makeGRangesListFromCopyNumber`

4.3 `makeSummarizedExperimentFromGISTIC`

5 Translating and interpreting TCGA identifiers

5.1 Translation

5.1.1 TCGA barcode to UUID

5.1.2 UUID to TCGA barcode

5.1.3 UUID to UUID

5.2 Parsing TCGA barcodes

5.3 Sample select

5.4 `data.frame` representation of barcode

6 Reference data

6.1 `sampleTypes`

6.2 `clinicalNames` - Firehose pipeline clinical variables

7 `sessionInfo`

TCGAutils: Helper functions for working with TCGA datasets

February 13, 2019

Contents

1 Overview

2 Installation

3 curatedTCGAData helpers

3.1 obtaining TCGA as MultiAssayExperiment objects from curatedTCGAData

3.2 sampleTables: what sample types are present in the data?

3.3 splitAssays: separate the data from different tissue types

3.4 getSubtypeMap: manually curated molecular subtypes

3.5 getClinicalNames: key “level 4” clinical & pathological data

3.6 mergeColData: expanding the colData of a MultiAssayExperiment

4 Importing TCGA text data files to Bioconductor classes

4.1 makeGRangesListFromExonFiles

4.2 makeGRangesListFromCopyNumber

4.3 makeSummarizedExperimentFromGISTIC

5 Translating and interpreting TCGA identifiers

5.1 Translation

5.1.1 TCGA barcode to UUID

5.1.2 UUID to TCGA barcode

5.1.3 UUID to UUID

5.2 Parsing TCGA barcodes

5.3 Sample select

5.4 data.frame representation of barcode

6 Reference data

6.1 sampleTypes

6.2 clinicalNames - Firehose pipeline clinical variables

7 sessionInfo

3 `curatedTCGAData` helpers

3.1 obtaining TCGA as `MultiAssayExperiment` objects from `curatedTCGAData`

3.2 `sampleTables`: what sample types are present in the data?

3.3 `splitAssays`: separate the data from different tissue types

3.4 `getSubtypeMap`: manually curated molecular subtypes

3.5 `getClinicalNames`: key “level 4” clinical & pathological data

3.6 `mergeColData`: expanding the `colData` of a `MultiAssayExperiment`

4.1 `makeGRangesListFromExonFiles`

4.2 `makeGRangesListFromCopyNumber`

4.3 `makeSummarizedExperimentFromGISTIC`

5.4 `data.frame` representation of barcode

6.1 `sampleTypes`

6.2 `clinicalNames` - Firehose pipeline clinical variables

7 `sessionInfo`