1 How to write recipes for new AnnotationHub resources

1.1 Overview of the process

This vignette is for those wishing to process online resources into AnnotationHub resources. There are four steps involved in creating resources.

1.2 Setup

This vignette is intended for users who are comfortable with R. Be sure to use Bioc-devel. Install the AnnotationHubData package using biocLite("AnnotationHubData").

1.3 Introducing AnnotationHubMetadata Objects

The AnnotationHubData package provides a place where we store code to processes online resources into AnnotationHub resources. AnnotationHubMetadata objects are used to describe an online resource, including a function (‘recipe’) that creates the resource.

The steps involved include writing a recipe are:

Write a function which takes the metadata about the resource and processes them into AnnotationHubMetadata objects
Optional step : Write an additional function specifying how original source files are pre-processed to R or other objects. Not all source files are pre-processed; one might pre-process data to make it easier or faster for the user to access the resource.
Optional step: Write a function specifying how a resource is to be read in to the user’s R session after being downloaded to a user’s local cache.

2 Step 1: Writing `AnnotationHubMetadata` Generating Functions

The following example function takes files from the latest release of inparanoid and processes them into a list of AnnotationHubMetadata objects (e.g., using Map()).

The function body can contain hard-coded information, but many recipes will ‘scrape’ internet resources, creating many AnnotationHub resources; in the example below, .inparanoidMetadataFromUrl() visits the inparanoid web site to find current data. The Recipe: field indicates the function that will perform optional step 2, transforming the original source file(s) to an object that is more convenient for user input. This function returns a list of AnnotationHubMetadata objects.

makeinparanoid8ToAHMs <-
   function(currentMetadata, justRunUnitTests=FALSE, BiocVersion=biocVersion())
{
    baseUrl <- 'http://inparanoid.sbc.su.se/download/current/Orthologs_other_formats'
    ## Make list of metadata in a helper function
    meta <- .inparanoidMetadataFromUrl(baseUrl)
    ## then make AnnotationHubMetadata objects.
    Map(AnnotationHubMetadata,
        Description=meta$description,
        Genome=meta$genome,
        SourceFile=meta$sourceFile,
        SourceUrl=meta$sourceUrl,
        SourceVersion=meta$sourceVersion,
        Species=meta$species,
        TaxonomyId=meta$taxonomyId,
        Title=meta$title,
        RDataPath=meta$rDataPath,
        MoreArgs=list(
          Coordinate_1_based = TRUE,
          DataProvider = baseUrl,
          Maintainer = "Marc Carlson <mcarlson@fhcrc.org>",
          RDataClass = "SQLiteFile",
          RDataDateAdded = Sys.time(),
          RDataVersion = "0.0.1",
          Recipe = "AnnotationHubData:::inparanoid8ToDbsRecipe",
          Tags = c("Inparanoid", "Gene", "Homology", "Annotation")))
}

Here is a listing of AnntotationHubMetadata arguments:

AnnotationHubRoot: ‘character(1)’ Absolute path to directory structure containing resources to be added to AnnotationHub.
SourceUrl: ‘character()’ URL where resource(s) can be found
SourceType: ‘character()’ which indicates what kind of resource was initially processed. The preference is to name the type of resource if it’s a single file type and to name where the resources came from if it is a compound resource. So Typical answers would be like: ‘BED’,‘FASTA’ or ‘Inparanoid’ etc.
SourceVersion: ‘character(1)’ Version of original file
SourceLastModifiedDate: ‘POSIXct()’ The date when the source was last modified. Leaving this blank should allow the values to be retrieved for you (if your sourceURL is valid).
SourceMd5: ‘character()’ md5 hash of original file
SourceSize: ‘numeric(1)’ Number of bytes in original file
DataProvider: ‘character(1)’ Where did this resource come from?
Title: ‘character(1)’ Title for this resource
Description: ‘character(1)’ Description of the resource
Species: ‘character(1)’ Species name
TaxonomyId: ‘character(1)’ NCBI code
Genome: ‘character(1)’ Name of genome build
Tags: ‘character()’ Free-form tags
Recipe: ‘character(1)’ Name of recipe function
RDataClass: ‘character(1)’ Class of derived object (e.g. ‘GRanges’)
RDataDateAdded: ‘POSIXct()’ Date added to AnnotationHub. Used to determine snapshots.
RDataPath: ‘character(1)’ file path to serialized form
Maintainer: ‘character(1)’ Maintainer name and email address, ‘A Maintainer ’
BiocVersion: ‘character(1)’ Under which resource was built
Coordinate_1_based: ‘logical(1)’ Do coordinates start with 1 or 0?
DispatchClass: ‘character(1)’ string used to indicate which code should be called by the client when the resource is downloaded. This is often the same as the RDataClass. But it is allowed to be a different value so that the client can do something different internally if required.
Location_Prefix: ‘character(1)’ The location prefix is the base path where the resource is coming from; the default is the Bioconductor AnnotationHub server.
Notes: ‘character()’ Notes about the resource.

3 Step 2: Recipes for Pre-Processing Resources

A (optional) recipe function transformed the source data into an object served by AnnotationHub to the user. It takes a single AnnotationHubMetadata object as an argument. Below is a recipe that generates an inparanoid database object from the metadata stored in it’s AnnotationHubMetadata object. Note that the recipe will be invoked on the hub, so should output data to the location specified in the input AnnotationHubMetadata object.

inparanoid8ToDbsRecipe <-
    function(ahm)
{
    require(AnnotationForge)
    inputFiles <- metadata(ahm)$SourceFile
    dbname <- makeInpDb(dir=file.path(inputFiles,""),
                        dataDir=tempdir())
    db <- loadDb(file=dbname)
    outputPath <- file.path(metadata(ahm)$AnnotationHubRoot,
                            metadata(ahm)$RDataPath)
    saveDb(db, file=outputPath)
    outputFile(ahm)
}

3.1 Note for Steps 1 and 2

While writing this function, care has to be taken for a couple of fields:

Case 1 (common) - The AnnotationHub resource is downloaded directly to the user cache without any pre-processing, then

SourceUrls specify the original resource locatoin = Location_Prefix + RDataPath
Recipe = NA_character_

Example -

SourceUrls = "http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToRn5.over.chain.gz"
Location_Prefix = "http://hgdownload.cse.ucsc.edu/",
RDataPath = "goldenPath/hg38/liftOver/hg38ToRn5.over.chain.gz"
Recipe = NA_character_

Case 2 - The AnnotationHub resource requires pre-processing

SourceUrls should merely document the original location of the untouched file
Location_Prefix + RDataPath should be equal to the file path on the amazon machine where all pre-processed files are stored.
Recipe = helper function which tells us how to pre-process the original file

Example -

SourceUrls="http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/hg38ToRn5.over.chain.gz",
Location_Prefix = "http://s3.amazonaws.com/annotationhub/",
RDataPath="chainfile/dummy.Rda"

If this seems confusing, please note how in both of these cases, the sourceUrl needs to reflect the location that the resource will actually come from once when the client is in use.

4 Step 3: Post-Processing Resources

One can post-process a file when it is instantiated into AnnotationHub from the user’s cache. An example, would be a BED file, downloaded to the user’s cache but input into R as a GRanges using rtrackler::import. Implement this by defining a class that extends AnnotationHubResource and that implements a get1() method.

setClass("BEDFileResource", contains="AnnotationHubResource")

setMethod(".get1", "BEDFileResource",
    function(x, ...)
{
    .require("rtracklayer")
    yy <- getHub(x)
    dat <- rtracklayer::BEDFile(cache(yy))
    rtracklayer::import(dat, format="bed", genome=yy$genome, ...)
})

The class and method definition typically need to be added to AnnotationHub, and require coordination with us.

5 Step 4: Testing and Next Steps

At this point make sure that the AnnotationHubMetadata generating function produces a list of AnnotationHubMetadata objects and that the recipe (if needed) produces an appropriate output path. Contact us to add your recipe to the production hub.

6 Session Information

## R version 3.3.0 (2016-05-03)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] BSgenome.Hsapiens.UCSC.hg19_1.4.0 BSgenome_1.40.0                  
##  [3] rtracklayer_1.32.0                VariantAnnotation_1.18.1         
##  [5] SummarizedExperiment_1.2.2        Rsamtools_1.24.0                 
##  [7] Biostrings_2.40.0                 XVector_0.12.0                   
##  [9] GenomicFeatures_1.24.2            AnnotationDbi_1.34.2             
## [11] Biobase_2.32.0                    GenomicRanges_1.24.0             
## [13] GenomeInfoDb_1.8.2                IRanges_2.6.0                    
## [15] S4Vectors_0.10.0                  AnnotationHub_2.4.2              
## [17] BiocGenerics_0.18.0               BiocStyle_2.0.2                  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.5                   BiocInstaller_1.22.2         
##  [3] formatR_1.4                   bitops_1.0-6                 
##  [5] tools_3.3.0                   zlibbioc_1.18.0              
##  [7] biomaRt_2.28.0                digest_0.6.9                 
##  [9] evaluate_0.9                  RSQLite_1.0.0                
## [11] shiny_0.13.2                  DBI_0.4-1                    
## [13] curl_0.9.7                    yaml_2.1.13                  
## [15] httr_1.1.0                    stringr_1.0.0                
## [17] knitr_1.13                    R6_2.1.2                     
## [19] BiocParallel_1.6.2            XML_3.98-1.4                 
## [21] rmarkdown_0.9.6               magrittr_1.5                 
## [23] GenomicAlignments_1.8.0       htmltools_0.3.5              
## [25] mime_0.4                      interactiveDisplayBase_1.10.3
## [27] xtable_1.8-2                  httpuv_1.3.3                 
## [29] stringi_1.0-1                 RCurl_1.95-4.8

How to create new AnnotationHub resources

Marc Carlson, Sonali Arora

Modified 11 November 2015; compiled 15 May 2016

Package version: AnnotationHub 2.4.2

Contents

1 How to write recipes for new AnnotationHub resources

1.1 Overview of the process

1.2 Setup

1.3 Introducing AnnotationHubMetadata Objects

2 Step 1: Writing `AnnotationHubMetadata` Generating Functions

3 Step 2: Recipes for Pre-Processing Resources

3.1 Note for Steps 1 and 2

4 Step 3: Post-Processing Resources

5 Step 4: Testing and Next Steps

6 Session Information

How to create new AnnotationHub resources

Marc Carlson, Sonali Arora

Modified 11 November 2015; compiled 15 May 2016

Package version: AnnotationHub 2.4.2

Contents

1 How to write recipes for new AnnotationHub resources

1.1 Overview of the process

1.2 Setup

1.3 Introducing AnnotationHubMetadata Objects

2 Step 1: Writing AnnotationHubMetadata Generating Functions

3 Step 2: Recipes for Pre-Processing Resources

3.1 Note for Steps 1 and 2

4 Step 3: Post-Processing Resources

5 Step 4: Testing and Next Steps

6 Session Information

2 Step 1: Writing `AnnotationHubMetadata` Generating Functions