nanotatoR: structural variant annotation and classification

Surajit Bhattacharya,Hayk Barsheghyan, Emmanuele C Delot and Eric Vilain

Introduction

Short-read sequencing (SRS) is the predominant technique of DNA sequencing used for clinical diagnosis. It utilizes flowcells covered with millions of surface-bound oligonucleotides that allow parallel sequencing of hundreds of millions of independent short reads. However, as the average sequencing read length is approximately 150 bp, large structural variants (SVs) and copy number variants (CNVs) are challenging to observe. This creates a diagnostic gap between the clinical phenotypes and the underlying genetic mechanisms in the field of biomedical sciences. Novel technologies such as optical genome mapping and long-read sequencing have partially addressed the issues of SV and CNV detection; however, the identification of pathogenic variants among thousands of called SVs/CNVs throughout the genome has proven to be challenging as the analytical pipelines available for single nucleotide variants are not applicable to SV analysis. Thus, we have built an R-based annotation package “nanotatoR” specifically for structural variants to provide a multitude of critical functional annotations for large genomic variations.

#Package Installation

nanotatoR is currently available from the GitHub repository. Installation method is as follows:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("nanotatoR", version = "3.8")
library("nanotatoR")
## 

The nanotatoR package is compatible with R versions ≥ 3.6.

#Package Functionalities

Given a list of structural variants with chromosome number, SV type, and variant start/end positions, nanotatoR can perform the following functions (both GRCh37/38 can be used):

##Structural Variant Frequency

Genomic variant frequencies are one of the most important filtration characteristics for identification of rare, possibly pathogenic, variants. NanotatoR uses the Database of Genomic Variants (DGV) and DECIPHER (DECIPHER), publicly available reference control database, to estimate structural variant frequencies in the general population. Compared with the traditional single nucleotide variant frequency calculations, frequency estimates for structural variants pose larger difficulty due to the much higher breakpoint variability between “same” structural variants. In order to provide accurate estimates of population frequencies, nanotatoR recognizes five categories of SVs: insertions, deletions, duplications, inversions and translocations (e.g. “gains” of DNA material are considered as “insertions/duplications” and “losses” of DNA material as “deletions”). In order for two SVs to be considered as “same”, nanotatoR, by default, checks whether they belong to the same category (e.g. deletion), have greater than 50% size similarity and the SV breakpoints (start and end positions) are within 10 kbp from each other. Currently the 50% size similarity cutoff is not applied for duplications, inversions and translocations as the sizes for these structural variants are not computed by most SV callers; however, a breakpoint cutoff of 50 kbp is applied in order to identify matching variants (note: duplication breakpoint cutoff is 10 kbp). Default breakpoint cutoffs and percent similarity determined by Bionano Genomics (see the references). In order for the natotatoR to estimate SV frequencies it requires the following input files: DECIPHER reference file(decipherpath) and “smap” file containing structural variant information generated via default Bionano Genomics Solve/Tools Pipelines for optical mapping-based SV calling (smappath). The default input parameters for SV breakpoints and percent similarities are as follows: insertions/duplications/deletions (win_indel) of 10,000 bases, inversions and translocations (win_inv_trans) of 50,000 bases and percentage similarity (perc_similarity) of 0.5 or 50%. The output from this function can be of 2 types: an R object (dataFrame) or a text file (Text).

hgpath=system.file("extdata", "GRCh37_hg19_variants_2016-05-15.txt", package="nanotatoR")
smappath=system.file("extdata", "GM24385_Ason_DLE1_VAP_trio5.smap", package="nanotatoR")
datDGV <- DGVfrequency (hgpath = hgpath, 
smap = smappath, EnzymeType = "SE",
win_indel_DGV = 10000, win_inv_trans_DGV = 50000,
input_fmt_SV = "Text",
perc_similarity_DGV = 0.5,returnMethod="dataFrame")
## [1] "Chromosome: 1"
datDGV[1,]
##       SampleID SmapEntryID QryContigID RefcontigID1 RefcontigID2 QryStartPos
## 1 GM24385_Ason         343         302            1            1    19362181
##   QryEndPos RefStartPos RefEndPos Confidence     Type XmapID1 XmapID2 LinkID
## 1  19371069   103038897 103048124         -1 deletion      21      21     -1
##   QryStartIdx QryEndIdx RefStartIdx RefEndIdx   Zygosity Genotype GenotypeGroup
## 1        3763      3764       19515     19516 homozygous        1           143
##   RawConfidence RawConfidenceLeft RawConfidenceRight RawConfidenceCenter SVsize
## 1          1.85           4609.43            1541.23                1.85  339.6
##   SVfreq orientation                      Sample           Algorithm Size
## 1  0.514         NA  GM24385_Ason_DLE1_assembly5 assembly_comparison  340
##   Present_in_._of_BNG_control_samples
## 1                                22.5
##   Present_in_._of_BNG_control_samples_with_the_same_enzyme
## 1                                                     80.7
##   Fail_assembly_chimeric_score OverlapGenes NearestNonOverlapGene
## 1               not_applicable            -               COL11A1
##   NearestNonOverlapGeneDistance PutativeGeneFusion Found_in_parents_assemblies
## 1                        293899                  -                        both
##   Found_in_parents_molecules Found_in_self_molecules Mother_molecule_count
## 1                       both                     yes                    17
##   Father_molecule_count Self_molecule_count DGV_Count DGV_Freq_Perc
## 1                    19                  31         0             0
decipherpath = system.file("extdata", "population_cnv.txt", package="nanotatoR")
smappath=system.file("extdata", "GM24385_Ason_DLE1_VAP_trio5.smap", package="nanotatoR")
datdecipher <- Decipherfrequency (decipherpath = decipherpath, 
smap = smappath, win_indel = 10000, 
perc_similarity = 0.5,returnMethod="dataFrame", EnzymeType = "SV",
input_fmt_SV = "Text")
## [1] "###Calculating Decipher Frequency###"
datdecipher[1,]
##       SampleID SmapEntryID QryContigID RefcontigID1 RefcontigID2 QryStartPos
## 1 GM24385_Ason         343         302            1            1    19362181
##   QryEndPos RefStartPos RefEndPos Confidence     Type XmapID1 XmapID2 LinkID
## 1  19371069   103038897 103048124         -1 deletion      21      21     -1
##   QryStartIdx QryEndIdx RefStartIdx RefEndIdx   Zygosity Genotype GenotypeGroup
## 1        3763      3764       19515     19516 homozygous        1           143
##   RawConfidence RawConfidenceLeft RawConfidenceRight RawConfidenceCenter SVsize
## 1          1.85           4609.43            1541.23                1.85  339.6
##   SVfreq orientation                      Sample           Algorithm Size
## 1  0.514         NA  GM24385_Ason_DLE1_assembly5 assembly_comparison  340
##   Present_in_._of_BNG_control_samples
## 1                                22.5
##   Present_in_._of_BNG_control_samples_with_the_same_enzyme
## 1                                                     80.7
##   Fail_assembly_chimeric_score OverlapGenes NearestNonOverlapGene
## 1               not_applicable            -               COL11A1
##   NearestNonOverlapGeneDistance PutativeGeneFusion Found_in_parents_assemblies
## 1                        293899                  -                        both
##   Found_in_parents_molecules Found_in_self_molecules Mother_molecule_count
## 1                       both                     yes                    17
##   Father_molecule_count Self_molecule_count DECIPHER_Frequency
## 1                    19                  31                  0

Other than DGV and DECIPHER, nanotatoR can also take as input a data set of variants created by Bionano Genomics comprising of 204 samples. Input parameters are similar to that of DECIPHER, with an addition of following parameters: confidence score threshold for insertion and deletion (indelconf; Default is 0.5), inversion (invconf; Default is 0.01) and translocation (transconf; Default is 0.1 (determined by Bionano Genomics). Note: The Bionano genomic reference database can be downloaded using the following command wget http://bnxinstall.com/solve/Solve3.3_10252018.tar.gz , followed by decompressing it using the tar -xvzf Solve3.3_10252018.tar.gz. The folder containing the database is in the config file which can be found in the folowing directory $PWD/Solve3.3_10252018/VariantAnnotation/10252018/config/. The reference files are named based on the variant type and reference genome. For example: current_ctrl_dup_hg19_anonymize.txt, would be the duplication reference variant file for hg19 reference genome.

    path <- system.file("extdata", "Bionano_config/", package = "nanotatoR")
    pattern <- "*_hg19_*"
    smapName="GM24385_Ason_DLE1_VAP_trio5.smap"
    smap = system.file("extdata", smapName, package="nanotatoR")
    datbndb <- BNDBfrequency(smap = smap, 
        buildBNInternalDB=TRUE, 
        input_fmt_SV = "Text",
        dbOutput="dataframe",
        BNDBpath = path, 
        BNDBpattern = pattern, 
        fname, outpath, 
        win_indel = 10000,
        win_inv_trans = 50000, 
        perc_similarity = 0.5, 
        indelconf = 0.5, 
        invconf = 0.01, 
        limsize = 1000,
        transconf = 0.1,
        returnMethod=c("dataFrame"), 
        EnzymeType = c("SE"))
## [1] "###Cohort Frequency Calculation###"
## [1] "###Building Database###"
## [1] "current_ctrl_dup_hg19_anonymize.txt"
## [1] 198  10
## [1] "current_ctrl_ins_del_hg19_anonymize.txt"
## [1] 200  10
## [1] "current_ctrl_inv_hg19_anonymize.txt"
## [1] 200  11
## [1] "current_ctrl_trans_hg19_anonymize.txt"
## [1] 199  10
## [1] "###Calculating BN Frequency###"
## [1] "Chrom:1"
    datbndb[1,]
##       SampleID SmapEntryID QryContigID RefcontigID1 RefcontigID2 QryStartPos
## 1 GM24385_Ason         343         302            1            1    19362181
##   QryEndPos RefStartPos RefEndPos Confidence     Type XmapID1 XmapID2 LinkID
## 1  19371069   103038897 103048124         -1 deletion      21      21     -1
##   QryStartIdx QryEndIdx RefStartIdx RefEndIdx   Zygosity Genotype GenotypeGroup
## 1        3763      3764       19515     19516 homozygous        1           143
##   RawConfidence RawConfidenceLeft RawConfidenceRight RawConfidenceCenter SVsize
## 1          1.85           4609.43            1541.23                1.85  339.6
##   SVfreq orientation                      Sample           Algorithm Size
## 1  0.514         NA  GM24385_Ason_DLE1_assembly5 assembly_comparison  340
##   Present_in_._of_BNG_control_samples
## 1                                22.5
##   Present_in_._of_BNG_control_samples_with_the_same_enzyme
## 1                                                     80.7
##   Fail_assembly_chimeric_score OverlapGenes NearestNonOverlapGene
## 1               not_applicable            -               COL11A1
##   NearestNonOverlapGeneDistance PutativeGeneFusion Found_in_parents_assemblies
## 1                        293899                  -                        both
##   Found_in_parents_molecules Found_in_self_molecules Mother_molecule_count
## 1                       both                     yes                    17
##   Father_molecule_count Self_molecule_count BNG_Freq_Perc_Filtered
## 1                    19                  31                      0
##   BNG_Freq_Perc_UnFiltered BNG_Homozygotes
## 1                        0               0

##Cohort Analysis

The cohort analysis is designed to provide internal variant frequency, and parental zygosity within the cohort. The function will first merge all of the available individual smaps to form an “internal reference” (buildSVInternalDB=TRUE) or use an existing file if the internal SV database is already available (buildSVInternalDB=FALSE). The function requires the paths of the query smap file, the merged internal database file, as well as the following parameters: confidence score threshold for insertion and deletion (indelconf; Default is 0.5), inversion (invconf; Default is 0.01) and translocation (transconf; Default is 0.1) (determined by Bionano Genomics). The output can be a (dataFrame) or a text file (Text).

smapName= "GM24385_Ason_DLE1_VAP_trio5.smap"
smap = system.file("extdata", smapName, package="nanotatoR")
indelconf = 0.5; invconf = 0.01;transconf = 0.1;input_fmt="Text";
datInf <- internalFrequencyTrio_Duo(smap = smap, 
    buildSVInternalDB=TRUE, win_indel=10000, 
    win_inv_trans=50000, 
    labelType = c("SE"),
    SE_path = system.file("extdata", "SoloFile/", package="nanotatoR"),
    SE_pattern = "*_DLE1_*",
    Samplecodes = system.file("extdata", "nanotatoR_sample_codes.csv", package="nanotatoR"),
    mergeKey = system.file("extdata", "nanotatoR_control_sample_codes.csv", package="nanotatoR"),
    mergedKeyoutpath = system.file("extdata", package="nanotatoR"), 
    mergedKeyFname = "Sample_index.csv", EnzymeType = "SE",
    indexfile = system.file("extdata", mergedKeyFname, package="nanotatoR"),
    perc_similarity=0.5, indelconf=0.5, invconf=0.01, 
    transconf=0.1, limsize=1000, win_indel_parents=5000,
    win_inv_trans_parents=40000,
    returnMethod="dataFrame", input_fmt_SV = "Text")
## [1] "###Calculating the Internal Frequency###"
## [1] "Sister not present"
## [1] "Brother not present"
## [1] "Sibling not present"
## [1] "Cousin not present"
## [1] "Husband not present"
## [1] "Wife not present"
## [1] "Son not present"
## [1] "Daughter not present"
## [1] "Maternal Grand Mother not present"
## [1] "Maternal Grand Father not present"
## [1] "Paternal Grand Mother not present"
## [1] "Paternal Grand Father not present"
## [1] "Maternal Uncle not present"
## [1] "Maternal Aunt not present"
## [1] "Paternal Uncle not present"
## [1] "Paternal Grand Father not present"
## [1] "Mother Absent"
## [1] "Father Absent"
## [1] "Sister not present"
## [1] "Brother not present"
## [1] "Sibling not present"
## [1] "Cousin not present"
## [1] "Husband not present"
## [1] "Wife not present"
## [1] "Son not present"
## [1] "Daughter not present"
## [1] "Maternal Grand Mother not present"
## [1] "Maternal Grand Father not present"
## [1] "Paternal Grand Mother not present"
## [1] "Paternal Grand Father not present"
## [1] "Maternal Uncle not present"
## [1] "Maternal Aunt not present"
## [1] "Paternal Uncle not present"
## [1] "Paternal Grand Father not present"
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] "GM24385_Ason"
## [1] "GM24143_Amother"
## [1] "GM24149_Afather"
## [1] "NA12878"
## [1] "Chrom:1"
datInf[1,]
##       SampleID SmapEntryID QryContigID RefcontigID1 RefcontigID2 QryStartPos
## 1 GM24385_Ason         343         302            1            1    19362181
##   QryEndPos RefStartPos RefEndPos Confidence     Type XmapID1 XmapID2 LinkID
## 1  19371069   103038897 103048124         -1 deletion      21      21     -1
##   QryStartIdx QryEndIdx RefStartIdx RefEndIdx   Zygosity Genotype GenotypeGroup
## 1        3763      3764       19515     19516 homozygous        1           143
##   RawConfidence RawConfidenceLeft RawConfidenceRight RawConfidenceCenter SVsize
## 1          1.85           4609.43            1541.23                1.85  339.6
##   SVfreq orientation                      Sample           Algorithm Size
## 1  0.514         NA  GM24385_Ason_DLE1_assembly5 assembly_comparison  340
##   Present_in_._of_BNG_control_samples
## 1                                22.5
##   Present_in_._of_BNG_control_samples_with_the_same_enzyme
## 1                                                     80.7
##   Fail_assembly_chimeric_score OverlapGenes NearestNonOverlapGene
## 1               not_applicable            -               COL11A1
##   NearestNonOverlapGeneDistance PutativeGeneFusion Found_in_parents_assemblies
## 1                        293899                  -                        both
##   Found_in_parents_molecules Found_in_self_molecules Mother_molecule_count
## 1                       both                     yes                    17
##   Father_molecule_count Self_molecule_count Internal_Freq_Perc_Filtered
## 1                    19                  31                           0
##   Internal_Freq_Perc_Unfiltered Internal_Homozygotes MotherZygosity
## 1                           100                    0     homozygous
##   FatherZygosity
## 1     homozygous

##Gene Overlap

Incorporates known gene and non-coding RNA genomic locations that overlap with or are near the identified structural variants. NanotatoR automatically determines the number of overlapping genes for a given structural variant, provides gene strandedness (+ or -) as well as the percent overlap of SVs with the gene. The function also provides gene names and corresponding distances from SVs for the nearest genes (user selected, default 3 on each side) that are upstream and downstream. The function requires an input BED file (inputfmt=BED) or Bionano compliant BED file (inputfmt=BNBED), which re-codes the X and Y chromosome notations to 23 and 24, respectively. The BED files are used to overlap structural variants of the query smap file (smap) with overlapping and non-overlapping upstream and downstream genes (n;default is 3). The output from this function can be of 2 types: an R object (dataFrame) or a text file (Text).

smapName="GM24385_Ason_DLE1_VAP_trio5.smap"
    smap = system.file("extdata", smapName, package="nanotatoR")
    bedFile <- system.file("extdata", "HomoSapienGRCH19_lift37.bed", package="nanotatoR")
    outpath <- system.file("extdata", package="nanotatoR")
    datcomp<-overlapnearestgeneSearch(smap = smap, 
        bed=bedFile, inputfmtBed = "bed", outpath, 
        n = 3, returnMethod_bedcomp = c("dataFrame"), 
        input_fmt_SV = "Text",
        EnzymeType = "SE", 
        bperrorindel = 3000, bperrorinvtrans = 50000)
## [1] "###Comparing SVs and Beds###"
## [1] "Genome doesnot have any Sex Chromosome"
## [1] "***Overlap Genes***"
## [1] "***NonOverlap Genes***"
    datcomp[1,]
##       SampleID SmapEntryID QryContigID RefcontigID1 RefcontigID2 QryStartPos
## 1 GM24385_Ason         343         302            1            1    19362181
##   QryEndPos RefStartPos RefEndPos Confidence     Type XmapID1 XmapID2 LinkID
## 1  19371069   103038897 103048124         -1 deletion      21      21     -1
##   QryStartIdx QryEndIdx RefStartIdx RefEndIdx   Zygosity Genotype GenotypeGroup
## 1        3763      3764       19515     19516 homozygous        1           143
##   RawConfidence RawConfidenceLeft RawConfidenceRight RawConfidenceCenter SVsize
## 1          1.85           4609.43            1541.23                1.85  339.6
##   SVfreq orientation                      Sample           Algorithm Size
## 1  0.514         NA  GM24385_Ason_DLE1_assembly5 assembly_comparison  340
##   Present_in_._of_BNG_control_samples
## 1                                22.5
##   Present_in_._of_BNG_control_samples_with_the_same_enzyme
## 1                                                     80.7
##   Fail_assembly_chimeric_score OverlapGenes NearestNonOverlapGene
## 1               not_applicable            -               COL11A1
##   NearestNonOverlapGeneDistance PutativeGeneFusion Found_in_parents_assemblies
## 1                        293899                  -                        both
##   Found_in_parents_molecules Found_in_self_molecules Mother_molecule_count
## 1                       both                     yes                    17
##   Father_molecule_count Self_molecule_count OverlapGenes_strand_perc
## 1                    19                  31                        -
##                                Upstream_nonOverlapGenes_dist_kb
## 1 AURKAIP1(-:101726.787);MXRA8(-:101747.828);DVL1(-:101765.241)
##   Downstream_nonOverlapGenes_dist_kb
## 1                                  -

##Entrez Extract

Generates a list of genes involved with the patient phenotype and overlaps it with gene names that span structural variants. The user provided, phenotypic key words are used to generate gene lists from selected databases such as ClinVar, OMIM, GTR and Gene Registry. The generated lists are used to prioritize structural variants that occur in genes known to be associated with patient’s phenotype.

The input to the function is a term, which can be provided as a single character input (method=“Single”), or a vector of terms (method=“Multiple”) or as a text file (method=“Text”). The output can be selected as a (dataFrame) or a text file (Text).

terms="CIRRHOSIS, FAMILIAL"
genes <- gene_list_generation(
    method_entrez = c("Single"), 
    term = terms,
    returnMethod=c("dataFrame"), 
    omimID = "OMIM:118980",
    omim = system.file("extdata", "mim2gene.txt", package="nanotatoR"), 
    clinvar = system.file("extdata", "localPDB/", package="nanotatoR"), 
    gtr = system.file("extdata", "gtrDatabase.txt", package="nanotatoR"),
 downloadClinvar = FALSE, downloadGTR = FALSE)
genes[1:10,]

##Expression Data Integration

The SVexpression_solo and SVexpression_duo_trio functions for solos and duos/trios respectively, provides the user with tools to integrate tissue specific gene expression values with SVs. The function takes as input a matrix of gene names and corresponding expression values for each sample (individual files can be merged by the RNAseqcombine function for duos/trios or RNAseqcombine_solo function for solos).

RNASeqDir = system.file("extdata", package="nanotatoR")
returnMethod="dataFrame"
datRNASeq <- RNAseqcombine_solo(RNASeqDir = RNASeqDir,
returnMethod = returnMethod)
## 'select()' returned 1:1 mapping between keys and columns
smapName="NA12878_DLE1_VAP_solo5.smap"
smap = system.file("extdata", smapName, package="nanotatoR")
smap = system.file("extdata", smapName, package="nanotatoR")
bedFile <- system.file("extdata", "HomoSapienGRCH19_lift37.bed", package="nanotatoR")
outpath <- system.file("extdata", package="nanotatoR")
datcomp<-overlapnearestgeneSearch(smap = smap, 
    bed=bedFile, inputfmtBed = "bed", outpath, 
    n = 3, returnMethod_bedcomp = c("dataFrame"), 
    input_fmt_SV = "Text",
    EnzymeType = "SE", 
    bperrorindel = 3000, 
    bperrorinvtrans = 50000)
## [1] "###Comparing SVs and Beds###"
## [1] "Genome doesnot have any Sex Chromosome"
## [1] "***Overlap Genes***"
## [1] "***NonOverlap Genes***"
datRNASeq1 <- SVexpression_solo (input_fmt_SV=c("dataFrame"),
    smapdata = datcomp,
    input_fmt_RNASeq=c("dataFrame"),
    RNASeqData = datRNASeq,
    outputfmt=c("datFrame"),
    pattern_Proband = "*_P_*", EnzymeType = c("SE"))
## [1] "###OverlapGenes###"
## [1] "###NonOverlapUPStreamGenes###"
## [1] "###NonOverlapDNStreamGenes###"
datRNASeq1[1,]
##   SampleID SmapEntryID QryContigID RefcontigID1 RefcontigID2 QryStartPos
## 1  NA12878         397        3451            1            1    16678749
##   QryEndPos RefStartPos RefEndPos Confidence     Type XmapID1 XmapID2 LinkID
## 1  16683839   100343445 100348831         -1 deletion      24      24     -1
##   QryStartIdx QryEndIdx RefStartIdx RefEndIdx     Zygosity Genotype
## 1        3260      3261       18927     18928 heterozygous        1
##   GenotypeGroup RawConfidence RawConfidenceLeft RawConfidenceRight
## 1            -1          2.96           4105.98            1852.86
##   RawConfidenceCenter SVsize SVfreq orientation                 Sample
## 1                2.96  296.1   0.51         NA  NA12878_DLE1_assembly5
##             Algorithm Size Present_in_._of_BNG_control_samples
## 1 assembly_comparison  296                                   0
##   Present_in_._of_BNG_control_samples_with_the_same_enzyme
## 1                                                        0
##   Fail_assembly_chimeric_score OverlapGenes NearestNonOverlapGene
## 1               not_applicable          AGL              BC112312
##   NearestNonOverlapGeneDistance PutativeGeneFusion Found_in_self_molecules
## 1                         85170                  -                     yes
##   Self_molecule_count OverlapGenes_strand_perc
## 1                  41                        -
##                             Upstream_nonOverlapGenes_dist_kb
## 1 AURKAIP1(-:99031.335);MXRA8(-:99052.376);DVL1(-:99069.789)
##   Downstream_nonOverlapGenes_dist_kb OverlapProbandEXP
## 1                                  -                 -
##         NonOverlapUPprobandEXP NonOverlapDNprobandEXP
## 1 AURKAIP1(-);MXRA8(-);DVL1(-)                      -

##Filter Variants

Provides easy to use, user-selected, filtration criteria to segregate variants into corresponding groups (such as de novo, inherited or occurring in the gene list). In this step the generated or available gene lists are appended to the smap file. The input file for this function can either be an (smap) or a dataframe. Both raw and nanotatoR-annotated smaps can serve as inputs. It has an option to take the input of the smap (input_fmt_svMap) and genelist (input_fmt_geneList) as a dataFrame or a text file, and produces an excel as the output.

smapName <- "GM24385_Ason_DLE1_VAP_trio5.smap"
outputFilename <- "GM24385_Ason_DLE1_VAP_trio5_out"
smappath <- system.file("extdata", smapName, package = "nanotatoR")
outpath <- system.file("extdata", smapName, package = "nanotatoR")
RZIPpath <- system.file("extdata", "zip.exe", package = "nanotatoR")
smap = system.file("extdata", smapName, package="nanotatoR")
bedFile <- system.file("extdata", "HomoSapienGRCH19_lift37.bed", package="nanotatoR")
outpath <- system.file("extdata", package="nanotatoR")
directoryName <- system.file("extdata", package="nanotatoR")
datcomp<-overlapnearestgeneSearch(smap = smap, 
    bed=bedFile, inputfmtBed = "bed", outpath, 
    n = 3, returnMethod_bedcomp = c("dataFrame"), 
    input_fmt_SV = "Text",
    EnzymeType = "SE", 
    bperrorindel = 3000, bperrorinvtrans = 50000)
## [1] "###Comparing SVs and Beds###"
## [1] "Genome doesnot have any Sex Chromosome"
## [1] "***Overlap Genes***"
## [1] "***NonOverlap Genes***"
hgpath=system.file("extdata", "GRCh37_hg19_variants_2016-05-15.txt", package="nanotatoR")
datDGV <- DGVfrequency (hgpath = hgpath, 
    smap_data = datcomp,
    win_indel_DGV = 10000, 
    input_fmt_SV = "dataFrame",EnzymeType = "SE",
    perc_similarity_DGV = 0.5,returnMethod="dataFrame")
## [1] "Chromosome: 1"
    indelconf = 0.5; invconf = 0.01;transconf = 0.1;
datInf <- internalFrequencyTrio_Duo(smapdata = datDGV, 
    buildSVInternalDB=TRUE, win_indel=10000, 
    win_inv_trans=50000, 
    labelType = c("SE"),
    EnzymeType = "SE",
    SE_path = system.file("extdata", "SoloFile/", package="nanotatoR"),
    SE_pattern = "*_DLE1_*", perc_similarity_parents =0.9,
    Samplecodes = system.file("extdata", "nanotatoR_sample_codes.csv", package="nanotatoR"),
    mergeKey = system.file("extdata", "nanotatoR_control_sample_codes.csv", package="nanotatoR"),
    mergedKeyoutpath = system.file("extdata", package="nanotatoR"), 
    mergedKeyFname = "Sample_index.csv",
    indexfile = system.file("extdata", mergedKeyFname, package="nanotatoR"),
    perc_similarity = 0.5, indelconf = 0.5, invconf = 0.01, 
    transconf = 0.1, limsize = 1000, win_indel_parents = 5000,
    win_inv_trans_parents=40000,
    returnMethod="dataFrame", input_fmt_SV = "dataFrame")
## [1] "###Calculating the Internal Frequency###"
## [1] "Sister not present"
## [1] "Brother not present"
## [1] "Sibling not present"
## [1] "Cousin not present"
## [1] "Husband not present"
## [1] "Wife not present"
## [1] "Son not present"
## [1] "Daughter not present"
## [1] "Maternal Grand Mother not present"
## [1] "Maternal Grand Father not present"
## [1] "Paternal Grand Mother not present"
## [1] "Paternal Grand Father not present"
## [1] "Maternal Uncle not present"
## [1] "Maternal Aunt not present"
## [1] "Paternal Uncle not present"
## [1] "Paternal Grand Father not present"
## [1] "Mother Absent"
## [1] "Father Absent"
## [1] "Sister not present"
## [1] "Brother not present"
## [1] "Sibling not present"
## [1] "Cousin not present"
## [1] "Husband not present"
## [1] "Wife not present"
## [1] "Son not present"
## [1] "Daughter not present"
## [1] "Maternal Grand Mother not present"
## [1] "Maternal Grand Father not present"
## [1] "Paternal Grand Mother not present"
## [1] "Paternal Grand Father not present"
## [1] "Maternal Uncle not present"
## [1] "Maternal Aunt not present"
## [1] "Paternal Uncle not present"
## [1] "Paternal Grand Father not present"
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] "GM24385_Ason"
## [1] "GM24143_Amother"
## [1] "GM24149_Afather"
## [1] "NA12878"
## [1] "Chrom:1"
path <- system.file("extdata", "Bionano_config/", package = "nanotatoR")
pattern <- "*_hg19_*"
datBNDB <- BNDBfrequency(smapdata = datInf, 
    buildBNInternalDB=TRUE, 
    input_fmt_SV = "dataFrame",
    dbOutput="dataframe",
    BNDBpath = path, 
    BNDBpattern = pattern, 
    fname, outpath, 
    win_indel = 10000,
    win_inv_trans = 50000, 
    perc_similarity = 0.5, 
    indelconf = 0.5, 
    invconf = 0.01, 
    limsize = 1000,
    transconf = 0.1,
    returnMethod=c("dataFrame"), 
    EnzymeType = c("SE"))
## [1] "###Cohort Frequency Calculation###"
## [1] "###Building Database###"
## [1] "current_ctrl_dup_hg19_anonymize.txt"
## [1] 198  10
## [1] "current_ctrl_ins_del_hg19_anonymize.txt"
## [1] 200  10
## [1] "current_ctrl_inv_hg19_anonymize.txt"
## [1] 200  11
## [1] "current_ctrl_trans_hg19_anonymize.txt"
## [1] 199  10
## [1] "###Calculating BN Frequency###"
## [1] "Chrom:1"
decipherpath = system.file("extdata", "population_cnv.txt", package="nanotatoR")
datdecipher <- Decipherfrequency (decipherpath = decipherpath, 
    smap_data = datBNDB, win_indel = 10000, 
    perc_similarity = 0.5,returnMethod="dataFrame", 
    input_fmt_SV = "dataFrame", EnzymeType = c("SE"))
## [1] "###Calculating Decipher Frequency###"
run_bionano_filter_SE_Trio (input_fmt_geneList = c("Text"),
    input_fmt_SV = c("dataFrame"),
    svData = datdecipher, 
    dat_geneList = dat_geneList,
    RZIPpath = RZIPpath, EnzymeType = c("SE"),
    outputType = c("csv"),
    primaryGenesPresent = FALSE, 
    fileprefix = "AnnotatedSamplesGM24385",
    outputFilename = outputFilename,
    directoryName = directoryName,
    outpath = outpath)
## [1] "fname <- paste(outputFilename, \"_DLE.xlsx\", sep = \"\")\n  write.xlsx(list_of_datasets, file = file.path(outpath, fname),keepNA=TRUE)"

##Main

The main function consecutively runs the available nanotatoR sub-functions by appending the outputs from each function. It takes as an input the smap file, DGV file, BED file, internal database file, phenotype term list as an input. It also takes in the output path and the filename for the final excel file. Individual, sub-function, input parameters are also available for user selections.

smapName="NA12878_DLE1_VAP_solo5.smap"
smap = system.file("extdata", smapName, package="nanotatoR")
bedFile <- system.file("extdata", "HomoSapienGRCH19_lift37.bed", package="nanotatoR")
hgpath=system.file("extdata", "GRCh37_hg19_variants_2016-05-15.txt", package="nanotatoR")
labelType = c("SE")
SE_path = system.file("extdata", "SoloFile/", package="nanotatoR")
SE_pattern = "*_DLE1_*"
Samplecodes = system.file("extdata", "nanotatoR_sample_codes.csv", package="nanotatoR")
mergeKey = system.file("extdata", "nanotatoR_control_sample_codes.csv", package="nanotatoR")
mergedKeyoutpath = system.file("extdata", package="nanotatoR")
mergedKeyFname = "Sample_index.csv"
RNASeqDir = system.file("extdata", "NA12878_P_Blood_S1.genes.results", package="nanotatoR")
path = system.file("extdata", "Bionano_config/", package = "nanotatoR")
pattern = "_hg19.txt"
outputFilename <- "NA12878_DLE1_VAP_solo5_out"
outpath <- system.file("extdata", smapName, package = "nanotatoR")
RZIPpath <- system.file("extdata", "zip.exe", package = "nanotatoR")
directoryName <- system.file("extdata", package="nanotatoR")
nanotatoR_main_Solo_SE(
    smap = smap, bed = bedFile, inputfmt = c("bed"), 
    n=3,
    buildBNInternalDB=TRUE,
    path = path , pattern = pattern, 
    buildSVInternalDB = TRUE,
    EnzymeType = c("SE"),
    labelType = c("SE"),
    SE_path = SE_path, SE_pattern = SE_pattern,
    win_indel_INF = 10000, win_inv_trans_INF = 50000, 
    perc_similarity_INF= 0.5, indelconf = 0.5, invconf = 0.01, 
    transconf = 0.1, 
    hgpath = hgpath, win_indel_DGV = 10000, win_inv_trans_DGV = 50000, 
    perc_similarity_DGV = 0.5,
    RNAseqcombo = TRUE,
    RNASeqDir = RNASeqDir, returnMethod = "dataFrame",
    pattern_Proband = "*_P_*",
    outpath = outpath,
    outputFilename = outputFilename, 
    termListPresent = FALSE,
    datGeneListPresent = FALSE,
    InternaldatabasePresent = FALSE,
    primaryGenesPresent = FALSE,
    fileprefix = "AnnotatedNA12878",
    directoryName = directoryName,
    outputType = c("csv"))
## [1] "####PipeLine Starts####"
## [1] "Time taken to run gene_list_generation is: -1.31130218505859e-05"
## [1] "###Comparing SVs and Beds###"
## [1] "Genome doesnot have any Sex Chromosome"
## [1] "***Overlap Genes***"
## [1] "***NonOverlap Genes***"
## [1] "Time taken to run compSmapbed is: -0.0345029830932617"
## [1] "Chromosome: 1"
## [1] "Time taken to run DGVfrequency is: -0.0355868339538574"
## [1] "###Calculating the Internal Frequency###"
## [1] "internalFrequency cannot work"
## [1] "Time taken to run internalFrequency is: -0.00505685806274414"
## [1] "###Cohort Frequency Calculation###"
## [1] "###Building Database###"
## [1] "BNDBfrequency cannot work"
## [1] "Time taken to run BNDBfrequency is: -0.000342607498168945"
## [1] "###Calculating Decipher Frequency###"
## [1] "Decipherfrequency cannot work"
## [1] "Time taken to run Decipherfrequency is: -0.000157833099365234"
## [1] "RNAseqcombo cannot work"
## [1] "###OverlapGenes###"
## [1] "###NonOverlapUPStreamGenes###"
## [1] "RNAseqcombo cannot work"
## [1] "Time taken to run SmapRNAseqquery is: -0.00619626045227051"
## [1] "run_bionano_filter_SE_solo cannot work"
## [1] "Time taken to run run_bionano_filter_SE_solo is: -0.000889062881469727"

#References

  1. Hayk Barseghyan, Wilson Tang, Richard T. Wang, Miguel Almalvez, Eva Segura, Matthew S. Bramble, Allen Lipson, Emilie D. Douine, Hane Lee, Emmanuèle C. Délot, Stanley F. Nelson and Eric Vilain.Next-generation mapping: a novel approach for detection of pathogenic structural variants with a potential utility in clinical diagnosis. Genome Medicine 2017 9:90. https://doi.org/10.1186/s13073-017-0479-0

  2. Winter, D. J. rentrez: an R package for the NCBI eUtils API The R Journal 2017 9(2):520-526

  3. Christopher Brown. hash: Full feature implementation of hash/associated arrays/dictionaries.2013. R package version 2.2.6. https://CRAN.R-project.org/package=hash

  4. Hadley Wickham. stringr: Simple, Consistent Wrappers for Common String Operations. 2018. R package version 1.3.1. https://CRAN.R-project.org/package=stringr

  5. Bionano Genomics. Theory Of Operation – Structural Variant Calling. https://bionanogenomics.com/wp-content/uploads/2018/04/30110-Bionano-Solve-Theory-of-Operation-Structural-Variant-Calling.pdf

  6. Bionano Genomics. Theory of Operation - Variant Annotation Pipeline. https://bionanogenomics.com/wp-content/uploads/2018/04/30190-Bionano-Solve-Theory-of-Operation-Variant-Annotation-Pipeline.pdf