The raw FASTQ read files were preprocessed with the dnaloop
(https://github.com/aryeelab/dnaloop) package, a PyPi
package that relies on samtools
, bedtools
, cutadapt
, and MACS2
to align reads, call anchor peaks, and summarize PETs per sample in the ChIA-PET experiment.
To use the loopsMake
function from a different preprocessing step, have files X.loop_counts.bedpe
,Y.loop_counts.bedpe
, Z.loop_counts.bedpe
in bed_dir
for samples = (X,Y,Z)
where the first lines should resemble:
1 10002272 10004045 10 120968807 120969483 . 1
1 10002272 10004045 10 99551498 99552470 . 1
1 10002272 10004045 1 10002272 10004045 . 17
where the first three columns specify the position (chr:start:stop) of the first anchor, the second three columns specify the position (chr:start:stop) of the secnd anchor, the 7th column is “.” and the 8th column is the number of paired-end reads support that particular PET.
We read in data from the preprocessing pipeline using the loopsMake
function. The example below uses sample data included in the diffloopdata
package. It contains interactions involving chromosome 1 from a set of six Cohesin ChIA-PET samples.\(^{1,2}\) You would normally set bed_dir
to your real data directory.
bed_dir <- system.file("extdata", "esc_jurkat", package="diffloopdata")
bed_dir
## [1] "/home/biocbuild/bbs-3.3-bioc/R/library/diffloopdata/extdata/esc_jurkat"
samples <- c("naive_esc_1", "naive_esc_2", "primed_esc_1", "primed_esc_2",
"jurkat_1", "jurkat_2")
Here, we use 1500 bp as a value to merge nearby anchors into a single peak. In the loopsMake
function, the default is zero and should be parameterized based on the resolution of the ChIA-PET Experiment.
full <- loopsMake(bed_dir, samples, 1500)
celltypes <- c("naive", "naive", "primed", "primed", "jurkat", "jurkat")
full <- updateLDGroups(full, celltypes)
head(full, 3)
## An object of class "loops"
## Slot "anchors":
## GRanges object with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] 1 [713970, 714336] *
## [2] 1 [804939, 805810] *
## [3] 1 [839766, 841153] *
## -------
## seqinfo: 51 sequences from an unspecified genome; no seqlengths
##
## Slot "interactions":
## left right
## [1,] 1 1
## [2,] 1 2
## [3,] 1 3
##
## Slot "counts":
## naive_esc_1 naive_esc_2 primed_esc_1 primed_esc_2 jurkat_1 jurkat_2
## [1,] 2 2 8 9 2 6
## [2,] 0 1 1 0 0 0
## [3,] 0 1 0 0 0 0
##
## Slot "colData":
## sizeFactor groups
## naive_esc_1 1 naive
## naive_esc_2 1 naive
## primed_esc_1 1 primed
## primed_esc_2 1 primed
## jurkat_1 1 jurkat
## jurkat_2 1 jurkat
##
## Slot "rowData":
## loopWidth
## 1 0
## 2 90604
## 3 125431
Alternatively, one could run full <- loopsMake(bed_dir)
and the samples
would be inferred from the directory contents. The default call to loopsMake
does not merge nearby anchors but instead keeps them distinct. If we want to group nearby anchors after the initial data read, one can run the mergeAnchors
command with the appropriate gap value.
We’ll now look at a PCA plot of the raw data:
pcaPlot(full) + geom_text(aes(label=samples))
From this initial PCA plot, it’s not clear how the samples separate based on their topologies, suggesting some quality control may be warranted.
First, we normalize the loops
object, which updates the sizeFactor
of each sample by the number of reads identical to the normalization technique employed in DESeq2
. We recommend computing the size factors on the original loops
object to generate the best normalization factors.
full <- calcLDSizeFactors(full)
head(full,3)
## An object of class "loops"
## Slot "anchors":
## GRanges object with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] 1 [713970, 714336] *
## [2] 1 [804939, 805810] *
## [3] 1 [839766, 841153] *
## -------
## seqinfo: 51 sequences from an unspecified genome; no seqlengths
##
## Slot "interactions":
## left right
## [1,] 1 1
## [2,] 1 2
## [3,] 1 3
##
## Slot "counts":
## naive_esc_1 naive_esc_2 primed_esc_1 primed_esc_2 jurkat_1 jurkat_2
## [1,] 2 2 8 9 2 6
## [2,] 0 1 1 0 0 0
## [3,] 0 1 0 0 0 0
##
## Slot "colData":
## sizeFactor groups
## naive_esc_1 0.9164864 naive
## naive_esc_2 1.3742819 naive
## primed_esc_1 0.8946324 primed
## primed_esc_2 1.3785983 primed
## jurkat_1 0.5986950 jurkat
## jurkat_2 1.1775918 jurkat
##
## Slot "rowData":
## loopWidth
## 1 0
## 2 90604
## 3 125431
Looking at the full dataset, we can see that a majority of the counts were within self-loops:
loopMetrics(full)
## naive_esc_1 naive_esc_2 primed_esc_1 primed_esc_2 jurkat_1 jurkat_2
## self 131766 195502 120366 189452 89371 192755
## unique 139058 196861 202804 249808 78828 167673
We define “valid” looping in a combined experiment as those supported by at least 3 replicates in 2 or more samples. We can generate valid loops by using the filterLoops
command
full_5kb <- filterLoops(full, width=5000, nreplicates=3, nsamples = 1)
dim(full_5kb)
## anchors interactions samples colData rowData
## 1 3321 3457 6 2 1
valid_full <- filterLoops(full, width=5000, nreplicates=3, nsamples = 2)
dim(valid_full)
## anchors interactions samples colData rowData
## 1 1508 1223 6 2 1
We can also see the proportion of loops that occur within the same chromosome (intrachromosomal) or have loops that bridge chromosomes (interchromosomal).
dim(intrachromosomal(valid_full))
## anchors interactions samples colData rowData
## 1 1506 1221 6 2 1
dim(interchromosomal(valid_full))
## anchors interactions samples colData rowData
## 1 3 2 6 2 1
Finally, we can determine the number of anchors supported by each sample as well as the loops with different thresholds of counts.
numLoops(valid_full,1:5)
## nloops naive_esc_1 naive_esc_2 primed_esc_1 primed_esc_2 jurkat_1
## 1 1 641 948 813 1043 422
## 2 2 443 777 605 919 260
## 3 3 249 519 380 641 138
## 4 4 145 353 239 470 79
## 5 5 89 261 175 335 53
## jurkat_2
## 1 782
## 2 663
## 3 479
## 4 377
## 5 294
numAnchors(valid_full)
## naive_esc_1 naive_esc_2 primed_esc_1 primed_esc_2 jurkat_1 jurkat_2
## 1 546 222 332 143 896 423
pcaPlot(valid_full) + geom_text(aes(label=samples)) +
expand_limits(x = c(-20, 34))
Here, the PC plot is much better as the first principal component separates the stem cells and the jurkat samples while the second PC separates the two stem cell samples.
Here, we describe the most general association method first fitting the model then doing pairwise comparisons by our choice of parameterizing the loopTest function.
# Fit edgeR Models Pairwise between cell types
fit <- loopFit(valid_full)
## The coefficients of the fitted GLM object are:
## (Intercept) groupsnaive groupsprimed
jn_res <- loopTest(fit,coef=2)
jp_res <- loopTest(fit,coef=3)
np_res <- loopTest(fit,contrast=c(0,-1,1))
head(np_res)
## An object of class "loops"
## Slot "anchors":
## GRanges object with 6 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] 1 [ 919114, 920102] *
## [2] 1 [ 967614, 969793] *
## [3] 1 [ 999024, 999947] *
## [4] 1 [1283357, 1285466] *
## [5] 1 [1306905, 1308004] *
## [6] 1 [1875093, 1876136] *
## -------
## seqinfo: 51 sequences from an unspecified genome; no seqlengths
##
## Slot "interactions":
## left right
## [1,] 1 3
## [2,] 2 3
## [3,] 4 5
## [4,] 6 8
## [5,] 7 8
## [6,] 9 10
##
## Slot "counts":
## naive_esc_1 naive_esc_2 primed_esc_1 primed_esc_2 jurkat_1 jurkat_2
## [1,] 0 1 1 3 4 5
## [2,] 0 2 3 2 1 3
## [3,] 4 3 0 3 1 1
## [4,] 1 3 3 3 0 4
## [5,] 16 14 4 7 3 6
## [6,] 3 4 3 14 3 13
##
## Slot "colData":
## sizeFactor groups
## naive_esc_1 0.9164864 naive
## naive_esc_2 1.3742819 naive
## primed_esc_1 0.8946324 primed
## primed_esc_2 1.3785983 primed
## jurkat_1 0.5986950 jurkat
## jurkat_2 1.1775918 jurkat
##
## Slot "rowData":
## loopWidth logFC logCPM F PValue FDR
## 1 78923 1.5469317 10.161615 1.1079860 0.29258948 0.8458779
## 2 29232 1.0051558 9.978743 0.5979946 0.43939401 0.9124109
## 3 21440 -1.3806368 10.044638 1.7261493 0.18898555 0.7721481
## 4 99664 0.3338366 10.150988 0.1070243 0.74357632 0.9833146
## 5 83277 -1.7004822 11.427281 8.5938853 0.00339378 0.2755610
## 6 185283 0.9849968 11.139303 1.9651851 0.16104446 0.7721481
We could also have computed pairwise associations by subsetting the data and using quickAssoc, which only allows loops objects with two group classes.
np <- valid_full[,1:4]
summary(np[1:3,])
## chr_1 start_1 end_1 chr_2 start_2 end_2 naive_esc_1 naive_esc_2
## 1 1 919114 920102 1 999024 999947 0 1
## 2 1 967614 969793 1 999024 999947 0 2
## 3 1 1283357 1285466 1 1306905 1308004 4 3
## primed_esc_1 primed_esc_2 loopWidth
## 1 1 3 78923
## 2 3 2 29232
## 3 0 3 21440
np_res2 <- quickAssoc(np)
Instead of determining the loops that distinguish these conditions from each other pairwise, this framework allows us to determine the loops that are significantly different between all of these conditions:
jnp_res <- loopTest(fit,coef=2:3)
head(jnp_res)
## An object of class "loops"
## Slot "anchors":
## GRanges object with 6 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] 1 [ 919114, 920102] *
## [2] 1 [ 967614, 969793] *
## [3] 1 [ 999024, 999947] *
## [4] 1 [1283357, 1285466] *
## [5] 1 [1306905, 1308004] *
## [6] 1 [1875093, 1876136] *
## -------
## seqinfo: 51 sequences from an unspecified genome; no seqlengths
##
## Slot "interactions":
## left right
## [1,] 1 3
## [2,] 2 3
## [3,] 4 5
## [4,] 6 8
## [5,] 7 8
## [6,] 9 10
##
## Slot "counts":
## naive_esc_1 naive_esc_2 primed_esc_1 primed_esc_2 jurkat_1 jurkat_2
## [1,] 0 1 1 3 4 5
## [2,] 0 2 3 2 1 3
## [3,] 4 3 0 3 1 1
## [4,] 1 3 3 3 0 4
## [5,] 16 14 4 7 3 6
## [6,] 3 4 3 14 3 13
##
## Slot "colData":
## sizeFactor groups
## naive_esc_1 0.9164864 naive
## naive_esc_2 1.3742819 naive
## primed_esc_1 0.8946324 primed
## primed_esc_2 1.3785983 primed
## jurkat_1 0.5986950 jurkat
## jurkat_2 1.1775918 jurkat
##
## Slot "rowData":
## loopWidth logFC.groupsnaive logFC.groupsprimed logCPM F
## 1 78923 -3.0513552 -1.5044234 10.161615 3.29854703
## 2 29232 -1.0658210 -0.0606652 9.978743 0.38590748
## 3 21440 1.5542538 0.1736170 10.044638 1.23257697
## 4 99664 -0.1392212 0.1946153 10.150988 0.05492481
## 5 83277 1.5988669 -0.1016154 11.427281 5.57025931
## 6 185283 -1.2822134 -0.2972166 11.139303 1.72873888
## PValue FDR
## 1 0.037046364 0.21887779
## 2 0.679861009 0.82161069
## 3 0.291661009 0.53700916
## 4 0.946557097 0.96844419
## 5 0.003841779 0.06581858
## 6 0.177652683 0.43629051
What regions are significantly different between jurkat and naive? We can answer this by doing a sliding window test by specifying the window size and the step size in the slidingWindowTest
function.
sw_jn <- slidingWindowTest(jn_res,200000,200000)
head(sw_jn)
## chr start stop n FDR PValue
## 443 1 89519114 89719114 2 0.001010453 2.252961e-06
## 444 1 89719114 89919114 3 0.001010453 3.379441e-06
## 843 1 169519114 169719114 5 0.003292691 1.651852e-05
## 783 1 157519114 157719114 2 0.003365320 2.251050e-05
## 155 1 31919114 32119114 6 0.005379398 4.592769e-05
## 731 1 147119114 147319114 2 0.005379398 5.397389e-05
What gene bodies have significantly different looping between the cell lines? Whereas the slidingWindowTest
makes no prior assumptions about the regions to be tested, featureTest
requires a GRanges
object of coordinates to determine what regions will be tested for association.
chr1 <- getHumanGenes(c("1"))
ft_jnp <- featureTest(jnp_res,chr1)
head(ft_jnp,10)
## chr start stop n feature FDR PValue
## 1533 1 169514166 169586588 4 F5 6.611069e-07 1.969778e-09
## 1535 1 169662007 169854080 5 C1orf112 6.611069e-07 2.462223e-09
## 863 1 89524836 89597864 2 LRRC8B 1.157464e-06 6.466280e-09
## 865 1 89633140 89933250 5 RP11-302M6.4 1.736196e-06 1.616570e-08
## 866 1 89821014 89936611 5 LRRC8D 1.736196e-06 1.616570e-08
## 1378 1 157513377 157552520 2 FCRL5 5.504059e-06 6.149787e-08
## 1108 1 147175351 147243050 1 FMO5 2.686776e-04 3.502316e-06
## 432 1 31906421 31944856 1 PTP4A2 9.383969e-04 1.572732e-05
## 1617 1 183248237 183418602 2 NMNAT2 9.383969e-04 1.556837e-05
## 1512 1 167094045 167129165 1 DUSP27 1.951215e-03 3.633548e-05
How many differential loops are present pairwise between these conditions?
np_sig_res <- topLoops(np_res, FDR = 0.2)
dim(np_sig_res)
## anchors interactions samples colData rowData
## 1 14 7 6 2 6
dim(topLoops(jp_res, FDR = 0.2))
## anchors interactions samples colData rowData
## 1 295 169 6 2 6
dim(topLoops(jn_res, FDR = 0.2))
## anchors interactions samples colData rowData
## 1 373 208 6 2 6
summary(np_sig_res)
## chr_1 start_1 end_1 chr_2 start_2 end_2 naive_esc_1
## 1 1 15388841 15391302 1 15426303 15427898 10
## 2 1 113423172 113424209 1 113466536 113468099 5
## 3 1 120428104 120429676 1 120510345 120511491 4
## 4 1 161039764 161042346 1 161066522 161068559 12
## 5 1 183250284 183251529 1 183287447 183288171 6
## 6 1 201528472 201529494 1 201761758 201764758 1
## 7 1 202070278 202071285 1 202087247 202088661 5
## naive_esc_2 primed_esc_1 primed_esc_2 jurkat_1 jurkat_2 loopWidth
## 1 8 1 1 0 2 35002
## 2 6 0 0 0 0 42328
## 3 7 0 0 0 1 80670
## 4 24 4 7 2 6 24177
## 5 10 0 1 0 0 35919
## 6 0 8 8 2 6 232265
## 7 8 0 0 0 0 15963
## logFC logCPM F PValue FDR
## 1 -3.247621 10.556695 13.85304 0.0002006411 0.0817947
## 2 -5.530195 9.986114 11.69001 0.0006352147 0.1449510
## 3 -5.527957 10.043625 12.42061 0.0004298078 0.1314137
## 4 -1.911138 11.519839 11.47942 0.0007111252 0.1449510
## 5 -3.896584 10.316748 14.25452 0.0001622045 0.0817947
## 6 3.476419 10.662947 10.91372 0.0009637006 0.1683723
## 7 -5.764451 10.104400 14.98909 0.0001100176 0.0817947
At FDR < 0.2, only 7 loops are significantly different from the naive and primed stem cells whereas the jurkat differs between the primed stem cells by 169 significant loops and the naive by 208 significant loops on chromosome 1. These results are in line with the underlying biology of these cell types. Now we can visualize a PC plot using only loops that are significantly different between celltypes
pcaPlot(topLoops(jnp_res, FDR = 0.05)) + geom_text(aes(label=samples))
Here, we see a much tighter cluster of the cell types, which may be expected since the only variables introduced into the principal components are the loops that are statistically different between the cell types.
We can easily visualize the looping structure of a region specfied by a GRanges
object in each sample. We also afford the annotation of the region with the genes for a particular organism. Currently, mouse and human (default) are supported.
regA <- GRanges(c("1"),ranges=IRanges(c(36000000),c(36300000)))
full.plot <- loopPlot(jn_res, regA)
\(^1\)Hnisz, Denes, et al. “Activation of proto-oncogenes by disruption of chromosome neighborhoods.” Science (2016): aad9024.
\(^2\)Ji, Xiong, et al. “3D Chromosome Regulatory Landscape of Human Pluripotent Cells.” Cell stem cell (2015).