Domain Calling Pipelines

  Overview

The 4DN Hi-C data processing pipeline produces contacts lists and contact matrices. The contact matrices show signals of genomic compartments (Liberman-Aiden et al. 2009) and local regions of enriched intra-contacts. The local contact enrichments have been interpreted to represent the existence of topologically associating domains, or TADs (Dixon et al. 2012). Two domain calling workflows in the 4DN Data Portal utilize the contact matrices to report compartments and local contact enrichment regions. Eigenvector decomposition of the matrix can be used to identify active (A) and inactive (B) compartments. And dips in the insulation score along the diagonal of the matrix define boundaries between high intra-contact domains (Crane et al. 2015). To perform the domain calls, we utilize the cooltools library which implements algorithms that are commonly utilized in literature on contact matrices in the cooler format.

Lieberman-Aiden, E., Berkum, N., Williams, L. et al. Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome. Science 326, 289-293 (2009). https://doi.org/10.1126/science.1181369

Dixon, J., Selvaraj, S., Yue, F. et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380 (2012). https://doi.org/10.1038/nature11082

Crane, E., Bian, Q., McCord, R. et al. Condensin-driven remodelling of X chromosome topology during dosage compensation. Nature 523, 240–244 (2015). https://doi.org/10.1038/nature14450

  Insulation Scores and Boundaries

Methods

The workflow uses the cooltools software to call diamond insulation scores using a 100kb window size and either a 5kb or 10kb bin size, for 4-cutter or 6-cutter restriction enzyme respectively. A bin size of 5kb is also used for MNase and DNase based assays. Local minima of the chromosome-wide topographic prominence track for log2(insulation score) above a 0.2 threshold are defined as boundaries. Please note that insulation score calls are not provided on datasets with less than 100M filtered reads, since results on low resolution data sets are found to be less reproducible and reliable.

The insulation score and boundary caller workflow components are pre-installed in a publicly available Docker image (4dndcic/4dn-insulation-scores-and-boundaries-caller:v1) on Docker Hub. The source code for the Docker image and the workflow description in Common Workflow Language (CWL) can be found on 4DN-DCIC GitHub repo.

Boundary Score Assessment

In the absence of gold standard truth information, it is not possible to assign a statistical score to the boundary calls presented. Based on the following assessment, it has been deemed useful to qualify boundaries as weak (0.2<=prominence<0.5) and strong (prominence>=0.5) to create two separate boundary lists for when high sensitivity vs. high specificity is desired.

The Micro-C dataset 4DNES21D8SP8 and the Hi-C dataset 4DNES2M5JIGV were used as reference for assessing boundary thresholds. These datasets are based on 4DN Tier H1-ESC cells grown with standard protocols and constitute some of the highest resolution genome-wide 3C maps. Here, we vary the thresold on prominence score to define boundary sets and assess their overlap to CTCF peaks. Boundaries within 5kb of CTCF ChIP-seq peaks (obtained from ENCODE) are considered as overlapping a CTCF site.

The top plots represent the number of boundaries obtained after a minimum boundary strength score is chosen as threshold. The bottom plots represent the proportion of boundaries within 5kb distance from a CTCF region after a minimum boundary strength score is chosen as threshold. 0.2 and 0.5 were selected as the weak and strong boundary thresholds respectively.

To further assess the reliability of insulation score calls in data sets of different protocols and sequencing depths, we present a comparison of boundary calls between different data sets. We present the boundary calls with a score above the 0.2 from the Micro-C dataset- the dataset with the highest resolution - as the most reliable "true" set of boundaries. We compare that to two Hi-C datasets of different sequencing depths on the same cell line. Boundaries in two data sets are considered to overlap if they are within 10kb of each other.
The top plots represent the boundary count for Micro-C set, Hi-C set and their overlap (within a 10kb distance) and the bottom represents the proportion of the overlap set to the number of boundaries called. The left plots are the results for a dataset with 2.5 billion reads (4DNES2M5JIGV) and the right plots are the results for dataset with 415 million reads (4DNESRJ8KV4Q).

Comparison to Community Submissions

Below are the insulation scores and boundary calls produced by different members of the 4DN Network, along with the results produced by DCIC.

  Compartments

Methods

The workflow uses the cooltools software to call compartment signals in cis for each chromosome. A principal component analysis is performed on the matrix at 250kb resolution. The Eigenvector among the top three with highest correlation to GC content is selected and oriented such that positive values correspond to high GC.

The workflow components are pre-installed in a publicly available Docker image (4dndcic/4dn-compartments-caller:v1.2) on Docker Hub. The source code for the Docker image and the pipeline description in Common Workflow Language (CWL) can be found on 4DN-DCIC GitHub repo.

Comparison to Community Submissions

Below are the compartments signals calls produced by different members of the 4DN Network along with the results produced by DCIC.

  Comparison Across Cell Types

While insulation and compartment score tracks show cell-to-cell variability, interphase cells of various cell types share a significant portion of their chromosome conformations, due to the presence of constitutive heterochromatin, housekeeping genes, and other constitutive features. Therefore, genome-wide correlation for these tracks between cell types can constitute a quality control on the score calculations.

After the pipelines were run on the released data in the data portal (insulation scores were only run on mcool files with over 100M reads, see above), Spearman correlations were calculated relative to reference tracks. The deep Micro-C datasets for H1-ESC (4DNES21D8SP8) and HFFc6 (4DNESWST3UBH) were used as reference for insulation scores and compartments respectively. Below is the distribution of those correlations.





The datasets were divided into six categories: **Low Cis/Trans Ratio:** Datasets that have either a cis/trans ratio less than 50 or less than 40 for H1 cell lines. **CTCF & Cohesion Depletion:** Datasets where CTCF or Cohesin were depleted. **Cohesion Release Reduction:** Datasets where Cohesin release factors (WALP & PDS5A/B) were depleted. **Protocol Variations:** Datasets produced by major variations to the standard Hi-C protocol, such as skipping the ligation step. ** Cell Cycle Studies:** Datasets where different phases of the cell cycle were studied. **Regular:** Datasets that do not fall in any of the categories above. The table used to generate the plot above can be downloaded here.