Insulation & boundaries

Welcome to the contact insulation notebook!

Insulation is a simple concept, yet a powerful way to look at C data. Insulation is one aspect of locus-specific contact frequency at small genomic distances, and reflects the segmentation of the genome into domains.

Insulation can be computed with multiple methods. One of the most common methods involves using a diamond-window score to generate an insulation profile. To compute this profile, slide a diamond-shaped window along the genome, with one of the corners on the main diagonal of the matrix, and sum up the contacts within the window for each position.

Insulation profiles reveal that certain locations have lower scores, reflecting lowered contact frequencies between upstream and downstream loci. These positions are often referred to as boundaries, and are also obtained with multiple methods. Here we illustrate one thresholding method for determining boundaries from an insulation profile.

In this notebook we:

  • Calculate the insulation score genome-wide and display it alongside an interaction matrix

  • Call insulating boundaries

  • Filter insulating boundaries based on their strength

  • Calculate enrichment of CTCF/genes at boundaries

  • Repeat boundary filtering based on enrichmnent of CTCF, a known insulator protein in mammalian genomes

[1]:
# import standard python libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
[2]:
# Import python package for working with cooler files and tools for analysis
import cooler
import cooltools.lib.plotting
from cooltools import insulation

from packaging import version
if version.parse(cooltools.__version__) < version.parse('0.5.4'):
    raise AssertionError("tutorials rely on cooltools version 0.5.4 or higher,"+
                         "please check your cooltools version and update to the latest")
[3]:
# download test data
# this file is 145 Mb, and may take a few seconds to download
import cooltools
data_dir = './data/'
cool_file = cooltools.download_data("HFF_MicroC", cache=True, data_dir=data_dir)
print(cool_file)
./data/test.mcool

Calculating genome-wide contact insulation

Here we load the Hi-C data at 10 kbp resolution and calculate insulation score with 4 different window sizes

[4]:
resolution = 10000
clr = cooler.Cooler(f'{data_dir}test.mcool::resolutions/{resolution}')
windows = [3*resolution, 5*resolution, 10*resolution, 25*resolution]
insulation_table = insulation(clr, windows, verbose=True)
INFO:root:fallback to serial implementation.
INFO:root:Processing region chr2
INFO:root:Processing region chr17

This function returns a dataframe where rows correspond to genomic bins of the cooler.

The columns of this insulation dataframe report the insulation score, the number of valid (non-nan) pixels, whether the given bin is valid, the boundary prominence (strength) and whether locus is called as a boundary after thresholding, for each of the window sizes provided to the function.

Below we print the information returned for any window size, as well as the specific information for the largest window used:

[5]:
first_window_summary =insulation_table.columns[[ str(windows[-1]) in i for i in insulation_table.columns]]

insulation_table[['chrom','start','end','region','is_bad_bin']+list(first_window_summary)].iloc[1000:1005]
[5]:
chrom start end region is_bad_bin log2_insulation_score_250000 n_valid_pixels_250000 boundary_strength_250000 is_boundary_250000
1000 chr2 10000000 10010000 chr2 False 0.309791 622.0 NaN False
1001 chr2 10010000 10020000 chr2 False 0.226045 622.0 NaN False
1002 chr2 10020000 10030000 chr2 False 0.090809 622.0 NaN False
1003 chr2 10030000 10040000 chr2 False -0.101091 622.0 NaN False
1004 chr2 10040000 10050000 chr2 False -0.342858 622.0 NaN False
[6]:
# Functions to help with plotting
def pcolormesh_45deg(ax, matrix_c, start=0, resolution=1, *args, **kwargs):
    start_pos_vector = [start+resolution*i for i in range(len(matrix_c)+1)]
    import itertools
    n = matrix_c.shape[0]
    t = np.array([[1, 0.5], [-1, 0.5]])
    matrix_a = np.dot(np.array([(i[1], i[0])
                                for i in itertools.product(start_pos_vector[::-1],
                                                           start_pos_vector)]), t)
    x = matrix_a[:, 1].reshape(n + 1, n + 1)
    y = matrix_a[:, 0].reshape(n + 1, n + 1)
    im = ax.pcolormesh(x, y, np.flipud(matrix_c), *args, **kwargs)
    im.set_rasterized(True)
    return im

from matplotlib.ticker import EngFormatter
bp_formatter = EngFormatter('b')
def format_ticks(ax, x=True, y=True, rotate=True):
    if y:
        ax.yaxis.set_major_formatter(bp_formatter)
    if x:
        ax.xaxis.set_major_formatter(bp_formatter)
        ax.xaxis.tick_bottom()
    if rotate:
        ax.tick_params(axis='x',rotation=45)

Let’s see what the insulation track at the highest resolution looks like, next to a rotated Hi-C matrix.

[7]:
from matplotlib.colors import LogNorm
from mpl_toolkits.axes_grid1 import make_axes_locatable
import bioframe
plt.rcParams['font.size'] = 12

start = 10_500_000
end = start+ 90*windows[0]
region = ('chr2', start, end)
norm = LogNorm(vmax=0.1, vmin=0.001)
data = clr.matrix(balance=True).fetch(region)
f, ax = plt.subplots(figsize=(18, 6))
im = pcolormesh_45deg(ax, data, start=region[1], resolution=resolution, norm=norm, cmap='fall')
ax.set_aspect(0.5)
ax.set_ylim(0, 10*windows[0])
format_ticks(ax, rotate=False)
ax.xaxis.set_visible(False)

divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="1%", pad=0.1, aspect=6)
plt.colorbar(im, cax=cax)

insul_region = bioframe.select(insulation_table, region)

ins_ax = divider.append_axes("bottom", size="50%", pad=0., sharex=ax)
ins_ax.set_prop_cycle(plt.cycler("color", plt.cm.plasma(np.linspace(0,1,5))))
ins_ax.plot(insul_region[['start', 'end']].mean(axis=1),
            insul_region['log2_insulation_score_'+str(windows[0])],
            label=f'Window {windows[0]} bp')

ins_ax.legend(bbox_to_anchor=(0., -1), loc='lower left', ncol=4);

format_ticks(ins_ax, y=False, rotate=False)
ax.set_xlim(region[1], region[2])
[7]:
(10500000.0, 13200000.0)
../_images/notebooks_insulation_and_boundaries_11_1.png

And now let’s add the other window sizes.

[8]:
for res in windows[1:]:
    ins_ax.plot(insul_region[['start', 'end']].mean(axis=1), insul_region[f'log2_insulation_score_{res}'], label=f'Window {res} bp')
ins_ax.legend(bbox_to_anchor=(0., -1), loc='lower left', ncol=4);
f
[8]:
../_images/notebooks_insulation_and_boundaries_13_0.png

This really highlights how much the result is dependent on window size: smaller windows are sensitive to local structure, whereas large windows capture regions that insulate at larger scales.

Boundary calling

The insulation table also has annotations for valleys of the insulation score, which correspond to highly insulating regions, such as TAD boundaries. All potential boundaries have an assigned boundary_strength_ column. Additionally, this strength is thresholded to find regions that insulate particularly strongly, and this is recorded in the is_boundary_ columns.

Let’s repeat the previous plot and show where we found the boundaries:

[9]:
f, ax = plt.subplots(figsize=(20, 10))
im = pcolormesh_45deg(ax, data, start=region[1], resolution=resolution, norm=norm, cmap='fall')
ax.set_aspect(0.5)
ax.set_ylim(0, 10*windows[0])
format_ticks(ax, rotate=False)
ax.xaxis.set_visible(False)

divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="1%", pad=0.1, aspect=6)
plt.colorbar(im, cax=cax)

insul_region = bioframe.select(insulation_table, region)

ins_ax = divider.append_axes("bottom", size="50%", pad=0., sharex=ax)

ins_ax.plot(insul_region[['start', 'end']].mean(axis=1),
            insul_region[f'log2_insulation_score_{windows[0]}'], label=f'Window {windows[0]} bp')

boundaries = insul_region[~np.isnan(insul_region[f'boundary_strength_{windows[0]}'])]
weak_boundaries = boundaries[~boundaries[f'is_boundary_{windows[0]}']]
strong_boundaries = boundaries[boundaries[f'is_boundary_{windows[0]}']]
ins_ax.scatter(weak_boundaries[['start', 'end']].mean(axis=1),
            weak_boundaries[f'log2_insulation_score_{windows[0]}'], label='Weak boundaries')
ins_ax.scatter(strong_boundaries[['start', 'end']].mean(axis=1),
            strong_boundaries[f'log2_insulation_score_{windows[0]}'], label='Strong boundaries')

ins_ax.legend(bbox_to_anchor=(0., -1), loc='lower left', ncol=4);

format_ticks(ins_ax, y=False, rotate=False)
ax.set_xlim(region[1], region[2])

[9]:
(10500000.0, 13200000.0)
../_images/notebooks_insulation_and_boundaries_18_1.png

Calculating boundary strength

Let’s inspect the histogram of boundary strengths to show how we selected the strong boundaries.

First, boundary strength is calculated using the peak prominence, on the dips (or minima) of the insulation profile.

[10]:
histkwargs = dict(
    bins=10**np.linspace(-4,1,200),
    histtype='step',
    lw=2,
)

f, axs = plt.subplots(len(windows),1, sharex=True, figsize=(6,6), constrained_layout=True)
for i, (w, ax) in enumerate(zip(windows, axs)):
    ax.hist(
        insulation_table[f'boundary_strength_{w}'],
        **histkwargs
    )
    ax.text(0.02, 0.9,
             f'Window {w//1000}kb',
             ha='left',
             va='top',
             transform=ax.transAxes)

    ax.set(
        xscale='log',
        ylabel='# boundaries'
    )

axs[-1].set(xlabel='Boundary strength');
../_images/notebooks_insulation_and_boundaries_21_0.png

As a quick way to automatically threshold the histogram, we borrow the tresholding methods from the image analysis field. These include Li (default threshold="Li") or Otsu, as implemented in scikit-image. Otsu is more conservative, whereas Li is more permissive.

In practice these thresholds work well for a simple parameter-free method for mammalian interphase data, though should be double-checked for any individual dataset.

[11]:
from skimage.filters import threshold_li, threshold_otsu

f, axs = plt.subplots(len(windows), 1, sharex=True, figsize=(6,6), constrained_layout=True)
thresholds_li = {}
thresholds_otsu = {}
for i, (w, ax) in enumerate(zip(windows, axs)):
    ax.hist(
        insulation_table[f'boundary_strength_{w}'],
        **histkwargs
    )
    thresholds_li[w] = threshold_li(insulation_table[f'boundary_strength_{w}'].dropna().values)
    thresholds_otsu[w] = threshold_otsu(insulation_table[f'boundary_strength_{w}'].dropna().values)
    n_boundaries_li = (insulation_table[f'boundary_strength_{w}'].dropna()>=thresholds_li[w]).sum()
    n_boundaries_otsu = (insulation_table[f'boundary_strength_{w}'].dropna()>=thresholds_otsu[w]).sum()
    ax.axvline(thresholds_li[w], c='green')
    ax.axvline(thresholds_otsu[w], c='magenta')
    ax.text(0.01, 0.9,
             f'Window {w//1000}kb',
             ha='left',
             va='top',
             transform=ax.transAxes)
    ax.text(0.01, 0.7,
            f'{n_boundaries_otsu} boundaries (Otsu)',
            c='magenta',
            ha='left',
            va='top',
            transform=ax.transAxes)
    ax.text(0.01, 0.5,
            f'{n_boundaries_li} boundaries (Li)',
            c='green',
            ha='left',
            va='top',
            transform=ax.transAxes)

    ax.set(
        xscale='log',
        ylabel='# boundaries'
    )

axs[-1].set(xlabel='Boundary strength')
[11]:
[Text(0.5, 0, 'Boundary strength')]
../_images/notebooks_insulation_and_boundaries_23_1.png

CTCF enrichment at boundaries

TAD boundaries are frequently associated with CTCF binding.

To quantify this, we can aggregate the ChIP-Seq singal around the boundaries, and compare enrichment of CTCF with boundary strength using pybbi (https://github.com/nvictus/pybbi). We provide a test bigWig file with CTCF enrichment over input for the same cell type as the Micro-C data.

We use the bbi.stackup() method with no binning to extract an array of average values for all boundary regions with 1 kbp flanks.

[12]:
# Download test data. The file size is 592 Mb, so the download might take a while:
ctcf_fc_file = cooltools.download_data("HFF_CTCF_fc", cache=True, data_dir=data_dir)
[13]:
import bbi
[14]:
is_boundary = np.any([
        ~insulation_table[f'boundary_strength_{w}'].isnull()
        for w in windows],
    axis=0)
boundaries = insulation_table[is_boundary]
boundaries.head()
[14]:
chrom start end region is_bad_bin log2_insulation_score_30000 n_valid_pixels_30000 log2_insulation_score_50000 n_valid_pixels_50000 log2_insulation_score_100000 ... log2_insulation_score_250000 n_valid_pixels_250000 boundary_strength_30000 boundary_strength_50000 boundary_strength_250000 boundary_strength_100000 is_boundary_30000 is_boundary_50000 is_boundary_100000 is_boundary_250000
5 chr2 50000 60000 chr2 False 0.089080 6.0 0.059578 22.0 0.586104 ... 1.211581 122.0 NaN 0.156397 NaN NaN False False False False
6 chr2 60000 70000 chr2 False 0.036906 6.0 0.134037 22.0 0.547732 ... 1.161302 147.0 0.150452 NaN NaN NaN False False False False
7 chr2 70000 80000 chr2 False 0.062353 6.0 0.122444 22.0 0.479052 ... 1.092480 172.0 NaN 0.011593 NaN NaN False False False False
9 chr2 90000 100000 chr2 False 0.049426 6.0 0.198381 22.0 0.377645 ... 0.972715 222.0 0.029686 NaN NaN NaN False False False False
11 chr2 110000 120000 chr2 False 0.095762 6.0 0.190455 22.0 0.320182 ... 0.867080 272.0 NaN 0.024922 NaN NaN False False False False

5 rows × 21 columns

[15]:
# Calculate the average ChIP singal/input in the 3kb region around the boundary.
flank = 1000 # Length of flank to one side from the boundary, in basepairs
ctcf_chip_signal = bbi.stackup(
    data_dir+'/test_CTCF.bigWig',
    boundaries.chrom,
    boundaries.start-flank,
    boundaries.end+flank,
    bins=1).flatten()

Real boundaries are often enriched in CTCF binding, and this can be used as a guide for thresholding the boundary strength. However note a small population of strong boundaries without CTCF binding.

[16]:
w=windows[0]
f, ax = plt.subplots()
ax.loglog(
    boundaries[f'boundary_strength_{w}'],
    ctcf_chip_signal,
    'o',
    markersize=1,
    alpha=0.05
);
ax.set(
    xlim=(1e-4,1e1),
    ylim=(3e-2,3e1),
    xlabel='Boundary strength',
    ylabel='CTCF enrichment over input')

ax.axvline(thresholds_otsu[w], ls='--', color='magenta', label='Otsu threshold')
ax.axvline(thresholds_li[w], ls='--', color='green', label='Li threshold')
ax.legend()
[16]:
<matplotlib.legend.Legend at 0x1afd95dc0>
../_images/notebooks_insulation_and_boundaries_31_1.png

If we were interested specifically in boundaries with CTCF, we could threshold based on enrichment of CTCF ChIP-seq:

[17]:
histkwargs = dict(
    bins=10**np.linspace(-4,1,200),
    histtype='step',
    lw=2,
)
f, ax = plt.subplots()
ax.set(xscale='log', xlabel='Boundary strength')
ax.hist(
    boundaries[f'boundary_strength_{windows[0]}'][ctcf_chip_signal>=2],
    label='CTCF Chip/Input ≥ 2.0',
    **histkwargs
);
ax.hist(
    boundaries[f'boundary_strength_{windows[0]}'][ctcf_chip_signal<2],
    label='CTCF Chip/Input < 2.0',
    **histkwargs
);
ax.hist(
    boundaries[f'boundary_strength_{windows[0]}'],
    label='all boundaries',
    **histkwargs
);
ax.legend(loc='upper left')
[17]:
<matplotlib.legend.Legend at 0x1af14eac0>
../_images/notebooks_insulation_and_boundaries_33_1.png

1D pileup: CTCF enrichment at boundaries

Additionally, we can create 1D pileup plot of average CTCF enrichment.

First, create a collection of genomic regions of equal size, each centered at the position of the boundary.

Then, create a stackup of binned ChIP-Seq signal for these regions. It is based on the same test bigWig file with the CTCF ChIP-Seq log fold change over input.

Finally, create 1D pileup by averaging each stacked window.

[18]:
# Select the strict thresholded boundaries for one window size
top_boundaries = insulation_table[insulation_table[f'boundary_strength_{windows[1]}']>=thresholds_otsu[windows[1]]]
[19]:
# Create of the stackup, the flanks are +- 50 Kb, number of bins is 100 :
flank = 50000 # Length of flank to one side from the boundary, in basepairs
nbins = 100   # Number of bins to split the region
stackup = bbi.stackup(data_dir+'/test_CTCF.bigWig',
                      top_boundaries.chrom,
                      top_boundaries.start+resolution//2-flank,
                      top_boundaries.start+resolution//2+flank,
                      bins=nbins)
[20]:
f, ax = plt.subplots(figsize=[7,5])
ax.plot(np.nanmean(stackup, axis=0) )
ax.set(xticks=np.arange(0, nbins+1, 10),
       xticklabels=(np.arange(0, nbins+1, 10)-nbins//2)*flank*2/nbins/1000,
       xlabel='Distance from boundary, kbp',
       ylabel='CTCF ChIP-Seq mean fold change over input');
../_images/notebooks_insulation_and_boundaries_38_0.png

Using adjacent boundaries to create a table of TADs

Calling TADs from Hi-C data poses a challenge, in part because domain structures vary greatly in their size, intensity, and can be nested. The number of called TADs varies substantially from tool to tool, and can depend on tool-specific parameters (Forcato, 2017). Below, we show an example of how adjacent boundaries calculated with cooltools can specify a set of intervals that could be analyzed as TADs, using bioframe.merge().

[41]:
def extract_TADs(insulation_table, is_boundary_col, max_TAD_length = 3_000_000):
    tads = bioframe.merge(insulation_table[insulation_table[is_boundary_col] == False])
    return tads[ (tads["end"] - tads["start"]) <= MAX_TAD_LENGTH].reset_index(drop=True)[['chrom','start','end']]

TADs_table = extract_TADs(insulation_table, f'is_boundary_{windows[0]}')
TADs_table.head()
[46]:
TADs_table
[46]:
chrom start end
0 chr2 0 200000
1 chr2 210000 290000
2 chr2 300000 670000
3 chr2 680000 740000
4 chr2 750000 950000
... ... ... ...
1693 chr17 82460000 82640000
1694 chr17 82650000 82760000
1695 chr17 82770000 82960000
1696 chr17 82970000 83080000
1697 chr17 83090000 83257441

1698 rows × 3 columns

[50]:
# Visualizing the first 10 inter-boundary intervals (grey overay) v.s. Hi-C data

region = ('chr2', TADs_table.iloc[0].start, TADs_table.iloc[9].end)
norm = LogNorm(vmax=0.1, vmin=0.001)
data = clr.matrix(balance=True).fetch(region)

f, ax = plt.subplots(figsize=(18, 6))
im = pcolormesh_45deg(ax, data, start=region[1], resolution=resolution, norm=norm, cmap='fall')
ax.set_aspect(0.5)
ax.set_ylim(0, 13*windows[0])
format_ticks(ax, rotate=False)
ax.xaxis.set_visible(False)

divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="1%", pad=0.1, aspect=6)
plt.colorbar(im, cax=cax)

idx = 10
max_pos = TADs_table[:idx]['end'].max()/resolution
contact_matrix = np.zeros((int(max_pos), int(max_pos)))
contact_matrix[:] = np.nan
for _, row in TADs_table[:idx].iterrows():
    contact_matrix[int(row['start']/resolution):int(row['end']/resolution), int(row['start']/resolution):int(row['end']/resolution)] = 1
    contact_matrix[int(row['start']/resolution + 1):int(row['end']/resolution - 1), int(row['start']/resolution + 1):int(row['end']/resolution - 1)] = np.nan

im = pcolormesh_45deg(ax, contact_matrix, start=0, resolution=resolution, cmap='gray', vmax=1, vmin=-1, alpha=0.6)

plt.show()
../_images/notebooks_insulation_and_boundaries_43_0.png

This page was generated with nbsphinx from /home/docs/checkouts/readthedocs.org/user_builds/cooltools/checkouts/latest/docs/notebooks/insulation_and_boundaries.ipynb