{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Insulation & boundaries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Welcome to the contact insulation notebook!\n", "\n", "Insulation is a simple concept, yet a powerful way to look at C data. Insulation is one aspect of locus-specific contact frequency at small genomic distances, and reflects the segmentation of the genome into domains.\n", "\n", "Insulation can be computed with multiple methods. One of the most common methods involves using a diamond-window score to generate an ***insulation profile***. To compute this profile, slide a diamond-shaped window along the genome, with one of the corners on the main diagonal of the matrix, and sum up the contacts within the window for each position.\n", "\n", "Insulation profiles reveal that certain locations have lower scores, reflecting lowered contact frequencies between upstream and downstream loci. These positions are often referred to as ***boundaries***, and are also obtained with multiple methods. Here we illustrate one thresholding method for determining boundaries from an insulation profile.\n", "\n", "In this notebook we:\n", "\n", "* Calculate the insulation score genome-wide and display it alongside an interaction matrix\n", "* Call insulating boundaries\n", "* Filter insulating boundaries based on their strength\n", "* Calculate enrichment of CTCF/genes at boundaries\n", "* Repeat boundary filtering based on enrichmnent of CTCF, a known insulator protein in mammalian genomes" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# import standard python libraries\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Import python package for working with cooler files and tools for analysis\n", "import cooler\n", "import cooltools.lib.plotting\n", "from cooltools import insulation\n", "\n", "from packaging import version\n", "if version.parse(cooltools.__version__) < version.parse('0.5.4'):\n", " raise AssertionError(\"tutorials rely on cooltools version 0.5.4 or higher,\"+\n", " \"please check your cooltools version and update to the latest\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "./data/test.mcool\n" ] } ], "source": [ "# download test data\n", "# this file is 145 Mb, and may take a few seconds to download\n", "import cooltools\n", "data_dir = './data/'\n", "cool_file = cooltools.download_data(\"HFF_MicroC\", cache=True, data_dir=data_dir) \n", "print(cool_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calculating genome-wide contact insulation\n", "Here we load the Hi-C data at 10 kbp resolution and calculate insulation score with 4 different window sizes" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:root:fallback to serial implementation.\n", "INFO:root:Processing region chr2\n", "INFO:root:Processing region chr17\n" ] } ], "source": [ "resolution = 10000 \n", "clr = cooler.Cooler(f'{data_dir}test.mcool::resolutions/{resolution}')\n", "windows = [3*resolution, 5*resolution, 10*resolution, 25*resolution]\n", "insulation_table = insulation(clr, windows, verbose=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This function returns a dataframe where rows correspond to genomic bins of the cooler.\n", "\n", "The columns of this insulation dataframe report the insulation score, the number of valid (non-nan) pixels, whether the given bin is valid, the boundary prominence (strength) and whether locus is called as a boundary after thresholding, for each of the window sizes provided to the function.\n", "\n", "Below we print the information returned for any window size, as well as the specific information for the largest window used:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | chrom | \n", "start | \n", "end | \n", "region | \n", "is_bad_bin | \n", "log2_insulation_score_250000 | \n", "n_valid_pixels_250000 | \n", "boundary_strength_250000 | \n", "is_boundary_250000 | \n", "
|---|---|---|---|---|---|---|---|---|---|
| 1000 | \n", "chr2 | \n", "10000000 | \n", "10010000 | \n", "chr2 | \n", "False | \n", "0.309791 | \n", "622.0 | \n", "NaN | \n", "False | \n", "
| 1001 | \n", "chr2 | \n", "10010000 | \n", "10020000 | \n", "chr2 | \n", "False | \n", "0.226045 | \n", "622.0 | \n", "NaN | \n", "False | \n", "
| 1002 | \n", "chr2 | \n", "10020000 | \n", "10030000 | \n", "chr2 | \n", "False | \n", "0.090809 | \n", "622.0 | \n", "NaN | \n", "False | \n", "
| 1003 | \n", "chr2 | \n", "10030000 | \n", "10040000 | \n", "chr2 | \n", "False | \n", "-0.101091 | \n", "622.0 | \n", "NaN | \n", "False | \n", "
| 1004 | \n", "chr2 | \n", "10040000 | \n", "10050000 | \n", "chr2 | \n", "False | \n", "-0.342858 | \n", "622.0 | \n", "NaN | \n", "False | \n", "
| \n", " | chrom | \n", "start | \n", "end | \n", "region | \n", "is_bad_bin | \n", "log2_insulation_score_30000 | \n", "n_valid_pixels_30000 | \n", "log2_insulation_score_50000 | \n", "n_valid_pixels_50000 | \n", "log2_insulation_score_100000 | \n", "... | \n", "log2_insulation_score_250000 | \n", "n_valid_pixels_250000 | \n", "boundary_strength_30000 | \n", "boundary_strength_50000 | \n", "boundary_strength_250000 | \n", "boundary_strength_100000 | \n", "is_boundary_30000 | \n", "is_boundary_50000 | \n", "is_boundary_100000 | \n", "is_boundary_250000 | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | \n", "chr2 | \n", "50000 | \n", "60000 | \n", "chr2 | \n", "False | \n", "0.089080 | \n", "6.0 | \n", "0.059578 | \n", "22.0 | \n", "0.586104 | \n", "... | \n", "1.211581 | \n", "122.0 | \n", "NaN | \n", "0.156397 | \n", "NaN | \n", "NaN | \n", "False | \n", "False | \n", "False | \n", "False | \n", "
| 6 | \n", "chr2 | \n", "60000 | \n", "70000 | \n", "chr2 | \n", "False | \n", "0.036906 | \n", "6.0 | \n", "0.134037 | \n", "22.0 | \n", "0.547732 | \n", "... | \n", "1.161302 | \n", "147.0 | \n", "0.150452 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "False | \n", "False | \n", "False | \n", "False | \n", "
| 7 | \n", "chr2 | \n", "70000 | \n", "80000 | \n", "chr2 | \n", "False | \n", "0.062353 | \n", "6.0 | \n", "0.122444 | \n", "22.0 | \n", "0.479052 | \n", "... | \n", "1.092480 | \n", "172.0 | \n", "NaN | \n", "0.011593 | \n", "NaN | \n", "NaN | \n", "False | \n", "False | \n", "False | \n", "False | \n", "
| 9 | \n", "chr2 | \n", "90000 | \n", "100000 | \n", "chr2 | \n", "False | \n", "0.049426 | \n", "6.0 | \n", "0.198381 | \n", "22.0 | \n", "0.377645 | \n", "... | \n", "0.972715 | \n", "222.0 | \n", "0.029686 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "False | \n", "False | \n", "False | \n", "False | \n", "
| 11 | \n", "chr2 | \n", "110000 | \n", "120000 | \n", "chr2 | \n", "False | \n", "0.095762 | \n", "6.0 | \n", "0.190455 | \n", "22.0 | \n", "0.320182 | \n", "... | \n", "0.867080 | \n", "272.0 | \n", "NaN | \n", "0.024922 | \n", "NaN | \n", "NaN | \n", "False | \n", "False | \n", "False | \n", "False | \n", "
5 rows × 21 columns
\n", "| \n", " | chrom | \n", "start | \n", "end | \n", "
|---|---|---|---|
| 0 | \n", "chr2 | \n", "0 | \n", "200000 | \n", "
| 1 | \n", "chr2 | \n", "210000 | \n", "290000 | \n", "
| 2 | \n", "chr2 | \n", "300000 | \n", "670000 | \n", "
| 3 | \n", "chr2 | \n", "680000 | \n", "740000 | \n", "
| 4 | \n", "chr2 | \n", "750000 | \n", "950000 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "
| 1693 | \n", "chr17 | \n", "82460000 | \n", "82640000 | \n", "
| 1694 | \n", "chr17 | \n", "82650000 | \n", "82760000 | \n", "
| 1695 | \n", "chr17 | \n", "82770000 | \n", "82960000 | \n", "
| 1696 | \n", "chr17 | \n", "82970000 | \n", "83080000 | \n", "
| 1697 | \n", "chr17 | \n", "83090000 | \n", "83257441 | \n", "
1698 rows × 3 columns
\n", "