Command line interface

Welcome to the cooltools command line interface (CLI) notebook!

Cooltools features a paired python API & CLI that enables user-facing functions to be run from the command line.

[1]:
import os, subprocess
import pandas as pd
import bioframe
import cooltools
import cooler
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
plt.rcParams['font.size']=12

from packaging import version
if version.parse(cooltools.__version__) < version.parse('0.5.2'):
    raise AssertionError("tutorials rely on cooltools version 0.5.2 or higher,"+
                         "please check your cooltools version and update to the latest")

# We can use this function to display a file within the notebook
from IPython.display import Image


# download test data
# this file is 145 Mb, and may take a few seconds to download
cool_file = cooltools.download_data("HFF_MicroC", cache=True, data_dir='./data/')
print(cool_file)

# To use this variable in a bash call from jupyter just use $cool_file
./data/test.mcool
[2]:
%%bash
mkdir -p data
mkdir -p outputs
[3]:
# Note for the correct bash environment to be recognized from jupyter with the ! magic,
# jupyter notebook must be initialized from a conda environment with cooler and cooltools installed.

To see a list of CLI commands for cooltools, see the help:

[4]:
!cooltools -h
Usage: cooltools [OPTIONS] COMMAND [ARGS]...

  Type -h or --help after any subcommand for more information.

Options:
  -v, --verbose  Verbose logging
  -d, --debug    Post mortem debugging
  -V, --version  Show the version and exit.
  -h, --help     Show this message and exit.

Commands:
  coverage        Calculate the sums of cis and genome-wide contacts (aka...
  dots            Call dots on a Hi-C heatmap that are not larger than...
  eigs-cis        Perform eigen value decomposition on a cooler matrix to...
  eigs-trans      Perform eigen value decomposition on a cooler matrix to...
  expected-cis    Calculate expected Hi-C signal for cis regions of...
  expected-trans  Calculate expected Hi-C signal for trans regions of...
  genome          Utilities for binned genome assemblies.
  insulation      Calculate the diamond insulation scores and call...
  pileup          Perform retrieval of the snippets from .cool file.
  random-sample   Pick a random sample of contacts from a Hi-C map.
  saddle          Calculate saddle statistics and generate saddle plots...
  virtual4c       Generate virtual 4C profile from a contact map by...

Visualization

[5]:
!cooler show $cool_file::resolutions/1000000 'chr2' -o 'outputs/chr2.png'
Traceback (most recent call last):
  File "/Users/geofffudenberg/anaconda3/envs/open2c/bin/cooler", line 8, in <module>
    sys.exit(cli())
  File "/Users/geofffudenberg/anaconda3/envs/open2c/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/Users/geofffudenberg/anaconda3/envs/open2c/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/Users/geofffudenberg/anaconda3/envs/open2c/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/geofffudenberg/anaconda3/envs/open2c/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/geofffudenberg/anaconda3/envs/open2c/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/Users/geofffudenberg/anaconda3/envs/open2c/lib/python3.9/site-packages/cooler/cli/show.py", line 230, in show
    plt.gcf().canvas.set_window_title("Contact matrix".format())
AttributeError: 'FigureCanvasAgg' object has no attribute 'set_window_title'
[6]:
Image('outputs/chr2.png', width=400, height=400)
[6]:
../_images/notebooks_command_line_interface_8_0.png

Expected

Tables of expected counts, either in cis or trans, are key inputs for many downstream analyses in cooltools. For more details, see the contacts_vs_dist notebook.

Typically, we specify a view to define regions under analysis. Here we quickly create a view that specifies chromosome arms using tables of chromosome sizes and centromere positions.

[7]:
# create a view of hg38 chromosome arms using chromosome sizes and definition of centromeres
hg38_chromsizes = bioframe.fetch_chromsizes('hg38')
hg38_cens = bioframe.fetch_centromeres('hg38')
view_hg38 = bioframe.make_chromarms(hg38_chromsizes,  hg38_cens)

# select only those chromosomes available in cooler
clr = cooler.Cooler(f'{cool_file}::/resolutions/1000000')
view_hg38 = view_hg38[view_hg38.chrom.isin(clr.chromnames)].reset_index(drop=True)
view_hg38.to_csv("data/view_hg38.tsv", index=False, header=False, sep='\t')
[8]:
! cooltools expected-cis $cool_file::resolutions/100000 --nproc 6 -o 'outputs/test.expected.cis.100000.tsv' --view "data/view_hg38.tsv"

Note expected for the first two distances are not defined with default settings, due to masking of near-diagonals in the cooler.

[9]:
display(
    pd.read_table("outputs/test.expected.cis.100000.tsv")[0:5]
)
region1 region2 dist n_valid count.sum balanced.sum count.avg balanced.avg
0 chr2_p chr2_p 0 878 NaN NaN NaN NaN
1 chr2_p chr2_p 1 876 NaN NaN NaN NaN
2 chr2_p chr2_p 2 874 2738583.0 65.287351 3133.390160 0.074699
3 chr2_p chr2_p 3 872 1739972.0 41.011675 1995.380734 0.047032
4 chr2_p chr2_p 4 870 1184707.0 28.473626 1361.732184 0.032728

Compartments & saddles

Many contact maps display plaid patterns of interactions. For more detail see the compartments notebook.

Since the orientation of eigenvectors is determined up to a sign, we often use GC content to orient or “phase” eigenvectors. Before calculating comparments, this notebook generates a binned GC content profile for the relevant region.

[10]:
## fasta sequence is required for calculating binned profile of GC conent
if not os.path.isfile('./data/hg38.fa'):
    ## note downloading a ~1Gb file can take a minute
    subprocess.call('wget -P ./data --progress=bar:force:noscroll https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.fa.gz', shell=True)
    subprocess.call('gunzip ./data/hg38.fa.gz', shell=True)

Bins can be fetched from the cooler, dropping any weights columns & keeping the header.

[11]:
!cooler dump --header -t bins $cool_file::resolutions/100000 | cut -f1-3 > outputs/bins.100000.tsv
[12]:
!cooltools genome gc outputs/bins.100000.tsv data/hg38.fa > outputs/gc.100000.tsv
[13]:
!cooltools eigs-cis -o outputs/test.eigs.100000 --view data/view_hg38.tsv --phasing-track outputs/gc.100000.tsv --n-eigs 1 $cool_file::resolutions/100000
/Users/geofffudenberg/anaconda3/envs/open2c/lib/python3.9/site-packages/cooltools/lib/checks.py:550: FutureWarning: In a future version of pandas, a length 1 tuple will be returned when iterating over a groupby with a grouper equal to a list of length 1. Don't supply a list with a single grouper to avoid this warning.
  for name, group in track.groupby([track.columns[0]]):
[14]:
display(
    pd.read_table('outputs/test.eigs.100000.cis.vecs.tsv')[0:5]
)
chrom start end weight E1
0 chr2 0 100000 0.006754 -1.564658
1 chr2 100000 200000 0.006767 -1.747567
2 chr2 200000 300000 0.004638 -0.370827
3 chr2 300000 400000 0.006034 -1.326894
4 chr2 400000 500000 0.006153 -1.434981

Pairwise class averaging is a typical way to reveal preferences in contact frequencies between regions.

[15]:
!cooltools saddle --qrange 0.02 0.98 --fig png -o outputs/test.saddle.cis.100000 --view data/view_hg38.tsv $cool_file::resolutions/100000 outputs/test.eigs.100000.cis.vecs.tsv outputs/test.expected.cis.100000.tsv
/Users/geofffudenberg/anaconda3/envs/open2c/lib/python3.9/site-packages/cooltools/lib/checks.py:550: FutureWarning: In a future version of pandas, a length 1 tuple will be returned when iterating over a groupby with a grouper equal to a list of length 1. Don't supply a list with a single grouper to avoid this warning.
  for name, group in track.groupby([track.columns[0]]):
/Users/geofffudenberg/anaconda3/envs/open2c/lib/python3.9/site-packages/cooltools/lib/checks.py:550: FutureWarning: In a future version of pandas, a length 1 tuple will be returned when iterating over a groupby with a grouper equal to a list of length 1. Don't supply a list with a single grouper to avoid this warning.
  for name, group in track.groupby([track.columns[0]]):
/Users/geofffudenberg/anaconda3/envs/open2c/lib/python3.9/site-packages/cooltools/lib/checks.py:550: FutureWarning: In a future version of pandas, a length 1 tuple will be returned when iterating over a groupby with a grouper equal to a list of length 1. Don't supply a list with a single grouper to avoid this warning.
  for name, group in track.groupby([track.columns[0]]):
/Users/geofffudenberg/anaconda3/envs/open2c/lib/python3.9/site-packages/cooltools/lib/checks.py:550: FutureWarning: In a future version of pandas, a length 1 tuple will be returned when iterating over a groupby with a grouper equal to a list of length 1. Don't supply a list with a single grouper to avoid this warning.
  for name, group in track.groupby([track.columns[0]]):
[16]:
saddle = np.load('outputs/test.saddle.cis.100000.saddledump.npz', allow_pickle=True)
[17]:
plt.figure(figsize=(6,6))
norm = LogNorm(    vmin=10**(-1), vmax=10**1)
im = plt.imshow(
    saddle['saddledata'],
    cmap='RdBu_r',
    norm = norm
);
plt.xlabel("saddle category")
plt.ylabel("saddle category")
plt.colorbar(im, label='obs/exp', pad=0.025, shrink=0.7);
../_images/notebooks_command_line_interface_26_0.png

Insulation & boundaries

A common strategy to summarize the near-diagonal structure of a contact map is by computing insulation scores. See the insulation notebook for more details.

The command below uses the Li method to threshold the insulation score for boundary calling, so the resulting table has both log2 insulation scores and whether a given region is a boundary. The boundary_strength_{window} column has a value for all local minima of the insulation profile, and the is_boundary_{window} column indicates whether this strength passed the threshold. Note that {window} indicates the chosen size of the sliding diamond, and the command below uses windows of size 100kb and 200kb.

[18]:
! cooltools insulation --threshold Li -o 'outputs/test.insulation.10000.tsv' --view "data/view_hg38.tsv" $cool_file::resolutions/10000 100000 200000
[19]:
display(
    pd.read_table('outputs/test.insulation.10000.tsv')[0:5]
)
chrom start end region is_bad_bin log2_insulation_score_100000 n_valid_pixels_100000 log2_insulation_score_200000 n_valid_pixels_200000 boundary_strength_100000 boundary_strength_200000 is_boundary_100000 is_boundary_200000
0 chr2 0 10000 chr2_p True NaN 0.0 NaN 0.0 NaN NaN False False
1 chr2 10000 20000 chr2_p False 0.692051 8.0 1.123245 18.0 NaN NaN False False
2 chr2 20000 30000 chr2_p False 0.760561 17.0 1.196643 37.0 NaN NaN False False
3 chr2 30000 40000 chr2_p False 0.766698 27.0 1.211748 57.0 NaN NaN False False
4 chr2 40000 50000 chr2_p False 0.674906 37.0 1.135037 77.0 NaN NaN False False

Dots & focal enrichment

Punctate pairwise peaks of enriched contact frequency are a prevalent feature of mammalian interphase contact maps. See the dots notebook for more details.

Since dots are evident at higher resolutions, we first calculate the 10kb expected, and we used multiple cores to speed up the calculation.

[20]:
! cooltools expected-cis --nproc 6 -o 'outputs/test.expected.cis.10000.tsv' --view "data/view_hg38.tsv" $cool_file::resolutions/10000
[21]:
! cooltools dots --nproc 6 -o 'outputs/test.dots.10000.tsv' --view "data/view_hg38.tsv" $cool_file::resolutions/10000 outputs/test.expected.cis.10000.tsv
INFO:root:Using recommended donut-based kernels with w=5, p=2 for binsize=10000
INFO:root: matrix 9314X9314 to be split into 256 tiles of 600X600.
INFO:root: tiles are padded (width=5) to enable convolution near the edges
INFO:root: matrix 14907X14907 to be split into 625 tiles of 600X600.
INFO:root: tiles are padded (width=5) to enable convolution near the edges
INFO:root: matrix 2472X2472 to be split into 25 tiles of 600X600.
INFO:root: tiles are padded (width=5) to enable convolution near the edges
INFO:root: matrix 5855X5855 to be split into 100 tiles of 600X600.
INFO:root: tiles are padded (width=5) to enable convolution near the edges
INFO:root:convolving 108 tiles to build histograms for lambda-bins
INFO:root:creating a Pool of 6 workers to tackle 108 tiles
INFO:root:Done building histograms in 17.489 sec ...
INFO:root:Determined thresholds for every lambda-bin ...
INFO:root:convolving 108 tiles to extract enriched pixels
INFO:root:creating a Pool of 6 workers to tackle 108 tiles
INFO:root:Done extracting enriched pixels in 18.971 sec ...
INFO:root:Begin post-processing of 11284 filtered pixels
INFO:root:preparing to extract needed q-values ...
INFO:root:clustering enriched pixels in region: chr17_p
INFO:root:detected 198 clusters of 3.69+/-3.15 size
INFO:root:clustering enriched pixels in region: chr17_q
INFO:root:detected 584 clusters of 3.87+/-3.44 size
INFO:root:clustering enriched pixels in region: chr2_p
INFO:root:detected 841 clusters of 3.82+/-3.56 size
INFO:root:clustering enriched pixels in region: chr2_q
INFO:root:detected 1364 clusters of 3.72+/-3.24 size
INFO:root:Clustering is complete
INFO:root:filtered 2548 out of 2987 centroids to reduce the number of false-positives
[22]:
display(
    pd.read_table('outputs/test.dots.10000.tsv')[0:5]
)
chrom1 start1 end1 chrom2 start2 end2 count la_exp.donut.value la_exp.vertical.value la_exp.horizontal.value ... la_exp.vertical.qval la_exp.horizontal.qval la_exp.lowleft.qval region cstart1 cstart2 c_label c_size region1 region2
0 chr17 12250000 12260000 chr17 12740000 12750000 186 72.485199 79.603089 72.129599 ... 2.210430e-22 2.032096e-22 1.506741e-33 chr17_p 1.220571e+07 1.273857e+07 0 7 chr17_p chr17_p
1 chr17 10640000 10650000 chr17 11980000 11990000 64 20.549081 23.891314 25.102296 ... 1.368050e-08 1.294897e-08 8.847705e-09 chr17_p 1.063000e+07 1.195500e+07 1 4 chr17_p chr17_p
2 chr17 9920000 9930000 chr17 10610000 10620000 444 40.054241 75.104470 41.227788 ... 3.310219e-171 7.071107e-246 2.515747e-246 chr17_p 9.920000e+06 1.061200e+07 2 4 chr17_p chr17_p
3 chr17 10640000 10650000 chr17 10840000 10850000 340 34.137182 46.658333 48.204555 ... 2.637572e-154 1.768044e-154 4.524004e-183 chr17_p 1.063667e+07 1.083889e+07 3 9 chr17_p chr17_p
4 chr17 12180000 12190000 chr17 13000000 13010000 215 28.235042 36.990608 38.266379 ... 7.841810e-80 1.045765e-79 7.967002e-80 chr17_p 1.218571e+07 1.299143e+07 4 7 chr17_p chr17_p

5 rows × 22 columns

Pileups & average features

A common method for quantifying contact maps is by creating pileup plots, which are averages over a set of “snippets”, or 2D windows, from the genome-wide map. See the pileups notebook for more details.

Below shows pileup for dots that were computed above.

[23]:
!cooltools pileup --features-format BEDPE --nproc 6 -o 'outputs/test.pileup.10000.npz' --view "data/view_hg38.tsv" --expected outputs/test.expected.cis.10000.tsv $cool_file::resolutions/10000  outputs/test.dots.10000.tsv
[24]:
## output is saved as an npz with keys 'pileup' and 'stack'
resolution = 10000
pile = np.load( f'outputs/test.pileup.{resolution}.npz')
[25]:
plt.figure(figsize=(6,6))
norm = LogNorm(    vmin=10**(-1), vmax=10**1)

im = plt.imshow(
    pile['pileup'],
    cmap='RdBu_r',
    norm = norm
);

plt.xticks(np.arange(0,25,5).astype(int),
           (np.arange(0,25,5).astype(int)-10)*resolution//1000,
)
plt.yticks(np.arange(0,25,5).astype(int),
           (np.arange(0,25,5).astype(int)-10)*resolution//1000,
)
plt.xlabel("offset from pileup center, kb")
plt.ylabel("offset from pileup center, kb")

plt.colorbar(im, label='obs/exp', pad=0.025, shrink=0.7);
../_images/notebooks_command_line_interface_40_0.png

This page was generated with nbsphinx from /home/docs/checkouts/readthedocs.org/user_builds/cooltools/checkouts/latest/docs/notebooks/command_line_interface.ipynb