API Reference
subpackages
- cooltools.lib package
- common
- numutils
COMED()
MAD()
adaptive_coarsegrain()
coarsen()
dist_to_mask()
fill_diag()
fill_inf()
fill_na()
fill_nainf()
get_diag()
get_eig()
get_finite()
get_kernel()
infer_mask2D()
interp_nan()
interpolate_bad_singletons()
is_symmetric()
normalize_score()
persistent_log_bins()
remove_good_singletons()
robust_gauss_filter()
set_diag()
slice_sorted()
smooth()
stochastic_sd()
weighted_groupby_mean()
zoom_array()
- peaks
- plotting
- schemas
cooltools.api.coverage module
- cooltools.api.coverage.coverage(clr, ignore_diags=None, chunksize=10000000, use_lock=False, clr_weight_name=None, store=False, store_prefix='cov', nproc=1, map_functor=<class 'map'>)[source]
Calculate the sums of cis and genome-wide contacts (aka coverage aka marginals) for a sparse Hi-C contact map in Cooler HDF5 format. Note that for raw coverage (i.e. clr_weight_name=None) the sum(tot_cov) from this function is two times the number of reads contributing to the cooler, as each side contributes to the coverage.
- Parameters
clr (cooler.Cooler) – Cooler object
ignore_diags (int, optional) – Drop elements occurring on the first
ignore_diags
diagonals of the matrix (including the main diagonal). If None, equals the number of diagonals ignored during IC balancing.chunksize (int, optional) – Split the contact matrix pixel records into equally sized chunks to save memory and/or parallelize. Default is 10^7
clr_weight_name (str) – Name of the weight column. Specify to calculate coverage of balanced cooler.
store (bool, optional) – If True, store the results in the input cooler file when finished. If clr_weight_name=None, also stores total cis counts in the cooler info. Default is False.
store_prefix (str, optional) – Name prefix of the columns of the bin table to save cis and total coverages. Will add suffixes _cis and _tot, as well as _raw in the default case or _clr_weight_name if specified.
nproc (int, optional) – How many processes to use for calculation. Ignored if map_functor is passed.
map_functor (callable, optional) – Map function to dispatch the matrix chunks to workers. If left unspecified, pool_decorator applies the following defaults: if nproc>1 this defaults to multiprocess.Pool; If nproc=1 this defaults the builtin map.
- Returns
cis_cov (1D array, whose shape is the number of bins in
h5
. Vector of bin sums in cis.)tot_cov (1D array, whose shape is the number of bins in
h5
. Vector of bin sums.)
cooltools.api.directionality module
- cooltools.api.directionality.directionality(clr, window_bp=100000, balance='weight', min_dist_bad_bin=2, ignore_diags=None, chromosomes=None)[source]
Calculate the diamond insulation scores and call insulating boundaries.
- Parameters
clr (cooler.Cooler) – A cooler with balanced Hi-C data.
window_bp (int) – The size of the sliding diamond window used to calculate the insulation score.
min_dist_bad_bin (int) – The minimal allowed distance to a bad bin. Do not calculate insulation scores for bins having a bad bin closer than this distance.
ignore_diags (int) – The number of diagonals to ignore. If None, equals the number of diagonals ignored during IC balancing.
- Returns
ins_table (pandas.DataFrame) – A table containing the insulation scores of the genomic bins and the insulating boundary strengths.
cooltools.api.dotfinder module
Collection of functions related to dot-calling
The main user-facing API function is:
dots(
clr,
expected,
expected_value_col="balanced.avg",
clr_weight_name="weight",
view_df=None,
kernels=None,
max_loci_separation=10_000_000,
max_nans_tolerated=1,
n_lambda_bins=40,
lambda_bin_fdr=0.1,
clustering_radius=20_000,
cluster_filtering=None,
tile_size=5_000_000,
nproc=1,
)
This function implements HiCCUPS-style dot calling, but enables user-specified modifications at multiple steps. The current implementation makes two passes over the input data, first to create a histogram of pixel enrichment values, and second to extract significantly enriched pixels.
The function starts with compatibility verifications
Recommendation or verification for kernels is done next. Custom kernels must satisfy properties including: square shape, equal sizes, odd sizes, zeros in the middle, etc. By default, HiCCUPS-style kernels are recommended based on the binsize.
Lambda bins are defined for multiple hypothesis testing separately for different value ranges of the locally adjusted expected. Currently, log-binned lambda-bins are hardcoded using a pre-defined BASE of 2^(1/3). n_lambda_bins controls the total number of bins. for the clr, expected and view of interest.
Genomic regions in the specified view`(all chromosomes by default) are split into smaller tiles of size `tile_size.
scoring_and_histogramming_step() is performed independently on the genomic tiles. In this step, locally adjusted expected is calculated using convolution kernels for each pixel in the tile. All surveyed pixels are histogrammed according to their adjusted expected and raw observed counts. Locally adjusted expected is not stored in memory.
Chunks of histograms are aggregated together and a modified BH-FDR procedure is applied to the result in determine_thresholds(). This returns thresholds for statistical significance in each lambda-bin (for observed counts), along with the adjusted p-values (q-values).
Calculated thresholds are used to extract statistically significant pixels in scoring_and_extraction_step(). Because locally adjusted expected is not stored in memory, it is re-caluclated during this step, which makes it computationally intensive. Locally adjusted expected values are required in order to apply different thresholds of significance depending on the lambda-bin.
Returned filtered pixels, or ‘dots’, are significantly enriched relative to their locally adjusted expecteds and thus have potential biological interest. Dots are further annotated with their genomic coordinates and q-values (adjusted p-values) for all applied kernels.
All further steps perform optional post-processing on called dots
enriched pixels that are within clustering_radius of each other are clustered together and the brightest one is selected as the representative position of a dot.
cluster-representatives along with “singletons” (enriched pixels that are not part of any cluster) can be subjected to further empirical enrichment filtering in cluster_filtering_hiccups(). This both requires clustered dots exceed prescribed enrichment thresholds relative to their local neighborhoods and that singletons pass an even more stringent q-value threshold.
- cooltools.api.dotfinder.adjusted_exp_name(kernel_name)
- cooltools.api.dotfinder.annotate_pixels_with_qvalues(pixels_df, qvalues, obs_raw_name='count')[source]
Add columns with the qvalues to a DataFrame of scored pixels
- Parameters
pixels_df (pandas.DataFrame) – a DataFrame with pixel coordinates that must have at least 2 columns named ‘bin1_id’ and ‘bin2_id’, where first is pixels’s row and the second is pixel’s column index.
qvalues (dict of DataFrames) – A dictionary with keys being kernel names and values DataFrames storing q-values for each observed count values in each lambda- bin. Colunms are Intervals defined by ‘ledges’ boundaries. Rows corresponding to a range of observed count values.
obs_raw_name (str) – Name of the column/field that carry number of counts per pixel, i.e. observed raw counts.
- Returns
pixels_qvalue_df (pandas.DataFrame) – DataFrame of pixels with additional columns la_exp.{k}.qval, storing q-values (adjusted p-values) corresponding to the count value of a pixel, its kernel, and a lambda-bin it belongs to.
- cooltools.api.dotfinder.clust_2D_pixels(pixels_df, threshold_cluster=2, bin1_id_name='bin1_id', bin2_id_name='bin2_id', clust_label_name='c_label', clust_size_name='c_size')[source]
Group significant pixels by proximity using Birch clustering. We use “n_clusters=None”, which implies no AgglomerativeClustering, and thus simply reporting “blobs” of pixels of radii <=”threshold_cluster” along with corresponding blob-centroids as well.
- Parameters
pixels_df (pandas.DataFrame) – a DataFrame with pixel coordinates that must have at least 2 columns named ‘bin1_id’ and ‘bin2_id’, where first is pixels’s row and the second is pixel’s column index.
threshold_cluster (int) – clustering radius for Birch clustering derived from ~40kb radius of clustering and bin size.
bin1_id_name (str) – Name of the 1st coordinate (row index) in ‘pixel_df’, by default ‘bin1_id’. ‘start1/end1’ could be usefull as well.
bin2_id_name (str) – Name of the 2nd coordinate (column index) in ‘pixel_df’, by default ‘bin2_id’. ‘start2/end2’ could be usefull as well.
clust_label_name (str) – Name of the cluster of pixels label. “c_label” by default.
clust_size_name (str) – Name of the cluster of pixels size. “c_size” by default.
- Returns
peak_tmp (pandas.DataFrame) – DataFrame with the following columns: [c+bin1_id_name, c+bin2_id_name, clust_label_name, clust_size_name] row/col (bin1/bin2) are coordinates of centroids, label and sizes are unique pixel-cluster labels and their corresponding sizes.
- cooltools.api.dotfinder.cluster_filtering_hiccups(centroids, obs_raw_name='count', enrichment_factor_vh=1.5, enrichment_factor_d_and_ll=1.75, enrichment_factor_d_or_ll=2.0, FDR_orphan_threshold=0.02)[source]
Centroids of enriched pixels can be filtered to further minimize the amount of false-positive dot-calls.
First, centroids are filtered on enrichment relative to the locally-adjusted expected for the “donut”, “lowleft”, “vertical”, and “horizontal” kernels. Additionally, singleton pixels (i.e. pixels that do not belong to a cluster) are filtered based on a combined q-values for all kernels. This empirical filtering approach was developed in Rao et al 2014 and results in a conservative dot-calls with the low rate of false-positive calls.
- Parameters
centroids (pd.DataFrame) – DataFrame that stores enriched and clustered pixels.
obs_raw_name (str) – name of the column with raw observed pixel counts
enrichment_factor_vh (float) – minimal enrichment factor for pixels relative to both “vertical” and “horizontal” kernel.
enrichment_factor_d_and_ll (float) – minimal enrichment factor for pixels relative to both “donut” and “lowleft” kernels.
enrichment_factor_d_or_ll (float) – minimal enrichment factor for pixels relative to either “donut” or” “lowleft” kenels.
FDR_orphan_threshold (float) – minimal combined q-value for singleton pixels.
- Returns
filtered_centroids (pd.DataFrame) – filtered dot-calls
- cooltools.api.dotfinder.clustering_step(scored_df, dots_clustering_radius, assigned_regions_name='region', obs_raw_name='count')[source]
Group together adjacent significant pixels into clusters after the lambda-binning multiple hypothesis testing by iterating over assigned regions and calling clust_2D_pixels.
- Parameters
scored_df (pandas.DataFrame) – DataFrame with enriched pixels that are ready to be clustered and are annotated with their genomic coordinates.
dots_clustering_radius (int) – Birch-clustering threshold.
assigned_regions_name (str | None) – Name of the column in scored_df to use for grouping pixels before clustering. When None, full chromosome clustering is done.
obs_raw_name (str) – name of the column with raw observed pixel counts
- Returns
centroids (pandas.DataFrame) – Pixels from ‘scored_df’ annotated with clustering information.
Notes
‘dots_clustering_radius’ in Birch clustering algorithm corresponds to a double the clustering radius in the “greedy”-clustering used in HiCCUPS
- cooltools.api.dotfinder.determine_thresholds(gw_hist, fdr)[source]
given a ‘gw_hist’ histogram of observed counts for each lambda-bin and for each kernel-type, and also given a FDR, calculate q-values for each observed count value in each lambda-bin for each kernel-type.
- Parameters
gw_hist_kernels (dict) – dictionary {kernel_name : 2D_hist}, where ‘2D_hist’ is a pd.DataFrame
fdr (float) – False Discovery Rate level
- Returns
threshold_df (dict) – each threshold_df[k] is a Series indexed by la_exp intervals (IntervalIndex) and it is all we need to extract “good” pixels from each chunk …
qvalues (dict) – A dictionary with keys being kernel names and values pandas.DataFrames storing q-values: each column corresponds to a lambda-bin, while rows correspond to observed pixels values.
- cooltools.api.dotfinder.dots(clr, expected, expected_value_col='balanced.avg', clr_weight_name='weight', view_df=None, kernels=None, max_loci_separation=10000000, max_nans_tolerated=1, n_lambda_bins=40, lambda_bin_fdr=0.1, clustering_radius=20000, cluster_filtering=None, tile_size=5000000, nproc=1)[source]
Call dots on a cooler {clr}, using {expected} defined in regions specified in {view_df}.
All convolution kernels specified in {kernels} will be all applied to the {clr}, and statistical testing will be performed separately for each kernel. A convolutional kernel is a small squared matrix (e.g. 7x7) of zeros and ones that defines a “mask” to extract local expected around each pixel. Since the enrichment is calculated relative to the central pixel, kernel width should be an odd number >=3.
- Parameters
clr (cooler.Cooler) – A cooler with balanced Hi-C data.
expected (DataFrame in expected format) – Diagonal summary statistics for each chromosome, and name of the column with the values of expected to use.
expected_value_col (str) – Name of the column in expected that holds the values of expected
clr_weight_name (str) – Name of the column in the clr.bins to use as balancing weights. Using raw unbalanced data is not supported for dot-calling.
view_df (viewframe) – Viewframe with genomic regions, at the moment the view has to match the view used for generating expected. If None, generate from the cooler.
kernels ({ str:np.ndarray } | None) – A dictionary of convolution kernels to be used for calculating locally adjusted expected. If None the default kernels from HiCCUPS are going to be recommended based on the resolution of the cooler.
max_loci_separation (int) – Miaximum loci separation for dot-calling, i.e., do not call dots for loci that are further than max_loci_separation basepair apart. default 10Mb.
max_nans_tolerated (int) – Maximum number of NaNs tolerated in a footprint of every used kernel Adjust with caution, as large max_nans_tolerated, might lead to artifacts in pixels scoring.
n_lambda_bins (int) – Number of log-spaced bins, where FDR-testing will be performed independently. TODO: generate lambda-bins on the fly based on the dynamic range of the data (i.e. maximum pixel count)
lambda_bin_fdr (float) – False discovery rate (FDR) for multiple hypothesis testing BH-FDR procedure, applied per lambda bin.
clustering_radius (None | int) – Cluster enriched pixels with a given radius. “Brightest” pixels in each group will be reported as the final dot-calls. If None, no clustering is performed.
cluster_filtering (bool) – whether to apply additional filtering to centroids after clustering, using cluster_filtering_hiccups()
tile_size (int) – Tile size for the Hi-C heatmap tiling. Typically on order of several mega-bases, and <= max_loci_separation. Controls tradeoff between memory consumption and speed of execution.
nproc (int) – Number of processes to use for multiprocessing.
- Returns
dots (pandas.DataFrame) – BEDPE-style dataFrame with genomic coordinates of called dots and additional annotations.
Notes
‘clustering_radius’ in Birch clustering algorithm corresponds to a double the clustering radius in the “greedy”-clustering used in HiCCUPS (to be tested).
TODO describe sequence of processing steps
- cooltools.api.dotfinder.extract_scored_pixels(scored_df, thresholds, ledges, obs_raw_name='count')[source]
Implementation of HiCCUPS-like lambda-binning statistical procedure. Use FDR thresholds for different “classes” of hypothesis (classified by their locally-adjusted expected (la_exp) scores), in order to extract “enriched” pixels.
- Parameters
scored_df (pd.DataFrame) – A table with the scoring information for a group of pixels.
thresholds (dict) – A dictionary {kernel_name : lambda_thresholds}, where ‘lambda_thresholds’ are pd.Series with FDR thresholds indexed by lambda-bin intervals
ledges (ndarray) – An ndarray with bin lambda-edges for groupping locally adjusted expecteds, i.e., classifying statistical hypothesis into lambda-bins. Left-most bin (-inf, 1], and right-most one (value,+inf].
obs_raw_name (str) – Name of the column/field with number of counts per pixel, i.e. observed raw counts.
- Returns
scored_df_slice (pandas.DataFrame) – Filtered DataFrame of pixels that satisfy thresholds.
- cooltools.api.dotfinder.generate_tiles_diag_band(clr, view_df, pad_size, tile_size, band_to_cover)[source]
A generator yielding corrdinates of heatmap tiles that are needed to cover the requested band_to_cover around diagonal. Each tile is “padded” with the pad of size ‘pad_size’ to allow for convolution near the boundary of a tile.
- Parameters
clr (cooler) – Cooler object to use to extract chromosome extents.
view_df (viewframe) – Viewframe with genomic regions to process, chrom, start, end, name.
pad_size (int) – Size of padding around each tile. Typically the outer size of the kernel.
tile_size (int) – Size of the heatmap tile.
band_to_cover (int) – Size of the diagonal band to be covered by the generated tiles. Typically correspond to the max_loci_separation for called dots.
- Returns
tile_coords (tuple) – Generator of tile coordinates, i.e. tuples of three: (region_name, tile_span_i, tile_span_j), where ‘tile_span_i/j’ each is a tuple of bin ids (bin_start, bin_end).
- cooltools.api.dotfinder.get_adjusted_expected_tile_some_nans(origin_ij, observed, expected, bal_weights, kernels)[source]
Get locally adjusted expected for a collection of local-filters (kernels).
Such locally adjusted expected, ‘Ek’ for a given kernel, can serve as a baseline for deciding whether a given pixel is enriched enough to call it a feature (dot-loop, flare, etc.) in a downstream analysis.
For every pixel of interest [i,j], locally adjusted expected is a product of a global expected in that pixel E_bal[i,j] and an enrichment of local environ- ment of the pixel, described with a given kernel:
KERNEL[i,j](O_bal) Ek_bal[i,j] = E_bal[i,j]* ------------------ KERNEL[i,j](E_bal)
where KERNEL[i,j](X) is a result of convolution between the kernel and a slice of matrix X centered around (i,j). See link below for details: https://en.wikipedia.org/wiki/Kernel_(image_processing)
Returned values for observed and all expecteds are rescaled back to raw-counts, for the sake of downstream statistical analysis, which is using Poisson test to decide is a given pixel is enriched. (comparison between balanced values using Poisson- test is intractable):
KERNEL[i,j](O_bal) Ek_raw[i,j] = E_raw[i,j]* ------------------ , KERNEL[i,j](E_bal)
where E_raw[i,j] is:
1 1 -------------- * -------------- * E_bal[i,j] bal_weights[i] bal_weights[j]
- Parameters
origin_ij ((int,int) tuple) – tuple of interegers that specify the location of an observed matrix slice. Measured in bins, not in nucleotides.
observed (numpy.ndarray) – square symmetrical dense-matrix that contains balanced observed O_bal
expected (numpy.ndarray) – square symmetrical dense-matrix that contains expected, calculated based on balanced observed: E_bal.
bal_weights (numpy.ndarray or (numpy.ndarray, numpy.ndarray)) – 1D vector used to turn raw observed into balanced observed for a slice of a matrix with the origin_ij on the diagonal; and a tuple/list of a couple of 1D arrays in case it is a slice with an arbitrary origin_ij.
kernels (dict of (str, numpy.ndarray)) – dictionary of kernels/masks to perform convolution of the heatmap. Kernels describe the local environment, and used to estimate baseline for finding enriched/prominent peaks. Peak must be enriched with respect to all local environments (all kernels), to be considered significant. Dictionay keys must contain names for each kernel. Note, scipy.ndimage.convolve first flips kernel and only then applies it to matrix.
- Returns
peaks_df (pandas.DataFrame) – DataFrame with the results of locally adjusted calculations for every kernel for a given slice of input matrix.
Notes
- Reported columns:
bin1_id - bin1_id index (row), adjusted to tile_start_i bin2_id - bin bin2_id index, adjusted to tile_start_j la_exp - locally adjusted expected (for each kernel) la_nan - number of NaNs around (each kernel’s footprint) exp.raw - global expected, rescaled to raw-counts obs.raw(counts) - observed values in raw-counts.
Depending on the intial tiling of the interaction matrix, concatened peaks_df may require “deduplication”, as some pixels can be evaluated in several tiles (e.g. near the tile edges). Default tilitng in the dots functions, should avoid this problem.
- cooltools.api.dotfinder.histogram_scored_pixels(scored_df, kernels, ledges, obs_raw_name='count')[source]
An attempt to implement HiCCUPS-like lambda-binning statistical procedure. This function aims at building up a histogram of locally adjusted expected scores for groups of characterized pixels.
Such histograms are subsequently used to compute FDR thresholds for different “classes” of hypothesis (classified by their locally-adjusted expected (la_exp)).
- Parameters
scored_df (pd.DataFrame) – A table with the scoring information for a group of pixels.
kernels (dict) – A dictionary with keys being kernels names and values being ndarrays representing those kernels.
ledges (ndarray) – An ndarray with bin lambda-edges for groupping locally adjusted expecteds, i.e., classifying statistical hypothesis into lambda-bins. Left-most bin (-inf, 1], and right-most one (value,+inf].
obs_raw_name (str) – Name of the column/field that carry number of counts per pixel, i.e. observed raw counts.
- Returns
hists (dict of pandas.DataFrame) – A dictionary of pandas.DataFrame with lambda/observed 2D histogram for every kernel-type.
Notes
returning histograms corresponding to the chunks of scored pixels.
- cooltools.api.dotfinder.is_compatible_kernels(kernels, binsize, max_nans_tolerated)[source]
- TODO implement checks for kernels:
matrices are of the same size
they should be squared (too restrictive ? maybe pad with 0 as needed)
dimensions are odd, to have a center pixel to refer to
they can be turned into int 1/0 ones (too restrictive ? allow weighted kernels ?)
the central pixel should be zero perhaps (unless weights are allowed 4sure)
maybe introduce an upper limit to the size - to avoid crazy long calculations
check relative to the binsize maybe ? what’s the criteria ?
- cooltools.api.dotfinder.nans_inkernel_name(kernel_name)
- cooltools.api.dotfinder.recommend_kernels(binsize)[source]
Return a recommended set of convolution kernels for dot-calling based on the resolution, or binsize, of the input data.
This function currently recommends the four kernels used in the HiCCUPS method: donut, horizontal, vertical, lowerleft. Kernels are recommended for resolutions near 5 kb, 10 kb, and 25 kb. Dots are not typically visible at lower resolutions (binsize >28kb) and the majority of datasets are too sparse for dot-calling at very high resolutions (<4kb). Given this, default kernels are not recommended for resolutions outside this range.
- Parameters
binsize (integer) – binsize of the provided cooler
- Returns
kernels ({str:ndarray}) – dictionary of convolution kernels as ndarrays, with their names as keys.
- cooltools.api.dotfinder.score_tile(tile_cij, clr, expected_indexed, expected_value_col, clr_weight_name, kernels, max_nans_tolerated, band_to_cover)[source]
The main working function that given a tile of a heatmap, applies kernels to perform convolution to calculate locally-adjusted expected and then calculates a p-value for every meaningfull pixel against these locally-adjusted expected (la_exp) values.
- Parameters
tile_cij (tuple) – Tuple of 3: region name, tile span row-wise, tile span column-wise: (region, tile_span_i, tile_span_j), where tile_span_i = (start_i, end_i), and tile_span_j = (start_j, end_j).
clr (cooler) – Cooler object to use to extract Hi-C heatmap data.
expected_indexed (pandas.DataFrame) – DataFrame with cis-expected, indexed with ‘region1’, ‘region2’, ‘dist’.
expected_value_col (str) – Name of a value column in expected DataFrame
clr_weight_name (str) – Name of a value column with balancing weights in a cooler.bins() DataFrame. Typically ‘weight’.
kernels (dict) – A dictionary with keys being kernels names and values being ndarrays representing those kernels.
max_nans_tolerated (int) – Number of NaNs tolerated in a footprint of every kernel.
band_to_cover (int) – Results would be stored only for pixels connecting loci closer than ‘band_to_cover’.
- Returns
res_df (pandas.DataFrame) – results: annotated pixels with calculated locally adjusted expected for every kernels, observed, precalculated pvalues, number of NaNs in footprint of every kernels, all of that in a form of an annotated pixels DataFrame for eligible pixels of a given tile.
- cooltools.api.dotfinder.scoring_and_extraction_step(clr, expected_indexed, expected_value_col, clr_weight_name, tiles, kernels, ledges, thresholds, max_nans_tolerated, loci_separation_bins, nproc, bin1_id_name='bin1_id', bin2_id_name='bin2_id', map_functor=<class 'map'>)[source]
This implements the 2nd step of the lambda-binning scoring procedure, extracting pixels that are FDR compliant.
In short, this combines scoring with with extraction into a single pipeline of per-chunk operations/transforms.
- cooltools.api.dotfinder.scoring_and_histogramming_step(clr, expected_indexed, expected_value_col, clr_weight_name, tiles, kernels, ledges, max_nans_tolerated, loci_separation_bins, nproc, map_functor=<class 'map'>)[source]
This implements the 1st step of the lambda-binning scoring procedure - histogramming.
In short, this pipes a scoring operation together with histogramming into a single pipeline of per-chunk operations/transforms.
- cooltools.api.dotfinder.tile_square_matrix(matrix_size, offset, tile_size, pad=0)[source]
Generate a stream of coordinates of tiles that cover a matrix of a given size. Matrix has to be square, on-digaonal one: e.g. corresponding to a chromosome or a chromosomal arm.
- Parameters
matrix_size (int) – Size of a squared matrix
offset (int) – Offset coordinates of generated tiles by ‘offset’
tile_size (int) – Requested size of the tiles. Tiles near the right and botoom edges could be rectangular and smaller then ‘tile_size’
pad (int) – Small padding around each tile to be included in the yielded coordinates.
- Yields
Pairs of indices/coordinates of every tile ((start_i, end_i), (start_j, end_j))
Notes
Generated tiles coordinates [start_i,end_i) , [start_i,end_i) can be used to fetch heatmap tiles from cooler: >>> clr.matrix()[start_i:end_i, start_j:end_j]
‘offset’ is useful when a given matrix is part of a larger matrix (a given chromosome or arm), and thus all coordinated needs to be offset to get absolute coordinates.
Tiles are non-overlapping (pad=0), but tiles near the right and bottom edges could be rectangular:
… *
cooltools.api.eigdecomp module
- cooltools.api.eigdecomp.cis_eig(A, n_eigs=3, phasing_track=None, ignore_diags=2, clip_percentile=0, sort_metric=None)[source]
Compute compartment eigenvector on a dense cis matrix.
Note that the amplitude of compartment eigenvectors is weighted by their corresponding eigenvalue
- Parameters
A (2D array) – balanced dense contact matrix
n_eigs (int) – number of eigenvectors to compute
phasing_track (1D array, optional) – if provided, eigenvectors are flipped to achieve a positive correlation with phasing_track.
ignore_diags (int) – the number of diagonals to ignore
clip_percentile (float) – if >0 and <100, clip pixels with diagonal-normalized values higher than the specified percentile of matrix-wide values.
sort_metric (str) – If provided, re-sort eigenvecs and eigvals in the order of decreasing correlation between phasing_track and eigenvector, using the specified measure of correlation. Possible values: ‘pearsonr’ - sort by decreasing Pearson correlation. ‘var_explained’ - sort by decreasing absolute amount of variation in eigvecs explained by phasing_track (i.e. R^2 * var(eigvec)) ‘MAD_explained’ - sort by decreasing absolute amount of Median Absolute Deviation from the median of eigvecs explained by phasing_track (i.e. COMED(eigvec, phasing_track) * MAD(eigvec)). ‘spearmanr’ - sort by decreasing Spearman correlation. This option is designed to report the most “biologically” informative eigenvectors first, and prevent eigenvector swapping caused by translocations. In reality, however, sometimes it shows poor performance and may lead to reporting of non-informative eigenvectors. Off by default.
- Returns
eigenvalues, eigenvectors
.. note:: ALWAYS check your EVs by eye. The first one occasionally does – not reflect the compartment structure, but instead describes chromosomal arms or translocation blowouts.
- cooltools.api.eigdecomp.eigs_cis(clr, phasing_track=None, view_df=None, n_eigs=3, clr_weight_name='weight', ignore_diags=None, clip_percentile=99.9, sort_metric=None, map=<class 'map'>)[source]
Compute compartment eigenvector for a given cooler clr in a number of symmetric intra chromosomal regions defined in view_df (cis-regions), or for each chromosome.
Note that the amplitude of compartment eigenvectors is weighted by their corresponding eigenvalue. Eigenvectors can be oriented by passing a binned phasing_track with the same resolution as the cooler.
- Parameters
clr (cooler) – cooler object to fetch data from
phasing_track (DataFrame) – binned track with the same resolution as cooler bins, the fourth column is used to phase the eigenvectors, flipping them to achieve a positive correlation.
view_df (iterable or DataFrame, optional) – if provided, eigenvectors are calculated for the regions of the view only, otherwise chromosome-wide eigenvectors are computed, for chromosomes specified in phasing_track.
n_eigs (int) – number of eigenvectors to compute
clr_weight_name (str) – name of the column with balancing weights to be used.
ignore_diags (int, optional) – the number of diagonals to ignore. Derived from cooler metadata if not specified.
clip_percentile (float) – if >0 and <100, clip pixels with diagonal-normalized values higher than the specified percentile of matrix-wide values.
sort_metric (str) – If provided, re-sort eigenvecs and eigvals in the order of decreasing correlation between phasing_track and eigenvector, using the specified measure of correlation. Possible values: ‘pearsonr’ - sort by decreasing Pearson correlation. ‘var_explained’ - sort by decreasing absolute amount of variation in eigvecs explained by phasing_track (i.e. R^2 * var(eigvec)) ‘MAD_explained’ - sort by decreasing absolute amount of Median Absolute Deviation from the median of eigvecs explained by phasing_track (i.e. COMED(eigvec, phasing_track) * MAD(eigvec)). ‘spearmanr’ - sort by decreasing Spearman correlation. This option is designed to report the most “biologically” informative eigenvectors first, and prevent eigenvector swapping caused by translocations. In reality, however, sometimes it shows poor performance and may lead to reporting of non-informative eigenvectors. Off by default.
map (callable, optional) – Map functor implementation.
- Returns
eigvals, eigvec_table -> DataFrames with eigenvalues for each region and
a table of eigenvectors filled in the bins table.
.. note:: ALWAYS check your EVs by eye. The first one occasionally does – not reflect the compartment structure, but instead describes chromosomal arms or translocation blowouts. Possible mitigations: employ view_df (e.g. arms) to avoid issues with chromosomal arms, consider blacklisting regions with translocations during balancing.
- cooltools.api.eigdecomp.eigs_trans(clr, phasing_track=None, n_eigs=3, partition=None, clr_weight_name='weight', sort_metric=None, **kwargs)[source]
- cooltools.api.eigdecomp.trans_eig(A, partition, n_eigs=3, perc_top=99.95, perc_bottom=1, phasing_track=None, sort_metric=False)[source]
Compute compartmentalization eigenvectors on trans contact data
- Parameters
A (2D array) – balanced whole genome contact matrix
partition (sequence of int) – bin offset of each contiguous region to treat separately (e.g., chromosomes or chromosome arms)
n_eigs (int) – number of eigenvectors to compute; default = 3
perc_top (float (percentile)) – filter - clip trans blowout contacts above this cutoff; default = 99.95
perc_bottom (float (percentile)) – filter - remove bins with trans coverage below this cutoff; default=1
phasing_track (1D array, optional) – if provided, eigenvectors are flipped to achieve a positive correlation with phasing_track.
sort_metric (str) – If provided, re-sort eigenvecs and eigvals in the order of decreasing correlation between phasing_track and eigenvector, using the specified measure of correlation. Possible values: ‘pearsonr’ - sort by decreasing Pearson correlation. ‘var_explained’ - sort by decreasing absolute amount of variation in eigvecs explained by phasing_track (i.e. R^2 * var(eigvec)) ‘MAD_explained’ - sort by decreasing absolute amount of Median Absolute Deviation from the median of eigvecs explained by phasing_track (i.e. COMED(eigvec, phasing_track) * MAD(eigvec)). ‘spearmanr’ - sort by decreasing Spearman correlation. This option is designed to report the most “biologically” informative eigenvectors first, and prevent eigenvector swapping caused by translocations. In reality, however, sometimes it shows poor performance and may lead to reporting of non-informative eigenvectors. Off by default.
- Returns
eigenvalues, eigenvectors
.. note:: ALWAYS check your EVs by eye. The first one occasionally does – not reflect the compartment structure, but instead describes chromosomal arms or translocation blowouts.
cooltools.api.expected module
- cooltools.api.expected.blocksum_pairwise(clr, view_df, transforms={}, clr_weight_name='weight', chunksize=1000000, map=<class 'map'>)[source]
Summary statistics on rectangular blocks of all (trans-)pairwise combinations of genomic regions in the view_df (aka trans-expected).
Note
This is a special case of asymmetric block-level summary stats, that can be calculated very efficiently. Regions in view_df are assigned to pixels only once and pixels falling into a given asymmetric block i != j are summed up.
- Parameters
clr (cooler.Cooler) – Cooler object
view_df (viewframe) – view_df of regions defining blocks for summary calculations, has to be sorted according to the order of chromosomes in clr.
transforms (dict of str -> callable, optional) – Transformations to apply to pixels. The result will be assigned to a temporary column with the name given by the key. Callables take one argument: the current chunk of the (annotated) pixel dataframe.
clr_weight_name (str) – name of the balancing weight column in cooler bin-table used to count “bad” pixels per block. Set to None not ot mask “bad” pixels (raw data only).
chunksize (int, optional) – Size of pixel table chunks to process
map (callable, optional) – Map functor implementation.
- Returns
DataFrame with entries for each blocks (region1, region2, n_valid, count.sum)
- cooltools.api.expected.combine_binned_expected(binned_exp, binned_exp_slope=None, Pc_name='balanced.avg', der_smooth_function_combined=<function <lambda>>, spread_funcs='logstd', spread_funcs_slope='std', minmax_drop_bins=2, concat_original=False)[source]
Combines by-region log-binned expected and slopes into genome-wide averages, handling small chromosomes and “corners” in an optimal fashion, robust to outliers. Calculates spread of by-chromosome P(s) and slopes, also in an optimal fashion.
- Parameters
binned_exp (dataframe) – binned expected as outputed by logbin_expected
binned_exp_slope (dataframe or None) – If provided, estimates spread of slopes. Is necessary if concat_original is True
Pc_name (str) – Name of the column with the probability of contacts. Defaults to “balanced.avg”.
der_smooth_function_combined (callable) – A smoothing function for calculating slopes on combined data
spread_funcs ("minmax", "std", "logstd" or a function (see below)) – A way to estimate the spread of the P(s) curves between regions. * “minmax” - use the minimum/maximum of by-region P(s) * “std” - use weighted standard deviation of P(s) curves (may produce negative results) * “logstd” (recommended) weighted standard deviation in logspace (as seen on the plot)
spread_funcs_slope ("minmax", "std" or a funciton) – Similar to spread_func, but for slopes rather than P(s)
concat_original (bool (default = False)) – Append original dataframe, and put combined under region “combined”
- Returns
scal, slope_df
Notes
This function does not calculate errorbars. The spread is not the deviation of the mean, and rather is representative of variability between chromosomes.
Calculating errorbars/spread
Take all by-region P(s)
For “minmax”, remove the last var_drop_last_bins bins for each region (by default two. They are most noisy and would inflate the spread for the last points). Min/max are most susceptible to this.
Groupby P(s) by region
Apply spread_funcs to the pd.GroupBy object. Options are: * minimum and maximum (“minmax”), * weighted standard deviation (“std”), * weighted standard deviation in logspace (“logstd”, default) or two custom functions We do not remove the last bins for “std” / “logstd” because we are doing weighted standard deviation. Therefore, noisy “ends” of regions would contribute very little to this.
Append them to the P(s) for the same bin.
As a result, by for minmax, we do not estimate spread for the last two bins. This is because there are often very few chromosomal arms there, and different arm measurements are noisy. For other methods, we do estimate the spread there, and noisy last bins are taken care of by the weighted standard deviation. However, the spread in the last bins may be noisy, and may become a 0 if only one region is contributing to the last pixel.
- cooltools.api.expected.count_all_pixels_per_block(x, y)[source]
Calculate total number of pixels in a rectangular block
- Parameters
x (int) – block width in pixels
y (int) – block height in pixels
- Returns
number_of_pixels (int) – total number of pixels in a block
- cooltools.api.expected.count_all_pixels_per_diag(n)[source]
Total number of pixels on each upper diagonal of a square matrix.
- Parameters
n (int) – total number of bins (dimension of square matrix)
- Returns
dcount (1D array of length n) – dcount[d] == total number of pixels on diagonal d
- cooltools.api.expected.count_bad_pixels_per_block(x, y, bad_bins_x, bad_bins_y)[source]
Calculate number of “bad” pixels per rectangular block of a contact map
- Parameters
x (int) – block width in pixels
y (int) – block height in pixels
bad_bins_x (int) – number of bad bins on x-side
bad_bins_y (int) – number of bad bins on y-side
- Returns
number_of_pixes (int) – number of “bad” pixels in a block
- cooltools.api.expected.count_bad_pixels_per_diag(n, bad_bins)[source]
Efficiently count the number of bad pixels on each upper diagonal of a matrix assuming a sequence of bad bins forms a “grid” of invalid pixels.
Each bad bin bifurcates into two a row and column of bad pixels, so an upper bound on number of bad pixels per diagonal is 2*k, where k is the number of bad bins. For a given diagonal, we need to subtract from this upper estimate the contribution from rows/columns reaching “out-of-bounds” and the contribution of the intersection points of bad rows with bad columns that get double counted.
o : bad bin * : bad pixel x : intersection bad pixel $ : out of bounds bad pixel $ $ $ *--------------------------+ * * * * | * * * * | ** * * | o****x*****x***********|$ * * * | * * * | * * * | o******x***********|$ * * | * * | * * | * * | * * | ** | o***********|$ * | * |
- Parameters
n (int) – total number of bins
bad_bins (1D array of int) – sorted array of bad bin indexes
- Returns
dcount (1D array of length n) – dcount[d] == number of bad pixels on diagonal d
- cooltools.api.expected.diagsum_from_array(A, counts=None, *, offset=0, ignore_diags=2, filter_counts=False, region_name=None)[source]
Calculates Open2C-formatted expected for a dense submatrix of a whole genome contact map.
- Parameters
A (2D array) – Normalized submatrix to calculate expected (
balanced.sum
).counts (2D array or None, optional) – Corresponding raw contacts to populate
count.sum
.offset (int or (int, int)) – i- and j- bin offsets of A relative to the parent matrix. If a single offset is provided it is applied to both axes.
ignore_diags (int, optional) – Number of initial diagonals to ignore.
filter_counts (bool, optional) – Apply the validity mask from balanced matrix to the raw one. Ignored when counts is None.
region_name (str or (str, str), optional) – A custom region name or pair of region names. If provided, region columns will be included in the output.
Notes
For regions that cross the main diagonal of the whole-genome contact map, the lower triangle “overhang” is ignored.
Examples
>>> A = clr.matrix()[:, :] # whole genome balanced >>> C = clr.matrix(balance=False)[:, :] # whole genome raw
Using only balanced data: >>> exp = diagsum_from_array(A)
Using balanced and raw counts: >>> exp1 = diagsum_from_array(A, C)
Using an off-diagonal submatrix >>> exp2 = diagsum_from_array(A[:50, 50:], offset=(0, 50))
- cooltools.api.expected.diagsum_pairwise(clr, view_df, transforms={}, clr_weight_name='weight', ignore_diags=2, chunksize=10000000, map=<class 'map'>)[source]
Intra-chromosomal diagonal summary statistics for asymmetric blocks of contact matrix defined as pairwise combinations of regions in “view_df.
Note
This is a special case of asymmetric diagonal summary statistic that is efficient and covers the most important practical case of inter-chromosomal arms “expected” calculation.
- Parameters
clr (cooler.Cooler) – Cooler object
view_df (viewframe) – view_df of regions for intra-chromosomal diagonal summation, has to be sorted according to the order of chromosomes in cooler.
transforms (dict of str -> callable, optional) – Transformations to apply to pixels. The result will be assigned to a temporary column with the name given by the key. Callables take one argument: the current chunk of the (annotated) pixel dataframe.
clr_weight_name (str) – name of the balancing weight vector used to count “bad” pixels per diagonal. Set to None not to mask “bad” pixels (raw data only).
chunksize (int, optional) – Size of pixel table chunks to process
map (callable, optional) – Map functor implementation.
- Returns
Dataframe of diagonal statistics for all intra-chromosomal blocks defined as
pairwise combinations of regions in the view
- cooltools.api.expected.diagsum_symm(clr, view_df, transforms={}, clr_weight_name='weight', ignore_diags=2, chunksize=10000000, map=<class 'map'>)[source]
Intra-chromosomal diagonal summary statistics.
- Parameters
clr (cooler.Cooler) – Cooler object
view_df (viewframe) – view_dfof regions for intra-chromosomal diagonal summation
transforms (dict of str -> callable, optional) – Transformations to apply to pixels. The result will be assigned to a temporary column with the name given by the key. Callables take one argument: the current chunk of the (annotated) pixel dataframe.
clr_weight_name (str) – name of the balancing weight vector used to count “bad” pixels per diagonal. Set to None not to mask “bad” pixels (raw data only).
chunksize (int, optional) – Size of pixel table chunks to process
ignore_diags (int, optional) – Number of intial diagonals to exclude from statistics
map (callable, optional) – Map functor implementation.
- Returns
Dataframe of diagonal statistics for all regions in the view
- cooltools.api.expected.expected_cis(clr, view_df=None, intra_only=True, smooth=True, aggregate_smoothed=True, smooth_sigma=0.1, clr_weight_name='weight', ignore_diags=2, chunksize=10000000, nproc=1, map_functor=<class 'map'>)[source]
Calculate average interaction frequencies as a function of genomic separation between pixels i.e. interaction decay with distance. Genomic separation aka “dist” is measured in the number of bins, and defined as an index of a diagonal on which pixels reside (bin1_id - bin2_id).
Average values are reported in the columns with names {}.avg, and they are calculated as a ratio between a corresponding sum {}.sum and the total number of “valid” pixels on the diagonal “n_valid”.
When balancing weights (clr_weight_name=None) are not applied to the data, there is no masking of bad bins performed.
- Parameters
clr (cooler.Cooler) – Cooler object
view_df (viewframe) – a collection of genomic intervals where expected is calculated otherwise expected is calculated for full chromosomes. view_df has to be sorted, when inter-regions expected is requested, i.e. intra_only is False.
intra_only (bool) – Return expected only for symmetric intra-regions defined by view_df, i.e. chromosomes, chromosomal-arms, intra-domains, etc. When False returns expected both for symmetric intra-regions and assymetric inter-regions.
smooth (bool) – Apply smoothing to cis-expected. Will be stored in an additional column
aggregate_smoothed (bool) – When smoothing, average over all regions, ignored without smoothing.
smooth_sigma (float) – Control smoothing with the standard deviation of the smoothing Gaussian kernel. Ignored without smoothing.
clr_weight_name (str or None) – Name of balancing weight column from the cooler to use. Use raw unbalanced data, when None.
ignore_diags (int, optional) – Number of intial diagonals to exclude results
chunksize (int, optional) – Size of pixel table chunks to process
nproc (int, optional) – How many processes to use for calculation. Ignored if map_functor is passed.
map_functor (callable, optional) – Map function to dispatch the matrix chunks to workers. If left unspecified, pool_decorator applies the following defaults: if nproc>1 this defaults to multiprocess.Pool; If nproc=1 this defaults the builtin map.
- Returns
DataFrame with summary statistic of every diagonal of every symmetric
or asymmetric block
Notes
When clr_weight_name=None, smooth=False, aggregate_smoothed=False, the minimum output DataFrame includes the following quantities (columns):
- dist:
Distance in bins.
- dist_bp:
Distance in basepairs.
- contact_freq:
The “most processed” contact frequency value. For example, if balanced & smoothing then this will return the balanced.avg.smooth.agg; if aggregated+smoothed, then balanced.avg.smooth.agg; if nothing then count.avg.
- n_total:
Number of total pixels at a given distance.
- n_valid:
Number of valid pixels (with non-NaN values after balancing) at a given distance.
- count.sum:
Sum up raw contact counts of all pixels at a given distance.
- count.avg:
The average raw contact count of pixels at a given distance. count.sum / n_total.
When clr_weigh_name is provided (by default, clr_weigh_name=”weight”), the following quantities (columns) will be added into the DataFrame:
- balanced.sum:
Sum up balanced contact values of valid pixels at a given distance. Returned if clr_weight_name is not None.
- balanced.avg:
The average balanced contact values of valid pixels at a given distance. balanced.sum / n_valid. Returned if clr_weight_name is not None.
When smooth=True, the following quantities (columns) will be added into the DataFrame:
- count.avg.smoothed:
Log-smoothed count.avg. Returned if smooth=True and clr_weight_name=None.
- balanced.avg.smoothed:
Log-smoothed balanced.avg. Returned if smooth=True and clr_weight_name is not None.
- When aggregate_smoothed=True, the following quantities (columns) will be added into the DataFrame:
- count.avg.smoothed.agg:
Aggregate Log-smoothed count.avg of all genome regions. Returned if smooth=True and aggregate_smoothed=True and clr_weight_name=None.
- balanced.avg.smoothed.agg:
Aggregate Log-smoothed balanced.avg of all genome regions. Returned if smooth=True and aggregate_smoothed=True and clr_weight_name is not None.
By default, clr_weight_name=”weight”, smooth=True, aggregate_smoothed=True, the output DataFrame includes all quantities (columns).
- cooltools.api.expected.expected_trans(clr, view_df=None, clr_weight_name='weight', chunksize=10000000, nproc=1, map_functor=<class 'map'>)[source]
Calculate average interaction frequencies for inter-chromosomal blocks defined as pairwise combinations of regions in view_df.
An expected level of interactions between disjoint chromosomes is calculated as a simple average, as there is no notion of genomic separation for a pair of chromosomes and contact matrix for these regions looks “flat”.
Average values are reported in the columns with names {}.avg, and they are calculated as a ratio between a corresponding sum {}.sum and the total number of “valid” pixels on the diagonal “n_valid”.
- Parameters
clr (cooler.Cooler) – Cooler object
view_df (viewframe) – a collection of genomic intervals where expected is calculated otherwise expected is calculated for full chromosomes, has to be sorted.
clr_weight_name (str or None) – Name of balancing weight column from the cooler to use. Use raw unbalanced data, when None.
chunksize (int, optional) – Size of pixel table chunks to process
nproc (int, optional) – How many processes to use for calculation. Ignored if map_functor is passed.
map_functor (callable, optional) – Map function to dispatch the matrix chunks to workers. If left unspecified, pool_decorator applies the following defaults: if nproc>1 this defaults to multiprocess.Pool; If nproc=1 this defaults the builtin map.
- Returns
DataFrame with summary statistic for every trans-blocks
region1, region2, n_valid, count.sum count.avg, etc
- cooltools.api.expected.genomewide_smooth_cvd(cvd, sigma_log10=0.1, window_sigma=5, points_per_sigma=10, cols=None, suffix='.smoothed')[source]
Smooth the contact-vs-distance curve aggregated across all regions in log-space.
- Parameters
cvd (pandas.DataFrame) – A dataframe with the expected values in the cooltools.expected format.
sigma_log10 (float, optional) – The standard deviation of the smoothing Gaussian kernel, applied over log10(diagonal), by default 0.1
window_sigma (int, optional) – Width of the smoothing window, expressed in sigmas, by default 5
points_per_sigma (int, optional) – If provided, smoothing is done only for points_per_sigma points per sigma and the rest of the values are interpolated (this results in a major speed-up). By default 10
cols (dict, optional) – If provided, use the specified column names instead of the standard ones. See DEFAULT_CVD_COLS variable for the format of this argument.
suffix (string, optional) – If provided, use the specified string as the suffix of the output column’s name
- Returns
cvd (pandas.DataFrame) – A cvd table with extra column for the log-smoothed contact frequencies (by default, “balanced.avg.smoothed.agg” if balanced, or “count.avg.smoothed.agg” if raw).
Notes
Parameters in “cols” will be used:
- dist:
Name of the column that stores distance values (by default, “dist”).
- n_pixels:
Name of the column that stores number of pixels (by default, “n_valid” if balanced, or “n_total” if raw).
- n_contacts:
Name of the column that stores the sum of contacts (by default, “balanced.sum” if balanced, or “count.sum” if raw).
- output_prefix:
Name prefix of the column that will store output value (by default, “balanced.avg” if balanced, or “count.avg” if raw).
- cooltools.api.expected.interpolate_expected(expected, binned_expected, columns=['balanced.avg'], kind='quadratic', by_region=True, extrapolate_small_s=False)[source]
Interpolates expected to match binned_expected. Basically, this function smoothes the original expected according to the logbinned expected. It could either use by-region expected (each region will have different expected) or use combined binned_expected (all regions will have the same expected after that)
Such a smoothed expected should be used to calculate observed/expected for downstream analysis.
- Parameters
expected (pd.DataFrame) – expected as returned by diagsum_symm
binned_expected (pd.DataFrame) – binned expected (combined or not)
columns (list[str] (optional)) – Columns to interpolate. Must be present in binned_expected, but not necessarily in expected.
kind (str (optional)) – Interpolation type, according to scipy.interpolate.interp1d
by_region (bool or str (optional)) – Whether to do interpolation by-region (default=True). False means use one expected for all regions (use entire table). If a region name is provided, expected for that region is used.
- cooltools.api.expected.lattice_pdist_frequencies(n, points)[source]
Distribution of pairwise 1D distances among a collection of distinct integers ranging from 0 to n-1.
- Parameters
n (int) – Size of the lattice on which the integer points reside.
points (sequence of int) – Arbitrary integers between 0 and n-1, inclusive, in any order but with no duplicates.
- Returns
h (1D array of length n) – h[d] counts the number of integer pairs that are exactly d units apart
Notes
This is done using a convolution via FFT. Thanks to Peter de Rivaz; see http://stackoverflow.com/questions/42423823/distribution-of-pairwise-distances-between-many-integers.
- cooltools.api.expected.logbin_expected(exp, summary_name='balanced.sum', bins_per_order_magnitude=10, bin_layout='fixed', smooth=<function <lambda>>, min_nvalid=200, min_count=50)[source]
Logarithmically bins expected as produced by diagsum_symm method.
- Parameters
exp (DataFrame) – DataFrame produced by diagsum_symm
summary_name (str, optional) – Name of the column of exp-DataFrame to use as a diagonal summary. Default is “balanced.sum”.
bins_per_order_magnitude (int, optional) – How many bins per order of magnitude. Default of 10 has a ratio of neighboring bins of about 1.25
bin_layout ("fixed", "longest_region", or array) –
“fixed” means that bins are exactly the same for different datasets, and only depend on bins_per_order_magnitude
”longest_region” means that the last bin will end at size of the longest region.
GOOD: the last bin will have as much data as possible. BAD: bin edges will end up different for different datasets, you can’t divide them by each other
array: provide your own bin edges. Can be of any size, and end at any value. Bins exceeding the size of the largest region will be simply ignored.
smooth (callable) – A smoothing function to be applied to log(P(s)) and log(x) before calculating P(s) slopes for by-region data
min_nvalid (int) – For each region, throw out bins (log-spaced) that have less than min_nvalid valid pixels This will ensure that each entree in Pc_by_region has at least n_valid valid pixels Don’t set it to zero, or it will introduce bugs. Setting it to 1 is OK, but not recommended.
min_count (int) – If counts are found in the data, then for each region, throw out bins (log-spaced) that have more than min_counts of counts.sum (raw Hi-C counts). This will ensure that each entree in Pc_by_region has at least min_count raw Hi-C reads
- Returns
Pc (DataFrame) – dataframe of contact probabilities and spread across regions
slope (ndarray) – slope of Pc(s) on a log-log plot and spread across regions
bins (ndarray) – an array of bin edges used for calculating P(s)
Notes
For main Pc and slope, the algorithm is the following
concatenate all the expected for all regions into a large dataframe.
create logarithmically-spaced bins of diagonals (or use provided)
pool together n_valid and balanced.sum for each region and for each bin
calculate the average diagonal for each bucket, weighted by n_valid
divide balanced.sum by n_valid after summing for each bucket (not before)
calculate the slope in log space (for each region)
X values are not midpoints of bins
In step 4, we calculate the average diag index weighted by n_valid. This seems counter-intuitive, but it actually is justified.
Let’s take the worst case scenario. Let there be a bin from 40MB to 44MB. Let there be a region that is exactly 41 MB long. The midpoint of the bin is at 42MB. But the only part of this region belonging to this bin is actually between 40MB and 41MB. Moreover, the “average” read in this little triangle of the heatmap is actually not coming even from 40.5 MB because the triangle is getting narrower towards 41MB. The center of mass of a triangle is 1/3 of the way up, or 40.33 MB. So an average read for this region in this bin is coming from 40.33.
Consider the previous bin, say, from 36MB to 40MB. The heatmap there is a trapezoid with a long side of 5MB, the short side of 1MB, and height of 4MB. The center of mass of this trapezoid is at 36 + 14/9 = 37.55MB, and not at 38MB. So the last bin center is definitely mis-assigned, and the second-to-last bin center is off by some 25%. This would lead to a 25% error of the P(s) slope estimated between the third-to-last and second-to-last bin.
In presence of missing bins, this all becomes more complex, but this kind of averaging should take care of everything. It follows a general principle: when averaging the y values with some weights, one needs to average the x values with the same weights. The y values here are being added together, so per-diag means are effectively averaged with the weight of n_valid. Therefore, the x values (diag) should be averaged with the same weights.
Other considerations
Steps #3 and #5 are important because the ratio of sums does not equal to the sum of ratios, and the former is more correct (the latter is more susceptible to noise). It is generally better to divide at the very end, rather than dividing things for each diagonal.
Here we divide at the end twice: first we divide balanced.sum by n_valid for each region, then we effectively multiply it back up and divide it for each bin when combining different regions (see weighted average in the next function).
Smoothing P(s) for the slope
For calcuating the slope, we apply smoothing to the P(s) to ensure the slope is not too noisy. There are several caveats here: the P(s) has to be smoothed in logspace, and both P and s have to be smoothed. It is discussed in detail here
https://gist.github.com/mimakaev/4becf1310ba6ee07f6b91e511c531e73
Examples
For example, see this gist: https://gist.github.com/mimakaev/e9117a7fcc318e7904702eba5b47d9e6
- cooltools.api.expected.make_block_table(clr, regions1, regions2, clr_weight_name='weight')[source]
Creates a table of total and valid pixels for a set of rectangular genomic blocks defined by regions1 and regions2. For every block calculate its “area” in pixels (“n_total”), and calculate number of “valid” pixels (“n_valid”). Valid pixels exclude “bad” pixels, which are inferred from the balancing weight column clr_weight_name.
When clr_weight_name is None, raw data is used, and no “bad” pixels are exclued.
- Parameters
clr (cooler.Cooler) – Input cooler
regions1 (viewframe-like dataframe) – viewframe-like dataframe, where repeated entries are allowed
regions2 (viewframe-like dataframe) – viewframe-like dataframe, where repeated entries are allowed
clr_weight_name (str) – name of the weight column in the cooler bins-table, used for masking bad pixels. When clr_weight_name is None, no bad pixels are masked.
- Returns
block_table (dict) – dictionary for blocks that are 0-indexed
- cooltools.api.expected.make_diag_table(bad_mask, span1, span2)[source]
Compute the total number of elements
n_total
and the number of bad elementsn_bad
per diagonal for a single contact area encompassingspan1
andspan2
on the same genomic scaffold (cis matrix).Follows the same principle as the algorithm for finding contact areas for computing scalings.
- Parameters
bad_mask (1D array of bool) – Mask of bad bins for the whole genomic scaffold containing the regions of interest.
span1 (pair of ints) – The bin spans (not genomic coordinates) of the two regions of interest.
span2 (pair of ints) – The bin spans (not genomic coordinates) of the two regions of interest.
- Returns
diags (pandas.DataFrame) – Table indexed by ‘diag’ with columns [‘n_total’, ‘n_bad’].
- cooltools.api.expected.make_diag_tables(clr, regions, regions2=None, clr_weight_name='weight')[source]
For every region infer diagonals that intersect this region and calculate the size of these intersections in pixels, both “total” and “n_valid”, where “n_valid” does not count “bad” pixels.
“Bad” pixels are inferred from the balancing weight column clr_weight_name. When clr_weight_name is None, raw data is used, and no “bad” pixels are exclued.
When regions2 are provided, all intersecting diagonals are reported for each rectangular and asymmetric block defined by combinations of matching elements of regions and regions2. Otherwise only regions-based symmetric square blocks are considered. Only intra-chromosomal regions are supported.
- Parameters
clr (cooler.Cooler) – Input cooler
regions (viewframe or viewframe-like dataframe) – viewframe without repeated entries or viewframe-like dataframe with repeated entries
regions2 (viewframe or viewframe-like dataframe) – viewframe without repeated entries or viewframe-like dataframe with repeated entries
clr_weight_name (str) – name of the weight column in the clr bin-table, Balancing weight is used to infer bad bins, set to None is masking bad bins is not desired for raw data.
- Returns
diag_tables (dict) – dictionary with DataFrames of relevant diagonals for every region.
- cooltools.api.expected.per_region_smooth_cvd(cvd, sigma_log10=0.1, window_sigma=5, points_per_sigma=10, cols=None, suffix='')[source]
Smooth the contact-vs-distance curve for each region in log-space.
- Parameters
cvd (pandas.DataFrame) – A dataframe with the expected values in the cooltools.expected format.
sigma_log10 (float, optional) – The standard deviation of the smoothing Gaussian kernel, applied over log10(diagonal), by default 0.1
window_sigma (int, optional) – Width of the smoothing window, expressed in sigmas, by default 5
points_per_sigma (int, optional) – If provided, smoothing is done only for points_per_sigma points per sigma and the rest of the values are interpolated (this results in a major speed-up). By default 10
cols (dict, optional) – If provided, use the specified column names instead of the standard ones. See DEFAULT_CVD_COLS variable for the format of this argument.
suffix (string, optional) – If provided, use the specified string as the suffix of the output column’s name
- Returns
cvd (pandas.DataFrame) – A cvd table with extra column for the log-smoothed contact frequencies (by default, “balanced.avg.smoothed” if balanced, or “count.avg.smoothed” if raw).
Notes
Parameters in “cols” will be used:
- region1:
Name of the column that stores region1’s locations (by default, “region1”).
- region2:
Name of the column that stores region2’s locations (by default, “region2”).
- dist:
Name of the column that stores distance values (by default, “dist”).
- n_pixels:
Name of the column that stores number of pixels (by default, “n_valid” if balanced, or “n_total” if raw).
- n_contacts:
Name of the column that stores the sum of contacts (by default, “balanced.sum” if balanced, or “count.sum” if raw).
- output_prefix:
Name prefix of the column that will store output value (by default, “balanced.avg” if balanced, or “count.avg” if raw).
cooltools.api.insulation module
- cooltools.api.insulation.calculate_insulation_score(clr, window_bp, view_df=None, ignore_diags=None, min_dist_bad_bin=0, is_bad_bin_key='is_bad_bin', append_raw_scores=False, chunksize=20000000, clr_weight_name='weight', verbose=False, nproc=1, map_functor=<class 'map'>)[source]
Calculate the diamond insulation scores for all bins in a cooler.
- Parameters
clr (cooler.Cooler) – A cooler with balanced Hi-C data.
window_bp (int or list of integers) – The size of the sliding diamond window used to calculate the insulation score. If a list is provided, then a insulation score if calculated for each value of window_bp.
view_df (bioframe.viewframe or None) – Viewframe for independent calculation of insulation scores for regions
ignore_diags (int | None) – The number of diagonals to ignore. If None, equals the number of diagonals ignored during IC balancing.
min_dist_bad_bin (int) – The minimal allowed distance to a bad bin to report insulation score. Fills bins that have a bad bin closer than this distance by nans.
is_bad_bin_key (str) – Name of the output column to store bad bins
append_raw_scores (bool) – If True, append columns with raw scores (sum_counts, sum_balanced, n_pixels) to the output table.
clr_weight_name (str or None) – Name of the column in the bin table with weight. Using unbalanced data with None will avoid masking “bad” pixels.
verbose (bool) – If True, report real-time progress.
nproc (int, optional) – How many processes to use for calculation. Ignored if map_functor is passed.
map_functor (callable, optional) – Map function to dispatch the matrix chunks to workers. If left unspecified, pool_decorator applies the following defaults: if nproc>1 this defaults to multiprocess.Pool; If nproc=1 this defaults the builtin map.
- Returns
ins_table (pandas.DataFrame) – A table containing the insulation scores of the genomic bins
- cooltools.api.insulation.find_boundaries(ins_table, min_frac_valid_pixels=0.66, min_dist_bad_bin=0, log2_ins_key='log2_insulation_score_{WINDOW}', n_valid_pixels_key='n_valid_pixels_{WINDOW}', is_bad_bin_key='is_bad_bin')[source]
Call insulating boundaries.
Find all local minima of the log2(insulation score) and calculate their chromosome-wide topographic prominence.
- Parameters
ins_table (pandas.DataFrame) – A bin table with columns containing log2(insulation score), annotation of regions (required), the number of valid pixels per diamond and (optionally) the mask of bad bins. Normally, this should be an output of calculate_insulation_score.
view_df (bioframe.viewframe or None) – Viewframe for independent boundary calls for regions
min_frac_valid_pixels (float) – The minimal fraction of valid pixels in a diamond to be used in boundary picking and prominence calculation.
min_dist_bad_bin (int) – The minimal allowed distance to a bad bin to be used in boundary picking. Ignore bins that have a bad bin closer than this distance.
log2_ins_key (str) – The names of the columns containing log2_insulation_score and the number of valid pixels per diamond. When a template containing {WINDOW} is provided, the calculation is repeated for all pairs of columns matching the template.
n_valid_pixels_key (str) – The names of the columns containing log2_insulation_score and the number of valid pixels per diamond. When a template containing {WINDOW} is provided, the calculation is repeated for all pairs of columns matching the template.
- Returns
ins_table (pandas.DataFrame) – A bin table with appended columns with boundary prominences.
- cooltools.api.insulation.get_n_pixels(bad_bin_mask, window=10, ignore_diags=2)[source]
Calculate the number of “good” pixels in a diamond at each bin.
- cooltools.api.insulation.insul_diamond(pixel_query, bins, window=10, ignore_diags=2, norm_by_median=True, clr_weight_name='weight')[source]
Calculates the insulation score of a Hi-C interaction matrix.
- Parameters
pixel_query (RangeQuery object <TODO:update description>) – A table of Hi-C interactions. Must follow the Cooler columnar format: bin1_id, bin2_id, count, balanced (optional)).
bins (pandas.DataFrame) – A table of bins, is used to determine the span of the matrix and the locations of bad bins.
window (int) – The width (in bins) of the diamond window to calculate the insulation score.
ignore_diags (int) – If > 0, the interactions at separations < ignore_diags are ignored when calculating the insulation score. Typically, a few first diagonals of the Hi-C map should be ignored due to contamination with Hi-C artifacts.
norm_by_median (bool) – If True, normalize the insulation score by its NaN-median.
clr_weight_name (str or None) – Name of balancing weight column from the cooler to use. Using raw unbalanced data is not supported for insulation.
- cooltools.api.insulation.insulation(clr, window_bp, view_df=None, ignore_diags=None, clr_weight_name='weight', min_frac_valid_pixels=0.66, min_dist_bad_bin=0, threshold='Li', append_raw_scores=False, chunksize=20000000, verbose=False, nproc=1)[source]
Find insulating boundaries in a contact map via the diamond insulation score.
For a given cooler, this function (a) calculates the diamond insulation score track, (b) detects all insulating boundaries, and (c) removes weak boundaries via an automated thresholding algorithm.
- Parameters
clr (cooler.Cooler) – A cooler with balanced Hi-C data.
window_bp (int or list of integers) – The size of the sliding diamond window used to calculate the insulation score. If a list is provided, then a insulation score if done for each value of window_bp.
view_df (bioframe.viewframe or None) – Viewframe for independent calculation of insulation scores for regions
ignore_diags (int | None) – The number of diagonals to ignore. If None, equals the number of diagonals ignored during IC balancing.
clr_weight_name (str) – Name of the column in the bin table with weight
min_frac_valid_pixels (float) – The minimal fraction of valid pixels in a diamond to be used in boundary picking and prominence calculation.
min_dist_bad_bin (int) – The minimal allowed distance to a bad bin to report insulation score. Fills bins that have a bad bin closer than this distance by nans.
threshold ("Li", "Otsu" or float) – Rule used to threshold the histogram of boundary strengths to exclude weak boundaries. “Li” or “Otsu” use corresponding methods from skimage.thresholding. Providing a float value will filter by a fixed threshold
append_raw_scores (bool) – If True, append columns with raw scores (sum_counts, sum_balanced, n_pixels) to the output table.
verbose (bool) – If True, report real-time progress.
nproc (int, optional) – How many processes to use for calculation
- Returns
ins_table (pandas.DataFrame) – A table containing the insulation scores of the genomic bins
cooltools.api.saddle module
- cooltools.api.saddle.digitize(track, n_bins, vrange=None, qrange=None, digitized_suffix='.d')[source]
Digitize genomic signal tracks into integers between 1 and n.
- Parameters
track (4-column DataFrame) – bedGraph-like dataframe with columns understood as (chrom,start,end,value).
n_bins (int) – number of bins for signal quantization.
vrange (tuple) – Low and high values used for binning track values. E.g. if `vrange`=(-0.05, 0.05), equal width bins would be generated between the value -0.05 and 0.05.
qrange (tuple) – Low and high values for quantile binning track values. E.g., if `qrange`=(0.02, 0.98) the lower bin would start at the 2nd percentile and the upper bin would end at the 98th percentile of the track value range. Low must be 0.0 or more, high must be 1.0 or less.
digitized_suffix (str) – suffix to append to the track value name in the fourth column.
- Returns
digitized (DataFrame) – New track dataframe (bedGraph-like) with digitized value column with name suffixed by ‘.d’ The digized column is returned as a categorical.
binedges (1D array (length n + 1)) – Bin edges used in quantization of track. For n bins, there are n + 1 edges. See encoding details in Notes.
Notes
The digital encoding is as follows:
1..n <-> values assigned to bins defined by vrange or qrange
0 <-> left outlier values
n+1 <-> right outlier values
-1 <-> missing data (NaNs)
- cooltools.api.saddle.saddle(clr, expected, track, contact_type, n_bins, vrange=None, qrange=None, view_df=None, clr_weight_name='weight', expected_value_col='balanced.avg', view_name_col='name', min_diag=3, max_diag=-1, trim_outliers=False, verbose=False, drop_track_na=False)[source]
Get a matrix of average interactions between genomic bin pairs as a function of a specified genomic track.
The provided genomic track is either: (a) digitized inside this function by passing ‘n_bins’, and one of ‘v_range’ or ‘q_range’ (b) passed as a pre-digitized track with a categorical value column as generated by get_digitized().
- Parameters
clr (cooler.Cooler) – Observed matrix.
expected (DataFrame in expected format) – Diagonal summary statistics for each chromosome, and name of the column with the values of expected to use.
contact_type (str) – If ‘cis’ then only cis interactions are used to build the matrix. If ‘trans’, only trans interactions are used.
track (DataFrame) – A track, i.e. BedGraph-like dataframe, which is digitized with the options n_bins, vrange and qrange. Can optionally be passed as a pre-digitized dataFrame with a categorical value column, as generated by get_digitzied(), also passing n_bins as None.
n_bins (int or None) – number of bins for signal quantization. If None, then track must be passed as a pre-digitized track.
vrange (tuple) – Low and high values used for binning track values. See get_digitized().
qrange (tuple) – Low and high values for quantile binning track values. Low must be 0.0 or more, high must be 1.0 or less. Only one of vrange or qrange can be passed. See get_digitzed().
view_df (viewframe) – Viewframe with genomic regions. If none, generate from track chromosomes.
clr_weight_name (str) – Name of the column in the clr.bins to use as balancing weights. Using raw unbalanced data is not supported for saddles.
expected_value_col (str) – Name of the column in expected used for normalizing.
view_name_col (str) – Name of column in view_df with region names.
min_diag (int) – Smallest diagonal to include in computation. Ignored with contact_type=trans.
max_diag (int) – Biggest diagonal to include in computation. Ignored with contact_type=trans.
trim_outliers (bool, optional) – Remove first and last row and column from the output matrix.
verbose (bool, optional) – If True then reports progress.
drop_track_na (bool, optional) – If True then drops NaNs in input track (as if they were missing), If False then counts NaNs as present in dataframe. In general, this only adds check form chromosomes that have all missing values, but does not affect the results.
- Returns
interaction_sum (2D array) – The matrix of summed interaction probability between two genomic bins given their values of the provided genomic track.
interaction_count (2D array) – The matrix of the number of genomic bin pairs that contributed to the corresponding pixel of
interaction_sum
.
- cooltools.api.saddle.saddle_strength(S, C)[source]
- Parameters
S (2D arrays, square, same shape) – Saddle sums and counts, respectively
C (2D arrays, square, same shape) – Saddle sums and counts, respectively
- Returns
1D array
Ratios of cumulative corner interaction scores, where the saddle data is
grouped over the AA+BB corners and AB+BA corners with increasing extent.
- cooltools.api.saddle.saddleplot(track, saddledata, n_bins, vrange=None, qrange=(0.0, 1.0), cmap='coolwarm', scale='log', vmin=0.5, vmax=2, color=None, title=None, xlabel=None, ylabel=None, clabel=None, fig=None, fig_kws=None, heatmap_kws=None, margin_kws=None, cbar_kws=None, subplot_spec=None)[source]
Generate a saddle plot.
- Parameters
track (pd.DataFrame) – See get_digitized() for details.
saddledata (2D array-like) – Saddle matrix produced by make_saddle. It will include 2 flanking rows/columns for outlier signal values, thus the shape should be (n+2, n+2).
cmap (str or matplotlib colormap) – Colormap to use for plotting the saddle heatmap
scale (str) – Color scaling to use for plotting the saddle heatmap: log or linear
vmin (float) – Value limits for coloring the saddle heatmap
vmax (float) – Value limits for coloring the saddle heatmap
color (matplotlib color value) – Face color for margin bar plots
fig (matplotlib Figure, optional) – Specified figure to plot on. A new figure is created if none is provided.
fig_kws (dict, optional) – Passed on to plt.Figure()
heatmap_kws (dict, optional) – Passed on to ax.imshow()
margin_kws (dict, optional) – Passed on to ax.bar() and ax.barh()
cbar_kws (dict, optional) – Passed on to plt.colorbar()
subplot_spec (GridSpec object) – Specify a subregion of a figure to using a GridSpec.
- Returns
Dictionary of axes objects.
cooltools.api.sample module
- cooltools.api.sample.sample(clr, out_clr_path, count=None, cis_count=None, frac=None, exact=False, chunksize=10000000, nproc=1, map_functor=<class 'map'>)[source]
Pick a random subset of contacts from a Hi-C map.
- Parameters
clr (cooler.Cooler or str) – A Cooler or a path/URI to a Cooler with input data.
out_clr_path (str) – A path/URI to the output.
count (int) – The target number of contacts in the sample. Mutually exclusive with cis_count and frac.
cis_count (int) – The target number of cis contacts in the sample. Mutually exclusive with count and frac.
frac (float) – The target sample size as a fraction of contacts in the original dataset. Mutually exclusive with count and cis_count.
exact (bool) – If True, the resulting sample size will exactly match the target value. Exact sampling will load the whole pixel table into memory! If False, binomial sampling will be used instead and the sample size will be randomly distributed around the target value.
chunksize (int) – The number of pixels loaded and processed per step of computation.
nproc (int, optional) – How many processes to use for calculation. Ignored if map_functor is passed.
map_functor (callable, optional) – Map function to dispatch the matrix chunks to workers. If left unspecified, pool_decorator applies the following defaults: if nproc>1 this defaults to multiprocess.Pool; If nproc=1 this defaults the builtin map.
cooltools.api.snipping module
Collection of classes and functions used for snipping and creation of pileups (averaging of multiple small 2D regions) The main user-facing function of this module is pileup, it performs pileups using snippers and other functions defined in the module. The concept is the following:
First, the provided features are annotated with the regions from a view (or simply whole chromosomes, if no view is provided). They are assigned to the region that contains it, or the one with the largest overlap.
Then the features are expanded using the flank argument, and aligned to the bins of the cooler
Depending on the requested operation (whether the normalization to expected is required), the appropriate snipper object is created
A snipper can select a particular region of a genome-wide matrix, meaning it stores its sparse representation in memory. This could be whole chromosomes or chromosome arms, for example
A snipper can snip a small area of a selected region, meaning it will extract and return a dense representation of this area
For each region present, it is first `select`ed, and then all features within it are `snip`ped, creating a stack: a 3D array containing all snippets for this region
For features that are not assigned to any region, an empty snippet is returned
All per-region stacks are then combined into one, which then can be averaged to create a single pileup
The order of snippets in the stack matches the order of features, this way the stack can also be used for analysis of any subsets of original features
This procedure achieves a good tradeoff between speed and RAM. Extracting each individual snippet directly from disk would be extremely slow due to slow IO. Extracting the whole chromosomes into dense matrices is not an option due to huge memory requirements. As a warning, deeply sequenced data can still require a substantial amount of RAM at high resolution even as a sparse matrix, but typically it’s not a problem.
- class cooltools.api.snipping.CoolerSnipper(clr, cooler_opts=None, view_df=None, min_diag=2)[source]
Bases:
object
- select(region1, region2)[source]
Select a portion of the cooler for snipping based on two regions in the view
In addition to returning the selected portion of the data, stores necessary information about it in the snipper object for future snipping
- Parameters
region1 (str) – Name of a region from the view
region2 (str) – Name of another region from the view.
- Returns
CSR matrix – Sparse matrix of the selected portion of the data from the cooler
- snip(matrix, region1, region2, tup)[source]
Extract a snippet from the matrix
Returns a NaN-filled array for out-of-bounds regions. Fills in NaNs based on the cooler weight, if using balanced data. Fills NaNs in all diagonals below min_diag
- Parameters
matrix (SCR matrix) – Output of the .select() method
region1 (str) – Name of a region from the view corresponding to the matrix
region2 (str) – Name of the other regions from the view corresponding to the matrix
tup (tuple) – (start1, end1, start2, end2) coordinates of the requested snippet in bp
- Returns
np.array – Requested snippet.
- class cooltools.api.snipping.ExpectedSnipper(clr, expected, view_df=None, min_diag=2, expected_value_col='balanced.avg')[source]
Bases:
object
- select(region1, region2)[source]
Select a portion of the expected matrix for snipping based on two regions in the view
In addition to returning the selected portion of the data, stores necessary information about it in the snipper object for future snipping
- Parameters
region1 (str) – Name of a region from the view
region2 (str) – Name of another region from the view.
- Returns
CSR matrix – Sparse matrix of the selected portion of the data from the cooler
- snip(exp, region1, region2, tup)[source]
Extract an expected snippet
Returns a NaN-filled array for out-of-bounds regions. Fills NaNs in all diagonals below min_diag
- Parameters
exp (SCR matrix) – Output of the .select() method
region1 (str) – Name of a region from the view corresponding to the matrix
region2 (str) – Name of the other regions from the view corresponding to the matrix
tup (tuple) – (start1, end1, start2, end2) coordinates of the requested snippet in bp
- Returns
np.array – Requested snippet.
- class cooltools.api.snipping.ObsExpSnipper(clr, expected, cooler_opts=None, view_df=None, min_diag=2, expected_value_col='balanced.avg')[source]
Bases:
object
- select(region1, region2)[source]
Select a portion of the cooler for snipping based on two regions in the view
In addition to returning the selected portion of the data, stores necessary information about it in the snipper object for future snipping
- Parameters
region1 (str) – Name of a region from the view
region2 (str) – Name of another region from the view.
- Returns
CSR matrix – Sparse matrix of the selected portion of the data from the cooler
- snip(matrix, region1, region2, tup)[source]
Extract an expected-normalised snippet from the matrix
Returns a NaN-filled array for out-of-bounds regions. Fills in NaNs based on the cooler weight, if using balanced data. Fills NaNs in all diagonals below min_diag
- Parameters
matrix (SCR matrix) – Output of the .select() method
region1 (str) – Name of a region from the view corresponding to the matrix
region2 (str) – Name of the other regions from the view corresponding to the matrix
tup (tuple) – (start1, end1, start2, end2) coordinates of the requested snippet in bp
- Returns
np.array – Requested snippet.
- cooltools.api.snipping.expand_align_features(features_df, flank, resolution, format='bed')[source]
Short summary.
- Parameters
features_df (pd.DataFrame) – Dataframe with feature coordinates.
flank (int) – Flank size to add to the central bin of each feature.
resolution (int) – Size of the bins to use.
format (str) – “bed” or “bedpe” format: has to have ‘chrom’, ‘start’, ‘end’ or ‘chrom1’, ‘start1’, ‘end1’, ‘chrom2’, ‘start2’, ‘end1’ columns, repectively.
- Returns
pd.DataFrame –
- DataFrame with features with new columns
”center”, “orig_start” “orig_end”
- or “center1”, “orig_start1”, “orig_end1”,
”center2”, “orig_start2”, “orig_rank_end2”, depending on format.
- cooltools.api.snipping.make_bin_aligned_windows(binsize, chroms, centers_bp, flank_bp=0, region_start_bp=0, ignore_index=False)[source]
Convert genomic loci into bin spans on a fixed bin-segmentation of a genomic region. Window limits are adjusted to align with bin edges.
- Parameters
binsize (int) – Bin size (resolution) in base pairs.
chroms (1D array-like) – Column of chromosome names.
centers_bp (1D or nx2 array-like) – If 1D, center points of each window. If 2D, the starts and ends.
flank_bp (int) – Distance in base pairs to extend windows on either side.
region_start_bp (int, optional) – If region is a subset of a chromosome, shift coordinates by this amount. Default is 0.
- Returns
DataFrame with columns – ‘chrom’ - chromosome ‘start’, ‘end’ - window limits in base pairs ‘lo’, ‘hi’ - window limits in bins
- cooltools.api.snipping.pileup(clr, features_df, view_df=None, expected_df=None, expected_value_col='balanced.avg', flank=100000, min_diag='auto', clr_weight_name='weight', nproc=1, map_functor=<class 'map'>)[source]
Pileup features over the cooler.
- Parameters
clr (cooler.Cooler) – Cooler with Hi-C data
features_df (pd.DataFrame) – Dataframe in bed or bedpe format: has to have ‘chrom’, ‘start’, ‘end’ or ‘chrom1’, ‘start1’, ‘end1’, ‘chrom2’, ‘start2’, ‘end2’ columns.
view_df (pd.DataFrame) – Dataframe with the genomic view for this operation (has to match the expected_df, if provided)
expected_df (pd.DataFrame) – Dataframe with the expected level of interactions at different genomic separations
expected_value_col (str) – Name of the column in expected used for normalizing.
flank (int) – How much to flank the center of the features by, in bp
min_diag (str or int) – All diagonals of the matrix below this value are ignored. ‘auto’ tries to extract the value used during the matrix balancing, if it fails defaults to 2
clr_weight_name (str) – Value of the column that contains the balancing weights
force (bool) – Allows start>end in the features (not implemented)
nproc (int, optional) – How many processes to use for calculation. Ignored if map_functor is passed.
map_functor (callable, optional) – Map function to dispatch the matrix chunks to workers. If left unspecified, pool_decorator applies the following defaults: if nproc>1 this defaults to multiprocess.Pool; If nproc=1 this defaults the builtin map.
- Returns
np.ndarray (a stackup of all snippets corresponding to the features, with shape)
(n, D, D), where n is the number of snippets and (D, D) is the shape of each
snippet
cooltools.api.virtual4c module
- cooltools.api.virtual4c.virtual4c(clr, viewpoint, clr_weight_name='weight', nproc=1, map_functor=<class 'map'>)[source]
Generate genome-wide contact profile for a given viewpoint.
Extract all contacts of a given viewpoint from a cooler file.
- Parameters
clr (cooler.Cooler) – A cooler with balanced Hi-C data.
viewpoint (tuple or str) – Coordinates of the viewpoint.
clr_weight_name (str) – Name of the column in the bin table with weight
nproc (int, optional) – How many processes to use for calculation. Ignored if map_functor is passed.
map_functor (callable, optional) – Map function to dispatch the matrix chunks to workers. If left unspecified, pool_decorator applies the following defaults: if nproc>1 this defaults to multiprocess.Pool; If nproc=1 this defaults the builtin map.
- Returns
v4C_table (pandas.DataFrame) – A table containing the interaction frequency of the viewpoint with the rest of the genome
Note
Note: this is a new (experimental) function, the interface or output might change in a future version.