orca_utils module

This module contains the utilities for Orca-based applications, including a class for structural variants and plotting utilities.

class orca_utils.GRange(chr, start, end, strand)

Bases: tuple

property chr

Alias for field number 0

property end

Alias for field number 2

property start

Alias for field number 1

property strand

Alias for field number 3

class orca_utils.LGRange(len, ref)

Bases: tuple

property len

Alias for field number 0

property ref

Alias for field number 1

class orca_utils.StructuralChange2(chr_name, length)[source]

Bases: object

This class stores and manupulating structural changes for a single chromosome and allow querying the mutated chromosome by coordinates by providing utilities for retrieving the corresponding reference genome segments.

The basic operations that StructuralChange2 supports are duplication, deletion, inversion, insertion, and concatenation. StructuralChange2 objects can be concatenated with ‘+’ operator, this operation allows concatenating two chromosomes. ‘+’ can be combined with other basic operations to create fused chromosomes.

These operations can be used sequentially to introduce arbitrarily complex structural changes. However, note that the coordinates are dynamically updated after each operation reflecting the current state of the chromosome, thus coordinates specified in later operation must take into account of the effects of all previous operations.

Parameters
  • chr_name (str) – Name of the reference chromosome.

  • length (int) – The length of the reference chromosome.

segments

List of reference genome segments that constitute the (mutated) chromosome. Each element is a LGRange namedtuple (length and a GRange namedtuple (chr: str, start: int, end: int, strand: str)).

Type

list(LGRange)

chr_name

Name of the chromosome

Type

str

coord_points

Stores N+1 key coordinates where N is the number of segments. The key coordinates are 0, segment junction positions, and chromosome end coordinate. coord_points reflects the current state of the chromosome.

Type

list(int)

delete(start, end)[source]

Delete a genomic region.

duplicate(start, end)[source]

Duplicate a genomic region.

insert(start, length, strand='+', name=None)[source]

Insert a genomic sequence with given length.

invert(start, end)[source]

Invert a genomic region.

query(start, end)[source]

Retrieve the segments in the reference genome that constitute the specified interval in the mutated genome.

query_ref(chr_name, start, end)[source]

Retrieve regions that correspond to the specified reference genome interval in the mutated genome.

orca_utils.coord_clip(pos, chrlen, binsize=128000, window_radius=16000000)[source]

Clip the coordinate to make sure that full window centered at the coordinate to stay within chromosome boundaries. coord_clip also try to preserve the relative position of the coordinate to the grid as specified by binsize whenever possible.

Parameters
  • x (int or numpy.ndarray) – Coordinates to round.

  • gridsize (int) – The gridsize to round by

Returns

The clipped coordinate

Return type

int

orca_utils.coord_round(x, gridsize=4000)[source]

Round coordinate to multiples of gridsize.

Parameters
  • x (int or numpy.ndarray) – Coordinates to round.

  • gridsize (int) – The gridsize to round by

Returns

The rounded coordinate

Return type

int

orca_utils.genomeplot(output, show_genes=False, show_tracks=False, show_coordinates=True, unscaled=False, file=None, cmap=None, unscaled_cmap=None, colorbar=True, maskpred=False, vmin=-1, vmax=2, model_labels=['H1-ESC', 'HFF'])[source]

Plot the multiscale prediction outputs for 32Mb output.

Parameters
  • output (dict) – The result dictionary to plot as returned by genomepredict_256Mb.

  • show_genes (bool, optional) – Default is False. If True, plot the retrieved gene annotations corresponding to all windows used for the multiscale prediction.

  • show_tracks (bool, optional) – Default is False. If True, plot the retrieved chromatin tracks for CTCF, chromatin accessibility and histone marks for all windows used for the multiscale prediction.

  • show_coordinates (bool, optional) – Default is True. If True, annotate the generated plot with the genome coordinates.

  • unscaled (bool, optional) – Default is False. If True, plot the predictions and observations without normalizing by distance-based expectation.

  • file (str or None, optional) – Default is None. The output file prefix. No output file is generated if set to None.

  • cmap (str or None, optional) – Default is None. The colormap for plotting scaled interactions (log fold over distance-based background). If None, use colormaps.hnh_cmap_ext5.

  • unscaled_cmap (str or None, optional) – Default is None. The colormap for plotting unscaled interactions (log balanced contact score). If None, use colormaps.hnh_cmap_ext5.

  • colorbar (bool, optional) – Default is True. Whether to plot the colorbar.

  • maskpred (bool, optional) – Default is True. If True, the prediction heatmaps are masked at positions where the observed data have missing values when observed data are provided in output dict.

  • vmin (int, optional) – Default is -1. The lowerbound value for heatmap colormap.

  • vmax (int, optional) – Default is 2. The upperbound value for heatmap colormap.

  • model_labels (list(str), optional) – Model labels for plotting. Default is [“H1-ESC”, “HFF”].

Returns

Return type

None

orca_utils.genomeplot_256Mb(output, show_coordinates=True, unscaled=False, file=None, cmap=None, unscaled_cmap=None, colorbar=True, maskpred=True, vmin=-1, vmax=2, model_labels=['H1-ESC', 'HFF'])[source]

Plot the multiscale prediction outputs for 256Mb output.

Parameters
  • output (dict) – The result dictionary to plot as returned by genomepredict_256Mb.

  • show_coordinates (bool, optional) – Default is True. If True, annotate the generated plot with the genome coordinates.

  • unscaled (bool, optional) – Default is False. If True, plot the predictions and observations without normalizing by distance-based expectation.

  • file (str or None, optional) – Default is None. The output file prefix. No output file is generated if set to None.

  • cmap (str or None, optional) – Default is None. The colormap for plotting scaled interactions (log fold over distance-based background). If None, use colormaps.hnh_cmap_ext5.

  • unscaled_cmap (str or None, optional) – Default is None. The colormap for plotting unscaled interactions (log balanced contact score). If None, use colormaps.hnh_cmap_ext5.

  • colorbar (bool, optional) – Default is True. Whether to plot the colorbar.

  • maskpred (bool, optional) – Default is True. If True, the prediction heatmaps are masked at positions where the observed data have missing values when observed data are provided in output dict.

  • vmin (int, optional) – Default is -1. The lowerbound value for heatmap colormap.

  • vmax (int, optional) – Default is 2. The upperbound value for heatmap colormap.

  • model_labels (list(str), optional) – Model labels for plotting. Default is [“H1-ESC”, “HFF”].

Returns

Return type

None

orca_utils.process_anno(anno_scaled, base=0, window_radius=16000000)[source]

Process annotations to the format used by Orca plotting functions such as genomeplot and genomeplot_256Mb.

Parameters
  • anno_scaled (list(list(..))) – List of annotations. Each annotation can be a region specified by [start: int, end: int, info:str] or a position specified by [pos: int, info:str]. Acceptable info strings for region currently include color names for matplotlib. Acceptable info strings for position are currently ‘single’ or ‘double’, which direct whether the annotation is drawn by single or double lines.

  • base (int) – The starting position of the 32Mb (if window_radius is 16000000) or 256Mb (if window_radius is 128000000) region analyzed.

  • window_radius (int) – The size of the region analyzed. It must be either 16000000 (32Mb region) or 128000000 (256Mb region).

Returns

annotation – Processed annotations with coordinates transformed to relative coordinate in the range of 0-1.

Return type

list