selene_utils2 module¶

This module provides the selene-based utilities for training and using Orca sequence models for multiscale genome interaction prediction. This module contains code from selene.

class selene_utils2.Genomic2DFeatures(input_paths, features, shape, cg=False)[source]¶

Bases: selene_sdk.targets.target.Target

Stores one or multple datasets of Hi-C style 2D data in cooler format.

Parameters

input_paths (list(str) or str) – List of paths to the Cooler datasets or a path to a single Cooler dataset. For mcool files, the path should include the resolution. Please refer to cooler.Cooler documentation for support of mcool files.
features (list(str) or str) – The list of dataset names that should match the input_path.
shape (tuple(int, int)) – The shape of the output array (# of bins by # of bins).
cg (bool, optional) – If yes, adpative coarse-graining is applied to the output.

data¶

The list of Cooler objects for the cooler files.

Type: list(cooler.Cooler)

n_features¶

The number of cooler files.

Type: int

feature_index_dict¶

A dictionary mapping feature names (str) to indices (int), where the index is the position of the feature in features.

Type: dict

shape¶

The shape of the output array (# of bins by # of bins).

Type: tuple(int, int)

cg¶

Whether adpative coarse-graining is applied to the output.

Type: bool

get_feature_data(chrom, start, end, chrom2=None, start2=None, end2=None)[source]¶: Retrieve the feature data for some coordinate.

class selene_utils2.MemmapGenome(input_path, blacklist_regions=None, bases_order=None, init_unpicklable=False, memmapfile=None)[source]¶

Bases: selene_sdk.sequences.genome.Genome

Memmapped version of selene.sequence.Genome. Faster for sequence retrieval by storing all precomputed one-hot encodings in a memmapped file (~40G for human genome).

The memmapfile can be an exisiting memmapped file or a path where you want to create the memmapfile. If the specified memmapfile does not exist, it will be created the first time you call any method of MemmapGenome or if MemmapGenome is initialized with init_unpickable=True. Therefore the first call will take some time for the creation of memmapfile if it does not exist. Also, if memmapfile has not been created, be careful not to run multiple instances of MemmapGenome in parallel (such as with Dataloader), because as each process will try to create the file.

Parameters

input_path (str) – Path to an indexed FASTA file, that is, a *.fasta file with a corresponding *.fai file in the same directory. This file should contain the target organism’s genome sequence.
init_unpickleable (bool, optional) – Default is False. If False, delay part of initialization code to executed only when a relevant method is called. This enables the object to be pickled after instantiation. init_unpickleable should be False when used when multi-processing is needed e.g. DataLoader.
memmapfile (str or None, optional) – Specify the numpy.memmap file for storing the encoding of the genome. If memmapfile does not exist, it will be created when the encoding is requested for the first time.

genome¶

The FASTA file containing the genome sequence.

Type: pyfaidx.Fasta

chrs¶

The list of chromosome names.

Type: list(str)

len_chrs¶

A dictionary mapping the names of each chromosome in the file to the length of said chromosome.

Type: dict

get_encoding_from_coords(chrom, start, end, strand='+', pad=False)[source]¶

Gets the one-hot encoding of the genomic sequence at the queried coordinates.

Parameters

chrom (str) – The name of the chromosome or region, e.g. “chr1”.
start (int) – The 0-based start coordinate of the first position in the sequence.
end (int) – One past the 0-based last position in the sequence.
strand ({'+', '-', '.'}, optional) – Default is ‘+’. The strand the sequence is located on. ‘.’ is treated as ‘+’.
pad (bool, optional) – Default is False. Pad the output sequence with ‘N’ if start and/or end are out of bounds to return a sequence of length end - start.

Returns

The \(L \times 4\) encoding of the sequence, where \(L = end - start\).

Return type

numpy.ndarray, dtype=numpy.float32

Raises

AssertionError – If it cannot retrieve encoding that matches the length L = end - start such as when end > chromosome length and pad=False

get_encoding_from_coords_check_unk(chrom, start, end, strand='+', pad=False)[source]¶

Gets the one-hot encoding of the genomic sequence at the queried coordinates and check whether the sequence contains unknown base(s).

Parameters

chrom (str) – The name of the chromosome or region, e.g. “chr1”.
start (int) – The 0-based start coordinate of the first position in the sequence.
end (int) – One past the 0-based last position in the sequence.
strand ({'+', '-', '.'}, optional) – Default is ‘+’. The strand the sequence is located on. ‘.’ is treated as ‘+’.
pad (bool, optional) – Default is False. Pad the output sequence with ‘N’ if start and/or end are out of bounds to return a sequence of length end - start.

Returns

tuple[0] is the \(L \times 4\) encoding of the sequence, where

\(L = end - start\). L = 0 for the NumPy array returned. * tuple[1] is the boolean value that indicates whether the sequence contains any unknown base(s) specified in self.UNK_BASE

Return type

tuple(numpy.ndarray, bool)

Raises

AssertionError – If it cannot retrieve encoding that matches the length L = end - start such as when end > chromosome length and pad=False

init()[source]¶

class selene_utils2.MultibinGenomicFeatures(input_path, features, bin_size, step_size, shape, mode='center')[source]¶

Bases: selene_sdk.targets.target.Target

Multibin version of selene.targets.GenomicFeatures Stores the dataset specifying features for genomic regions. Accepts a *.bed file with the following columns, in order:

[chrom, start, end, strand, feature]

start and end is 0-based as in bed file format.

Note that unlike selene_sdk.targets.GenomicFeatures which queries the tabix data file out-of-core, MultibinGenomicFeatures requires more memory as it loads the entire bed file in memory as a pyranges table for higher query speed.

Parameters

input_path (str) – Path to the bed file.
features (list(str)) – The non-redundant list of genomic features names. The output array will have the same feature order as specified in this list.
bin_size (int) – The length of the bin(s) in which we check for features
step_size (int) – The interval between two adjacent bins.
shape (tuple(int, int)) – The shape of the output array (n_features by n_bins).
mode (str, optional) – For mode==’any’, any overlap will get 1, and no overlap will get 0. For mode==’center’, only overlap with the center basepair of each bin will get 1, otherwise 0. For `mode==’proportion’, the proportion of overlap will be returned.

data¶

The data stored in PyRanges object.

Type: pyranges.PyRanges

n_features¶

The number of distinct features.

Type: int

feature_index_dict¶

A dictionary mapping feature names (str) to indices (int), where the index is the position of the feature in features.

Type: dict

index_feature_dict¶

A dictionary mapping indices (int) to feature names (str), where the index is the position of the feature in the input features.

Type: dict

bin_size¶

The length of the bin(s) in which we check for features

Type: int

step_size¶

The interval between two adjacent bins.

Type: int

shape¶

The shape of the output array (n_features by n_bins).

Type: tuple(int, int)

mode¶

For mode==’any’, any overlap will get assigned 1, and no overlap will

get assigned 0. - For mode==’center’, only overlap with the center basepair of each bin will get assigned 1, otherwise assigned 0. - For `mode==’proportion’, the proportion of overlap will be assigned.

Type: str

get_feature_data(chrom, start, end)[source]¶

For a genomic region specified, return a number of features by number of bins array for overlap of each genomic bin and each feature. How the overlap is quantified depends on the mode attribute specified during initialization.

For mode==’any’, any overlap will get assigned 1, and no overlap will get assigned 0. For mode==’center’, only overlap with the center basepair of each bin will get assigned 1, otherwise assigned 0. For `mode==’proportion’, the proportion of overlap will be assigned.

Parameters

chrom (str) – The name of the region (e.g. ‘1’, ‘2’, …, ‘X’, ‘Y’).
start (int) – The 0-based first position in the region.
end (int) – One past the 0-based last position in the region.

Returns

\(L \times N\) array, where \(L = \) and \(N =\) self.n_features.

Return type

numpy.ndarray

init()[source]¶

class selene_utils2.RandomPositionsSamplerHiC(reference_sequence, target, features, target_1d=None, background_cis_file=None, background_trans_file=None, seed=436, validation_holdout=['chr6', 'chr7'], test_holdout=['chr8', 'chr9'], sequence_length=1000000, max_seg_length=None, length_schedule=None, position_resolution=1, random_shift=0, random_strand=True, cross_chromosome=True, permute_segments=False, mode='train')[source]¶

Bases: selene_sdk.samplers.online_sampler.OnlineSampler

This sampler randomly selects a region in the genome and retrieves sequence and relevant Hi-C and optionally multibin genomic data from that region. This implementation is modified based on selene_sdk.samplers.RandomPositionSampler.

Parameters

reference_sequence (selene_sdk.sequences.Genome) – A genome to retrieve sequence from.
target (Genomic2DFeatures) – Genomic2DFeatures object that loads the cooler files.
features (list(str)) – List of names that correspond to the cooler files.
target_1d (MultibinGenomicFeatures or None, optional) – MultibinGenomicFeatures object that loads 1D genomic feature data.
background_cis_file (str or None, optional) – Path to the numpy file that stores the distance-based expected background balanced scores for cis-interactions. If specified with background_trans_file, the sampler will return corresponding background array that matches with the 2D feature retrieved.
background_trans_file (str or None, optional) – Path to the numpy file that stores the expected background balanced scores for trans-interactions. See doc for background_cis_file for more detail.
seed (int, optional) – Default is 436. Sets the random seed for sampling.
validation_holdout (list(str), optional) – Default is [‘chr6’, ‘chr7’]. Holdout can be regional or proportional. If regional, expects a list (e.g. [‘chrX’, ‘chrY’]). Regions must match those specified in the first column of the tabix-indexed BED file. If proportional, specify a percentage between (0.0, 1.0). Typically 0.10 or 0.20.
test_holdout (list(str), optional) – Default is [‘chr8’, ‘chr9’]. See documentation for validation_holdout for additional information.
sequence_length (int, optional) – Default is 1000000. Model is trained on sequences of size sequence_length where genomic features are retreived for the same regions as the sequences.
max_seg_length (int or None, optional) – Default is None. If specified and cross_chromosome is True, bound the maximum length of each sequence segment.
length_schedule (list(float, list(int, int)) or None, optional) – Default is None. If specified and cross_chromosome is True, decide the sequence segment length to sample according to the length schedule (before trimming to fit in the sequence length). The length schedule is in the format of [p, [min_len, max_len]], which means, with probability p, decide the length by randomly sampling an integer between min_len and max_len, and retrieve the maximal remaining length as default with probability 1-p.
position_resolution (int, optional) – Default is 1. Preprocess the sampled start position by start = start - start % position_resolution. Useful for binned data.
random_shift (int, optional) – Default is 0. Shift the coordinates to retrieve sequence by a random integer in the range of [-random_shift, random_shift).
random_strand (bool, optional) – Default is True. If True, randomly select the strand of the sequence, otherwise alway use the ‘+’ strand.
cross_chromosome (bool, optional) – Default is True. If True, allows sampling multiple segments of sequences and the corresponding features. The default is sampling the maximum length allowed by sequence_length, thus multiple segments will only be sampled if sequence_length is larger than the minimum chromosome length or when max_seg_length and length_schedule is specified to limit the sequence segment length.
permute_segments (bool, optional) – Default is False. If True, permute the order of segments when multiple segments are sampled.
mode ({'train', 'validate', 'test'}) – Default is ‘train’. The mode to run the sampler in.

reference_sequence¶

A genome to retrieve sequence from.

Type: selene_sdk.sequences.Genome

target¶

The selene_sdk.targets.Target object holding the features that we would like to predict.

Type: selene_sdk.targets.Target

target_1d¶

MultibinGenomicFeatures object that loads 1D genomic feature data.

Type: MultibinGenomicFeatures or None, optional

background_cis¶

One-dimensional numpy.ndarray that stores the distance-based expected background balanced scores for cis-interactions.

Type: numpy.ndarray

background_trans¶

The expected background balanced score for trans-interactions.

Type: float

bg¶

Whether the sample will retrieve background arrays.

Type: bool

validation_holdout¶

The samples to hold out for validating model performance. These can be “regional” or “proportional”. If regional, this is a list of region names (e.g. [‘chrX’, ‘chrY’]). These regions must match those specified in the first column of the tabix-indexed BED file. If proportional, this is the fraction of total samples that will be held out.

Type: list(str)

test_holdout¶

The samples to hold out for testing model performance. See the documentation for validation_holdout for more details.

Type: list(str)

sequence_length¶

Model is trained on sequences of size sequence_length where genomic features are retreived for the same regions as the sequences.

Type: int

max_seg_length¶

Default is None. If specified and cross_chromosome is True, bound the maximum length of each sequence segment.

Type: int or None

length_schedule¶

Default is None. If specified and cross_chromosome is True, decide the sequence segment length to sample according to the length schedule (before trimming to fit in the sequence length). The length schedule is in the format of [p, [min_len, max_len]], which means, with probability p, decide the length by randomly sampling an integer between min_len and max_len, and retrieve the maximal remaining length as default with probability 1-p.

Type: list(float, list(int, int)) or None

position_resolution¶

Default is 1. Preprocess the sampled start position by start = start - start % position_resolution. Useful for binned data.

Type: int

random_shift¶

Default is 0. Shift the coordinates to retrieve sequence by a random integer in the range of [-random_shift, random_shift).

Type: int

random_strand¶

Default is True. If True, randomly select the strand of the sequence, otherwise alway use the ‘+’ strand.

Type: bool

cross_chromosome¶

Default is True. If True, allows sampling multiple segments of sequences and the corresponding features. The default is sampling the maximum length allowed by sequence_length, thus multiple segments will only be sampled if sequence_length is larger than the minimum chromosome length or when max_seg_length and length_schedule is specified to limit the sequence segment length.

Type: bool

permute_segments¶

Default is False. If True, permute the order of segments when multiple segments are sampled.

Type: bool

modes¶

The list of modes that the sampler can be run in.

Type: list(str)

mode¶

The current mode that the sampler is running in. Must be one of the modes listed in modes.

Type: str

init()[source]¶

sample(batch_size=1, mode=None, coordinate_only=False)[source]¶

Randomly draws a mini-batch of examples and their corresponding labels.

Parameters

batch_size (int, optional) – Default is 1. The number of examples to include in the mini-batch.
mode (str, optional) – Default is None. The operating mode that the object should run in. If None, will use the current mode self.mode.
coordinate_only (bool, optional) – Default is False. If True, only return the coordinates.

Returns

sequences, targets, … – A tuple containing the numeric representation of the sequence examples, their corresponding 2D targets, and optionally 1D targets (if target_1d were specified) and background matrices (if background_cis_file and background_trans_file were specified). The shape of sequences will be \(B \times L \times N\), where \(B\) is batch_size, \(L\) is the sequence length, and \(N\) is the size of the sequence type’s alphabet. The shape of targets depends on target.shape. For example it will be \(B \times M \times M\), when \(M \times M\) is target.shape. The shape of 1D targets is \(B \times K \times F\), where \(K = \) and \(F =\) self.n_features. The shape of background matrices are the same as targets.

Return type

tuple(numpy.ndarray, numpy.ndarray, ..)

class selene_utils2.SampleIndices(indices, weights)¶

Bases: tuple

property indices¶: Alias for field number 0

property weights¶: Alias for field number 1