orca_predict module

This module provides functions for using Orca models for various types of the predictions. This is the main module that you need for interacting with Orca models.

To use any of the prediction functions, load_resources has to be called first to load the necessary resources.

The coordinates used in Orca are 0-based, inclusive for the start coordinate and exclusive for the end coordinate, consistent with python conventions.

orca_predict.genomepredict(sequence, mchr, mpos=-1, wpos=-1, models=['h1esc', 'hff'], targets=None, annotation=None, use_cuda=True, nan_thresh=1)[source]

Multiscale prediction for a 32Mb sequence input, zooming into the position specified when generating a series of 32Mb, 16Mb, 8Mb, 4Mb, 2Mb and 1Mb predictions with increasing resolutions (up to 4kb). This function also processes information used only for plotting including targets and annotation.

For larger sequence and interchromosomal predictions, you can use 256Mb input with genomepredict_256Mb.

Parameters
  • sequence (numpy.ndarray) – One-hot sequence encoding of shape 1 x 4 x 32000000. The encoding can be generated with selene_sdk.Genome.sequence_to_encoding().

  • mchr (str) – Chromosome name. This is used for annotation purpose only.

  • mpos (int, optional) – The coordinate to zoom into for multiscale prediction.

  • wpos (int, optional) – The coordinate of the center position of the sequence, which is start position + 16000000.

  • models (list(torch.nn.Module or str), optional) – Models to use. Default is H1-ESC and HFF Orca models.

  • targets (list(numpy.ndarray), optional) – The observed balanced contact matrices from the 32Mb region. Used only for plotting when used with genomeplot. The length and order of the list of targets should match the models specified (default is H1-ESC and HFF Orca models). The dimensions of the arrays should be 8000 x 8000 (1kb resolution).

  • annotation (str or None, optional) – List of annotations for plotting. The annotation can be generated with See orca_utils.process_anno and see its documentation for more details.

  • use_cuda (bool, optional) – Default is True. If False, use CPU.

  • nan_thresh (int, optional) – Default is 1. Specify the threshold of the proportion of NaNs values allowed during downsampling for the observed matrices. Only relevant for plotting. The lower resolution observed matrix value are computed by averaging multiple bins into one. By default, we allow missing values and only average over the non-missing values, and the values with more than the specified proprotion of missing values will be filled with NaN.

Returns

output – Result dictionary that can be used as input for genomeplot. The dictionary has the following keys:

  • predictionslist(list(numpy.ndarray), list(numpy.ndarray))

    Multi-level predictions for H1-ESC and HFF cell types.

  • experimentslist(list(numpy.ndarray), list(numpy.ndarray))

    Observations for H1-ESC and HFF cell types that matches the predictions. Exists if targets is specified.

  • normmatslist(list(numpy.ndarray), list(numpy.ndarray))

    Background distance-based expected balanced contact matrices for H1-ESC and HFF cell types that matches the predictions.

  • start_coordslist(int)

    Start coordinates for the prediction at each level.

  • end_coordslist(int)

    End coordinates for the prediction at each level.

  • chrstr

    The chromosome name.

  • annoslist(list(…))

    Annotation information. The format is as outputed by orca_utils.process_anno Exists if annotation is specified.

Return type

dict

orca_predict.genomepredict_256Mb(sequence, mchr, normmats, chrlen, mpos=-1, wpos=-1, models=['h1esc_256m', 'hff_256m'], targets=None, annotation=None, padding_chr=None, use_cuda=True, nan_thresh=1)[source]

Multiscale prediction for a 256Mb sequence input, zooming into the position specified when generating a series of 256Mb, 128Mb, 64Mb, and 32Mb predictions with increasing resolutions (up to 128kb). This function also processes information used only for plotting including targets and annotation.

This function accepts multichromosal input sequence. Thus it needs an extra input normmats to encode the chromosomal information. See documentation for normmats argument for details.

Parameters
  • sequence (numpy.ndarray) – One-hot sequence encoding of shape 1 x 4 x 256000000. The encoding can be generated with selene_sdk.Genome.sequence_to_encoding().

  • mchr (str) – The chromosome name of the first chromosome included in the seqeunce. This is used for annotation purpose only.

  • normmats (list(numpy.ndarray)) – A list of distance-based background matrices for H1-ESC and HFF.The normmats contains arrays with dimensions 8000 x 8000 (32kb resolution). Interchromosomal interactions are filled with the expected balanced contact score for interchromomsal interactions.

  • chrlen (int) – The coordinate of the end of the first chromosome in the input, which is the chromosome that will be zoomed into.

  • mpos (int, optional) – Default is -1. The coordinate to zoom into for multiscale prediction. If neither mpos nor wpos are specified, it zooms into the center of the input by default.

  • wpos (int, optional) – Default is -1. The coordinate of the center position of the sequence, which is start position + 16000000. If neither mpos nor wpos are specified, it zooms into the center of the input by default.

  • models (list(torch.nn.Module or str), optional) – Models to use. Default is H1-ESC(256Mb) and HFF(256Mb) Orca models.

  • targets (list(numpy.ndarray), optional) – The observed balanced contact matrices from the 256Mb sequence. Used only for plotting when used with genomeplot. The length and order of the list of targets should match the models specified (default is H1-ESC and HFF Orca models). The dimensions of the arrays should be 8000 x 8000 (32kb resolution).

  • annotation (str or None, optional) – Default is None. List of annotations for plotting. The annotation can be generated with See orca_utils.process_anno and see its documentation for more details.

  • padding_chr (str, None, optional) – Default is None. Name of the padding chromosome after the first. Used for annotation only. TODO: be more flexible in the support for multiple chromosomes.

  • use_cuda (bool, optional) – Default is True. If False, use CPU.

  • nan_thresh (int, optional) – Default is 1. Specify the threshold of the proportion of NaNs values allowed during downsampling for the observed matrices. Only relevant for plotting. The lower resolution observed matrix value are computed by averaging multiple bins into one. By default, we allow missing values and only average over the non-missing values, and the values with more than the specified proprotion of missing values will be filled with NaN.

Returns

output – Result dictionary that can be used as input for genomeplot. The dictionary has the following keys:

  • predictionslist(list(numpy.ndarray), list(numpy.ndarray))

    Multi-level predictions for H1-ESC and HFF cell types.

  • experimentslist(list(numpy.ndarray), list(numpy.ndarray))

    Observations for H1-ESC and HFF cell types that matches the predictions. Exists if targets is specified.

  • normmatslist(list(numpy.ndarray), list(numpy.ndarray))

    Background distance-based expected balanced contact matrices for H1-ESC and HFF cell types that matches the predictions.

  • start_coordslist(int)

    Start coordinates for the prediction at each level.

  • end_coordslist(int)

    End coordinates for the prediction at each level.

  • chrstr

    The chromosome name.

  • annoslist(list(…))

    Annotation information. The format is as outputed by orca_utils.process_anno Exists if annotation is specified.

Return type

dict

orca_predict.load_resources(models=['32M'], use_cuda=True, use_memmapgenome=True)[source]

Load resources for Orca predictions including the specified Orca models and hg38 reference genome. It also creates Genomic2DFeatures objects for experimental micro-C datasets (for comparison with prediction). Load resourced are accessible as global variables.

The list of globl variables generated is here:

hg38selene_utils2.MemmapGenome or selene_sdk.sequences.Genome

If use_memmapgenome==True and the resource file for hg38 mmap exists, use MemmapGenome instead of Genome.

h1escorca_models.H1esc

1-32Mb Orca H1-ESC model

hfforca_models.Hff

1-32Mb Orca HFF model

h1esc_256morca_models.H1esc_256M

32-256Mb Orca H1-ESC model

hff_256morca_models.Hff_256M

32-256Mb Orca HFF model

h1esc_1morca_models.H1esc_1M

1Mb Orca H1-ESC model

hff_1morca_models.Hff_1M

1Mb Orca HFF model

target_h1escselene_utils2.Genomic2DFeatures

Genomic2DFeatures object that load H1-ESC micro-C dataset 4DNFI9GMP2J8 at 4kb resolution, used for comparison with 1-32Mb models.

target_hffselene_utils2.Genomic2DFeatures

Genomic2DFeatures object that load HFF micro-C dataset 4DNFI643OYP9 at 4kb resolution, used for comparison with 1-32Mb models.

target_h1esc_256mselene_utils2.Genomic2DFeatures

Genomic2DFeatures object that load H1-ESC micro-C dataset 4DNFI9GMP2J8 at 32kb resolution, used for comparison with 32-256Mb models.

target_hff_256mselene_utils2.Genomic2DFeatures

Genomic2DFeatures object that load HFF micro-C dataset 4DNFI643OYP9 at 32kb resolution, used for comparison with 32-256Mb models.

target_h1esc_1mselene_utils2.Genomic2DFeatures

Genomic2DFeatures object that load H1-ESC micro-C dataset 4DNFI9GMP2J8 at 32kb resolution, used for comparison with 1Mb models.

target_hff_1mselene_utils2.Genomic2DFeatures

Genomic2DFeatures object that load HFF micro-C dataset 4DNFI643OYP9 at 1kb resolution, used for comparison with 1Mb models.

target_availablebool

Indicate whether the micro-C dataset resource file is available.

Parameters
  • models (list(str)) – List of model types to load, supported model types includes “32M”, “256M”, “1M”, corresponding to 1-32Mb, 32-256Mb, and 1Mb models. Lower cases are also accepted.

  • use_cuda (bool, optional) – Default is True. If true, loaded models are moved to GPU.

  • use_memmapgenome (bool, optional) – Default is True. If True and the resource file for hg38 mmap exists, use MemmapGenome instead of Genome.

orca_predict.process_custom(region_list, ref_region_list, mpos, genome, ref_mpos_list=None, anno_list=None, ref_anno_list=None, custom_models=None, target=True, file=None, show_genes=True, show_tracks=False, window_radius=16000000, use_cuda=True)[source]

Generate multiscale genome interaction predictions for a custom variant by an ordered list of genomic segments.

Parameters
  • region_list (list(list(..))) – List of segments to complete the alternative. Each segment is specified by a list( chr: str, start: int, end: int, strand: str), and segments are concatenated together in the given order. The total length should sum up to 32Mb. An example input is [[‘chr5’, 89411065, 89411065+16000000, ‘-‘], [‘chr7’, 94378248, 94378248+16000000,’+’]].

  • ref_region_list (list(list(..))) – The reference regions to predict. This can be any reference regions with the length of the specified window size. If the Each reference region is specified with a list( chr: str, start: int, end: int, strand: str). The strand must be ‘+’. The intended use is predicting the genome interactions for each segment that constitute the alternative allele within the native reference sequence context. An example input is [[‘chr5’, 89411065-16000000, 89411065+16000000,’+’], [‘chr7’, 94378248-16000000, 94378248+16000000,’+’]].

  • mpos (int) – The position to zoom into in the alternative allele. Note that mpos here specify the relative position with respect to the to start of the 32Mb.

  • genome (selene_utils2.MemmapGenome or selene_sdk.sequences.Genome) – The reference genome object to extract sequence from.

  • ref_mpos_list (list(int) or None, optional) – Default is None. List of positions to zoom into for each of the reference regions specified in ref_region_list. If not specified, then zoom into the center of each region. Note that ref_mpos_list specifies the relative positions with respect to start of the 32Mb. For example, 16000000 means the center of the sequence.

  • custom_models (list(torch.nn.Module or str) or None, optional) – Models to use instead of the default H1-ESC and HFF Orca models. Default is None.

  • target (list(selene_utils2.Genomic2DFeatures or str) or bool, optional) – If specified as list, use this list of targets to retrieve experimental data (for plotting only). Default is True and will use micro-C data for H1-ESC and HFF cells (4DNFI9GMP2J8, 4DNFI643OYP9) that correspond to the default models.

  • file (str or None, optional) – Default is None. The output file prefix.

  • show_genes (bool, optional) – Default is True. If True, generate gene annotation visualization file in pdf format that matches the windows of multiscale predictions.

  • show_tracks (bool, optional) – Default is False. If True, generate chromatin tracks visualization file in pdf format that matches the windows of multiscale predictions.

  • window_radius (int, optional) – Default is 16000000. Currently only 16000000 (32Mb window) is accepted.

  • use_cuda (bool, optional) – Default is True. Use CPU if False.

Returns

outputs_ref_l, outputs_ref_r, outputs_alt – Reference allele predictions zooming into the left boundary of the duplication, Reference allele predictions zooming into the right boundary of the duplication, Alternative allele predictions zooming into the duplication breakpoint. The returned results are in the format of dictonaries containing the prediction outputs and other retrieved information. These dictionaries can be directly used as input to genomeplot or genomeplot_256Mb. See documentation of genomepredict or genomepredict_256Mb for details of the dictionary content.

Return type

dict, dict, dict

orca_predict.process_del(mchr, mstart, mend, genome, cmap=None, file=None, custom_models=None, target=True, show_genes=True, show_tracks=False, window_radius=16000000, padding_chr='chr1', use_cuda=True)[source]

Generate multiscale genome interaction predictions for an deletion variant.

Parameters
  • mchr (str) – The chromosome name of the first segment

  • mstart (int) – The start coordinate of the deletion.

  • mend (ind) – The end coordinate of the deletion.

  • genome (selene_utils2.MemmapGenome or selene_sdk.sequences.Genome) – The reference genome object to extract sequence from

  • custom_models (list(torch.nn.Module or str) or None, optional) – Models to use instead of the default H1-ESC and HFF Orca models. Default is None.

  • target (list(selene_utils2.Genomic2DFeatures or str) or bool, optional) – If specified as list, use this list of targets to retrieve experimental data (for plotting only). Default is True and will use micro-C data for H1-ESC and HFF cells (4DNFI9GMP2J8, 4DNFI643OYP9) that correspond to the default models.

  • file (str or None, optional) – Default is None. The output file prefix.

  • show_genes (bool, optional) – Default is True. If True, generate gene annotation visualization file in pdf format that matches the windows of multiscale predictions.

  • show_tracks (bool, optional) – Default is False. If True, generate chromatin tracks visualization file in pdf format that matches the windows of multiscale predictions.

  • window_radius (int, optional) – Default is 16000000. The acceptable values are 16000000 which selects the 1-32Mb models or 128000000 which selects the 32-256Mb models.

  • padding_chr (str, optional) – Default is “chr1”. If window_radius is 128000000, padding is generally needed to fill the sequence to 256Mb. The padding sequence will be extracted from the padding_chr.

  • use_cuda (bool, optional) – Default is True. Use CPU if False.

Returns

outputs_ref_l, outputs_ref_r, outputs_alt – Reference allele predictions zooming into the left boundary of the deletion, Reference allele predictions zooming into the right boundary of the deletion, Alternative allele predictions zooming into the deletion breakpoint. The returned results are in the format of dictonaries containing the prediction outputs and other retrieved information. These dictionaries can be directly used as input to genomeplot or genomeplot_256Mb. See documentation of genomepredict or genomepredict_256Mb for details of the dictionary content.

Return type

dict, dict, dict

orca_predict.process_dup(mchr, mstart, mend, genome, file=None, custom_models=None, target=True, show_genes=True, show_tracks=False, window_radius=16000000, padding_chr='chr1', use_cuda=True)[source]

Generate multiscale genome interaction predictions for an duplication variant.

Parameters
  • mchr (str) – The chromosome name of the first segment

  • mstart (int) – The start coordinate of the duplication.

  • mend (ind) – The end coordinate of the duplication.

  • genome (selene_utils2.MemmapGenome or selene_sdk.sequences.Genome) – The reference genome object to extract sequence from

  • custom_models (list(torch.nn.Module or str) or None, optional) – Models to use instead of the default H1-ESC and HFF Orca models. Default is None.

  • target (list(selene_utils2.Genomic2DFeatures or str) or bool, optional) – If specified as list, use this list of targets to retrieve experimental data (for plotting only). Default is True and will use micro-C data for H1-ESC and HFF cells (4DNFI9GMP2J8, 4DNFI643OYP9) that correspond to the default models.

  • file (str or None, optional) – Default is None. The output file prefix.

  • show_genes (bool, optional) – Default is True. If True, generate gene annotation visualization file in pdf format that matches the windows of multiscale predictions.

  • show_tracks (bool, optional) – Default is False. If True, generate chromatin tracks visualization file in pdf format that matches the windows of multiscale predictions.

  • window_radius (int, optional) – Default is 16000000. The acceptable values are 16000000 which selects the 1-32Mb models or 128000000 which selects the 32-256Mb models.

  • padding_chr (str, optional) – Default is “chr1”. If window_radius is 128000000, padding is generally needed to fill the sequence to 256Mb. The padding sequence will be extracted from the padding_chr.

  • use_cuda (bool, optional) – Default is True. Use CPU if False.

Returns

outputs_ref_l, outputs_ref_r, outputs_alt – Reference allele predictions zooming into the left boundary of the duplication, Reference allele predictions zooming into the right boundary of the duplication, Alternative allele predictions zooming into the duplication breakpoint. The returned results are in the format of dictonaries containing the prediction outputs and other retrieved information. These dictionaries can be directly used as input to genomeplot or genomeplot_256Mb. See documentation of genomepredict or genomepredict_256Mb for details of the dictionary content.

Return type

dict, dict, dict

orca_predict.process_ins(mchr, mpos, ins_seq, genome, strand='+', file=None, custom_models=None, target=True, show_genes=True, show_tracks=False, window_radius=16000000, padding_chr='chr1', use_cuda=True)[source]

Generate multiscale genome interaction predictions for an insertion variant that inserts the specified sequence to the insertion site.

Parameters
  • mchr (str) – The chromosome name of the first segment

  • mpos (int) – The insertion site coordinate.

  • ins_seq (str) – The inserted sequence in string format.

  • genome (selene_utils2.MemmapGenome or selene_sdk.sequences.Genome) – The reference genome object to extract sequence from

  • custom_models (list(torch.nn.Module or str) or None, optional) – Models to use instead of the default H1-ESC and HFF Orca models. Default is None.

  • target (list(selene_utils2.Genomic2DFeatures or str) or bool, optional) – If specified as list, use this list of targets to retrieve experimental data (for plotting only). Default is True and will use micro-C data for H1-ESC and HFF cells (4DNFI9GMP2J8, 4DNFI643OYP9) that correspond to the default models.

  • file (str or None, optional) – Default is None. The output file prefix.

  • show_genes (bool, optional) – Default is True. If True, generate gene annotation visualization file in pdf format that matches the windows of multiscale predictions.

  • show_tracks (bool, optional) – Default is False. If True, generate chromatin tracks visualization file in pdf format that matches the windows of multiscale predictions.

  • window_radius (int, optional) – Default is 16000000. The acceptable values are 16000000 which selects the 1-32Mb models or 128000000 which selects the 32-256Mb models.

  • padding_chr (str, optional) – Default is “chr1”. If window_radius is 128000000, padding is generally needed to fill the sequence to 256Mb. The padding sequence will be extracted from the padding_chr.

  • use_cuda (bool, optional) – Default is True. Use CPU if False.

Returns

outputs_ref, outputs_alt_l, outputs_alt_r – Reference allele predictions zooming into the insertion site, Alternative allele predictions zooming into the left boundary of the insertion seqeunce, Alternative allele prediction zooming into the right boundary of the insertion seqeunce. The returned results are in the format of dictonaries containing the prediction outputs and other retrieved information. These dictionaries can be directly used as input to genomeplot or genomeplot_256Mb. See documentation of genomepredict or genomepredict_256Mb for details of the dictionary content.

Return type

dict, dict, dict

orca_predict.process_inv(mchr, mstart, mend, genome, file=None, custom_models=None, target=True, show_genes=True, show_tracks=False, window_radius=16000000, padding_chr='chr1', use_cuda=True)[source]

Generate multiscale genome interaction predictions for an inversion variant.

Parameters
  • mchr (str) – The chromosome name of the first segment

  • mstart (int) – The start coordinate of the inversion.

  • mend (ind) – The end coordinate of the inversion.

  • genome (selene_utils2.MemmapGenome or selene_sdk.sequences.Genome) – The reference genome object to extract sequence from

  • custom_models (list(torch.nn.Module or str) or None, optional) – Models to use instead of the default H1-ESC and HFF Orca models. Default is None.

  • target (list(selene_utils2.Genomic2DFeatures or str) or bool, optional) – If specified as list, use this list of targets to retrieve experimental data (for plotting only). Default is True and will use micro-C data for H1-ESC and HFF cells (4DNFI9GMP2J8, 4DNFI643OYP9) that correspond to the default models.

  • file (str or None, optional) – Default is None. The output file prefix.

  • show_genes (bool, optional) – Default is True. If True, generate gene annotation visualization file in pdf format that matches the windows of multiscale predictions.

  • show_tracks (bool, optional) – Default is False. If True, generate chromatin tracks visualization file in pdf format that matches the windows of multiscale predictions.

  • window_radius (int, optional) – Default is 16000000. The acceptable values are 16000000 which selects the 1-32Mb models or 128000000 which selects the 32-256Mb models.

  • padding_chr (str, optional) – Default is “chr1”. If window_radius is 128000000, padding is generally needed to fill the sequence to 256Mb. The padding sequence will be extracted from the padding_chr.

  • use_cuda (bool, optional) – Default is True. Use CPU if False.

Returns

outputs_ref_l, outputs_ref_r, outputs_alt_l, outputs_alt_r – Reference allele predictions zooming into the left boundary of the inversion, Reference allele predictions zooming into the right boundary of the inversion, Alternative allele predictions zooming into the left boundary of the inversion, Alternative allele prediction zooming into the right boundary of the inversion. The returned results are in the format of dictonaries containing the prediction outputs and other retrieved information. These dictionaries can be directly used as input to genomeplot or genomeplot_256Mb. See documentation of genomepredict or genomepredict_256Mb for details of the dictionary content.

Return type

dict, dict, dict, dict

orca_predict.process_region(mchr, mstart, mend, genome, file=None, custom_models=None, target=True, show_genes=True, show_tracks=False, window_radius=16000000, padding_chr='chr1', use_cuda=True)[source]

Generate multiscale genome interaction predictions for the specified region.

Parameters
  • mchr (str) – The chromosome name of the first segment

  • mstart (int) – The start coordinate of the region.

  • mend (ind) – The end coordinate of the region.

  • genome (selene_utils2.MemmapGenome or selene_sdk.sequences.Genome) – The reference genome object to extract sequence from

  • custom_models (list(torch.nn.Module or str) or None, optional) – Models to use instead of the default H1-ESC and HFF Orca models. Default is None.

  • target (list(selene_utils2.Genomic2DFeatures or str) or bool, optional) – If specified as list, use this list of targets to retrieve experimental data (for plotting only). Default is True and will use micro-C data for H1-ESC and HFF cells (4DNFI9GMP2J8, 4DNFI643OYP9) that correspond to the default models.

  • file (str or None, optional) – Default is None. The output file prefix.

  • show_genes (bool, optional) – Default is True. If True, generate gene annotation visualization file in pdf format that matches the windows of multiscale predictions.

  • show_tracks (bool, optional) – Default is False. If True, generate chromatin tracks visualization file in pdf format that matches the windows of multiscale predictions.

  • window_radius (int, optional) – Default is 16000000. The acceptable values are 16000000 which selects the 1-32Mb models or 128000000 which selects the 32-256Mb models.

  • padding_chr (str, optional) – Default is “chr1”. If window_radius is 128000000, padding is generally needed to fill the sequence to 256Mb. The padding sequence will be extracted from the padding_chr.

  • use_cuda (bool, optional) – Default is True. Use CPU if False.

Returns

outputs_ref_l, outputs_ref_r, outputs_alt – Reference allele predictions zooming into the left boundary of the duplication, Reference allele predictions zooming into the right boundary of the duplication, Alternative allele predictions zooming into the duplication breakpoint. The returned results are in the format of dictonaries containing the prediction outputs and other retrieved information. These dictionaries can be directly used as input to genomeplot or genomeplot_256Mb. See documentation of genomepredict or genomepredict_256Mb for details of the dictionary content.

Return type

dict, dict, dict

orca_predict.process_single_breakpoint(chr1, pos1, chr2, pos2, orientation1, orientation2, genome, custom_models=None, target=True, file=None, show_genes=True, show_tracks=False, window_radius=16000000, padding_chr='chr1', use_cuda=True)[source]

Generate multiscale genome interaction predictions for a simple translocation event that connects two chromosomal breakpoints. Specifically, two breakpoint positions and the corresponding two orientations are needed. The orientations decide how the breakpoints are connected. The ‘+’ or ‘-’ sign indicate whether the left or right side of the breakpoint is used. For example, for an input (‘chr1’, 85691449, ‘chr5’, 89533745 ‘+’, ‘+’), two plus signs indicate connecting chr1:0-85691449 with chr5:0-89533745.

Parameters
  • chr1 (str) – The chromosome name of the first segment

  • pos1 (int) – The coorindate of breakpoint on the first segment

  • chr2 (str) – The chromosome name of the second segment

  • pos2 (int) – The coorindate of breakpoint on the second segment

  • orientation1 (str) – Indicate which side of the breakpoint should be used for the first segment, ‘+’ indicate the left and ‘-‘ indicate the right side.

  • orientation2 (str) – Indicate which side of the breakpoint should be used for the second segment, ‘+’ indicate the left and ‘-‘ indicate the right side.

  • genome (selene_utils2.MemmapGenome or selene_sdk.sequences.Genome) – The reference genome object to extract sequence from

  • custom_models (list(torch.nn.Module or str) or None, optional) – Models to use instead of the default H1-ESC and HFF Orca models. Default is None.

  • target (list(selene_utils2.Genomic2DFeatures or str) or bool, optional) – If specified as list, use this list of targets to retrieve experimental data (for plotting only). Default is True and will use micro-C data for H1-ESC and HFF cells (4DNFI9GMP2J8, 4DNFI643OYP9) that correspond to the default models.

  • file (str or None, optional) – Default is None. The output file prefix.

  • show_genes (bool, optional) – Default is True. If True, generate gene annotation visualization file in pdf format that matches the windows of multiscale predictions.

  • show_tracks (bool, optional) – Default is False. If True, generate chromatin tracks visualization file in pdf format that matches the windows of multiscale predictions.

  • window_radius (int, optional) – Default is 16000000. The acceptable values are 16000000 which selects the 1-32Mb models or 128000000 which selects the 32-256Mb models.

  • padding_chr (str, optional) – Default is “chr1”. If window_radius is 128000000, padding is generally needed to fill the sequence to 256Mb. The padding sequence will be extracted from the padding_chr.

  • use_cuda (bool, optional) – Default is True. Use CPU if False.

Returns

outputs_ref_1, outputs_ref_2, outputs_alt – Reference allele predictions zooming into the chr1 breakpoint, Reference allele predictions zooming into the chr2 breakpoint, Alternative allele prediction zooming into the junction. The returned results are in the format of dictonaries containing the prediction outputs and other retrieved information. These dictionaries can be directly used as input to genomeplot or genomeplot_256Mb. See documentation of genomepredict or genomepredict_256Mb for details of the dictionary content.

Return type

dict, dict, dict