Skip to content

OrthoSeq Generator API Documentation

The below contains documentation for all relevant functions available for import within the orthoseq_generator package.

Sequence Computations

orthoseq_generator.sequence_computations

SequencePairRegistry

SequencePairRegistry(
    length=7,
    fivep_ext="",
    threep_ext="",
    unwanted_substrings=None,
    apply_unwanted_to="core",
    seed=None,
    preselected_cores=None,
)

Stateful generator/registry for DNA sequence pairs.

It generates random core sequences of fixed length, forms the pair (seq, revcom(seq)), applies constraints, and assigns stable integer IDs.

If a generated pair has been seen before, it returns the previously assigned ID instead of creating a new one.

PARAMETER DESCRIPTION
length

Length of the core DNA sequence (without flanks).

TYPE: int DEFAULT: 7

fivep_ext

Optional 5′ flanking sequence prepended to each strand.

TYPE: str DEFAULT: ''

threep_ext

Optional 3′ flanking sequence appended to each strand.

TYPE: str DEFAULT: ''

unwanted_substrings

List of substrings that disqualify a sequence. Example: ["AAAA", "CCCC", "GGGG", "TTTT"].

TYPE: list[str] | None DEFAULT: None

apply_unwanted_to

Where to apply unwanted_substrings checks. - "core": apply only to the random core sequences - "full": apply to the full flanked sequences

TYPE: str DEFAULT: 'core'

seed

Optional RNG seed for reproducibility.

TYPE: int | None DEFAULT: None

preselected_cores

Optional iterable of core sequences to draw from instead of random generation. Sampling is without replacement in random order.

TYPE: iterable[str] | None DEFAULT: None

sample_pair

sample_pair(max_tries=10000)

Generates (or reuses) a random sequence pair and returns (pair_id, pair).

Behavior
  • If preselected_cores were provided, draws from that list (random, with replacement).
  • Draw random core sequences until constraints pass.
  • Convert to canonical (sorted) flanked pair.
  • If pair was seen: return existing ID.
  • Else: assign new ID, store, return it.
PARAMETER DESCRIPTION
max_tries

Maximum attempts before raising an error (prevents infinite loops).

TYPE: int DEFAULT: 10000

RETURNS DESCRIPTION
tuple[int, tuple[str, str]]

(pair_id, (seq, rc_seq)) where seq/rc_seq are flanked and sorted.

get_pair_by_id

get_pair_by_id(pair_id)

Returns the stored pair for a given ID.

PARAMETER DESCRIPTION
pair_id

Integer ID returned by sample_pair.

TYPE: int

RETURNS DESCRIPTION
tuple[str, str]

(seq, rc_seq) canonical sorted pair.

revcom

revcom(sequence)

Computes the reverse complement of a DNA sequence.

PARAMETER DESCRIPTION
sequence

Single DNA sequence as a string.

TYPE: str

RETURNS DESCRIPTION
str

Reverse complement of the input sequence as a string.

has_four_consecutive_bases

has_four_consecutive_bases(seq)

Returns True if the sequence contains four identical consecutive bases (e.g., "GGGG", "CCCC", "AAAA", "TTTT").

Notes

Additional sequence constraints (e.g., homopolymer runs of other lengths) can be added here as needed.

PARAMETER DESCRIPTION
seq

DNA sequence as a string.

TYPE: str

RETURNS DESCRIPTION
bool

True if any base appears four times in a row, False otherwise.

sorted_key

sorted_key(seq1, seq2)

Returns a tuple with the two input sequences sorted alphabetically.

Description

Ensures that (seq1, seq2) and (seq2, seq1) map to the same dictionary key.

PARAMETER DESCRIPTION
seq1

First DNA sequence.

TYPE: str

seq2

Second DNA sequence.

TYPE: str

RETURNS DESCRIPTION
tuple

Tuple of the two sequences in alphabetical order.

create_sequence_pairs_pool

create_sequence_pairs_pool(length=7, fivep_ext='', threep_ext='', avoid_gggg=True)

Generates a list of unique DNA sequence pairs (and their reverse complements) with optional flanking sequences.

Procedure
  1. Generate all possible core sequences of specified length.
  2. Compute each sequence's reverse complement and alphabetically sort the pair.
  3. If avoid_gggg is True, filter out any pair where either sequence contains four identical bases in a row.
  4. Prepend fivep_ext and append threep_ext to both members of each pair.
  5. Enumerate the resulting list, assigning a unique integer ID to each pair.
PARAMETER DESCRIPTION
length

Length of the core DNA sequences (without flanks).

TYPE: int DEFAULT: 7

fivep_ext

Optional 5′ flanking sequence prepended to each strand.

TYPE: str DEFAULT: ''

threep_ext

Optional 3′ flanking sequence appended to each strand.

TYPE: str DEFAULT: ''

avoid_gggg

If True, filters out pairs containing four identical consecutive bases.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
list of tuple

List of tuples [(index, (sequence, reverse_complement)), ...], where index is a unique ID and each tuple contains the complementary pair.

create_seqwalk_sequence_pairs_pool

create_seqwalk_sequence_pairs_pool(
    length=7,
    k=3,
    seed=None,
    fivep_ext="",
    threep_ext="",
    alphabet="ACGT",
    avoid_reverse_complements=True,
    gc_lims=None,
    prevented_patterns=None,
    verbose=True,
)

Generates sequence pairs from SeqWalk and converts them into this module's pair format.

This is a thin integration layer around seqwalk.design.max_size. SeqWalk designs a maximal library of core sequences for a chosen sequence-symmetry minimization (SSM) k value, optionally excluding reverse complements. The resulting core sequences are then converted into canonical (seq, revcom(seq)) pairs with optional flanks.

PARAMETER DESCRIPTION
length

Length of the core DNA sequences produced by SeqWalk.

TYPE: int DEFAULT: 7

k

Sequence symmetry minimization (SSM) k value passed to SeqWalk.

TYPE: int DEFAULT: 3

seed

Optional Python random seed for deterministic SeqWalk output.

TYPE: int | None DEFAULT: None

fivep_ext

Optional 5′ flank prepended to both strands.

TYPE: str DEFAULT: ''

threep_ext

Optional 3′ flank appended to both strands.

TYPE: str DEFAULT: ''

alphabet

Allowed DNA alphabet passed to SeqWalk.

TYPE: str DEFAULT: 'ACGT'

avoid_reverse_complements

If True, request an RC-free SeqWalk library.

TYPE: bool DEFAULT: True

gc_lims

Optional (min_gc, max_gc) tuple passed to SeqWalk.

TYPE: tuple[int, int] | None DEFAULT: None

prevented_patterns

Optional list of forbidden patterns passed to SeqWalk.

TYPE: list[str] | None DEFAULT: None

verbose

If True, allow SeqWalk to print progress information.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
list[tuple[int, tuple[str, str]]]

List of (index, (sequence, reverse_complement)) tuples.

nupack_compute_energy_precompute_library_fast

nupack_compute_energy_precompute_library_fast(
    seq1, seq2, type="total", Use_Library=None
)

Computes the Gibbs free energy of hybridization between two DNA sequences using NUPACK, with optional caching via a precompute library.

Notes
  • Uses a local cache to avoid redundant NUPACK calls when Use_Library=True. If the argument is None, the global setting in hf.USE_LIBRARY is used.
  • Energies are stored under a sorted key so (seq1, seq2) and (seq2, seq1) map identically. This function does not write back to disk; cache updates are handled by callers.
  • Called by multiprocessing; each worker loads its own cache copy once from file.
  • Does not write to the cache during multiprocessing to prevent conflicts.
  • All energies larger than -1 kcal/mol are mapped to -1 kcal/mol. 0 is used in other routines as an indicator that the energy has not been computed. -1 kcal/mol is already extremely weak (virtually no interaction).
  • Model parameters are fixed at 37°C, sodium=0.05 M, magnesium=0.025 M; change with a fresh cache.
PARAMETER DESCRIPTION
seq1

First DNA sequence.

TYPE: str

seq2

Second DNA sequence.

TYPE: str

type

Either 'total' (partition sum) or 'minimum' (MFE) calculation. The result of 'total' is what you would use to compute a binding constant.

TYPE: str DEFAULT: 'total'

Use_Library

If True, use and load the precompute cache; defaults to global setting.

TYPE: bool | None DEFAULT: None

RETURNS DESCRIPTION
tuple[float, float, float] | float

Tuple (energy, G_A, G_B) where energy is the association free energy (kcal/mol). For homodimers, G_B == G_A. If NUPACK returns no MFE or an exception occurs, the function returns -1.0 (scalar) instead.

compute_pair_energy_on

compute_pair_energy_on(i, seq, rc_seq)

Helper function for parallel computing of on-target energies.

PARAMETER DESCRIPTION
i

Sequence index.

TYPE: int

seq

DNA sequence.

TYPE: str

rc_seq

Reverse complement sequence.

TYPE: str

RETURNS DESCRIPTION
tuple[int, float, float, float]

Tuple (i, pair_energy, self_energy_seq, self_energy_rc_seq).

compute_ontarget_energies

compute_ontarget_energies(sequence_list)

Computes on-target Gibbs free energies for a list of sequence pairs using multiprocessing.

Notes
  • Uses ProcessPoolExecutor (with initializer=_init_worker) to parallelize calls to NUPACK via nupack_compute_energy_precompute_library_fast.
  • If hf.USE_LIBRARY is True, the initializer function (_init_worker) passes the library filename and flag to each worker so that nupack_compute_energy_precompute_library_fast can load its cache. After all parallel computations finish, this function saves the cache with the new energies.
  • Saves the updated cache atomically using DelayedKeyboardInterrupt to prevent corruption.
  • Prints progress and CPU core usage to the console.
PARAMETER DESCRIPTION
sequence_list

List of tuples, each containing a sequence and its reverse complement.

TYPE: list of tuple

RETURNS DESCRIPTION
tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]

Tuple of NumPy arrays (pair_energies, self_energies_seq, self_energies_rc_seq).

compute_pair_energy_off

compute_pair_energy_off(i, j, seq1, seq2)

Helper function for parallel computing of off-target energies.

PARAMETER DESCRIPTION
i

Index of the first sequence.

TYPE: int

j

Index of the second sequence.

TYPE: int

seq1

First DNA sequence.

TYPE: str

seq2

Second DNA sequence.

TYPE: str

RETURNS DESCRIPTION
tuple (int, int, float)

Tuple (i, j, energy) where energy is the computed Gibbs free energy.

compute_offtarget_energies

compute_offtarget_energies(sequence_pairs)

Computes off-target hybridization energies for all pairwise combinations of a given list of sequence pairs.

Procedure
  1. Extract handles and antihandles from sequence_pairs.
  2. Initialize three N×N energy matrices for:
  3. handle-handle interactions
  4. antihandle-antihandle interactions
  5. handle-antihandle interactions
  6. For each matrix, use ProcessPoolExecutor (via compute_pair_energy_off) to fill only the required entries:
  7. i ≥ j for the two symmetric matrices
  8. i ≠ j for the mixed handle-antihandle matrix
  9. If hf.USE_LIBRARY is True, the initializer function (_init_worker) passes the library filename and flag to each worker so that nupack_compute_energy_precompute_library_fast can load its cache. After all parallel computations finish, this function saves the cache with the new energies.
Notes
  • Off-target interactions are computed for:
    1) handle with handle
    2) antihandle with antihandle
    3) handle with antihandle
  • Symmetric matrices only compute the lower triangle (i ≥ j) to avoid redundancy.
  • Entries with no interaction or computation errors return -1.0 (mapped for any energy > -1.0).
    A value of 0 indicates the energy was skipped due to redundancy.
  • Uses DelayedKeyboardInterrupt to ensure atomic writes when saving the updated cache.
PARAMETER DESCRIPTION
sequence_pairs

List of (sequence, reverse_complement) tuples.

TYPE: list of tuple

RETURNS DESCRIPTION
dict

Dictionary containing three N×N numpy arrays with keys:
- 'handle_handle_energies'
- 'antihandle_handle_energies'
- 'antihandle_antihandle_energies'

select_subset

select_subset(sequence_pairs, max_size=200, timeout_s=20)

Selects a random subset of sequence pairs up to a specified maximum size.

This function supports two input types: 1) A precomputed pool: list of (index, (seq, rc_seq)) tuples. - If pool size > max_size: uses random.sample for efficiency. - Else: returns all pairs. 2) A generator/registry object that provides sample_pair(). - Repeatedly calls sample_pair() until max_size unique pairs are collected, or timeout_s is reached.

Notes
  • For list input: uses sampling rather than shuffling for performance.
  • For registry input: guarantees uniqueness by ID (not by sequence string), so repeated samples do not inflate the subset.
Timeout behavior

If timeout_s is reached while using a registry, the function returns the pairs found so far and prints: "Only X of requested Y found (timeout)."

PARAMETER DESCRIPTION
sequence_pairs

Either - list of (index, (seq, rc_seq)) tuples, or - an object with method sample_pair() -> (pair_id, (seq, rc_seq)).

TYPE: list | object

max_size

Maximum number of pairs to select.

TYPE: int DEFAULT: 200

timeout_s

Optional timeout in seconds (only used for registry input).

TYPE: float | None DEFAULT: 20

RETURNS DESCRIPTION
list of tuple

List of (seq, rc_seq) pairs selected.

crossreference_sequences

crossreference_sequences(
    new_pair, pool, offtarget_limit, max_pair_violations=0, Use_Library=None
)

Checks off-target interactions between a candidate sequence pair and a history pool.

Counts violations per pool pair, not per individual strand-strand interaction. A pool pair is counted as violating if any of the four pairwise comparisons between (seq, rc_seq) and (pool_seq, pool_rc) falls below offtarget_limit.

PARAMETER DESCRIPTION
new_pair

Candidate (seq, rc_seq) pair to test.

TYPE: tuple[str, str]

pool

Existing (seq, rc_seq) pairs to cross-reference against.

TYPE: list[tuple[str, str]]

offtarget_limit

Energy cutoff below which an off-target interaction is considered a violation.

TYPE: float

max_pair_violations

Maximum number of violating pool pairs allowed before the candidate is rejected.

TYPE: int DEFAULT: 0

Use_Library

Whether to use the precomputed energy library (overrides the global setting if not None).

TYPE: bool | None DEFAULT: None

RETURNS DESCRIPTION
tuple[bool, int]

Tuple (passed, nupack_calls) where passed is False if the number of violating pool pairs exceeds max_pair_violations, and nupack_calls is the number of direct energy computations performed during this cross-reference check.

select_subset_in_energy_range

select_subset_in_energy_range(
    sequence_pairs,
    energy_min=-inf,
    energy_max=inf,
    self_energy_min=-inf,
    max_size=inf,
    Use_Library=None,
    avoid_indices=None,
    timeout_s=None,
    history_pool=None,
    allowed_violations=0,
    offtarget_limit=None,
    max_nupack_calls=None,
    progress_every=None,
)

Selects a random subset of sequence pairs that pass on-target energy, self-energy, and optional cross-reference filters.

Supports two input types: 1) Precomputed list of (index, (seq, rc_seq)) tuples. 2) SequencePairRegistry-like object with sample_pair() method.

Notes
  • Uses random sampling without full shuffling.
  • Keeps returned sequence order aligned with returned indices list.
  • Can stop early due to timeout_s, max_nupack_calls, or candidate exhaustion.
  • If offtarget_limit is None, cross-reference filtering is skipped.
PARAMETER DESCRIPTION
sequence_pairs

List of (index, (seq, rc_seq)) tuples or registry with sample_pair().

TYPE: list | object

energy_min

Minimum acceptable on-target (association) energy.

TYPE: float DEFAULT: -inf

energy_max

Maximum acceptable on-target (association) energy.

TYPE: float DEFAULT: inf

self_energy_min

Minimum acceptable self-energy for each strand.

TYPE: float DEFAULT: -inf

max_size

Maximum number of pairs to return.

TYPE: int DEFAULT: inf

Use_Library

Whether to use the precomputed energy library (overrides global if not None).

TYPE: bool | None DEFAULT: None

avoid_indices

Indices to avoid when sampling.

TYPE: set | None DEFAULT: None

timeout_s

Optional wall-clock timeout in seconds; returns early if exceeded.

TYPE: float | None DEFAULT: None

history_pool

Optional list of accepted (seq, rc_seq) pairs to cross-reference against.

TYPE: list[tuple[str, str]] | None DEFAULT: None

allowed_violations

Maximum number of pool pairs allowed to violate offtarget_limit.

TYPE: int DEFAULT: 0

offtarget_limit

Optional off-target energy cutoff for cross-reference filtering.

TYPE: float | None DEFAULT: None

max_nupack_calls

Optional limit on direct NUPACK energy computations made inside this function.

TYPE: int | None DEFAULT: None

progress_every

Optional attempt interval for progress prints.

TYPE: int | None DEFAULT: None

RETURNS DESCRIPTION
tuple[list[tuple[str, str]], list[int], bool, int]

Tuple (subset, indices, stopped_early, nupack_calls) where subset is a list of (seq, rc_seq) pairs, indices are their corresponding global IDs, stopped_early indicates timeout or NUPACK-budget exit, and nupack_calls is the number of direct NUPACK computations made inside this function.

select_all_in_energy_range

select_all_in_energy_range(
    sequence_pairs, energy_min=-inf, energy_max=inf, Use_Library=None, avoid_ids=None
)

Selects all sequence pairs whose on-target energies fall within a given energy range.

Description

Iterates through every (global_index, (seq, rc_seq)) tuple, computes the on-target energy using nupack_compute_energy_precompute_library_fast, and collects those where energy_min <= energy <= energy_max, skipping any global_index values in avoid_ids. Note that the ID here refers to the global index in the original sequence-pair list.

Notes
  • If Use_Library is True, energies are fetched from or stored in the precompute cache.
  • Prints progress messages to the console.
PARAMETER DESCRIPTION
sequence_pairs

List of (global_index, (seq, rc_seq)) tuples.

TYPE: list of tuple

energy_min

Minimum allowed Gibbs free energy (inclusive).

TYPE: float DEFAULT: -inf

energy_max

Maximum allowed Gibbs free energy (inclusive).

TYPE: float DEFAULT: inf

Use_Library

Whether to use a precomputed energy library (overrides global if not None).

TYPE: bool | None DEFAULT: None

avoid_ids

Set of global indices to skip during selection.

TYPE: set | None DEFAULT: None

RETURNS DESCRIPTION
tuple (list of tuple, list of int)

Tuple (subset, selected_ids) where: - subset is a list of (seq, rc_seq) pairs within the energy range. - selected_ids is a list of their corresponding global indices.

compute_offtarget_fraction_below_limit

compute_offtarget_fraction_below_limit(off_energies, off_limit)

Computes the fraction of off-target energies that are below off_limit.

Notes
  • If off_energies is a dict of matrices, values are flattened and concatenated.
  • For dict input, zeros are excluded because they represent uncomputed entries.
PARAMETER DESCRIPTION
off_energies

Off-target energies as an array-like or dict of energy matrices.

TYPE: array - like | dict

off_limit

Threshold energy (kcal/mol).

TYPE: float

RETURNS DESCRIPTION
float

Fraction of values < off_limit in [0, 1]. Returns 0.0 if no values are available.

plot_on_off_target_histograms

plot_on_off_target_histograms(
    on_energies,
    off_energies,
    bins=80,
    output_path=None,
    show_plot=True,
    vlines=None,
    title=None,
    xlim=None,
)

Plots histograms comparing on-target and off-target Gibbs free energy distributions.

Notes
  • If off_energies is a dict, combines:
    • 'handle_handle_energies'
    • 'antihandle_handle_energies'
    • 'antihandle_antihandle_energies' into a single array, excluding zeros (uncomputed values).
  • Normalizes frequencies so that area under each histogram sums to 1.
  • Uses consistent bin edges across both distributions for direct comparison.
  • Saves the figure to output_path if provided, otherwise only displays it.
  • Prints summary statistics after plotting.
PARAMETER DESCRIPTION
on_energies

On-target energy values.

TYPE: array - like

off_energies

Off-target energies as an array-like or dict of energy matrices.

TYPE: array - like | dict

bins

Number of bins for histograms.

TYPE: int DEFAULT: 80

output_path

File path to save the plot; if None, the plot is only displayed.

TYPE: str | None DEFAULT: None

show_plot

Whether to call plt.show() to display the plot.

TYPE: bool DEFAULT: True

vlines

Optional dictionary of additional vertical lines to draw. Special keys: 'min_ontarget'.

TYPE: dict | None DEFAULT: None

title

Optional custom plot title. If None, a default title is used.

TYPE: str | None DEFAULT: None

xlim

Optional x-axis limits as (xmin, xmax). If None, limits are inferred from data.

TYPE: tuple[float, float] | None DEFAULT: None

RETURNS DESCRIPTION
dict

Dictionary of summary statistics: - 'min_on' : Minimum on-target energy - 'mean_on' : Mean of on-target energies
- 'std_on' : Standard deviation of on-target energies
- 'max_on' : Maximum on-target energy
- 'mean_off' : Mean of off-target energies
- 'std_off' : Standard deviation of off-target energies
- 'min_off' : Minimum off-target energy

plot_self_energy_histogram

plot_self_energy_histogram(self_energies, bins=30, output_path=None, show_plot=True)

Plots a histogram of self-energies (e.g., G_A and G_B combined).

Notes
  • Accepts a single array-like, a tuple/list of arrays (e.g., (G_A, G_B)), or a dict of arrays; all values are flattened and concatenated.
  • Uses the same visual style as plot_on_off_target_histograms.
  • Prints summary statistics after plotting.
PARAMETER DESCRIPTION
self_energies

Array-like, tuple/list of arrays, or dict of arrays.

TYPE: array - like | tuple / list | dict

bins

Number of bins for histogram.

TYPE: int DEFAULT: 30

output_path

File path to save the plot; if None, the plot is only displayed.

TYPE: str | None DEFAULT: None

show_plot

Whether to call plt.show() to display the plot.

TYPE: bool DEFAULT: True

Vertex Cover Algorithms

orthoseq_generator.vertex_cover_algorithms

min_ontarget module-attribute

min_ontarget = -10.4

Select sequences with on-target energy in desired range

subset, indices, _, _ = select_subset_in_energy_range( ontarget7mer, energy_min=min_ontarget, energy_max=max_ontarget, max_size=30, Use_Library=True, avoid_indices=set() )

Compute off-target energies for the subset

off_e_subset = compute_offtarget_energies(subset, Use_Library=False)

Build the off-target interaction graph

Edges = build_edges(off_e_subset, indices, offtarget_limit)

heuristic_vertex_cover_optimized2

heuristic_vertex_cover_optimized2(E, avoid_V=None, cleanup=True)

This function is the core of the sequence search algorithm. It’s a heuristic approach to solve the NP-hard minimum vertex cover problem.

Inspired by: - Joshi (2020), "Neighbourhood Evaluation Criteria for Vertex Cover Problem" - StackExchange discussion: https://cs.stackexchange.com/q/74546

Algorithm Outline
  1. Immediately add any self-edge vertices (u == v) to the cover.
  2. Build an adjacency list for all non-self edges.
  3. Track the degree (number of neighbors) for each vertex.
  4. While edges remain: a. Identify the vertex/vertices with maximum degree. b. Among those, select the vertex with the fewest neighbors that also share that max degree. c. Break ties randomly, preferring vertices in avoid_V. d. Add the selected vertex to the cover, remove it and its incident edges, and update degrees.
Notes
  • avoid_V contains vertices that should be removed when possible, but they can still be kept.
  • Self-edges are covered immediately.
  • Orphan vertices (degree zero) are naturally independent and never need removal.
PARAMETER DESCRIPTION
E

Set of edges (u, v). Vertices can be any hashable.

TYPE: iterable of tuple

avoid_V

Vertices you’d like to preferentially remove into the cover. They can still be kept, just less likely.

TYPE: (set, optional) DEFAULT: None

cleanup

If True, remove any redundant vertices from the final cover without uncovering any edges.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
set

A vertex cover (set of vertices touching every edge in E).

find_uncovered_edges

find_uncovered_edges(E, vertex_cover)

Finds edges that are not covered by the current vertex cover.

Description

Given a collection of edges E and a set vertex_cover of vertices, this function returns all edges which are not in the set. Technically, vertex_cover is not a full vertex cover of the graph but only a partial vertex cover.

PARAMETER DESCRIPTION
E

Collection of edges (u, v).

TYPE: iterable of tuple

vertex_cover

Set of vertices currently in the cover.

TYPE: set

RETURNS DESCRIPTION
set

Edges (u, v) from E for which neither u nor v is in vertex_cover. Self-edges (u == v) are included if u is not in the cover.

build_edges

build_edges(offtarget_dict, indices, energy_cutoff)

Builds a list of global index‐pair edges from off‐target energy matrices. (Global indices refer to the positions in the originally created sequence-pair list.)

Procedure
  1. Extract all (i, j) positions from each matrix where energy < energy_cutoff.
  2. Stack these positions together and sort each pair so (i, j) and (j, i) collapse to one.
  3. Remove duplicate pairs.
  4. Map local indices back to global sequence indices via the indices list.
PARAMETER DESCRIPTION
offtarget_dict

Dictionary containing three N×N numpy arrays under keys: - 'handle_handle_energies' - 'antihandle_handle_energies' - 'antihandle_antihandle_energies'

TYPE: dict

indices

List of global sequence indices corresponding to matrix rows/columns.

TYPE: list of int

energy_cutoff

Threshold below which an energy defines an edge.

TYPE: float

RETURNS DESCRIPTION
list of tuple

List of (i, j) tuples where each is a global‐index edge with off‐target energy < cutoff.

compute_pair_conflict_probability

compute_pair_conflict_probability(offtarget_dict, energy_cutoff)

Computes pair-level conflict probability using the same conflict rule as build_edges.

A pair (i, j) with i != j is counted as conflicting if at least one of the three off-target interaction matrices violates energy_cutoff, exactly as in build_edges.

PARAMETER DESCRIPTION
offtarget_dict

Dictionary containing the three off-target energy matrices.

TYPE: dict

energy_cutoff

Threshold below which an interaction defines a conflict.

TYPE: float

RETURNS DESCRIPTION
float

Fraction of conflicting unordered sequence-pair pairs in [0, 1]. Returns 0.0 if fewer than 2 sequence pairs are present.

select_vertices_to_remove

select_vertices_to_remove(vertex_cover, num_vertices_to_remove)

Selects a subset of vertices to remove from an existing vertex cover.

PARAMETER DESCRIPTION
vertex_cover

Current set of cover vertices.

TYPE: set

num_vertices_to_remove

Desired number of vertices to remove.

TYPE: int

RETURNS DESCRIPTION
set

Randomly chosen vertices to remove (size ≤ num_vertices_to_remove).

iterative_vertex_cover_multi

iterative_vertex_cover_multi(
    V,
    E,
    avoid_V=None,
    num_vertices_to_remove=150,
    max_iterations=200,
    limit=+inf,
    multistart=30,
    population_size=5,
    show_progress=False,
)

Attempts to find a small vertex cover via multiple randomized restarts and iterative refinement. Strategically calls heuristic_vertex_cover_optimized2

Algorithm Outline
  1. For each of multistart attempts: a. Compute an initial cover via the greedy heuristic. b. Initialize a population containing that cover. c. Repeat up to max_iterations:
    • For each cover in the population:
      • Remove num_vertices_to_remove random vertices (respecting avoid_V).
      • Find uncovered edges and re-cover via the heuristic.
      • If the new cover is smaller, reset the population to this cover.
      • If it’s the same size but unique, add it to the population.
    • Trim the population to population_size by random sampling.
    • Optionally print progress. d. If this attempt’s best cover is smaller than the global best, update it.
Notes

Because minimum vertex cover is NP-hard, this is a heuristic: it runs quickly but does not guarantee an optimal solution.

PARAMETER DESCRIPTION
V

All vertices in the graph (e.g., list or set of IDs). Note: V is only used for printing/monitoring; the graph is fully encoded by E.

TYPE: iterable

E

All edges (u, v) in global index space.

TYPE: iterable of tuple

avoid_V

Vertices to preferentially remove into the cover.

TYPE: (set, optional) DEFAULT: None

num_vertices_to_remove

Number of vertices to drop each iteration.

TYPE: int DEFAULT: 150

max_iterations

Max refine steps per restart.

TYPE: int DEFAULT: 200

limit

Target threshold for |V| - |cover|; stops early if reached.

TYPE: float DEFAULT: +inf

multistart

Number of independent greedy restarts.

TYPE: int DEFAULT: 30

population_size

Max number of equal-sized covers to retain each iteration.

TYPE: int DEFAULT: 5

show_progress

If True, prints status each iteration.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
tuple[set, list[list[int]]]

Tuple of (best_vertex_cover, trajectories), where trajectories is a list of per-multistart lists of independent set sizes over iterations.

evolutionary_vertex_cover

evolutionary_vertex_cover(
    sequence_pairs,
    offtarget_limit,
    max_ontarget,
    min_ontarget,
    self_energy_limit,
    subsetsize=200,
    generations=100,
    stop_event=None,
)

Dont use. It is working worse than Evolves an independent set of sequences from a set of candidate sequence pairs by iteratively removing high-energy (off-target) interactions via vertex-cover heuristics.
Implements a form of genetic “survivor selection” via repeated vertex-cover: new sequences are sampled each generation and those with strong off-target interactions are “removed” again. The history variable ensures previously promising sequences re-enter the sampling pool.

Procedure
  1. Initialize:
  2. non_cover_vertices: best independent set so far (sequences not in the cover).
  3. history: indices to avoid reselection, preserving diversity.
  4. For each of generations iterations: a. Check if stop_event is set. If so, break. b. Select a random subset of sequences whose on-target energies lie within [min_ontarget, max_ontarget], excluding those in history.
    c. Re-add any sequences from history to ensure good candidates are retained.
    d. Assert that there are no duplicate indices.
    e. Compute off-target energies for the subset.
    f. Build the off-target interaction graph (edges where energy < offtarget_limit).
    g. Apply the multi-start, iterative vertex-cover heuristic to find removed_vertices.
    h. Derive the new independent set: all selected indices minus removed_vertices.
    i. If this independent set is at least as large as the previous best:
    • Update non_cover_vertices.
    • Clear history if strictly larger.
      j. If its size ≥ 95% of the best, add its indices (deduplicated) to history.
      k. Print generation summary statistics.
  5. On user interrupt (Ctrl+C) or stop_event, exit gracefully and proceed to save the current best.
  6. After all generations or interruption, save the final independent set to a text file.
Notes
  • Catches KeyboardInterrupt to allow early exit: the best result so far is saved and plotted.
PARAMETER DESCRIPTION
sequence_pairs

List of (index, (seq, rc_seq)) tuples for candidate sequences.

TYPE: list of tuple

offtarget_limit

Energy threshold below which an off-target interaction defines an edge.

TYPE: float

max_ontarget

Upper bound for acceptable on-target energy.

TYPE: float

min_ontarget

Lower bound for acceptable on-target energy.

TYPE: float

self_energy_limit

Minimum acceptable self-energy for each strand.

TYPE: float

subsetsize

Number of sequences to sample per generation.

TYPE: int DEFAULT: 200

generations

Number of evolutionary iterations to perform.

TYPE: int DEFAULT: 100

stop_event

Optional threading.Event to stop the search.

TYPE: Event DEFAULT: None

RETURNS DESCRIPTION
list of tuple

Final list of (seq, rc_seq) pairs forming the best independent set.

Helper Functions

orthoseq_generator.helper_functions

DelayedKeyboardInterrupt

Context manager that delays KeyboardInterrupt (Ctrl+C) during critical operations.

This prevents corruption of the precomputed energy library by deferring interrupt handling until the protected block (e.g., file writes) completes.

Usage

with DelayedKeyboardInterrupt(): # perform critical operation, like saving files save_pickle_atomic(...)

Notes
  • On entering, replaces the SIGINT handler to queue the signal.
  • On exit, restores the original handler and re-raises if an interrupt was received.

set_nupack_params

set_nupack_params(material='dna', celsius=37, sodium=0.05, magnesium=0.025)

Updates global NUPACK parameters used for all energy computations.

Notes

These values are read by functions in sequence_computations when building a NUPACK Model. If you change parameters, you should also choose a new precompute library filename to avoid mixing incompatible energies.

PARAMETER DESCRIPTION
material

NUPACK material type (e.g., "dna").

TYPE: str DEFAULT: 'dna'

celsius

Temperature in Celsius.

TYPE: float DEFAULT: 37

sodium

Sodium concentration in M.

TYPE: float DEFAULT: 0.05

magnesium

Magnesium concentration in M.

TYPE: float DEFAULT: 0.025

RETURNS DESCRIPTION
None

None

choose_precompute_library

choose_precompute_library(filename)

Sets the name of the precomputed energy library file.

Notes

Updates the global variable used by other functions to locate the correct library.

PARAMETER DESCRIPTION
filename

Name of the pickle file where precomputed energies are or will be stored.

TYPE: str

RETURNS DESCRIPTION
None

None

save_pickle_atomic

save_pickle_atomic(data, filepath)

Saves a Python object to disk as a pickle file in a safe and atomic way.

Notes
  • Writes data to a temporary file (<filepath>.tmp) first, then atomically replaces the original file to avoid corruption if a crash occurs during writing.
  • Creates the target directory if it does not exist.
PARAMETER DESCRIPTION
data

Python object to save (typically a dictionary).

TYPE: any

filepath

Full path to the target pickle file.

TYPE: str

RETURNS DESCRIPTION
None

None

get_library_path

get_library_path()

Returns the full file path to the currently selected precomputed energy library.

Description

Constructs a path by combining the 'pre_computed_energies' folder with the filename set via choose_precompute_library(). If no filename has been set, defaults to 'test_lib.pkl'.

RETURNS DESCRIPTION
str

Full path to the pickle file containing the precomputed Gibbs free energy dictionary.

get_default_results_folder

get_default_results_folder()

Returns the default path to the 'noflank_results' folder where output files containing the generated sequence pairs are saved.

Description

The noflank_results directory is created automatically if it does not exist. The path is based on the current working directory from which the script was executed.

RETURNS DESCRIPTION
str

Absolute path to the 'noflank_results' directory.

save_sequence_pairs_to_txt

save_sequence_pairs_to_txt(sequence_pairs, filename=None)

Saves a list of DNA sequence pairs to a plain text file in the default noflank_results folder.

Description

Each line in the output file contains a sequence and its reverse complement, separated by a tab. If filename is not provided, an informative name is generated based on the number of sequences, sequence length, and current timestamp.

PARAMETER DESCRIPTION
sequence_pairs

List of (sequence, reverse_complement) tuples.

TYPE: list of tuple

filename

Optional custom file name. If None, a name is generated based on timestamp and sequence length.

TYPE: str | None DEFAULT: None

RETURNS DESCRIPTION
None

None

load_sequence_pairs_from_txt

load_sequence_pairs_from_txt(filename, use_default_results_folder=True)

Loads DNA sequence pairs from a plain text file in the default noflank_results folder.

Description

Reads a tab-separated text file where each line contains a sequence and its reverse complement. The file is located in the noflank_results directory returned by get_default_results_folder().

PARAMETER DESCRIPTION
filename

Name of the text file to load.

TYPE: str

use_default_results_folder

If True, interpret filename relative to the default noflank_results folder; otherwise treat it as an absolute or relative path.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
list of tuple

List of (sequence, reverse_complement) tuples loaded from the file.

RAISES DESCRIPTION
FileNotFoundError

If the specified file does not exist.