OrthoSeq Generator API Documentation¶

The below contains documentation for all relevant functions available for import within the orthoseq_generator package.

Sequence Computations¶

orthoseq_generator.sequence_computations ¶

SequencePairRegistry ¶

SequencePairRegistry(
    length=7,
    fivep_ext="",
    threep_ext="",
    unwanted_substrings=None,
    apply_unwanted_to="core",
    seed=None,
    preselected_cores=None,
)

Stateful generator/registry for DNA sequence pairs.

It generates random core sequences of fixed length, forms the pair (seq, revcom(seq)), applies constraints, and assigns stable integer IDs.

If a generated pair has been seen before, it returns the previously assigned ID instead of creating a new one.

PARAMETER	DESCRIPTION
`length`	Length of the core DNA sequence (without flanks). TYPE: `int` DEFAULT: `7`
`fivep_ext`	Optional 5′ flanking sequence prepended to each strand. TYPE: `str` DEFAULT: `''`
`threep_ext`	Optional 3′ flanking sequence appended to each strand. TYPE: `str` DEFAULT: `''`
`unwanted_substrings`	List of substrings that disqualify a sequence. Example: ["AAAA", "CCCC", "GGGG", "TTTT"]. TYPE: `list[str] \| None` DEFAULT: `None`
`apply_unwanted_to`	Where to apply unwanted_substrings checks. - "core": apply only to the random core sequences - "full": apply to the full flanked sequences TYPE: `str` DEFAULT: `'core'`
`seed`	Optional RNG seed for reproducibility. TYPE: `int \| None` DEFAULT: `None`
`preselected_cores`	Optional iterable of core sequences to draw from instead of random generation. Sampling is without replacement in random order. TYPE: `iterable[str] \| None` DEFAULT: `None`

sample_pair ¶

sample_pair(max_tries=10000)

Generates (or reuses) a random sequence pair and returns (pair_id, pair).

Behavior¶

If preselected_cores were provided, draws from that list (random, with replacement).
Draw random core sequences until constraints pass.
Convert to canonical (sorted) flanked pair.
If pair was seen: return existing ID.
Else: assign new ID, store, return it.

PARAMETER	DESCRIPTION
`max_tries`	Maximum attempts before raising an error (prevents infinite loops). TYPE: `int` DEFAULT: `10000`

RETURNS	DESCRIPTION
`tuple[int, tuple[str, str]]`	(pair_id, (seq, rc_seq)) where seq/rc_seq are flanked and sorted.

get_pair_by_id ¶

get_pair_by_id(pair_id)

Returns the stored pair for a given ID.

PARAMETER	DESCRIPTION
`pair_id`	Integer ID returned by sample_pair. TYPE: `int`

RETURNS	DESCRIPTION
`tuple[str, str]`	(seq, rc_seq) canonical sorted pair.

revcom ¶

revcom(sequence)

Computes the reverse complement of a DNA sequence.

PARAMETER	DESCRIPTION
`sequence`	Single DNA sequence as a string. TYPE: `str`

RETURNS	DESCRIPTION
`str`	Reverse complement of the input sequence as a string.

has_four_consecutive_bases ¶

has_four_consecutive_bases(seq)

Returns True if the sequence contains four identical consecutive bases (e.g., "GGGG", "CCCC", "AAAA", "TTTT").

Notes¶

Additional sequence constraints (e.g., homopolymer runs of other lengths) can be added here as needed.

PARAMETER	DESCRIPTION
`seq`	DNA sequence as a string. TYPE: `str`

RETURNS	DESCRIPTION
`bool`	True if any base appears four times in a row, False otherwise.

sorted_key ¶

sorted_key(seq1, seq2)

Returns a tuple with the two input sequences sorted alphabetically.

Description¶

Ensures that (seq1, seq2) and (seq2, seq1) map to the same dictionary key.

PARAMETER	DESCRIPTION
`seq1`	First DNA sequence. TYPE: `str`
`seq2`	Second DNA sequence. TYPE: `str`

RETURNS	DESCRIPTION
`tuple`	Tuple of the two sequences in alphabetical order.

create_sequence_pairs_pool ¶

create_sequence_pairs_pool(length=7, fivep_ext='', threep_ext='', avoid_gggg=True)

Generates a list of unique DNA sequence pairs (and their reverse complements) with optional flanking sequences.

Procedure¶

Generate all possible core sequences of specified length.
Compute each sequence's reverse complement and alphabetically sort the pair.
If avoid_gggg is True, filter out any pair where either sequence contains four identical bases in a row.
Prepend fivep_ext and append threep_ext to both members of each pair.
Enumerate the resulting list, assigning a unique integer ID to each pair.

PARAMETER	DESCRIPTION
`length`	Length of the core DNA sequences (without flanks). TYPE: `int` DEFAULT: `7`
`fivep_ext`	Optional 5′ flanking sequence prepended to each strand. TYPE: `str` DEFAULT: `''`
`threep_ext`	Optional 3′ flanking sequence appended to each strand. TYPE: `str` DEFAULT: `''`
`avoid_gggg`	If True, filters out pairs containing four identical consecutive bases. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`list of tuple`	List of tuples `[(index, (sequence, reverse_complement)), ...]`, where `index` is a unique ID and each tuple contains the complementary pair.

create_seqwalk_sequence_pairs_pool ¶

create_seqwalk_sequence_pairs_pool(
    length=7,
    k=3,
    seed=None,
    fivep_ext="",
    threep_ext="",
    alphabet="ACGT",
    avoid_reverse_complements=True,
    gc_lims=None,
    prevented_patterns=None,
    verbose=True,
)

Generates sequence pairs from SeqWalk and converts them into this module's pair format.

This is a thin integration layer around seqwalk.design.max_size. SeqWalk designs a maximal library of core sequences for a chosen sequence-symmetry minimization (SSM) k value, optionally excluding reverse complements. The resulting core sequences are then converted into canonical (seq, revcom(seq)) pairs with optional flanks.

PARAMETER	DESCRIPTION
`length`	Length of the core DNA sequences produced by SeqWalk. TYPE: `int` DEFAULT: `7`
`k`	Sequence symmetry minimization (SSM) k value passed to SeqWalk. TYPE: `int` DEFAULT: `3`
`seed`	Optional Python random seed for deterministic SeqWalk output. TYPE: `int \| None` DEFAULT: `None`
`fivep_ext`	Optional 5′ flank prepended to both strands. TYPE: `str` DEFAULT: `''`
`threep_ext`	Optional 3′ flank appended to both strands. TYPE: `str` DEFAULT: `''`
`alphabet`	Allowed DNA alphabet passed to SeqWalk. TYPE: `str` DEFAULT: `'ACGT'`
`avoid_reverse_complements`	If True, request an RC-free SeqWalk library. TYPE: `bool` DEFAULT: `True`
`gc_lims`	Optional `(min_gc, max_gc)` tuple passed to SeqWalk. TYPE: `tuple[int, int] \| None` DEFAULT: `None`
`prevented_patterns`	Optional list of forbidden patterns passed to SeqWalk. TYPE: `list[str] \| None` DEFAULT: `None`
`verbose`	If True, allow SeqWalk to print progress information. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`list[tuple[int, tuple[str, str]]]`	List of `(index, (sequence, reverse_complement))` tuples.

nupack_compute_energy_precompute_library_fast ¶

nupack_compute_energy_precompute_library_fast(
    seq1, seq2, type="total", Use_Library=None
)

Computes the Gibbs free energy of hybridization between two DNA sequences using NUPACK, with optional caching via a precompute library.

Notes¶

Uses a local cache to avoid redundant NUPACK calls when Use_Library=True. If the argument is None, the global setting in hf.USE_LIBRARY is used.
Energies are stored under a sorted key so (seq1, seq2) and (seq2, seq1) map identically. This function does not write back to disk; cache updates are handled by callers.
Called by multiprocessing; each worker loads its own cache copy once from file.
Does not write to the cache during multiprocessing to prevent conflicts.
All energies larger than -1 kcal/mol are mapped to -1 kcal/mol. 0 is used in other routines as an indicator that the energy has not been computed. -1 kcal/mol is already extremely weak (virtually no interaction).
Model parameters are fixed at 37°C, sodium=0.05 M, magnesium=0.025 M; change with a fresh cache.

PARAMETER	DESCRIPTION
`seq1`	First DNA sequence. TYPE: `str`
`seq2`	Second DNA sequence. TYPE: `str`
`type`	Either 'total' (partition sum) or 'minimum' (MFE) calculation. The result of 'total' is what you would use to compute a binding constant. TYPE: `str` DEFAULT: `'total'`
`Use_Library`	If True, use and load the precompute cache; defaults to global setting. TYPE: `bool \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`tuple[float, float, float] \| float`	Tuple `(energy, G_A, G_B)` where `energy` is the association free energy (kcal/mol). For homodimers, `G_B == G_A`. If NUPACK returns no MFE or an exception occurs, the function returns `-1.0` (scalar) instead.

compute_pair_energy_on ¶

compute_pair_energy_on(i, seq, rc_seq)

Helper function for parallel computing of on-target energies.

PARAMETER	DESCRIPTION
`i`	Sequence index. TYPE: `int`
`seq`	DNA sequence. TYPE: `str`
`rc_seq`	Reverse complement sequence. TYPE: `str`

RETURNS	DESCRIPTION
`tuple[int, float, float, float]`	Tuple `(i, pair_energy, self_energy_seq, self_energy_rc_seq)`.

compute_ontarget_energies ¶

compute_ontarget_energies(sequence_list)

Computes on-target Gibbs free energies for a list of sequence pairs using multiprocessing.

Notes¶

Uses ProcessPoolExecutor (with initializer=_init_worker) to parallelize calls to NUPACK via nupack_compute_energy_precompute_library_fast.
If hf.USE_LIBRARY is True, the initializer function (_init_worker) passes the library filename and flag to each worker so that nupack_compute_energy_precompute_library_fast can load its cache. After all parallel computations finish, this function saves the cache with the new energies.
Saves the updated cache atomically using DelayedKeyboardInterrupt to prevent corruption.
Prints progress and CPU core usage to the console.

PARAMETER	DESCRIPTION
`sequence_list`	List of tuples, each containing a sequence and its reverse complement. TYPE: `list of tuple`

RETURNS	DESCRIPTION
`tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]`	Tuple of NumPy arrays `(pair_energies, self_energies_seq, self_energies_rc_seq)`.

compute_pair_energy_off ¶

compute_pair_energy_off(i, j, seq1, seq2)

Helper function for parallel computing of off-target energies.

PARAMETER	DESCRIPTION
`i`	Index of the first sequence. TYPE: `int`
`j`	Index of the second sequence. TYPE: `int`
`seq1`	First DNA sequence. TYPE: `str`
`seq2`	Second DNA sequence. TYPE: `str`

RETURNS	DESCRIPTION
`tuple (int, int, float)`	Tuple `(i, j, energy)` where `energy` is the computed Gibbs free energy.

compute_offtarget_energies ¶

compute_offtarget_energies(sequence_pairs)

Computes off-target hybridization energies for all pairwise combinations of a given list of sequence pairs.

Procedure¶

Extract handles and antihandles from sequence_pairs.
Initialize three N×N energy matrices for:
handle-handle interactions
antihandle-antihandle interactions
handle-antihandle interactions
For each matrix, use ProcessPoolExecutor (via compute_pair_energy_off) to fill only the required entries:
i ≥ j for the two symmetric matrices
i ≠ j for the mixed handle-antihandle matrix
If hf.USE_LIBRARY is True, the initializer function (_init_worker) passes the library filename and flag to each worker so that nupack_compute_energy_precompute_library_fast can load its cache. After all parallel computations finish, this function saves the cache with the new energies.

Notes¶

Off-target interactions are computed for:
1) handle with handle
2) antihandle with antihandle
3) handle with antihandle
Symmetric matrices only compute the lower triangle (i ≥ j) to avoid redundancy.
Entries with no interaction or computation errors return -1.0 (mapped for any energy > -1.0).
A value of 0 indicates the energy was skipped due to redundancy.
Uses DelayedKeyboardInterrupt to ensure atomic writes when saving the updated cache.

PARAMETER	DESCRIPTION
`sequence_pairs`	List of (sequence, reverse_complement) tuples. TYPE: `list of tuple`

RETURNS	DESCRIPTION
`dict`	Dictionary containing three N×N numpy arrays with keys: - 'handle_handle_energies' - 'antihandle_handle_energies' - 'antihandle_antihandle_energies'

select_subset ¶

select_subset(sequence_pairs, max_size=200, timeout_s=20)

Selects a random subset of sequence pairs up to a specified maximum size.

This function supports two input types: 1) A precomputed pool: list of (index, (seq, rc_seq)) tuples. - If pool size > max_size: uses random.sample for efficiency. - Else: returns all pairs. 2) A generator/registry object that provides sample_pair(). - Repeatedly calls sample_pair() until max_size unique pairs are collected, or timeout_s is reached.

Notes¶

For list input: uses sampling rather than shuffling for performance.
For registry input: guarantees uniqueness by ID (not by sequence string), so repeated samples do not inflate the subset.

Timeout behavior¶

If timeout_s is reached while using a registry, the function returns the pairs found so far and prints: "Only X of requested Y found (timeout)."

PARAMETER	DESCRIPTION
`sequence_pairs`	Either - list of (index, (seq, rc_seq)) tuples, or - an object with method sample_pair() -> (pair_id, (seq, rc_seq)). TYPE: `list \| object`
`max_size`	Maximum number of pairs to select. TYPE: `int` DEFAULT: `200`
`timeout_s`	Optional timeout in seconds (only used for registry input). TYPE: `float \| None` DEFAULT: `20`

RETURNS	DESCRIPTION
`list of tuple`	List of (seq, rc_seq) pairs selected.

crossreference_sequences ¶

crossreference_sequences(
    new_pair, pool, offtarget_limit, max_pair_violations=0, Use_Library=None
)

Checks off-target interactions between a candidate sequence pair and a history pool.

Counts violations per pool pair, not per individual strand-strand interaction. A pool pair is counted as violating if any of the four pairwise comparisons between (seq, rc_seq) and (pool_seq, pool_rc) falls below offtarget_limit.

PARAMETER	DESCRIPTION
`new_pair`	Candidate `(seq, rc_seq)` pair to test. TYPE: `tuple[str, str]`
`pool`	Existing `(seq, rc_seq)` pairs to cross-reference against. TYPE: `list[tuple[str, str]]`
`offtarget_limit`	Energy cutoff below which an off-target interaction is considered a violation. TYPE: `float`
`max_pair_violations`	Maximum number of violating pool pairs allowed before the candidate is rejected. TYPE: `int` DEFAULT: `0`
`Use_Library`	Whether to use the precomputed energy library (overrides the global setting if not None). TYPE: `bool \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`tuple[bool, int]`	Tuple `(passed, nupack_calls)` where `passed` is False if the number of violating pool pairs exceeds `max_pair_violations`, and `nupack_calls` is the number of direct energy computations performed during this cross-reference check.

select_subset_in_energy_range ¶

select_subset_in_energy_range(
    sequence_pairs,
    energy_min=-inf,
    energy_max=inf,
    self_energy_min=-inf,
    max_size=inf,
    Use_Library=None,
    avoid_indices=None,
    timeout_s=None,
    history_pool=None,
    allowed_violations=0,
    offtarget_limit=None,
    max_nupack_calls=None,
    progress_every=None,
)

Selects a random subset of sequence pairs that pass on-target energy, self-energy, and optional cross-reference filters.

Supports two input types: 1) Precomputed list of (index, (seq, rc_seq)) tuples. 2) SequencePairRegistry-like object with sample_pair() method.

Notes¶

Uses random sampling without full shuffling.
Keeps returned sequence order aligned with returned indices list.
Can stop early due to timeout_s, max_nupack_calls, or candidate exhaustion.
If offtarget_limit is None, cross-reference filtering is skipped.

PARAMETER	DESCRIPTION
`sequence_pairs`	List of (index, (seq, rc_seq)) tuples or registry with sample_pair(). TYPE: `list \| object`
`energy_min`	Minimum acceptable on-target (association) energy. TYPE: `float` DEFAULT: `-inf`
`energy_max`	Maximum acceptable on-target (association) energy. TYPE: `float` DEFAULT: `inf`
`self_energy_min`	Minimum acceptable self-energy for each strand. TYPE: `float` DEFAULT: `-inf`
`max_size`	Maximum number of pairs to return. TYPE: `int` DEFAULT: `inf`
`Use_Library`	Whether to use the precomputed energy library (overrides global if not None). TYPE: `bool \| None` DEFAULT: `None`
`avoid_indices`	Indices to avoid when sampling. TYPE: `set \| None` DEFAULT: `None`
`timeout_s`	Optional wall-clock timeout in seconds; returns early if exceeded. TYPE: `float \| None` DEFAULT: `None`
`history_pool`	Optional list of accepted `(seq, rc_seq)` pairs to cross-reference against. TYPE: `list[tuple[str, str]] \| None` DEFAULT: `None`
`allowed_violations`	Maximum number of pool pairs allowed to violate `offtarget_limit`. TYPE: `int` DEFAULT: `0`
`offtarget_limit`	Optional off-target energy cutoff for cross-reference filtering. TYPE: `float \| None` DEFAULT: `None`
`max_nupack_calls`	Optional limit on direct NUPACK energy computations made inside this function. TYPE: `int \| None` DEFAULT: `None`
`progress_every`	Optional attempt interval for progress prints. TYPE: `int \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`tuple[list[tuple[str, str]], list[int], bool, int]`	Tuple `(subset, indices, stopped_early, nupack_calls)` where `subset` is a list of `(seq, rc_seq)` pairs, `indices` are their corresponding global IDs, `stopped_early` indicates timeout or NUPACK-budget exit, and `nupack_calls` is the number of direct NUPACK computations made inside this function.

select_all_in_energy_range ¶

select_all_in_energy_range(
    sequence_pairs, energy_min=-inf, energy_max=inf, Use_Library=None, avoid_ids=None
)

Selects all sequence pairs whose on-target energies fall within a given energy range.

Description¶

Iterates through every (global_index, (seq, rc_seq)) tuple, computes the on-target energy using nupack_compute_energy_precompute_library_fast, and collects those where energy_min <= energy <= energy_max, skipping any global_index values in avoid_ids. Note that the ID here refers to the global index in the original sequence-pair list.

Notes¶

If Use_Library is True, energies are fetched from or stored in the precompute cache.
Prints progress messages to the console.

PARAMETER	DESCRIPTION
`sequence_pairs`	List of `(global_index, (seq, rc_seq))` tuples. TYPE: `list of tuple`
`energy_min`	Minimum allowed Gibbs free energy (inclusive). TYPE: `float` DEFAULT: `-inf`
`energy_max`	Maximum allowed Gibbs free energy (inclusive). TYPE: `float` DEFAULT: `inf`
`Use_Library`	Whether to use a precomputed energy library (overrides global if not None). TYPE: `bool \| None` DEFAULT: `None`
`avoid_ids`	Set of global indices to skip during selection. TYPE: `set \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`tuple (list of tuple, list of int)`	Tuple `(subset, selected_ids)` where: - `subset` is a list of `(seq, rc_seq)` pairs within the energy range. - `selected_ids` is a list of their corresponding global indices.

compute_offtarget_fraction_below_limit ¶

compute_offtarget_fraction_below_limit(off_energies, off_limit)

Computes the fraction of off-target energies that are below off_limit.

Notes¶

If off_energies is a dict of matrices, values are flattened and concatenated.
For dict input, zeros are excluded because they represent uncomputed entries.

PARAMETER	DESCRIPTION
`off_energies`	Off-target energies as an array-like or dict of energy matrices. TYPE: `array - like \| dict`
`off_limit`	Threshold energy (kcal/mol). TYPE: `float`

RETURNS	DESCRIPTION
`float`	Fraction of values < off_limit in [0, 1]. Returns 0.0 if no values are available.

plot_on_off_target_histograms ¶

plot_on_off_target_histograms(
    on_energies,
    off_energies,
    bins=80,
    output_path=None,
    show_plot=True,
    vlines=None,
    title=None,
    xlim=None,
)

Plots histograms comparing on-target and off-target Gibbs free energy distributions.

Notes¶

If off_energies is a dict, combines:
- 'handle_handle_energies'
- 'antihandle_handle_energies'
- 'antihandle_antihandle_energies' into a single array, excluding zeros (uncomputed values).
Normalizes frequencies so that area under each histogram sums to 1.
Uses consistent bin edges across both distributions for direct comparison.
Saves the figure to output_path if provided, otherwise only displays it.
Prints summary statistics after plotting.

PARAMETER	DESCRIPTION
`on_energies`	On-target energy values. TYPE: `array - like`
`off_energies`	Off-target energies as an array-like or dict of energy matrices. TYPE: `array - like \| dict`
`bins`	Number of bins for histograms. TYPE: `int` DEFAULT: `80`
`output_path`	File path to save the plot; if None, the plot is only displayed. TYPE: `str \| None` DEFAULT: `None`
`show_plot`	Whether to call plt.show() to display the plot. TYPE: `bool` DEFAULT: `True`
`vlines`	Optional dictionary of additional vertical lines to draw. Special keys: 'min_ontarget'. TYPE: `dict \| None` DEFAULT: `None`
`title`	Optional custom plot title. If None, a default title is used. TYPE: `str \| None` DEFAULT: `None`
`xlim`	Optional x-axis limits as `(xmin, xmax)`. If None, limits are inferred from data. TYPE: `tuple[float, float] \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`dict`	Dictionary of summary statistics: - 'min_on' : Minimum on-target energy - 'mean_on' : Mean of on-target energies - 'std_on' : Standard deviation of on-target energies - 'max_on' : Maximum on-target energy - 'mean_off' : Mean of off-target energies - 'std_off' : Standard deviation of off-target energies - 'min_off' : Minimum off-target energy

plot_self_energy_histogram ¶

plot_self_energy_histogram(self_energies, bins=30, output_path=None, show_plot=True)

Plots a histogram of self-energies (e.g., G_A and G_B combined).

Notes¶

Accepts a single array-like, a tuple/list of arrays (e.g., (G_A, G_B)), or a dict of arrays; all values are flattened and concatenated.
Uses the same visual style as plot_on_off_target_histograms.
Prints summary statistics after plotting.

PARAMETER	DESCRIPTION
`self_energies`	Array-like, tuple/list of arrays, or dict of arrays. TYPE: `array - like \| tuple / list \| dict`
`bins`	Number of bins for histogram. TYPE: `int` DEFAULT: `30`
`output_path`	File path to save the plot; if None, the plot is only displayed. TYPE: `str \| None` DEFAULT: `None`
`show_plot`	Whether to call plt.show() to display the plot. TYPE: `bool` DEFAULT: `True`

Vertex Cover Algorithms¶

orthoseq_generator.vertex_cover_algorithms ¶

min_ontarget `module-attribute` ¶

min_ontarget = -10.4

Select sequences with on-target energy in desired range¶

subset, indices, _, _ = select_subset_in_energy_range( ontarget7mer, energy_min=min_ontarget, energy_max=max_ontarget, max_size=30, Use_Library=True, avoid_indices=set() )

Compute off-target energies for the subset¶

off_e_subset = compute_offtarget_energies(subset, Use_Library=False)

Build the off-target interaction graph¶

Edges = build_edges(off_e_subset, indices, offtarget_limit)

heuristic_vertex_cover_optimized2 ¶

heuristic_vertex_cover_optimized2(E, avoid_V=None, cleanup=True)

This function is the core of the sequence search algorithm. It’s a heuristic approach to solve the NP-hard minimum vertex cover problem.

Inspired by: - Joshi (2020), "Neighbourhood Evaluation Criteria for Vertex Cover Problem" - StackExchange discussion: https://cs.stackexchange.com/q/74546

Algorithm Outline¶

Immediately add any self-edge vertices (u == v) to the cover.
Build an adjacency list for all non-self edges.
Track the degree (number of neighbors) for each vertex.
While edges remain: a. Identify the vertex/vertices with maximum degree. b. Among those, select the vertex with the fewest neighbors that also share that max degree. c. Break ties randomly, preferring vertices in avoid_V. d. Add the selected vertex to the cover, remove it and its incident edges, and update degrees.

Notes¶

avoid_V contains vertices that should be removed when possible, but they can still be kept.
Self-edges are covered immediately.
Orphan vertices (degree zero) are naturally independent and never need removal.

PARAMETER	DESCRIPTION
`E`	Set of edges (u, v). Vertices can be any hashable. TYPE: `iterable of tuple`
`avoid_V`	Vertices you’d like to preferentially remove into the cover. They can still be kept, just less likely. TYPE: `(set, optional)` DEFAULT: `None`
`cleanup`	If True, remove any redundant vertices from the final cover without uncovering any edges. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`set`	A vertex cover (set of vertices touching every edge in E).

find_uncovered_edges ¶

find_uncovered_edges(E, vertex_cover)

Finds edges that are not covered by the current vertex cover.

Description¶

Given a collection of edges E and a set vertex_cover of vertices, this function returns all edges which are not in the set. Technically, vertex_cover is not a full vertex cover of the graph but only a partial vertex cover.

PARAMETER	DESCRIPTION
`E`	Collection of edges (u, v). TYPE: `iterable of tuple`
`vertex_cover`	Set of vertices currently in the cover. TYPE: `set`

RETURNS	DESCRIPTION
`set`	Edges (u, v) from `E` for which neither u nor v is in `vertex_cover`. Self-edges (u == v) are included if u is not in the cover.

build_edges ¶

build_edges(offtarget_dict, indices, energy_cutoff)

Builds a list of global index‐pair edges from off‐target energy matrices. (Global indices refer to the positions in the originally created sequence-pair list.)

Procedure¶

Extract all (i, j) positions from each matrix where energy < energy_cutoff.
Stack these positions together and sort each pair so (i, j) and (j, i) collapse to one.
Remove duplicate pairs.
Map local indices back to global sequence indices via the indices list.

PARAMETER	DESCRIPTION
`offtarget_dict`	Dictionary containing three N×N numpy arrays under keys: - 'handle_handle_energies' - 'antihandle_handle_energies' - 'antihandle_antihandle_energies' TYPE: `dict`
`indices`	List of global sequence indices corresponding to matrix rows/columns. TYPE: `list of int`
`energy_cutoff`	Threshold below which an energy defines an edge. TYPE: `float`

RETURNS	DESCRIPTION
`list of tuple`	List of (i, j) tuples where each is a global‐index edge with off‐target energy < cutoff.

compute_pair_conflict_probability ¶

compute_pair_conflict_probability(offtarget_dict, energy_cutoff)

Computes pair-level conflict probability using the same conflict rule as build_edges.

A pair (i, j) with i != j is counted as conflicting if at least one of the three off-target interaction matrices violates energy_cutoff, exactly as in build_edges.

PARAMETER	DESCRIPTION
`offtarget_dict`	Dictionary containing the three off-target energy matrices. TYPE: `dict`
`energy_cutoff`	Threshold below which an interaction defines a conflict. TYPE: `float`

RETURNS	DESCRIPTION
`float`	Fraction of conflicting unordered sequence-pair pairs in [0, 1]. Returns 0.0 if fewer than 2 sequence pairs are present.

select_vertices_to_remove ¶

select_vertices_to_remove(vertex_cover, num_vertices_to_remove)

Selects a subset of vertices to remove from an existing vertex cover.

PARAMETER	DESCRIPTION
`vertex_cover`	Current set of cover vertices. TYPE: `set`
`num_vertices_to_remove`	Desired number of vertices to remove. TYPE: `int`

RETURNS	DESCRIPTION
`set`	Randomly chosen vertices to remove (size ≤ num_vertices_to_remove).

iterative_vertex_cover_multi ¶

iterative_vertex_cover_multi(
    V,
    E,
    avoid_V=None,
    num_vertices_to_remove=150,
    max_iterations=200,
    limit=+inf,
    multistart=30,
    population_size=5,
    show_progress=False,
)

Attempts to find a small vertex cover via multiple randomized restarts and iterative refinement. Strategically calls heuristic_vertex_cover_optimized2

Algorithm Outline¶

For each of multistart attempts: a. Compute an initial cover via the greedy heuristic. b. Initialize a population containing that cover. c. Repeat up to max_iterations:
- For each cover in the population:
  - Remove num_vertices_to_remove random vertices (respecting avoid_V).
  - Find uncovered edges and re-cover via the heuristic.
  - If the new cover is smaller, reset the population to this cover.
  - If it’s the same size but unique, add it to the population.
- Trim the population to population_size by random sampling.
- Optionally print progress. d. If this attempt’s best cover is smaller than the global best, update it.

Notes¶

Because minimum vertex cover is NP-hard, this is a heuristic: it runs quickly but does not guarantee an optimal solution.

PARAMETER	DESCRIPTION
`V`	All vertices in the graph (e.g., list or set of IDs). Note: V is only used for printing/monitoring; the graph is fully encoded by E. TYPE: `iterable`
`E`	All edges (u, v) in global index space. TYPE: `iterable of tuple`
`avoid_V`	Vertices to preferentially remove into the cover. TYPE: `(set, optional)` DEFAULT: `None`
`num_vertices_to_remove`	Number of vertices to drop each iteration. TYPE: `int` DEFAULT: `150`
`max_iterations`	Max refine steps per restart. TYPE: `int` DEFAULT: `200`
`limit`	Target threshold for \|V\| - \|cover\|; stops early if reached. TYPE: `float` DEFAULT: `+inf`
`multistart`	Number of independent greedy restarts. TYPE: `int` DEFAULT: `30`
`population_size`	Max number of equal-sized covers to retain each iteration. TYPE: `int` DEFAULT: `5`
`show_progress`	If True, prints status each iteration. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`tuple[set, list[list[int]]]`	Tuple of (best_vertex_cover, trajectories), where trajectories is a list of per-multistart lists of independent set sizes over iterations.

evolutionary_vertex_cover ¶

evolutionary_vertex_cover(
    sequence_pairs,
    offtarget_limit,
    max_ontarget,
    min_ontarget,
    self_energy_limit,
    subsetsize=200,
    generations=100,
    stop_event=None,
)

Dont use. It is working worse than Evolves an independent set of sequences from a set of candidate sequence pairs by iteratively removing high-energy (off-target) interactions via vertex-cover heuristics.
Implements a form of genetic “survivor selection” via repeated vertex-cover: new sequences are sampled each generation and those with strong off-target interactions are “removed” again. The history variable ensures previously promising sequences re-enter the sampling pool.

Procedure¶

Initialize:
non_cover_vertices: best independent set so far (sequences not in the cover).
history: indices to avoid reselection, preserving diversity.
For each of generations iterations: a. Check if stop_event is set. If so, break. b. Select a random subset of sequences whose on-target energies lie within [min_ontarget, max_ontarget], excluding those in history.
c. Re-add any sequences from history to ensure good candidates are retained.
d. Assert that there are no duplicate indices.
e. Compute off-target energies for the subset.
f. Build the off-target interaction graph (edges where energy < offtarget_limit).
g. Apply the multi-start, iterative vertex-cover heuristic to find removed_vertices.
h. Derive the new independent set: all selected indices minus removed_vertices.
i. If this independent set is at least as large as the previous best:
- Update non_cover_vertices.
- Clear history if strictly larger.
  j. If its size ≥ 95% of the best, add its indices (deduplicated) to history.
  k. Print generation summary statistics.
On user interrupt (Ctrl+C) or stop_event, exit gracefully and proceed to save the current best.
After all generations or interruption, save the final independent set to a text file.

Notes¶

Catches KeyboardInterrupt to allow early exit: the best result so far is saved and plotted.

PARAMETER	DESCRIPTION
`sequence_pairs`	List of (index, (seq, rc_seq)) tuples for candidate sequences. TYPE: `list of tuple`
`offtarget_limit`	Energy threshold below which an off-target interaction defines an edge. TYPE: `float`
`max_ontarget`	Upper bound for acceptable on-target energy. TYPE: `float`
`min_ontarget`	Lower bound for acceptable on-target energy. TYPE: `float`
`self_energy_limit`	Minimum acceptable self-energy for each strand. TYPE: `float`
`subsetsize`	Number of sequences to sample per generation. TYPE: `int` DEFAULT: `200`
`generations`	Number of evolutionary iterations to perform. TYPE: `int` DEFAULT: `100`
`stop_event`	Optional threading.Event to stop the search. TYPE: `Event` DEFAULT: `None`

RETURNS	DESCRIPTION
`list of tuple`	Final list of (seq, rc_seq) pairs forming the best independent set.

Helper Functions¶

orthoseq_generator.helper_functions ¶

DelayedKeyboardInterrupt ¶

Context manager that delays KeyboardInterrupt (Ctrl+C) during critical operations.

This prevents corruption of the precomputed energy library by deferring interrupt handling until the protected block (e.g., file writes) completes.

Usage¶

with DelayedKeyboardInterrupt(): # perform critical operation, like saving files save_pickle_atomic(...)

Notes¶

On entering, replaces the SIGINT handler to queue the signal.
On exit, restores the original handler and re-raises if an interrupt was received.

set_nupack_params ¶

set_nupack_params(material='dna', celsius=37, sodium=0.05, magnesium=0.025)

Updates global NUPACK parameters used for all energy computations.

Notes¶

These values are read by functions in sequence_computations when building a NUPACK Model. If you change parameters, you should also choose a new precompute library filename to avoid mixing incompatible energies.

PARAMETER	DESCRIPTION
`material`	NUPACK material type (e.g., "dna"). TYPE: `str` DEFAULT: `'dna'`
`celsius`	Temperature in Celsius. TYPE: `float` DEFAULT: `37`
`sodium`	Sodium concentration in M. TYPE: `float` DEFAULT: `0.05`
`magnesium`	Magnesium concentration in M. TYPE: `float` DEFAULT: `0.025`

RETURNS	DESCRIPTION
`None`	None

choose_precompute_library ¶

choose_precompute_library(filename)

Sets the name of the precomputed energy library file.

Notes¶

Updates the global variable used by other functions to locate the correct library.

PARAMETER	DESCRIPTION
`filename`	Name of the pickle file where precomputed energies are or will be stored. TYPE: `str`

RETURNS	DESCRIPTION
`None`	None

save_pickle_atomic ¶

save_pickle_atomic(data, filepath)

Saves a Python object to disk as a pickle file in a safe and atomic way.

Notes¶

Writes data to a temporary file (<filepath>.tmp) first, then atomically replaces the original file to avoid corruption if a crash occurs during writing.
Creates the target directory if it does not exist.

PARAMETER	DESCRIPTION
`data`	Python object to save (typically a dictionary). TYPE: `any`
`filepath`	Full path to the target pickle file. TYPE: `str`

RETURNS	DESCRIPTION
`None`	None

get_library_path ¶

get_library_path()

Returns the full file path to the currently selected precomputed energy library.

Description¶

Constructs a path by combining the 'pre_computed_energies' folder with the filename set via choose_precompute_library(). If no filename has been set, defaults to 'test_lib.pkl'.

RETURNS	DESCRIPTION
`str`	Full path to the pickle file containing the precomputed Gibbs free energy dictionary.

get_default_results_folder ¶

get_default_results_folder()

Returns the default path to the 'noflank_results' folder where output files containing the generated sequence pairs are saved.

Description¶

The noflank_results directory is created automatically if it does not exist. The path is based on the current working directory from which the script was executed.

RETURNS	DESCRIPTION
`str`	Absolute path to the 'noflank_results' directory.

save_sequence_pairs_to_txt ¶

save_sequence_pairs_to_txt(sequence_pairs, filename=None)

Saves a list of DNA sequence pairs to a plain text file in the default noflank_results folder.

Description¶

Each line in the output file contains a sequence and its reverse complement, separated by a tab. If filename is not provided, an informative name is generated based on the number of sequences, sequence length, and current timestamp.

PARAMETER	DESCRIPTION
`sequence_pairs`	List of (sequence, reverse_complement) tuples. TYPE: `list of tuple`
`filename`	Optional custom file name. If None, a name is generated based on timestamp and sequence length. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`None`	None

load_sequence_pairs_from_txt ¶

load_sequence_pairs_from_txt(filename, use_default_results_folder=True)

Loads DNA sequence pairs from a plain text file in the default noflank_results folder.

Description¶

Reads a tab-separated text file where each line contains a sequence and its reverse complement. The file is located in the noflank_results directory returned by get_default_results_folder().

PARAMETER	DESCRIPTION
`filename`	Name of the text file to load. TYPE: `str`
`use_default_results_folder`	If True, interpret `filename` relative to the default noflank_results folder; otherwise treat it as an absolute or relative path. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`list of tuple`	List of (sequence, reverse_complement) tuples loaded from the file.

RAISES	DESCRIPTION
`FileNotFoundError`	If the specified file does not exist.

OrthoSeq Generator API Documentation¶

Sequence Computations¶

orthoseq_generator.sequence_computations ¶

SequencePairRegistry ¶

sample_pair ¶

Behavior¶

get_pair_by_id ¶

revcom ¶

has_four_consecutive_bases ¶

Notes¶

sorted_key ¶

Description¶

create_sequence_pairs_pool ¶

Procedure¶

create_seqwalk_sequence_pairs_pool ¶

nupack_compute_energy_precompute_library_fast ¶

Notes¶

compute_pair_energy_on ¶

compute_ontarget_energies ¶

Notes¶

compute_pair_energy_off ¶

compute_offtarget_energies ¶

Procedure¶

Notes¶

select_subset ¶

Notes¶

Timeout behavior¶

crossreference_sequences ¶

select_subset_in_energy_range ¶

Notes¶

select_all_in_energy_range ¶

Description¶

Notes¶

compute_offtarget_fraction_below_limit ¶

Notes¶

plot_on_off_target_histograms ¶

Notes¶

plot_self_energy_histogram ¶

Notes¶

Vertex Cover Algorithms¶

orthoseq_generator.vertex_cover_algorithms ¶

min_ontarget module-attribute ¶

Select sequences with on-target energy in desired range¶

Compute off-target energies for the subset¶

Build the off-target interaction graph¶

heuristic_vertex_cover_optimized2 ¶

Algorithm Outline¶

Notes¶

find_uncovered_edges ¶

Description¶

build_edges ¶

Procedure¶

compute_pair_conflict_probability ¶

select_vertices_to_remove ¶

iterative_vertex_cover_multi ¶

Algorithm Outline¶

Notes¶

evolutionary_vertex_cover ¶

Procedure¶

Notes¶

Helper Functions¶

orthoseq_generator.helper_functions ¶

DelayedKeyboardInterrupt ¶

Usage¶

Notes¶

set_nupack_params ¶

Notes¶

choose_precompute_library ¶

Notes¶

save_pickle_atomic ¶

Notes¶

get_library_path ¶

Description¶

get_default_results_folder ¶

Description¶

save_sequence_pairs_to_txt ¶

Description¶

load_sequence_pairs_from_txt ¶

Description¶

min_ontarget `module-attribute` ¶