OrthoSeq Generator API Documentation¶
The below contains documentation for all relevant functions available for import within the orthoseq_generator package.
Sequence Computations¶
orthoseq_generator.sequence_computations ¶
SequencePairRegistry ¶
SequencePairRegistry(
length=7,
fivep_ext="",
threep_ext="",
unwanted_substrings=None,
apply_unwanted_to="core",
seed=None,
preselected_cores=None,
)
Stateful generator/registry for DNA sequence pairs.
It generates random core sequences of fixed length, forms the pair (seq, revcom(seq)), applies constraints, and assigns stable integer IDs.
If a generated pair has been seen before, it returns the previously assigned ID instead of creating a new one.
| PARAMETER | DESCRIPTION |
|---|---|
length
|
Length of the core DNA sequence (without flanks).
TYPE:
|
fivep_ext
|
Optional 5′ flanking sequence prepended to each strand.
TYPE:
|
threep_ext
|
Optional 3′ flanking sequence appended to each strand.
TYPE:
|
unwanted_substrings
|
List of substrings that disqualify a sequence. Example: ["AAAA", "CCCC", "GGGG", "TTTT"].
TYPE:
|
apply_unwanted_to
|
Where to apply unwanted_substrings checks. - "core": apply only to the random core sequences - "full": apply to the full flanked sequences
TYPE:
|
seed
|
Optional RNG seed for reproducibility.
TYPE:
|
preselected_cores
|
Optional iterable of core sequences to draw from instead of random generation. Sampling is without replacement in random order.
TYPE:
|
sample_pair ¶
Generates (or reuses) a random sequence pair and returns (pair_id, pair).
Behavior¶
- If preselected_cores were provided, draws from that list (random, with replacement).
- Draw random core sequences until constraints pass.
- Convert to canonical (sorted) flanked pair.
- If pair was seen: return existing ID.
- Else: assign new ID, store, return it.
| PARAMETER | DESCRIPTION |
|---|---|
max_tries
|
Maximum attempts before raising an error (prevents infinite loops).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[int, tuple[str, str]]
|
(pair_id, (seq, rc_seq)) where seq/rc_seq are flanked and sorted. |
get_pair_by_id ¶
Returns the stored pair for a given ID.
| PARAMETER | DESCRIPTION |
|---|---|
pair_id
|
Integer ID returned by sample_pair.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[str, str]
|
(seq, rc_seq) canonical sorted pair. |
revcom ¶
Computes the reverse complement of a DNA sequence.
| PARAMETER | DESCRIPTION |
|---|---|
sequence
|
Single DNA sequence as a string.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
Reverse complement of the input sequence as a string. |
has_four_consecutive_bases ¶
Returns True if the sequence contains four identical consecutive bases (e.g., "GGGG", "CCCC", "AAAA", "TTTT").
Notes¶
Additional sequence constraints (e.g., homopolymer runs of other lengths) can be added here as needed.
| PARAMETER | DESCRIPTION |
|---|---|
seq
|
DNA sequence as a string.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bool
|
True if any base appears four times in a row, False otherwise. |
sorted_key ¶
Returns a tuple with the two input sequences sorted alphabetically.
Description¶
Ensures that (seq1, seq2) and (seq2, seq1) map to the same dictionary key.
| PARAMETER | DESCRIPTION |
|---|---|
seq1
|
First DNA sequence.
TYPE:
|
seq2
|
Second DNA sequence.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple
|
Tuple of the two sequences in alphabetical order. |
create_sequence_pairs_pool ¶
Generates a list of unique DNA sequence pairs (and their reverse complements) with optional flanking sequences.
Procedure¶
- Generate all possible core sequences of specified
length. - Compute each sequence's reverse complement and alphabetically sort the pair.
- If
avoid_ggggis True, filter out any pair where either sequence contains four identical bases in a row. - Prepend
fivep_extand appendthreep_extto both members of each pair. - Enumerate the resulting list, assigning a unique integer ID to each pair.
| PARAMETER | DESCRIPTION |
|---|---|
length
|
Length of the core DNA sequences (without flanks).
TYPE:
|
fivep_ext
|
Optional 5′ flanking sequence prepended to each strand.
TYPE:
|
threep_ext
|
Optional 3′ flanking sequence appended to each strand.
TYPE:
|
avoid_gggg
|
If True, filters out pairs containing four identical consecutive bases.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list of tuple
|
List of tuples |
create_seqwalk_sequence_pairs_pool ¶
create_seqwalk_sequence_pairs_pool(
length=7,
k=3,
seed=None,
fivep_ext="",
threep_ext="",
alphabet="ACGT",
avoid_reverse_complements=True,
gc_lims=None,
prevented_patterns=None,
verbose=True,
)
Generates sequence pairs from SeqWalk and converts them into this module's pair format.
This is a thin integration layer around seqwalk.design.max_size.
SeqWalk designs a maximal library of core sequences for a chosen sequence-symmetry
minimization (SSM) k value, optionally excluding reverse complements. The
resulting core sequences are then converted into canonical
(seq, revcom(seq)) pairs with optional flanks.
| PARAMETER | DESCRIPTION |
|---|---|
length
|
Length of the core DNA sequences produced by SeqWalk.
TYPE:
|
k
|
Sequence symmetry minimization (SSM) k value passed to SeqWalk.
TYPE:
|
seed
|
Optional Python random seed for deterministic SeqWalk output.
TYPE:
|
fivep_ext
|
Optional 5′ flank prepended to both strands.
TYPE:
|
threep_ext
|
Optional 3′ flank appended to both strands.
TYPE:
|
alphabet
|
Allowed DNA alphabet passed to SeqWalk.
TYPE:
|
avoid_reverse_complements
|
If True, request an RC-free SeqWalk library.
TYPE:
|
gc_lims
|
Optional
TYPE:
|
prevented_patterns
|
Optional list of forbidden patterns passed to SeqWalk.
TYPE:
|
verbose
|
If True, allow SeqWalk to print progress information.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[tuple[int, tuple[str, str]]]
|
List of |
nupack_compute_energy_precompute_library_fast ¶
Computes the Gibbs free energy of hybridization between two DNA sequences using NUPACK, with optional caching via a precompute library.
Notes¶
- Uses a local cache to avoid redundant NUPACK calls when
Use_Library=True. If the argument is None, the global setting inhf.USE_LIBRARYis used. - Energies are stored under a sorted key so (seq1, seq2) and (seq2, seq1) map identically. This function does not write back to disk; cache updates are handled by callers.
- Called by multiprocessing; each worker loads its own cache copy once from file.
- Does not write to the cache during multiprocessing to prevent conflicts.
- All energies larger than -1 kcal/mol are mapped to -1 kcal/mol. 0 is used in other routines as an indicator that the energy has not been computed. -1 kcal/mol is already extremely weak (virtually no interaction).
- Model parameters are fixed at 37°C, sodium=0.05 M, magnesium=0.025 M; change with a fresh cache.
| PARAMETER | DESCRIPTION |
|---|---|
seq1
|
First DNA sequence.
TYPE:
|
seq2
|
Second DNA sequence.
TYPE:
|
type
|
Either 'total' (partition sum) or 'minimum' (MFE) calculation. The result of 'total' is what you would use to compute a binding constant.
TYPE:
|
Use_Library
|
If True, use and load the precompute cache; defaults to global setting.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[float, float, float] | float
|
Tuple |
compute_pair_energy_on ¶
Helper function for parallel computing of on-target energies.
| PARAMETER | DESCRIPTION |
|---|---|
i
|
Sequence index.
TYPE:
|
seq
|
DNA sequence.
TYPE:
|
rc_seq
|
Reverse complement sequence.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[int, float, float, float]
|
Tuple |
compute_ontarget_energies ¶
Computes on-target Gibbs free energies for a list of sequence pairs using multiprocessing.
Notes¶
- Uses
ProcessPoolExecutor(withinitializer=_init_worker) to parallelize calls to NUPACK vianupack_compute_energy_precompute_library_fast. - If
hf.USE_LIBRARYis True, the initializer function (_init_worker) passes the library filename and flag to each worker so thatnupack_compute_energy_precompute_library_fastcan load its cache. After all parallel computations finish, this function saves the cache with the new energies. - Saves the updated cache atomically using
DelayedKeyboardInterruptto prevent corruption. - Prints progress and CPU core usage to the console.
| PARAMETER | DESCRIPTION |
|---|---|
sequence_list
|
List of tuples, each containing a sequence and its reverse complement.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]
|
Tuple of NumPy arrays |
compute_pair_energy_off ¶
Helper function for parallel computing of off-target energies.
| PARAMETER | DESCRIPTION |
|---|---|
i
|
Index of the first sequence.
TYPE:
|
j
|
Index of the second sequence.
TYPE:
|
seq1
|
First DNA sequence.
TYPE:
|
seq2
|
Second DNA sequence.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple (int, int, float)
|
Tuple |
compute_offtarget_energies ¶
Computes off-target hybridization energies for all pairwise combinations of a given list of sequence pairs.
Procedure¶
- Extract handles and antihandles from
sequence_pairs. - Initialize three N×N energy matrices for:
- handle-handle interactions
- antihandle-antihandle interactions
- handle-antihandle interactions
- For each matrix, use
ProcessPoolExecutor(viacompute_pair_energy_off) to fill only the required entries: - i ≥ j for the two symmetric matrices
- i ≠ j for the mixed handle-antihandle matrix
- If
hf.USE_LIBRARYis True, the initializer function (_init_worker) passes the library filename and flag to each worker so thatnupack_compute_energy_precompute_library_fastcan load its cache. After all parallel computations finish, this function saves the cache with the new energies.
Notes¶
- Off-target interactions are computed for:
1) handle with handle
2) antihandle with antihandle
3) handle with antihandle - Symmetric matrices only compute the lower triangle (i ≥ j) to avoid redundancy.
- Entries with no interaction or computation errors return -1.0 (mapped for any energy > -1.0).
A value of 0 indicates the energy was skipped due to redundancy. - Uses
DelayedKeyboardInterruptto ensure atomic writes when saving the updated cache.
| PARAMETER | DESCRIPTION |
|---|---|
sequence_pairs
|
List of (sequence, reverse_complement) tuples.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary containing three N×N numpy arrays with keys: |
select_subset ¶
Selects a random subset of sequence pairs up to a specified maximum size.
This function supports two input types: 1) A precomputed pool: list of (index, (seq, rc_seq)) tuples. - If pool size > max_size: uses random.sample for efficiency. - Else: returns all pairs. 2) A generator/registry object that provides sample_pair(). - Repeatedly calls sample_pair() until max_size unique pairs are collected, or timeout_s is reached.
Notes¶
- For list input: uses sampling rather than shuffling for performance.
- For registry input: guarantees uniqueness by ID (not by sequence string), so repeated samples do not inflate the subset.
Timeout behavior¶
If timeout_s is reached while using a registry, the function returns the pairs found so far and prints: "Only X of requested Y found (timeout)."
| PARAMETER | DESCRIPTION |
|---|---|
sequence_pairs
|
Either - list of (index, (seq, rc_seq)) tuples, or - an object with method sample_pair() -> (pair_id, (seq, rc_seq)).
TYPE:
|
max_size
|
Maximum number of pairs to select.
TYPE:
|
timeout_s
|
Optional timeout in seconds (only used for registry input).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list of tuple
|
List of (seq, rc_seq) pairs selected. |
crossreference_sequences ¶
crossreference_sequences(
new_pair, pool, offtarget_limit, max_pair_violations=0, Use_Library=None
)
Checks off-target interactions between a candidate sequence pair and a history pool.
Counts violations per pool pair, not per individual strand-strand
interaction. A pool pair is counted as violating if any of the four
pairwise comparisons between (seq, rc_seq) and (pool_seq, pool_rc)
falls below offtarget_limit.
| PARAMETER | DESCRIPTION |
|---|---|
new_pair
|
Candidate
TYPE:
|
pool
|
Existing
TYPE:
|
offtarget_limit
|
Energy cutoff below which an off-target interaction is considered a violation.
TYPE:
|
max_pair_violations
|
Maximum number of violating pool pairs allowed before the candidate is rejected.
TYPE:
|
Use_Library
|
Whether to use the precomputed energy library (overrides the global setting if not None).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[bool, int]
|
Tuple |
select_subset_in_energy_range ¶
select_subset_in_energy_range(
sequence_pairs,
energy_min=-inf,
energy_max=inf,
self_energy_min=-inf,
max_size=inf,
Use_Library=None,
avoid_indices=None,
timeout_s=None,
history_pool=None,
allowed_violations=0,
offtarget_limit=None,
max_nupack_calls=None,
progress_every=None,
)
Selects a random subset of sequence pairs that pass on-target energy, self-energy, and optional cross-reference filters.
Supports two input types: 1) Precomputed list of (index, (seq, rc_seq)) tuples. 2) SequencePairRegistry-like object with sample_pair() method.
Notes¶
- Uses random sampling without full shuffling.
- Keeps returned sequence order aligned with returned indices list.
- Can stop early due to
timeout_s,max_nupack_calls, or candidate exhaustion. - If
offtarget_limitis None, cross-reference filtering is skipped.
| PARAMETER | DESCRIPTION |
|---|---|
sequence_pairs
|
List of (index, (seq, rc_seq)) tuples or registry with sample_pair().
TYPE:
|
energy_min
|
Minimum acceptable on-target (association) energy.
TYPE:
|
energy_max
|
Maximum acceptable on-target (association) energy.
TYPE:
|
self_energy_min
|
Minimum acceptable self-energy for each strand.
TYPE:
|
max_size
|
Maximum number of pairs to return.
TYPE:
|
Use_Library
|
Whether to use the precomputed energy library (overrides global if not None).
TYPE:
|
avoid_indices
|
Indices to avoid when sampling.
TYPE:
|
timeout_s
|
Optional wall-clock timeout in seconds; returns early if exceeded.
TYPE:
|
history_pool
|
Optional list of accepted
TYPE:
|
allowed_violations
|
Maximum number of pool pairs allowed to violate
TYPE:
|
offtarget_limit
|
Optional off-target energy cutoff for cross-reference filtering.
TYPE:
|
max_nupack_calls
|
Optional limit on direct NUPACK energy computations made inside this function.
TYPE:
|
progress_every
|
Optional attempt interval for progress prints.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[list[tuple[str, str]], list[int], bool, int]
|
Tuple |
select_all_in_energy_range ¶
select_all_in_energy_range(
sequence_pairs, energy_min=-inf, energy_max=inf, Use_Library=None, avoid_ids=None
)
Selects all sequence pairs whose on-target energies fall within a given energy range.
Description¶
Iterates through every (global_index, (seq, rc_seq)) tuple, computes the on-target energy
using nupack_compute_energy_precompute_library_fast, and collects those where
energy_min <= energy <= energy_max, skipping any global_index values in avoid_ids.
Note that the ID here refers to the global index in the original sequence-pair list.
Notes¶
- If
Use_Libraryis True, energies are fetched from or stored in the precompute cache. - Prints progress messages to the console.
| PARAMETER | DESCRIPTION |
|---|---|
sequence_pairs
|
List of
TYPE:
|
energy_min
|
Minimum allowed Gibbs free energy (inclusive).
TYPE:
|
energy_max
|
Maximum allowed Gibbs free energy (inclusive).
TYPE:
|
Use_Library
|
Whether to use a precomputed energy library (overrides global if not None).
TYPE:
|
avoid_ids
|
Set of global indices to skip during selection.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple (list of tuple, list of int)
|
Tuple |
compute_offtarget_fraction_below_limit ¶
Computes the fraction of off-target energies that are below off_limit.
Notes¶
- If
off_energiesis a dict of matrices, values are flattened and concatenated. - For dict input, zeros are excluded because they represent uncomputed entries.
| PARAMETER | DESCRIPTION |
|---|---|
off_energies
|
Off-target energies as an array-like or dict of energy matrices.
TYPE:
|
off_limit
|
Threshold energy (kcal/mol).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float
|
Fraction of values < off_limit in [0, 1]. Returns 0.0 if no values are available. |
plot_on_off_target_histograms ¶
plot_on_off_target_histograms(
on_energies,
off_energies,
bins=80,
output_path=None,
show_plot=True,
vlines=None,
title=None,
xlim=None,
)
Plots histograms comparing on-target and off-target Gibbs free energy distributions.
Notes¶
- If
off_energiesis a dict, combines:- 'handle_handle_energies'
- 'antihandle_handle_energies'
- 'antihandle_antihandle_energies' into a single array, excluding zeros (uncomputed values).
- Normalizes frequencies so that area under each histogram sums to 1.
- Uses consistent bin edges across both distributions for direct comparison.
- Saves the figure to
output_pathif provided, otherwise only displays it. - Prints summary statistics after plotting.
| PARAMETER | DESCRIPTION |
|---|---|
on_energies
|
On-target energy values.
TYPE:
|
off_energies
|
Off-target energies as an array-like or dict of energy matrices.
TYPE:
|
bins
|
Number of bins for histograms.
TYPE:
|
output_path
|
File path to save the plot; if None, the plot is only displayed.
TYPE:
|
show_plot
|
Whether to call plt.show() to display the plot.
TYPE:
|
vlines
|
Optional dictionary of additional vertical lines to draw. Special keys: 'min_ontarget'.
TYPE:
|
title
|
Optional custom plot title. If None, a default title is used.
TYPE:
|
xlim
|
Optional x-axis limits as
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary of summary statistics:
- 'min_on' : Minimum on-target energy
- 'mean_on' : Mean of on-target energies |
plot_self_energy_histogram ¶
Plots a histogram of self-energies (e.g., G_A and G_B combined).
Notes¶
- Accepts a single array-like, a tuple/list of arrays (e.g., (G_A, G_B)), or a dict of arrays; all values are flattened and concatenated.
- Uses the same visual style as plot_on_off_target_histograms.
- Prints summary statistics after plotting.
| PARAMETER | DESCRIPTION |
|---|---|
self_energies
|
Array-like, tuple/list of arrays, or dict of arrays.
TYPE:
|
bins
|
Number of bins for histogram.
TYPE:
|
output_path
|
File path to save the plot; if None, the plot is only displayed.
TYPE:
|
show_plot
|
Whether to call plt.show() to display the plot.
TYPE:
|
Vertex Cover Algorithms¶
orthoseq_generator.vertex_cover_algorithms ¶
min_ontarget
module-attribute
¶
Select sequences with on-target energy in desired range¶
subset, indices, _, _ = select_subset_in_energy_range( ontarget7mer, energy_min=min_ontarget, energy_max=max_ontarget, max_size=30, Use_Library=True, avoid_indices=set() )
Compute off-target energies for the subset¶
off_e_subset = compute_offtarget_energies(subset, Use_Library=False)
Build the off-target interaction graph¶
Edges = build_edges(off_e_subset, indices, offtarget_limit)
heuristic_vertex_cover_optimized2 ¶
This function is the core of the sequence search algorithm. It’s a heuristic approach to solve the NP-hard minimum vertex cover problem.
Inspired by: - Joshi (2020), "Neighbourhood Evaluation Criteria for Vertex Cover Problem" - StackExchange discussion: https://cs.stackexchange.com/q/74546
Algorithm Outline¶
- Immediately add any self-edge vertices (u == v) to the cover.
- Build an adjacency list for all non-self edges.
- Track the degree (number of neighbors) for each vertex.
- While edges remain:
a. Identify the vertex/vertices with maximum degree.
b. Among those, select the vertex with the fewest neighbors that also share that max degree.
c. Break ties randomly, preferring vertices in
avoid_V. d. Add the selected vertex to the cover, remove it and its incident edges, and update degrees.
Notes¶
avoid_Vcontains vertices that should be removed when possible, but they can still be kept.- Self-edges are covered immediately.
- Orphan vertices (degree zero) are naturally independent and never need removal.
| PARAMETER | DESCRIPTION |
|---|---|
E
|
Set of edges (u, v). Vertices can be any hashable.
TYPE:
|
avoid_V
|
Vertices you’d like to preferentially remove into the cover. They can still be kept, just less likely.
TYPE:
|
cleanup
|
If True, remove any redundant vertices from the final cover without uncovering any edges.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
set
|
A vertex cover (set of vertices touching every edge in E). |
find_uncovered_edges ¶
Finds edges that are not covered by the current vertex cover.
Description¶
Given a collection of edges E and a set vertex_cover of vertices, this function returns
all edges which are not in the set. Technically, vertex_cover is not a full
vertex cover of the graph but only a partial vertex cover.
| PARAMETER | DESCRIPTION |
|---|---|
E
|
Collection of edges (u, v).
TYPE:
|
vertex_cover
|
Set of vertices currently in the cover.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
set
|
Edges (u, v) from |
build_edges ¶
Builds a list of global index‐pair edges from off‐target energy matrices. (Global indices refer to the positions in the originally created sequence-pair list.)
Procedure¶
- Extract all (i, j) positions from each matrix where energy <
energy_cutoff. - Stack these positions together and sort each pair so (i, j) and (j, i) collapse to one.
- Remove duplicate pairs.
- Map local indices back to global sequence indices via the
indiceslist.
| PARAMETER | DESCRIPTION |
|---|---|
offtarget_dict
|
Dictionary containing three N×N numpy arrays under keys: - 'handle_handle_energies' - 'antihandle_handle_energies' - 'antihandle_antihandle_energies'
TYPE:
|
indices
|
List of global sequence indices corresponding to matrix rows/columns.
TYPE:
|
energy_cutoff
|
Threshold below which an energy defines an edge.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list of tuple
|
List of (i, j) tuples where each is a global‐index edge with off‐target energy < cutoff. |
compute_pair_conflict_probability ¶
Computes pair-level conflict probability using the same conflict rule as build_edges.
A pair (i, j) with i != j is counted as conflicting if at least one of the three
off-target interaction matrices violates energy_cutoff, exactly as in build_edges.
| PARAMETER | DESCRIPTION |
|---|---|
offtarget_dict
|
Dictionary containing the three off-target energy matrices.
TYPE:
|
energy_cutoff
|
Threshold below which an interaction defines a conflict.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float
|
Fraction of conflicting unordered sequence-pair pairs in [0, 1]. Returns 0.0 if fewer than 2 sequence pairs are present. |
select_vertices_to_remove ¶
Selects a subset of vertices to remove from an existing vertex cover.
| PARAMETER | DESCRIPTION |
|---|---|
vertex_cover
|
Current set of cover vertices.
TYPE:
|
num_vertices_to_remove
|
Desired number of vertices to remove.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
set
|
Randomly chosen vertices to remove (size ≤ num_vertices_to_remove). |
iterative_vertex_cover_multi ¶
iterative_vertex_cover_multi(
V,
E,
avoid_V=None,
num_vertices_to_remove=150,
max_iterations=200,
limit=+inf,
multistart=30,
population_size=5,
show_progress=False,
)
Attempts to find a small vertex cover via multiple randomized restarts and iterative refinement. Strategically calls heuristic_vertex_cover_optimized2
Algorithm Outline¶
- For each of
multistartattempts: a. Compute an initial cover via the greedy heuristic. b. Initialize a population containing that cover. c. Repeat up tomax_iterations:- For each cover in the population:
- Remove
num_vertices_to_removerandom vertices (respectingavoid_V). - Find uncovered edges and re-cover via the heuristic.
- If the new cover is smaller, reset the population to this cover.
- If it’s the same size but unique, add it to the population.
- Remove
- Trim the population to
population_sizeby random sampling. - Optionally print progress. d. If this attempt’s best cover is smaller than the global best, update it.
- For each cover in the population:
Notes¶
Because minimum vertex cover is NP-hard, this is a heuristic: it runs quickly but does not guarantee an optimal solution.
| PARAMETER | DESCRIPTION |
|---|---|
V
|
All vertices in the graph (e.g., list or set of IDs). Note: V is only used for printing/monitoring; the graph is fully encoded by E.
TYPE:
|
E
|
All edges (u, v) in global index space.
TYPE:
|
avoid_V
|
Vertices to preferentially remove into the cover.
TYPE:
|
num_vertices_to_remove
|
Number of vertices to drop each iteration.
TYPE:
|
max_iterations
|
Max refine steps per restart.
TYPE:
|
limit
|
Target threshold for |V| - |cover|; stops early if reached.
TYPE:
|
multistart
|
Number of independent greedy restarts.
TYPE:
|
population_size
|
Max number of equal-sized covers to retain each iteration.
TYPE:
|
show_progress
|
If True, prints status each iteration.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[set, list[list[int]]]
|
Tuple of (best_vertex_cover, trajectories), where trajectories is a list of per-multistart lists of independent set sizes over iterations. |
evolutionary_vertex_cover ¶
evolutionary_vertex_cover(
sequence_pairs,
offtarget_limit,
max_ontarget,
min_ontarget,
self_energy_limit,
subsetsize=200,
generations=100,
stop_event=None,
)
Dont use. It is working worse than
Evolves an independent set of sequences from a set of candidate sequence pairs by
iteratively removing high-energy (off-target) interactions via vertex-cover heuristics.
Implements a form of genetic “survivor selection” via repeated vertex-cover:
new sequences are sampled each generation and those with strong off-target interactions
are “removed” again. The history variable ensures previously promising sequences
re-enter the sampling pool.
Procedure¶
- Initialize:
non_cover_vertices: best independent set so far (sequences not in the cover).history: indices to avoid reselection, preserving diversity.- For each of
generationsiterations: a. Check ifstop_eventis set. If so, break. b. Select a random subset of sequences whose on-target energies lie within [min_ontarget,max_ontarget], excluding those inhistory.
c. Re-add any sequences fromhistoryto ensure good candidates are retained.
d. Assert that there are no duplicate indices.
e. Compute off-target energies for the subset.
f. Build the off-target interaction graph (edges where energy <offtarget_limit).
g. Apply the multi-start, iterative vertex-cover heuristic to findremoved_vertices.
h. Derive the new independent set: all selected indices minusremoved_vertices.
i. If this independent set is at least as large as the previous best:- Update
non_cover_vertices. - Clear
historyif strictly larger.
j. If its size ≥ 95% of the best, add its indices (deduplicated) tohistory.
k. Print generation summary statistics.
- Update
- On user interrupt (Ctrl+C) or
stop_event, exit gracefully and proceed to save the current best. - After all generations or interruption, save the final independent set to a text file.
Notes¶
- Catches
KeyboardInterruptto allow early exit: the best result so far is saved and plotted.
| PARAMETER | DESCRIPTION |
|---|---|
sequence_pairs
|
List of (index, (seq, rc_seq)) tuples for candidate sequences.
TYPE:
|
offtarget_limit
|
Energy threshold below which an off-target interaction defines an edge.
TYPE:
|
max_ontarget
|
Upper bound for acceptable on-target energy.
TYPE:
|
min_ontarget
|
Lower bound for acceptable on-target energy.
TYPE:
|
self_energy_limit
|
Minimum acceptable self-energy for each strand.
TYPE:
|
subsetsize
|
Number of sequences to sample per generation.
TYPE:
|
generations
|
Number of evolutionary iterations to perform.
TYPE:
|
stop_event
|
Optional threading.Event to stop the search.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list of tuple
|
Final list of (seq, rc_seq) pairs forming the best independent set. |
Helper Functions¶
orthoseq_generator.helper_functions ¶
DelayedKeyboardInterrupt ¶
Context manager that delays KeyboardInterrupt (Ctrl+C) during critical operations.
This prevents corruption of the precomputed energy library by deferring interrupt handling until the protected block (e.g., file writes) completes.
Usage¶
with DelayedKeyboardInterrupt(): # perform critical operation, like saving files save_pickle_atomic(...)
Notes¶
- On entering, replaces the SIGINT handler to queue the signal.
- On exit, restores the original handler and re-raises if an interrupt was received.
set_nupack_params ¶
Updates global NUPACK parameters used for all energy computations.
Notes¶
These values are read by functions in sequence_computations when building
a NUPACK Model. If you change parameters, you should also choose a new
precompute library filename to avoid mixing incompatible energies.
| PARAMETER | DESCRIPTION |
|---|---|
material
|
NUPACK material type (e.g., "dna").
TYPE:
|
celsius
|
Temperature in Celsius.
TYPE:
|
sodium
|
Sodium concentration in M.
TYPE:
|
magnesium
|
Magnesium concentration in M.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
None
|
None |
choose_precompute_library ¶
Sets the name of the precomputed energy library file.
Notes¶
Updates the global variable used by other functions to locate the correct library.
| PARAMETER | DESCRIPTION |
|---|---|
filename
|
Name of the pickle file where precomputed energies are or will be stored.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
None
|
None |
save_pickle_atomic ¶
Saves a Python object to disk as a pickle file in a safe and atomic way.
Notes¶
- Writes data to a temporary file (
<filepath>.tmp) first, then atomically replaces the original file to avoid corruption if a crash occurs during writing. - Creates the target directory if it does not exist.
| PARAMETER | DESCRIPTION |
|---|---|
data
|
Python object to save (typically a dictionary).
TYPE:
|
filepath
|
Full path to the target pickle file.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
None
|
None |
get_library_path ¶
Returns the full file path to the currently selected precomputed energy library.
Description¶
Constructs a path by combining the 'pre_computed_energies' folder with the
filename set via choose_precompute_library(). If no filename has been set,
defaults to 'test_lib.pkl'.
| RETURNS | DESCRIPTION |
|---|---|
str
|
Full path to the pickle file containing the precomputed Gibbs free energy dictionary. |
get_default_results_folder ¶
Returns the default path to the 'noflank_results' folder where output files containing the generated sequence pairs are saved.
Description¶
The noflank_results directory is created automatically if it does not exist. The path is based on the current working directory from which the script was executed.
| RETURNS | DESCRIPTION |
|---|---|
str
|
Absolute path to the 'noflank_results' directory. |
save_sequence_pairs_to_txt ¶
Saves a list of DNA sequence pairs to a plain text file in the default noflank_results folder.
Description¶
Each line in the output file contains a sequence and its reverse complement,
separated by a tab. If filename is not provided, an informative name is
generated based on the number of sequences, sequence length, and current timestamp.
| PARAMETER | DESCRIPTION |
|---|---|
sequence_pairs
|
List of (sequence, reverse_complement) tuples.
TYPE:
|
filename
|
Optional custom file name. If None, a name is generated based on timestamp and sequence length.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
None
|
None |
load_sequence_pairs_from_txt ¶
Loads DNA sequence pairs from a plain text file in the default noflank_results folder.
Description¶
Reads a tab-separated text file where each line contains a sequence and its
reverse complement. The file is located in the noflank_results directory returned by
get_default_results_folder().
| PARAMETER | DESCRIPTION |
|---|---|
filename
|
Name of the text file to load.
TYPE:
|
use_default_results_folder
|
If True, interpret
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list of tuple
|
List of (sequence, reverse_complement) tuples loaded from the file. |
| RAISES | DESCRIPTION |
|---|---|
FileNotFoundError
|
If the specified file does not exist. |