Step Deep-Dives¶
This page explains the internal logic of every built-in pipeline step — what problem it solves, how it works algorithmically, and what data it reads from and writes to the SimulationContainer.
For parameter details and constructor signatures, see the API Reference and Parameter Reference.
SimulateSequence¶
Problem It Solves¶
Creates an immunoglobulin or TCR sequence from scratch by simulating V(D)J recombination with somatic hypermutation. This is the foundational step — every pipeline starts here.
Algorithm¶
-
Select alleles: Choose V, D (if applicable), J, and C alleles from the DataConfig. Selection is weighted by empirical gene usage frequencies, unless specific alleles are provided via
specific_v,specific_d,specific_j. -
Trim segments: For each segment, sample trim amounts from the DataConfig's trimming distributions:
- V: trimmed at the 3' end only
- D: trimmed independently at both 5' and 3' ends (validated so combined trims don't exceed allele length)
- J: trimmed at the 5' end only
-
Generate NP junctions: Create non-templated nucleotide regions between segments using a first-order Markov chain model:
- Sample the NP region length from
NP_lengths - Sample the first base from
NP_first_bases - Generate subsequent bases using position-specific transition probabilities from
NP_transitions - NP1: between V and D (or V and J for light chains)
- NP2: between D and J (heavy chains and TCR-beta only)
- Sample the NP region length from
-
Assemble: Concatenate trimmed-V + NP1 + trimmed-D + NP2 + trimmed-J into the complete sequence.
-
Validate: Run the FunctionalityValidator to check reading frame, stop codons, and anchor residues.
-
Retry (if
productive=True): If the sequence is non-functional, generate a completely new random sequence. Retry up to 25 times. If all fail, issue a warning and proceed. -
Apply mutations: Pass the assembled sequence to the mutation model (S5F or Uniform), which introduces substitutions at the sampled rate. If the mutation model has
productive=True, mutations that would create stop codons or destroy anchors are rejected. -
Re-validate: Run the validator again on the mutated sequence to update productivity flags.
Container Writes¶
| Field | Value |
|---|---|
sequence |
The mutated nucleotide sequence |
v_call, d_call, j_call, c_call |
Lists containing the selected allele(s) |
v_sequence_start/end, d_sequence_start/end, j_sequence_start/end |
Segment positions in the final sequence |
v_germline_start/end, d_germline_start/end, j_germline_start/end |
Positions within the original germline allele |
v_trim_5/3, d_trim_5/3, j_trim_5/3 |
Trim amounts applied |
junction_sequence_start/end |
CDR3 region boundaries |
mutations |
Dictionary of {position: "X>Y"} |
mutation_rate |
Fraction of mutated positions |
productive, stop_codon, vj_in_frame, note |
Validation results |
FixVPositionAfterTrimmingIndexAmbiguity¶
Problem It Solves¶
When the V segment is trimmed at its 3' end, random nucleotides are added in the NP1 junction. By chance, some of those NP nucleotides may be identical to the bases that were trimmed away. This makes the true V/NP boundary ambiguous — an alignment tool cannot tell where V ends and NP begins.
This step resolves the ambiguity by expanding the V boundary to absorb any matching bases.
Algorithm¶
- Extract the NP region immediately after the current V end (V_end to D_start, or V_end to J_start for light chains).
- Retrieve the trimmed-off portion of the V germline (the last
v_trim_3bases of the V reference). - Walk through the trimmed region character by character, comparing with the NP region:
- If the first NP base matches the last trimmed base, expand V by 1 position.
- Continue comparing until a mismatch is found or one sequence is exhausted.
- Update
v_sequence_end,v_germline_end, andv_trim_3to reflect the expansion.
Example¶
Before: V segment ends at position 280, v_trim_3 = 5
Trimmed V tail: ...ACGTC
NP region start: ACGT...
The first 4 NP bases (ACGT) match the trimmed tail.
After: V segment ends at position 284, v_trim_3 = 1
The 4 matching bases are now attributed to V, not NP.
Why this matters for benchmarking
If ground-truth V boundaries don't account for this ambiguity, an alignment tool that correctly identifies the overlap would appear to disagree with the ground truth — producing a false-negative evaluation.
FixDPositionAfterTrimmingIndexAmbiguity¶
Problem It Solves¶
Same concept as the V fix, but applied to both ends of the D segment. D segments are short (often 10–40 bp) and heavily trimmed, so NP nucleotides frequently match the trimmed D sequence on either side.
Algorithm¶
- Extract NP1 (V_end to D_start) and NP2 (D_end to J_start).
- Retrieve the trimmed portions:
- 5' side: first
d_trim_5bases of D reference - 3' side: last
d_trim_3bases of D reference
- 5' side: first
- For the 5' end: Walk backwards through the trimmed-5 region and NP1, comparing from the end. Each match moves D_start one position earlier.
- For the 3' end: Walk forward through the trimmed-3 region and NP2. Each match moves D_end one position later.
- Update
d_sequence_start/end,d_germline_start/end, andd_trim_5/3.
Only for heavy chains and TCR-beta
Light chains have no D segment. Skip this step for kappa and lambda chains.
FixJPositionAfterTrimmingIndexAmbiguity¶
Problem It Solves¶
Same concept applied to the J segment's 5' end. After J is trimmed at the 5' end, NP nucleotides before J may match the trimmed-away J sequence.
Algorithm¶
- Extract the NP region before J (D_end to J_start, or V_end to J_start for light chains).
- Retrieve the trimmed 5' portion of J (first
j_trim_5bases of J reference). - Walk backwards through both sequences comparing characters. Each match moves J_start one position earlier.
- Update
j_sequence_start,j_germline_start, andj_trim_5.
CorrectForVEndCut¶
Problem It Solves¶
After trimming, the remaining V sequence may be identical across multiple V alleles. For example, if two V alleles differ only in their last 10 bases and 12 bases are trimmed, both alleles produce the same observed sequence. The original V allele is no longer uniquely identifiable.
This step documents all V alleles that are indistinguishable given the observed trim amount.
Algorithm¶
- Load the V_3_TRIM_SIMILARITY_MAP from the DataConfig. This precomputed map stores, for each V allele and trim amount, the list of other V alleles that become identical.
- Look up the current V allele (
v_call[0]) and currentv_trim_3amount. - Retrieve all equivalent alleles from the map.
- Add them to
v_call, keeping the original allele first.
Container Writes¶
| Field | Change |
|---|---|
v_call |
Extended from [original] to [original, equivalent1, equivalent2, ...] |
This doesn't change the sequence
This step only updates metadata. The sequence itself is unchanged.
CorrectForDTrims¶
Problem It Solves¶
Same concept as CorrectForVEndCut, but for D alleles. Since D segments are trimmed from both ends, the similarity lookup uses a 2D key: the (d_trim_5, d_trim_3) tuple.
Algorithm¶
- Load the D_5_3_TRIM_SIMILARITY_MAP from the DataConfig.
- Look up the current D allele with both trim amounts as a tuple key.
- Add all equivalent alleles to
d_call.
Only for heavy chains and TCR-beta
Skip for light chains.
DistillMutationRate¶
Problem It Solves¶
The raw mutation dictionary from SimulateSequence includes mutations at every position — including NP junctions. But NP regions are non-templated random nucleotides with no germline reference, so "mutations" there are meaningless. This step removes them and recalculates a clean mutation rate.
Algorithm¶
- Identify NP region positions:
- For heavy chains:
[v_sequence_end, d_sequence_start)and[d_sequence_end, j_sequence_start) - For light chains:
[v_sequence_end, j_sequence_start)
- For heavy chains:
- Find all mutations whose positions fall within NP regions.
- Remove those mutations from the
mutationsdictionary. - Recalculate:
mutation_rate = len(remaining_mutations) / len(sequence)
Container Writes¶
| Field | Change |
|---|---|
mutations |
NP-region mutations removed |
mutation_rate |
Recalculated (lower or equal) |
Must run before artifact steps
Corruption, N-insertion, and indels modify the sequence without being biological mutations. If DistillMutationRate runs after them, positions will be shifted and the rate will be wrong.
CorruptSequenceBeginning¶
Problem It Solves¶
Real NGS data often has imperfect 5' ends due to PCR artifacts, primer degradation, or incomplete reverse transcription. This step simulates these artifacts by modifying the beginning of the sequence.
Algorithm¶
With probability probability (default 0.7), one of three corruption events is selected based on event_weights:
Event 1: Add Random Nucleotides¶
- Sample the number of bases to add from a Beta distribution scaled by
nucleotide_add_coefficient. - Generate a random nucleotide string of that length.
- Prepend it to the sequence.
- Shift all position annotations right by the added amount.
- Shift all mutation positions right.
Event 2: Remove Nucleotides¶
- Sample the number of bases to remove from a Beta distribution scaled by
nucleotide_remove_coefficient(capped at V region length). - Remove that many bases from the sequence start.
- Shift all position annotations left by the removed amount.
- Discard mutations that were in the removed region; shift the rest left.
- Look up V alleles that become equivalent given the now-larger effective V trim.
- Recalculate the mutation rate.
Event 3: Remove Then Add¶
- Perform the remove operation (as above).
- Sample a smaller number of bases to add (using
nucleotide_add_after_remove_coefficient). - Prepend random bases.
- Check if the added bases partially recreate the removed V sequence — if so, absorb the overlap back into V.
- Update equivalent V alleles based on the final effective trim.
After any corruption event, the step recalculates productivity (stop codons, reading frame, anchors) because the sequence may have changed in ways that affect these.
Container Writes¶
| Field | Change |
|---|---|
sequence |
Modified (bases added/removed at 5' end) |
corruption_event |
'add', 'remove', or 'remove_before_add' |
corruption_add_amount / corruption_remove_amount |
How many bases |
corruption_added_section / corruption_removed_section |
The actual bases |
| All position fields | Shifted left or right |
mutations |
Shifted; removed if in deleted region |
v_call |
Extended with equivalent alleles |
productive, stop_codon, vj_in_frame |
Recalculated |
EnforceSequenceLength¶
Problem It Solves¶
Sequencing platforms have fixed read lengths (e.g., Illumina MiSeq: ~576 bp for paired 2×300). Sequences longer than the read length need to be truncated.
Algorithm¶
If len(sequence) > max_length:
- Calculate
trim_amount = len(sequence) - max_length. - Remove
trim_amountbases from the 5' end (since sequencing reads from 5' to 3', 5' truncation simulates incomplete reads). - Shift all position annotations left by
trim_amount. - Remove mutations, Ns, and indels that fall in the trimmed region; shift the rest left.
- If trimming extends into the V region, adjust
v_germline_startto reflect the portion of V that was lost.
Container Writes¶
| Field | Change |
|---|---|
sequence |
Truncated to max_length |
| All position fields | Shifted left |
mutations, Ns, indels |
Entries removed or shifted |
v_germline_start |
Increased if V was trimmed |
InsertNs¶
Problem It Solves¶
Real sequencing data contains positions where the base caller could not determine the nucleotide with confidence. These positions are reported as 'N'. This step simulates that.
Algorithm¶
With probability probability (default 0.5):
- Calculate
num_replacements = int(len(sequence) * n_ratio). - Select that many random positions in the sequence.
- Replace each selected base with 'N'.
- Log each replacement in the
Nsdictionary as{position: "original_base > N"}. - Remove any mutations at N-replaced positions (since the original base is now unknown).
- V allele disambiguation: For N positions within the V region, map them back to germline positions and query the V_N_AMBIGUITY_CORRECTION_GRAPH. This graph identifies which V alleles become indistinguishable when specific discriminating positions are obscured by Ns. Add those alleles to
v_call. - Recalculate the mutation rate (counting Ns within the V-J region as additional uncertainty).
Container Writes¶
| Field | Change |
|---|---|
sequence |
Selected positions replaced with 'N' |
Ns |
Dictionary of N positions and original bases |
mutations |
Entries at N positions removed |
v_call |
Extended with N-ambiguous equivalent alleles |
mutation_rate |
Increased to account for N uncertainty |
InsertIndels¶
Problem It Solves¶
Sequencing and PCR errors can introduce small insertions and deletions. This step simulates those artifacts.
Algorithm¶
With probability probability (default 0.5):
- Find valid positions: All positions from 0 to
j_sequence_end, excluding:- NP junction regions (non-germline; indels there are not meaningful)
- Positions already occupied by Ns
- Positions already mutated
- Sample indel count: Uniformly from 1 to
max_indels. - For each indel (processed sequentially to handle shifting):
- Choose insertion or deletion based on
insertion_probability/deletion_probabilityweights. - Insertion: Pick a random nucleotide, insert at the position. Shift all downstream positions right by 1. Log as
"I < {base}". - Deletion: Record the deleted base, remove it from the sequence. Shift all downstream positions left by 1. Log as
"D > {base}". - Update mutation positions, N positions, and remaining indel positions to account for the shift.
- Choose insertion or deletion based on
- Recalculate productivity: Indels can introduce frameshifts that destroy the reading frame.
Container Writes¶
| Field | Change |
|---|---|
sequence |
Modified (bases inserted/deleted) |
indels |
Dictionary of {position: "I < X" or "D > X"} |
| All position fields | Shifted per indel |
mutations, Ns |
Positions shifted |
productive, stop_codon, vj_in_frame |
Recalculated |
ShortDValidation¶
Problem It Solves¶
D segments can be trimmed so aggressively that only a few bases remain. When the D region is very short (e.g., 1–4 bp), the D allele assignment is essentially random — many D alleles share such short subsequences. This step flags those cases.
Algorithm¶
- Calculate D length:
d_sequence_end - d_sequence_start. - If length <
short_d_length(default: 5 bp): replaced_callwith["Short-D"].
Container Writes¶
| Field | Change |
|---|---|
d_call |
Replaced with ["Short-D"] if D is too short |
Only for heavy chains and TCR-beta
Light chains have no D segment.
FilterTCRDJAmbiguities¶
Problem It Solves¶
In TCR-beta, there are biological constraints on which D and J families can appear together. After position corrections and allele equivalence expansion, the d_call list may contain D alleles from families that are incompatible with the selected J allele. This step removes those invalid combinations.
Algorithm¶
- Extract the family numbers from all J alleles in
j_call. - For each D allele in
d_call(except the original, which is always kept):- Extract its family number.
- If the family number is not compatible with any J family, remove it.
- Update
d_callwith the filtered list.
Container Writes¶
| Field | Change |
|---|---|
d_call |
Filtered to remove incompatible D-J family combinations |
TCR-beta only
This step is specific to T-cell receptor beta chains and should not be included in BCR pipelines.