Step Deep-Dives¶

This page explains the internal logic of every built-in pipeline step — what problem it solves, how it works algorithmically, and what data it reads from and writes to the SimulationContainer.

For parameter details and constructor signatures, see the API Reference and Parameter Reference.

SimulateSequence¶

Problem It Solves¶

Creates an immunoglobulin or TCR sequence from scratch by simulating V(D)J recombination with somatic hypermutation. This is the foundational step — every pipeline starts here.

Algorithm¶

Select alleles: Choose V, D (if applicable), J, and C alleles from the DataConfig. Selection is weighted by empirical gene usage frequencies, unless specific alleles are provided via specific_v, specific_d, specific_j.
Trim segments: For each segment, sample trim amounts from the DataConfig's trimming distributions:
- V: trimmed at the 3' end only
- D: trimmed independently at both 5' and 3' ends (validated so combined trims don't exceed allele length)
- J: trimmed at the 5' end only
Generate NP junctions: Create non-templated nucleotide regions between segments using a first-order Markov chain model:
- Sample the NP region length from NP_lengths
- Sample the first base from NP_first_bases
- Generate subsequent bases using position-specific transition probabilities from NP_transitions
- NP1: between V and D (or V and J for light chains)
- NP2: between D and J (heavy chains and TCR-beta only)
Assemble: Concatenate trimmed-V + NP1 + trimmed-D + NP2 + trimmed-J into the complete sequence.
Validate: Run the FunctionalityValidator to check reading frame, stop codons, and anchor residues.
Retry (if productive=True): If the sequence is non-functional, generate a completely new random sequence. Retry up to 25 times. If all fail, issue a warning and proceed.
Apply mutations: Pass the assembled sequence to the mutation model (S5F or Uniform), which introduces substitutions at the sampled rate. If the mutation model has productive=True, mutations that would create stop codons or destroy anchors are rejected.
Re-validate: Run the validator again on the mutated sequence to update productivity flags.

Container Writes¶

Field	Value
`sequence`	The mutated nucleotide sequence
`v_call`, `d_call`, `j_call`, `c_call`	Lists containing the selected allele(s)
`v_sequence_start/end`, `d_sequence_start/end`, `j_sequence_start/end`	Segment positions in the final sequence
`v_germline_start/end`, `d_germline_start/end`, `j_germline_start/end`	Positions within the original germline allele
`v_trim_5/3`, `d_trim_5/3`, `j_trim_5/3`	Trim amounts applied
`junction_sequence_start/end`	CDR3 region boundaries
`mutations`	Dictionary of `{position: "X>Y"}`
`mutation_rate`	Fraction of mutated positions
`productive`, `stop_codon`, `vj_in_frame`, `note`	Validation results

FixVPositionAfterTrimmingIndexAmbiguity¶

Problem It Solves¶

When the V segment is trimmed at its 3' end, random nucleotides are added in the NP1 junction. By chance, some of those NP nucleotides may be identical to the bases that were trimmed away. This makes the true V/NP boundary ambiguous — an alignment tool cannot tell where V ends and NP begins.

This step resolves the ambiguity by expanding the V boundary to absorb any matching bases.

Algorithm¶

Extract the NP region immediately after the current V end (V_end to D_start, or V_end to J_start for light chains).
Retrieve the trimmed-off portion of the V germline (the last v_trim_3 bases of the V reference).
Walk through the trimmed region character by character, comparing with the NP region:
- If the first NP base matches the last trimmed base, expand V by 1 position.
- Continue comparing until a mismatch is found or one sequence is exhausted.
Update v_sequence_end, v_germline_end, and v_trim_3 to reflect the expansion.

Example¶

Before:  V segment ends at position 280, v_trim_3 = 5
         Trimmed V tail:  ...ACGTC
         NP region start: ACGT...

         The first 4 NP bases (ACGT) match the trimmed tail.

After:   V segment ends at position 284, v_trim_3 = 1
         The 4 matching bases are now attributed to V, not NP.

Why this matters for benchmarking

If ground-truth V boundaries don't account for this ambiguity, an alignment tool that correctly identifies the overlap would appear to disagree with the ground truth — producing a false-negative evaluation.

FixDPositionAfterTrimmingIndexAmbiguity¶

Problem It Solves¶

Same concept as the V fix, but applied to both ends of the D segment. D segments are short (often 10–40 bp) and heavily trimmed, so NP nucleotides frequently match the trimmed D sequence on either side.

Algorithm¶

Extract NP1 (V_end to D_start) and NP2 (D_end to J_start).
Retrieve the trimmed portions:
- 5' side: first d_trim_5 bases of D reference
- 3' side: last d_trim_3 bases of D reference
For the 5' end: Walk backwards through the trimmed-5 region and NP1, comparing from the end. Each match moves D_start one position earlier.
For the 3' end: Walk forward through the trimmed-3 region and NP2. Each match moves D_end one position later.
Update d_sequence_start/end, d_germline_start/end, and d_trim_5/3.

Only for heavy chains and TCR-beta

Light chains have no D segment. Skip this step for kappa and lambda chains.

FixJPositionAfterTrimmingIndexAmbiguity¶

Problem It Solves¶

Same concept applied to the J segment's 5' end. After J is trimmed at the 5' end, NP nucleotides before J may match the trimmed-away J sequence.

Algorithm¶

Extract the NP region before J (D_end to J_start, or V_end to J_start for light chains).
Retrieve the trimmed 5' portion of J (first j_trim_5 bases of J reference).
Walk backwards through both sequences comparing characters. Each match moves J_start one position earlier.
Update j_sequence_start, j_germline_start, and j_trim_5.

CorrectForVEndCut¶

Problem It Solves¶

After trimming, the remaining V sequence may be identical across multiple V alleles. For example, if two V alleles differ only in their last 10 bases and 12 bases are trimmed, both alleles produce the same observed sequence. The original V allele is no longer uniquely identifiable.

This step documents all V alleles that are indistinguishable given the observed trim amount.

Algorithm¶

Load the V_3_TRIM_SIMILARITY_MAP from the DataConfig. This precomputed map stores, for each V allele and trim amount, the list of other V alleles that become identical.
Look up the current V allele (v_call[0]) and current v_trim_3 amount.
Retrieve all equivalent alleles from the map.
Add them to v_call, keeping the original allele first.

Container Writes¶

Field	Change
`v_call`	Extended from `[original]` to `[original, equivalent1, equivalent2, ...]`

This doesn't change the sequence

This step only updates metadata. The sequence itself is unchanged.

CorrectForDTrims¶

Problem It Solves¶

Same concept as CorrectForVEndCut, but for D alleles. Since D segments are trimmed from both ends, the similarity lookup uses a 2D key: the (d_trim_5, d_trim_3) tuple.

Algorithm¶

Load the D_5_3_TRIM_SIMILARITY_MAP from the DataConfig.
Look up the current D allele with both trim amounts as a tuple key.
Add all equivalent alleles to d_call.

Only for heavy chains and TCR-beta

Skip for light chains.

DistillMutationRate¶

Problem It Solves¶

The raw mutation dictionary from SimulateSequence includes mutations at every position — including NP junctions. But NP regions are non-templated random nucleotides with no germline reference, so "mutations" there are meaningless. This step removes them and recalculates a clean mutation rate.

Algorithm¶

Identify NP region positions:
- For heavy chains: [v_sequence_end, d_sequence_start) and [d_sequence_end, j_sequence_start)
- For light chains: [v_sequence_end, j_sequence_start)
Find all mutations whose positions fall within NP regions.
Remove those mutations from the mutations dictionary.
Recalculate: mutation_rate = len(remaining_mutations) / len(sequence)

Container Writes¶

Field	Change
`mutations`	NP-region mutations removed
`mutation_rate`	Recalculated (lower or equal)

Must run before artifact steps

Corruption, N-insertion, and indels modify the sequence without being biological mutations. If DistillMutationRate runs after them, positions will be shifted and the rate will be wrong.

CorruptSequenceBeginning¶

Problem It Solves¶

Real NGS data often has imperfect 5' ends due to PCR artifacts, primer degradation, or incomplete reverse transcription. This step simulates these artifacts by modifying the beginning of the sequence.

Algorithm¶

With probability probability (default 0.7), one of three corruption events is selected based on event_weights:

Event 1: Add Random Nucleotides¶

Sample the number of bases to add from a Beta distribution scaled by nucleotide_add_coefficient.
Generate a random nucleotide string of that length.
Prepend it to the sequence.
Shift all position annotations right by the added amount.
Shift all mutation positions right.

Event 2: Remove Nucleotides¶

Sample the number of bases to remove from a Beta distribution scaled by nucleotide_remove_coefficient (capped at V region length).
Remove that many bases from the sequence start.
Shift all position annotations left by the removed amount.
Discard mutations that were in the removed region; shift the rest left.
Look up V alleles that become equivalent given the now-larger effective V trim.
Recalculate the mutation rate.

Event 3: Remove Then Add¶

Perform the remove operation (as above).
Sample a smaller number of bases to add (using nucleotide_add_after_remove_coefficient).
Prepend random bases.
Check if the added bases partially recreate the removed V sequence — if so, absorb the overlap back into V.
Update equivalent V alleles based on the final effective trim.

After any corruption event, the step recalculates productivity (stop codons, reading frame, anchors) because the sequence may have changed in ways that affect these.

Container Writes¶

Field	Change
`sequence`	Modified (bases added/removed at 5' end)
`corruption_event`	`'add'`, `'remove'`, or `'remove_before_add'`
`corruption_add_amount` / `corruption_remove_amount`	How many bases
`corruption_added_section` / `corruption_removed_section`	The actual bases
All position fields	Shifted left or right
`mutations`	Shifted; removed if in deleted region
`v_call`	Extended with equivalent alleles
`productive`, `stop_codon`, `vj_in_frame`	Recalculated

EnforceSequenceLength¶

Problem It Solves¶

Sequencing platforms have fixed read lengths (e.g., Illumina MiSeq: ~576 bp for paired 2×300). Sequences longer than the read length need to be truncated.

Algorithm¶

If len(sequence) > max_length:

Calculate trim_amount = len(sequence) - max_length.
Remove trim_amount bases from the 5' end (since sequencing reads from 5' to 3', 5' truncation simulates incomplete reads).
Shift all position annotations left by trim_amount.
Remove mutations, Ns, and indels that fall in the trimmed region; shift the rest left.
If trimming extends into the V region, adjust v_germline_start to reflect the portion of V that was lost.

Container Writes¶

Field	Change
`sequence`	Truncated to `max_length`
All position fields	Shifted left
`mutations`, `Ns`, `indels`	Entries removed or shifted
`v_germline_start`	Increased if V was trimmed

InsertNs¶

Problem It Solves¶

Real sequencing data contains positions where the base caller could not determine the nucleotide with confidence. These positions are reported as 'N'. This step simulates that.

Algorithm¶

With probability probability (default 0.5):

Calculate num_replacements = int(len(sequence) * n_ratio).
Select that many random positions in the sequence.
Replace each selected base with 'N'.
Log each replacement in the Ns dictionary as {position: "original_base > N"}.
Remove any mutations at N-replaced positions (since the original base is now unknown).
V allele disambiguation: For N positions within the V region, map them back to germline positions and query the V_N_AMBIGUITY_CORRECTION_GRAPH. This graph identifies which V alleles become indistinguishable when specific discriminating positions are obscured by Ns. Add those alleles to v_call.
Recalculate the mutation rate (counting Ns within the V-J region as additional uncertainty).

Container Writes¶

Field	Change
`sequence`	Selected positions replaced with 'N'
`Ns`	Dictionary of N positions and original bases
`mutations`	Entries at N positions removed
`v_call`	Extended with N-ambiguous equivalent alleles
`mutation_rate`	Increased to account for N uncertainty

InsertIndels¶

Problem It Solves¶

Sequencing and PCR errors can introduce small insertions and deletions. This step simulates those artifacts.

Algorithm¶

With probability probability (default 0.5):

Find valid positions: All positions from 0 to j_sequence_end, excluding:
- NP junction regions (non-germline; indels there are not meaningful)
- Positions already occupied by Ns
- Positions already mutated
Sample indel count: Uniformly from 1 to max_indels.
For each indel (processed sequentially to handle shifting):
- Choose insertion or deletion based on insertion_probability / deletion_probability weights.
- Insertion: Pick a random nucleotide, insert at the position. Shift all downstream positions right by 1. Log as "I < {base}".
- Deletion: Record the deleted base, remove it from the sequence. Shift all downstream positions left by 1. Log as "D > {base}".
- Update mutation positions, N positions, and remaining indel positions to account for the shift.
Recalculate productivity: Indels can introduce frameshifts that destroy the reading frame.

Container Writes¶

Field	Change
`sequence`	Modified (bases inserted/deleted)
`indels`	Dictionary of `{position: "I < X" or "D > X"}`
All position fields	Shifted per indel
`mutations`, `Ns`	Positions shifted
`productive`, `stop_codon`, `vj_in_frame`	Recalculated

ShortDValidation¶

Problem It Solves¶

D segments can be trimmed so aggressively that only a few bases remain. When the D region is very short (e.g., 1–4 bp), the D allele assignment is essentially random — many D alleles share such short subsequences. This step flags those cases.

Algorithm¶

Calculate D length: d_sequence_end - d_sequence_start.
If length < short_d_length (default: 5 bp): replace d_call with ["Short-D"].

Container Writes¶

Field	Change
`d_call`	Replaced with `["Short-D"]` if D is too short

Only for heavy chains and TCR-beta

Light chains have no D segment.

FilterTCRDJAmbiguities¶

Problem It Solves¶

In TCR-beta, there are biological constraints on which D and J families can appear together. After position corrections and allele equivalence expansion, the d_call list may contain D alleles from families that are incompatible with the selected J allele. This step removes those invalid combinations.

Algorithm¶

Extract the family numbers from all J alleles in j_call.
For each D allele in d_call (except the original, which is always kept):
- Extract its family number.
- If the family number is not compatible with any J family, remove it.
Update d_call with the filtered list.

Container Writes¶

Field	Change
`d_call`	Filtered to remove incompatible D-J family combinations

TCR-beta only

This step is specific to T-cell receptor beta chains and should not be included in BCR pipelines.