Step-by-Step Tutorial: Building Your First Pipeline¶
This tutorial walks you through creating a GenAIRR simulation pipeline from scratch. Each section introduces one concept, explains why it matters, and shows the code.
Prerequisites: GenAIRR installed (pip install GenAIRR). See Installation & Quick Start if you haven't set up yet.
Step 1: Imports¶
| Import | What it is |
|---|---|
Pipeline |
The pipeline runner — takes a config and a list of steps, executes them in order |
steps |
Namespace containing all built-in augmentation steps |
HUMAN_IGH_OGRDB |
Pre-built DataConfig for human heavy chain (OGRDB germline set) |
S5F |
Context-dependent somatic hypermutation model |
Step 2: Choose a Mutation Model¶
A mutation model determines how nucleotide substitutions are introduced into the simulated sequence.
What this does: For each simulated sequence, a mutation rate is sampled uniformly from [0.02, 0.08]. The S5F model then applies mutations at that rate, using 5-mer context-dependent substitution probabilities derived from empirical data.
Why S5F? Real somatic hypermutation is not uniform — certain motifs (like WRC/GYW hotspots) mutate at much higher rates. S5F captures this pattern. Use Uniform if you want a simpler null model:
from GenAIRR.mutation import Uniform
null_model = Uniform(min_mutation_rate=0.02, max_mutation_rate=0.08)
Choosing mutation rate ranges:
| Scenario | min_mutation_rate |
max_mutation_rate |
|---|---|---|
| Naive B cells | 0.001 | 0.01 |
| Memory B cells | 0.02 | 0.08 |
| Plasma cells | 0.05 | 0.25 |
| Mixed repertoire | 0.003 | 0.25 |
Step 3: Create the Sequence Generation Step¶
The SimulateSequence step is always the first step in a pipeline. It performs V(D)J recombination and applies mutations.
Parameters:
mutation_model— the model created in Step 2productive=True— restricts output to sequences that are in-frame and lack stop codons. Set toFalseto include non-productive rearrangements (roughly 2/3 of all rearrangements are non-productive in biology).
Productivity and performance
When productive=True, GenAIRR retries up to 25 times to find a valid rearrangement. At very high mutation rates (>0.2), this can slow generation. See Productive Sequences for details.
What happens internally:
- Selects a random V, D, and J allele from the config (weighted by empirical usage)
- Applies exonuclease trimming at segment junctions (from empirical distributions)
- Adds N-nucleotides at junctions (P- and N-additions)
- Concatenates segments into a complete sequence
- Applies somatic hypermutation at the sampled rate
- If
productive=True, repeats until an in-frame sequence is generated
Step 4: Add Position Correction Steps¶
After trimming, the exact boundaries between V, D, and J segments can become ambiguous — the same nucleotide sequence could correspond to multiple valid trim positions. These correction steps resolve that ambiguity to produce reliable ground-truth annotations.
fix_v = steps.FixVPositionAfterTrimmingIndexAmbiguity()
fix_d = steps.FixDPositionAfterTrimmingIndexAmbiguity()
fix_j = steps.FixJPositionAfterTrimmingIndexAmbiguity()
Why this matters: If you use simulated data to benchmark an alignment tool, the ground-truth segment boundaries must be unambiguous. Without these corrections, multiple trim positions could explain the same observed sequence, leading to inconsistent ground truth.
Light chains
For kappa and lambda light chains, omit FixDPositionAfterTrimmingIndexAmbiguity and CorrectForDTrims — light chains have no D segment.
Step 5: Add Biological Correction Steps¶
These steps apply further corrections based on biological constraints:
CorrectForVEndCut()— adjusts the V-end position when the 3' end of the V segment was trimmed into the coding regionCorrectForDTrims()— adjusts D segment boundaries after 5' and 3' trimming. Skip this for light chains.
Step 6: Record the Mutation Rate¶
DistillMutationRate computes and stores the final mutation rate in the SimulationContainer.
Step ordering
Place DistillMutationRate before any artifact steps (corruption, N-insertion, indels). Artifact steps modify the sequence in ways that are not biological mutations — measuring afterwards inflates the reported rate.
Step 7: Add Sequencing Artifact Steps (Optional)¶
Real-world sequencing data contains imperfections. These steps simulate common NGS artifacts:
5' End Corruption¶
Simulates 5' end degradation — the leading portion of the sequence may be truncated, replaced with random nucleotides, or both. The event_weights tuple controls the probability of (add random bases, remove bases, add-after-remove).
Read Length Enforcement¶
Truncates sequences exceeding max_length nucleotides, simulating fixed read-length limits of sequencing platforms.
Ambiguous Base Insertion¶
With probability 0.5, replaces ~2% of bases with 'N' to simulate ambiguous base calls.
Insertion/Deletion Errors¶
indels = steps.InsertIndels(
probability=0.5,
max_indels=5,
insertion_probability=0.5,
deletion_probability=0.5
)
Introduces random insertions and deletions to simulate PCR/sequencing errors.
Step 8: Assemble and Run the Pipeline¶
Combine all steps into a pipeline:
pipeline = Pipeline(
config=HUMAN_IGH_OGRDB,
steps=[
# Sequence generation
steps.SimulateSequence(
S5F(min_mutation_rate=0.02, max_mutation_rate=0.08),
productive=True
),
# Position corrections
steps.FixVPositionAfterTrimmingIndexAmbiguity(),
steps.FixDPositionAfterTrimmingIndexAmbiguity(),
steps.FixJPositionAfterTrimmingIndexAmbiguity(),
# Biological corrections
steps.CorrectForVEndCut(),
steps.CorrectForDTrims(),
# Record mutation rate (before artifacts)
steps.DistillMutationRate(),
# Sequencing artifacts
steps.CorruptSequenceBeginning(),
steps.EnforceSequenceLength(),
steps.InsertNs(),
# Validation and errors
steps.ShortDValidation(),
steps.InsertIndels(),
]
)
Execute the pipeline:
Inspecting the Output¶
The SimulationContainer holds all simulation metadata:
# Sequence and alleles
print("Sequence:", data['sequence'][:60], "...")
print("V allele:", data['v_call'])
print("D allele:", data['d_call'])
print("J allele:", data['j_call'])
# Positions in the final sequence
print("V region:", data['v_sequence_start'], "-", data['v_sequence_end'])
print("D region:", data['d_sequence_start'], "-", data['d_sequence_end'])
print("J region:", data['j_sequence_start'], "-", data['j_sequence_end'])
# Mutation info
print("Mutation rate:", data['mutation_rate'])
print("Mutations:", data['mutations'])
print("Productive:", data['productive'])
Step 9: Generate a Dataset¶
To produce a batch of sequences:
import pandas as pd
sequences = []
for i in range(100):
result = pipeline.execute()
seq_dict = result.get_dict()
seq_dict['id'] = f"seq_{i:04d}"
sequences.append(seq_dict)
df = pd.DataFrame(sequences)
print(df[['id', 'v_call', 'd_call', 'j_call', 'mutation_rate']].head())
Export:
# CSV
df.to_csv('simulated_sequences.csv', index=False)
# FASTA
with open('simulated_sequences.fasta', 'w') as f:
for _, row in df.iterrows():
f.write(f">{row['id']}\n{row['sequence']}\n")
Variations¶
Light Chain Pipeline¶
Light chains lack D segments. Omit D-related steps:
from GenAIRR import HUMAN_IGK_OGRDB
kappa_pipeline = Pipeline(
config=HUMAN_IGK_OGRDB,
steps=[
steps.SimulateSequence(
S5F(min_mutation_rate=0.02, max_mutation_rate=0.08),
productive=True
),
steps.FixVPositionAfterTrimmingIndexAmbiguity(),
steps.FixJPositionAfterTrimmingIndexAmbiguity(),
steps.CorrectForVEndCut(),
steps.DistillMutationRate(),
steps.CorruptSequenceBeginning(),
steps.EnforceSequenceLength(max_length=400),
steps.InsertNs(),
steps.InsertIndels(),
]
)
Specific Allele Selection¶
Force particular V, D, J alleles:
v_allele = HUMAN_IGH_OGRDB.v_alleles['IGHVF1-G1'][0]
d_allele = HUMAN_IGH_OGRDB.d_alleles['IGHD1-1'][0]
j_allele = HUMAN_IGH_OGRDB.j_alleles['IGHJ1'][0]
pipeline = Pipeline(
config=HUMAN_IGH_OGRDB,
steps=[
steps.SimulateSequence(
S5F(min_mutation_rate=0.02, max_mutation_rate=0.08),
productive=True,
specific_v=v_allele,
specific_d=d_allele,
specific_j=j_allele
),
steps.FixVPositionAfterTrimmingIndexAmbiguity(),
steps.FixDPositionAfterTrimmingIndexAmbiguity(),
steps.FixJPositionAfterTrimmingIndexAmbiguity(),
steps.CorrectForVEndCut(),
steps.CorrectForDTrims(),
]
)
Reproducible Output¶
from GenAIRR import set_seed
set_seed(42)
result = pipeline.execute()
# Same seed → same sequence every time
Troubleshooting¶
| Problem | Cause | Solution |
|---|---|---|
TypeError on step constructors |
Using positional args | Use keyword arguments: InsertNs(n_ratio=0.02, probability=0.5) |
| All sequences identical | Seed set without reset | Call reset_seed() or use different seeds |
KeyError on allele name |
Wrong allele key format | Use OGRDB family keys (e.g., IGHVF1-G1), not IMGT names |
| Non-productive sequences | productive=False |
Set productive=True in SimulateSequence |
Next Steps¶
- How the Pipeline Works — Understand the architecture in detail
- Parameter Reference — Complete parameter documentation
- Best Practices — Guidelines for realistic simulations
- Biological Context — The immunobiology behind the simulation