Installation & Quick Start¶
This guide covers installing GenAIRR and running your first simulation. By the end, you will have generated a realistic immunoglobulin heavy chain sequence with somatic hypermutation.
Installation¶
Install the latest stable release from PyPI:
GenAIRR requires Python 3.9+ and depends on NumPy, SciPy, and pandas.
Virtual environment recommended
Install GenAIRR inside a virtual environment (venv or conda) to avoid dependency conflicts with other packages.
To verify the installation:
Core Concepts in 60 Seconds¶
GenAIRR has four core building blocks:
| Component | Purpose |
|---|---|
| DataConfig | Holds germline allele sets, trimming distributions, and empirical data for a species/chain type |
| Pipeline | Executes an ordered list of steps against a config to produce a simulated sequence |
| Steps | Individual transformations — sequence generation, correction, artifact injection |
| SimulationContainer | The output object carrying the sequence, annotations, and metadata |
Their relationship:
DataConfig ──► Pipeline(config, steps) ──► SimulationContainer
│
├── SimulateSequence
├── Fix...Ambiguity
├── CorrectFor...
└── InsertIndels, InsertNs, ...
Your First Simulation¶
One-liner with simulate()¶
The fastest way to generate a sequence:
from GenAIRR import simulate, HUMAN_IGH_OGRDB, S5F
result = simulate(HUMAN_IGH_OGRDB, S5F(0.01, 0.05))
print(result.sequence)
simulate() creates a minimal pipeline internally — it runs SimulateSequence and FixVPositionAfterTrimmingIndexAmbiguity, then returns a SimulationContainer.
When to use simulate() vs Pipeline
Use simulate() for quick exploratory work. Switch to an explicit Pipeline when you need artifact simulation, custom step ordering, or measurement steps like DistillMutationRate.
Parameters:
config— aDataConfiginstance (e.g.,HUMAN_IGH_OGRDBfor human heavy chain)mutation_model— anS5ForUniforminstance specifying the mutation rate rangeproductive— ifTrue(default), only generates in-frame sequences without stop codonsn— number of sequences to generate (default: 1; returns a list whenn > 1)
Generating multiple sequences¶
results = simulate(HUMAN_IGH_OGRDB, S5F(0.01, 0.05), n=100)
print(f"Generated {len(results)} sequences")
Building an Explicit Pipeline¶
For full control, create a Pipeline with your choice of steps:
from GenAIRR import Pipeline, steps, HUMAN_IGH_OGRDB, S5F
pipeline = Pipeline(
config=HUMAN_IGH_OGRDB,
steps=[
steps.SimulateSequence(
S5F(min_mutation_rate=0.01, max_mutation_rate=0.05),
productive=True
),
steps.FixVPositionAfterTrimmingIndexAmbiguity(),
steps.FixDPositionAfterTrimmingIndexAmbiguity(),
steps.FixJPositionAfterTrimmingIndexAmbiguity(),
steps.CorrectForVEndCut(),
steps.CorrectForDTrims(),
steps.DistillMutationRate(),
]
)
result = pipeline.execute()
data = result.get_dict()
Each call to pipeline.execute() produces one SimulationContainer. The container holds all annotations:
print(data['sequence'][:60]) # nucleotide sequence
print(data['v_call']) # V allele used
print(data['d_call']) # D allele used
print(data['j_call']) # J allele used
print(data['mutation_rate']) # fraction of mutated positions
print(data['mutations']) # dict of {position: "X>Y"} changes
print(data['productive']) # True if in-frame, no stop codons
Available Data Configurations¶
GenAIRR ships with pre-built configs derived from the OGRDB and IMGT germline databases:
| Import name | Chain type | Species | Segments |
|---|---|---|---|
HUMAN_IGH_OGRDB |
Heavy (IGH) | Human | V, D, J |
HUMAN_IGH_EXTENDED |
Heavy (IGH) | Human | V, D, J (extended set) |
HUMAN_IGK_OGRDB |
Kappa light (IGK) | Human | V, J |
HUMAN_IGL_OGRDB |
Lambda light (IGL) | Human | V, J |
HUMAN_TCRB_IMGT |
TCR Beta (TRB) | Human | V, D, J |
Import them directly:
Available Mutation Models¶
S5F — Context-Dependent Mutation¶
The S5F model captures the context-dependent substitution patterns observed in real somatic hypermutation. Mutation probability at each position depends on the surrounding 5-mer motif.
Uniform — Position-Independent Mutation¶
Each position has equal probability of mutation, regardless of sequence context. Useful for null-model comparisons.
from GenAIRR.mutation import Uniform
model = Uniform(min_mutation_rate=0.01, max_mutation_rate=0.05)
Choosing mutation rates — the min_mutation_rate and max_mutation_rate define a range; each simulated sequence samples a rate uniformly from this range. Typical ranges:
| Cell type | Mutation rate range |
|---|---|
| Naive B cells | 0.001 – 0.01 |
| Memory B cells | 0.02 – 0.08 |
| Plasma cells | 0.05 – 0.25 |
Exporting Results¶
To pandas DataFrame¶
import pandas as pd
sequences = [pipeline.execute().get_dict() for _ in range(100)]
df = pd.DataFrame(sequences)
df.to_csv('simulated_sequences.csv', index=False)
To FASTA¶
with open('sequences.fasta', 'w') as f:
for i, seq in enumerate(sequences):
f.write(f">seq_{i:04d}\n{seq['sequence']}\n")
Reproducibility¶
GenAIRR provides seed management for deterministic output:
from GenAIRR import set_seed, get_seed, reset_seed
set_seed(42)
result_a = pipeline.execute()
set_seed(42)
result_b = pipeline.execute()
assert result_a.sequence == result_b.sequence # identical
set_seed(n)— fix the global random stateget_seed()— retrieve the current seed valuereset_seed()— clear the seed, restoring non-deterministic behavior
Remember to reset
If you set a seed for a reproducibility check, call reset_seed() afterwards to restore non-deterministic behavior for production runs.
Next Steps¶
- Step-by-Step Tutorial — Detailed walkthrough of building a full pipeline with explanations for each step
- How the Pipeline Works — Architecture deep-dive into DataConfig, Steps, and SimulationContainer
- Biological Context — The immunobiology behind GenAIRR's simulation model
- Parameter Reference — Complete parameter documentation for every step