Quick Start Guide to GenAIRR¶
Welcome to the Quick Start Guide for GenAIRR, a Python module designed for generating synthetic Adaptive Immune Receptor Repertoire (AIRR) sequences. This guide will walk you through the basic usage of GenAIRR, including setting up your environment, simulating heavy and light chain sequences, and customizing your simulations.
Installation¶
Before you begin, ensure that you have Python 3.x installed on your system. GenAIRR can be installed using pip:
pip install GenAIRR
import pandas as pd
Setting Up¶
To start using GenAIRR, import the core classes. The main entry point is the Pipeline class, which takes a data configuration and a list of processing steps.
from GenAIRR import (
Pipeline, steps,
S5F, Uniform,
HUMAN_IGH_OGRDB, HUMAN_IGK_OGRDB, HUMAN_IGL_OGRDB,
set_seed, get_seed, reset_seed,
)
# Built-in data configurations (pre-processed germline gene databases)
# HUMAN_IGH_OGRDB - Heavy chain (V, D, J, C segments)
# HUMAN_IGK_OGRDB - Kappa light chain (V, J, C segments)
# HUMAN_IGL_OGRDB - Lambda light chain (V, J, C segments)
Simulating Heavy Chain Sequences¶
Let's simulate a BCR heavy chain sequence using the Pipeline. Each pipeline takes a config (the germline database) and a list of steps to apply.
# Create the simulation pipeline
pipeline = Pipeline(
config=HUMAN_IGH_OGRDB,
steps=[
# 1. CORE SIMULATION: Generate a V(D)J recombined sequence with somatic hypermutation
steps.SimulateSequence(S5F(min_mutation_rate=0.003, max_mutation_rate=0.25), productive=True),
# 2. POSITION CORRECTIONS: Fix ambiguities from trimming during V(D)J recombination
steps.FixVPositionAfterTrimmingIndexAmbiguity(),
steps.FixDPositionAfterTrimmingIndexAmbiguity(),
steps.FixJPositionAfterTrimmingIndexAmbiguity(),
# 3. BIOLOGICAL CORRECTIONS: Model natural trimming processes
steps.CorrectForVEndCut(),
steps.CorrectForDTrims(),
# 4. MUTATION RATE: Calculate final mutation rate before adding artifacts
steps.DistillMutationRate(),
# 5. SEQUENCING ARTIFACTS: Model real-world sequencing issues
steps.CorruptSequenceBeginning(probability=0.7, event_weights=(0.4, 0.4, 0.2)),
steps.EnforceSequenceLength(max_length=576),
# 6. AMBIGUOUS BASES: Add 'N' calls (unreadable bases)
steps.InsertNs(n_ratio=0.02, probability=0.5),
# 7. QUALITY CONTROL: Flag sequences with very short D segments
steps.ShortDValidation(),
# 8. STRUCTURAL VARIANTS: Add insertions and deletions
steps.InsertIndels(probability=0.5, max_indels=5, insertion_probability=0.5, deletion_probability=0.5),
]
)
# Simulate a heavy chain sequence
heavy_sequence = pipeline.execute()
# Print the simulated heavy chain sequence with all metadata
print("Simulated Heavy Chain Sequence:", heavy_sequence.get_dict())
Understanding the Output¶
The simulation returns a SimulationContainer object with rich metadata. Let's examine the key fields:
result_dict = heavy_sequence.get_dict()
print("=== KEY OUTPUT FIELDS ===")
print(f"Sequence: {result_dict['sequence'][:50]}... (length: {len(result_dict['sequence'])})")
print(f"V allele used: {result_dict['v_call']}")
print(f"D allele used: {result_dict['d_call']}")
print(f"J allele used: {result_dict['j_call']}")
print(f"Is productive: {result_dict['productive']}")
print(f"Mutation rate: {result_dict['mutation_rate']:.3f}")
print(f"Number of mutations: {len(result_dict['mutations'])}")
print(f"Number of N bases: {len(result_dict['Ns'])}")
print(f"Number of indels: {len(result_dict['indels'])}")
print("\n=== SEQUENCE REGIONS ===")
print(f"V region: positions {result_dict['v_sequence_start']}-{result_dict['v_sequence_end']}")
print(f"D region: positions {result_dict['d_sequence_start']}-{result_dict['d_sequence_end']}")
print(f"J region: positions {result_dict['j_sequence_start']}-{result_dict['j_sequence_end']}")
print(f"Junction (CDR3): positions {result_dict['junction_sequence_start']}-{result_dict['junction_sequence_end']}")
Customizing Simulations¶
GenAIRR allows for extensive customization to closely mimic the natural diversity of immune sequences. Below is an example of how to customize mutation rates and indel simulations.
# Create a pipeline with custom parameters
custom_pipeline = Pipeline(
config=HUMAN_IGH_OGRDB,
steps=[
steps.SimulateSequence(S5F(min_mutation_rate=0.1, max_mutation_rate=0.5), productive=True),
steps.FixVPositionAfterTrimmingIndexAmbiguity(),
steps.FixDPositionAfterTrimmingIndexAmbiguity(),
steps.FixJPositionAfterTrimmingIndexAmbiguity(),
steps.CorrectForVEndCut(),
steps.CorrectForDTrims(),
steps.DistillMutationRate(),
steps.CorruptSequenceBeginning(probability=0.7, event_weights=(0.4, 0.4, 0.2)),
steps.EnforceSequenceLength(max_length=576),
steps.InsertNs(n_ratio=0.02, probability=0.5),
steps.ShortDValidation(),
steps.InsertIndels(probability=0.05, max_indels=15, insertion_probability=0.7, deletion_probability=0.3),
]
)
heavy_sequence = custom_pipeline.execute()
print("Customized Simulated Heavy Chain Sequence:", heavy_sequence.get_dict())
Understanding Parameter Impact¶
Let's see how different mutation rates affect the output by comparing three scenarios: naive, memory, and plasma cell-like sequences.
# 1. Naive B cell (very few mutations)
naive_pipeline = Pipeline(
config=HUMAN_IGH_OGRDB,
steps=[
steps.SimulateSequence(S5F(min_mutation_rate=0.001, max_mutation_rate=0.01), productive=True),
steps.FixVPositionAfterTrimmingIndexAmbiguity(),
steps.FixDPositionAfterTrimmingIndexAmbiguity(),
steps.FixJPositionAfterTrimmingIndexAmbiguity(),
steps.DistillMutationRate()
]
)
# 2. Memory B cell (moderate mutations)
memory_pipeline = Pipeline(
config=HUMAN_IGH_OGRDB,
steps=[
steps.SimulateSequence(S5F(min_mutation_rate=0.02, max_mutation_rate=0.08), productive=True),
steps.FixVPositionAfterTrimmingIndexAmbiguity(),
steps.FixDPositionAfterTrimmingIndexAmbiguity(),
steps.FixJPositionAfterTrimmingIndexAmbiguity(),
steps.DistillMutationRate()
]
)
# 3. Plasma cell (high mutations)
plasma_pipeline = Pipeline(
config=HUMAN_IGH_OGRDB,
steps=[
steps.SimulateSequence(S5F(min_mutation_rate=0.05, max_mutation_rate=0.25), productive=True),
steps.FixVPositionAfterTrimmingIndexAmbiguity(),
steps.FixDPositionAfterTrimmingIndexAmbiguity(),
steps.FixJPositionAfterTrimmingIndexAmbiguity(),
steps.DistillMutationRate()
]
)
naive_seq = naive_pipeline.execute().get_dict()
memory_seq = memory_pipeline.execute().get_dict()
plasma_seq = plasma_pipeline.execute().get_dict()
print("=== MUTATION RATE COMPARISON ===")
print(f"Naive B cell: {naive_seq['mutation_rate']:.3f} ({len(naive_seq['mutations'])} mutations)")
print(f"Memory B cell: {memory_seq['mutation_rate']:.3f} ({len(memory_seq['mutations'])} mutations)")
print(f"Plasma cell: {plasma_seq['mutation_rate']:.3f} ({len(plasma_seq['mutations'])} mutations)")
print(f"\nSequence lengths:")
print(f"Naive: {len(naive_seq['sequence'])} bp")
print(f"Memory: {len(memory_seq['sequence'])} bp")
print(f"Plasma: {len(plasma_seq['sequence'])} bp")
Generating Naive Sequences¶
A naive sequence refers to an antibody sequence that has not undergone somatic hypermutation. GenAIRR allows you to simulate such naive sequences using the HeavyChainSequence class.
from GenAIRR.sequence import HeavyChainSequence
# Create a naive heavy chain sequence
naive_heavy_sequence = HeavyChainSequence.create_random(HUMAN_IGH_OGRDB)
print("Naive Heavy Chain Sequence:", naive_heavy_sequence)
print('Ungapped Sequence:')
print(naive_heavy_sequence.ungapped_seq)
Applying Mutations¶
To mimic the natural diversity and evolution of immune sequences, GenAIRR supports the simulation of mutations through various models. Here, we demonstrate how to apply mutations to a naive sequence using the S5F and Uniform mutation models.
Using the S5F Mutation Model¶
The S5F model is a context-dependent mutation model that considers surrounding nucleotide context for mutation probabilities. It's particularly useful for simulating realistic somatic hypermutations.
# Initialize the S5F mutation model with custom mutation rates
s5f_model = S5F(min_mutation_rate=0.01, max_mutation_rate=0.05)
# Apply mutations to the naive sequence
s5f_mutated_sequence, mutations, mutation_rate = s5f_model.apply_mutation(naive_heavy_sequence)
print("S5F Mutated Heavy Chain Sequence:", s5f_mutated_sequence)
print("S5F Mutation Details:", mutations)
print("S5F Mutation Rate:", mutation_rate)
Using the Uniform Mutation Model¶
The Uniform mutation model applies mutations at a uniform rate across the sequence, providing a simpler alternative to the context-dependent models.
# Initialize the Uniform mutation model
uniform_model = Uniform(min_mutation_rate=0.01, max_mutation_rate=0.05)
# Apply mutations to the naive sequence
uniform_mutated_sequence, mutations, mutation_rate = uniform_model.apply_mutation(naive_heavy_sequence)
print("Uniform Mutated Heavy Chain Sequence:", uniform_mutated_sequence)
print("Uniform Mutation Details:", mutations)
print("Uniform Mutation Rate:", mutation_rate)
Light Chain Simulation¶
Heavy chains are just one part of the antibody structure. Let's also simulate light chains (kappa and lambda). Note that light chains lack a D segment.
# Kappa light chain pipeline
kappa_pipeline = Pipeline(
config=HUMAN_IGK_OGRDB,
steps=[
steps.SimulateSequence(S5F(min_mutation_rate=0.02, max_mutation_rate=0.08), productive=True),
steps.FixVPositionAfterTrimmingIndexAmbiguity(),
steps.FixJPositionAfterTrimmingIndexAmbiguity(), # No D position fix for light chains
steps.CorrectForVEndCut(),
# No CorrectForDTrims or ShortDValidation for light chains
steps.DistillMutationRate(),
steps.CorruptSequenceBeginning(probability=0.7, event_weights=(0.4, 0.4, 0.2)),
steps.EnforceSequenceLength(max_length=400),
steps.InsertNs(n_ratio=0.02, probability=0.5),
steps.InsertIndels(probability=0.3, max_indels=3),
]
)
# Lambda light chain pipeline
lambda_pipeline = Pipeline(
config=HUMAN_IGL_OGRDB,
steps=[
steps.SimulateSequence(S5F(min_mutation_rate=0.02, max_mutation_rate=0.08), productive=True),
steps.FixVPositionAfterTrimmingIndexAmbiguity(),
steps.FixJPositionAfterTrimmingIndexAmbiguity(),
steps.CorrectForVEndCut(),
steps.DistillMutationRate(),
steps.CorruptSequenceBeginning(probability=0.7, event_weights=(0.4, 0.4, 0.2)),
steps.EnforceSequenceLength(max_length=400),
steps.InsertNs(n_ratio=0.02, probability=0.5),
steps.InsertIndels(probability=0.3, max_indels=3),
]
)
# Heavy chain for comparison
heavy_simple = Pipeline(
config=HUMAN_IGH_OGRDB,
steps=[
steps.SimulateSequence(S5F(min_mutation_rate=0.02, max_mutation_rate=0.08), productive=True),
steps.FixVPositionAfterTrimmingIndexAmbiguity(),
steps.FixDPositionAfterTrimmingIndexAmbiguity(),
steps.FixJPositionAfterTrimmingIndexAmbiguity(),
steps.DistillMutationRate()
]
)
kappa_sequence = kappa_pipeline.execute().get_dict()
lambda_sequence = lambda_pipeline.execute().get_dict()
heavy_sequence_simple = heavy_simple.execute().get_dict()
# Comparison table
comparison_data = {
'Chain Type': ['Heavy', 'Kappa Light', 'Lambda Light'],
'Has D Segment': [True, False, False],
'Sequence Length': [
len(heavy_sequence_simple['sequence']),
len(kappa_sequence['sequence']),
len(lambda_sequence['sequence'])
],
'V Allele': [
heavy_sequence_simple['v_call'][0],
kappa_sequence['v_call'][0],
lambda_sequence['v_call'][0]
],
'J Allele': [
heavy_sequence_simple['j_call'][0],
kappa_sequence['j_call'][0],
lambda_sequence['j_call'][0]
],
'Mutation Rate': [
f"{heavy_sequence_simple['mutation_rate']:.3f}",
f"{kappa_sequence['mutation_rate']:.3f}",
f"{lambda_sequence['mutation_rate']:.3f}"
]
}
print(pd.DataFrame(comparison_data).to_string(index=False))
print(f"\nSequence structure:")
print(f" Heavy chain: V - D - J - C")
print(f" Light chain: V --- J - C (no D segment)")
Common Use Cases¶
Generating Many Sequences¶
Generate a batch of sequences and convert to a DataFrame for analysis.
num_sequences = 10
heavy_sequences = [pipeline.execute().get_dict() for _ in range(num_sequences)]
df = pd.DataFrame(heavy_sequences)
print(f"Generated {len(df)} sequences")
print(f"Columns: {list(df.columns)}")
df[['v_call', 'd_call', 'j_call', 'mutation_rate', 'productive']].head()
Generating a Specific Allele Combination Sequence¶
In some cases, you might want to simulate sequences with specific V, D, and J allele combinations.
# See what allele families are available
print("Available V allele families:", list(HUMAN_IGH_OGRDB.v_alleles.keys())[:5])
print("Available D allele families:", list(HUMAN_IGH_OGRDB.d_alleles.keys())[:5])
print("Available J allele families:", list(HUMAN_IGH_OGRDB.j_alleles.keys())[:5])
# Select specific alleles from the data config
specific_v = HUMAN_IGH_OGRDB.v_alleles['IGHVF1-G1'][0]
specific_d = HUMAN_IGH_OGRDB.d_alleles['IGHD1-1'][0]
specific_j = HUMAN_IGH_OGRDB.j_alleles['IGHJ1'][0]
print(f"\nSelected: V={specific_v.name}, D={specific_d.name}, J={specific_j.name}")
# Create pipeline with specific alleles
specific_pipeline = Pipeline(
config=HUMAN_IGH_OGRDB,
steps=[
steps.SimulateSequence(
S5F(min_mutation_rate=0.02, max_mutation_rate=0.08),
productive=True,
specific_v=specific_v,
specific_d=specific_d,
specific_j=specific_j
),
steps.FixVPositionAfterTrimmingIndexAmbiguity(),
steps.FixDPositionAfterTrimmingIndexAmbiguity(),
steps.FixJPositionAfterTrimmingIndexAmbiguity(),
steps.DistillMutationRate()
]
)
result = specific_pipeline.execute().get_dict()
print(f"\nGenerated sequence uses:")
print(f" V: {result['v_call']}")
print(f" D: {result['d_call']}")
print(f" J: {result['j_call']}")
print(f" Mutation rate: {result['mutation_rate']:.3f}")
Generating Naive vs. Mutated Sequence Pairs¶
Comparing naive and mutated versions of the same sequence can be useful for studying somatic hypermutation effects.
from GenAIRR.sequence import HeavyChainSequence
# Generate a naive sequence and then mutate it
sequence_object = HeavyChainSequence.create_random(HUMAN_IGH_OGRDB)
sequence_object.mutate(s5f_model)
print("Naive Sequence: ", sequence_object.ungapped_seq)
print("Mutated Sequence:", sequence_object.mutated_seq)
Reproducibility¶
Use set_seed() to get reproducible results across runs.
from GenAIRR import set_seed, simulate
# Set seed for reproducibility
set_seed(42)
result1 = simulate(HUMAN_IGH_OGRDB, S5F(min_mutation_rate=0.01, max_mutation_rate=0.05))
set_seed(42)
result2 = simulate(HUMAN_IGH_OGRDB, S5F(min_mutation_rate=0.01, max_mutation_rate=0.05))
print(f"Same seed produces same sequence: {result1.get_dict()['sequence'] == result2.get_dict()['sequence']}")
# Reset to random behavior
reset_seed()
Conclusion¶
This tutorial covered:
- Pipeline creation:
Pipeline(config=..., steps=[...]) - Mutation models: S5F (context-dependent) and Uniform
- Chain types: Heavy, kappa, and lambda light chains
- Custom alleles: Specifying V, D, J alleles
- Reproducibility: Using
set_seed()for consistent results - Batch generation: Generating multiple sequences
Next Steps¶
- Advanced Custom Generation - Custom allele selection and specific sequence generation
- Introduction to DataConfig - Deep dive into data configuration
- Parameter Reference - Detailed parameter explanations
- Best Practices - Guidelines for effective use