Frequently Asked Questions (FAQ)¶
Common questions about GenAIRR and their answers.
Getting Started¶
Q: What is GenAIRR used for?¶
A: GenAIRR simulates realistic immune receptor sequences (antibodies and T-cell receptors) with full ground truth annotation. It's primarily used for: - Benchmarking sequence alignment algorithms - Training machine learning models on immune data - Studying somatic hypermutation patterns - Generating synthetic datasets for research
Q: Do I need biology knowledge to use GenAIRR?¶
New to immunology?
Start with the Biological Context page for a concise primer, then follow the Step-by-Step Tutorial.
A: Basic understanding helps, but it's not required. Start with: 1. The Step-by-Step Tutorial for hands-on learning 2. The Biological Context Guide for background 3. Use default parameters initially - they're biologically reasonable
Q: Which Python version does GenAIRR support?¶
A: Python 3.9 or higher. Install with: pip install GenAIRR
Basic Usage¶
Q: What's the minimum code to generate a sequence?¶
A: Just 3 lines using the convenience function:
from GenAIRR import simulate, HUMAN_IGH_OGRDB, S5F
result = simulate(HUMAN_IGH_OGRDB, S5F(0.003, 0.25))
print(result.sequence)
Or with a pipeline for more control:
from GenAIRR import Pipeline, steps, HUMAN_IGH_OGRDB, S5F
pipeline = Pipeline(
config=HUMAN_IGH_OGRDB,
steps=[steps.SimulateSequence(S5F(min_mutation_rate=0.003, max_mutation_rate=0.25), productive=True)]
)
sequence = pipeline.execute()
print(sequence.sequence)
Q: What does "productive=True" mean?¶
A: It ensures the generated sequence is: - In the correct reading frame - Free of premature stop codons - Potentially functional as an antibody
About 1/3 of natural V(D)J recombination events are productive.
Q: Why do I get an error when creating pipelines?¶
A: Make sure you pass the config to the Pipeline constructor:
Parameters and Configuration¶
Q: What mutation rates should I use?¶
A: Depends on the cell type you're modeling: - Naive B cells: 0.001-0.01 (0.1-1%) - Memory B cells: 0.02-0.08 (2-8%) - Plasma cells: 0.05-0.25 (5-25%)
Q: What's the difference between S5F and Uniform mutation models?¶
A: - S5F: Context-dependent, biologically realistic mutations - Uniform: Simple random mutations at specified rate - Recommendation: Use S5F for research, Uniform for testing
Q: Can I simulate light chains?¶
A: Yes! Use the appropriate data configuration:
from GenAIRR import Pipeline, steps, HUMAN_IGK_OGRDB, S5F
# or HUMAN_IGL_OGRDB for lambda light chain
pipeline = Pipeline(
config=HUMAN_IGK_OGRDB,
steps=[
steps.SimulateSequence(S5F(min_mutation_rate=0.02, max_mutation_rate=0.08), productive=True),
steps.FixVPositionAfterTrimmingIndexAmbiguity(),
steps.FixJPositionAfterTrimmingIndexAmbiguity(), # No D segment steps
steps.CorrectForVEndCut(),
steps.DistillMutationRate(),
]
)
Pipeline Design¶
Q: Do I need all the pipeline steps?¶
A: No. Start minimal and add complexity:
- Minimal: Just SimulateSequence
- Basic: Add position fix steps
- Realistic: Add biological corrections
- Full: Add sequencing artifacts
Q: What order should pipeline steps be in?¶
A: Follow this general order:
1. SimulateSequence (always first)
2. Position fixes (FixVPositionAfterTrimmingIndexAmbiguity, etc.)
3. Biological corrections (CorrectForVEndCut, etc.)
4. Finalization (DistillMutationRate)
5. Sequencing artifacts (CorruptSequenceBeginning, EnforceSequenceLength, InsertNs)
6. Quality control (ShortDValidation)
7. Structural variants (InsertIndels)
Q: Can I create custom pipeline steps?¶
A: Yes! Inherit from AugmentationStep and implement the apply method:
from GenAIRR.steps.StepBase import AugmentationStep
class MyCustomStep(AugmentationStep):
def apply(self, container):
# Your custom logic here
container.sequence = container.sequence.upper()
Data and Output¶
Q: What data does GenAIRR output?¶
A: Each simulated sequence includes: - DNA sequence string - V, D, J allele names used - Mutation positions and types - Sequence region boundaries - Quality metrics (productive, mutation rate, etc.)
Q: How do I export results to different formats?¶
A:
# Pandas DataFrame
import pandas as pd
df = pd.DataFrame([seq.get_dict() for seq in sequences])
# FASTA format
with open('output.fasta', 'w') as f:
for i, seq in enumerate(sequences):
f.write(f">seq_{i}\n{seq.sequence}\n")
# JSON format
import json
with open('output.json', 'w') as f:
json.dump([seq.get_dict() for seq in sequences], f)
Q: Can I use my own germline database?¶
A: Yes, but it requires creating a custom DataConfig. See the Custom Data Config guide.
Performance and Scaling¶
Q: How fast is GenAIRR?¶
A: Speed depends on complexity:
- Simple pipeline: ~100-1000 sequences/second
- Full pipeline: ~10-100 sequences/second
- With high mutation rates or productive=True: slower due to retries
Q: How do I generate large datasets efficiently?¶
A: Use batch processing:
def generate_batch(pipeline, batch_size=1000):
return [pipeline.execute().get_dict() for _ in range(batch_size)]
# Generate 10,000 sequences in batches
all_sequences = []
for i in range(10):
batch = generate_batch(pipeline, 1000)
all_sequences.extend(batch)
print(f"Generated {len(all_sequences)} sequences")
Q: Why is generation slow when using productive=True?¶
A: The library regenerates sequences until finding productive ones. Solutions:
- Use lower mutation rates
- Accept some non-productive sequences (productive=False)
- Use pre-filtered germline databases
Troubleshooting¶
Q: My sequences all look the same!¶
A: Check these:
- Mutation rates aren't zero: S5F(min_mutation_rate=0.01, max_mutation_rate=0.05) not S5F(0, 0)
- Using different alleles: Check if you're forcing specific alleles
- Random seed: Don't set a fixed seed for production use
Q: I'm getting very short sequences!¶
A: Adjust corruption and length parameters:
# Less aggressive corruption
steps.CorruptSequenceBeginning(probability=0.3, event_weights=(0.7, 0.3, 0))
steps.EnforceSequenceLength(max_length=400)
Q: How do I reproduce results?¶
A: Use GenAIRR's built-in seed management:
Advanced Usage¶
Q: Can I simulate paired heavy/light chains?¶
A: Not directly, but you can generate them separately using different pipelines:
from GenAIRR import Pipeline, steps, HUMAN_IGH_OGRDB, HUMAN_IGK_OGRDB, S5F
heavy_pipeline = Pipeline(
config=HUMAN_IGH_OGRDB,
steps=[steps.SimulateSequence(S5F(min_mutation_rate=0.02, max_mutation_rate=0.08), productive=True)]
)
light_pipeline = Pipeline(
config=HUMAN_IGK_OGRDB,
steps=[steps.SimulateSequence(S5F(min_mutation_rate=0.02, max_mutation_rate=0.08), productive=True)]
)
heavy = heavy_pipeline.execute()
light = light_pipeline.execute()
Q: How do I model specific diseases or conditions?¶
A: Adjust parameters to reflect biology:
# Autoimmune (higher mutation)
steps.SimulateSequence(S5F(min_mutation_rate=0.05, max_mutation_rate=0.15), productive=True)
# Immunodeficiency (lower diversity - use specific alleles)
steps.SimulateSequence(S5F(min_mutation_rate=0.001, max_mutation_rate=0.02), productive=True, specific_v=common_allele)
Q: Can I add custom mutation patterns?¶
A: Yes, by creating custom mutation models. See the source code of S5F and Uniform classes as examples.
Getting Help¶
Q: Where can I find more examples?¶
A: Check these resources: - Jupyter notebook tutorials - GitHub repository - Step-by-step tutorial
Q: How do I report bugs or request features?¶
A: 1. Check existing GitHub issues first 2. Create a minimal reproducible example 3. Include your Python version and GenAIRR version 4. Submit to the GitHub repository
Q: Is there a community forum or chat?¶
A: Check the GitHub repository for current community resources and discussion channels.