Frequently Asked Questions (FAQ)¶

Common questions about GenAIRR and their answers.

Getting Started¶

Q: What is GenAIRR used for?¶

A: GenAIRR simulates realistic immune receptor sequences (antibodies and T-cell receptors) with full ground truth annotation. It's primarily used for: - Benchmarking sequence alignment algorithms - Training machine learning models on immune data - Studying somatic hypermutation patterns - Generating synthetic datasets for research

Q: Do I need biology knowledge to use GenAIRR?¶

New to immunology?

Start with the Biological Context page for a concise primer, then follow the Step-by-Step Tutorial.

A: Basic understanding helps, but it's not required. Start with: 1. The Step-by-Step Tutorial for hands-on learning 2. The Biological Context Guide for background 3. Use default parameters initially - they're biologically reasonable

Q: Which Python version does GenAIRR support?¶

A: Python 3.9 or higher. Install with: pip install GenAIRR

Basic Usage¶

Q: What's the minimum code to generate a sequence?¶

A: Just 3 lines using the convenience function:

from GenAIRR import simulate, HUMAN_IGH_OGRDB, S5F

result = simulate(HUMAN_IGH_OGRDB, S5F(0.003, 0.25))
print(result.sequence)

Or with a pipeline for more control:

from GenAIRR import Pipeline, steps, HUMAN_IGH_OGRDB, S5F

pipeline = Pipeline(
    config=HUMAN_IGH_OGRDB,
    steps=[steps.SimulateSequence(S5F(min_mutation_rate=0.003, max_mutation_rate=0.25), productive=True)]
)
sequence = pipeline.execute()
print(sequence.sequence)

Q: What does "productive=True" mean?¶

A: It ensures the generated sequence is: - In the correct reading frame - Free of premature stop codons - Potentially functional as an antibody

About 1/3 of natural V(D)J recombination events are productive.

Q: Why do I get an error when creating pipelines?¶

A: Make sure you pass the config to the Pipeline constructor:

pipeline = Pipeline(
    config=HUMAN_IGH_OGRDB,
    steps=[...]
)

Parameters and Configuration¶

Q: What mutation rates should I use?¶

A: Depends on the cell type you're modeling: - Naive B cells: 0.001-0.01 (0.1-1%) - Memory B cells: 0.02-0.08 (2-8%) - Plasma cells: 0.05-0.25 (5-25%)

Q: What's the difference between S5F and Uniform mutation models?¶

A: - S5F: Context-dependent, biologically realistic mutations - Uniform: Simple random mutations at specified rate - Recommendation: Use S5F for research, Uniform for testing

Q: Can I simulate light chains?¶

A: Yes! Use the appropriate data configuration:

from GenAIRR import Pipeline, steps, HUMAN_IGK_OGRDB, S5F
# or HUMAN_IGL_OGRDB for lambda light chain

pipeline = Pipeline(
    config=HUMAN_IGK_OGRDB,
    steps=[
        steps.SimulateSequence(S5F(min_mutation_rate=0.02, max_mutation_rate=0.08), productive=True),
        steps.FixVPositionAfterTrimmingIndexAmbiguity(),
        steps.FixJPositionAfterTrimmingIndexAmbiguity(),  # No D segment steps
        steps.CorrectForVEndCut(),
        steps.DistillMutationRate(),
    ]
)

Pipeline Design¶

Q: Do I need all the pipeline steps?¶

A: No. Start minimal and add complexity: - Minimal: Just SimulateSequence - Basic: Add position fix steps - Realistic: Add biological corrections - Full: Add sequencing artifacts

Q: What order should pipeline steps be in?¶

A: Follow this general order: 1. SimulateSequence (always first) 2. Position fixes (FixVPositionAfterTrimmingIndexAmbiguity, etc.) 3. Biological corrections (CorrectForVEndCut, etc.) 4. Finalization (DistillMutationRate) 5. Sequencing artifacts (CorruptSequenceBeginning, EnforceSequenceLength, InsertNs) 6. Quality control (ShortDValidation) 7. Structural variants (InsertIndels)

Q: Can I create custom pipeline steps?¶

A: Yes! Inherit from AugmentationStep and implement the apply method:

from GenAIRR.steps.StepBase import AugmentationStep

class MyCustomStep(AugmentationStep):
    def apply(self, container):
        # Your custom logic here
        container.sequence = container.sequence.upper()

Data and Output¶

Q: What data does GenAIRR output?¶

A: Each simulated sequence includes: - DNA sequence string - V, D, J allele names used - Mutation positions and types - Sequence region boundaries - Quality metrics (productive, mutation rate, etc.)

Q: How do I export results to different formats?¶

A:

# Pandas DataFrame
import pandas as pd
df = pd.DataFrame([seq.get_dict() for seq in sequences])

# FASTA format
with open('output.fasta', 'w') as f:
    for i, seq in enumerate(sequences):
        f.write(f">seq_{i}\n{seq.sequence}\n")

# JSON format
import json
with open('output.json', 'w') as f:
    json.dump([seq.get_dict() for seq in sequences], f)

Q: Can I use my own germline database?¶

A: Yes, but it requires creating a custom DataConfig. See the Custom Data Config guide.

Performance and Scaling¶

Q: How fast is GenAIRR?¶

A: Speed depends on complexity: - Simple pipeline: ~100-1000 sequences/second - Full pipeline: ~10-100 sequences/second - With high mutation rates or productive=True: slower due to retries

Q: How do I generate large datasets efficiently?¶

A: Use batch processing:

def generate_batch(pipeline, batch_size=1000):
    return [pipeline.execute().get_dict() for _ in range(batch_size)]

# Generate 10,000 sequences in batches
all_sequences = []
for i in range(10):
    batch = generate_batch(pipeline, 1000)
    all_sequences.extend(batch)
    print(f"Generated {len(all_sequences)} sequences")

Q: Why is generation slow when using productive=True?¶

A: The library regenerates sequences until finding productive ones. Solutions: - Use lower mutation rates - Accept some non-productive sequences (productive=False) - Use pre-filtered germline databases

Troubleshooting¶

Q: My sequences all look the same!¶

A: Check these: - Mutation rates aren't zero: S5F(min_mutation_rate=0.01, max_mutation_rate=0.05) not S5F(0, 0) - Using different alleles: Check if you're forcing specific alleles - Random seed: Don't set a fixed seed for production use

Q: I'm getting very short sequences!¶

A: Adjust corruption and length parameters:

# Less aggressive corruption
steps.CorruptSequenceBeginning(probability=0.3, event_weights=(0.7, 0.3, 0))
steps.EnforceSequenceLength(max_length=400)

Q: How do I reproduce results?¶

A: Use GenAIRR's built-in seed management:

from GenAIRR import set_seed, get_seed, reset_seed

set_seed(42)
# Now generate sequences...

Advanced Usage¶

Q: Can I simulate paired heavy/light chains?¶

A: Not directly, but you can generate them separately using different pipelines:

from GenAIRR import Pipeline, steps, HUMAN_IGH_OGRDB, HUMAN_IGK_OGRDB, S5F

heavy_pipeline = Pipeline(
    config=HUMAN_IGH_OGRDB,
    steps=[steps.SimulateSequence(S5F(min_mutation_rate=0.02, max_mutation_rate=0.08), productive=True)]
)

light_pipeline = Pipeline(
    config=HUMAN_IGK_OGRDB,
    steps=[steps.SimulateSequence(S5F(min_mutation_rate=0.02, max_mutation_rate=0.08), productive=True)]
)

heavy = heavy_pipeline.execute()
light = light_pipeline.execute()

Q: How do I model specific diseases or conditions?¶

A: Adjust parameters to reflect biology:

# Autoimmune (higher mutation)
steps.SimulateSequence(S5F(min_mutation_rate=0.05, max_mutation_rate=0.15), productive=True)

# Immunodeficiency (lower diversity - use specific alleles)
steps.SimulateSequence(S5F(min_mutation_rate=0.001, max_mutation_rate=0.02), productive=True, specific_v=common_allele)

Q: Can I add custom mutation patterns?¶

A: Yes, by creating custom mutation models. See the source code of S5F and Uniform classes as examples.

Getting Help¶

Q: Where can I find more examples?¶

A: Check these resources: - Jupyter notebook tutorials - GitHub repository - Step-by-step tutorial

Q: How do I report bugs or request features?¶

A: 1. Check existing GitHub issues first 2. Create a minimal reproducible example 3. Include your Python version and GenAIRR version 4. Submit to the GitHub repository

Q: Is there a community forum or chat?¶

A: Check the GitHub repository for current community resources and discussion channels.