Skip to content

GenAIRR: Modular Ig Sequence Simulation

GenAIRR is a Python framework for simulating immunoglobulin (Ig) and adaptive immune receptor sequences. It provides a modular, pipeline-based architecture for generating realistic synthetic AIRR data — from naive B-cell sequences to fully mutated, sequencing-artifact-laden reads.

GenAIRR is designed for researchers and developers who need synthetic immune receptor data for benchmarking alignment tools, training machine learning models, or studying V(D)J recombination and somatic hypermutation.

from GenAIRR import simulate, HUMAN_IGH_OGRDB, S5F

# Generate a single simulated heavy chain sequence
result = simulate(HUMAN_IGH_OGRDB, S5F(0.01, 0.05))
print(result.sequence)

Key Features

Modular Pipeline Architecture
Build simulation workflows from composable steps. Each step modifies a SimulationContainer and can be added, removed, or reordered freely.
Biologically Realistic Mutation Models
Built-in S5F (context-dependent) and Uniform mutation models simulate somatic hypermutation at configurable rates.
Empirical Germline Data
Pre-built DataConfig objects contain V, D, and J allele sets, trimming distributions, and nucleotide addition patterns derived from real repertoire data (OGRDB).
Sequencing Artifact Simulation
Simulate real-world data imperfections — 5' truncation, N-base insertions, insertions/deletions, and read-length limits.
Ambiguity Resolution
Automatic correction steps resolve positional ambiguities introduced by trimming, ensuring accurate ground-truth annotations.
Reproducibility
Seed management (set_seed, get_seed, reset_seed) enables deterministic sequence generation.

Supported Chains

Config Chain Species
HUMAN_IGH_OGRDB Heavy (IGH) Human
HUMAN_IGK_OGRDB Kappa light (IGK) Human
HUMAN_IGL_OGRDB Lambda light (IGL) Human
HUMAN_TRB_OGRDB T-cell receptor beta (TRB) Human

Installation

pip install GenAIRR

Requires Python 3.9+.


Quick Example: Full Pipeline

For complete control over the simulation process, build a pipeline with explicit steps:

from GenAIRR import Pipeline, steps, HUMAN_IGH_OGRDB, S5F

pipeline = Pipeline(
    config=HUMAN_IGH_OGRDB,
    steps=[
        # 1. Generate a mutated sequence
        steps.SimulateSequence(
            S5F(min_mutation_rate=0.01, max_mutation_rate=0.05),
            productive=True
        ),
        # 2. Resolve positional ambiguities
        steps.FixVPositionAfterTrimmingIndexAmbiguity(),
        steps.FixDPositionAfterTrimmingIndexAmbiguity(),
        steps.FixJPositionAfterTrimmingIndexAmbiguity(),
        # 3. Biological corrections
        steps.CorrectForVEndCut(),
        steps.CorrectForDTrims(),
        # 4. Record mutation rate
        steps.DistillMutationRate(),
        # 5. Simulate sequencing artifacts
        steps.CorruptSequenceBeginning(),
        steps.EnforceSequenceLength(),
        steps.InsertNs(),
        # 6. Quality variants
        steps.ShortDValidation(),
        steps.InsertIndels(),
    ]
)

result = pipeline.execute()
data = result.get_dict()
print(data['v_call'], data['mutation_rate'])

Documentation Overview

Getting Started

User Guide

Tutorials (Jupyter Notebooks)

Reference

Advanced

Support


Citation

If you use GenAIRR in your research, please cite:

GenAIRR — Briefings in Bioinformatics, 2024. DOI: 10.1093/bib/bbae556