Skip to content

Parameter Reference

Detailed explanation of every configurable parameter in GenAIRR, organized by component.


Mutation Model Parameters

Both S5F and Uniform accept the same base parameters:

Parameter Type Default Range Description
min_mutation_rate float 0.0 0.0 – 1.0 Lower bound of the mutation rate range. Each simulated sequence samples a rate uniformly from [min, max].
max_mutation_rate float 0.0 0.0 – 1.0 Upper bound of the mutation rate range. Setting both to 0 produces unmutated sequences.
productive bool False When True, the mutation process avoids introducing stop codons and preserves CDR3 anchor residues (Cys/Trp).

S5F only:

Parameter Type Default Description
custom_model str None Path to a custom 5-mer substitution model file. If None, uses the built-in S5F targeting model.
Biological scenario min_mutation_rate max_mutation_rate
Unmutated (germline) 0.0 0.0
Naive B cells 0.001 0.01
Memory B cells 0.02 0.08
Plasma cells 0.05 0.25
Germinal center (active) 0.01 0.15
Mixed/broad repertoire 0.003 0.25
TCR (minimal SHM) 0.0 0.01

SimulateSequence Parameters

Parameter Type Default Description
mutation_model MutationModel required An S5F or Uniform instance that controls how mutations are applied.
productive bool False If True, the step retries until it generates an in-frame sequence without stop codons. Roughly 1/3 of random rearrangements are productive.
specific_v VAllele None Force a specific V allele instead of random weighted selection. Obtain from config.v_alleles['GENE_NAME'][index].
specific_d DAllele None Force a specific D allele. Only applicable for chains with D segments (heavy, TRB).
specific_j JAllele None Force a specific J allele.

Notes: - When specific_v/d/j are None, alleles are selected randomly using the empirical usage weights in the DataConfig. - You can specify any combination — e.g., fix V and J but randomize D.


CorruptSequenceBeginning Parameters

All parameters are keyword-only (must use param=value syntax).

Parameter Type Default Description
probability float 0.7 Probability that corruption is applied to a given sequence. Set to 0.0 to disable.
event_weights tuple[float, float, float] (0.4, 0.4, 0.2) Relative weights for three corruption event types: (1) add random nucleotides to the 5' end, (2) remove nucleotides from the 5' end, (3) remove then add. Weights are normalized internally; they do not need to sum to 1.
nucleotide_add_coefficient int 210 Shape parameter controlling the distribution of how many nucleotides are added. Higher values produce longer additions.
nucleotide_remove_coefficient int 310 Shape parameter controlling how many nucleotides are removed. Higher values produce more aggressive trimming.
nucleotide_add_after_remove_coefficient int 50 Shape parameter for the add-after-remove event. Typically smaller than the add coefficient.
random_sequence_add_probability float 1.0 Probability that added nucleotides are random (vs. matching the germline). 1.0 = always random; 0.0 = always germline-matching.

EnforceSequenceLength Parameters

Parameter Type Default Description
max_length int 576 Maximum allowed sequence length in nucleotides. Sequences longer than this are truncated from the 3' end. Choose based on your sequencing platform's read length.

Common values:

Platform Typical max_length
Illumina MiSeq (2x300 paired) 576
Illumina MiSeq (2x250 paired) 480
Illumina NextSeq (2x150) 280
PacBio / Nanopore 1000+ (or omit this step)

InsertNs Parameters

Parameter Type Default Description
n_ratio float 0.02 Fraction of sequence positions to replace with 'N'. A value of 0.02 means ~2% of bases become ambiguous.
probability float 0.5 Probability of applying any N-insertion to a given sequence. When the event fires, n_ratio fraction of positions are replaced.

InsertIndels Parameters

Parameter Type Default Description
probability float 0.5 Probability of applying indels to a given sequence.
max_indels int 5 Maximum number of indel events per sequence. The actual count is sampled uniformly from [1, max_indels].
insertion_probability float 0.5 Relative weight for insertion events.
deletion_probability float 0.5 Relative weight for deletion events. The ratio of insertion_probability to deletion_probability determines the insertion/deletion balance.

ShortDValidation Parameters

Parameter Type Default Description
short_d_length int 5 D segments shorter than this threshold (in nucleotides) are flagged as "short" and their annotations may be adjusted.

Position Correction Steps

The following steps take no parameters:

Step Applicable chains
FixVPositionAfterTrimmingIndexAmbiguity() All
FixDPositionAfterTrimmingIndexAmbiguity() Heavy, TRB
FixJPositionAfterTrimmingIndexAmbiguity() All
CorrectForVEndCut() All
CorrectForDTrims() Heavy, TRB
DistillMutationRate() All
FilterTCRDJAmbiguities() TRB only

Seed Management

Function Parameter Description
set_seed(seed) seed: int Sets the global random seed across all RNGs (Python random, NumPy). Enables deterministic output.
get_seed() Returns the current seed value, or None if not set.
reset_seed() Clears the seed, returning to non-deterministic behavior.

Pipeline Constructor

Parameter Type Default Description
steps List[AugmentationStep] required Ordered list of steps to execute.
config DataConfig None The germline data configuration. Passed to each step automatically. Required for all pipelines that include SimulateSequence.

simulate() Convenience Function

Parameter Type Default Description
config DataConfig required Germline data configuration.
mutation_model MutationModel required S5F or Uniform instance.
productive bool True Only generate productive sequences.
n int 1 Number of sequences. Returns a list when n > 1.