Advanced Guide to GenAIRR¶
Welcome to the advanced guide on using the GenAIRR Python package. In this notebook, we'll explore how to generate custom sequences and perform various manipulations using GenAIRR. This guide assumes familiarity with the basic functionalities of GenAIRR, as covered in the Quick Start Guide.
What You'll Learn¶
- Generating custom sequences with specific alleles
- Advanced manipulations of sequences
- Controlling allele selection in pipelines
Let's get started!
Loading the Built-in Heavy Chain Data Configuration¶
In this section, we load the built-in heavy chain data configuration object from the GenAIRR package. This object contains predefined configurations, reference alleles and metadata which is essential for working with heavy chain sequence data.
from GenAIRR import Pipeline, steps, S5F, Uniform, HUMAN_IGH_OGRDB
# Load the built-in heavy chain data config object
dataconfig = HUMAN_IGH_OGRDB
dataconfig
Extracting Allele Objects from the Data Configuration¶
The alleles from the reference file are stored as Allele objects within a dictionary, where each gene family is mapped to a list of corresponding alleles. In this section, we extract each Allele object from these lists, converting them into a more accessible list format. This step simplifies access but is optional -- if you know the specific allele you need, you can directly access it using key indexing.
# Extract alleles from the gene family dictionary into flat lists
v_alleles = [i for j in dataconfig.v_alleles for i in dataconfig.v_alleles[j]]
d_alleles = [i for j in dataconfig.d_alleles for i in dataconfig.d_alleles[j]]
j_alleles = [i for j in dataconfig.j_alleles for i in dataconfig.j_alleles[j]]
c_alleles = [i for j in dataconfig.c_alleles for i in dataconfig.c_alleles[j]]
# Example of the first Allele object in the V allele list
v_alleles[0]
Creating a Custom Naive Sequence¶
In this section, we create a custom naive sequence by passing a list of selected V, D, J, and C (Constant) alleles to the HeavyChainSequence class. This allows us to generate a sequence based on specific alleles of interest.
from GenAIRR.sequence import HeavyChainSequence
selected_v_allele = v_alleles[0]
selected_d_allele = d_alleles[0]
selected_j_allele = j_alleles[0]
selected_c_allele = c_alleles[0]
# Pass V, D, J, and C alleles to create a naive sequence
naive_sequence = HeavyChainSequence(
[selected_v_allele, selected_d_allele, selected_j_allele, selected_c_allele],
dataconfig
)
naive_sequence
Viewing the Naive Sequence¶
The HeavyChainSequence object now contains the resulting naive sequence information. This sequence is free from mutations, corruption, or N nucleotides, representing a pure recombined sequence.
print(naive_sequence.v_allele)
print(naive_sequence.d_allele)
print(naive_sequence.j_allele)
print(naive_sequence.c_allele)
print("V Allele 3' Trim Length = ", naive_sequence.v_trim_3)
print("V Allele 5' Trim Length = ", naive_sequence.v_trim_5)
print("D Allele 3' Trim Length = ", naive_sequence.d_trim_3)
print("D Allele 5' Trim Length = ", naive_sequence.d_trim_5)
print("J Allele 3' Trim Length = ", naive_sequence.j_trim_3)
print("J Allele 5' Trim Length = ", naive_sequence.j_trim_5)
print("C Allele 3' Trim Length = ", naive_sequence.c_trim_3)
print("C Allele 5' Trim Length = ", naive_sequence.c_trim_5)
print('Junction: ', naive_sequence.junction)
print('Is Functional:', naive_sequence.functional)
print('NP 1 Region Length: ', naive_sequence.NP1_length)
print('NP 1 Region Sequence: ', naive_sequence.NP1_region)
print('NP 2 Region Length: ', naive_sequence.NP2_length)
print('NP 2 Region Sequence: ', naive_sequence.NP2_region)
print('Sequence: ', naive_sequence.ungapped_seq)
Introducing Mutations to the Sequence¶
In this section, we introduce mutations into the heavy chain sequence. This allows us to simulate more realistic scenarios where sequences are not perfect, reflecting the natural variability found in biological systems.
# Apply mutations using the Uniform mutation model
mutation_model = Uniform(min_mutation_rate=0.03, max_mutation_rate=0.1)
naive_sequence.mutate(mutation_model)
print('Mutated Sequence: ', naive_sequence.mutated_seq, '\n')
print('Simulated Mutations: ', naive_sequence.mutations)
# The key is the position, the value shows original>mutated nucleotide
# e.g., 144:T>G means T at position 144 was changed to G
Note that rerunning naive_sequence.mutate(mutation_model) will keep generating new mutated sequences, without changing any of the naive sequence properties. Only the mutated_seq variable will change.
Controlling Allele Selection in Augmented Sequences¶
In many cases, especially in benchmarking settings, you might want to control which alleles the pipeline uses. The SimulateSequence step allows you to specify a particular V, D, J allele, or any combination of them as parameters. This forces the pipeline to use the specified alleles.
The specific_v, specific_d, and specific_j arguments are optional. If any of these parameters are not provided, the pipeline will randomly sample an allele for that gene from the DataConfig.
If none of the three alleles are specified, the method will generate a completely random sequence. However, if you provide specific V and J alleles and run the method in a loop, you'll obtain multiple sequences with varying D alleles (as they are randomly sampled). These sequences will share the same V and J alleles but will have different mutations, noise, and corruption levels.
specific_v = v_alleles[0]
# Create the simulation pipeline with a specific V allele
pipeline = Pipeline(
config=HUMAN_IGH_OGRDB,
steps=[
steps.SimulateSequence(
S5F(min_mutation_rate=0.003, max_mutation_rate=0.25),
productive=True,
specific_v=specific_v # Force this V allele
),
steps.FixVPositionAfterTrimmingIndexAmbiguity(),
steps.FixDPositionAfterTrimmingIndexAmbiguity(),
steps.FixJPositionAfterTrimmingIndexAmbiguity(),
steps.CorrectForVEndCut(),
steps.CorrectForDTrims(),
steps.DistillMutationRate(),
steps.CorruptSequenceBeginning(
probability=0.7,
event_weights=(0.4, 0.4, 0.2),
nucleotide_add_coefficient=210,
nucleotide_remove_coefficient=310,
nucleotide_add_after_remove_coefficient=50,
),
steps.EnforceSequenceLength(max_length=576),
steps.InsertNs(n_ratio=0.02, probability=0.5),
steps.ShortDValidation(),
steps.InsertIndels(probability=0.5, max_indels=5, insertion_probability=0.5, deletion_probability=0.5),
]
)
# Simulate a heavy chain sequence
heavy_sequence = pipeline.execute()
print("Simulated Heavy Chain Sequence:", heavy_sequence.get_dict())