Introduction to the DataConfig Object¶
The DataConfig class is a vital component of the GenAIRR package, designed to manage and organize the various configurations needed for immunoglobulin sequence generation, allele usage, trimming, and mutation simulations. This object serves as a centralized hub for storing and accessing all the essential data required during simulations and analyses.
Key Attributes of DataConfig¶
- family_use_dict: Manages the usage frequencies of gene families, helping to simulate realistic gene family distributions. ( currently not used and uniform selection of each allele is prioritized)
- gene_use_dict: Similar to
family_use_dict, but focuses on individual gene usage frequencies. - trim_dicts: Contains information on how to trim gene segments (V, D, J) during sequence generation.
- NP_transitions & NP_first_bases: These dictionaries define the transition probabilities and initial base probabilities for non-polymorphic (NP) regions, which are crucial for simulating realistic sequences.
- NP_lengths: Provides the distribution of NP region lengths, adding another layer of realism to the sequence generation process.
- v_alleles, d_alleles, j_alleles, c_alleles: These dictionaries store allele information for V, D, J, and C gene segments, respectively, organized by family.
- correction_maps: Maps used for correcting or adjusting sequences or simulation parameters, ensuring that generated sequences meet specific criteria.
- asc_tables: Stores allele sequence cluster (ASC) tables, which group alleles based on sequence similarity and other criteria, providing insights into allele relationships.
The DataConfig object is integral to ensuring that simulations and sequence analyses are conducted with accurate and relevant data. Throughout this notebook, we will explore how to utilize DataConfig to configure and manage data effectively for your specific research needs.
Note that for proper GenAIRR functionality in its various functions and capabilities, all of the above variables must be present and in the correct format in case you decide to modify an existing or create a custom DataConfig file.
Let's begin by diving into the structure and examples of the DataConfig object!
from GenAIRR.data import HUMAN_IGH_OGRDB
from GenAIRR.dataconfig import DataConfig
# Use the built-in heavy chain data configuration
heavychain_dataconfig = HUMAN_IGH_OGRDB
Trim Dictionary (trim_dicts)¶
The Trim Dictionary is a multi-level dictionary housed within each DataConfig object. This structure organizes trimming information based on gene and side (e.g., 5' or 3'). The keys follow the format of gene_side, such as V_3 for the 3' end of the V gene.
For each gene_side key, there are sub-keys representing all the gene families available in the reference. Under each family sub-key, the dictionary lists the possible trimming lengths that can be applied to an allele within that family, along with the likelihood of each trimming length being selected.
Modifying this dictionary within the DataConfig object allows you to control the trimming lengths applied to specific gene-side and family combinations during sequence generation.
heavychain_dataconfig.trim_dicts['V_3']['IGHVF1']
{'IGHVF1-G1': defaultdict(float,
{0: 0.2136996827285346,
1: 0.0882906999801705,
2: 0.2842987804878049,
3: 0.1517759766012294,
4: 0.1780314297045409,
5: 0.0283127602617489,
6: 0.0390206722189173,
7: 0.0038915328177672,
8: 0.0087621455482847,
9: 0.0017536684513186,
10: 0.00098527662105889,
11: 0.0007064247471743,
12: 4.9573666468372e-05,
13: 7.4360499702558e-05,
14: 3.09835415427325e-05,
15: 1.85901249256395e-05,
16: 4.33769581598255e-05,
17: 5.57703747769185e-05,
18: 4.33769581598255e-05,
19: 3.09835415427325e-05,
20: 6.1967083085465e-06,
21: 1.2393416617093e-05,
22: 2.4786833234186e-05,
23: 6.1967083085465e-06,
25: 6.1967083085465e-06,
28: 6.1967083085465e-06,
29: 6.1967083085465e-06,
30: 6.1967083085465e-06,
33: 6.1967083085465e-06,
38: 6.1967083085465e-06,
39: 6.1967083085465e-06,
48: 6.1967083085465e-06,
52: 6.1967083085465e-06,
66: 6.1967083085465e-06,
73: 6.1967083085465e-06,
76: 6.1967083085465e-06}),
'IGHVF1-G2': defaultdict(float,
{0: 0.2136996827285346,
1: 0.0882906999801705,
2: 0.2842987804878049,
3: 0.1517759766012294,
4: 0.1780314297045409,
5: 0.0283127602617489,
6: 0.0390206722189173,
7: 0.0038915328177672,
8: 0.0087621455482847,
9: 0.0017536684513186,
10: 0.00098527662105889,
11: 0.0007064247471743,
12: 4.9573666468372e-05,
13: 7.4360499702558e-05,
14: 3.09835415427325e-05,
15: 1.85901249256395e-05,
16: 4.33769581598255e-05,
17: 5.57703747769185e-05,
18: 4.33769581598255e-05,
19: 3.09835415427325e-05,
20: 6.1967083085465e-06,
21: 1.2393416617093e-05,
22: 2.4786833234186e-05,
23: 6.1967083085465e-06,
25: 6.1967083085465e-06,
28: 6.1967083085465e-06,
29: 6.1967083085465e-06,
30: 6.1967083085465e-06,
33: 6.1967083085465e-06,
38: 6.1967083085465e-06,
39: 6.1967083085465e-06,
48: 6.1967083085465e-06,
52: 6.1967083085465e-06,
66: 6.1967083085465e-06,
73: 6.1967083085465e-06,
76: 6.1967083085465e-06}),
'IGHVF1-G3': defaultdict(float,
{0: 0.2136996827285346,
1: 0.0882906999801705,
2: 0.2842987804878049,
3: 0.1517759766012294,
4: 0.1780314297045409,
5: 0.0283127602617489,
6: 0.0390206722189173,
7: 0.0038915328177672,
8: 0.0087621455482847,
9: 0.0017536684513186,
10: 0.00098527662105889,
11: 0.0007064247471743,
12: 4.9573666468372e-05,
13: 7.4360499702558e-05,
14: 3.09835415427325e-05,
15: 1.85901249256395e-05,
16: 4.33769581598255e-05,
17: 5.57703747769185e-05,
18: 4.33769581598255e-05,
19: 3.09835415427325e-05,
20: 6.1967083085465e-06,
21: 1.2393416617093e-05,
22: 2.4786833234186e-05,
23: 6.1967083085465e-06,
25: 6.1967083085465e-06,
28: 6.1967083085465e-06,
29: 6.1967083085465e-06,
30: 6.1967083085465e-06,
33: 6.1967083085465e-06,
38: 6.1967083085465e-06,
39: 6.1967083085465e-06,
48: 6.1967083085465e-06,
52: 6.1967083085465e-06,
66: 6.1967083085465e-06,
73: 6.1967083085465e-06,
76: 6.1967083085465e-06})}
NP Region Generation Parameters¶
The DataConfig object contains three crucial components that guide the generation of NP regions during sequence simulation.
1. NP First Bases (NP_first_bases)¶
The first component is NP_first_bases, which is a multi-level dictionary. The top-level key represents the specific NP region of interest, either NP1 or NP2. In cases where there is no D allele, such as in light chains, only NP1 exists. The inner dictionary provides the probabilities of the NP region starting with each of the four nucleotides (A, T, C, G).
When simulating NP regions in GenAIRR, a first-order Markov chain is used. The NP_first_bases dictionary provides the initial state probabilities for this Markov chain. For example, when generating the NP1 region, the first nucleotide is sampled based on the weights (likelihoods) defined in dataconfig.NP_first_bases["NP1"].
2. Markov Chain Transition Matrix (NP_transitions)¶
The second component is the Markov chain transition matrix, stored in the NP_transitions dictionary. This is also a multi-level dictionary with several layers that define how the NP region evolves as nucleotides are added.
Top-Level Key: Similar to
NP_first_bases, the first key inNP_transitionsspecifies the NP region type (NP1orNP2). For instance,dataconfig.NP_transitions['NP1']retrieves the transition matrix used for generating theNP1region.Second-Level Key: The next level in the dictionary corresponds to the position within the NP region. For example, if you are generating the 5th nucleotide in the sequence, you would use
dataconfig.NP_transitions['NP1'][4]to access the relevant transition probabilities.Third-Level Key: At this level, the key corresponds to the current nucleotide observed at the specific position. If the 4th position in the generated NP region is a "T", you would query
dataconfig.NP_transitions['NP1'][4]["T"]. This returns a distribution that allows you to sample the next nucleotide (5th in this case), continuing the process for the entire length of the NP region.
This loop repeats until the NP region reaches its predetermined length.
3. NP Region Length Distribution (NP_lengths)¶
The third component is the NP region length distribution, stored in the NP_lengths dictionary. This is a two-level dictionary where the top-level key specifies the NP region (NP1 or NP2). The value for each key is a distribution of likelihoods over the possible lengths for that NP region.
This distribution defines the variety of lengths that can occur in the NP regions during simulation, allowing for more realistic sequence generation.
heavychain_dataconfig.NP_first_bases
{'NP1': {'A': 0.11170254294101757,
'C': 0.24612237873865697,
'G': 0.28458098488427797,
'T': 0.3575940934360475},
'NP2': {'A': 0.16727374243138382,
'C': 0.3712668132552802,
'G': 0.27509575963731403,
'T': 0.18636368467602188}}
heavychain_dataconfig.NP_transitions['NP1'][0]['T']
{'A': 0.16940029106419394,
'C': 0.39025137900329543,
'G': 0.23279850358164553,
'T': 0.2075498263508651}
heavychain_dataconfig.NP_lengths['NP1']
{0: 0.05265143785167222,
1: 0.04075110575704825,
2: 0.05608972515275564,
3: 0.07007809677350897,
4: 0.07858552952587226,
5: 0.07975324952254308,
6: 0.07758633439139435,
7: 0.07449997706055178,
8: 0.06841003801143569,
9: 0.05997446760685735,
10: 0.05486857812958784,
11: 0.04679214939837378,
12: 0.04113627082942587,
13: 0.034565642417839555,
14: 0.0284863084612022,
15: 0.024700616824390856,
16: 0.020714718727390977,
17: 0.01551780635633689,
18: 0.013459847504451441,
19: 0.010925362147151564,
20: 0.009099157910326753,
21: 0.007894751794817339,
22: 0.006256334057027708,
23: 0.005212570520841891,
24: 0.004889068681041339,
25: 0.003603860948874375,
26: 0.0031465202209132598,
27: 0.0026087014130785004,
28: 0.0022806267502488496,
29: 0.0018500272959719036,
30: 0.0017848341530929469,
31: 0.0014223901128996882,
32: 0.0012977542751098972,
33: 0.0008708713162783527,
34: 0.0013054161439008335,
35: 0.0007111989391616005,
36: 0.0005481891501906322,
37: 0.0006125581214754118,
38: 0.0005807192802702009,
39: 0.000407270072130761,
40: 0.0003821994072440937}