codonbias package
This package provides analysis tools for genomic sequences, focusing on protein coding regions, translation efficiency and synonymous mutations. These include implementations of popular models from the past four decades of codon usage study, such as:
Nucleotide and codon k-mer statistics (GC, GC3, CpG, etc.)
Frequency of Optimal Codons (FOP)
Relative Synonymous Codon Usage (RSCU)
- Codon Adaptation Index (CAI), including extensions:
Codon pair (and k-mers) adaptation
- Effective Number of Codons (ENC), including extensions:
Background correction
Improved estimation
Effective number of codon pairs (and k-mers) (ENcp)
tRNA Adaptation Index (tAI)
Codon Pair Bias (CPB/CPS)
Relative Codon Bias Score (RCBS)
Normalized Translational Efficiency (nTE)
Directional Codon Bias Score (DCBS)
Codon Usage Frequency Similarity (CUFS)
This package also includes tools for sequence optimization based on these codon usage models, and generators of random sequence permutations that can be used to compute empirical p-values and z-scores.
The package contains 6 submodules:
codonbias.stats: Classes for basepair / codon statistics.
codonbias.scores: Models / scores that operate on individual sequences independently.
codonbias.pairwise: Models / scores that operate on pairs of sequences.
codonbias.optimizers: Algorithms for score-based optimization of a sequence.
codonbias.random: Random sequence permutations for empirical z-scores and p-values.
codonbias.utils: Helper functions for the other submodules.
Submodules
codonbias.optimizers module
- class codonbias.optimizers.BalancedWeight(weights=None, model=None, higher_is_better=True, genetic_code=1)
Bases:
WeightOptimizerOptimizes the amino acid sequence by selecting synonymous codons with a probability proportional to their weight. This generates a balanced codon distribution, with more optimal codons appearing at higher frequencies.
- Parameters
weights (pd.Series, optional) – Codon weights, according to which optimization will encode the sequence, by default None
model (scores.ScalarScore, optional) – Codon model object with a weights property, by default None
higher_is_better (bool, optional) – Defines the direction of the weights for the optimization, by default True
genetic_code (int, optional) – NCBI genetic code ID, by default 1
- optimize(seq_aa)
- class codonbias.optimizers.MaxWeight(weights=None, model=None, higher_is_better=True, genetic_code=1)
Bases:
WeightOptimizerOptimizes the amino acid sequence by selecting synonymous codons with the highest weights.
- Parameters
weights (pd.Series, optional) – Codon weights, according to which optimization will encode the sequence, by default None
model (scores.ScalarScore, optional) – Codon model object with a weights property, by default None
higher_is_better (bool, optional) – Defines the direction of the weights for the optimization, by default True
genetic_code (int, optional) – NCBI genetic code ID, by default 1
- optimize(seq_aa)
- class codonbias.optimizers.MinWeight(weights=None, model=None, higher_is_better=True, genetic_code=1)
Bases:
WeightOptimizerOptimizes the amino acid sequence by selecting synonymous codons with the lowest weights.
- Parameters
weights (pd.Series, optional) – Codon weights, according to which optimization will encode the sequence, by default None
model (scores.ScalarScore, optional) – Codon model object with a weights property, by default None
higher_is_better (bool, optional) – Defines the direction of the weights for the optimization, by default True
genetic_code (int, optional) – NCBI genetic code ID, by default 1
- optimize(seq_aa)
- class codonbias.optimizers.WeightOptimizer(weights=None, model=None, higher_is_better=True, genetic_code=1)
Bases:
objectAbstract class for optimizers that use codon weights to choose between synonymous sequences.
- Parameters
weights (pd.Series, optional) – Codon weights, according to which optimization will encode the sequence, by default None
model (scores.ScalarScore, optional) – Codon model object with a weights property, by default None
higher_is_better (bool, optional) – Defines the direction of the weights for the optimization, by default True
genetic_code (int, optional) – NCBI genetic code ID, by default 1
- optimize(seq_aa)
codonbias.pairwise module
- class codonbias.pairwise.CodonUsageFrequency(synonymous=False, k_mer=1, genetic_code=1, ignore_stop=False, pseudocount=1, n_jobs=None)
Bases:
PairwiseScoreCodon Usage Frequency (CUFS, Diament, Pinter & Tuller, Nature Communications, 2014).
This is a distance metric between pairs of sequences based on their distribution of codons. It employs a distance metric for probability distrbutions (Endres & Schindelin, 2003) that is based on KL divergence. The original implementation used the parameter `pseudocount`=0.
- Parameters
synonymous (bool, optional) – When True snynomous codon frequencies are normalized to sum to 1 for each amino acid (synCUFS), by default False
k_mer (int, optional) – Determines the length of the codon k-mer to base statistics on, by default 1
genetic_code (int, optional) – NCBI genetic code ID, by default 1
ignore_stop (bool, optional) – Whether STOP codons will be discarded from the analysis, by default True
pseudocount (int, optional) – Pseudocount correction for normalized codon frequencies, by default 1
n_jobs (_type_, optional) – Number of processes to use for matrix computation, by default None
- class codonbias.pairwise.PairwiseScore(n_jobs=None)
Bases:
objectAbstract class for models that output a scalar for a pair of sequences, or a pairwise score matrix for a set of sequences. Inheriting classes may implement the computation of the score for a single pair in two steps: (1) a transformation of the sequence by _calc_weights(seq); and (2) a computation of the score by _calc_pair_score(w1, w2). The abstract class implements two wrapper methods that call the aforementioned internal implementations: get_score(seq1, seq2), get_matrix(seqs). The latter function assumes that the score is symmetric, and that the diagonal always contains zeros.
In case that a dedicated implementation for whole matrix computation is implemented in _calc_matrix(weights), this method will be preferred by the get_matrix(seqs) method. This can be, for example, an efficient vectorized implementation of the computation.
- Parameters
n_jobs (int, optional) – Number of processes to use for matrix computation, by default None
- get_matrix(seqs, elementwise=False)
Computes the all pair score matrix for the given sequences.
- Parameters
seqs (iterable of str) – Set of DNA sequences.
elementwise (bool, optional) – When True matrix computation will be done element by element using multiple processes. This may be useful to decrease memory consumption, by default False
- Returns
Square matrix of scores for all pairs of the given sequences.
- Return type
numpy.array
- get_score(seq1, seq2)
Computes the score between the two given sequences.
- Parameters
seq1 (str) – DNA sequence.
seq2 (str) – DNA sequence.
- Returns
Score for seq1 and seq2.
- Return type
float
codonbias.random module
- class codonbias.random.IntraPosPermuter(property_func=<function translate>, n_samples=100, random_state=42, n_jobs=None, **kwargs)
Bases:
PermuterThis permuter generates random sequences by shuffling codons in each position between all sequences while preserving a defined property of the sequence. This null model can be used to return the shuffled sequences, or to estimate the z-score / p-value of weight vectors associated with the sequence.
The property (or properties) to be preserved by the permutation is defined using property_func. For example, the default property_func translates the sequence to amino acids, and therefore the permutation preserves the amino acid sequence. However, arbitrary properties may be defined. When n_samples equals zero, the permuter attemps to estimate the z-scores and p-values without actually permuting the sequences (very fast). This is especially useful and accurate for computing z-scores. While the resulting p-values are highly correlated with permutation results, they tend to be lower than permutation p-values by 30% on average (but up to 60% lower at most).
- Parameters
property_func (fuction, optional) – Property generating function that accepts a sequence as input and returns a pandas.DataFrame with propery columns, by default utils.translate
n_samples (int, optional) – The numper of permutations to generate for each sequence. When zero, the permuter attempts to estimate the z-scores and p-values without actually permuting the sequences, by default 100
random_state (int, optional) – Random seed for the permutation function, by default 42
n_jobs (int or None, optional) – Number of parallel processes to run. When set to None the permuter will use the number of available cores, by default None
kwargs – Parameters to be passed to the property_func.
See also
codonbias.random.PermuterGeneral-purpose permutation.
codonbias.random.IntraSeqPermuterWithin-sequence permutation.
- class codonbias.random.IntraSeqPermuter(property_func=<function translate>, n_samples=100, random_state=42, n_jobs=None, **kwargs)
Bases:
PermuterThis permuter generates random sequences by shuffling the codons within each sequence while preserving a defined property of the sequence. This null model can be used to return the shuffled sequences, or to estimate the z-score / p-value of weight vectors associated with the sequence.
The property (or properties) to be preserved by the permutation is defined using property_func. For example, the default property_func translates the sequence to amino acids, and therefore the permutation preserves the amino acid sequence. However, arbitrary properties may be defined. When n_samples equals zero, the permuter attemps to estimate the z-scores and p-values without actually permuting the sequences (very fast). This is especially useful and accurate for computing z-scores. While the resulting p-values are highly correlated with permutation results, they tend to be lower than permutation p-values by 30% on average (but up to 60% lower at most).
- Parameters
property_func (function, optional) – Property generating function that accepts a sequence as input and returns a pandas.DataFrame with propery columns, by default utils.translate
n_samples (int, optional) – The numper of permutations to generate for each sequence. When zero, the permuter attempts to estimate the z-scores and p-values without actually permuting the sequences, by default 100
random_state (int, optional) – Random seed for the permutation function, by default 42
n_jobs (int or None, optional) – Number of parallel processes to run. When set to None the permuter will use the number of available cores, by default None
kwargs – Parameters to be passed to the property_func.
See also
codonbias.random.PermuterGeneral-purpose permutation.
codonbias.random.IntraPosPermuterPositional permutation.
- class codonbias.random.Permuter(property_func=<function translate>, add_properties=[], n_samples=100, random_state=42, n_jobs=None, **kwargs)
Bases:
objectThis general-prupose permuter generates random sequences by shuffling codons within and between sequences while preserving a defined property of the sequence. This null model can be used to return the shuffled sequences, or to estimate the z-score / p-value of weight vectors associated with the sequence.
The property (or properties) to be preserved by the permutation is defined using property_func. For example, the default property_func translates the sequence to amino acids, and therefore the permutation preserves the amino acid sequence. However, arbitrary properties may be defined. When n_samples equals zero, the permuter attemps to estimate the z-scores and p-values without actually permuting the sequences (very fast). This is especially useful and accurate for computing z-scores. While the resulting p-values are highly correlated with permutation results, they tend to be lower than permutation p-values by 30% on average (but up to 60% lower at most).
- Parameters
property_func (function, optional) – Property generating function that accepts a sequence as input and returns a pandas.DataFrame with propery columns, by default codonbias.utils.translate
n_samples (int, optional) – The numper of permutations to generate for each sequence. When zero, the permuter attempts to estimate the z-scores and p-values without actually permuting the sequences, by default 100
random_state (int, optional) – Random seed for the permutation function, by default 42
n_jobs (int or None, optional) – Number of parallel processes to run. When set to None the permuter will use the number of available cores, by default None
kwargs – Parameters to be passed to the property_func.
See also
codonbias.random.IntraSeqPermuterWithin-sequence permutation.
codonbias.random.IntraPosPermuterPositional permutation.
- get_permuted_seq(seqs, slice=None)
Computes n_samples permutations of the given sequences.
- Parameters
seqs (iterable of str) – DNA sequence.
slice (slice object, optional) – Optional slicing applied to all sequences prior to perpmuation, by deafult None
- Returns
Permuted sequences DataFrame with n_samples columns.
- Return type
pandas.DataFrame
- get_pval(vector, seqs, alternative='greater', slice=None, mapfunc=None, aggfunc=None, model_kws={})
Compute the p-value for each position in the vector using random permutations of the sequences. The parameter vector can be either a weights vector or a VectorScore model. If the latter is provided, the weights will be recomputed for each permuted sequence (slower), otherwise the weights vector itself will be permuted (faster).
- Parameters
vector (iterable or scores.VectorScore) – Weights to be permuted in order to compute the z-score, or a VectorScore model.
seqs (iterable of str) – DNA sequence.
slice (slice object, optional) – Optional slicing applied to all sequences and vectors, by deafult None
mapfunc (function, optional) – Optional map function to be applied to every vector, by default None
aggfunc (function, optional) – Optional agg function to aggregate all vectors, by default None
model_kws (dict, optional) – Optional keyword arguments to the VectorScore model’s get_vector function, by default {}
- Returns
Z-scores series with an entry for each input sequence that contains its p-values array.
- Return type
pandas.Series
- get_zscore(vector, seqs, slice=None, mapfunc=None, aggfunc=None, model_kws={})
Compute the z-score for each position in the vector using random permutations of the sequences. The parameter vector can be either a weights vector or a VectorScore model. If the latter is provided, the weights will be recomputed for each permuted sequence (slower), otherwise the weights vector itself will be permuted (faster).
- Parameters
vector (iterable or scores.VectorScore) – Weights to be permuted in order to compute the z-score, or a VectorScore model.
seqs (iterable of str) – DNA sequence.
slice (slice object, optional) – Optional slicing applied to all sequences and vectors prior to permutation, by deafult None
mapfunc (function, optional) – Optional map function to be applied to every vector, by default None
aggfunc (function, optional) – Optional agg function to aggregate all vectors, by default None
model_kws (dict, optional) – Optional keyword arguments to the VectorScore model’s get_vector function, by default {}
- Returns
Z-scores series with an entry for each input sequence that contains its z-scores array.
- Return type
pandas.Series
codonbias.scores module
- class codonbias.scores.CodonAdaptationIndex(ref_seq, k_mer=1, genetic_code=1, ignore_stop=True, pseudocount=1)
Bases:
ScalarScore,VectorScoreCodon Adaptation Index (CAI, Sharp & Li, NAR, 1987).
This model determines the level of optimality of codons based on their frequency in the given set of reference sequences ref_seq. For each amino acid, the most frequent synonymous codon receives a weight of 1, while other codons are weighted based on their relative frequency with respect to the most frequent synonymous codon. The returned vector for a sequence is an array with the weight of the corresponding codon in each position in the sequence. The score for a sequence is the geometric mean of these weights, and ranges from 0 (strong rare codon bias) to 1 (strong frequent codon bias).
This implementation extends the model to arbitrary codon k-mers using the k_mer parameter.
- Parameters
ref_seq (iterable of str) – Reference sequences for learning the codon frequencies.
k_mer (int, optional) – Determines the length of the k-mer to base statistics on, by default 1
genetic_code (int, optional) – NCBI genetic code ID, by default 1
ignore_stop (bool, optional) – Whether STOP codons will be discarded from the analysis, by default True
pseudocount (int, optional) – Pseudocount correction for normalized codon frequencies. this is effective when ref_seq contains few short sequences. by default 1
- class codonbias.scores.CodonPairBias(ref_seq, k_mer=2, genetic_code=1, ignore_stop=True, pseudocount=1)
Bases:
ScalarScore,VectorScore,WeightScoreCodon Pair Bias (CPB/CPS, Coleman et al., Science, 2008).
This model is extended here to arbitrary codon k-mers. The model calculates the over-/under- represention of codon k-mers compared to a background distribution. Each k-mer receives a weight that is the log-ratio between its observed and expected probabilities. The returned vector for a sequence is an array with the weight of the corresponding k-mer in each position in the sequence. The score for a sequence is the mean of these weights, and ranges from a negative value (mostly under-represented pairs) to a positive value (mostly over-represented pairs).
- Parameters
ref_seq (iterable of str) – Reference sequences for learning the codon frequencies.
k_mer (int, optional) – Determines the length of the k-mer to base statistics on, by default 2
genetic_code (int, optional) – NCBI genetic code ID, by default 1
ignore_stop (bool, optional) – Whether STOP codons will be discarded from the analysis, by default True
pseudocount (int, optional) – Pseudocount correction for normalized codon frequencies. this is effective when ref_seq contains few short sequences. by default 1
- class codonbias.scores.EffectiveNumberOfCodons(k_mer=1, bg_correction=False, robust=True, pseudocount=1, mean='weighted', genetic_code=1)
Bases:
ScalarScore,WeightScoreEffective Number of Codons (ENC, Wright, Gene, 1990).
This model measures the deviation of synonymous codon usage from uniformity based on a statistical model analogous to the effective number of alleles in genetics. The score for a sequence is the effective number of codons in use, and ranges from 20 (very strong bias: a single codon per amino acid) to 61 (uniform use of all codons). Thus, this score is expected to be negatively correlated with most other codon bias measures.
The model has also been extended to codon pairs by Alexaki et al. (JMB, 2019). The k_mer parameter can be used to calculate ENC for codon pairs as well as longer k-mers.
When bg_correction is True, a background correction procedure is performed as proposed by Novembre (MBE, 2002). This procedure estimates the background codon composition of each sequence using the independent probabilities of observing each of the 4 bases in the 3 codon positions. This implementation learns the nucleotide probabilities from the provided coding sequence. However, if the parameter background is given to get_score(), this background sequence will be used instead.
The parameters robust, pseudocount and mean introduce additional improvements to the estimation of the effective number as proposed by Sun, Yang & Xia (MBE, 2013). They are activated by default, and remove, for example, the strong dependency between ENC and sequence length.
- Parameters
k_mer (int, optional) – Extends the model to codon k-mers. For example, codon pairs, as suggested by Alexaki et al. (JMB, 2019), by default 1
bg_correction (bool, optional) – Background correction based on Novembre (MBE, 2002), by default False
robust (bool, optional) – Robust estimation of F values that is less sensitive to small counts. Proposed improvement by Sun, Yang & Xia (MBE, 2013), by default True
pseudocount (int, optional) – Pseudocounts added to codon statistics. Proposed improvement by Sun, Yang & Xia (MBE, 2013), by default 1
mean ({'weighetd', 'unweighted'}, optional) – Weighted average of F across amino acids by their frequency. Proposed improvement by Sun, Yang & Xia (MBE, 2013), by default ‘weighetd’
genetic_code (int, optional) – NCBI genetic code ID, by default 1
- class codonbias.scores.FrequencyOfOptimalCodons(ref_seq, thresh=0.95, genetic_code=1, ignore_stop=True, pseudocount=1)
Bases:
ScalarScore,VectorScoreFrequency of Optimal Codons (FOP, Ikemura, J Mol Biol, 1981).
This model determines the optimal codons for each amino acid based on their frequency in the given set of reference sequences ref_seq. Multiple codons may be selected as optimal based on thresh. The score for a sequence is the fraction of codons in the sequence deemed optimal. The returned vector for a sequence is a binary array where optimal positions contain 1 and non-optimal ones contain 0.
- Parameters
ref_seq (iterable of str) – A set of reference DNA sequences for codon usage statistics.
thresh (float, optional) – Minimal ratio between the frequency of a codon and the most frequent one in order to be set as optimal, by default 0.95
genetic_code (int, optional) – NCBI genetic code ID, by default 1
ignore_stop (bool, optional) – Whether STOP codons will be discarded from the analysis, by default True
pseudocount (int, optional) – Pseudocount correction for normalized codon frequencies. this is effective when ref_seq contains few short sequences. by default 1
- class codonbias.scores.NormalizedTranslationalEfficiency(ref_seq, mRNA_counts, tGCN=None, url=None, genome_id=None, domain=None, prokaryote=False, s_values='dosReis', genetic_code=1)
Bases:
ScalarScore,VectorScoreNormalized Translational Efficiency (Pechmann & Frydman, Nat. Struct. Mol. Biol., 2013)
This models computes a translational efficiency score that takes into account both supply (of tRNAs) and demand (codons being translated). Supply is computed based on the tRNA Adaptation Index (tAI), and demand is computed based on the sum of all codons in the genome weighted by their mRNA abundance (or ribosome occupancy, where available). Each codon receives a weight in [0, 1] that describes its translational efficiency. The returned vector for a sequence is an array with the weight of the corresponding codon in each position in the sequence. The score for a sequence is the geometric mean of these weights, and ranges from 0 (low efficiency) to 1 (high efficiency).
- Parameters
ref_seq (iterable os str) – Demand parameter: Will be used to count the codons across transcripts in a weighted sum
mRNA_counts (iterable of float) – Demand parameter: Will be used in the weighted sum of codons across transcripts
tGCN (pandas.DataFrame, optional) – Supply parameter: tRNA Gene Copy Numbers given as a DataFrame with the columns anti_codon, GCN, by default None
url (str, optional) – Supply parameter: URL of the relevant page on GtRNAdb, by default None
genome_id (str, optional) – Supply parameter: Genome ID of the organism, by default None
domain (str, optional) – Supply parameter: Taxonomic domain of the organism, by default None
prokaryote (bool, optional) – Supply parameter: Whether the organism is a prokaryote, by default False
s_values ({'dosReis', 'Tuller'}, optional) – Supply parameter: Coefficients of the tRNA-codon efficiency of coupling, by default ‘dosReis’
genetic_code (int, optional) – NCBI genetic code ID, by default 1
See also
- class codonbias.scores.RelativeCodonBiasScore(directional=False, mean='geometric', genetic_code=1, ignore_stop=True, pseudocount=1)
Bases:
ScalarScore,VectorScore,WeightScoreRelative Codon Bias Score (RCBS, Roymondal, Das & Sahoo, DNA Research, 2009).
This model measures the deviation of codon usage from a background distribution and computes for each codon the observed-to-expected ratio. The background distribution is estimated for each sequence separately, based on its nucleotide composition. The model’s null hypothesis is that the 3 codon positions are independently distributed according to the same nucleotide distribution. Thus, overrepresented codons are given higher weights while underrepresented codons are given lower weights. The score for a sequence is the geometric mean of codon ratios, minus 1. The returned vector for a sequence is an array with the ratio of the corresponding codon in each position in the sequence.
Sabi & Tuller (DNA Research, 2014) proposed a modified score based on these principles, termed the Directional Codon Bias Score (DCBS). In this model underrepresented codons are given larger weights (rather than smaller weights) similarly to overrepresnted codons. This model’s hypothesis is that biased sequences will typically include both highly overrepresnted codons as well as underrepresented ones, and therefore both signals should contribute towards a higher (i.e., biased) score. This modification is activated by setting the directional parameter to True and the mean parameter to ‘arithmetic’.
- Parameters
directional (bool, optional) – When True will compute the modified version by Sabi & Tuller, by default False
mean ({'geometric', 'arithmetic'}, optional) – How to compute the score, by default ‘geometric’
genetic_code (int, optional) – NCBI genetic code ID, by default 1
ignore_stop (bool, optional) – Whether STOP codons will be discarded from the analysis, by default True
pseudocount (int, optional) – Pseudocount correction for normalized codon frequencies, by default 1
- class codonbias.scores.RelativeSynonymousCodonUsage(ref_seq=None, directional=False, mean='geometric', genetic_code=1, ignore_stop=True, pseudocount=1)
Bases:
ScalarScore,VectorScore,WeightScoreRelative Synonymous Codon Usage (RSCU, Sharp & Li, NAR, 1986).
This model measures the deviation of synonymous codon usage from uniformity and returns for each codon the ratio between its observed frequency and its expected frequency if synonymous codons were chosen randomly (uniformly). Overepresented codons will have a score > 1, while underrepresented codons will have a score < 1. get_weights() returns a vector of 61 RSCU ratios for each sequence. While not defined as part of the original Sharp & Li model, the get_vector() method returns an array with the ratio of the corresponding codon in each position in the sequence, and the get_score() method returns the geometric mean of the ratios for a sequence (minus 1), in a similar way to the Relative Codon Bias Score (RCBS). The directional parameter modifies RSCU similarly to the way the Directional Codon Bias Score (DCBS) modifies RCBS, by giving higher weights to both overrepresented and underrepresented codons.
- Parameters
ref_seq (iterable of str, optional) – When given, codon frequencies in the reference set will be used instead of the uniform codon distribution, by default None
directional (bool, optional) – When True will compute the modified version by Sabi & Tuller, by default False
mean ({'geometric', 'arithmetic'}, optional) – How to compute the score, by default ‘geometric’
genetic_code (int, optional) – NCBI genetic code ID, by default 1
ignore_stop (bool, optional) – Whether STOP codons will be discarded from the analysis, by default True
pseudocount (int, optional) – Pseudocount correction for normalized codon frequencies, by default 1
- class codonbias.scores.ScalarScore
Bases:
objectAbstract class for models that output a scalar per sequence. Inheriting classes may implement the computation of the score for a single sequence in the method _calc_score(seq). Parameters of the model may be initialized with the instance of the class.
- get_score(seq, slice=None, **kwargs)
Compute the score for a single, or multiple sequences. When slice is provided, all sequences will be sliced before computing the score.
- Parameters
seq (str or an iterable of str) – DNA sequence, or an iterable of ones.
slice (slice, optional) – Python slice object, by default None
- Returns
Score for each provided sequence.
- Return type
float or numpy.array
Examples
>>> EffectiveNumberOfCodons().get_score('ACGACGGAGGAG') 35.0
>>> EffectiveNumberOfCodons().get_score('ACGACGGAGGAG', slice=slice(6)) 44.33333333333333
- class codonbias.scores.TrnaAdaptationIndex(tGCN=None, url=None, genome_id=None, domain=None, prokaryote=False, s_values='dosReis', genetic_code=1)
Bases:
ScalarScore,VectorScoretRNA Adaptation Index (tAI, dos Reis, Savva & Wernisch, NAR, 2004).
This model measures translational efficiency based on the availablity of tRNAs (approximated by the gene copy number of each tRNA species), and the efficiency of coupling between tRNAs and codons (modeled via the set of s_values coefficients). Each codon receives a weight in [0, 1] that describes its translational efficiency. The returned vector for a sequence is an array with the weight of the corresponding codon in each position in the sequence. The score for a sequence is the geometric mean of these weights, and ranges from 0 (low efficiency) to 1 (high efficiency).
Gene copy numbers can be provided explicitly, or automatically downloaded from GtRNAdb.
The model was originally trained in S. cerevisiae and E. coli in order to maximize the correlation with mRNA levels measured via microarrays. The model was later refitted using protein abundance levels (Tuller et al., Genome Biology, 2011). The s_values parameter can be used to switch between these coefficients sets. When analyzing an organism that is a prokaryote, the prokaryote parameter should be set to True.
- Parameters
tGCN (pandas.DataFrame, optional) – tRNA Gene Copy Numbers given as a DataFrame with the columns anti_codon, GCN, by default None
url (str, optional) – URL of the relevant page on GtRNAdb, by default None
genome_id (str, optional) – Genome ID of the organism, by default None
domain (str, optional) – Taxonomic domain of the organism, by default None
prokaryote (bool, optional) – Whether the organism is a prokaryote, by default False
s_values ({'dosReis', 'Tuller'}, optional) – Coefficients of the tRNA-codon efficiency of coupling, by default ‘dosReis’
genetic_code (int, optional) – NCBI genetic code ID, by default 1
Notes
For species-specific optimization of the tAI model, see: Sabi & Tuller, DNA Research, 2014; the stAIcalc online calculator: https://tau-tai.azurewebsites.net/; and the gtAI package: https://github.com/AliYoussef96/gtAI.
- class codonbias.scores.VectorScore
Bases:
objectAbstract class for models that output a vector per sequence. For example, the output can be a score per position in the sequence. Inheriting classes may implement the computation of the score for a single sequence in the method _calc_vector(seq). Parameters of the model may be initialized with the instance of the class.
- get_vector(seq, slice=None, **kwargs)
Compute the score vector for a single, or multiple sequences. When slice is provided, all sequences will be sliced before computing the score.
- Parameters
seq (str or an iterable of str) – DNA sequence, or an iterable of ones.
slice (slice, optional) – Python slice object, by default None
- Returns
1D array for a single sequence, 1D array of 1D arrays for arbitrary sequences, or a matrix NxM for N sequences of length M.
- Return type
numpy.array, or numpy.array of numpy.array
- class codonbias.scores.WeightScore
Bases:
objectAbstract class for models that output a weights vector per sequence. Inheriting classes may implement the computation of the score for a single sequence in the method _calc_seq_weights(seq). Parameters of the model may be initialized with the instance of the class.
- get_weights(seq, slice=None, **kwargs)
Compute the codon / amino acid weights for a single, or multiple sequences. When slice is provided, all sequences will be sliced before computing the score.
- Parameters
seq (str or an iterable of str) – DNA sequence, or an iterable of ones.
slice (slice, optional) – Python slice object, by default None
- Returns
N by C array with a weights vector for each of the N provided sequences.
- Return type
numpy.array
codonbias.stats module
- class codonbias.stats.BaseCounter(seqs=None, k_mer=1, step=1, frame=1, sum_seqs=True)
Bases:
objectNucleotide statistics for a single, or multiple DNA sequences. When the k_mer argument is provided, the counter will return dinucleotide (k_mer=2), trinucleotide (k_mer=3) statistics, etc.
- Parameters
seqs (str, or iterable of str) – DNA sequence, or an iterable of ones.
k_mer (int, optional) – Determines the length of the k-mer to base statistics on, by default 1
step (int, optional) – Determines the step size to take along the sequence, by default 1
frame (int, optional) – Determines the frame, or shift+1, from the beginning of the sequence, by default 1
sum_seqs (bool, optional) – Determines how multiple sequences will be handled. When True, their statistics will be summed, otherwise separate statistics will be kept in a table. by default True
Examples
Compute the GC3 content (GC in the third position of codons):
>>> nuc = BaseCounter(step=3, frame=3) >>> freq = nuc.count(seq).get_table(normed=True) >>> freq['G'] + freq['C']
Compute CpG content:
>>> nuc = BaseCounter(k_mer=2) >>> freq = nuc.count(seq).get_table(normed=True) >>> freq['CG']
- count(seqs)
Update the BaseCounter object with the base counts of the given sequence(s).
- Parameters
seqs (str, or iterable of str) – DNA sequence, or an iterable of ones. by default None
- Returns
BaseCounter object (self) with updated counts
- Return type
- get_table(normed=False, pseudocount=1)
Return base counts as a Series (for a single summary) or DataFrame (for multiple summaries, when sum_seqs is False), indexed by the nucletoide k-mer. Normalized frequencies (when `normed`=True) are corrected by default using pseudocounts.
- Parameters
normed (bool, optional) – Determines whether base counts will be normalized to sum to 1, by default False
pseudocount (int, optional) – Pseudocount correction for normalized base frequencies, by default 1
- Returns
Neltodie k-mer counts (or frequencies) with k-mers as index, and counts as values.
- Return type
pandas.Series or pandas.DataFrame
- class codonbias.stats.CodonCounter(seqs=None, k_mer=1, sum_seqs=True, concat_index=True, genetic_code=1, ignore_stop=True)
Bases:
objectCodon statistics for a single, or multiple DNA sequences. When the k_mer argument is provided, the counter will return codon pairs (k_mer=2), codon triplets (k_mer=3) statistics, etc.
- Parameters
seqs (str, or iterable of str, optional) – DNA sequence, or an iterable of ones. by default None
k_mer (int, optional) – Determines the length of the k-mer to base statistics on, by default 1
sum_seqs (bool, optional) – Determines how multiple sequences will be handled. When True, their statistics will be summed, otherwise separate statistics will be kept in a table. by default True
genetic_code (int, optional) – NCBI genetic code ID, by default 1
ignore_stop (bool, optional) – Whether STOP codons will be discarded from the analysis, by default True
- count(seqs)
Update the CodonCounter object with the codon counts of the given sequence(s).
- Parameters
seqs (str, or iterable of str) – DNA sequence, or an iterable of ones. by default None
- Returns
CodonCounter object (self) with updated counts
- Return type
- get_aa_table(normed=False, pseudocount=1, nonzero=False)
Return codon counts as a Series (for a single summary) or DataFrame (for multiple summaries, when sum_seqs is False), indexed by the codon and the encoded amino acid. Normalized frequencies (when `normed`=True) are corrected by default using pseudocounts.
- Parameters
normed (bool, optional) – Determines whether codon counts will be normalized to sum to 1 for each amino acid (a vector that sums to 20), by default False
pseudocount (int, optional) – Pseudocount correction for normalized codon frequencies, by default 1
- Returns
Codon counts (or frequencies) with amino acids and codons as index, and counts as values.
- Return type
pandas.Series or pandas.DataFrame
- get_codon_table(normed=False, pseudocount=1, nonzero=False)
Return codon counts as a Series (for a single summary) or DataFrame (for multiple summaries, when sum_seqs is False). Normalized frequencies (when `normed`=True) are corrected by default using pseudocounts.
- Parameters
normed (bool, optional) – Determines whether codon counts will be normalized to sum to 1, by default False
pseudocount (int, optional) – Pseudocount correction for normalized codon frequencies, by default 1
- Returns
Codon counts (or frequencies) with codons as index, and counts as values.
- Return type
pandas.Series or pandas.DataFrame
codonbias.utils module
- class codonbias.utils.ReferenceSelector(score_object, seqs, higher_is_better=True)
Bases:
objectA helper class for selecting reference sequences, based on models from the scores submodule.
- Parameters
score_object (codonbias.scores.ScalarScore) – Codon model with a get_score method.
seqs (iterable of str) – Iterable of DNA sequences.
higher_is_better (bool, optional) – Defines the direction of the codon score, by default True
- get_top_indices(top=0.2)
Returns the top sequence indices based on the given model.
- Parameters
top (float, optional) – Can be a positive integer or a float in (0, 1), by default 0.2
- Returns
Vector of sequence indices, sorted by the score.
- Return type
np.array
- get_top_seqs(top=0.2)
Returns the top sequences based on the given model.
- Parameters
top (float, optional) – Can be a positive integer or a float in (0, 1), by default 0.2
- Returns
List of DNA sequences, sorted by the score.
- Return type
list of str
- codonbias.utils.fetch_GCN_from_GtRNAdb(url=None, genome=None, domain=None)
Download a tRNA gene copy number (GCN) table for an organism from GtRNAdb, given either the URL of the relevant page, or the genome ID and taxonomic domain of the organism. Note, that this is an experimental function.
- Parameters
url (str, optional) – URL of the relevant page on GtRNAdb, by default None
genome (str, optional) – Genome ID of the organism, by default None
domain (str, optional) – Taxonomic domain of the organism, by default None
- Returns
tRNA gene copy numbers with the columns: anti_codon, GCN.
- Return type
pandas.DataFrame
Examples
>>> fetch_GCN_from_GtRNAdb(url='http://gtrnadb.ucsc.edu/genomes/eukaryota/Scere3/') anti_codon GCN 10 AAC 14 35 AAT 13 17 ACG 6 13 AGA 11 ....
>>> fetch_GCN_from_GtRNAdb(genome='Scere3', domain='eukaryota') anti_codon GCN 10 AAC 14 35 AAT 13 17 ACG 6 13 AGA 11 ....
- codonbias.utils.geomean(log_weights, counts)
Compute the geometric mean based on codon scores given in log_weights (weights in logarithmic scale), and codon counts give in counts.
- Parameters
log_weights (pandas.Series) – Codon scores in logarithmic scale, with codons as index and scores as values.
counts (pandas.Series) – Codon counts, with codons as index and counts as values.
- Returns
Geometric mean.
- Return type
float
- codonbias.utils.greater_equal(x1, x2)
Modifies the corresponding numpy operator to preserve NaNs.
- codonbias.utils.less_equal(x1, x2)
Modifies the corresponding numpy operator to preserve NaNs.
- codonbias.utils.mean(weights, counts)
Compute the arithmetic mean based on codon scores given in weights, and codon counts given in counts.
- Parameters
weights (pandas.Series) – Codon scores, with codons as index and scores as values.
counts (pandas.Series) – Codon counts, with codons as index and counts as values.
- Returns
Arithmetic mean.
- Return type
float
- codonbias.utils.process_GtRNAdb_table(table)
Helper function to get a dataframe of tRNA anti-codon copy numbers from a single HTML table.
- Parameters
table (pandas.DataFrame) – The product of read_html().
- Returns
tRNA gene copy numbers with the columns: anti_codon, GCN.
- Return type
pandas.DataFrame
- codonbias.utils.rankdata(x)
Modifies the corresponding scipy function to preserve NaNs.
- codonbias.utils.reverse_complement(seq)
The reverse complement of the given DNA sequence, such as the anti-codon that perfectly pairs with a codon.
- Parameters
seq (str) – Nucleotide sequence in {A,C,G,T}.
- Returns
The reverse complement sequence in {A,C,G,T}.
- Return type
str
- codonbias.utils.translate(seq, return_str=False, genetic_code=1)
Translate a nucleotide sequence and return its amino acids.
- Parameters
seq (str) – DNA sequence.
genetic_code (int, optional) – NCBI genetic code ID, by default 1