codonbias package

This package provides analysis tools for genomic sequences, focusing on protein coding regions, translation efficiency and synonymous mutations. These include implementations of popular models from the past four decades of codon usage study, such as:

  • Frequency of Optimal Codons (FOP)

  • Relative Synonymous Codon Usage (RSCU)

  • Codon Adaptation Index (CAI)

  • Effective Number of Codons (ENC)

  • tRNA Adaptation Index (tAI)

  • Relative Codon Bias Score (RCBS)

  • Directional Codon Bias Score (DCBS)

  • Codon Usage Frequency Similarity (CUFS)

The package contains 4 submodules:

  • codonbias.stats: Classes for codon statistics.

  • codonbias.scores: Models / scores that operate on individual sequences independently.

  • codonbias.pairwise: Models / scores that operate on pairs of sequences.

  • codonbias.utils: Helper functions for the other submodules.

Submodules

codonbias.pairwise module

class codonbias.pairwise.CodonUsageFrequency(synonymous=False, genetic_code=1, ignore_stop=False, n_jobs=None)

Bases: PairwiseScore

Codon Usage Frequency (CUFS, Diament, Pinter & Tuller, Nat Commun, 2014). This is a distance metric between pairs of sequences based on their distribution of codons. It employs a distance metric for probability distrbutions (Endres & Schindelin, 2003) that is based on KL divergence.

Parameters
  • synonymous (bool, optional) – When True snynomous codon frequencies are normalized to sum to 1 for each amino acid (synCUFS), by default False

  • genetic_code (int, optional) – NCBI genetic code ID, by default 1

  • ignore_stop (bool, optional) – Whether STOP codons will be discarded from the analysis, by default True

  • n_jobs (_type_, optional) – Number of processes to use for matrix computation, by default None

class codonbias.pairwise.PairwiseScore(n_jobs=None)

Bases: object

Abstract class for models that output a scalar for a pair of sequences, or a pairwise score matrix for a set of sequences. Inheriting classes may implement the computation of the score for a single pair in two steps: (1) a transformation of the sequence by _calc_weights(seq); and (2) a computation of the score by _calc_pair_score(w1, w2). The abstract class implements two wrapper methods that call the aforementioned internal implementations: get_score(seq1, seq2), get_matrix(seqs). The latter function assumes that the score is symmetric, and that the diagonal always contains zeros.

In case that a dedicated implementation for whole matrix computation is implemented in _calc_matrix(weights), this method will be preferred by the get_matrix(seqs) method. This can be, for example, an efficient vectorized implementation of the computation.

Parameters

n_jobs (int, optional) – Number of processes to use for matrix computation, by default None

get_matrix(seqs, elementwise=False)

Computes the all pair score matrix for the given sequences.

Parameters
  • seqs (iterable of str) – Set of DNA sequences.

  • elementwise (bool, optional) – When True matrix computation will be done element by element using multiple processes. This may be useful to decrease memory consumption, by default False

Returns

Square matrix of scores for all pairs of the given sequences.

Return type

numpy.array

get_score(seq1, seq2)

Computes the score between the two given sequences.

Parameters
  • seq1 (str) – DNA sequence.

  • seq2 (str) – DNA sequence.

Returns

Score for seq1 and seq2.

Return type

float

codonbias.scores module

class codonbias.scores.CodonAdaptationIndex(ref_seq, genetic_code=1, ignore_stop=True)

Bases: ScalarScore, VectorScore

Codon Adaptation Index (CAI, Sharp & Li, NAR, 1987). This model determines the level of optimality of codons based on their frequency in the given set of reference sequences ref_seq. For each amino acid, the most frequent synonymous codon receives a weight of 1, while other codons are weighted based on their relative frequency with respect to the most frequent synonymous codon. The returned vector for a sequence is an array with the weight of the corresponding codon in each position in the sequence. The score for a sequence is the geometric mean of these weights, and ranges from 0 (strong rare codon bias) to 1 (strong frequent codon bias).

Parameters
  • ref_seq (iterable of str) – Reference sequences for learning the codon frequencies.

  • genetic_code (int, optional) – NCBI genetic code ID, by default 1

  • ignore_stop (bool, optional) – Whether STOP codons will be discarded from the analysis, by default True

class codonbias.scores.EffectiveNumberOfCodons(genetic_code=1)

Bases: ScalarScore

Effective Number of Codons (ENC, Wright, Gene, 1990). This model measures the deviation of synonymous codon usage from uniformity based on a statistical model analogous to the effective number of alleles in genetics. The score for a sequence is the effective number of codon in use, and ranges from 20 (very strong bias: a single codon per amino acid) to 61 (uniform use of all codons). Thus, this score is expected to be negatively correlated with most other codon bias measures.

Parameters

genetic_code (int, optional) – NCBI genetic code ID, by default 1

class codonbias.scores.FrequencyOfOptimalCodons(ref_seq, thresh=0.95, genetic_code=1, ignore_stop=True)

Bases: ScalarScore, VectorScore

Frequency of Optimal Codons (FOP, Ikemura, J Mol Biol, 1981). This model determines the optimal codons for each amino acid based on their frequency in the given set of reference sequences ref_seq. Multiple codons may be selected as optimal based on thresh. The score for a sequence is the fraction of codons in the sequence deemed optimal. The returned vector for a sequence is a binary array where optimal positions contain 1 and non-optimal ones contain 0.

Parameters
  • ref_seq (iterable of str) – A set of reference DNA sequences for codon usage statistics.

  • thresh (float, optional) – Minimal ratio between the frequency of a codon and the most frequent one in order to be set as optimal, by default 0.95

  • genetic_code (int, optional) – NCBI genetic code ID, by default 1

  • ignore_stop (bool, optional) – Whether STOP codons will be discarded from the analysis, by default True

class codonbias.scores.RelativeCodonBiasScore(directional=False, mean='geometric', genetic_code=1, ignore_stop=True)

Bases: ScalarScore, VectorScore

Relative Codon Bias Score (RCBS, Roymondal, Das & Sahoo, DNA Research, 2009). This model measures the deviation of codon usage from a background distribution and computes for each codon the observed-to-expected ratio. The background distribution is estimated for each sequence separately, based on its nucleotide composition. The model’s null hypothesis is that the 3 codon positions are independently distributed according to the same nucleotide distribution. Thus, overrepresented codons are given higher weights while underrepresented codons are given lower weights. The score for a sequence is the geometric mean of codon ratios, minus 1. The returned vector for a sequence is an array with the ratio of the corresponding codon in each position in the sequence.

Sabi & Tuller (DNA Research, 2014) proposed a modified score based on these principles, termed the Directional Codon Bias Score (DCBS). In this model underrepresented codons are given larger weights (rather than smaller weights) similarly to overrepresnted codons. This model’s hypothesis is that biased sequences will typically include both highly overrepresnted codons as well as underrepresented ones, and therefore both signals should contribute towards a higher (i.e., biased) score. This modification is activated by setting the directional parameter to True and the mean parameter to ‘arithmetic’.

Parameters
  • directional (bool, optional) – When True will compute the modified version by Sabi & Tuller, by default False

  • mean ({'geometric', 'arithmetic'}, optional) – How to compute the score, by default ‘geometric’

  • genetic_code (int, optional) – NCBI genetic code ID, by default 1

  • ignore_stop (bool, optional) – Whether STOP codons will be discarded from the analysis, by default True

class codonbias.scores.RelativeSynonymousCodonUsage(ref_seq=None, directional=False, mean='geometric', genetic_code=1, ignore_stop=True)

Bases: ScalarScore, VectorScore

Relative Synonymous Codon Usage (RSCU, Sharp & Li, NAR, 1986). This model measures the deviation of synonymous codon usage from uniformity and returns for each codon the ratio between its observed frequency and its expected frequency if synonymous codons were chosen randomly (uniformly). Overepresented codons will have a score > 1, while underrepresented codons will have a score < 1. get_weights() returns a vector of 61 RSCU ratios for each sequence. While not defined as part of the original Sharp & Li model, the get_vector() method returns an array with the ratio of the corresponding codon in each position in the sequence, and the get_score() method returns the geometric mean of the ratios for a sequence (minus 1), in a similar way to the Relative Codon Bias Score (RCBS). The directional parameter modifies RSCU similarly to the way the Directional Codon Bias Score (DCBS) modifies RCBS, by giving higher weights to both overrepresented and underrepresented codons.

Parameters
  • ref_seq (iterable of str, optional) – When given, codon frequencies in the reference set will be used instead of the uniform codon distribution, by default None

  • directional (bool, optional) – When True will compute the modified version by Sabi & Tuller, by default False

  • mean ({'geometric', 'arithmetic'}, optional) – How to compute the score, by default ‘geometric’

  • genetic_code (int, optional) – NCBI genetic code ID, by default 1

  • ignore_stop (bool, optional) – Whether STOP codons will be discarded from the analysis, by default True

get_weights(seq)

Compute a vector of 61 RSCU codon weights (ratios) for each sequence in seq.

Parameters

seq (str, or iterable of str) – DNA sequence, or an iterable of ones.

Returns

RSCU weights for each codon, for each sequence.

Return type

pandas.Series or pandas.DataFrame

class codonbias.scores.ScalarScore

Bases: object

Abstract class for models that output a scalar per sequence. Inheriting classes may implement the computation of the score for a single sequence in the method _calc_score(seq). Parameters of the model may be initialized with the instance of the class.

get_score(seq, slice=None, **kwargs)

Compute the score for a single, or multiple sequences. When slice is provided, all sequences will be sliced before computing the score.

Parameters
  • seq (str or an iterable of str) – DNA sequence, or an iterable of ones.

  • slice (slice, optional) – Python slice object, by default None

Returns

_description_

Return type

float or numpy.array

Examples

>>> EffectiveNumberOfCodons().get_score('ACGACGGAGGAG')
35.0
>>> EffectiveNumberOfCodons().get_score('ACGACGGAGGAG', slice=slice(6))
44.33333333333333
class codonbias.scores.TrnaAdaptationIndex(tGCN=None, url=None, genome_id=None, domain=None, prokaryote=False, s_values='dosReis', genetic_code=1)

Bases: ScalarScore, VectorScore

tRNA Adaptation Index (tAI, dos Reis, Savva & Wernisch, NAR, 2004). This model measures translational efficiency based on the availablity of tRNAs (approximated by the gene copy number of each tRNA species), and the efficiency of coupling between tRNAs and codons (modeled via the set of s_values coefficients). Each codon receives a weight in [0, 1] that describes its translational efficiency. The returned vector for a sequence is an array with the weight of the corresponding codon in each position in the sequence. The score for a sequence is the geometric mean of these weights, and ranges from 0 (low efficiency) to 1 (high efficiency).

Gene copy numbers can be provided explicitly, or automatically downloaded from GtRNAdb.

The model was originally trained in S. cerevisiae and E. coli in order to maximize the correlation with mRNA levels measured via microarrays. The model was later refitted using protein abundance levels (Tuller et al., Genome Biology, 2011). The s_values parameter can be used to switch between these coefficients sets. When analyzing an organism that is a prokaryote, the prokaryote parameter should be set to True.

Parameters
  • tGCN (pandas.DataFrame, optional) – tRNA Gene Copy Numbers given as a DataFrame with the columns anti_codon, GCN, by default None

  • url (str, optional) – URL of the relevant page on GtRNAdb, by default None

  • genome_id (str, optional) – Genome ID of the organism, by default None

  • domain (str, optional) – Taxonomic domain of the organism, by default None

  • prokaryote (bool, optional) – Whether the organism is a prokaryote, by default False

  • s_values ({'dosReis', 'Tuller'}, optional) – Coefficients of the tRNA-codon efficiency of coupling, by default ‘dosReis’

  • genetic_code (int, optional) – NCBI genetic code ID, by default 1

Notes

For species-specific optimization of the tAI model, see: Sabi & Tuller, DNA Research, 2014; the stAIcalc online calculator: https://tau-tai.azurewebsites.net/; and the gtAI package: https://github.com/AliYoussef96/gtAI.

class codonbias.scores.VectorScore

Bases: object

Abstract class for models that output a vector per sequence. For example, the output can be a score per position in the sequence. Inheriting classes may implement the computation of the score for a single sequence in the method _calc_vector(seq). Parameters of the model may be initialized with the instance of the class.

get_vector(seq, slice=None, **kwargs)

Compute the score vector for a single, or multiple sequences. When slice is provided, all sequences will be sliced before computing the score.

Parameters
  • seq (str or an iterable of str) – DNA sequence, or an iterable of ones.

  • slice (slice, optional) – Python slice object, by default None

Returns

1D array for a single sequence, 1D array of 1D arrays for arbitrary sequences, or a matrix NxM for N sequences of length M.

Return type

numpy.array, or numpy.array of numpy.array

codonbias.stats module

class codonbias.stats.CodonCounter(seqs, sum_seqs=True, genetic_code=1, ignore_stop=True)

Bases: object

Codon statistics for a single, or multiple DNA sequences.

Parameters
  • seqs (str, or iterable of str) – DNA sequence, or an iterable of ones.

  • sum_seqs (bool, optional) – Determines how multiple sequences will be handled. When True, their statistics will be summed, otherwise separate statistics will be kept in a table. by default True

  • genetic_code (int, optional) – NCBI genetic code ID, by default 1

  • ignore_stop (bool, optional) – Whether STOP codons will be discarded from the analysis, by default True

get_aa_table(normed=False, fillna=False)

Return codon counts as a Series (for a single summary) or DataFrame (for multiple summaries, when sum_seqs is False), indexed by the codon and the encoded amino acid.

Parameters
  • normed (bool, optional) – Determines whether codon counts will be normalized to sum to 1 for each amino acid (a vector that sums to 20), by default False

  • fillna (bool, optional) – When True will fill NaNs according to a unifrom distribution, by default False

Returns

Codon counts (or frequencies) with amino acids and codons as index, and counts as values.

Return type

pandas.Series or pandas.DataFrame

get_codon_table(normed=False, fillna=False)

Return codon counts as a Series (for a single summary) or DataFrame (for multiple summaries, when sum_seqs is False).

Parameters
  • normed (bool, optional) – Determines whether codon counts will be normalized to sum to 1, by default False

  • fillna (bool, optional) – When True will fill NaNs according to a unifrom distribution, by default False

Returns

Codon counts (or frequencies) with codons as index, and counts as values.

Return type

pandas.Series or pandas.DataFrame

codonbias.utils module

codonbias.utils.fetch_GCN_from_GtRNAdb(url=None, genome=None, domain=None)

Download a tRNA gene copy number (GCN) table for an organism from GtRNAdb, given either the URL of the relevant page, or the genome ID and taxonomic domain of the organism. Note, that this is an experimental function.

Parameters
  • url (str, optional) – URL of the relevant page on GtRNAdb, by default None

  • genome (str, optional) – Genome ID of the organism, by default None

  • domain (str, optional) – Taxonomic domain of the organism, by default None

Returns

tRNA gene copy numbers with the columns: anti_codon, GCN.

Return type

pandas.DataFrame

Examples

>>> fetch_GCN_from_GtRNAdb(url='http://gtrnadb.ucsc.edu/genomes/eukaryota/Scere3/')
anti_codon  GCN
10        AAC   14
35        AAT   13
17        ACG    6
13        AGA   11
....
>>> fetch_GCN_from_GtRNAdb(genome='Scere3', domain='eukaryota')
anti_codon  GCN
10        AAC   14
35        AAT   13
17        ACG    6
13        AGA   11
....
codonbias.utils.geomean(log_weights, counts)

Compute the geometric mean based on codon scores given in log_weights (weights in logarithmic scale), and codon counts give in counts.

Parameters
  • log_weights (pandas.Series) – Codon scores in logarithmic scale, with codons as index and scores as values.

  • counts (pandas.Series) – Codon counts, with codons as index and counts as values.

Returns

Geometric mean.

Return type

float

codonbias.utils.mean(weights, counts)
Compute the arithmetic mean based on codon scores given in

weights, and codon counts given in counts.

Parameters
  • weights (pandas.Series) – Codon scores, with codons as index and scores as values.

  • counts (pandas.Series) – Codon counts, with codons as index and counts as values.

Returns

Arithmetic mean.

Return type

float

codonbias.utils.process_GtRNAdb_table(table)

Helper function to get a dataframe of tRNA anti-codon copy numbers from a single HTML table.

Parameters

table (pandas.DataFrame) – The product of read_html().

Returns

tRNA gene copy numbers with the columns: anti_codon, GCN.

Return type

pandas.DataFrame

codonbias.utils.reverse_complement(seq)

The reverse complement of the given DNA sequence, such as the anti-codon that perfectly pairs with a codon.

Parameters

seq (str) – Nucleotide sequence in {A,C,G,T}.

Returns

The reverse complement sequence in {A,C,G,T}.

Return type

str