Does antiretroviral treatment change HIV-1 codon usage patterns in its genes: a preliminary bioinformatics study

Background Codon usage bias has been described for various organisms and is thought to contribute to the regulation of numerous biological processes including viral infections. HIV-1 codon usage has been previously shown to be different from that of other viruses and man. It is evident that the antiretroviral drugs used to restrict HIV-1 replication also select for resistance variants. We wanted to test whether codon frequencies in HIV-1 sequences from treatment-experienced patients differ from those of treatment-naive individuals due to drug pressure affecting codon usage bias. Results We developed a JavaScript to determine the codon frequencies of aligned nucleotide sequences. Irrespective of subtypes, using HIV-1 pol sequences from 532 treatment-naive and 52 treatment-experienced individuals, we found that pol sequences from treatment-experienced patients had significantly increased AGA (arginine; p = 0.0002***) and GGU (glycine; p = 0.0001***), and decreased AGG (arginine; p = 0.0001***) codon frequencies. The same pattern was not observed when subtypes B and C sequences were analyzed separately. Additionally, irrespective of subtypes, using HIV-1 gag sequences from 524 treatment-naive and 54 treatment-experienced individuals, gag sequences from treatment-experienced patients had significantly increased CUA (leucine; p < 0.0001***), CAG (glutamine; p = 0.0006***), AUC (isoleucine; p < 0.0001***) and UCU (serine; p = 0.0005***), and decreased AUA (isoleucine; p = 0.0003***) and CAA (glutamine; p = 0.0006***) codon frequencies. Conclusion Using pol and gag genes derived from the same HIV-1 genome, we show that antiretroviral therapy changed certain HIV-1 codon frequencies in a subtype specific way.


Background
HIV-1 can be classified into various groups (i.e. M, N, O and P). Viruses from groups M and N originated from independent transmissions of simian immunodeficiency virus (SIV) from chimpanzees to humans, while viruses from groups O and P originated from gorillas to humans [1]. Group M of HIV-1 is the most common worldwide and is further divided into various subtypes (i.e. A-K).
Since the identification of HIV as the etiological agent of AIDS more than 30 years ago, antiretroviral therapy has evolved to include the use of combinations of inhibitors that target two or more processes in HIV replication (e.g. entry, reverse transcription, DNA integration, maturation) to reduce viral replication [2,3]. However, drug-resistant HIV mutants can often emerge during the course of therapy [4,5]. Resistant viruses also exist among antiretroviral treatment-naive patients as a result of the transmission of drug resistant HIV variants [6].
Codon usage bias is defined as the preference for particular codon(s) over others in synthesis of the same amino acid. It is well known that codon usage bias exists Open Access AIDS Research and Therapy *Correspondence: navastones@gmail.com 1 McGill University AIDS Centre, Lady Davis Institute for Medical Research, Jewish General Hospital, 3755, Ch. Cote-Ste-Catherine, Montréal, QC, Canada Full list of author information is available at the end of the article among different organisms [7][8][9]. Codon usage bias might have arisen in the course of evolution to protect an organism from pathogens bearing invasive foreign nucleic acids, such as viruses and transposable elements, and is thus sometimes considered an aspect of intrinsic immunity. The importance of codon usage bias in the immune response is illustrated by the activity of the interferon inducible schlafen family member 11 (SLFN11) protein [10] that selectively inhibits late stages of HIV-1 production in a codon usage-dependent manner [10]. SLFN11 binds to tRNA and thereby prevents tRNA pool changes that would otherwise be triggered by HIV infection [10]. By using sequences documented over a period of 23 years, it has been shown that the codons of HIV regulatory genes match closely with human codon preference patterns, with rev being the closest followed tat, nef and vpr respectively [11]. It has been speculated that codon preference patterns that are similar to those of the host might confer several beneficial characteristics to HIV-1, including the potential for the emergence of drug resistance [11,12].
Two hypotheses have been proposed to explain bias in codon usage. One of these involves the concept of translation efficiency, i.e. the genes of proteins that have to be expressed constitutively and/or in large quantities should have codon usage that is similar to that of the host cell, while the genes of proteins that have to be expressed under restrictive conditions and/or in small quantities might involve codon(s) that are not commonly used by the host cell. Re-engineering of the HIV-1 genome, such that its codons matched with the relative synonymous codon usage (RSCU) of humans, led to an increase in viral protein production [13].
The second hypothesis favours the notion that codon usage bias exists because of inherent genetic constraints (for e.g. GC contents) and associated mutation fixation probabilities, i.e. mutation biases [8]. These mutation fixation probabilities can be influenced by external factors such as the host immune system and antiretroviral drugs and this hypothesis is supported by a study that codons within parts of the HIV-1 env gene tend to match with human RSCU over the course of infection because of mutation pressure [14]. This led to the question whether antiretroviral therapy can change HIV codon frequencies significantly and ultimately the usage bias patterns. As a preliminary, to test our hypothesis, we have used HIV-1 pol and gag sequences from antiretroviral treatmentnaive and treatment-experienced patients, retrieved from the Los Alamos HIV database, to see whether there are any significant difference in codon frequencies that may have resulted from treatment. As env sequences are highly variable and can be greatly influenced by the host immune system, we considered only pol and gag sequences in the present study. Due to limited information, we did not take into consideration additional clinical parameters that may have influenced our results such as regimen use and timing of treatment initiation, among others.

Methods
Codon usage data for man and HIV-1 were retrieved from the codon usage database of the Kazusa DNA Research Institute, Japan (http://www.kazusa.or.jp/ codon/) (Fig. 1). Complete HIV-1 genome sequences (as nucleotides) from both antiretroviral treatment-naive and treatment-experienced patients were retrieved from the Los Alamos HIV database (http://www.hiv.lanl.gov/ components/sequence/HIV/search/search.html) as multiple sequence aligned FASTA file. These genome sequences were collected and deposited at different time points from various geographical regions by others. We strictly took annotated sequences to make sure that the viral sequences used in this study were isolated from antiretroviral treatment-naive and treated-experienced patients. We additionally restricted the database to provide only one sequence per patient to eliminate bias. We chose pol and gag genes for this study because they are relatively conserved in HIV-1 compared to env gene [15]. Moreover, majority of HIV drugs currently available in the market are targeted to the pol region. From the complete genome, pol gene and gag gene sequences were cut out using BioEdit© software V.7.2.5 (http://www.mbio. ncsu.edu/bioedit/bioedit.html). The HIV-1 HXB2 pol and gag genes were used as a reference sequence. Using the same software, pol protein (and gag protein) multiple sequence alignments (by implementing ClustalW) were performed separately. Sequences with additional stop codons and poor sequence quality (including one or more R, Y and other nucleotides) were removed from further analysis. Nucleotides encoding amino acids from W34 to S53 in pol gene and amino acids at positions 1, 110-127, 371-374, 378, 385, 464-470, 475-484 and 497-499 in gag gene were also removed from further analysis because of difficulty with the alignment (i.e. this region was found to be highly prone to insertion-deletion mutations). We covered 98% of amino acids in pol gene and 91% of amino acids in gag gene in this study. Later, sequences were toggled back from amino acids to nucleotides. A java script was developed that gave us the codon usage per amino acid in Excel format (https://drive. google.com/folderview?id=0Bw4LWIJCCBxwRmVEalN NWG9JY1E&usp=sharing). The data were imported into GraphPad Prism V.5. We performed non-parametric test (Mann-Whitney test; 95% CI; two-tailed) for each codon between pol gene sequences derived from treatmentnaive and treated individuals (same analysis repeated for gag gene sequences as well). We chose non-parametric tests over parametric tests for the entire study for two reasons: (1) we had fewer HIV-1 subtype C sequences from treatment-experienced patients, and (2) we did not have access to all relevant clinical parameters that would have assisted our statistical evaluations.

Human and HIV-1 codon usage are different
Two earlier studies showed that HIV has different codon usage patterns compared to other viruses including HTLV-1 [16,17], although, these previous works did not explain specific codon changes in detail. We compared codon usage in human and HIV-1 genomes (using data from Kazusa Codon Usage Database), and found that eight codons, i.e. UUA (leucine), CUA (leucine), AUA (isoleucine), GUA (valine), CAA (glutamine), AAU (asparagine), AGA (arginine) and GGA (glycine) were >twofold more common within the HIV-1 than in the human genome (Fig. 1, represented by *). UGG (tryptophan) was also overrepresented in HIV-1 compared to humans; however, given that UGG is the only codon for tryptophan, this observation simply indicates that this amino acid is more prevalent in HIV-1 than in human proteins (Fig. 1, represented by #). An earlier study also reported differences in codon usage patterns between HIV-1 and humans using HIV sequences obtained over 23 years [11].

Phylogeny and resistance analysis of studied sequences
First, we wanted to evaluate evolutionary relationships among the sequences used in this study. pol gene Fig. 1 Comparison of codon usage between human and HIV-1. Codon usage data for human and HIV-1 were obtained from the codon usage database of Kazusa (http://www.kazusa.or.jp/codon/). Eight codons, i.e. UUA and CUA (leucine), AUA (isoleucine), GUA (valine), CAA (glutamine), AAU (asparagine), AGA (arginine) and GGA (glycine) were found to be present at levels > twofold in the HIV-1 genome compared to the human genome (represented by *). Tryptophan which has only one codon (i.e. UGG) is represented by #. An increase in UGG in HIV-1 simply means that tryptophan is more prevalent in HIV-1 than in human proteins sequences from 532 treatment-naive and 52 treatmentexperienced HIV-1 samples were studied. For the construction of a phylogenetic tree, MEGA6 (http://www. megasoftware.net/) software was used [18]. The tree construction parameters included: Maximum Likelihood (for statistical analysis), Bootstrap method (for testing of phylogeny), 1000 (for number of Bootstrap replications), nucleotides (for substitution type), Tamura-Nei model (for model) while others were set to default parameters. From the phylogenetic tree, we found that the sequences formed distinct diverse clusters, thereby making their sequences ideal for further analysis (Fig. 2).
We also evaluated resistance mutations in treatmentnaive and treated-experienced HIV-1 samples and included all the resistance markers within the pol gene, as listed by the International Antiviral Society-USA 2014 [19]. In the case of the reverse transcriptase (RT) gene, resistance markers were found to be more prevalent in HIV-1 samples isolated from treatment-experienced patients compared with treatment-naive patients but the same trend was not seen with resistance markers within the protease and integrase genes (Fig. 3). Two reasons for this might be a lower degree of protease and integrase resistance in treatment-experienced patients due to small sample size or because most patients had been prescribed RT inhibitors but not protease inhibitors or integrase inhibitors. For the RT region, mutations at amino acid positions 41, 70, 184, 190, 210 and 215 were found >fourfold more frequently in treatment-experienced than in treatment-naive patients.

Certain HIV-1 codon frequencies in the pol gene are significantly different between treatment-naïve and -experienced patients
We investigated whether antiretroviral treatment influences HIV-1 codon frequency. Irrespective of HIV-1 subtype, we compared codon repartition within unique pol gene sequences of 532 treatment-naive and 52 treatment-experienced individuals with the following subtype distribution: B = 35.2, C = 38, AE = 9.4, others < 4% for treatment-naive and B = 53.9, BF = 9.6, C = 9.6, BC = 5.8 and others < 4% in treatment-experienced. Importantly, codon frequency was measured for each amino acid, thus excluding differences due to amino acid changes from this analysis. Of the eight above mentioned codons that were initially identified as being differentially used in humans vs HIV-1, one was significantly increased in sequences from treatment-experienced individuals, i.e. AGA (arginine) (p = 0.0002***) (Table 1). Additionally, GGU (glycine) was significantly increased (p = 0.0001***) in treatment-experienced compared to treatment-naive sequences. A different arginine codon, namely AGG, was significantly decreased (p = 0.0001***) in treatment-experienced sequences. Codons GCU and GCC of alanine, AAU and AAC of asparagine, GGG of glycine, CAU and CAC of histidine, AUU and AUC of isoleucine, CUG of leucine, CCG of proline and GUA of valine were also affected when we compared HIV-1 pol sequences from treatment-naive and treatment-experienced patients. While GCC, AAU, CAU, AUU and CUG were more prevalent in sequences from treatment-experienced individuals, GCU, AAC, GGG, CAC, AUC, CCG and GUA were decreased.

HIV-1 codon frequency change is subtype specific
To try to determine a role for viral subtype, we analysed 187 HIV-1 subtype B sequences from treatment-naive individuals and compared them with 28 HIV-1 subtype B sequences from treatment-experienced individuals. None of the codons differed significantly (i.e. *** or ** significance). Only AUA (isoleucine) and GUC (valine) trended towards higher prevalence in treatment-experienced sequences and with low significance (i.e. p = 0.0443* and 0.0201* respectively).
We also compared 202 HIV-1 subtype C sequences from treatment-naive with 5 HIV-1 subtype C sequences from treatment-experienced individuals. Phylogenetic analysis (Fig. 2b) and sequence geography information showed that the 4 out of 5 HIV-1 subtype C sequences from treatment-experienced persons were evolutionarily distinct from one another. Serine codons i.e. UCC, AGU and AGC significantly differed in sequences from treatment-experienced individuals (i.e. p = 0.0052**, 0.0033** and 0.0085** respectively) with UCC and AGU found to be diminished while AGC was increased. UUA for leucine, GCU and GCA for alanine, and AGA and AGG codons for arginine all differed with p values of 0.0276* (for both lysine codons), 0.0423* (UUA), 0.038* (GCU), 0.0267* (GCA), 0.0138* (AGA) and 0.0397* (AGG). Codons GCA, AGA, and AAA were more frequent in sequences from treatment-experienced persons while the four other codons were less frequent in sequences from treatment-experienced individuals. No significant changes were seen in regard to other codons.

The role of drug pressure, GC content and other factors?
The differences in codon frequency between treatmentnaive and treatment-experienced sequences could conceivably be influenced by the emergence of resistance substitutions. Irrespective of subtype, an increase in AGA codon usage in pol could potentially be related to the prevalence of K70R substitutions associated with stavudine-based or zidovudine-based therapy (Fig. 3). Lysine (K) is encoded by two codons: AAA and AAG, the former of which can give rise to the AGA (arginine) codon through a single A to G transition. K70R substitutions in reverse transcriptase could therefore result in an increase in the proportion of AGA codons. Similar explanations can be proposed for treatment-associated changes in AAU codons (asparagine) that might be related to K103N substitutions (AAG or AAA to AAU) (Fig. 3). On the other hand, irrespective of subtype, there was a significant decrease (i.e. p = 0.0001***) in AGG (arginine) in treatment-experienced individuals. Although AAG (lysine) can undergo a single A to G transition to give rise to AGG (arginine), this situation is not favoured, indicating that amino acid substitutions due to drug pressure may not be alone sufficient to influence codon frequency patterns. When considering only the genomic region encoding for RT, we found that codons AGA (arginine) and AAU (asparagine) were significantly increased in sequences from treatment-experienced patients (i.e. p = 0.0250* and p = 0.0040**, respectively). Additionally, we found that codon GGU (glycine) was significantly more frequent in sequences from treatmentexperienced individuals (i.e. p < 0.0001***). An increase in GGU codon might be attributed to the A98G substitution in RT.

Discussion
Except tryptophan, each amino acid has more than one codon that can be decoded by the amino acid containing t-RNA. Codon usage bias is a measure of codon use for each amino acid and should not be reflected in baseline differences in peptidic sequences. Codon usage bias is likely important for the modulation of translation processes. Using pol and gag gene sequences from treatment-naive and treatment-experienced patients, we have shown that antiretroviral therapy can modulate codon frequencies that might ultimately lead to usage biases (Table 1). Although this is an initial attempt at this type of work, it was limited by the availability and diversity of numbers of sequences available from treatment-experienced patients. A comparison of codon frequency differences between pol and gag in treatment-naive and treatment-experienced sequences showed that changes can occur at both