Skip to main content

Extensive purifying selection acting on synonymous sites in HIV-1 Group M sequences



Positive selection pressure acting on protein-coding sequences is usually inferred when the rate of nonsynonymous substitution is greater than the synonymous rate. However, purifying selection acting directly on the nucleotide sequence can lower the synonymous substitution rate. This could result in false inference of positive selection because when synonymous changes at some sites are under purifying selection, the average synonymous rate is an underestimate of the neutral rate of evolution. Even though HIV-1 coding sequences contain a number of regions that function at the nucleotide level, and are thus likely to be affected by purifying selection, studies of positive selection assume that synonymous substitutions can be used to estimate the neutral rate of evolution.


We modelled site-to-site variation in the synonymous substitution rate across coding regions of the HIV-1 genome. Synonymous substitution rates were found to vary significantly within and between genes. Surprisingly, regions of the genome that encode proteins in more than one frame had significantly higher synonymous substitution rates than regions coding in a single frame. We found evidence of strong purifying selection pressure affecting synonymous mutations in fourteen regions with known functions. These included an exonic splicing enhancer, the rev-responsive element, the poly-purine tract and a transcription factor binding site. A further five highly conserved regions were located within known functional domains. We also found four conserved regions located in env and vpu which have not been characterized previously.


We provide the coordinates of genomic regions with markedly lower synonymous substitution rates, which are putatively under the influence of strong purifying selection pressure at the nucleotide level as well as regions encoding proteins in more than one frame. These regions should be excluded from studies of positive selection acting on HIV-1 coding regions.


Several statistical models of codon evolution have been developed and applied to protein-coding sequences from viral and other pathogens [14]. The primary application of these models has been the detection of evidence of diversifying selection acting on protein coding DNA sequences. Within maximum likelihood or Bayesian frameworks these models can be used to identify specific sites at which adaptive mutations have occurred. In the context of virus infections this information can be especially useful for identifying immune escape and drug resistance mutations [3, 5, 6].

Positive selection is frequently inferred by comparing the rate of non-synonymous substitutions per non-synonymous site (dN) to the rate of synonymous substitutions per synonymous site (dS). The ratio of these two rates is often represented by the symbol ω. Under the assumption that synonymous substitutions are neutral and that the synonymous substitution rate therefore approximates the neutral rate of evolution, diversifying selection can be inferred when ω is greater than one. Several methods exist to determine whether there is evidence that ω is greater than one at a subset of sites in a protein-coding gene (i.e. the gene is evolving under diversifying selection) and to identify the sites within the gene at which diversifying selection occurs [3, 4, 79].

In many of the situations in which this strategy is applied, the assumption that synonymous substitutions are fixed at a constant rate and provide a good estimate of the neutral rate of evolution, may not hold. Kosakovsky Pond & Muse reported that coding sequences from a wide range of taxa, including HIV-1, show strong evidence of variation in the rate of synonymous substitution across coding regions [10]. There are two possible causes of synonymous rate variation. If synonymous substitutions are indeed neutral, variation in the mutation rate can cause the synonymous substitution rate to vary. In such a case, it is possible to include a varying synonymous substitution rate in the codon models of evolution and inference of positive selection from comparison of the local synonymous and nonsynonymous substitution rates remains feasible. However, if the variation in synonymous substitution rate is caused by selection acting to preserve functions that are encoded at the nucleotide level, even a comparison of local nonsynonymous and synonymous substitution rates cannot be used to infer positive selection because the synonymous substitution rate is no longer a valid proxy for the neutral rate of evolution and the standard approach of inferring the action of diversifying selection when ω > 1 is not valid.

Failure to model variation in synonymous substitution rate will result in an overall underestimate of the neutral rate of evolution. This undermines the validity of the inference of selection, because nonsynonymous substitution rates are compared against a rate which is no longer a good estimate of the neutral rate, and this is likely to result in inference of diversifying selection at a proportion of the sites that are actually evolving neutrally. Indeed, as the number of taxa increases, we expect an ever greater proportion of the neutral sites to be classified as diversifying selection sites in this scenario. Alternatively, if the synonymous substitution rate variation is modeled and selection inferred when dN is greater than the local dS rate then we expect a very high probability of false inference of selection at codons where the synonymous positions happen to be functionally important and conserved, and the nonsynonymous positions are neutral or experience less purifying selection. Thus, in general, in a codon-based method, analysis of selection is unreliable when there is purifying selection acting to preserve functions at the nucleotide level. An example where an elevated ω was attributable to purifying selection acting on synonymous sites was reported by Hurst & Pal [11].

Several examples of sequence motifs within protein-coding sequences that are expected to be under purifying selection at the nucleotide level are known in HIV-1, many of which are involved in regulating gene expression. Examples include the 3' long terminal repeat (LTR) region, part of which also encodes the Nef protein [12, 13]. In addition to conserved RNA secondary structures, the LTR contains several regulatory elements, some of which directly interact with cellular transcription factors (e.g. the Ets protein family) [14, 15]. The viral sequence also has an intragenic nuclease hypersensitive region involved in regulating gene expression (also referred to as HS7) in the pol gene [1618] and the rev-responsive element (RRE) in the env gene which interacts with the Rev protein to transport unspliced or partially spliced RNA from the nucleus to the cytoplasm of the infected cell [17, 1921]. Some functionally important regions of the RRE have previously been found to be conserved at the nucleotide sequence level, presumably the result of purifying selection pressure to preserve this function [22]. The inhibitory sequence elements (INS) in gag and the cis-repressive sequence (CRS) in pol are examples of negative regulators of transcription [2326]. If these functional sites are important for viral viability, then we expect them to be preserved by purifying selection.

While it represents a significant challenge for studies of selection acting on the HIV-1 amino acid sequence, the variability in the synonymous substitution rate may also provide useful information about previously unknown sequence motifs within the coding fraction of the HIV-1 genome that function at the nucleotide level. Although some variability can be explained by a variable mutation rate, the identification of regions of very high conservation that cannot be explained by selection acting on the amino acid sequence or by known motifs that function at the nucleotide level has the potential to highlight novel functions encoded in the HIV-1 genome.

Here we use an existing model of codon sequence evolution [10] to provide the first complete overview of site-to-site variation in synonymous substitution rate across the whole HIV-1 genome and identify selection pressures likely to be driving this variation. This model allows dN and dS to vary independently across sites, ensuring that the estimated dS values reflect selection pressure acting upon the nucleotide sequence and not at the amino acid level. It is worthwhile to distinguish between selective pressure acting at the nucleotide level, affecting both synonymous and nonsynonymous changes, and at the amino acid level, affecting nonsynonymous changes only. Unfortunately, quantifying the relative contributions of nucleotide and amino acid level effects on nonsynonymous changes is highly sensitive to model assumptions. We therefore restricted the analysis to synonymous changes and do not attempt to quantify the nucleotide-level selective pressure on nonsynonymous changes. We took into account recombination breakpoints in order to avoid biased estimates that can result from fitting phylogenetic models that do not take recombination into account [27, 28].

We report patterns of sequence conservation around nucleotide sequence motifs with known functions and identify additional conserved nucleotide elements that do not fall within any currently characterized functional motifs. Finally, we report the locations of all HIV-1 genome regions where we infer that purifying selection acting directly on the nucleotide sequence is likely to cause a substantial reduction in the synonymous substitution rate. These are provided with respect to the HXB2 reference strain, to enable other researchers to mask these regions from their analyses of positive selection acting on HIV-1 genes.


Sequence data

Nucleotide sequence alignments consisting of HIV-1 Group M subtype reference sequences were downloaded from the Los Alamos database for each gene of the HIV-1 genome [29]. Each alignment had at least one sequence (total ranging from 32 to 37) from each of the 11 non-recombinant HIV-1 group M subtypes A1, A2, B, C, D, F1, F2, G, H, J and K [Genbank: AB253421, AB253429, AF004885, AF005494, AF005496, AF061641, AF061642, AF067155, AF069670, AF075703, AF077336, AF082394, AF082395, AF084936, AF190127, AF190128, AF286237, AF286238, AF377956, AF484509, AJ249235, AJ249236, AJ249237, AJ249238, AJ249239, AY173951, AY253311, AY331295, AY371157, AY371158, AY423387, AY612637, AY772699, DQ676872, DQ853463, K03454, K03455, U46016, U51190, U52953, U88824, U88826]. All regions encoding amino acids in more than one frame were identified and regions judged by eye to be unreliably aligned, i.e., positions 6544–6595, 6700–6715, 7318–7375 of the env gene region, were excluded from the analysis. We used the HIV-1 genome map and sequence annotations available from the Los Alamos database to identify the regions of the genome that encode proteins in a single reading frame (see Table 1) [29].

Table 1 Summary of HIV-1 reference sequence data used

We identified recombination breakpoints in each alignment using the GARD (Genetic Algorithm for Recombination Detection) algorithm implemented in the HyPhy (Hypothesis testing using Phylogenies) package [30, 31]. Evidence of recombination was detected in all genes except tat, vpr and vpu. GARD outputs both an alignment showing the positions of recombination breakpoints and separate tree topologies for each of the sequence alignment segments bounded by these breakpoints.

Synonymous substitution rate estimation

Synonymous substitution rates were estimated using a version of the MG94 codon substitution model [1, 10]. We used the Dual Model, which allows dS to vary independently of dN and used three discrete categories for each rate. We ran the selected models using a HyPhy batch script for analysis of selection acting on recombining sequences which we developed previously [32]. This method uses separate tree topologies for each partition of the sequence alignment while keeping the rest of the model parameters fixed across all partitions. We used sliding window plots of mean dS values, calculated over three adjacent codons, to identify regions with low synonymous substitution rates.

We also analyzed an alignment of HIV-1 subtype C gag sequences [Genbank: DQ792982-DQ793045] described previously [33] to assess the impact of conservation acting on synonymous substitutions on inference of positive selection. We inferred positive selection using model M2a of Yang and colleagues [34], taking recombination into account [32].


We used HyPhy to generate simulated data under a neutral model with trees generated from the original alignments (or the tree from the largest un-recombined region for alignments where recombination was detected). The same sequence alignments used as input in the initial analysis were used and one hundred simulated datasets were generated for each alignment. Each simulated dataset was then analyzed using the Dual Model as described above. For each gene the minimum value of mean dS across all sliding windows of three adjacent codons, in all of the one hundred simulated datasets, was used as a conservative threshold to identify windows of reduced dS in the observed data. This stringent threshold and a less stringent one that included 95% of the values inferred from the simulated data are shown in the sliding window plots.

Functional analysis of a novel nucleotide sequence motif in env

JC53-bl and 293T cells were obtained from Dr George Shaw (University of Alabama, Birmingham, AL) and cultured as described previously [35]. M7-Luc cells (5.25.EGFP.Luc.M7) were kindly provided by Dr Nathaniel Landau through the AIDS Research and Reference Reagent Program, Division of AIDS, NIAID. M7-Luc cells were cultured in RPMI-1640 medium containing 2 mM L-glutamine, 25 mM HEPES, 10% heat-inactivated fetal bovine serum (FBS) and 50 ug/ml gentamicin (Sigma), supplemented with 10 μg/ml DEAE-Dextran for infectivity assays. An infectious molecular clone, p81, was obtained through the AIDS Research and Reference Reagent Program, Division of AIDS, NIAID, NIH: p81A-4 (Cat#11440) from Dr. Bruce Chesebro. Mutations were introduced into p81 using the Stratagene QuickChange XL kit. Infectious viral particles were produced by transfection of 293T cells using Fugene reagent (Roche BioSciences), and transfection output was assessed by determination of TCID50 in JC53-bl cells as described previously [35]. Equivalent numbers of TCID50 were used to infect M7-luc cells seeded into 96 well plates at a density of 7.5 × 105 cells/ml followed by a washout step12 hours post-infection. Replication was monitored by measurement of p24 production using the Vironostica HIV-1 antigen Microelisa system (Biomerieux) over 5 days.


Consistent with previous reports [5, 10], we found evidence of variation in synonymous substitution rates within and across HIV-1 genes (Figure 1). For all genes the Dual Model [10], which allows independent variation of dS and dN had a much better fit to the data than a model with constant dN and dS (referred to as the Constant model in Table 2) or than a model in which only dN varied across sites (the Nonsynonymous model in Table 2). The variance of dS gives an indication of the extent of site-to-site synonymous rate heterogeneity within the different genomic regions (Table 3). There was significant variation between genes (p-value = 2 × 10-7, from Levene's test) with the least site-to-site variation in dS observed in vpu and the most in vpr followed by nef and env (Figure 2).

Table 2 AIC model selection index to show how different models fit to the data.
Table 3 Regions of the HIV-1 sequence which should be considered for exclusion in positive selection analysis studies.
Figure 1
figure 1

HIV-1 genome-wide plot of mean nonsynonymous substitution rates. A 30 codon sliding window was used. Regions coding for proteins in more than one frame are shaded in pink. The frames that were used in each region are shown in grey rectangles, with frame 1 at the top and frame 3 at the bottom.

Figure 2
figure 2

Box-and-whisker plot showing variation of dS values per gene. The values are from non-overlapping regions of HIV-1 genes.

We found evidence of strong purifying selection acting directly on the nucleotide sequence at twenty-three sites across the HIV-1 genome (Figures 3, 4, 5 and 6). Fourteen of these regions (marked in black in Figures 3 and 4) coincided exactly with well characterized functional motifs while for another five (marked in green in Figure 3), we were able to identify possible functions based on the known functions of the sequence domains in which they were situated. We could not, however, find plausible explanations for high degrees of sequence conservation observed within a twelve-nucleotide region of the env gene and three other regions in vpu (marked in red in Figure 4). Sequence logos illustrating the conservation in each of these twenty-three significantly conserved regions are shown in Figure 5 (for those with known specific function) and Figure 6 (for those with predicted and unknown functions).

Figure 3
figure 3

Mean (blue) synonymous substitution rates observed across gag, pol , vif and vpr genes. Mean dS was calculated over sliding windows of three codons. Horizontal lines mark the most stringent (red) and less stringent (purple) significance thresholds. (a) dS across the gag gene. 'sl4'; the fourth stem loop of the encapsidation signal, 'INS1'; a motif within the first inhibitory sequence region, 'INS2'; a motif within the second inhibitory sequence region. (b) dS across the pol gene. 'crs'; start of the cis-repressive sequence, horizontal dotted line is the nuclease hypersensitive region and sites 'B', 'G', 'C' and 'D' are confirmed transcription factor binding sites known as site-B, GC-box, site-C and site-D respectively. 'p1'and 'p2'; conserved sites within nuclease hypersensitive region. "ese"; exonic splicing enhancer. (c) dS across the vif gene. (d) dS across the vpr gene. 'ssa3'; 3' splice acceptor site A3, 'rnase'; RNae-V1 cleavage site.

Figure 4
figure 4

Mean (blue) synonymous substitution rates observed across tat, vpu , env and nef genes. (a) dS across the tat gene. 'ess2'; exonic splicing silencer ESS2, 'ssa4b'; 3' splice acceptor site A4b. (b) dS across the vpu gene. 'n2', 'n3' and 'n4'; novel conserved sites. (c) dS across the env gene. "n1" is the novel conserved site. The black dotted horizontal lines indicate poorly aligned regions that were excluded from the analysis. "rre"; rev-responsive element, *; the 9 nucleotides (5' GACGGUACA 3') which bind to the Rev protein with highest affinity, "ss"; splice site region for the tat and rev 3' exons. (d) dS across the nef gene. "G-A"; G-to-A hypermutations (see Additional file 1), 'var'; highly variable region, 'ppt'; poly-purine tract, "c"; PPT integrase attachment site, 'nre'; start of the negative repressive sequence, 'ets'; Ets-1 transcription factor binding site.

Figure 5
figure 5

Sequence motifs for the highly conserved regions with known function. The fourteen regions with known specific functions found to be under strong purifying selection in HIV-1 genes. The range of coordinates on the HIV-1 genome for each motif is given in column 2. Numbers above each logo represent the degeneracy at each nucleotide site.

Figure 6
figure 6

Sequence motifs for the highly conserved regions with unknown specific function. The five regions with predicted functions and four regions with unknown functions found to be under strong purifying selection in HIV-1 genes. The range of coordinates on the HIV-1 genome for each motif is given in column 2. Numbers above each logo represent the degeneracy at each nucleotide site.

The fourteen regions with known functions included one region consisting of fifteen nucleotides following the gag start codon, positions 793–807 of HXB2. This forms the fourth stem loop (sl4, Figure 3a) of the dimerization/encapsidation signal. The encapsidation signal is a four stem-loop structure which stretches from the 5' LTR and interacts with the nucleo-capsid protein, promoting formation of genomic RNA and blocking the initiation of transcription [14, 36, 37].

In the pol gene, we found two highly conserved regions with known functions. One corresponded to the first three nucleotides (HXB2 coordinate positions 4092–4094) of the 260 nucleotide long cis-repressive sequence (CRS; Figure 3b). The cis-repressive sequence inhibits expression of structural protein mRNAs by preventing their transportation from the nucleus – a process that is reversed by the rev-responsive element (RRE) [2426]. The other was found at the 3' end of the intragenic nuclease hypersensitive domain (hs7 in Figure 3b). The latter motif, located between positions 4926 and 4937 and labeled "ese" in Figure 3b, is also located within HIV-1 exon 2 and the last six nucleotides, TGGAAA, of this conserved region form a known exonic splicing enhancer (ESE) of HIV mRNAs [3842].

Two regions in vpr, a 3' splice acceptor site [41] and an RNase-V1 cleavage site [43, 44] were also conserved (labeled ssa3 and rnase respectively in Figure 3d) at positions 5759–5777 and 5794–5805 respectively. An additional two highly conserved regions were observed within the tat gene, one containing an exonic splicing silencer, ESS2 between nucleotide positions 5855 and 5863 and the other at the 3' splice acceptor site A4b located at positions 5957 to 5968 (ess2 and ssa4b respectively in Figure 4a) [41].

Three of the highly conserved regions with known functions were in the env gene and included a nine-nucleotide long motif from position 7834 to 7842 within the RRE. The approximately two hundred nucleotides long RRE element within env, is known to interact with the Rev protein and facilitates the transport of late un-spliced and partially-spliced RNAs from the nucleus to the cytoplasm [1921]. Although the RRE is associated with a long stretch of sequence that forms a well characterized secondary structure with various conserved domains [22], only the nine nucleotides that bind Rev with highest affinity [21, 45] were sufficiently conserved to be detected using our conservative threshold (Figure 4c). Also conserved within the env gene were the two splice site regions at the end of the tat/rev exon, positions 8349–8354 and 8376–8378, usually referred to as 7a/7b, and 7 ("ss" in Figure 4c; [39, 41, 46, 47]

The last four of the significantly conserved regions with known function were in the nef gene and coincided with the poly-purine tract (PPT), integrase attachment site, negative regulatory element (NRE) and Ets-1 transcription factor binding site (Figure 4d) (with HXB2 coordinates 9066–9083, 9084–9091, 9183–9192 and 9391–9399 respectively). The PPT precedes the start of the LTR and is known to associate with the 3' LTR, serving as a primer for the initiation of HIV-1 plus strand DNA replication [12, 4850]. A previous detailed RT RNase-H binding analysis revealed that priming of the plus strand occurs specifically at the 3' end of the PPT, at the "GGGGGG" motif [5153]. The region adjacent to the 3' end of the PPT was also highly conserved. This region corresponds to the start of the 3'LTR and contains the cleavage site of the PPT by RNAse H as well as the start of integrase attachment region for the integration of the viral genome into the genome of the host [48, 5355]. Two codons in the central region of nef were highly variable, one dominated by G-to-A mutations in a sequence context consistent with APOBEC-induced hypermutations when compared to the Group M ancestral sequence [56] (see Additional file 1) and the second had a high rate of synonymous and nonsynonymous substitutions (labeled "G-A" and "var" in Figure 4d respectively).

One of the novel regions, the twelve nucleotide long motif in env, upstream of the RRE showed the highest degree of conservation of any region in the env gene ('n1' in Figure 4c; positions 7656–7667 of HXB2). Introduction of synonymous point mutations at positions 7, 9 and 12 in this nucleotide motif had no effect on virus output from transfected 293T cells (see Additional file 2(a)). No significant differences were observed between the wild type and mutated viruses with respect to their infectivity in M7-Luc cells (see Additional file 2(b)). Although the effect of mutations within this region is therefore not clear at present, future work making use of more sensitive competitive replication assays will determine whether these changes have an impact on viral fitness. Functional analyses of the three novel conserved regions in vpu, with HXB2 coordinates 6101–6106, 6143–6151 and 6167–6178 (n2, n3, n4 in Figure 4b) are being considered. The protein products of vpu and env are known to be produced from a bicistronic transcript [46] and the conserved regions in vpu may be involved in the control of translation.

In order to assess whether purifying selection is likely to cause false inference of positive selection we used standard methods to detect positive selection in a subtype C gag coding sequence alignment described previously [33]. Sites with ω significantly greater than one, implying positive selection, overlapped significantly with sites that had lower than average dS values (Fisher's exact test odds ratio = 4.4; p-value = 0.006; Additional file 3). This is consistent with a substantial proportion of the positive selection signals resulting from conservation of the synonymous sites rather than diversifying selection acting on the nonsynonymous sites.


This is the first study to provide a detailed analysis of site-to-site variation in the rate of synonymous substitutions across the HIV-1 genome. In the past, site-to-site variation in dS in HIV-1 has been investigated in a single gene [5, 10] and in another study a single overall synonymous substitution rate for the entire genome was determined for comparison to other viral lineages [57]. We modeled site-to-site synonymous rate variation using a similar approach to a previous study [10], in that case only one HIV-1 gene, vif, was considered and sites that encode proteins in multiple reading frames were included. As a consequence, it was not clear whether the observed site-to-site rate variation resulted from variation in the synonymous rate or from selection acting on nonsynonymous substitutions in another reading frame. Here we focused primarily on regions of the HIV genome that encode proteins in a single reading frame and explored functions of nucleotide sequences that have the largest influence on synonymous substitution rate variation.

Previous studies have demonstrated that recombination causes false inference of positive selection. Since recombination affects tree topologies used in fitting phylogenetic models, it is also likely to cause biased estimates of dS. The recent development of methods to account for recombination in selection analyses [32] permitted us to remove recombination as a source of bias in our estimates of synonymous substitution rates.

In addition to the fourteen conserved regions with known functions and the novel sites in env and vpu, five conserved sites without previously reported specific functions occurred within known functional domains. These include three short (3–6 bp) motifs in the inhibitory (INS) sequence regions of gag (HXB2 positions 898–903, 985–996 and 1309–1314, Figure 3a). Previous in-vitro analyses have shown that short motifs within the approximately two hundred bp long INS regions, are responsible for the actual inhibition of mRNA expression [23, 24, 26]. Although the three motifs we find within this region have not been specifically identified in-vitro, a computational study by Wolff et al (2003) showed that the INS sequences have several short functional motifs within them [26]. The conserved sites we identified within INS1 and INS2 could serve the same inhibitory function. Functional assays elucidating the role of these sites in the inhibition of mRNA expression could help to determine the precise mechanisms by which inhibition occurs and whether these sites also play a role. In another previous study which analyzed the RNA secondary structure of the 5' region of HIV-1, these two regions, labeled 'INS1' in Figure 3a, were found to be involved in conserved Watson-Crick base-pairing [58]

The last two of the five conserved regions with unidentified specific functions were within the pol HS7. HS7 spans five hundred nucleotides, between positions 4481 and 4982 of the HIV-1 genome, and has an LTR-like regulatory function [16, 17]. Previous studies revealed four domains towards the 3' end of this region (PU box, GC-box, site-C and site-D) that bind to specific transcription factors (TFs) and are also important for viral infectivity [16, 17]. In these studies, the Oct-1, Oct-2, PU.1, Sp1 and Sp3 transcription factors were found to bind to at least one of the four identified sites. The two conserved regions we identified are outside these specific identified functional domains, but one of them at positions 4767–4790 (labeled "p1" in Figure 3b) showed potential binding to Oct-1 using the MATCH tool from the TRANSFAC database [59, 60]. Potential association with a transcription factor was not observed for the three nucleotides (positions 4864–4866) labeled 'p2' and its adjacent sites.

Knowledge of regions of the genome that function at the nucleotide level is important for positive selection analysis. Conserved synonymous sites can cause false detection of positive selection and need to be either excluded from analyses or modeled appropriately. The danger is that some sites may be assumed to be evolving adaptively simply as a result of the purifying selection acting directly on the nucleotide sequence. We found evidence of this in subtype C gag sequences where positively selected sites significantly coincided with significantly low dS (Additional file 3). In addition, a study by Hurst & Pal (2001) also showed false detection of positive selection caused by purifying selection pressure acting on synonymous sites.

In many selection, studies the synonymous rate is assumed to be constant. However, negative selection acting on synonymous sites can potentially reduce gene-wide estimates of the synonymous rates below the neutral evolution rate. Comparison of site-specific nonsynonymous substitution rates against this underestimate of the neutral rate is likely to cause a proportion of the selectively neutral nonsynonymous sites to seem as though they are evolving adaptively. We have, however, used a very stringent cutoff to identify the twenty three regions within the HIV genome that have obviously reduced synonymous substitution rates. For a more conservative analysis of selection, all the significantly conserved sites, i.e., including those conserved at 95% confidence (p-value < 0.05, listed in Table 3) should be excluded from analyses along with sites that encode proteins in multiple frames. Surprisingly, we found that the rate of synonymous substitution, was higher, on average, in overlapping gene regions that encode proteins in more than one frame (p = 6 × 10-7 from Wilcoxon rank sum test; Figure 1); however, lower dS was observed within some genome regions that are translated in multiple reading frames. Analysis of the most diverse sequences within subtype B and C revealed more highly conserved sites across the RRE and INS1 regions at the subtype level (Additional file 4). These putatively functional domains could also be removed in more conservative studies of selection acting on HIV protein sequences.


We have analyzed sequence variation at the synonymous sites of non-overlapping regions of HIV-1 genes. We found substantial site-to-site variation in the rate of synonymous substitution with evidence of purifying selection pressure within functional domains such as the Rev-responsive element. The majority of conserved sites we identified are within functional regions that are well documented in the literature; however, the total number of sites that function at the nucleotide level is unknown and it is therefore difficult to assess the fraction of the known and novel functional sites that are detectable using our method. In addition to identifying putatively functional sites under purifying selection, these results contribute to the robustness of analyses of positive selection by identifying conserved synonymous sites that can cause false positive inference of selection. The sites presented in Table 3 and Figures 5 and 6 thus form a resource for future studies of selection pressures acting on HIV-1 genes.


  1. Muse SV, Gaut BS: A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol Biol Evol 1994, 11: 715-724.

    CAS  PubMed  Google Scholar 

  2. Goldman N, Yang Z: A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol Biol Evol 1994, 11: 725-736.

    CAS  PubMed  Google Scholar 

  3. Nielsen R, Yang Z: Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics 1998, 148: 929-936.

    PubMed Central  CAS  PubMed  Google Scholar 

  4. Yang Z, Nielsen R, Goldman N, Pedersen AM: Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 2000, 155: 431-449.

    PubMed Central  CAS  PubMed  Google Scholar 

  5. Lemey P, Kosakovsky Pond SL, Drummond AJ, Pybus OG, Shapiro B, Barroso H, Taveira N, Rambaut A: Synonymous substitution rates predict HIV disease progression as a result of underlying replication dynamics. PLoS Comput Biol 2007, 3: e29. 10.1371/journal.pcbi.0030029

    PubMed Central  Article  PubMed  Google Scholar 

  6. Seoighe C, Ketwaroo F, Pillay V, Scheffler K, Wood N, Duffet R, Zvelebil M, Martinson N, McIntyre J, Morris L, Hide W: A model of directional selection applied to the evolution of drug resistance in HIV-1. Mol Biol Evol 2007, 24: 1025-1031. 10.1093/molbev/msm021

    CAS  Article  PubMed  Google Scholar 

  7. Choisy M, Woelk CH, Guegan JF, Robertson DL: Comparative study of adaptive molecular evolution in different human immunodeficiency virus groups and subtypes. J Virol 2004, 78: 1962-1970. 10.1128/JVI.78.4.1962-1970.2004

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  8. de Oliveira T, Salemi M, Gordon M, Vandamme AM, van Rensburg EJ, Engelbrecht S, Coovadia HM, Cassol S: Mapping sites of positive selection and amino acid diversification in the HIV genome: an alternative approach to vaccine design? Genetics 2004, 167: 1047-1058. 10.1534/genetics.103.018135

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  9. Zanotto PM, Kallas EG, de Souza RF, Holmes EC: Genealogical evidence for positive selection in the nef gene of HIV-1. Genetics 1999, 153: 1077-1089.

    PubMed Central  CAS  PubMed  Google Scholar 

  10. Kosakovsky Pond S, Muse SV: Site-to-site variation of synonymous substitution rates. Mol Biol Evol 2005, 22: 2375-2385. 10.1093/molbev/msi232

    Article  Google Scholar 

  11. Hurst LD, Pal C: Evidence for purifying selection acting on silent sites in BRCA1 . TRENDS in Genetics 2001, 17: 62-65. 10.1016/S0168-9525(00)02173-9

    CAS  Article  PubMed  Google Scholar 

  12. Quinones-Mateu ME, Mas A, Lain dL, Soriano V, Alcami J, Lederman MM, Domingo E: LTR and tat variability of HIV-1 isolates from patients with divergent rates of disease progression. Virus Res 1998, 57: 11-20. 10.1016/S0168-1702(98)00082-3

    CAS  Article  PubMed  Google Scholar 

  13. Das AT, Klaver B, Berkhout B: The 5' and 3' TAR elements of human immunodeficiency virus exert effects at several points in the virus life cycle. J Virol 1998, 72: 9217-9223.

    PubMed Central  CAS  PubMed  Google Scholar 

  14. Wilkinson KA, Gorelick RJ, Vasa SM, Guex N, Rein A, Mathews DH, Giddings MC, Weeks KM: High-throughput SHAPE analysis reveals structures in HIV-1 genomic RNA strongly conserved across distinct biological states. PLoS Biol 2008, 6: e96. 10.1371/journal.pbio.0060096

    PubMed Central  Article  PubMed  Google Scholar 

  15. Pereira LA, Bentley K, Peeters A, Churchill MJ, Deacon NJ: A compilation of cellular transcription factor interactions with the HIV-1 LTR promoter. Nucleic Acids Res 2000, 28: 663-668. 10.1093/nar/28.3.663

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  16. Goffin V, Demonte D, Vanhulle C, de Walque S, de Launoit Y, Burny A, Collette Y, Van Lint C: Transcription factor binding sites in the pol gene intragenic regulatory region of HIV-1 are important for virus infectivity. Nucleic Acids Res 2005, 33: 4285-4310. 10.1093/nar/gki720

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  17. Van Lint C, Ghysdael J, Paras P Jr, Burny A, Verdin E: A transcriptional regulatory element is associated with a nuclease-hypersensitive site in the pol gene of human immunodeficiency virus type 1. J Virol 1994, 68: 2632-2648.

    PubMed Central  CAS  PubMed  Google Scholar 

  18. Verdin E, Becker N, Bex F, Droogmans L, Burny A: Identification and characterization of an enhancer in the coding region of the genome of human immunodeficiency virus type 1. Proc Natl Acad Sci 1990, 87: 4874-4878. 10.1073/pnas.87.12.4874

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  19. Hadzopoulou-Cladaras M, Felber BK, Cladaras C, Athanassopoulos A, Tse A, Pavlakis GN: The rev (trs/art) protein of human immunodeficiency virus type 1 affects viral mRNA and protein expression via a cis-acting sequence in the env region. J Virol 1989, 63: 1265-1274.

    PubMed Central  CAS  PubMed  Google Scholar 

  20. Renwick SB, Critchley AD, Adams CJ, Kelly SM, Price NC, Stockley PG: Probing the details of the HIV-1 Rev-Rev-responsive element interaction: effects of modified nucleotides on protein affinity and conformational changes during complex formation. Biochem J 1995,308(Pt 2):447-453.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  21. Peterson RD, Feigon J: Structural change in Rev responsive element RNA of HIV-1 on binding Rev peptide. J Mol Biol 1996, 264: 863-877. 10.1006/jmbi.1996.0683

    CAS  Article  PubMed  Google Scholar 

  22. Phuphuakrat A, Auewarakul P: Heterogeneity of HIV-1 Rev response element. AIDS Res Hum Retroviruses 2003, 19: 569-574. 10.1089/088922203322230932

    CAS  Article  PubMed  Google Scholar 

  23. Schwartz S, Campbell M, Nasioulas G, Harrison J, Felber BK, Pavlakis GN: Mutational inactivation of an inhibitory sequence in human immunodeficiency virus type 1 results in Rev-independent gag expression. J Virol 1992, 66: 7176-7182.

    PubMed Central  CAS  PubMed  Google Scholar 

  24. Schneider R, Campbell M, Nasioulas G, Felber BK, Pavlakis GN: Inactivation of the human immunodeficiency virus type 1 inhibitory elements allows Rev-independent expression of Gag and Gag/protease and particle formation. J Virol 1997, 71: 4892-4903.

    PubMed Central  CAS  PubMed  Google Scholar 

  25. Cochrane AW, Jones KS, Beidas S, Dillon PJ, Skalka AM, Rosen CA: Identification and characterization of intragenic sequences which repress human immunodeficiency virus structural gene expression. J Virol 1991, 65: 5305-5313.

    PubMed Central  CAS  PubMed  Google Scholar 

  26. Wolff H, Brack-Werner R, Neumann M, Werner T, Schneider R: Intergrated functional and bioinformatics approach for the identification and experimental verification of RNA signals: application to HIV-1 INS. Nucleic Acids Research 2003, 31: 2839-2851. 10.1093/nar/gkg390

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  27. Anisimova M, Nielsen R, Yang Z: Effect of recombination on the accuracy of the likelihood method for detecting positive selection at amino acid sites. Genetics 2003, 164: 1229-1236.

    PubMed Central  CAS  PubMed  Google Scholar 

  28. Shriner D, Nickle DC, Jensen MA, Mullins JI: Potential impact of recombination on sitewise approaches for detecting positive natural selection. Genet Res 2003, 81: 115-121. 10.1017/S0016672303006128

    CAS  Article  PubMed  Google Scholar 

  29. Leitner T, Foley B, Hahn B, Marx P, McCutchan F, Mellors J, Wolinsky S, Korber B: HIV Sequence Compendium. Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, NM, LA-UR 2005, 06-0680.

    Google Scholar 

  30. Kosakovsky Pond SL, Frost SD, Muse SV: HyPhy: hypothesis testing using phylogenies. Bioinformatics 2005, 21: 676-679. 10.1093/bioinformatics/bti079

    Article  Google Scholar 

  31. Kosakovsky Pond SL, Posada D, Gravenor MB, Woelk CH, Frost S: Automated Phylogenetic Detection of Recombination Using a Genetic Algorithm. Mol Biol Evol 2006, 23: 1891-1901. 10.1093/molbev/msl051

    Article  PubMed  Google Scholar 

  32. Scheffler K, Martin DP, Seoighe C: Robust inference of positive selection from recombining coding sequences. Bioinformatics 2006, 22: 2493-2499. 10.1093/bioinformatics/btl427

    CAS  Article  PubMed  Google Scholar 

  33. Ngandu NG, Bredell H, Gray CM, Williamson C, Seoighe C: CTL response to HIV type 1 subtype C is poorly predicted by known epitope motifs. AIDS Res Hum Retroviruses 2007, 23: 1033-1041. 10.1089/aid.2007.0024

    CAS  Article  PubMed  Google Scholar 

  34. Yang Z, Wong WS, Nielsen R: Bayes empirical bayes inference of amino acid sites under positive selection. Mol Biol Evol 2005, 22: 1107-1118. 10.1093/molbev/msi097

    CAS  Article  PubMed  Google Scholar 

  35. Moore PL, Gray ES, Choge IA, Ranchobe N, Mlisana K, Abdool Karim SS, Williamson C, Morris L: The c3-v4 region is a major target of autologous neutralizing antibodies in human immunodeficiency virus type 1 subtype C infection. J Virol 2008, 82: 1860-1869. 10.1128/JVI.02187-07

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  36. Clever J, Sassetti C, Parslow TG: RNA secondary structure and binding sites for gag gene products in the 5' packaging signal of human immunodeficiency virus type 1. J Virol 1995, 69: 2101-2109.

    PubMed Central  CAS  PubMed  Google Scholar 

  37. Huthoff H, Das AT, Vink M, Klaver B, Zorgdrager F, Cornelissen M, Berkhout B: A human immunodeficiency virus type 1-infected individual with low viral load harbors a virus variant that exhibits an in vitro RNA dimerization defect. J Virol 2004, 78: 4907-4913. 10.1128/JVI.78.9.4907-4913.2004

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  38. Exline CM, Feng Z, Stoltzfus CM: Negative and positive mRNA splicing elements act competitively to regulate human immunodeficiency virus type 1 vif gene expression. J Virol 2008, 82: 3921-3931. 10.1128/JVI.01558-07

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  39. Schwartz S, Felber BK, Benko DM, Fenyo EM, Pavlakis GN: Cloning and functional analysis of multiply spliced mRNA species of human immunodeficiency virus type 1. J Virol 1990, 64: 2519-2529.

    PubMed Central  CAS  PubMed  Google Scholar 

  40. Madsen JM, Stoltzfus CM: A suboptimal 5' splice site downstream of HIV-1 splice site A1 is required for unspliced viral mRNA accumulation and efficient virus replication. Retrovirology 2006, 3: 10. 10.1186/1742-4690-3-10

    PubMed Central  Article  PubMed  Google Scholar 

  41. Kammler S, Otte M, Hauber I, Kjems J, Hauber J, Schaal H: The strength of the HIV-1 3' splice sites affects Rev function. Retrovirology 2006, 3: 89. 10.1186/1742-4690-3-89

    PubMed Central  Article  PubMed  Google Scholar 

  42. Krummheuer J, Lenz C, Kammler S, Scheid A, Schaal H: Influence of the small leader exons 2 and 3 on human immunodeficiency virus type 1 gene expression. Virology 2001, 286: 276-289. 10.1006/viro.2001.0974

    CAS  Article  PubMed  Google Scholar 

  43. Jacquenet S, Mereau A, Bilodeau PS, Damier L, Stoltzfus CM, Branlant C: A second exon splicing silencer within human immunodeficiency virus type 1 tat exon 2 represses splicing of Tat mRNA and binds protein hnRNP H. J Biol Chem 2001, 276: 40464-40475. 10.1074/jbc.M104070200

    CAS  Article  PubMed  Google Scholar 

  44. Jacquenet S, Ropers D, Bilodeau PS, Damier L, Mougin A, Stoltzfus CM, Branlant C: Conserved stem-loop structures in the HIV-1 RNA region containing the A3 3' splice site and its cis-regulatory element: possible involvement in RNA splicing. Nucleic Acids Res 2001, 29: 464-478. 10.1093/nar/29.2.464

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  45. Hung LW, Holbrook EL, Holbrook SR: The crystal structure of the Rev binding element of HIV-1 reveals novel base pairing and conformational variability. Proc Natl Acad Sci USA 2000, 97: 5107-5112. 10.1073/pnas.090588197

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  46. Schwartz S, Felber BK, Fenyo EM, Pavlakis GN: Env and Vpu proteins of human immunodeficiency virus type 1 are produced from multiple bicistronic mRNAs. J Virol 1990, 64: 5448-5456.

    PubMed Central  CAS  PubMed  Google Scholar 

  47. Swanson AK, Stoltzfus CM: Overlapping cis sites used for splicing of HIV-1 env/nef and rev mRNAs. J Biol Chem 1998, 273: 34551-34557. 10.1074/jbc.273.51.34551

    CAS  Article  PubMed  Google Scholar 

  48. Miles LR, Agresta BE, Khan MB, Tang S, Levin JG, Powell MD: Effect of polypurine tract (PPT) mutations on human immunodeficiency virus type 1 replication: a virus with a completely randomized PPT retains low infectivity. J Virol 2005, 79: 6859-6867. 10.1128/JVI.79.11.6859-6867.2005

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  49. Luo GX, Sharmeen L, Taylor J: Specificities involved in the initiation of retroviral plus-strand DNA. J Virol 1990, 64: 592-597.

    PubMed Central  CAS  PubMed  Google Scholar 

  50. Rausch JW, Le Grice SF: Purine analog substitution of the HIV-1 polypurine tract primer defines regions controlling initiation of plus-strand DNA synthesis. Nucleic Acids Res 2007, 35: 256-268. 10.1093/nar/gkl909

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  51. Powell MD, Levin JG: Sequence and structural determinants required for priming of plus-strand DNA synthesis by the human immunodeficiency virus type 1 polypurine tract. J Virol 1996, 70: 5288-5296.

    PubMed Central  CAS  PubMed  Google Scholar 

  52. Pullen KA, Rattray AJ, Champoux JJ: The sequence features important for plus strand priming by human immunodeficiency virus type 1 reverse transcriptase. J Biol Chem 1993, 268: 6221-6227.

    CAS  PubMed  Google Scholar 

  53. Rausch JW, Le Grice SF: 'Binding, bending and bonding': polypurine tract-primed initiation of plus-strand DNA synthesis in human immunodeficiency virus. Int J Biochem Cell Biol 2004, 36: 1752-1766. 10.1016/j.biocel.2004.02.016

    CAS  Article  PubMed  Google Scholar 

  54. Masuda T, Kuroda MJ, Harada S: Specific and independent recognition of U3 and U5 att sites by human immunodeficiency virus type 1 integrase in vivo. J Virol 1998, 72: 8396-8402.

    PubMed Central  CAS  PubMed  Google Scholar 

  55. Brown HE, Chen H, Engelman A: Structure-based mutagenesis of the human immunodeficiency virus type 1 DNA attachment site: effects on integration and cDNA synthesis. J Virol 1999, 73: 9011-9020.

    PubMed Central  CAS  PubMed  Google Scholar 

  56. Rose PP, Korber BT: Detecting hypermutations in viral sequences with an emphasis on G --> A hypermutation. Bioinformatics 2000, 16: 400-401. 10.1093/bioinformatics/16.4.400

    CAS  Article  PubMed  Google Scholar 

  57. Hanada K, Suzuki Y, Gojobori T: A large variation in the rates of synonymous substitution for RNA viruses and its relationship to a diversity of viral infection and transmission modes. Mol Biol Evol 2004, 21: 1074-1080. 10.1093/molbev/msh109

    CAS  Article  PubMed  Google Scholar 

  58. Paillart JC, Skripkin E, Ehresmann B, Ehresmann C, Marquet R: In vitro evidence for a long range pseudoknot in the 5'-untranslated and matrix coding regions of HIV-1 genomic RNA. J Biol Chem 2002, 277: 5995-6004. 10.1074/jbc.M108972200

    CAS  Article  PubMed  Google Scholar 

  59. Kel AE, Gossling E, Reuter I, Cheremushkin E, Kel-Margoulis OV, Wingender E: MATCH: A tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res 2003, 31: 3576-3579. 10.1093/nar/gkg585

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  60. Matys V, Fricke E, Geffers R, Gossling E, Haubrock M, Hehl R, Hornischer K, Karas D, Kel AE, Kel-Margoulis OV, Kloos DU, Land S, Lewicki-Potapov B, Michael H, Munch R, Reuter I, Rotert S, Saxel H, Scheer M, Thiele S, Wingender E: TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res 2003, 31: 374-378. 10.1093/nar/gkg108

    PubMed Central  CAS  Article  PubMed  Google Scholar 

Download references


This study was funded by the South African National Bioinformatics Network. NKN was supported by a training grant under the Stanford-South Africa Biomedical Informatics Training Program which is supported by the Fogarty International Center, part of the National Institutes of Health (grant no. 5D43 TW006993).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Cathal Seoighe.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

NKN carried out the analysis, interpreted the results and drafted the manuscript. KS provided HyPhy scripts, contributed to the methodology and participated in drafting the manuscript. PM carried out the biochemical fitness assays and participated in writing the manuscript. ZW and DM participated in validating the results and editing the manuscript. CS conceived and supervised the study and participated in writing the manuscript.

Electronic supplementary material


Additional file 1: G-A mutations in a variable region in the nef gene. Mutations observed in reference sequences in comparison to Group M ancestral sequence identified using the hypermut tool available in the Los Alamos database. The highly variable region (labeled "G-A" in Figure 3d) showed G-A mutations and is boxed in red. (PDF 803 KB)


Additional file 2: Functional analysis of a novel region in the env gene. (a) Production of p24 from transfected wildtype p81 and 3 mutants produced from synonymous mutations introduced in the previously uncharacterized conserved region in env. (b) Comparison of infectivity between wildtype and the mutants. (PDF 37 KB)


Additional file 3: Evidence of overlap between high omega at a codon and low dS at the synonymous sites. Positively selected sites at which a significantly low dS was observed at the synonymous sites. Positively selected sites are shown in blue vertical lines and sites with low dS are shaded in light blue. (PDF 87 KB)


Additional file 4: Highly conserved regions observed at the subtype-level. dS across subtypes B and C gag and env genes showing more conserved sites at the subtype sequence level within the INS regions in gag and RRE in env. (PDF 372 KB)

Authors’ original submitted files for images

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Ngandu, N.K., Scheffler, K., Moore, P. et al. Extensive purifying selection acting on synonymous sites in HIV-1 Group M sequences. Virol J 5, 160 (2008).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Long Terminal Repeat
  • Synonymous Substitution
  • Nonsynonymous Substitution
  • Synonymous Site
  • Recombination Breakpoint