Different rates of (non-)synonymous mutations in astrovirus genes; correlation with gene function

Background Complete genome sequences of the Astroviridae include human, non-human mammalian and avian species. A consensus topology of astroviruses has been derived from nucleotide substitutions in the full-length genomes and from non-synonymous nucleotide substitutions in each of the three ORFs. Analyses of synonymous substitutions displayed a loss of tree structure, suggesting either saturation of the substitution model or a deviant pattern of synonymous substitutions in certain virus species. Results We analyzed the complete Astroviridae family for the inference of adaptive molecular evolution at sites and in branches. High rates of synonymous mutations are observed among the non-human virus species. Deviant patterns of synonymous substitutions are found in the capsid structural genes. Purifying selection is a dominant force among all astrovirus genes and only few codon sites showed values for the dN/dS ratio that may indicate site-specific molecular adaptation during virus evolution. One of these sites is the glycine residue of a RGD motif in ORF2 of human astrovirus serotype 1. RGD or similar integrin recognition motifs are present in nearly all astrovirus species. Conclusion Phylogenetic analysis directed by maximum likelihood approximation allows the inclusion of significantly more evolutionary history and thereby, improves the estimation of dN and dS. Sites with enhanced values for dN/dS are prominent at domains in charge of environmental communication (f.i. VP27 and domain 4 in ORF1a) more than at domains dedicated to intrinsic virus functions (f.i. VP34 and ORF1b (the virus polymerase)). Integrin recognition may play a key role in astrovirus to target cell attachment.


Background
Human astrovirus has been recognized as the second most common cause of diarrhoea among children under 5 years old [1]. In animal and bird farms, an astrovirus infection is fatal for a considerable part of the livestock [2,3]. The family of Astroviridae is divided in two genera: Mamastrovirus (mammalian astroviruses) and Avastrovirus (avian astroviruses). The pathogen is a non-enveloped virion with a single-stranded, positive-sense RNA of approximately 6.8 kb in size [4]. The virus genome contains three open reading frames designated ORF1a (2.8 kb), ORF1b (1.6 kb) and ORF2 (2.4 kb). Translation of ORF1b depends on translation of ORF1a by a ribosomal frameshift mechanism [5]. The primary translation products are processed into the virus protease (ORF1a) and the virus polymerase (ORF1b). ORF2 encodes a structural protein that is intracellularly processed at the C-terminal part by caspase protease, by which genome packaging and virus particle release is promoted [6]. As part of the released virus particle, ORF2 protein is processed further by trypsin to acquire the mature capsid proteins VP34 and VP27/25, a process accompanied by a considerable increase of virus infectivity [7]. Consensus on the post-translational processing routes of ORF1a, ORF1b and ORF2 polyproteins has not yet been attained [8,9].
An inventory of evolutionary relationships among astroviruses has been published confirming the topology that is generally accepted for the astroviruses and pointing to a strong selection against non-synonymous substitutions [10]. Phylogenetic analyses based solely on the synonymous substitutions resulted in loss of tree structure due to the shortening of specific ancestral branches in trees of all ORFs. Avian, sheep, and human virus species appeared to be virtually equidistant in these trees. Such a loss of tree structure or tree compression may be interpreted either as the result of saturation of synonymous substitutions or to a peculiar pattern of synonymous substitutions that is typical for specific members of the astrovirus family. To address this issue, we analyze the astrovirus genes by means of nucleotide substitution models based on maximum likelihood approximation because these models are better suited for the estimation of mutational rates in highly divergent genes than models relying on the Jukes-Cantor correction for multiple hits at the same site.
Phylogenetic analysis by maximum likelihood (PAML3.14) [11] offers a set of sophisticated models to assess the extent of (non-) synonymous substitutions in genes. The current status of the PAML programs displays a profound documentation for the inference of sites and branches prone to molecular adaptation and is supported by validated statistics [12][13][14]. For instance, adaptive evolution is observed in the hemagglutinin gene of human influenza virus type A [15].
Applying PAML to the genes of the complete astrovirus family, we show that tree compression (shortening of ancestral branches) can be ascribed to deviant rates of synonymous mutation at discrete regions of ORF2 in certain astrovirus species. Tree compression is absent in ORF1a and ORF1b as revealed by PAML due to its ability to include more evolutionary history. Sites that tend to escape from purifying selection correlate to protein domains dedicated to environmental communication rather than to replication and assembly of the virus. Finally, we propose that integrin recognition of ORF2 domains plays a key role in the process of cell binding by astrovirus.

(Non-) synonymous substitutions in astrovirus genes: branch models
The values for dN/dS of the branch model applied to the astrovirus ORFs are all far below the value of 1 (Table 1), by convention considered as the lower limit for positive selection. In fact, dN/dS values do not exceed 0.16, except for ORF2 in the cat and human4 isolates (0.1904 and 0.2610, respectively). Obviously, purifying selection is very dominant in branches specifying astrovirus species.
The individual values of dN and dS may be more easily interpreted in a tree format (Fig 1) than as raw data in a table. Unrooted dN and dS trees are part of PAML's output and have been constructed by pasting the dN and dS values from Table 1 as branch length onto the amino acid tree topology supplied to PAML as input tree file.
The dN trees of the three ORFs ( Fig 1A) are in close agreement with the widely accepted astrovirus phylogeny [10,[16][17][18][19][20][21]. The ORFs evolve independently and mutational rates are lower in ORF1b than in ORF1a or ORF2 as indicated by the different scale bars. Also, the relative lengths of the individual branches mimic those in trees inferred by other means. For instance, the sequence of ORF2 in human astrovirus serotype 4 is known as the most distant among the human ORF2 sequences [19]. Apparently, the alignment attained by the multi-step MUSCLE procedure (see Materials & Methods) displayed an average accuracy and is at least as good as more laborious ClustalW based protocols, despite the sequence diversity typical for the astrovirus data set.
Synonymous rates of mutation in astrovirus genes differ from non-synonymous mutational rates by as much as one to two orders of magnitude as indicated by the dN/dS ratio in Table 1 and the scale bars in Fig 1B. Nevertheless, dS trees of ORF1a and ORF1b based on synonymous substitutions are close to the corresponding trees derived from non-synonymous substitutions or amino acid replacements. Loss of tree structure [10] or tree compression caused by the virtual disappearance of ancestral branches can hardly be observed illustrating the power of the substitution model to incorporate multiple hits at the same site. In contrast, loss of tree structure and shortening of ancestral branches is observed in the dS tree of ORF2 and appears to be due to enhanced accumulation of synonymous substitutions in ORF2 sequences of the astroviruses of pig, sheep, mink and turkey compared to viruses in humans, cats and both avian nephritis viruses.
The ORF2 regions coding for the virus capsid proteins VP34 and VP27 may differ in the accumulation of synon-ymous substitutions. To test this, the ORF2 alignment was divided in two portions based on the cleavage by trypsin. VP34 is slightly more conserved than VP27, but most ORF2 values are close to the average of the corresponding values for VP34 and VP27 (Table 2). Indeed, dN trees of VP34 and VP27 (not shown) are nicely deep-rooted and nearly identical to the dN tree of the complete ORF2 ( Fig  1A). By contrast, dS values of sheep virus capsids increase from 36.5 in VP34 to 64.4 in VP27, while these values decrease from 53.3 to 12.3 in mink. In pig, the difference is even more dramatic, 60.6 in VP34 and only 4.8 in VP27. Both turkey isolates exchange their dS values of about 20 and 33 for VP34 and VP27, respectively. Trees with values for synonymous substitutions pasted as branch lengths onto the same topology more clearly show the speciesspecific differences in synonymous mutational rates between VP34 and VP27 (Fig 2). With respect to VP34, tree compression is due to high values of dS in pig, sheep, mink and turkey, similar to the results for ORF2. In the dS tree of VP27, however, the sheep virus and to a lesser extent both turkey viruses are responsible for tree compression. The viruses of pig and mink are essentially conforming the human, feline and avian viruses. Apparently, the VP34 and VP27 domains tolerate considerable differences with respect to the extent of synonymous mutational rates as estimated by means of PAML. Similar to ORF1a and ORF1b, the domains VP34 and VP27 of ORF2 display mutually independent patterns of molecular evolution.
Currently, it cannot be excluded that even PAML encounters saturation-related problems at high rates of mutation in a properly aligned set of sequences [22]. However, this may only be a partial explanation to the large variation of synonymous mutational rates in adjacent domains of specific astrovirus genes and genomes.

Site models: dN/dS values in relation to domain functions
Site models of CODEML allow for the estimation of dN/ dS ratios at individual codon sites in an alignment of sequences. Because of the skewed astrovirus populationlarge distances between the few clades and overrepresentation of closely related human virus species -statistical support was taken only from the Bayes-Empirical-Bayes (BEB) output [14]. With respect to ORF2 of the human virus species, we confined the analysis to one virus representative per human serotype (except for H1). All three ORFs display sites that tend to escape from purifying selection (Table 3). Posterior mean values of dN/dS for BEB-selected sites are between 1 and 1.5 in all ORFs with high standard errors and low posterior probabilities. Conventionally, a dN/dS value of 1 marks the transition from neutral evolution to weakly positive selection. BEBselected sites (dN/dS = 1) are not distributed randomly in   ORF1a and ORF2 but appear to cluster. This is more easily observed when all sites are plotted as a string of local dN/ dS values of the ORF's codons superimposed on a map of virus functions embedded in the ORF as described in literature and/or predicted by servers (Fig 3). In ORF1a ( Fig  3A), clusters with relatively high dN/dS values are located around amino acid 620, at position 775, 777 and 812-817. The 614-624 cluster maps near the N-terminus of the nsp1a-4 domain. The exact position of this N-terminus is still under debate, being either T568 or I655 [8,23].
A coiled coil, a nuclear localization signal and a death domain for the induction of cellular apoptosis are found in this region [24]. Also, this part of ORF1a displays enhanced amino acid variability and may be prone to Oglycosylation and phosphorylation. The BEB-selected sites between 775-814 mark the borders of the interspecies hypervariable region (760-838) that also harbors the peptide that is deleted by cell culture adaptation of the virus [25]. At the C-terminus, the ribosomal frame-shift signal is decorated with a BEB-selected threonine residue next to a predicted motif for retention at the endoplasmic reticulum. As expected from the branch model analysis (see above), sites with dN/dS values less than 0.1 constitute a large majority in ORF1a.
The dN/dS distribution in ORF1b illustrates the power of purifying selection even more dramatically (Fig 3B).  posterior mean value of 1.384 for dN/dS, but with weak statistical support. The same holds for the leucine residue 29 (dN/dS = 1.068), between the frameshift stem-loop and the predicted furin-type cleavage site. All sites possibly prone to phosphorylation in ORF1b are subjected to strong purifying selection.      The dN/dS profile of ORF2 polyprotein (Fig 3C) clearly shows the distinction between capsid functions either dedicated to virus replication and assembly or involved in environmental communication. The N-terminal VP34 protein is involved in the packaging of virus RNA. Replacement of the much conserved threonine residue at position 227, for instance, abolishes the formation of virus particles due to loss of the ability to bind virus RNA [26]. Nearly all sites in mature VP34 are under strong purifying selection, indicating the involvement of VP34 in conserved virus functions. Mature VP25/27 encoded by the C-terminal part of ORF2 constitutes the virus' spike protein and carries the region of serotypic antibody recognition (580-606). Variation is beneficial for immune escape and sequence homology decreases to levels that locally even hamper a proper alignment of the available data set. As a result, dN/dS values tend to increase towards neutral evolution. The limited data set prevents a statistical discrimination between sites with dN/dS values = 1 due to either weak but bona fide positive selection or merely site-specific heterogeneity. However, clusters of sites with dN/dS values exceeding 1 can be observed. The Proteolysis by caspase of this region is the first step in the maturation process of the ORF2 polyprotein [6]. RNA packaging and cellular release can-not occur without cleavage by caspase, which in turn is probably activated by the death domain in ORF1a.

Cell attachment by integrin recognition
The glycine residue at position 573 in ORF2 is among the top three of all sites having a dN/dS value of 1.441 ± 0.264 and a posterior probability of 0.901. Surprisingly, this Gly573 is identified by PROSITE as the core residue of a RGD tripeptide being a recognition sequence for cellular attachment to an integrin-type cell surface receptor [27]. It seems a paradox that a site prone to adaptive molecular evolution also participates in an important replication function. However, integrin binding can also be fulfilled by similar motifs, for instance KGD, RHD, NGR and LDV, and may be enhanced by synergy sites like the tripeptide RNS [28]. RNS, KGD and RHD oligopeptides are absent in the collection of ORFs2, but the integrin-binding motifs RGD, LDV and NGR are present in astrovirus ORF2 except for Turkey1 (Table 4). Most of these putative integrinbinding sequences are located in VP34 rather than in the spike protein VP25. The LDV tripeptide is located at position 183 and NGR is present (sometimes even in duplicate) at the positions 17 and 51 of VP34 (H1Z25771 numbering).

Discussion
Previous research showed that avian astrovirus species displayed different topologies in trees based on either synonymous or non-synonymous substitutions suggesting a deviant pattern of synonymous substitutions specifically in these species [10]. More specifically, tree compression due to shortening of ancestral branches caused loss of res- olution among the non-human mammalian and the avian species in all ORFs leaving solely the human serotypes properly resolved. Recently, we demonstrated a switch in the recent evolution of Astroviridae driving the synonymous codon usage in genes of specifically the nonhuman mammalian viruses towards the mean codon usage in genes of their hosts [21]. The present study employs phylogenetic analysis by maximum likelihood approximation to assess the extent of (non-) synonymous mutational rates at the expense of tree-building capacity. Fortunately, there is consensus in literature on the phylogenetic topology of astrovirus species derived from amino acid replacements or non-synonymous nucleotide substitutions [10,[16][17][18][19][20][21]. This provides the opportunity to decorate a tree carrying this consensus topology with the branch lengths estimated by PAML for synonymous or non-synonymous substitutions in astrovirus ORFs. By these means, we obtained standard-like trees without significant compression for astrovirus ORF1a and ORF1b, despite the large extent of synonymous substitutions in the non-human species. Apparently, substitution models subjected to maximum likelihood approximation tolerate considerably higher levels of mutational saturation than "classic" substitution models relying on the Jukes-Cantor correction for multiple hits at the same site and hence allow the inclusion of significantly more evolutionary history during phylogenetic analysis. Improvement of dN and dS estimation is the result.
With respect to astrovirus ORF2, the tree based on nonsynonymous substitutions is very much standard-like, whereas the tree based on synonymous substitutions clearly suffers from compression due to the extended branch lengths of pig, sheep, mink and turkey astrovirus species. A bipartition of ORF2 into the two regions encoding the VP34 and VP27 capsid proteins shows that the branch extension of these species is observed in the VP34 domain, but confined to sheep and the turkeys in VP27.
In mink and particularly pig, enhanced rates of synonymous mutation are present in the VP34 domain, but absent in the VP27 domain of ORF2. At present, we cannot offer a proper explanation for this species-and domain-specific enhancement in the rates of synonymous mutation. The consistency of dN trees with data in literature [10,[16][17][18][19][20] argues against an improper alignment of the astrovirus sequences. It is conceivable that the substitution model applied reaches its limit at a certain level in the rate of mutation. However, VP34 carries the majority of elongated branch lengths, but is slightly better conserved than its ORF2 colleague VP27. The avian clade may pose a biological argument relevant to the problem. As shown above, turkey astroviruses consistently do and avian nephritis viruses do not display branch length extension indicating a possibly relevant difference between these species. Investigation at the source of the sequences involved has pointed out that the sequences of turkey species have been determined by RT-PCR of virus RNA extracted directly from stools and organs [18,29], whereas RNAs of avian nephritis virus have been prepared from cell-culture supernatants after three consecutive rounds of plaque purification before being subjected to RT-PCR amplification [30]. It has already been mentioned that adaptation of human astroviruses to grow in continuous cell lines induces a 45-nucleotide deletion near the 3'-end of ORF1a [25]. In coronavirus, mutations have been associated with isolation and passage in primate cell lines [31,32]. In conclusion, selection at the level of isolation and purification as well as mutation at the level of propagation may affect the difference between the rates of synonymous and non-synonymous substitution in astrovirus genes. Future research may address this issue.
The relationship between negative and positive selection carries an antagonistic character. Random mutations that may be a menace to intrinsic virus functionality are meshed during cycles of virus replication and propagation and are subsequently removed from the virus population leading to conservation of the sequences involved. Astrovirus protease, polymerase and VP34 are exemplary for this process of negative or rather purifying selection. Positive selection of substitutions on the other hand occurs in response to environmental changes and hence is also designated as molecular adaptation or adaptive molecular evolution. Obviously, sites prone to positive selection may be expected at domains in charge of communicative functionality like host range and immune response. It is therefore not surprising that sites belonging to the predicted serotypic epitope and to the putative RGD site for cell attachment in astrovirus ORF2 as well as to the two variable regions in astrovirus ORF1a display dN/dS values indicative for weak positive selection. The two clusters with dN/dS values >1 at the borders of VP25 may mark adaptive responses to maintain the structure of the virus spike protein allowing variability in the central part of VP25 that carries the serotypic epitope. The range of neutral evolution (0.3 < dN/dS < 1) is not very popular in astrovirus ORFs indicating that during evolution astrovirus has reached equilibrium between purifying selection and molecular adaptation.
Our attempts to correlate positively selected sites with virus functions also resulted in the identification of an RGD recognition motif for cellular attachment present in ORF2. In all astrovirus species (except Turkey1 isolate), an integrin recognition motif is found in ORF2. One of these (NGR) is located near the N-terminus. Bass and Qui [7] have shown that the aminoterminal 70 amino acids of the astrovirus ORF2 polyprotein can be deleted without consequences for virus assembly. Remarkably, they demonstrated this property for human astrovirus serotype 1 Routing of astrovirus sequence data from GenBank to PAML input data files Figure 4 Routing of astrovirus sequence data from GenBank to PAML input data files.

Structural proteins RFS
being the only species with all three integrin-binding motifs present in the sequence. Data on the process of cell entry by astrovirus have not been reported since 1992 [33]. Although experimental support is lacking, we tend to propose that integrin recognition plays a key role in astrovirus to target cell attachment.