Compare the differences of synonymous codon usage between the two species within cardiovirus

Background Cardioviruses are positive-strand RNA viruses in the Picornaviridae family that can cause enteric infection in rodents and also been detected at lower frequencies in other mammals such as pigs and human beings. The Cardiovirus genus consists two distinct species: Encephalomyocarditis virus (EMCV) and Theilovirus (ThV). There are a lot differences between the two species. In this study, the differences of codon usage in EMCV and ThV were compared. Results The mean ENC values of EMCV and ThV are 54.86 and 51.08 respectively, higher than 40.And there are correlations between (C+G)12% and (C+G)3% for both EMCV and ThV (r = -0.736;r = 0.986, P < 0.01, repectively). For ThV the (C+G)12%, (C+G)3%, axis f'1 and axis f'2 had a significant correlations respectively but not for EMCV. According to the RSCU values, the EMCV species seemed to prefer U, G and C ending codon, while the ThV spice seemed to like using U and A ending codon. However, in both genus AGA for Arg, AUU for Ile, UCU for Ser, and GGA for Gly were chosen preferentially. Correspondence analysis detected one major trend in the first axis (f'1) which accounted for 22.89% of the total variation, and another major trend in the second axis (f'2) which accounted for 17.64% of the total variation. And the plots of the same serotype seemed at the same region at the coordinate. Conclusion The overall extents of codon usage bias in both EMCV and ThV are low. The mutational pressure is the main factor that determines the codon usage bias, but the (C+G) content plays a more important role in codon usage bias for ThV than for EMCV. The synonymous codon usage pattern in both EMCV and ThV genes is gene function and geography specific, but not host specific. Maybe the serotype is one factor effected the codon bias for ThV, and location has no significant effect on the variations of synonymous codon usage in these virus genes.


Background
Synonymous codon usage is biased and the bias seems to be different in different organisms [1,2]. Many factors are concerned to be the reasons for this bias, such as degree and timing of gene expression, codon-anticodon interactions, transcription and translation rate and fidelity, codon context, and global and local (C+G) content [3,4]. Understanding the extent and causes of biases in codon usage is essential to the understanding of viral evolution, particularly the interplay between viruses and the immune response [5]. More recent studies have revealed that patterns of codon usage bias and nucleotide composition within many cellular genomes are far more complex than previously imagined, and the factors shaping their evolution are still not entirely understood. In general, natural selection and/or mutation pressure for accurate and efficient translation in various organisms are the main reasons to this bias. In addition, compared with natural selection, mutation pressure plays an important role in synonymous codon usage pattern in some RNA viruses [6][7][8][9][10].
Nevertheless, little information about codon usage pattern of Cardiovirus genus genome including the relative synonymous codon usage (RSCU) and codon usage bias (CUB) in the process of its evolution is available. In this study, the key genetic determinants of codon usage index in Cardiovirus genus were examined.

The characteristics of Synonymous codon usage in EMCV and ThV
In order to investigate the extent of codon usage bias in Cardiovirus, all RSCU values of different codons in 39 Cardiovirus strains were calculated. As shown in Table 1, the EMCV strains seem to like using U, G and C ending codon, while the ThV species seem to like using U and A ending codons. The values of ENC (effective number of codons) ( Table 2) among EMCV    12  Compositional properties of coding sequences of both EMCV and ThV As shown in Table 3, (C+G)% has a highly significant correlation with each A 3 %, C 3 %, G 3 % and U 3 % . (C+G) 3 % has a highly significant correlation with each of A%, U%, C% and G% among the ThV strains but not among the EMCV strains. This indicates that the (C+G)% and (C+G) 3 % may reflect some more important characteristics of codon usage pattern of ThV compared with EMCV. Then the C+G content at first and second codon positions ((C+G)12%) was compared with that at synonymous third codon positions ((C+G) 3 %) for both EMCV and ThV respectively. A highly significant correlation is observed in ThV (r = 0.986, P < 0.01)( Figure  1A, Table 4). However for EMCV a highly negative correlation is observed (r = -0.736, P < .0.01)( Figure 1B, Table 4). Then the (C+G) 12 % and (C+G) 3 % of both EMCV and ThV were compared with axis f' 1 and axis f' 2 respectively. The results (  Figure 2). All these results imply that the codon bias of Cardiovirus especially the ThV can be explained mainly by an uneven base composition, in other words, by mutation pressure rather than natural selection and the (C+G) content has a more significant effect for ThV than EMCV

Correspondence analysis (COA) for all the strains
To investigate the major trend in codon usage variation among Cardiovirus, COA was used for all 39 Cardiovirus complete coding regions selected for this study.
COA detect one major trend in the first axis (f' 1 ) which account for 22.89% of the total variation, and another major trend in the second axis (f' 2 ) which account for 17.64% of the total variation. The coordinate of the complete coding region of each gene was plotted in Figure 3 defining by the first and second principal axes. It is clear that the f' 1 values of all EMCV are positive while the ThVs are negative. And the plots of the strains of the same serotype seem at the same region. Furthermore, the EMCV has a tendency to converge tightly while the different serotypes of ThV are dispersed. These findings imply that different serotype may have different codon usage patterns. Interestingly, the plot of EMC-30 is a little far from the other EMCV, but this does not indicate the location is an element that could dramatically influence the codon usage pattern.

Qualitative evaluation of codon usage bias in EMCV and ThV
There was a seemingly random variation in RSCU between amino acids and gene groups. There were several synonymous codons with strong discrepancy for codon usage in each genus. As for EMCV, in details, AGA for Arg, GGA for Gly, CAU for His, AUU for Ile, CCA for Pro, UCU for Ser and GUG for Val. And there are some differences of the global pattern of codon usage between EMCV and Theilovirus. However, in both genus AGA for Arg, AUU for Ile, UCU for Ser, and GGA for Gly were chosen preferentially ( Figure 4).

Discussion
Studies of synonymous codon usage in viruses can reveal much about viral genomes. In this study, we used RSCU, ENC, COA, and GC 3S , to measure the synonymous codon usage bias in order to compare the differences between EMCV and ThV, the two species within Table 3 Summary of correlation analysis between the A, U, C, G contents and A 3 , U 3 , C 3 , G 3 contents in all selected samples. Cardiovirus. The synonymous codon usage bias in coding regions of both EMCV and ThV are low because the mean ENC values of 54.86 and 51.08 respectively (higher than 40). This is in agreement with previous reports about some other RNA viruses, for example, BVDV (mean ENC = 51.42), H5N1 (mean ENC = 50.91) and SARS-covs (mean ENC = 48.99) [6,7,21]. A low codon usage bias is advantageous to replicate efficiently in vertebrate host cells, with potentially distinct codon preferences. However there is a marked variation in codon usage pattern among different Theilovirus genes (S.D. = 6.41) compared to the EMCV genes (S.D. = 0.36). One explanation about this phenomenon is that the ThV probably has four serotypes while the EMCV just has one and the serotype might affect the codon choice. A general mutational pressure, which affects the whole genome, would certainly account for the majority of the codon usage variation. In this study, the general association between codon usage bias and base composition suggests that mutational pressure, rather than natural selection, is mainly supported by the highly significant correlation between (C+G) 12 % and (C+G) 3 % (r = -0.736 for EMCV; r = 0.986 for ThV, P < 0.01), since the effects are present at all codon positions. Also the (G +C) content was another factor which was found to be strong correlated with codon usage bias. In this study, the results indicated the (C+G) content played an important role in codon usage bias for ThV (Table 3), but not for EMCV. This is a little complex for EMCV and we need to do more research for this genus such as each nucleotide composition, gene structure and so on to find the main factor for codon bias of EMCV. Nevertheless we still consider that the mutational pressure rather than natural selection is the one of the main factors responsible for the variation of synonymous codon usage among ORF coding sequences in Cardiovirus genus.
Generally, previous reports indicates that many viruses including foot-and-mouth disease viruses, influenza A virus subtype H5N1, severe acute respiratory syndrome Coronavirus (SARSCoV) and human bocavirus, preferentially use C and G-ended codons [2,7,9,10]. In this study we found that the EMCV strains seemed to like using U, G and C ending codon, while the ThV species seemed to like using U and A ending codon. Also there was a seemingly random variation in RSCU between amino acids and gene groups. This may be because using these codon with different endings could be advantage for replicating efficiently in host cells with potentially distinct codon preferences for both EMCV and ThV.
Serotype may be one factor for codon bias in Cardiovirus as the Figure 3 showed. And there was no evidence supported that location could be a factor for codon bias, because the plot of EMC-30 which was isolated from USA was a little far from other EMCV that were isolated from USA plots. Table 4 Analysis of correlation between the first two principle axes and nucleotide contents in samples.

Conclusion
The overall extents of codon usage bias in both EMCV and ThV are low (mean ENC = 54.86; mean ENC = 51.08 respectively, higher than 40). The mutational pressure rather than natural selection is the main factor that determines the codon usage bias that is supported by the highly significant correlation between (C+G) 12 % and (C+G) 3 % (r = -0.736 for EMCV; r = 0.986 for ThV, P < 0.01), but the (C+G) content plays a more important role in codon usage bias for ThV than for EMCV. The synonymous codon usage pattern in both EMCV and ThV genes is gene function and geography specific, but not host specific. Maybe the serotype is one factor effected the codon bias for ThV, and location has no significant effect on the variations of synonymous codon usage in these virus genes.

Sequences
A total of 39 Cardiovirus genomes were used in this study, including 18 EMCV genomes and 21ThV genomes. The CDS of these viruses were obtained from NCBI http://www.ncbi.nlm.nih.gov/Genbank/ randomly in December 2010. And the serial number (SN), Gen-Bank number, genotype and other detail information are listed in Table 5.

Measures of relative synonymous codon usage
Relative synonymous codon usage (RSCU) values of each codon in each ORF were used to measure the synonymous codon usage. RSCU values are largely independent of amino acid composition and are particularly useful in comparing codon usage between genes, or sets of genes that differ in their size and amino acid composition [22]. The RSCU value of the ith codon for the jth amino acid was calculated as: Where g ij is the observed number of the ith codon for jth amino acid which has n i type of synonymous codons. When the codon with RSCU values close to 1.0, it means that this codon is chosen equally and randomly. The values of RSCU were obtained by CodonW program The effective number of codons (ENC) was calculated to quantify the codon usage bias of an ORF [23], which is the best estimator of absolute synonymous codon usage bias [24]. The larger extent of codon preference in a gene, the smaller the ENC value is. In an extremely biased gene where only one codon is used for each amino acid, this value would be 20; if all codons are used equally, it would be 61; and if the value of the ENC is greater than 40, the codon usage bias was regarded as a low bias [25] The values of ENC were obtained by CodonW program.
Composition analysis of coding region In order to better understand the synonymous codon usage variation among different Cardiovirus isolates, The (C+G) content at the first and second codon positions [(C+G) 12 %] and that at the synonymous third position [(C+G) 3 %] were calculated by the CodonW program, respectively [26,27]. The values of the (C+G) content at different positions were used to compare with the values of the other compositional content.

Correspondence analysis (COA)
Multivariate statistical analysis can be used to explore the relationships between variables and samples. In this study, correspondence analysis was used to investigate the major trend in codon usage variation among genes. In this study, the complete coding region of each gene was represented as a 59 dimensional vector, and each dimension corresponds to the RSCU value of one sense codon (excluding Met, Trp, and the termination codons) [28].

Correlation analysis
Correlation analysis was used to identify the relationship between nucleotide composition and synonymous codon usage pattern [29]. This analysis was implemented based on the Spearman's rank correlation analysis way. All statistical processes were carried out by with statistical software SPSS 11.5 for windows.