Characterization of codon usage pattern in SARS-CoV-2

The outbreak of coronavirus disease 2019 (COVID-19) due to severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has posed significant threats to international health. The genetic traits as well as evolutionary processes in this novel coronavirus are not fully characterized, and their roles in viral pathogenesis are yet largely unknown. To get a better picture of the codon architecture of this newly emerging coronavirus, in this study we perform bioinformatic analysis, based on publicly available nucleotide sequences of SARS-CoV-2 along with those of other members of human coronaviruses as well as non-human coronaviruses in different hosts, to take a snapshot of the genome-wide codon usage pattern of SARS-CoV-2 and uncover that all over-represented codons end with A/U and this newly emerging coronavirus has a relatively low codon usage bias, which is shaped by both mutation pressure and natural selection. Additionally, there is slight variation in the codon usage pattern among the SARS-CoV-2 isolates from different geo-locations. Furthermore, the overall codon usage pattern of SARS-CoV-2 is generally similar to that of its phylogenetic relatives among non-human betacoronaviruses such as RaTG13. Taken together, we comprehensively analyze the characteristics of codon usage pattern in SARS-CoV-2 via bioinformatic approaches. The information from this research may not only be helpful to get new insights into the evolution of SARS-CoV-2, but also have potential value for developing coronavirus vaccines.


Phylogenetic analysis
Phylogenetic tree of the whole genome sequences of coronaviruses were constructed by using MEGA software version 6.0 (http://www.megasoftware.net) with the maximum likelihood algorithm and Kimura 2-parameter model with 1000 bootstrap replicates.
RSCU analysis of the complete coding sequences of SARS-CoV-2 Wuhan-Hu-1 revealed that all the over-represented codons (RSCU value > 1.6) ended with A/U whereas most of the under-represented codons (RSCU value < 0.6) ended with C/G (Supplementary Table 2). The highest RSCU value for the codon was AGA for R (2.67) amino acid and the lowest was UCG for S (0.11). The heatmap analysis (Fig. 1b) further revealed that all human coronaviruses analyzed in this study share the over-represented codons (UAA, GGU, GCU, UCU, GUU, CCU, ACU) and the average RSCU value > 2.0, whereas UCA were over-represented only in SARS-CoV-2 and SARS-CoV.
The profiles of codon usage patterns among different genes of human coronaviruses were further analyzed (Figs. 1c and 2). As for spike (S) gene, all human coronaviruses analyzed in this study shared the overrepresented codons (UCU, GCU, CUU, GUU, ACU) and all ended with U, whereas two codons (CCA, ACA) were over-represented only in SARS-CoV-2. In addition, SARS-CoV-2 did not use CGA for arginine nor CCG for proline. As for envelop (E) gene, two codons (UAC, GCG) were over-represented only in SARS-CoV-2 and SARS-CoV. All human coronaviruses analyzed in this study did not use two synonymous codons (CGC, CGG) for arginine as well as CCG for proline and UGA for stop codon at all. SARS-CoV-2 and SARS-CoV did not use CAA for glutamine nor UAU for tyrosine, whereas they use GCG for alanine, AUC for isoleucine, UCG and AGC for serine. As for membrane (M) gene, three codons (GUA, GAA, GGA) were over-represented only in SARS-CoV-2. As for nucleocapsid (N) gene, all human coronaviruses analyzed in this study share the overrepresented codons (GCU, ACU, CUU) and all ended with U. The average RSCU values of GCU in complete gene, S gene, E gene, M gene and N gene in all human coronaviruses analyzed in this study were 2.22, 2.12, 1.79, 2.13, 2.16, respectively. GCU for alanine was identified as the highly preferred codon among the human coronaviruses.
Amino acids are degenerate and each amino acid has different number of synonymous codons except for methionine (Met, M) and tryptophan (Trp, W). The overall amino acid usage of the human coronaviruses was shown in Supplementary Figure 2. Leucine and valine were the two most frequently used amino acids in all human coronaviruses analyzed in this study, CUU and GUU were preferred codons for leucine and valine, respectively ( Fig. 3), whereas tryptophan, histidine and methionine were the three least used ones, which was consistent with recent report [14].
To further estimate the degree of codon usage bias, intrinsic codon bias index (ICDI), codon bias index (CBI) and effective number of codons (ENC) values were calculated (Table 1). ICDI value (0.144), CBI value (0.306) and ENC value (45.38) all exhibited relatively low codon We next attempted to determine the forces influencing the codon usage bias. Accumulating evidence suggests that the formation of codon usage bias is affected by many factors, and two generally accepted major forces are mutation pressure and natural selection [16]. Other influential factors include gene expression level, gene length, GC content, GC contents at the third base of one codon (GC3), RNA stability, hydrophilicity, and hydrophobicity, etc. When G or C is in high or low proportion at the third position of the codon, mutational pressure is involved [17]. From Supplementary Figure 1, it clearly showed that both G3 and C3 were lower than A3 and U3, suggesting the contribution of mutational force acting on codon usage pattern. Moreover, all preferred codons were A/U ending (Figs. 1b, c and 2), which further suggested that mutational force contributed to shape codon usage in this virus. Furthermore, to better understand the relation between gene composition and codon usage bias, an ENC-GC3 scatter diagram of ENC versus GC3S (ENC plotted against G + C content at the third codon position) was constructed. When codon usage pattern is only affected by GC3 resulting from mutation pressure, the expected ENC values should be just on the solid curved line. As shown in Supplementary Figure 3, all points lie together under the expected ENC curve, indicating that some independent factors, such as natural selection might also play a role in codon usage bias of human coronaviruses.
Apart from human, many animal species can also be infected by different types of coronaviruses. Previous studies have shown that some animals such as bats are believed to represent the original reservoir of several human-infecting coronaviruses [1]. In order to provide additional information to better understand the evolution of SARS-CoV-2, we further compared the codon usage pattern of SARS-CoV-2 and non-human coronaviruses (Supplementary Table 1).
Phylogenetic analysis (Fig. 4a) showed that SARS-CoV-2 was most closely related to recently reported Bat coronavirus RaTG13 [8]. Nucleotide composition analysis (Supplementary Figure 4) revealed that similar to SARS-CoV-2 Wuhan-Hu-1, all the non-human coronaviruses analyzed in this study had the highest compositional value of U% and nucleotide U occurred most frequently at the third position. The heatmap analysis (Fig. 4b) revealed that SARS-CoV-2 and all the nonhuman coronaviruses analyzed in this study shared the over-represented codons (GGU, UCU, CCU) and all ended with U, meanwhile they shared the underrepresented codons (UCG, GGG, GCG, CCG, CGG, ACG, CGA) and most ended with G except for CGA. Codon usage pattern of SARS-CoV-2 was generally found a high similarity to that of betacoronaviruses except for Bat coronavirus HKU4-1, Bat coronavirus HKU5-1 (Fig. 4c, Supplementary Figures 5, 6, 7, 8). Moreover, the profiles of codon usage patterns among different genes of SARS-CoV-2 and non-human coronaviruses were further analyzed, as shown in Fig. 5 and Supplementary Figures 9, 10, 11, 12. We found similar codon usage pattern among SARS-CoV-2 and its phylogenetic relatives such as RaTG13, Bat-SL-CoVZC45, Bat-SL-CoVZXC21, PCoV_GX-P1E, PCoV_GX-P4L, which may reflect the evolutionary relationship between SARS-CoV-2 and these non-human coronaviruses. These results are in accordance with the full-genome phylogenetic analysis (Fig. 4a). The overall amino acid usage of the non-human coronaviruses was shown in Supplementary Figure 13. Similar to SARS-CoV-2, leucine and valine were the two most frequently used amino acids in all non-human coronaviruses analyzed in this study, CUU and GUU were preferred codons for leucine and valine, respectively.
Furthermore, similar to SARS-CoV-2, all the nonhuman coronaviruses analyzed in this study exhibited relatively low codon usage bias according to the intrinsic codon bias index (ICDI), codon bias index (CBI) and effective number of codons (ENC) values, as shown in Supplementary Figure 14. Nucleotide composition analysis (Supplementary Figure 4) and ENC-GC3S plot (Supplementary Figure 15) revealed that both mutational force and natural selection contribute to shape codon usage in non-human coronaviruses.
Overall, in the present study we attempted to take a snapshot of the characteristics of codon usage pattern in   novel coronavirus SARS-CoV-2. As a result, we found all over-represented codons ended with A/U and this novel coronavirus had a relatively low codon usage bias. Both mutation pressure and natural selection were contributors to the bias. Additionally, the overall codon usage pattern of SARS-CoV-2 was generally similar to that of its phylogenetic relatives among non-human coronaviruses such as RaTG13. Our findings are consistent with the recent observations [11][12][13][14][15] and provide new insights into the characteristics of codon usage pattern in coronaviruses. These results also have important implications for future work. Firstly, the information of genome-wide codon usage pattern of SARS-CoV-2 may be helpful to get new insights into the evolution of this newly emerging virus. With the increase of SARS-CoV-2 genome data available, we could reevaluate the codon usage pattern of SARS-CoV-2 more comprehensively to track the evolutionary changes between them. In this regard, genomewide codon usage patterns in 100 complete genome sequences of SARS-CoV-2 isolates including SARS-CoV-2 Wuhan-Hu-1 from different geo-locations were analyzed herein. All information about the isolates can be found in Supplementary Table 3. The heatmap analysis (Supplementary Figure 16) revealed 12 preferred codons (GGU, GCU, UAA, GUU, UCU, CCU, ACU UAA, GGU, GCU, UCU, GUU, CCU, ACU) ending with A/U among all the 100 isolates, and the average RSCU value of these over-represented codons vary from 1.63 to 2.67 (Supplementary Figure 17). The highest RSCU value was for the codon AGA for R (2.67) amino acid and the lowest was UCG for S (0.11). We noted that the overall codon usage pattern appeared to be slightly variant among the tested 100 SARS-CoV-2 isolates from different geo-locations, reflecting minimal evolutionary changes among them.
Additionally, compared to other members of human coronaviruses as well as non-human coronaviruses in different hosts, we found that the overall codon usage pattern of SARS-CoV-2 is generally similar to that of its phylogenetic relatives among non-human betacoronaviruses such as RaTG13 (Fig. 5), which may reflect the evolutionary relationship between SARS-CoV-2 and these non-human coronaviruses.
Secondly, the information of genome-wide codon usage pattern of SARS-CoV-2 may have potential value for developing coronavirus vaccines to combat this pandemic disease. The information on codon usage by SARS-CoV-2 may pave the way to design strategies such as codon deoptimization [18][19][20], the use of the least preferred codons to modify the SARS-CoV-2 genome to reduce virulence for the development of a safe and effective vaccine. This strategy has several advantages. Deoptimized viruses could express an identical antigenic repertoire of T-and B-cell epitopes because they contain the intact wide type amino acid sequence. Moreover, deoptimized viruses can efficiently replicate in vitro while being highly attenuated in vivo, which is important for vaccine production and their safe implementation.

Conclusions
Taking all these results together, our studies reveal that SARS-CoV-2 has a relatively low codon usage bias, which is shaped by both mutation pressure and natural selection. Additionally, there is slight variation in the codon usage pattern among the SARS-CoV-2 isolates from different geo-locations. Furthermore, the overall codon usage pattern of SARS-CoV-2 is generally similar to that of its phylogenetic relatives among non-human betacoronaviruses such as RaTG13. The information from this research may not only be helpful to get new insights into the evolution of human coronaviruses, but also have potential value for developing coronavirus vaccines.