Mutation quantities among geographical areas
At the start, we decided to find the mutations to understand mutation incidence rates and statistically identify essential mutations. A total of 6,394,483, 6,177,403, 5,841,477, and 895,738 E, M, N, and S sequences were studied, respectively.
According to the obtained data, 96.40% of the E amino acid (AA) sequences (AASs) exhibited no mutation (Fig. 2A). These characteristics were 36.76%, 2.20%, and 2.11% for M, N, and S AASs, respectively. Furthermore, in the same order, 3.56%, 59.64%, 5.68%, and 26.86% had one mutation. Two mutations were found in 0.02%, 2.80%, 7.11%, and 26.15% of E, M, N, and S AASs, respectively.
The data from the E protein demonstrated that 77.72% of AASs in Africa and 95.72% of Asia ASSs did not display any mutation. This exhibition in Europe, North America, South America and Oceania has been observed in 96.10%, 97.44%, 97.30% and 96.86% of AASs, respectively. In comparison, 49% of Africa ASSs did not exhibit any mutation for the M protein (Fig. 2B). Such a feature for M protein was concluded by 36.44%, 36.62%, 35.03%, 60.45% and 39.04% in Asia, Europe, North America, South America and Oceania, successively. One mutation of M protein was shown in 47.35%, 61.19%, 60.27%, 60.36%, 37.87% and 56% in Asia, Europe, North America, South America and Oceania ASSs, respectively. 1.43%, 1.9% and 2.13% of Africa, Asia and Europe ASSs illustrated two mutations in M ASSs, same as the 4.01%, 1.33% and 2.35% of ASSs in North America, South America and Oceania, successively.
Furthermore, 4.53% of the African samples did not have any N protein mutation (Fig. 3A). Also, 2.98%, 1.80%, 2.36%, 1.14% and 4.31% of N AASs in Asia, Europe, North America, South America and Oceania were without mutation occurrence, respectively. In contrast to Africa, which displayed 19.35% with one mutation, the one-mutation incidence rates in the other five areas were noticeably lower, and except for Oceania, other areas displayed almost similar one-mutation incidence rates. The percentages of N ASSs with two mutations in Oceania and Africa were higher than in other areas. This demonstration in Oceania and Africa ASSs was 28.81% and 16.55%, respectively but in Asia, Europe, North America and South America this trend was 7.07%, 4.8%, 9.53% and 4.95%, successively. Concerning the S protein, it has resulted that in South America, only 0.46% of AASs did not display any mutation and about 82% demonstrated four and more mutations (Fig. 3B). Oceania’s no-mutation incidence rate for this protein was 8.45%, the highest no-mutation incidence rate. The one mutation incidence rate among S AASs has been demonstrated as 36.01%, 35.81%, 20.75%, 31.70%, 6.95% and 13.78% in the ASSs of Africa, Asia, Europe, North America, South America and Oceania, respectively. 81.96%, 35.92%, 24.07%, 17.14%, 16.99% and 2.52% of S AASs displayed four and more mutations among South America, North America, Asia, Africa, Europe and Oceania ASSs, respectively. The prevalence of AASs with one mutation in Africa, Asia and Europe was higher than in other types of achieved data. Besides, the most prevalent mutations in Oceania and the Americas were two and more than three, respectively.
In the following, we drew a heat map for mutations to detect their frequency in total and among each area. Data displayed the most mutations relative to the total AASs among the E, M, N and S AASs occurred in the regions of 7 to 14 AA (0.0018 frequency), 66 to 88 AA (0.0279 frequency), 164 to 205 AA (0.0294 frequency) and 508 to 635 AA (0.0079 frequency), respectively (Figs. 4, 5). The second highest mutations frequency in the E, M, N and S AASs arose in the regions of 56 to 63 AA (0.0006 frequency), 1 to 22 AA (0.0010 frequency), 205 to 246 AA (0.0201 frequency) and 1 to 127 AA (0.0048 frequency), respectively.
The characteristics of mutations based on geographical areas
The locations of mutations in the protein structure and their frequency were considered in the following step to identify more dimensions of mutations. As shown in Fig. 6A, T9I (0.0128) has the highest frequency of mutations in the E protein, followed by P71L (0.0068), V62F (0.0066), L21F/V (0.0017/0.0003), and V58F (0.0013). Accordingly, T9I was the most frequent mutation in Europe, Oceania, North America and South America, with 0.0187, 0.0249, 0.0066 and 0.0049 frequency rates, respectively. Nevertheless, P71L was the most frequent mutation in Africa and Asia, with frequency rates of 0.1643 and 0.0146, respectively. V62F is one of the first ten frequent mutations in Asia (0.0118 frequency), Europe (0.0016 frequency) and North America (0.0024 frequency), in contrast to Africa (0.0011 frequency), Oceania (0.0004 frequency) and South America (0.0012 frequency) which this mutation was as eighth, sixth and ninth, respectively.
Regarding the M protein, the analysis showed I82T (0.6015 frequency), D3G (0.0077 frequency), A63T (0.0073 frequency), Q19E (0.0072 frequency) and A2S (0.0033 frequency) are the first five mutations with the highest frequency, respectively (Fig. 6B). I82T was the most frequent mutation in all six areas. This situation differs from the D3G mutation, which was not the second most frequent mutation in Asia and North America. In these areas, F28L (0.020 frequency) and A81S (0.0083 frequency) were in the second position of frequent mutations, respectively. A63T was the third most frequent mutation in Africa and Europe, with 0.0224 and 0.0091 frequencies, respectively. On the other hand, the third most frequent mutation in Asia, Oceania, North America and South America were D3G (0.0048 frequency), Q19E (0.0259 frequency), S197N (0.0069 frequency) and R164H (0.004 frequency), respectively.
Analysis of N AASs data illustrated that the R203M/R203K with 0.6084/0.2489 frequencies was at the first position of frequent mutations (Fig. 7A). Globally, D377Y (0.6134 frequency) mutation is the second, D63G (0.6002 frequency) the third, G215C (0.5479 frequency) the fourth and G204R/G204P (0.2352/0.0134 frequencies) the fifth mutation. In all continents except South America, up to the fourth position of frequent mutations were similar to the global results. Analysis of data from South America resulted in a different arrangement. The frequency of the R203M mutation was higher than the R203K mutation in all continents except South America. The R203M/R203K frequencies in Africa were 0.4195/0.1965, in Asia were 0.6033/0.3052, in Europe were 0.6074/0.2826, in North America were 0.6310/0.1776 and in Oceania were 0.6143/0.3008. However, in South America, the R203M/R203K frequencies were 0.3570/0.5700. A further dimension of differences between South America and other areas is the positions of second and third frequent mutations. G204R (0.5685 frequency) and P80R (0.4184 frequency) rank second and third mutations in South America.
The pattern of mutation frequency for S AASs displayed that D614G, with 0.9756 frequency worldwide, achieved first place among frequent mutations. In the following, L18F (0.1680 frequency), A222V (0.1579 frequency), E484K (0.1454 frequency) and N501Y (0.1120 frequency) rank second to fifth frequent mutations (Fig. 7B). The first frequent mutation in S AASs was identical in all six geographical areas; however, the frequencies differed between them. The frequency of D614G in Africa, Asia, Europe, North America, South America and Oceania was 0.8884, 0.9579, 0.9743, 0.9835, 0.9959 and 0.9047, respectively. In Africa, P681R/P681H mutations (0.0674/0.0396 frequencies) rank second, Q677H/Q677K mutations (0.0661/0.0163 frequencies) rank third, L452R mutation (0.0728 frequency) ranks fourth and S477N mutation (0.0712) ranks fifth. In addition to the point that the second mutation in Asia attributed to E484K/E484Q (0.1112/0.0211 frequencies), P681R/P681H (0.1127/0.0150 frequencies), W152L (0.0897 frequency) and G769V (0.0903 frequency) are the third to fifth mutations in S AASs. In Europe, A222V (0.4217 frequency) is the second most frequent mutation and L18F (0.1702 frequency), S477N (0.0999 frequency) and S98F (0.0466 frequency) are the third to fifth, respectively. Among the mutations that occurred in S AASs from North America, E484K (0.1393 frequency) ranks second, L18F (0.1227 frequency) ranks third, P681H/P681R (0.0849/0.0268 frequencies) ranks fourth and L452R (0.1021) ranks fifth which are different in the types in frequent mutations from South America. In South America, V1176F with 0.8592 frequency, E484K with 0.8268 frequency, N501Y with 0.7745 frequency and L18F with 0.7742 frequency are the frequent mutations, respectively. S477N (0.6864 frequency), P681R (0.0229 frequency), V1068F (0.0196 frequency) and N439K (0.0179 frequency) are the frequent mutation following the D614G mutation in Oceania, respectively. Additional data have been listed in Additional files 1, 2, 3 and 4. As a conclusive visualization, the locations of top three frequent mutations were determined in Fig. 8.
Evolutionary trends of emergence and distribution of top ten mutations concerning the time and geographical areas
Identifying the trends of mutation emergence and spreading can lead to a more practical approach to help identify factors affecting drug and vaccine effectiveness. Figures 9 and 10 display the distribution pattern of the top ten mutations based on collection time, and Additional files 5, 6, 7 and 8 provide supplemental data.
The T9 mutation, which is the most frequent mutation in E AASs in the world (0.0693 frequency rate), began to prevail in October 2021 and till January 2022 (Fig. 9A). P71 mutation prevalence increased in May 2020, then decreased in August 2020, then increased again in September 2020. In March 2021, P71 mutation frequency was 0.0257. Although V62 mutation was present from the beginning of the pandemic, it increased from August and reached maximum frequency in October 2021. (0.0058). P71 emerged in August 2020 in Africa. It then increased till April 2021 and reached the highest frequency (0.3673) and decreased till September 2021. Accordingly, the highest frequency of P71 mutation in all other continents is almost identical to Africa. In Asia, V62 mutation increased noticeably from august 2021 and displayed its maximum frequency rate (0.0901) in October 2021. Since the pandemic began in South America, V58 mutation frequency increased to 0.0457 in May 2020. However, it declined from July 2020.
The most worldwide frequent mutation of M AASs, I82, had a notable frequency rate in January 2021 (0.1095). The second global peak of I82 prevalence started from May 2021 and reached the highest frequency rate (0.9969) in October 2021 (Fig. 9B). From this perspective, except in South America, the evolutionary trends in the distribution of I82 mutation in all continents have almost identical patterns. Although I82 was detectable in South America from the start of the pandemic, it had a near-zero frequency rate before April 2021 and began dominating in July 2021. Q19 mutation increased worldwide in October 2021, when I82 mutation decreased.
The prevalence of mutations among N AASs is fluctuant. R203 mutation has a maximum frequency rate in January 2020and began to increase from February 2020 till august 2020 (Fig. 10A). It began to dominate globally in November 2020 and grew to 0.9907 in January 2022. The evolutionary trends in all areas except South America showed a similar pattern of R203 prevalence. The achieved data from this area demonstrated an almost steady pattern of frequency rate for R203 from April 2020 to January 2022. The evolutionary trends of D63 and G215 mutations have a similar pattern and both started increasing in April 2021 on all continents. Also, P80 mutation, which is common and exclusive in South America, increased from November 2020 to June 2021 and then began to decline.
The growing evolutionary trend of D614G mutation started in February 2020 in the entire world. In contrast to other regions, the mentioned mutation did not follow a consistent pattern in Africa and showed fluctuations. L18 mutation has increased since August 2020 and began to decrease from July 2021 worldwide (Fig. 10B). Such a pattern has also been demonstrated by E488 and N501 globally. Contrary to them, A222 mutation displayed a different trend; from July 2020 to October 2020, its prevalence increased. In May 2020, S477 mutation, one of Oceania’s top ten mutations, increased and decreased 3 months later (August 2020). South American results showed that, except for D614, mutations increased in November 2020.
Protein–protein interaction (PPI) network presentation
The protein–protein interaction (PPI) network with 57 nodes and 153 edges presents the interaction between E, M, N, S SARS-CoV-2 protein and human proteins (Fig. 11) (See Additional file 9). Through the ranking analysis, Ras GTPase-activating protein-binding protein 1 (G3BP1) was identified as high human gene rank (Fig. 12). Additional data is illustrated in Additional file 10. The network showed the linkage between the M protein cluster genes and E and N members which are linked with the A-kinase anchoring protein 8 like (AKAP8L) human gene playing a role as a bottleneck. Also, in this network, Zinc Finger DHHC-Type Palmitoyltransferase 5 (ZDHHC5) and Golgin A7 (GOLGA7) have been shown as the human genes with the highest interaction with S protein.