NS4A protein as a marker of HCV history suggests that different HCV genotypes originally evolved from genotype 1b

Background The 9.6 kb long RNA genome of Hepatitis C virus (HCV) is under the control of RNA dependent RNA polymerase, an error-prone enzyme, for its transcription and replication. A high rate of mutation has been found to be associated with RNA viruses like HCV. Based on genetic variability, HCV has been classified into 6 different major genotypes and 11 different subtypes. However this classification system does not provide significant information about the origin of the virus, primarily due to high mutation rate at nucleotide level. HCV genome codes for a single polyprotein of about 3011 amino acids which is processed into structural and non-structural proteins inside host cell by viral and cellular proteases. Results We have identified a conserved NS4A protein sequence for HCV genotype 3a reported from four different continents of the world i.e. Europe, America, Australia and Asia. We investigated 346 sequences and compared amino acid composition of NS4A protein of different HCV genotypes through Multiple Sequence Alignment and observed amino acid substitutions C22, V29, V30, V38, Q46 and Q47 in NS4A protein of genotype 1b. Furthermore, we observed C22 and V30 as more consistent members of NS4A protein of genotype 1a. Similarly Q46 and Q47 in genotype 5, V29, V30, Q46 and Q47 in genotype 4, C22, Q46 and Q47 in genotype 6, C22, V38, Q46 and Q47 in genotype 3 and C22 in genotype 2 as more consistent members of NS4A protein of these genotypes. So the different amino acids that were introduced as substitutions in NS4A protein of genotype 1 subtype 1b have been retained as consistent members of the NS4A protein of other known genotypes. Conclusion These observations indicate that NS4A protein of different HCV genotypes originally evolved from NS4A protein of genotype 1 subtype 1b, which in turn indicate that HCV genotype 1 subtype 1b established itself earlier in human population and all other known genotypes evolved later as a result of mutations in HCV genotype 1b. These results were further confirmed through phylogenetic analysis by constructing phylogenetic tree using NS4A protein as a phylogenetic marker.


Introduction
Hepatitis C virus belongs to Flaviviridae family of viruses and its chronic infection has affected 350 million people worldwide [1]. HCV has a positive-sense single-stranded RNA genome of about 9.6 kb that has one single open reading frame and conserved un-translated regions (UTRs) at the 5' and 3' ends [2]. Within host cell the polyprotein is processed into structural (Core, E1, E2 and P7) and nonstructural proteins (NS2, NS3, NS4A, NS4B, NS5A and NS5B). Nonstructural 5B (NS5B) protein is an RNA-dependent RNA polymerase that is responsible for viral genome replication [3]. The error-prone nature of this enzyme is responsible for a high mutation rate in HCV. Based on nucleotide sequence comparison analysis in 5'UTR, Core/E1 and NS5B regions six major HCV genotypes (HCV-1 to HCV-6) have been described, each containing multiple subtypes (e.g., 1a, 1b, 1c etc). In terms of genetic variability, genotypes differ from each other by 31 to 33% and subtype by 20 to 25% [4]. Though HCV classification system has evolved considerably [5,6], it does not provide convincing information about origin of the virus. Suzuki and Nei used amino acid sequences of hemagglutinin genes instead of nucleotide sequences in their work on origin and evolution of influenza virus and they reported that amino acid sequences provide more reliable information in establishing evolutionary relationship than nucleotide sequences when the sequence divergence is high [7]. During our protein blast analysis http://blast.ncbi.nlm.nih.gov/Blast. cgi?PAGE=Proteins of NS4A gene (HCV genotype 3a) isolated from Pakistani population, we observed a relatively conserved nature of NS4A protein. Furthermore, we observed occasional amino acid substitutions in the NS4A protein sequences from genotype 3a.
The purpose of this study is to establish the identity of the parent HCV genotype that first established itself in human population. We have analyzed amino acid sequences of NS4A protein of all known Hepatitis C virus genotypes through Multiple Sequence Alignment and by constructing a phylogenetic tree using CLC sequence viewer software. We used NS4A protein due to many reasons. First of all due to its relatively conserved nature, second the occasional amino acid substitutions that we observed and third due to availability of large number of sequences for this region in sequence databases from all over the world. We have used amino acid substitutions as a tool because it would be logical to think that when an amino acid substitution is introduced into NS4A protein it will be retained in future progenies until mutated again. Due to a relatively conserved nature of NS4A protein, some of these amino acid substitutions might travel a long distance across different HCV genotypes as HCV evolved. If we follow such substitutions across different HCV genotypes it can provide valuable information about evolution of NS4A protein, and in turn about evolution of HCV. Phylogenetic tree was constructed using UPGMA (Unweighted Pair Group Method with Arithmetic Mean) method to support our results. First of all MSA was performed for 56 sequences from genotype 3 subtype 3a. After that single MSA was done for all the 346 sequences. Then MSA was performed for 73 sequences from genotype 1 subtype 1b and 3 sequences from genotype 1 subtype 1c. Furthermore, MSA was performed for the 73 sequences from genotype 1 subtype 1b with 64 sequences from genotype 1 subtype 1a, 35 sequences from genotype 5, 37 sequences from genotype 4, 58 sequences from genotype 3 and 58 sequences from genotype 2 respectively. Finally a single phylogenetic tree was constructed for all the 346 sequences using UPGMA method using CLC sequence software http://www.clcbio.com/index.php?id=28.

NS4A protein HCV genotype 3a
Total of 56 different amino acid sequences that were reported from different parts of the world for NS4A protein genotype 3 subtype 3a were analyzed through Multiple Sequence Alignment. Out of 56 sequences that were observed 41 sequences had same amino acid sequence as shown in Figure 1, where dots show similarity and Roman letters shows amino acid substitutions relative to the sequence 1 (PK/FG3). PK/FG3 isolate used as a reference sequence was isolated from local Pakistani population. These 41 sequences which show same amino acid sequence for NS4A protein of HCV genotype 3a have been reported from different parts of the world i.e. Pakistan, France, United Kingdom, Switzerland, Germany, Belgium, Australia and United States of America, representing 4 different continents of the world i.e. Asia, Europe, Australia and North America. Different amino acid substitutions F 6 , V 13 , I 20 , S 22 , E 32 , R 32 , R 41 and R 46 were observed in sequences 42-56 relative to sequence 1. These results indicate relatively conserved nature of NS4A protein at genotype level and may help in performing evolutionary studies with HCV.
Amino Acid sequence comparison of NS4A protein of different HCV genotypes Multiple Sequence Alignment of NS4A protein of HCV genotype 3a provided useful information about its conserved nature. These results indicated that both the conserved nature and occasional amino acid substitution in the NS4A protein might provide useful information about origin of HCV in humans. So we compared amino acid composition of NS4A protein of different HCV genotypes through Multiple Sequence Alignment. Single MSA was performed for all 346 sequences included in this study (data not shown) and amino acid substitutions were critically analyzed in all HCV genotypes. We observed amino acid substitutions in genotype 1b that were consistent members of NS4A protein of different HCV genotypes. So we analyzed and compared sequences of genotype 1b with sequences from different HCV genotypes and subtypes.

NS4A protein HCV genotype 1b and 1c
A total of 72 sequences for NS4A protein HCV genotype 1 subtype 1b and 3 sequences for subtype 1c were compared through Multiple Sequence Alignment as shown in the Figure 2. Genotype 1b sequences included in this study were reported from France, Switzerland, United States of America, Japan, Germany, China, Sweden, Korea, Ireland, Australia and Russia while genotype 1c sequences were reported from Indonesia and India. Sequences 1 to 22 have same amino acid sequence with no amino acid substitution. These 22 sequences were reported from France, Switzerland, Japan and USA, indicating the relatively conserved nature of NS4A protein.
Sequences 23 to 38 have 6 different single amino acid substitutions C 22 , V 30 , R 34 , I37, V 38 and Q 46 (Roman letters and numbers indicate specific amino acids and their position in the NS4A protein respectively). Sequence 39 to 51 shows double amino acid substitutions in which the already observed 6 different single amino acid substitutions were combined in pairs and in different combinations. In sequences 52 and 53 another amino acid substitution Q 47 Hong Kong  -------14  14 17 Thailand     29 were introduced as the NS4A protein of genotype 1 subtype 1b evolved. NS4A protein of genotype 1 subtype 1c closely resembles the NS4A protein of subtype 1b as shown in Figure 2. Sequence 74 shows that NS4A protein of genotype 1 subtype 1c evolved when T 19 in NS4A protein genotype 1 subtype 1b was substituted to S 19 . G 32 is another amino acid that we observed in subtype 1c sequences 74 and 75 but not in any of the 72 sequences of the subtype 1b.
NS4A protein HCV genotype 1a MSA was performed for 64 different sequences of NS4A protein genotype 1 subtype 1a with 72 sequences from genotype 1 subtype 1b and the file that was generated is shown in Figure 3, for convenience only one sequence for genotype 1b is shown. Genotype 1a sequences that are included in this study were reported from France, UK, Japan, USA, Australia, Switzerland, Singapore and Canada. We observed that C 22 and V 30 that were introduced as occasional amino acid substitutions in NS4A protein of genotype1b are consistent members of NS4A protein of genotype 1 subtype 1a. R 34 , I 37 , V 38 and Q 46 that emerged as single amino acid substitutions in NS4A protein of genotype 1b are also present in different sequences of genotype 1a. S 19 amino acid which was also observed in genotype 1c sequences is a consistent member of genotype 1a NS4A protein. The overall similarity represented in the form of dots, the presence of C 22 and V 30 as consistent members, the presence of V 29 , R 34 , I 37 , V 38 and Q 46 amino acids which originally emerged at genotype 1b level clearly indicates that NS4A protein of genotype 1a evolved later as compared to NS4A protein of genotype 1b.

NS4A protein HCV genotype 5
MSA for 35 different sequences of NS4A protein of genotype 5 was performed with 72 sequences form genotype 1 subtype 1b. Genotype 5 sequences that are included in this study were reported from France, Belgium, USA, South Africa, Algeria, UK and Spain. MSA results for genotype 5 sequences are shown in Figure 4 and for simplicity only one sequence from genotype 1b is shown. Comparative analysis of genotype 1b and genotype 5 sequences (Figure 4) shows that L 10 , T 20 and V 24 of NS4A protein genotype 1b has been replaced by V 10 , V 20 and A 24 respectively in NS4A protein of genotype 5. Q 46 and Q 47 are the amino acids that were introduced as amino acid substitutions in genotype 1b sequences has been retained as more consistent members in genotype 5 sequences. R 34 and I 37 amino acids are also present in different sequences of genotype 1b and 5. We propose that NS4A protein of genotype 5 evolved when V 10 , V 20 and A 24 amino acid substitutions were introduced into NS4A protein sequences of genotype 1b (sequences 52 to 58 in Figure 2).
NS4A protein HCV genotype 4 MSA was performed for 37 different sequences of NS4A protein genotype 4 with 72 sequences form genotype 1 subtype 1b. Genotype 4 sequences included in this study were reported from USA, Egypt, UK, Spain, France, Indonesia, Cameroon and Portugal. Some of the sequences for genotype 1b that were reported from African patients in Canada are also included in this study. MSA results are shown in Figure 5 and for simplicity only one sequence from genotype 1b is shown.V 29 , V 30 , Q 46 and Q 47 amino acids that emerged as amino acid substitutions in NS4A protein sequences of genotype 1b can be seen to be present more consistently in NS4A protein of genotype 4. I 37 amino acid can also be seen in some sequences. Q 34 amino acid has been observed to be present consistently in NS4A protein sequences of genotype 4 only. S 19 and V 20 are the other amino acids that are present more consistently in NS4A protein sequences of genotype 4 but not in the sequences that we had observed for genotype 1b. Other amino acids occurring less frequently are also shown in Figure 5.

NS4A protein HCV genotype 6
Thirty amino acid sequences for NS4A protein genotype 6 were uploaded to the CLC software and MSA was performed with 72 sequences from genotype 1 subtype 1b. Genotype 6 sequences that were included in this study were reported from Hong Kong, UK, France, China, Japan, Thailand and Viet Nam. Results for this alignment are shown in Figure 6, for convenience only one sequence from genotype 1b is shown. It is clear from the figure that C 22 , Q 46 and Q 47 are present as more consistent members of NS4A protein sequences of genotype 6. These amino acids emerged as amino acid substitutions in NS4A protein of genotype 1b. V 38 amino acid present in different sequences of genotype 6 also emerged in genotype 1b sequences. S 19 , V 20 , C 26 , T 30 , T 31 , T 32 , I 43 are the amino acids that are present in different sequences of genotype 6 but not in the 72 sequences we observed for genotype 1b. Some other amino acids shown in Figure 6 are also present in genotype 6 sequences but they occur less consistently.

NS4A protein HCV genotype 3
MSA was performed for 58 sequences of NS4A protein of genotype 3 and 72 sequences from genotype 1b. Genotype 3 sequences included in this study were reported from Pakistan, France, UK, Switzerland, Australia, USA, Germany, Belgium, Japan, Singapore, Denmark, Indonesia and India. Results for this alignment are shown in Figure 7, for convenience only one sequence for genotype 1b is shown. C 22 , V 38 , Q 46 and Q 47 amino acids are frequent members of NS4A protein sequences of genotype 3. These amino acids emerged as amino acid substitutions in NS4A protein sequences of genotype 1b.
Presence of S 19 and G 32 amino acids together in same sequence has been observed in sequences from genotype 3 and 1c only. L 6 , V 20 , H 28 , E 30 , L 37 , K 41 and Y 48 are amino acids that we did not observe in our sequences for genotype 1b but are frequent members of NS4A protein sequences from genotype 3. Some other amino acids differences have also been observed but are present less frequently as shown in Figure 7.
NS4A protein HCV genotype 2 58 sequences for NS4A protein genotype 2 that were reported from Japan, UK, USA, Indonesia, and Viet Nam were included in this study. MSA was performed for 58 sequences from genotype 2 and 72 sequences from genotype 1b for NS4A protein. Results are shown in Figure 8, for convenience only one sequence for genotype 1b is shown. C 22 is the amino acid that appeared as occasional substitution in NS4A protein of genotype 1b but is more frequent member of NS4A protein sequences from genotype 2. K 41 is a frequent member of genotype 2 and genotype 3 sequences. NS4A protein sequences from genotype 2 differs the most from genotype 1b sequences in terms of amino acid composition as indicated in Figure 8.

Phylogenetic Analysis
Phylogenetic tree was constructed for 346 sequences of NS4A protein representing so far known HCV genotypes using CLC sequence viewer software and through UPGMA method. Standard layout of the tree is shown in Figure 9, 10, 11, 12 (A single Phylogenetic tree was constructed but for convenience it has been shown in four different figures and these figures should be considered in continuation from Figure 9, 10,11,12). UPGMA method assumes that evolution has occurred at a constant rate in the different lineages and that is why root of the tree can also be estimated. For bootstrap analysis the default value of 100 was used. Bootstrap values are attached to each branch. Genotype 1b sequences occupy the root of the tree and sequences from the individual genotypes are clustered together in the tree which clearly demonstrates that NS4A protein of different HCV genotypes originally evolved from NS4A protein of genotype 1b.

Discussion
NS4A gene (Accession no. HM135518 and isolate name PK/FG3) that we had isolated, sequenced and reported to the Gen Bank from a Pakistani patient chronically infected with HCV genotype 3a showed 100% homology on protein blast available at NCBI with many sequences reported from United Kingdom. This was an amazing observation as HCV is known for a high mutation rate but still NS4A protein reported from Pakistani and UK populations show such a high similarity at amino acid level. These Blast results prompted us to investigate the conserved nature of NS4A protein across different regions of the world.
Our results in Figure 1 clearly shows that Hepatitis C virus genotype 3a is widespread to the four different continents of the world but it still retained same amino acid sequence for NS4A protein despite high mutation rate in HCV genome. The relatively conserved nature of NS4A protein indicates that the original NS4A protein, which was part of HCV polyprotein when it first established itself in humans, might have been passed on in its dormant form to the present day HCV and its sequence might have been reported to sequence databases. And by comparing the amino acid composition of NS4A protein of different HCV genotypes, the occasional amino acid substitutions that we had observed might help us to investigate its identity. The conserved nature of NS4A protein has two important implications. First when amino acid substitutions are introduced into this protein, there is a considerable chance that they will be retained in future progenies. And secondly, some of these amino acid substitutions may travel a long distance across different HCV genotypes. By locating such amino acid substitutions and following them across different HCV genotypes, might help us identify the genotypes that evolved earlier or later in HCV evolution. Our study suggests that C 22 , Q 46 and Q 47 are three very important amino acid substitutions that were introduced into NS4A protein of genotype 1b early in HCV evolution. Amino acid composition analysis of NS4A protein of different HCV genotypes shows that at least one of the three amino acids is a consistent member of NS4A of the all other known HCV genotypes. C 22 is a more consistent member of NS4A protein sequences of genotype 1a, genotype 6, genotype 3 and genotype 2. Q 46 and Q 47 amino acids are more consistent members of NS4A protein sequences of genotype 5, genotype 4, genotype 6 and genotype 3. V 29 , V 30 and V 38 are the other three important amino acid substitutions introduced into NS4A protein of genotype 1b. V 30 is a consistent member of NS4A protein sequences of genotype 1a, V 29 and V 30 are more consistent members of genotype 4 sequences and V 38 is more consistent member of genotype 3 sequences.
Previous studies that were performed to understand HCV evolution and to classify different genotypes used nucleotide sequences [5,6,19,20]. We have used amino acid sequences in this study because sequence divergence is very high in HCV at nucleotide level due to error-prone nature of its polymerase. For the study of evolutionary history and origin of new subtypes of HCV there is a need of consistent system. We used amino acid substitution in individual genotypes and subtypes of HCV for the study of origin and evolution. Suzuki and Nei used amino acid sequences to study the origin and evolution of Influenza virus [7]. Furthermore previous    Figure 9 is showing sequences of genotype 1b at the root while clustering 1a, 6 and some sequences from genotype 3.   studies used 5UTR, Core/E1 and or NS5B gene regions [6,19,21,22]. While on the other hand we have used relatively conserved NS4A protein sequences which can better predict the picture of evolution. Previous studies used ClustalW for Multiple Sequence Alignment, we have used CLC software that automatically arranges sequences on the basis of sequence similarity. Furthermore, CLC software allows the movement of individual sequences up and down in the MSA file that is generated. So we can arrange sequences in different orders and look for different patterns of amino acid substitutions that may emerge.
We have identified different amino acids as consistent members in different HCV genotypes that we did not observed in our NS4A protein sequences from genotype 1b. We believe that these amino acids were introduced later as HCV evolved with time. T 19 and S 32 amino acids in genotype 1b sequences have been replaced by S 19 and G 32 in genotype 1c sequences respectively. T 19 of genotype 1b sequences has been replaced by S 19 in genotype 1a sequences. L 10 , T 20 and V 24 in genotype 1b sequences have been replaced by V 10 , V 20 and A 24 in genotype 5 sequences respectively. Genotype 4 sequences have S 19 , V 20 and Q 34 amino acids as more consistent members while genotype 1b sequences have T 19 , S 32 and K 34 amino acids. Genotype 6 and genotype 3 sequences also have S 19 and V 20 amino acids similar to genotype 4 sequences. T 30 and T 32 are also members of genotype 6 sequences but these are less consistent members compared to S 19 and V 20 amino acids. R 28 , I 30 , S 32 , V 37 , K 41 , F 48 in genotype 1b sequences has been replaced by H 28 , E 30 , G 32 , L 37 , K 41 , Y 48 in genotype 3 sequences. Genotype 2 shows highest diversity from genotype 1b sequences in terms of amino acid composition and is indicated in Figure  8. The overall similarity of genotype 1b sequences with other genotypes denoted by dots ( Figure 2 to Figure  8), the occasional amino acid substitutions in genotype 1b and their presence as more consistent members in sequences of other known genotypes and presence of further substitutions that we just discussed shows that NS4A protein of the other so far known HCV genotypes originally evolved from NS4A protein of genotype 1b.
To further confirm our results phylogenetic analysis was performed by constructing a single phylogenetic tree using UPGMA method as shown in Figure 9, 10, 11, 12. Many studies related to HCV classification and evolution has used UPGMA method for constructing phylogenetic tree [23][24][25]. NS4A protein sequences from genotype 1b occupied the root of the phylogenetic tree. Sequences from individual genotypes were clustered together in the tree which indicates that our constructed tree is in accordance with current classification system which is based on nucleotide sequence analysis of 5TUR, Core/E1 and NS5B gene regions. This also shows the importance of NS4A protein as a phylogenetic marker of HCV history and UPGMA as a relevant method for tree construction. Both amino acid composition analysis and our phylogenetic tree indicates that genotype 2 differ the most from genotype 1b than any other HCV genotype. Based on the above mentioned observations it is now easy to generalize that HCV genotype 1b established itself earlier in humans and that all other known HCV genotypes evolve later as result of mutations in genotype 1b. We propose that the following amino acid sequence (Figure 2, Sequence 1 to 22) might have been sequence of the NS4A protein which was part of HCV polyprotein when it first infected humans.

S T W V L V G G V L A A L A A Y C L T T G S V V I V G R I I L S G K P A V I P D R E V L Y R E F D E M E E C
Some of the genotype 6 variants reported from Southeast Asia have 5'UTR sequences identical to those of genotype 1b and 1a [26][27][28][29]. At nucleotide level, 5'UTR is the most conserved region in HCV genome and these reports support our results. Few of the HCV genomic sequences reported from Russia have structural genes similar to genotype 2 and non-structural genes similar to genotype 1b [30,31], which according to our findings is the parent HCV genotype. Another genomic sequence reported from Peru has structural genes similar to genotype 1a and non-structural genes similar to genotype 1b [32]. These sequences have been classified as recombinants because it is believed that these sequences were generated as a result of recombination events between different HCV genotypes [30][31][32]. It is well documented that HCV target structural genes like E1 and E2 for mutation to avoid immune responses [33,34]. There is a possibility that these recombinant genotypes evolved as result of much higher mutation rate than normal in the structural region and lower mutation rate in non-structural regions and not as a result of recombination events. This much higher mutation rate could be due to high pressure on HCV from immune system in certain individuals. But much work needs to be done to establish facts regarding recombinants genotypes and our discovery will have a role to play in that regard.

Conclusion
This work highlights the significance of NS4A protein as phylogenetic marker in studies related to origin and evolution of HCV. Amino acid substitution and phylogenetic analysis of NS4A protein sequences of different HCV genotypes shows that NS4A protein of the so far known HCV genotypes evolved from NS4A protein of HCV genotype 1b. This implies that genotype 1b established itself earlier in humans and that all other known HCV genotypes evolved later as a result of mutations in HCV genotype 1b.