Prediction of prognostic biomarkers for Interferon-based therapy to Hepatitis C Virus patients: a metaanalysis of the NS5A protein in subtypes 1a, 1b, and 3a
© ElHefnawi et al; licensee BioMed Central Ltd. 2010
Received: 10 April 2010
Accepted: 15 June 2010
Published: 15 June 2010
Hepatitis C virus (HCV) is a worldwide health problem with no vaccine and the only approved therapy is Interferon-based plus Ribavarin. Response prediction to treatment has health and economic impacts, and is a multi-factorial problem including both host and viral factors (e.g: age, sex, ethnicity, pre-treatment viral load, and dynamics of the HCV non-structural protein NS5A quasispecies). We implement a novel approach for extracting features including informative markers from mutations in the non-structural 5A protein (NS5A), specifically its Interferon sensitivity determining region (ISDR) and V3 regions, and use a novel bioinformatics approach for pattern recognition on the NS5A protein and its motifs to find biomarkers for response prediction using class association rules and comparing the predictability of the different features.
A total of 58 sequences from sustained responders and 94 from non-responders were downloaded from the HCV LANL database. Site-specific signatures for response prediction from the NS5A protein were extracted from the alignments. Class association rules were generated (e.g.: sustained response is associated with position A2368T in subtype 1a (support 100% and confidence 52.19%); in subtype 1b, response is associated with E2356G/D/K (support 76.3% and confidence 67.3%).
The V3 region was a more accurate biomarker than the ISDR region. Subtype-specific class association rules gave better support and confidence than profile hidden Markov models HMMs scores, genetic distances or number of variable sites, and would thus aid in the prediction of prognostic biomarkers and improve the accuracy of prognosis. Sites-specific class association rules in the V3 region of the NS5A protein have given the best support and confidence.
Hepatitis C virus (HCV) is a positive single stranded enveloped RNA virus belonging to the Flaviviridae family. It causes a persistent infection in immune-competent individuals . Its major sequel is chronic active Hepatitis, liver fibrosis, cirrhosis, and hepatocellular carcinoma. It is a major concern for the future world health and development as it infects ~3% of the world population, and has no vaccine . The only approved combined therapy of pegylated Interferon plus Ribavarin has limited success (80% for genotypes 2 & 3 and 50% in genotypes 1 & 4). Factors influencing response can be classified into viral, e.g. the baseline viral load, the genotype, and the viral quasispecies heterogeneity [3, 4], and host which can be further divided into general parameters like age, sex, contamination period, liver fibrosis and cellular factors including genetic polymorphisms in cellular immunological proteins .
The NS5A is a multidomain phosphoprotein ; an integral part of the virus replicase complex. It is involved in protein interactions with cellular proteins including cytokines, growth factors, oncoproteins, and signalling proteins, for a review see (Macdonald and Harris, 2004; and Reyes, 2002) [8, 9]. NS5A also antagonizes numerous cellular pathways, including the antiviral interferon-α response pathway , and the jack stat pathway as part of the counter attack mechanisms employed by the virus . Site-specific substitutions, higher genetic distances, and number of variable sites in the ISDR and the V3 regions as well as dynamics of the NS5A quasispecies after 4 weeks of therapy all showed correlation with favorable response to treatment [3, 11, 12]. This indicates the superiority of viral factors in determining the response result . Genetic markers from the virus proteins are important to consider in view of the immunological nature of the Hepatitis C virus disease and the many reports confirming the importance of virus-immune system interactions for determining response outcome. But, first, some general comments on bioinformatics and data mining are necessary.
Data mining has been defined as the nontrivial extraction of implicit, previously unknown and potentially useful information from data. Classification is a classic data mining task, with roots in machine learning. Associative classification aims to detect relationships between categorical variables and large datasets. This enables identification of hidden patterns in large databases. Associative classification aims to discover a small set of rules in the database, called class association rules, to form an accurate classifier. The accuracy of the rules is measured by their support (relative frequency of the body or head of the rule) and confidence (conditional probability of the body given the head of the rule). Several algorithms have been implemented in association rule mining including the A-priori algorithm, the frequent item set mining algorithm (COFI) .
Bioinformatics as a subdescipline of data mining aims to improve our current knowledge and understanding of biological and molecular entities. Pattern recognition and representation of motifs is a fundamental problem in bioinformatics and bioinformatics for diseases. The need arises for methods that can find discriminative patterns between closely related set of sequences that exhibit different phenotypes such as virulence, drug resistance, etc. It is important to capture very subtle variations, which are discriminatively powerful, and leave out unimportant statistically insignificant variations between the sets of sequences. Different approaches for pattern representations from sequence data include regular expressions, position weighted matrices, sequence logos, profile hidden Markov models, etc. All these have been used in several motif databases (e.g.: PFAM ).
In silico approaches for motif identifications and representations have tremendously helped to guide in vitro and in vivo experiments. DNA and protein motifs that were discovered in silico could be verified as signatures for diagnosis, prognosis, and response to treatment for several pathogens and cancer.
In this work, we apply a novel bioinformatics approach for signature extraction, feature selection and classification; mining NS5A sequences from the HCV LANL database for response biomarker prediction. Informative class association rules with a certain threshold of support and confidence were generated to improve prognosis prediction. Pattern and variability analysis on the NS5A protein, and specifically on its most important motifs for IFN-therapy response, namely the ISDR and V3 regions are performed. The rational was that new molecular markers are needed to improve current criteria for IFN-therapy inclusion and prognostic prediction. An efficient comparison of the ISDR and V3 regions, and the three studied subtypes (1a, 1b, and 3a) was also due. Finally, a comparison between the results of the applied techniques is conducted.
Prognosis prediction will help in personalising the treatment for HCV patients, reducing the side-effects and high costs associated with IFN treatment therapy choice in view of the number of specifically targeted antiviral treatment (STAT-C) inhibitors that will be available soon. Up to our knowledge, pattern analysis and classification modelling in the study of response to IFN based treatment for HCV has not been done before. Our workflow for finding markers for response to IFN is composed of sequence collection and sorting, multiple sequence alignments, informative site identification and feature selection by using relative Shanon entropy, comparative sequence logos, and viral epidemiology signature pattern analysis (VESPA) for positional enumeration of amino acids in each group followed by generation of class association rules followed by selection of the best set of rules.
Materials and Methods
Sequence Collection and Analysis
Summary of sequence analysis and mean genetic distance
# of sequences
# of variable sites
Mean Genetic Distance within group
Mean genetic distance between groups
Variability and Phylogeny Analysis
Tree reconstruction for each subtype and region was done using the PROTDIST from the PHYLIP package , and the MEGA 4.0 software . Genetic distances within and between groups were also calculated using the MEGA 4.0 program.
Pattern Discovery and Feature Selection
Detecting the most statistically significant differences between the responder and non- responder groups was done using the VESPA  available from the HCV database which gave the most variable positions and their frequencies between responders and non-responders. Class association rules were generated from these tables. Relative Shanon entropy was calculated using the tool from the great facilities available from the HCV LANL database. Statistically significant variations were calculated with a threshold of P = 0.05.
The two Sample sequence logo  server was also used to identify and confirm significant variations between the two groups for each subtype and statistical significance assessed.
Profile HMMs for the responder and the non-responder groups were performed using the HMMBUILD program from the HMMER package [25, 26]. Class association rules were generated for the sites with statistically significant variations between the two groups in both the comparative sequence logo and the relative Shanon entropy and those whose support and confidence are above 50% were retained.
The association rules were tested on a 10% subset of the sequences. The HMM search tool available from the HMMER package was also used to score the test sequences against a profile HMM and the prediction accuracy noted. The threshold genetic distances scores, HMM scores, and number of variable sites used for rule generation were inferred and class association rules were generated.
Patients' Sequences and Variability Analysis
Patterns Discovery and Recognition
Comparative sequence logos confirm the results of VESPA and the relative Shanon entropy tool. The graphical motif representation enables a quick identification of positions that are clearly different by their length, and can therefore be incorporated in the classifier.
In the V3 region, the following can be noted about site considerable variations between the two groups of responders and non-responders: There were four statistically significant sites (2356, 2358, 2374 and 2378) in the V3 region of subtype 1b which were confirmed by filtering results of both the relative Shanon entropy (Figure 2) and the comparative sequence logo (Figure 3). Similar analysis showed that there was no confirmed marker in subtype 3a and there were 4 positions in subtype 1a (2365, 2367, 2376,2379). Position 2378 was significantly variable between responders and non-responders in subtypes 1b and 3a.
There were 3 statistically significant variations in the IRRDR regions (2326, 2342 and 2349 in subtype 1a; 2332, 2348 and 2383 in subtype 3a).
For the whole of the NS5A protein, discriminative variations clustered in the IRRDR region and its flanking parts only.
No observable variations were present in other parts of the NS5A protein, and in the 2'5' OAS binding region.
Comparing genotypes 1 & 3, the number of variable sites, genetic distances, and statistically significant positions were lower in subtype 3a than 1a & b. The higher variability in subtype 1b could also be attributed to the diverse countries from which the patients came from.
Evaluation and Comparison of Different Biomarkers
The class association rules for each subtype were generated from the VESPA, relative Shanon entropy, and comparative sequence logos results. The support and confidence of the class association rules have been calculated. The most informative rules with highest support and confidence are: In the V3 region, sustained response is associated with E2356G/D/K in subtype 1b (support 76.3% and confidence 67.3%), A2368T in subtype 1a (support 100% and confidence 52.19%). In subtype 1b, non-response is associated with wild type 2378T (support 50% and confidence 69%). In the ISDR region: In subtype 1a, non-response is associated with wild type 2248S (support 47.5% and confidence 95%).
Summary comparison of the accuracy of different approaches used in the paper
Site-specific class Association rules
Wildtype 2378T in NR subtype 1b
A2368T in R in subtype 1a
E2356G/D in R in subtype 1b
Number of variable sites
Three variable sites in R
Six variable sites in R
GD > 0.2 for the V3 region in R
Profile Hidden Markov Model
Score > 45 for R
Our objective was to extract patterns that can discriminate between two sets of phylogenetically close but functionally different sets of sequences. According to our results it is evident that variability is present in both groups; there were red lines and long letters in both response groups (Figures 2 and 3). Accordingly, an accurate measure which depends only on the variability would not be efficient in separating responders from non-responders. That's also why the profile HMMs, as maximum entropy models, didn't perform well.
The approach using class association rules extracted from the VESPA results (see additional file 1- table S1 and S2) and confirmed by relative Shanon entropy calculations and comparative sequence logos can help increase the sensitivity and specificity of genetic biomarker discovery in general. These class association rules, which are position and amino acid specific, proved more appropriate and gave high support and confidence. The associative classification technique was chosen because it builds more accurate and easily interpretable set of rules than traditional classification approaches [28, 29]. Analysis of the genetic distance variations, VESPA, and relative Shanon entropy (Table 1, Figure 2 and Additional files 2 and 3) indicates the discriminative superiority of the V3 region over the ISDR region as a biomarker in the response to therapy problem. This was also confirmed by recent studies . Subtype 3a showed lower overall variability and more homogeneity in both regions, with no statistically significant variations, thus indicating its higher rate for response. We correlated specific residues in the V3 region whose support and confidence exceeded both 50%. The previous structural and functional analysis  showed that the V3 region is 100% exposed, and contains a hot loop region, therefore highly ranking it as a protein binding motif. These mutations could limit the efficacy of the NS5A protein-host immune system proteins interactions in its counter attack mechanisms. Also, non-response was associated with specific amino acids in the V3 region which could be potential binding sites with the immune system proteins. Analysis of variability failed to accurately distinguish the response groups as these disordered proteins are inherently variable, with little effect by amino acid substitutions . All three methods, VESPA, Shanon entropy, and comparative sequence logos, coincided in their results for the most important statistically significant variable positions between the two sets. An automated pipeline of analysis that incorporates these methods for signature extraction would aid in rapid sequence biomarker discovery in general. This can help physicians in drug type assessment as has been done with HIV drug resistance .
We conclude that the IRRDR region is a better biomarker for therapy response than the ISDR region. Indicative biomarkers were extracted from subtypes 1a, 1b, and 3a, which showed significant variation between the two groups using a multi- bioinformatics approach for pattern analysis. Subtype 3a showed lower overall variability and more homogeneity in both regions, with no statistically significant variations, thus indicating its higher rate for response. Finally, comparing the results from pattern based approaches to analysis of variability, it is evident that rule generation methods, and pattern discovery are more reliable than noisy models (HMMs) and analysis of variability alone.
In conclusion, prognostic biomarkers have been extracted using this approach that would enhance prediction of response to IFN therapy in Chronic Hepatitis C patients.
We acknowledge all those who helped in this work including those who reviewed it and suggested modifications and improvements. Special thanks to Pr. Steve Polyak for his very fruitful discussions and help. Special thanks to Ali Khalifa, Mona Kamar, and Nafisa Hassan for their efforts and help through this paper.
- Pavio N, Lai MM: The hepatitis C virus persistence: how to evade the immune system?. J Biosci. 2003, 28 (3): 287-304. 10.1007/BF02970148.PubMedView ArticleGoogle Scholar
- Cohen J: The scientific challenge of hepatitis C. Science. 1999, 285 (5424): 26-30. 10.1126/science.285.5424.26.PubMedView ArticleGoogle Scholar
- Farci P, et al: Early changes in hepatitis C viral quasispecies during interferon therapy predict the therapeutic outcome. Proc Natl Acad Sci USA. 2002, 99 (5): 3081-6. 10.1073/pnas.052712599.PubMedPubMed CentralView ArticleGoogle Scholar
- Wagner V, et al: Dynamics of hepatitis C virus quasispecies turnover during interferon-alpha treatment. Journal of Viral Hepatitis. 2003, 10: 413-422. 10.1046/j.1365-2893.2003.00457.x.View ArticleGoogle Scholar
- Mihm U, et al: Review article: predicting response in hepatitis C virus therapy. Aliment Pharmacol Ther. 2006, 23 (8): 1043-54. 10.1111/j.1365-2036.2006.02863.x.PubMedView ArticleGoogle Scholar
- El Hefnawi MM, et al: Natural genetic engineering of hepatitis C virus NS5A for immune system counterattack. Ann N Y Acad Sci. 2009, 1178: 173-85. 10.1111/j.1749-6632.2009.05003.x.PubMedView ArticleGoogle Scholar
- Pawlotsky JM: Hepatitis C virus (HCV) NS5A protein: role in HCV replication and resistance to interferon-alpha. J Viral Hepat. 1999, 6 (Suppl 1): 47-8. 10.1046/j.1365-2893.1999.00004.x.PubMedView ArticleGoogle Scholar
- Macdonald A, Harris M: Hepatitis C virus NS5A: tales of a promiscuous protein. J Gen Virol. 2004, 85 (Pt 9): 2485-502. 10.1099/vir.0.80204-0.PubMedView ArticleGoogle Scholar
- Reyes GR: The nonstructural NS5A protein of hepatitis C virus: an expanding, multifunctional role in enhancing hepatitis C virus pathogenesis. J Biomed Sci. 2002, 9 (3): 187-97. 10.1007/BF02256065.PubMedView ArticleGoogle Scholar
- Song J, et al: The NS5A protein of hepatitis C virus partially inhibits the antiviral activity of interferon. J Gen Virol. 1999, 80 (Pt 4): 879-86.PubMedView ArticleGoogle Scholar
- El-Shamy A, et al: Sequence variation in hepatitis C virus nonstructural protein 5A predicts clinical outcome of pegylated interferon/ribavirin combination therapy. Hepatology. 2008, 48 (1): 38-47. 10.1002/hep.22339.PubMedView ArticleGoogle Scholar
- Sarrazin C, et al: Hepatitis C virus nonstructural 5A protein and interferon resistance: a new model for testing the reliability of mutational analyses. J Virol. 2002, 76 (21): 11079-90. 10.1128/JVI.76.21.11079-11090.2002.PubMedPubMed CentralView ArticleGoogle Scholar
- Wohnsland A, Hofmann WP, Sarrazin C: Viral determinants of resistance to treatment in patients with hepatitis C. Clin Microbiol Rev. 2007, 20 (1): 23-38. 10.1128/CMR.00010-06.PubMedPubMed CentralView ArticleGoogle Scholar
- Baralis E, Torino P: A lazy approach to pruning classification rules. IEEE International Conference on Data Mining. 2002Google Scholar
- Finn RD, et al: The Pfam protein families database. Nucleic Acids Res. 2008, 36 (Database issue): D281-8.PubMedPubMed CentralGoogle Scholar
- HCV LANL database,Google Scholar
- Kuiken C, et al: The Los Alamos hepatitis C sequence database. Bioinformatics. 2005, 21 (3): 379-84. 10.1093/bioinformatics/bth485.PubMedView ArticleGoogle Scholar
- Clamp M, et al: The Jalview Java alignment editor. Bioinformatics. 2004, 20 (3): 426-7. 10.1093/bioinformatics/btg430.PubMedView ArticleGoogle Scholar
- Hall TA: BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT. Nucl. Acids. Symp. Ser. 1999, 41: 95-98.Google Scholar
- Pei J, Grishin NV: MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information. Nucleic Acids Res. 2006, 34 (16): 4364-74. 10.1093/nar/gkl514.PubMedPubMed CentralView ArticleGoogle Scholar
- Felsenstein J: Phylogenies from molecular sequences: inference and reliability. Annu Rev Genet. 1988, 22: 521-65. 10.1146/annurev.ge.22.120188.002513.PubMedView ArticleGoogle Scholar
- Tamura K, et al: MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Mol Biol Evol. 2007, 24 (8): 1596-9. 10.1093/molbev/msm092.PubMedView ArticleGoogle Scholar
- Korber B, Myers G: Signature pattern analysis: a method for assessing viral sequence relatedness. AIDS Res Hum Retroviruses. 1992, 8 (9): 1549-60. 10.1089/aid.1992.8.1549.PubMedView ArticleGoogle Scholar
- Vacic V, Iakoucheva LM, Radivojac P: Two Sample Logo: a graphical representation of the differences between two sets of sequence alignments. Bioinformatics. 2006, 22 (12): 1536-7. 10.1093/bioinformatics/btl151.PubMedView ArticleGoogle Scholar
- Eddy SR: Profile hidden Markov models. Bioinformatics. 1998, 14 (9): 755-63. 10.1093/bioinformatics/14.9.755.PubMedView ArticleGoogle Scholar
- Wistrand M, Sonnhammer EL: Improved profile HMM performance by assessment of critical algorithmic features in SAM and HMMER. BMC Bioinformatics. 2005, 6: 99-10.1186/1471-2105-6-99.PubMedPubMed CentralView ArticleGoogle Scholar
- Nousbaum J, et al: Prospective characterization of full-length hepatitis C virus NS5A quasispecies during induction and combination antiviral therapy. J Virol. 2000, 74 (19): 9028-38. 10.1128/JVI.74.19.9028-9038.2000.PubMedPubMed CentralView ArticleGoogle Scholar
- Jiawei WLaP, J H: CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules. ICDM'01 San Jose. 2001Google Scholar
- Liu B, W H, Y M: Integrating classification and association rule mining. KDD New York. 1998Google Scholar
- Torres-Puente M, et al: Hepatitis C virus and the controversial role of the interferon sensitivity determining region in the response to interferon treatment. J Med Virol. 2008, 80 (2): 247-53. 10.1002/jmv.21060.PubMedView ArticleGoogle Scholar
- El-Hefnawi Mahmoud ea: An integrative in silico model of Hepatitis C Virus non structural 5a protein. BIOCOMP. 2009Google Scholar
- Wang D, et al: A comparison of three computational modelling methods for the prediction of virological response to combination HIV therapy. Artif Intell Med. 2009, 47 (1): 63-74. 10.1016/j.artmed.2009.05.002.PubMedView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.