An earlier investigation of phylogenetic relations among 27 complete HCV genomes used maximum likelihood and careful determination of the appropriate nucleotide substitution model, and reported a star-like phylogeny among the six known HCV genotypes . The best substitution model was also found to be the most general. In the earlier study, the 5' UTR was found to have lower phylogenetic signal, lower evolutionary rate, and greater phylogenetic noise than alternative regions of the HCV genome, including NS5B . Our observations concur with those previously reported. Methodological refinements in our approach include the use of information-based model selection criteria to determine the best nucleotide substitution model, more complete HCV genomes, the revised nomenclature for subtypes , and formal comparisons between alternative topologies for the purpose of subtype determination.
The tree from the Okamoto region of NS5B is a significantly better fit to the HCV whole-genome and polyprotein data than the 5' UTR tree, regardless of the optimality criterion used for phylogenetic inference. Trees obtained from the 5' UTR perform worse at classifying HCV subtypes into clades of the same genotype than do trees from the whole genome, polyprotein, or the Okamoto region of NS5B. Discordant topologies of maximum-likelihood phylogenetic trees obtained from the 5' UTR and NS5B have been described for a subset of HCV genotypes [14, 15]. The inconsistent ordering of deeply rooted branches among trees from protein-coding regions indicates a basal polytomy whose resolution is contingent on the data available, which accords with the star-like phylogeny of all six known HCV genotypes previously reported elsewhere [3, 5, 12, 16].
The same evolutionary model (GTR with a discrete-gamma distribution of rate variation) used here has been utilized previously for likelihood phylogenies of the hepatitis B virus  and, with accommodation of invariant sites, for both HIV  and HCV . Instantaneous substitution rates (normalized to the G-U rate) are greater among sites in the non-coding 5' UTR than in the regions that encode proteins, despite the fact that overall sequence conservation is greater in the UTR (Table 2). In particular, the instantaneous substitution rate between cytidine and uridine is much greater for the 5' UTR than for protein-coding regions. The accelerated C-U (or C-T for DNA sequences) substitution rate has previously been reported and discussed for protein-coding regions , though the rate is even greater for the non-coding terminus than for regions having codon usage constraints. Spontaneous deamination of cytosine to uracil may inflate the C-U substitution rate.
Conservation of single-stranded RNA secondary structure in both coding and non-coding regions of HCV has already been reported [15, 20–23]. The high C-U rate bias may additionally be explained by the formation of non-canonical base pairs between guanosine and uridine in single-stranded RNA molecules, which is consistent with selection to conserve secondary structure, because a mutation from cytosine to uridine is less disruptive to secondary structure formation than other point mutations . The may also be explained by the fact that all rates are rescaled such that the G-U rate is unity. A low G-U substitution rate thus inflates other rates. A mutation between G and U is disruptive to RNA secondary structure, because it eliminates the possibility of bases pairing without a compensatory mutation elsewhere. Overall, the elevated C-U substitution rate seen for the 5' UTR probably results from several interacting factors.
Though the same evolutionary model applies to the non-coding 5' UTR and the Okamoto region of NS5B, the two regions are subjected to different constraints. While coding sequences have codon-usage constraints and selective pressure for amino-acid mutations to escape detection by the host immune system, the UTR must preserve long-range interactions with complementary nucleotides at the other terminus of the viral genome if cyclization of the genome is essential to viral replication [6, 20]. Because of these differences in selective regimes, it should not be surprising that phylogenies of the two differ.
HCV diagnostic technologies include serologic (antibody based) and genetic (sequence based) techniques to detect infected samples [4, 6, 25]. Population screens are the most commonly deployed genetic HCV tests, which benefit from low false-positive rates because they utilize the conserved 5' UTR as targets for PCR amplification. However, it is clear both from the results of this study and from previous investigations that the 5' UTR does not contain sufficient information to resolve subtypes [26–31]. Phylogenetic signal in protein-coding regions, such as NS5B, provides a useful alternative [12, 32], but few commercial assays exploit this information at present. The "gold standard" for subtype determination is direct sequencing, which has a lower cost for reagents but requires more time than commercial assay kits [4, 25].
There exist further complications to subtype classification, including coinfection [30, 33, 34], recombination [35, 36], within-host evolution [37, 38], and compartmentalization of genotypes into different cell types . Diagnostic assays that are informed by the 5' UTR will be less able to accommodate these difficulties than methods that are able to resolve subtypes.