The evolution of human influenza A viruses from 1999 to 2006: A complete genome study

Background Knowledge about the complete genome constellation of seasonal influenza A viruses from different countries is valuable for monitoring and understanding of the evolution and migration of strains. Few complete genome sequences of influenza A viruses from Europe are publicly available at the present time and there have been few longitudinal genome studies of human influenza A viruses. We have studied the evolution of circulating human H3N2, H1N1 and H1N2 influenza A viruses from 1999 to 2006, we analysed 234 Danish human influenza A viruses and characterised 24 complete genomes. Results H3N2 was the prevalent strain in Denmark during the study period, but H1N1 dominated the 2000–2001 season. H1N2 viruses were first observed in Denmark in 2002–2003. After years of little genetic change in the H1N1 viruses the 2005–2006 season presented H1N1 of greater variability than before. This indicates that H1N1 viruses are evolving and that H1N1 soon is likely to be the prevalent strain again. Generally, the influenza A haemagglutinin (HA) of H3N2 viruses formed seasonal phylogenetic clusters. Different lineages co-circulating within the same season were also observed. The evolution has been stochastic, influenced by small "jumps" in genetic distance rather than constant drift, especially with the introduction of the Fujian-like viruses in 2002–2003. Also evolutionary stasis-periods were observed which might indicate well fit viruses. The evolution of H3N2 viruses have also been influenced by gene reassortments between lineages from different seasons. None of the influenza genes were influenced by strong positive selection pressure. The antigenic site B in H3N2 HA was the preferred site for genetic change during the study period probably because the site A has been masked by glycosylations. Substitutions at CTL-epitopes in the genes coding for the neuraminidase (NA), polymerase acidic protein (PA), matrix protein 1 (M1), non-structural protein 1 (NS1) and especially the nucleoprotein (NP) were observed. The N-linked glycosylation pattern varied during the study period and the H3N2 isolates from 2004 to 2006 were highly glycosylated with ten predicted sequons in HA, the highest amount of glycosylations observed in this study period. Conclusion The present study is the first to our knowledge to characterise the evolution of complete genomes of influenza A H3N2, H1N1 and H1N2 isolates from Europe over a time period of seven years from 1999 to 2006. More precise knowledge about the circulating strains may have implications for predicting the following season strains and thereby better matching the vaccine composition.


Background
Every year the influenza A virus causes human infection with varying severity depending on the host acquired immunity against the particular virus strain. Three to five million people experience severe illness and 0.25 to 0.5 million people die of influenza yearly worldwide (WHO EB111/10). The influenza virus evades host immunity by accumulation of point mutations (drift) in the major surface glycoproteins, haemagglutinin (HA) and neuraminidase (NA) or by reassortment of segments from different viruses co-infecting the same cell leading to a new stain with a HA (and NA) not seen in the population before (shift). In the worst case, shifts may cause pandemics. There have been three pandemics the last hundred years, the Spanish flu in 1918 (H1N1), the Asian flu in 1957 (H2N2) and the Hong Kong flu in 1968 (H3N2). It is believed that new pandemics emerge through shifts with strains from the avian reservoir, as was the case of the pandemics of 1957 and 1968, or by direct introduction of an avian strain into the human population as suggested for the 1918 pandemic [1]. At present only two of the 16 possible HA subtypes (H1 and H3), and two of the nine possible NA subtypes (N1 and N2) are circulating in man. H3N2 and H1N1 influenza A viruses have co-circulated in the human population since the re-emergence of H1N1 in 1977, increasing the possibility for genetic reassortments. The prevalence of the different subtype combinations may vary from season to season. The H3N2 has been the predominant influenza A strain during the last 20 years, with the exception of the 1988-1989 and 2000-2001 seasons where H1N1 infections dominated [2]. In the 2000-2001 season a new reassorted human strain, H1N2, emerged in Europe and became established in the autumn 2001 [3,4]. The new H1N2 subtype was covered by the 2002-2003 H1 and N2 trivalent vaccine components and because both H1 and N2 viruses had circulated the previous years some degree of herd immunity against the new strain was expected. The H1N2 viruses were not associated with severe influenza illness that season. In 2002, a new lineage A/Fujian/411/02(H3N2)-like emerged in Asia and caused significant outbreaks on every continent [5,6].
For the northern hemisphere the WHO issues the recommendation for strains to be included in the trivalent vaccine for the next season based on epidemiological data and antigenic and genetic analyses of circulating strains.
Until the recent release of over 1,800 complete influenza A genome sequences from the Influenza Genome Sequencing Project managed by US National Institute of Allergy and Infectious Diseases [7,8]

Prevalence of influenza A in Denmark from 1999 to 2006
The relative prevalence of influenza virus varies from season to season. Influenza A H3N2 was the dominating strain in Denmark during the last seven years, with the exception of the 2000-2001 season where the H1N1 viruses dominated, as can be seen in Figure 1.

Genetic evolution of influenza A H3N2 viruses
Based on phylogenetic analysis of the HA and NA nucleotide sequences from 1999 to 2006 (Figure 2), ten isolates representative for the phylogenetic clustering of sequences from each subtype in each season, as far as possible, were included in the final HA and NA tree ( Figure 2) and representatives were chosen for complete genome sequencing. Generally the H3N2 HA and NA genes formed seasonal phylogenetic clusters ( Figure 2). However, we observed that strains of different lineages and clusters cocirculated within the same season and that viruses had reassorted with viruses from previous seasons ( Figure 2).
The HA gene of the influenza H3N2 strains from the 1999-2000 season formed a phylogenetic subclade to A/ Moscow/10/99(H3N2) and A/Sydney/5/97(H3N2) (represented by A/Memphis/31/98) (Figure 2), located between A/Moscow/10/99 and A/Panama/2007/99 (not shown). The antigenicity of these strains was A/Moscow/ 10/99(H3N2)-like in a haemagglutination inhibition assay, and will therefore be referred to as Moscow-like throughout this report. The NA and the internal genes were all A/Moscow/10/99(H3N2)-like, with the exception of the matrix (M) gene that clustered as a subclade to the A/New York/55/01-like strains ( Figure 3).   Figure 1). Thirteen isolates from this season were available for sequencing, and all were of the H1N1 subtype. These sequences represented two different co-circulating lineages (Figure 4). Lineage I is A/Bayern/7/ 95(H1N1)-like and lineage II include the H1N1 strains of today and the A/New Caledonia/20/99(H1N1) vaccine reference strain (Figure 4). The phylogenetic trees of NA and the internal genes showed the same topology ( Figure  3 and 4). The lineage II strains are characterised by a deletion K130 in HA (K134 in H3 numbering) ( Table 1)

H1N2 viruses
* Amino acids in brackets indicate less than half but more than two substitutions at the given amino acid position within a season. A single amino acid change in one position is not shown. Amino acids separated by '/' indicate equal substitutions of either amino acid at the given position. Letters in upper case above an amino acid indicate the antigenic site location of the residue. In N1 the upper case letter ' P ' stands for phylogenetically important region (PIR) and the following letters indicate the actual PIR.    Based on the ratio of non-synonymous versus synonymous substitutions none of the influenza A genes were directly influenced by positive selection (dN/dS<1) (Table  3). However, as expected the HA1 region of both H3 and H1 viruses were more influenced by evolutionary pressure (Table 3). We applied FEL and SLAC maximum-likelihood methods to estimate individual positively selected sites in H3N2 HA and NA and added REL for smaller datasets in all genes (se methods section). The FEL method found one site in the H3 protein (n = 204), position 199 (p = 0.046) to be positively selected, while the more conservative SLAC analysis found none. No positive selected sites were predicted for the N2 genes (n = 166) and none in the internal genes (n = 15) estimated by FEL and SLAC. The REL analysis retrieved four sites in the M1 gene to be selected namely positions 208, 211, 218 and 219. No sites in HA (n = 27) and NA (n = 30) of the H1N1 viruses were directly positively selected with any of the three methods of analysis.  [26] and it was anticipated that there would be some extent of herd immunity in the population against this new reassortant.

Genetic evolution of influenza A
The phylogenetic trees of H3N2 HA and NA showed seasonal clusters but also co-circulating lineages within seasons. The introduction of the A/Fujian/411/02(H3N2) strains in 2002-2003 caused a "jump" in the evolution of both HA and NA genes. Many of the substitutions in HA introduced with the A/Fujian/411/02(H3N2)-like viruses have become fixed, probably reflecting a very fit virus. The genetic variation before the 2003-2004 season may have been more influenced by introduction of new viruses through viral migration than adaptive evolution of the genes. A constant rate of drift was not observed for HA but instead periods of change followed by stasis periods. The low dN/dS ratios (0.232 for HA and 0.247 for NA) also indicated that the influenza genes were not directly influenced by positive selection. Reassortments between co-circulating strains and viruses from previous seasons, and introduction of viruses from other parts of the world might play a larger role than natural selection for some seasons, as also observed by others [27,28].  [2,4,25,26]. The reassorted H1N2 viruses possessed only the HA from A/New Caldonia/20/ 99(H1N1)-like viruses and the rest of the genome from A/ Moscow/10/99(H3N2)-like viruses as also reported by others using partial sequences [25]. The H1N2 viruses have been introduced to Denmark from elsewhere and are not a local reassortant.  (Table 1) suggesting they may be important for viral escape from the host immune system and the overall fit of the virus.

Amino acid changes in the haemagglutinins
The A/Fujian/411/02-like(H3N2) viruses did not antigenically match the A/Moscow/10/99(H3N2) strain included in the 2002-2003 vaccine [29,36]. The HA substitution D144N was however not responsible for the antigenic drift of the Fujian-like viruses. It has been shown that only two amino acid changes, H155T and Q156H, specified the antigenic difference from Moscow-like to Fujian-like [37], both are located at antigenic site B. The T155 and H156 amino acids have been maintained in all Danish isolates after the introduction of the Fujian-like viruses. With the A/California/7/04(H3N2)-like viruses in 2004-2005, site 145 has changed from K to S in some isolates and to N in others. S145 has been found at this position before 1999 and N145K was the only cluster-difference substitution between isolates from the seasons 1987/ 1989 and 1992/1995 [38]. It was therefore suggested that substitutions at this site alone could have large antigenic effect. Antigenic site A, located in a loop, makes few contacts with the rest of the structure, therefore 144 and 145 may change drastically without influencing on the overall shape of the HA molecule.
Antigenic site A is supposed to be ideal for antibody binding and for amino acid replacements [33]. The preferred antigenic site in the change from A/Moscow/10/ 99(H3N2) to A/Fujian/411/02(H3N2)-like viruses was site B. This observation is in accordance with previously published data [39]. One region in HA, position 225 to 227, that influence antigenic site D, has changed drastically during the study period from GVS → DVS → DIP → NIP. The influence of antigenic site D was first apparent in the antigenic change from Fujian-like viruses to California-like viruses; however, the antigenic site B was still the preferred site for antigenic change. It has been proposed that a minimum of four substitutions in two or more antibody binding sites are required for an epidemically important strain [40].
The substitution H274Y is the only neuraminidase resistance mutation identified in H1N1 viruses to date [46,47]. Danish isolates did not possess any resistance mutations in the NAs. Neuraminidase inhibitory drugs are rarely used for influenza prophylaxis or treatment in Denmark.

Internal proteins
T-cell epitopes are more conserved than antibody epitopes. Fifteen per cent of the T-cell epitopes are conserved while only 2.7% of the antibody epitopes [48]. The reason for this higher degree of conservation is that 80% of the antibody epitopes are located in the variable glycoproteins HA and NA, while only 40% of the T-cell epitopes are found in these proteins [48]. Recent research has shown some degree of escape from CTL-mediated immunity in addition to escape from neutralizing antibodies [28]. In our dataset we found several substitutions in regions involved in protective T-cell response [48] in NA, PA, M1, NS1 and most in the NP protein. This is not unexpected because most T-cell epitopes are defined for the NP protein and this protein is the main target for the cytotoxic host immune response [49,50]. The extensive variations in the T-cell epitopes during 1999 to 2006 suggest that these regions and the antibody epitopes are working together for efficient escape from the host defence responses.
The M2 proteins from the Danish Wisconsin-like viruses in 2005-2006 possessed the substitution S31N, associated with resistance to matrix inhibitory drugs, like amantadine [22,44,45,51]. These types of drugs are not used for prophylaxis or treatment in Denmark. The S31N substitution is therefore not a local introduced resistance mutation. We cannot exclude that this substitution has arisen by chance, but it is more likely that the mutation has emerged as a resistance mutation in other countries like the USA [52] and Australia [53] where the prevalence of amantadine resistance is high. The resistance may also be related to the increased use of this drug in Asia during the SARS epidemic [21].
The changes G186V and V226I increased egg viral replica-tion of A/Fujian/411/02(H3N2) viruses so did the changes G186V and A196T for A/California/20/ 99(H3N2) viruses [56,57]. On the contrary, others have stated that the V226I change in combination with T155 and H156 do not result in viral recovery in eggs [37,54]. This might explain the delay in the 2006-2007 vaccine production for the northern hemisphere due to egg propagation difficulties with the A/Wisconsin/67/05(H3N2) strain. The influence on replication efficiency by the other substitutions observed at receptor binding sites should be investigated further.
We did not observe amino acids in the N2 NA protein that would decrease virus replication in eggs. The amino acids known to give good replication in eggs (Q119, K136 and Y347) [56] were all present in this dataset.

N-linked glycosylation pattern
Oligosaccharides at the surface proteins HA and NA might have greater impact on viral escape from the immune system than single amino acid changes in the antigenic sites. Oligosaccharides which are recognised as "self" by the host immune system may pose conformational changes in the molecule and mask antigenic sites, thereby prevent binding of host antibodies. The number of N-linked glycosylation sites in the H3 HA protein has increased during the years from only three attachment sites in 1968 [58,59] to ten predicted sites in the Danish isolates after 2004.
The A/Fujian/411/02(H3N2)-like stains from the 2002-2003 season gained a potential glycosylation site at position 144, thereby masking the supposed "key" site for antigenic change [33]. Substitutions at antigenic site B and the predicted N-glycosylation at position 144 in HA antigenic site A together with a stronger NA might have contributed to the increased infectivity of the reassorted Fujian-like viruses of the 2003-2004 season, causing an epidemic in Denmark. We have shown that the preferred antibody epitope for genetic change is antigenic site B for the Danish dataset also reflecting that site A is camouflaged by glycosylation. Thus the antigenic changes at a glycosylated site A may not play a major role in escape from the immune system as long as the glycosylation is present.
Six potential N-glycosylation sites have been conserved in the N2 NA Danish dataset from 1999 to 2003. The majority of isolates from 2003 to 2006 have lost the site at position 93 which is located in a CTL epitope (HLA-A*0201) region of NA [12]. The recent NAs may therefore be more easily recognised by the cytotoxic immune system. We found two predicted N-glycosylation sites (61 and 70) in the N2 NA stalk region in sequences from 1999-2006. Greater density of carbohydrate in the stalk region of NA might reflect a need for proteolytic protection. The two observed carbohydrates in the stalk have been reported for other isolates in other time periods and other continents [58,60]. The stalk region has therefore stayed unchanged and the two sequons seem to be conserved.

Sequence data
As expected a higher dN/dS ratio was observed for the surface glycoproteins, although none were directly influenced by positive selection. We observed that the HA1 region and the M2 protein have a slightly higher global dN/dS ratio than the other genes (Table 3). This is consistent with the findings of others [11,61]. The M2 protein is a membrane ion channel protein on the surface of the virus molecule, a higher dN/dS ratio for this protein compared to the internal proteins is expected. The ratio might, however, be biased because the M2 protein is spliced from M and the dS is suppressed for overlapping regions giving a higher dN/dS ratio [11].
Earlier findings support positive selection at sites involved in receptor and antigen binding [11,62]. Five of the 18 codons in HA proposed by Bush et al., [63] to be under positive selection were found to have changed in our HA dataset of H3N2 since 1999, namely: pos 145, 156, 186, 193 and 226. In our H3 dataset (n = 204) site 199 was influenced by positive selection as calculated by FEL analysis, but no positively selected sites were found applying the more conservative SLAC method. Position 199 may be involved in receptor binding and influence on the virus ability to grow in eggs [10]. In a similar study on a slightly larger dataset (n = 284) positions 220 and 229 were found to be positively selected [11]. Another study found positions 13 and 236 [62]; however, suggested positively selected sites may vary by the dataset applied, method used and the significance level selected for a site to be classified as positively selected. REL analysis identified four sites (208, 211, 218 and 219) in M1 under positive selection pressure. REL analysis tends to give better estimates on small datasets than SLAC and FEL. Sites 211, 218, and 219 were still selected when the bayes factor cut-off was increased from 50 to 200. Further analysis would be needed to determine if these sites actually are positively selected.

Conclusion
There is a need for complete genome analysis of European human influenza A viruses in order to gather a comprehensive picture of the evolution and migration of viruses.
Our results support the suggestion that the evolution of influenza A viruses is more complex than originally believed [28,62]. Local short term evolution of influenza virus is a stochastic process, also involving gene reassortments. It will be interesting to further investigate how viruses from other parts of Europe influence on the evolu-tion of Danish isolates when more full length sequences from Europe are made public.

RNA extraction and full-length one-step RT-PCR
Viral RNA was extracted from 140 µl of human nasal swab suspension or nasopharyngeal aspirate by QIAamp ® Viral RNA Mini Kit (QIAGEN, Germany) as described by the manufacturer or by an automated MagNA Pure LC Instrument applying the MagNa Pure LC Total Nucleic Acid Isolation Kit (Roche Diagnostics, Basel, Switzerland). The different gene segments were amplified by OneStep RT-PCR Kit (QIAGEN) as previously described [64], applying a two minute elongation step for all genes. The primers for RT-PCR were segment specific but subtype universal targeting the highly conserved noncoding RNA regions at the 5'-and 3'-end of each segment [65]. PCR products were purified using the GFX™ PCR DNA and Gel Band Purification Kit (Amersham Biosciences, Germany) prior to sequencing.

Sequencing and phylogenetic analyses
Purified PCR products were sequenced directly. Primer sequences are available upon request. The sequencing reaction was performed by ABI PRISM ® BigDye™ Terminators v3.1 Cycle Sequencing Kit (Applied Biosystems, USA) as described previously [66]. The sequences were developed on an automatic ABI PRISM ® 3130 genetic analyzer (Applied Biosystems) with 80 cm capillaries. Consensus sequences were generated in SeqScape ® Software v2.5 (Applied Biosystems). Sequence assembly, multiple alignment and alignment trimming were performed with the BioEdit software v.7.0.5 [67]. Distance based neighbor joining (NJ) phylogenetic trees and character based maximum parsimony (MP) trees were generated using the Molecular Evolutionary Genetics Analysis (MEGA) software v.3.1 [68]. Maximum likelihood trees were generated by the Phylogenetic Analysis Using Parsimony (PAUP 4.0) Software (Sinauer Associates, Inc.) [69] applying the HKY85 nucleotide model, allowing transitions and transversions to occur at different rates.

Sequence data
The between-seasons nucleotide distance means were computed as the arithmetic mean of all pair wise distances between two seasons in the inter-season comparisons using the MEGA v.3.1 software [68]. The global rate between dN and dS substitutions and the individual sitespecific selection pressure were measured by the likelihood based single likelihood ancestor counting (SLAC) method in Datamonkey (modified Suzuki-Gojobori method) [70,71]. For datasets over 100 sequences (H3 n = 204 and N2 n = 166) the HyPhy package [72] was applied. The estimations are likelihood-based, employing a codon model cross between HKY85 and MG94. To elucidate single, positively selected amino acids, the HA and NA datasets were analysed with SLAC and a two rate fixed effects likelihood (FEL) [73] using a likelihood approach with neighbor joining phylogenetic trees (HyPhy). Sites with dN>dS with a <0.05 significance level for likelihood ratio test (LRT) were implied as positively selected for the large HA and NA datasets. For the small datasets (genes coding for the internal proteins <15 and the H1N1 viruses <30) we additionally ran a random effects likelihood (REL) analysis using an empirical Bayes approach with NJ phylogenetic trees in Datamonkey [70]. This method is expected to calculate positively selected sites more accurately in small datasets. The accepted significance level for a positively selected site was set at <0.1 (two-tailed binominal distribution) for SLAC and FEL analyses and >50 bayes factor for REL.

Prediction of glycosylation sites
Potential N-linked glycosylation sites were predicted using nine artificial neural networks with the NetNGlyc 1.0 Server [74]. A threshold value of >0.5 average potential score was set to predict glycosylated sites. The N-Glycosite prediction tool at Los Alamos [75] was used to visualise the fraction of isolates possessing certain glycosylated sites along the sequence alignment.

Calculation of antigenic distance
The specific measure of antigenic distance between two strains of influenza were calculated as P epitope values by the method suggested by Muñoz, et al., [39]. The P epitope value was calculated as the number of mutations within an antibody antigenic site divided by the number of amino acids defining the site. It is assumed that an antigenic epitope which has the greatest percentage of mutations is dominant, because that epitope is influenced by the greatest selective pressure from the immune system. The P epitope distance is defined as the fractional change between the dominant antigenic epitopes of one strain compared to another. The P epitope Calculator [76,77] was applied for H3 sequences. Residues in antigenic epitopes were collected from several references [9,13,39,40,48,[78][79][80][81][82][83]83,84].