Identification of novel conserved functional motifs across most Influenza A viral strains
© ElHefnawi et al; licensee BioMed Central Ltd. 2011
Received: 6 November 2010
Accepted: 27 January 2011
Published: 27 January 2011
Skip to main content
© ElHefnawi et al; licensee BioMed Central Ltd. 2011
Received: 6 November 2010
Accepted: 27 January 2011
Published: 27 January 2011
Influenza A virus poses a continuous threat to global public health. Design of novel universal drugs and vaccine requires a careful analysis of different strains of Influenza A viral genome from diverse hosts and subtypes. We performed a systematic in silico analysis of Influenza A viral segments of all available Influenza A viral strains and subtypes and grouped them based on host, subtype, and years isolated, and through multiple sequence alignments we extrapolated conserved regions, motifs, and accessible regions for functional mapping and annotation.
Across all species and strains 87 highly conserved regions (conservation percentage > = 90%) and 19 functional motifs (conservation percentage = 100%) were found in PB2, PB1, PA, NP, M, and NS segments. The conservation percentage of these segments ranged between 94 - 98% in human strains (the most conserved), 85 - 93% in swine strains (the most variable), and 91 - 94% in avian strains. The most conserved segment was different in each host (PB1 for human strains, NS for avian strains, and M for swine strains). Target accessibility prediction yielded 324 accessible regions, with a single stranded probability > 0.5, of which 78 coincided with conserved regions. Some of the interesting annotations in these regions included sites for protein-protein interactions, the RNA binding groove, and the proton ion channel.
The influenza virus has evolved to adapt to its host through variations in the GC content and conservation percentage of the conserved regions. Nineteen universal conserved functional motifs were discovered, of which some were accessible regions with interesting biological functions. These regions will serve as a foundation for universal drug targets as well as universal vaccine design.
The influenza A virus is a major threat to world health and economy. The polymerase of this RNA virus lacks proof reading activity , which gives rise to considerable viral variability culminating in the 3 different types A, B and C, in addition to many subtypes based on variations in the hemagglutinin (HA) and the neuraminidase (NA) surface proteins . The influenza genome consists of 8 RNA segments and encodes 10 proteins including the internal structural proteins, nucleocapsid protein (NP), and the two matrix proteins (M1 & M2) [3, 4].
The surface proteins neuraminidase (NA) and hemagglutinin (HA) have been studied extensively and the antigenic variations in the these surface glycoproteins are used to subtype influenza A. Additionally, three of the influenza polypeptides are associated with RNA polymerase activity (PA, PB1, PB2). The RNA binding non-structural protein (NS) contributes to viral pathogenicity and plays a central role in the prevention of interferon mediated antiviral response [3, 4].
Genetic reassortment of the Influenza A virus within different hosts (including avian and swine), and antigenic shifts and drifts in the HA and NA proteins, are the cause of widespread pandemics in immunologically unfamiliar populations. These have resulted in serious outbreaks and pandemics, such as those of 1918, 1957, 1968, and 2009 . This change in genetic and antigenic composition, presents an ever-present challenge for the development of influenza vaccines and antiviral medications.
Bioinformatics has played a major role in several aspect of virology research; these include predicting viral RNA structure , the structural and functional analysis of viral proteins , and immunoinformatics to predict epitopes and reverse vaccinology . Such studies have assisted the development of biomarkers for the diagnosis, staging, and prognosis  of viruses (for a review see ). Additionally, computer-aided drug designs have led to the identification and validation of drugs  for many major viruses, such as HIV, influenza and HCV , helping the world face the challenges of such major viral diseases with a huge medical care burden [13, 14]. Molecular modelling studies have in addition provided mechanistic explanations for such questions like drug modes of action, virus-receptor interaction, and virus-host interactions. In these lines of research, conserved regions found in viruses, extrapolated from multiple sequence alignments of different strains, were essential in functional prediction through the identification of epitopes and motifs [15–17].
Several studies have addressed different aspects of the influenza virus, its evolution, structure, and function analysis, to delineate the molecular mechanisms of pathogenicity and continuous resistance to immune response. Several previous studies performed phylogenic analysis and addressed the evolution of one or more Influenza A viral segments . Additionally, methodical analysis of the whole genome has identified co-occurrence of mutation networks and other properties, such as relative codon usage (rscu) and codon usage patterns (cup), as features of Influenza evolution . Motif prediction in the HA influenza genes and proteins has been previously conducted .
Our study is a comprehensive systematic comparative nucleotide genomic analysis that complements prior analyses and utilizes complete influenza viral segments isolated from different hosts such as humans, avians, swine, and a fourth group for all other hosts, that belong to different HA and NA subtypes, and from different geographic regions and years. The main theme of the current study is genome conservation among different strains. This is achieved by the utilization of all available complete segment sequences from the NCBI's Influenza Virus Resource database in order to achieve a reasonable comparative analysis between the main three hosts: human, swine, and avian, to highlight regions that could serve as targets for universal drug and vaccine design. The need for high sequence conservation as a prerequisite of efficient siRNA design for the Influenza A virus has been highlighted previously . The identification of conserved regions in the influenza M gene has been previously reported.
In the current study, meta-analysis of the Influenza A viral genome segments from different hosts, different subtypes, and different geographic regions is performed. Genomic conserved regions across all diverse strains and hosts are extracted by multiple sequence alignments and the conservation percentage is calculated. An analysis of inter- and intra- host strains segmental genomic variability of Influenza A viral segments for human, avian, and swine hosts, and the GC percentage of the segments in the different hosts, is also conducted. Completely conserved genomic functional motifs are identified and analysed through functional annotation. This work will not only provide understanding of the natural selection of the Influenza A virus, but will serve as a foundation for gene therapy, and novel Influenza A universal drug and vaccine design to target highly conserved regions with crucial functions. Moreover, the bioinformatics sequence analysis workflow that is presented and applied could be used for research into the understanding of the evolution of viruses and the design of universal drug targets.
Preprocessing and alignment of Influenza sequences.
Number of sequences downloaded and utilized in this study for each of the influenza viral segments and their conserved regions, the longest conserved region of each segment and the conserved regions with highest conservation percentage are recorded
Number of sequences
Number of conserved regions
Longest conserved region in each segment
Conserved regions with highest conservation percentage
Region 12 (from nt 2165 to 2317)
PB2 - 12: Position 2165 to 2317
Region 2 (from nt 230 to 493)
PB1 - 22: Position 2012 to 2064
Region 7 (from nt 690 to 677)
PA-6: Position 621 to 677
Region 1 (from nt 62 to 161)
NP-13: Position 1447 to 1486
Region 8 (from nt 733 to 1000)
M-6: Position 599 to 663
Region 6 (from nt 492 to 700)
NS-1: Position 59 to 137
Our approach in grouping the sequences according to species infected/ host isolated from enabled an analysis of inter- and intra- species conservation and variability. A comparison of the inter- and intra- host alignments of the influenza segments using the Plotcon and Infoalign tools results shows that swine strains are the most variable (similarity plots illustrated in figures 1 and 2). This result was expected, since the swine strains can mix with both avian and human influenza strains. The human strains are the most conserved except in segment PB2, where the avian is more conserved. The conservation percentage of the segments ranged between 94 - 98% in human strains (the most conserved), 91 - 94% in avian strains and 85 - 93% in swine strains (the most variable). On the other hand, intra-segmental comparisons reveal that the PB1 segment is the most conserved in human strains (98.1%), followed by PB2, NP, PA, NS, and finally the M segment. For the avian host strains, the NS and M segments show the most conservation (94.5%), followed by the PB1, NP, PA, and finally the PB2 segment. In the swine strains the M segment shows the most conservation followed by the segments NP, PB2, PA, and finally PB1.
The influenza genome segments conservation and GC percentage in the different hosts
Conservation % & GC% in Segments
Evolutionary highly conserved motifs in Influenza A virus
start position (consensus sequence)
End Position (Consensus Sequence)
H5N1 Start Position
H5N1 End Position
Repeated Positions On H5N1
Mapping on Conserved Regions
(2238 - 2243)
(1243 - 1248)
(1246 - 1251)
(1624 - 1629)
(1994 - 1999)
(101 to 106)
( 368- 373)
We found 4 motifs in the PB2 segment; motif 2 (GAAACG) is repeated twice in the H5N1 reference sequence; and motifs 2, 3, and 4 were previously identified as a conserved region involved in RNA packaging . Interestingly, motif 3 also partially overlaps the nuclear localization signal (NLS). In segment PB1, four motifs were found; motif 1 (ATGATG) is repeated five times and motif 3 (GAGATC) is repeated twice on the H5N1 reference sequence. In PA, two motifs were identified that overlap with RNA packaging annotations . Segment M contains six motifs; motif 4 is the longest (CTCACCGTGCCCAGTGA). In segment NS, three motifs were found; and motif 3 (AATGGA) is repeated three times on the H5N1 reference sequence.
Functional annotation of the conserved regions and motifs was also performed by mapping of the regions and motifs on the 3D structure. Structural mapping of these conserved regions on the available influenza domains from PDB revealed many interesting functions, explaining their selection for conservation.
Analysis of many conserved regions in PB2 and PA revealed that they are mostly on the surface and are involved in protein-protein interactions. The same applies to the NP protein. Interestingly, conserved regions 5, 6, and 7 together form the RNA binding groove (ElHefnawi et. al., submitted).
There are three large conserved regions (Cr2, Cr6, Cr9) found on the NS1 protein  (PDB id: 3F5T) that is expressed by segment 8 of the virus genome (represented in figure 4b). They lie mainly on the surface of the protein and may play an important roles in the binding of different molecules and ligands that contribute to the promiscuity of the NS1 in its immune counterattack mechanisms. Clefts found in these conserved regions could bind to different immune system components (Figure 4c). Cr6 contains four functional motifs with sequences of (AGGTAGA, AGGATGTCAA and three motifs of the sequence AATGGA). The immune system interception functions of the NS1 protein are quite similar to those of the NS5A protein of Hepatitis C virus that was shown previously to have different immune system counterattack mechanisms. This is an interesting property of many viruses that deserves further analysis.
We have also assessed accessible regions and mapped them to conserved regions to infer their potential use as drug targets. Understanding accessible regions is a critical factor; for example, at least half of the siRNA target region needs to be accessible preferentially in the terminal ends. Therefore, the accessibility of the segments was calculated using the SFOLD server. We located 324 regions on six segments and mapped them to conserved regions (Figure 3). In PB2, ten accessible regions mapped to conserved regions, in PB1, 24; in PA, 16; in NP, 14; in M, 7; and in NS, 7. The accessible regions, which overlap with functional motifs, are presented in figure 3 and additional file 9.
This in silico study analyzed Influenza A virus genome segments available in the Influenza A virus resource at NCBI and grouped them according to host, strain, and year to determine conserved regions across all species studied. The higher variability in the influenza sequences isolated from swine host suggests greater hazards in future pandemics. The higher GC percentage of Influenza sequences infecting avian hosts indicates adaptation to the higher host temperature. The evolution of the influenza virus is driven by adaptation mechanisms to its host. Identification of highly-conserved functional motifs and accessible regions of all sequences was obtained. Eighty-seven conserved regions, nineteen functional motifs, and many potentially accessible regions were identified. These data on the Influenza A virus segments were utilized in the optimal design of universal therapeutic small interfering RNA molecules. The complete workflow including the siRNA design and selection figure will be presented in the next publication(ElHefnawi, submitted) and can help in other future drug and vaccine design.
Complete sequences for all segments of Influenza A virus were downloaded in groups using the advanced database search at the NCBI's Influenza Virus Resource . We utilized both the entire nucleotide sequences, in addition to coding sequences for single segments encoding the following proteins; segments PB2, PB1, PA, NP, M, NS, HA, and NA. We utilized approximately 30,000 influenza sequences for the eight segments. The number of sequences utilized from each segment is represented in Table 1.
To facilitate the analysis process we divided each segment based on the infected host as follows:
Swine strain sequences
H9 and Mixed strains
H10, H11, H12, H13, H14, H15, H16
H9, H7, H5
H1 strains were further subdivided, based on the year of isolation, into the following two subcategories; H1 strains isolated between 1918 and 2000, and H1 strains isolated between 2001 and 2007.
H3 strains were further subdivided based on year of isolation into the following three subcategories; H3 isolated between 1968 and 1998, between 1999 and 2002, and between 2003 and 2007.
Miscellaneous: all other strains infecting species other than avian, human and swine.
The above categorization of the sequences facilitated the management of the data, allowed the identification of diversity in the sequences based on the host and year isolated, and helped in the determination of conservations amongst strains. This categorization allowed us to conduct comparative mutational analysis in all segments followed by the calculation of conservation percentage. Such subtype classification according to the immunological nature of strains, and identification of the similarity of structural proteins across strains, combined with sub-categorization at the nucleotide level, will facilitate drug design as siRNA data mining.
The program MUSCLE version 3.6  was used to align primary sequence groups. The resulting aligned sequences were aligned by profile-profile alignment using the same MUSCLE 3.6 program.
First the alignments were performed by aligning strains isolated from the same host, as discussed above, where avian strains were aligned separately from human and swine strains. Second, human and swine strain sequences were aligned, and the resulting file was aligned with the avian sequence file, and then all other host strains. This order was followed because human and swine strains are generally more homologous than avian strains. For similar reasons, the avian strains were added before the other host species. Based on phylogenetic distances, such an order in the alignment sequences enhances conservation finding and facilitates the management of diversity in sequences.
The BIOPERL modules were used for automating the analysis of the alignments using different tools from EMBOSS like GeeCee , Logobar , Infoalign , Cons  and Plotcon . Scripts were written for each of these tools and run under the Biolinux operating environment . These scripts are available upon request. The consensus sequence for each segment was calculated using the Cons tool from EMBOSS  and submitted to Genbank.
Conservation and variability across the eight IAV segments in the different hosts was studied by plotting the conservation of the alignments using the Plotcon tool from EMBOSS . Additionally, the Infoalign tool from EMBOSS was used to calculate the conservation percentage of the segments in the different hosts in order to study inter-species and intra-host variability  (Table 2). The GC % for each segment was also calculated using the GeeCee tool from EMBOSS as shown in Table 2.
Conserved nucleotide regions were extracted using the Bioedit program .
Mining for conserved sequences among the aligned sequences was performed by determining the entropy of regions with at least 21 nucleotides in length with a maximum of 2 mismatches. Therefore, we defined an area as conserved if 19 identical continuous nucleotides were detected in all strains with an additional 2-nucleotide mismatch (total 21 nucleotides).
Entropy calculation was followed by checking the number of mismatches in each of our identified conserved regions. The conserved regions were mapped to the 8 segments on the influenza virus as illustrated in figure 3 and additional file 7.
Logo bars for all conserved regions were generated using the logo bar tool (additional file 8). The conservation percent of every conserved region was calculated using Infoalign from EMBOSS  and tabulated in additional file 7.
One-hundred-percent conserved motifs of a minimum length of 6 bp in all IAV segments were extracted using the BIOEDIT program . The motifs were mapped to the H5N1 reference genome, and to the conserved regions (Table 3). Also, the H5N1 avian flu reference sequence was checked for other occurrences of these motifs. The perfect conservation of these motifs suggests biological significance and a potential role in the Influenza life cycle.
Functional annotation of the conserved regions and functional motifs was performed after mapping them on the PDB 3D protein files of their segments, and using annotations available for these proteins from the PDB SUM server . After downloading the relevant structure files we highlighted the conserved regions on the structure to show their positions and configuration. Then we used the annotation knowledge gained from the PDBsum for linking the regions with their correlated functions. The annotation at the genome level was performed using Rfam in order to search for conserved regions in RNA structures with specific annotations.
The SFOLD tool was used to calculate the target accessibility of the Influenza segments using the consensus sequence for each segment calculated from the multiple sequence alignment . A region was considered accessible if at least the average single stranded probability using Sfold was greater than 0.5 for 9 consecutive nucleotides. The results are tabulated in additional file 9 and the regions that map to conserved regions are highlighted in figure 3.
This work was partially funded by an American University in Cairo (AUC) Research Grant to RS and a Yousef-Jameel Science and Technology Research Centre (YJ-STRC) at AUC grant to SZ. We acknowledge the effort of the Information Technology Institute intake 30 Bioinformatics track graduate students who helped in the tabulation of the conservation percentages in hosts and segments.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.