Regions identity between the genome of vertebrates and non-retroviral families of insect viruses

Background The scope of our understanding of the evolutionary history between viruses and animals is limited. The fact that the recent availability of many complete insect virus genomes and vertebrate genomes as well as the ability to screen these sequences makes it possible to gain a new perspective insight into the evolutionary interaction between insect viruses and vertebrates. This study is to determine the possibility of existence of sequence identity between the genomes of insect viruses and vertebrates, attempt to explain this phenomenon in term of genetic mobile element, and try to investigate the evolutionary relationship between these short regions of identity among these species. Results Some of studied insect viruses contain variable numbers of short regions of sequence identity to the genomes of vertebrate with nucleotide sequence length from 28 bp to 124 bp. They are found to locate in multiple sites of the vertebrate genomes. The ontology of animal genes with identical regions involves in several processes including chromatin remodeling, regulation of apoptosis, signaling pathway, nerve system development and some enzyme-like catalysis. Phylogenetic analysis reveals that at least some short regions of sequence identity in the genomes of vertebrate are derived the ancestral of insect viruses. Conclusion Short regions of sequence identity were found in the vertebrates and insect viruses. These sequences played an important role not only in the long-term evolution of vertebrates, but also in promotion of insect virus. This typical win-win strategy may come from natural selection.


Background
The interaction between viruses and animals is quite profound and complex. Precious studies have deeply increased the depth of our understanding of their longterm evolutionary history in terms of genome sequence. Viruses have a highly host-associated life circle. As a result, they infect and occasionally integrate into the germ line cells chromosome and are inherited vertically as host alleles [1,2]. A growing number of nucleotide sequences of viruses have been and continue to be found in their respective host spices. These remnants of ancient viral infections play an important role in offering not only unforeseen sources of genomic novelty in their hosts [1,3] but also molecular fossils to facilitate our knowledge of the evolution process between viruses and animals [4]. Some of these sequences identity in host species were found to highlight several pathways including cell adhesion, Wnt signalling [5] and immunomodulation [6] as well as mammalian reproduction [7].
However, most of these discoveries were merely addressed in an aspect of virus-host interaction and may narrow our prospective to probe the links between viruses and animals.
Here in a broad sense, we aimed at to identify the possible regions identity between the genomes of vertebrates and non-retroviral families of insect viruses and the possible role(s) of the identical sequences in evolution of the corresponding animal(s). Moreover, we reported phylogenetic analysis of these identical sequences. In this paper, we showed that at least some of the sequences identity in vertebrates chromosomes identified here are likely to come from insect viruses and exapted during their long-term evolution.

Results
We screened several hundreds of insect viruses including DNA viruses and RNA viruses against 21 vertebrates. Of interest, dozens of short regions of sequence identity were found between animals and viruses including double stranded DNA viruses and double stranded RNA viruses ( Table 1). Note that in our study more short regions of sequence identity to a DNA-virus were found than that to a RNA-virus which was also reported in precious study [8]. Ranging from 28 bp to 124 bp, these sequences identity were found in two possible orientations in the respective animals. Most of these regions were found in intergenic regions of the genomes, some were within introns. However, with occasional exception, regions of identity were also found within gene-coding region. For example, in the case of duck-billed platypus, sequence identity to Phthorimaea operculella granulovirus occured within exon and coded protein similar to ubiquitin. Pieces of sequence identity that copy themselves and reinsert into the genome of animals could be found in our study. Besides, two distinct short regions of sequence identity to a certain virus also occurred in the same genome of the animal suggesting that more than one distinct short region derived from a virus invaded and fixed into the same animal genome. For example, in the case of zebra finch two distinct short regions identity to Choristoneura occidentalis granulovirus were found within the genome [GenBank:NW_002197778.1] with respective E-values 4e-23 and 1e-14.

The relationship between pseudo-genes and sequences identity
The phenomenon that a large number of identified regions were located near or within pseudo-genes caused our attention and promoted us to investigate what the relationship between the sequences identity and pseudo-genes was. To investigate this phenomenon further, we calculated the distance between the pseudo-genes and the end (s) of the regions identity as described in Methods. Figure 1 shows the relationship between the distance from the ends of a short region of identity to the related pseudo-gene and the percentage of pseudo-gene within the distance. In our study, 7 out of 76 pseudo-genes harbor short regions of sequence identity. A rough rule of the distribution is that most of the pseudo-genes are within 1000 kb flanking the ends of the short regions of identity. Table 2 shows the important roles of genes containing regions of sequence identity play in the evolution of vertebrates ranging from chromatin remodeling, mitotic cell cycle, signaling pathway, gene switch to signal transduction, cell-cell adhesion and nervous system development.

Phyogenetic analysis
A screen of vertebrate genomes has unexpectedly exhumed short regions of sequence identity to insect viruses leading us to speculate about the evolutionary relationship among these sequences. And then phyogenetic comparisons of these sequences identity were performed as described in Methods.

Sequence identity to Adoxophyes orana NPV
Significant blast hits to Adoxophyes orana NPV were sequences from species including mammalian, virus, fungi and bacteria ( Figure 2). Sequences from Oryctolagus cuniculus, Cafeteria roenhergensis virus BV-PW1, Penicillium chrysogenum Wisconsin 54-1255, Dictyostelium purpureum and Adoxophyes orana NPV grouped into a single group with robust bootstrap support (100%), suggesting that they are likely derived from the same lineage. Cafeteria roenhergensis virus has the largest genome of any described marine virus and infects a widespread marine phagocytic protest [9]. The argument that cafeteria roenhergensis virus belongs to the fourth domain of life is supported by recent study [10].

Sequence identity to Choristoneura occidentalis granulovirus
Sequences matching Choristoneura occidentalis granulovirus were all identified in insects ( Figure 3). In phylogenies, these short regions identity grouped into two clades, the largest of which included matches related to insect genomes suggesting that they are from the same ancestral lineage. Sequence derived from Choristoneura occidentalis granulovirus formed a single clade. It's hard for us to know whether sequences from insects originated from distinct Choristoneura occidentalis granulovirus linage or not.

Sequence identity to Culex nigripalpus baculovirus
We identified high-level significant matches to Culex nigripalpus baculovirus in the genomes of plant, mammalian, insect ( Figure 4). Phylogenies constructed grouped Mouse, Drosophila willistoni with Culex nigripalpus baculovirus with a robust support (100%), suggesting they are likely derived from the same exogenous lineage.

Sequence identity to Cydia pomonella granulovirus
Significant matches to Cydia pomonella granulovirus are short regions identified in a broad range of lineage genomes including chordate, fungi, insects, vertebrates, protozoa and plant ( Figure 5). Curiously, Cyprinus carpio, Mus musculus and Theragra chalcogramma and some other species grouped together into a larger well-surpported clade with Cydia pomonella granulovirus while Mouse, Rattus, Schistoroma mansoni and Drosophila melanogaster as well as Candida albicans grouped into a smaller clade. Considering that a closely related species doesn't group into the same clade, the initial nucleotide sequences flow from Cydia pomonella to the ancestor of the Mus musculus at least post dated the split of Mus musculus and Rattus norvegicus which occurred about 10 million years ago [11].

Sequence identity to Leucania separata
Matches to Leucania separata were sequences from different species ranging from fungi, mammalians, bacteria and protozoa as well as insects ( Figure 6). Interestingly, with a robust bootstrap support (97%) sequences from Mouse and Leucania separata grouped into a single group suggesting that they are likely derived from the same ancestral lineage. As for sequences identity from Mus musculus, Rattus norvegicus, fungi and bacteria they may derive from distinct Leucania separata lineages.

Discussion
In order to broaden the scope of people's understanding of the interaction between virus and animals, We searched genomes of 21 currently available vertebrates for sequences identity to that of insect viruses with expectation that possible sequences identity may exist, and unearthed lush short regions of sequence identity in diverse animals. The chance matches of the search were ruled out by performing reciprocal BLAST. With sequence length from 28 to 124 bp, most of them are non-functional, however, with exceptional occasions, some are within exon. The mechanism that nucleotide sequences flowed from ancestral insect viruses to vertebrates is still unclear. A possible explanation for the phenomenon is due to genetic mobile element such as virus and phage The distance between the short regions of identity and nearby pseudogenes. The percent of pseudogenes Figure 1 The relationship between sequences identity and rate of nearby pseudo-gene. as well as plasmid. Earlier study shows that viruses move between different biomes and the total number of viruses largely exceeds the number of cells [12]. In our data, short regions of sequence identity to virus is also found in bacteria, for example, in the case of Leucania separata, short region of identity is found in Ajellomyces capsulatus. Besides, short regions of sequence identity in the genomes of bacteria and bacteriophages as well as human were identified recently [13]. And further study is still warranted.
The fate of most acquired nucleotide sequences in the chromosomes of animals has been to undergo deletion due to homologous recombination [14], however, the deletion rate decreased dramatically with age [14], and finally only few fragments of the sequences fixed into the genomes of germ line cells and passed from parent to offspring vertically. These obtained sequences undoubtedly play a pivotal role in shaping vertebrates genome. Among the products of the short regions of sequence identity, some involve in interaction with animals: chromatin remodeling, regulation of apoptosis, signaling pathway, nerve system development and some enzyme-like catalysis. On one hand, these products take in part in the formation of vertebrate, help to promote the evolution of vertebrates. On the other hand, likewise, these products play an important role in promotion of virus persistence [5,15]. For the survival of virus, the ideal can be achieved that the impact of its infection will not harm the host and the risk of host pathology will be reduced with a long-term host [15]. From this aspect, the phenomenon that virus invaded animal(s) and fixed its nucleotide sequences into the genomes of the germ line cells and passed vertically is a typical winwin strategy both for the survival of virus sequences and the long-term evolution of animal(s).
No discussion of short regions of sequence identity would be complete without mention pseudo-genes. Pseudo-gene which is known for non-functional, genelike sequences due to a high mutation rate is harbored by mammalian genomes [16]. Lacking functional promoters or other regulatory elements, a pseudo-gene is not transcribed [17,18]. Coincide with the studies that a fixed viral insertion possibly decay into a pseudo-gene [1,17], in our study 7 out of 76 pseudo-genes harbor   Fan and Li Virology Journal 2011, 8:511 http://www.virologyj.com/content/8/1/511 short regions of sequence identity. However, it is quite confused that dozens of pseudo-genes were located near the short regions identity from several hundred base pair to more than one million base pair. A rough rule is that most of them are within 1 Mb. The reason why so many pseudo-genes are located nearby is not clear. The explanation that the distribution of nearby pseudo-genes is by chance seems not likely. The fact that pseudogenes tend to occur in the genome of families with environmental-response functions shows that instead of being dead, they may form a reservoir of diverse "extra part" which can be helpful for an organism to get used to its surroundings [19]. Alternative explanation is that the short regions of sequence identity may function by an unknown regulatory mechanism in the formation of pseudo-genes. Note that in our study, in the case of western clawed frog, short regions identity to Choristoneura occidentalis granulovirus were within intron of the gene whose product is miscRNA. MiscRNA is short for miscellaneous RNA, a general term for a series of miscellaneous small RNA. It serves a variety of functions, including some enzyme-like catalysis and processing RNA after it is formed. Besides, some of these small RNAs may serve as switches. Others, called RNAi, Figure 4 Phylogenetic relationship of short regions of identity to Culex nigripalpus baculovirus. Figure 5 Phylogenetic relationship of short regions of identity to Cydia pomonella granulovirus. Fan and Li Virology Journal 2011, 8:511 http://www.virologyj.com/content/8/1/511 silence genes by tagging their mRNA for destruction [20,21]. Maybe some of these small RNAs serve as gene switches, turning genes on and off, or just silence genes with the help of RNAi. Besides, it's known that enhancers as well as other regulatory elements can be 1 Mb from the target gene [22]. The phenomenon that most nearby pseudo-genes are within 1 Mb coincides with the description above. Apparently, further study is needed to address this possibility.
We have investigated the evolutionary radiation of some of the identified short regions of insect viruses and demonstrated a broad history of interaction between insect viruses and vertebrates. It is interesting to speculate that short regions of identity occurred across a brand species. According to our data, at least some short regions of identity identified in vertebrates are derived from insect viruses. And the initial gene flow from Cydia pomonella to the ancestor of the Mus musculus at least post dated the divergence of Mus musculus and Rattus norvegicus about 10 million years ago. However, due to the limited samples, it is hard for us to know whether some sequences identity of the insect viruses and that of vertebrates shared the same ancestral lineage or not. Since the evolution of some viral sequences is more rapid than that of animals, it may mask any two nucleotide sequences which actually derived from the same ancestor [23].

Conclusions
Our study established that the genetic material derived from insect viruses can flow to vertebrates and play a significant evolutionary role for the development of vertebrates and the survival of the viruses. This win-win strategy may be the result of natural selection.

Genome screening
The genomes of non-retroviral families of insect viruses were screened against chromosome assemblies and whole genome shotgun assemblies of 21 vertebrate species in silico approach using BLASTn with the resources of NCBI. Insect viruses sequences with a high-level identity (i.e. e-value < 0.001) of matches to vertebrates nucleotide sequences were acquired. Then the acquired animal sequences were used as queries to screen the GenBank non-redundant (nr) database in a reciprocal BLASTn search. Significant matches to retroviruses and non-insect viruses were discarded, while the remaining matches were considered as regions of identity to nonretroviral families of insect viruses.
Regions of identity were located in corresponding genome shotgun assemblies of vertebrates precisely. If pseudo-genes were found near regions of identity (i.e. 2000 kb within their 5' and/or 3' ends) distance was calculated between the nearby pseudo-genes and 5'site and/or 3's site of regions of identity.

Phylogenetic analysis
For understanding the distribution and possible origin of sequences identity, BLASTn was run with virus sequences as queries to screen the GenBank non-redundant (nr) database. Significant hits with over 95% Figure 6 Phylogenetic relationship of short regions of identity to Leucania separate. Fan and Li Virology Journal 2011, 8:511 http://www.virologyj.com/content/8/1/511 identity and blast E-values of 10-7 or lower were identified as regions of sequence identity. And representative sequences were extracted. These nucleotide sequences were aligned using ClustalX [24] program and manually edited. Neighbor-Joining (NJ) phylogenies [25] were then constructed using the nucleotide sequence alignments with PHYLIP [26]. A consensus tree was calculated with the program Consensus of the PHYLIP package. Support for the ML trees was evaluated with a total of 1,000 bootstrap replicates.