Metagenomic analyses and genetic diversity of Tomato leaf curl Arusha virus affecting tomato plants in Kenya

Background Tomato production is threatened worldwide by the occurrence of begomoviruses which are associated with tomato leaf curl diseases. There is little information on the molecular properties of tomato begomoviruses in Kenya, hence we investigated the population and genetic diversity of begomoviruses associated with tomato leaf curl in Kenya. Methods Tomato leaf samples with virus-like symptoms were obtained from farmers’ field across the country in 2018 and Illumina sequencing undertaken to determine the genetic diversity of associated begomoviruses. Additionally, the occurrence of selection pressure and recombinant isolates within the population were also evaluated. Results Twelve complete begomovirus genomes were obtained from our samples with an average coverage of 99.9%. The sequences showed 95.7–99.7% identity among each other and 95.9–98.9% similarities with a Tomato leaf curl virus Arusha virus (ToLCArV) isolate from Tanzania. Analysis of amino acid sequences showed the highest identities in the regions coding for the coat protein gene (98.5–100%) within the isolates, and 97.1–100% identity with the C4 gene of ToLCArV. Phylogenetic algorithms clustered all Kenyan isolates in the same clades with ToLCArV, thus confirming the isolates to be a variant of the virus. There was no evidence of recombination within our isolates. Estimation of selection pressure within the virus population revealed the occurrence of negative or purifying selection in five out of the six coding regions of the sequences. Conclusions The begomovirus associated with tomato leaf curl diseases of tomato in Kenya is a variant of ToLCArV, possibly originating from Tanzania. There is low genetic diversity within the virus population and this information is useful in the development of appropriate management strategies for the disease in the country.


Methods
Sample collection and extraction of nucleic acids Field surveys and sampling were carried out between January and May 2018 in four major tomato growing regions in Kenya, with different agroecological and climatic conditions (Figure 1a). Tomato elds were randomly selected based on crop availability, with 30 plants randomly assessed per eld. From each eld, young trifoliate leaf samples (n=5) were obtained only from plants showing symptoms such as chlorosis, reduced leaf size, upward leaf curling, stunting and ower abscission (Figure 1b). A total of 240 leaf samples were collected from 48 elds, carried in paper bags and stored at -80 °C till further analysis. Samples from each eld were pooled prior to DNA extraction.
Extraction of total genomic DNA was performed as described [26]. Brie y, about 150 mg of leaf tissues were homogenized using a mortar and pestle with 1.5 ml of pre-warmed extraction buffer (2% cetyl trimethyl ammonium bromide w/v; 100 mM Tris-HCl, 1.4 M NaCl, 20 mM EDTA, pH 8.0 + 50 mg PVP + 0.2% v/v β -mercaptoethanol added just before use). The samples were transferred into 1.5 ml microtubes and incubated at 65ºC for 30 mins while mixing at 10 mins interval. The tubes were centrifuged at 10,000 rpm for 5 secs and supernatants (750 µl) were transferred into fresh microtubes. Chloroform and isoamyl alcohol (750 µl) in the ratio 24:1 was added to the tubes, mixed and centrifuged at 10,000 rpm for 15 mins. The aqueous layers were transferred into new microtubes and ice cold isopropanol (300 µl) were added and mixed by inverting the tube slowly. Tubes were incubated overnight at -20ºC and the nucleic acids were then pelleted by centrifugation at 10,000 rpm for 15 mins. The supernatants were discarded, pellets washed with 500 µl of 70% (v/v) ethanol and dried at room temperature. These were dissolved in 100 µL of Tris-EDTA buffer (10mM Tris-HCl [pH 8.0] + 1 mM EDTA), incubated at 37ºC for 30 mins and stored at -20ºC. A Nanodrop 2000 spectrophotometer (Thermo Fisher Scienti c, MA, USA) was used to determine the quality of the nucleic acids.

Library preparation and sequencing
The genomic DNA were quanti ed using a Qubit TM uorometer (Thermo Fisher Scienti c, MA, USA) and normalized to 2.5 ng/µl and used for library preparation. Libraries were prepared using Nextera DNA library preparation kit (Illumina, CA, USA) according to the manufacturer's instructions. Brie y, enzymatic fragmentation was carried out on normalized genomic DNA samples (20 µl) via addition of TD buffer (25 µl) and TDE (5 µl).
Mixtures were centrifuged (Hettich Centrifugen, D-78532, Germany) at 14,000 rpm at 20 o C for 1 min and transferred into microtubes. Tagmentation was carried out in a pre-programmed thermocycler at 55 o C lid and 55 o C incubation temperature, while holding at 10 o C. The tagmented DNA was barcoded using indexed adapters then cleaned with AMPure XP magnetic beads (Beckman Coulter, Inc. Indianapolis, IN) to remove shorter DNA fragments and other impurities. Library quality was con rmed with the Agilent Tape Station 2200 System (Agilent Technologies, Santa Clara, CA). All the 48 libraries were quanti ed using the Qubit TM uorometer (Thermo Fisher Scienti c Inc., Waltham, MA). The indexed DNA libraries of 48 biological samples were each normalized to a concentration of 4 nm before being pooled together. High-throughput sequencing was performed on an Illumina MiSeq System using 2 × 251 v2 kit and 12 pM of 1% PhiX v3 spike to create paired-end reads. Sequencing was performed at the facility of the Biosciences Eastern and Central Africa International Livestock Research Institute (BecA-ILRI) Hub, Nairobi, Kenya.

Sequence validation through Polymerase chain reaction (PCR) and Sanger sequencing
The assembled begomovirus genomes were validated using a polymerase chain reaction (PCR) step followed by Sanger sequencing of the ampli ed products. The Illumina assembled virus sequences were aligned together using ClustalW multiple sequence alignment program with default parameters as implemented in BioEdit v.7.2.3 [32]. A consensus sequence was obtained and used to design PCR primers ToLCV-Forward (5'-ATTGGCGATTTCCCAGGTATAG-3') and ToLCV-Reverse (5'-ACAATGTGGGCTAGGTCATTAG-3') using the Primer Express v3.0 software (Applied Biosystems, USA). Secondary structures, complementarity and dimer effects of the primers were also checked using the multiple primer analyzer software (Thermo Fisher Scienti c, MA, USA). Using PCR, these were tested on the genomic DNA from which the complete begomovirus genomes had been obtained via Illumina sequencing. The PCR products were ethanol-puri ed and quanti ed using a Nanodrop 2000 spectrophotometer (Thermo Fisher Scienti c, MA, USA) to determine purity levels. Amplicons were sequenced at Macrogen Europe and manually assembled using BioEdit. Consensus sequences were veri ed using BLASTN 2.9.0+ and comparisons were made with the complete begomovirus sequences assembled from Illumina reads.
Sequence alignment, distance matrix and evidence of recombination Complete sequences of monopartite begomoviruses found in tomato were retrieved from GenBank (Additional File 2: Table S1) and aligned with full virus contigs using ClustalW in BioEdit. Deduced amino acids from the ToLCV genomes were compared with GenBank isolates while sequence pairwise identities were performed using SDT v1.2 [33] with pairwise gap deletions. A scan for recombination signatures were performed on each protein-coding sequence data using the single breakpoint scanning (SBP) and genetic algorithm recombination detection (GARD) methods [34]. These two methods were implemented by the Datamonkey software [35]. Potential recombination events were further investigated using the default settings of the seven detection algorithms within RDP v 4.13 [36]. Putative recombination events, potential recombinants, and their parental sequences were deemed acceptable only when signals were identi ed by at least four detection methods, with strong levels of signi cance (P≤0.05).
Phylogeny, genetic diversity and population genetic analysis A phylogenetic tree was constructed using the maximum likelihood method based on Jukes-Cantor model in MEGA v.6.06 [37]. Bootstrap replicate values were set at 1,000 while a strain of Tomato leaf curl purple vein virus (KY196216) was selected as an outgroup. Genetic structure and diversity within ToLCV populations in Kenya were investigated to understand potential evolutionary dynamics that produce variations. Population structure parameters estimated included; average nucleotide diversity (π), haplotype diversity (Hd), number of polymorphic or segregating sites (S), the statistic estimate of population mutation based on the number of segregating sites (θ-W), total number of mutations (Eta), the average number of nucleotide differences between sequences (k) and the statistic estimate of population mutation based on the total number of mutations (θ-Eta). These were estimated using complete genome and protein coding sequences in DnaSP v5.10.01 [38].
The possible occurrences of selection pressure on individual genes and sites within the ToLCV populations were obtained using the single-likelihood ancestor counting (SLAC) method [39] in the HyPhy package [40] as implemented on the Datamonkey software [35] at http://www.datamonkey.org. The ratio of average number of nucleotide differences between the sequences per nonsynonymous site (d N ) to the average number of nucleotide differences between the sequences per synonymous site (d S ) were calculated as an indicator of natural selection. These were used to estimate the occurrence of positive and negative selection at typical begomovirus amino acid ORF sites: the movement protein (MP) or V1 protein, coat protein (CP) or V2 protein, replication protein (Rep) or C1 protein, transcription activator protein (TrAP) or C2 protein, Rep enhancer protein (REn) or C3 protein and the C4 protein. Depending on the dN/dS values, the selection pressure was considered negative or purifying (dN/dS < 1), neutral (dN/dS = 1), or diversifying or positive (dN/dS > 1) for data sets of each coding region. The DNAsp v5. 10.01 was used to calculate the Tajima's D, Fu and Li's F* and D*, and Fu's Fs to determine the deviation of ToLCV populations from neutrality assuming a constant population size, with zero recombination and migration [41]. A negative Tajima's D statistic indicates super uous low-frequency polymorphism triggered by background selection, genetic hitchhiking, or population expansions [42]. Conversely, positive values of Tajima's D statistic suggest minimal levels of low and high frequency polymorphisms, indicating a reduction in population size and/or balancing selection.

Results
Sequence data, de novo assembly and begomovirus PCR veri cation After mapping of sequence reads from leaf samples to the tomato reference genome, unmapped reads were subsequently assembled into contigs. The de novo assembly yielded several contigs, with the largest having sizes of >45 kb while N50 values ranged from 135-270 bp (Additional File 3: Table S2). After Kaiju analyses (see Materials and Methods), all assembled virus contigs were subjected to BLASTN 2.9.0+ searches. The results revealed twelve contig matches of lengths >2.7 kb from eleven samples with complete begomovirus genomes within the database (see Additional File 2: Table S1) while partial contigs matching other DNA viruses were also present (data not shown). Raw reads from these positive samples have been deposited at the SRA archive (Bioproject number PRJNA646848). Across all the samples, only monopartite begomoviruses with DNA-A-like sequences were recovered. The presence of beta-satellites was not evaluated in this study. However, a sample (Tom54) produced the full-length genome of a separate begomovirus, Chickpea chlorotic dwarf virus, which we recently described [43]. The PCR primers designed from the full begomovirus genomes produced the expected 530 bp amplicons from the genomic DNA of infected tomato plants. Sanger sequencing of the PCR products revealing 95.6-99.7% identity (data not shown) with the complete genomes assembled from the Illumina reads, thus con rming the accuracy of the nucleotides within the assembled virus genomes.

The begomoviruses in Kenyan tomato are a variant of ToLCArV
In all the samples, the full-length genomes of the begomoviruses varied from 2,760 to 2,765 bp (Table 1). These were subsequently deposited in GenBank database under the accession numbers MN894493 to MN894504. Sequence analyses showed that these genomes encoded the six ORFs (V1, V2, C1, C2, C3 and C4) that are typical of monopartite begomoviruses while the intergenic regions ranged from 245-250 nt. Pairwise alignments of begomoviruses (see Additional File 4: File 1) with pairwise deletion of gaps revealed the highest full genome similarity (95.9-98.9%) with an isolate of Tomato leaf curl Arusha virus (ToLCArV, GenBank accession EF194760) from Tanzania (Additional File 5: Table S3). This was followed by Tomato leaf curl Toliara virus (ToLCToV, GenBank accession AM701768) with 95.9-98.9 % identity and another isolate of Tomato leaf curl virus Arusha virus (ToLCArV, GenBank accession DQ519575) at 89.8-90.5% similarity. Furthermore, all isolates exhibited less than 80% pairwise sequence identity to other begomovirus sequences (Additional File 6: Figure S2). Based on the species demarcation criteria of the International Committee for the Taxonomy of Viruses set for begomoviruses at <91% nucleotide sequence identity [44], the Kenyan begomoviruses were considered as a variant of ToLCArV. Similar patterns were observed for deduced amino acids as the highest identity was observed with ToLCArV (GenBank accession EF194760) across all the six coding regions (93.  (Table 2).

Recombination analyses
Using the automated SBP and GARD tools within Datamonkey, recombination signals were found within the genomic regions of our ToLCArV-like populations (data not shown). However, further analyses of the isolates (see Additional File 8: File 2) using the programs implemented in the RDP4 software did not reveal signi cant recombination signals within our sequences. Conversely, two isolates Tom5a (MN894493) and Tom39 (MN894499) were identi ed as potential major and minor parental sequences for the signals detected in ToLCArV (DQ519575) and ToLCDiV (AM701765), respectively (Table 3).
Phylogenetic relationships and genetic diversity of Kenyan tomato begomoviruses A phylogenetic analysis was done using the full genome sequences of the 12 ToLCArV isolates from Kenya, together with TYLCV-like sequences and other tomato begomoviruses from GenBank. As expected, all TYLCV-like isolates (n=25) clustered separately from ToLCV-like sequences (n=46) with a clear geographical segregation (Figure 2). African ToLCV-like sequences (n=26) were separated from those of Asian origins (n=20) while isolates from Kenya formed a monophyletic cluster with isolates from Tanzania (ToLCArV, EF194760 and DQ519575) ( Figure 2). This nding strengthens the hypothesis that Kenyan ToLCArV-like isolates are closely related to ToLCArV from Tanzania, with both strains having a common ancestor.
Analyses of haplotype number and haplotype diversity, represented by 'h' and 'Hd', respectively revealed varying values among the 12 Kenyan ToLCArV-like sequences and also among other ToLCV-like sequences from GenBank, based on the six coding regions evaluated (Table 4). From the total ToLCV-like sequences (n=46), haplotypes number ranged from 43 in the MP region to 46, in the CP, Rep and whole genomes. Similarly, among the Kenyan ToLCArV-like isolates (n=12), 'h' values ranged from 9 (MP gene) to 12 (CP, Rep and complete genomes). Thus, across ToLCV-like sequences from the Genbank and the Kenyan ToLCArV-like sequences, each isolate represented a haplotype at both CP and Rep genes, revealing high genetic variation within the coding regions of each group. This therefore indicates that genetic variation was highest within the CP and Rep coding regions. Interestingly, Hd values were highest for the CP and REn gene and lowest for MP gene, both across ToLCV-like isolates obtained from GenBank and among the 12 Kenyan ToLCV-like sequences obtained in this study (Table 4). Furthermore, genetic distances for each gene-speci c data set were calculated, with highest π values obtained within the REn gene (0.2458) across the ToLCV-like isolates (n=46). The C4 gene and Rep gene recorded the lowest π values i.e. 0.21015 and 0.21165, respectively. Remarkably, the π value of the C4 gene within the 12 Kenyan ToLCArVlike isolates (0.00869) was more than half the π values of other coding regions, indicating that these coding regions were more variable than the C4 gene (Table 4). Collectively, these results show high genetic variability among the CP and Rep coding regions across both ToLCV groups, with C4 gene having the least variation across the isolates.
Tajima's D and estimation of selection pressure Tajima's D statistical test [45] was used to evaluate the nucleotide polymorphism occurring within each gene and on the complete genomes of Kenyan ToLCArV-like isolates and other ToLCV-like isolates. The Tajima's D, Fu and Li's D and Fu and Li's F statistic revealed negative values for the complete genome datasets which did not statistically deviate from zero (P > 0.10) ( Table 5). Within Kenyan isolates, similar trends were observed for gene-speci c datasets except the MP and CP genes which revealed positive values that are not signi cantly (P > 0.10) different. These results indicate an excess of low-frequency polymorphism caused by background selection, genetic hitchhiking, or population increases.
In order to understand the selection pressure acting on the different coding regions within our ToLCArV-like sequences, the ratios of nonsynonymous substitution per nonsynonymous site (dN) and synonymous substitutions per synonymous sites (dS) were calculated ( Table 5). The dN/dS ratio is an estimator of the evolutionary constraints imposed on a coding region with a value >1 considered as evidence for positive selection, values <1 show evidence of negative selection while values of 1 indicate neutral selection [46]. Across the Kenyan ToLCArV-like sequences, the dN/dS ratio was 0.2067 for the MP gene, 0.067 for the CP gene, 0.3986 for Rep gen, 0.2590 for REn Gene, 0.2908 for TrAP gene and 1.1491 for C4 gene (Table   5). Thus, contrasting patterns of evolution were obtained for the coding region datasets as all except the C4 gene had dN/dS ratio of <1. This indicates a negative or purifying selection among ve out of six coding regions. In addition, these results show that although the MP, CP, Rep, TrAP and REn coding regions are under strong purifying selection, the purifying selective pressure is not distributed uniformly across the genes. The protein encoded by the C4 gene appears to be selectively neutral. The dN/dS values for the CP gene had the lowest values, with other gene sets having at least more than thrice its dN/dS ratio (Table 5).

Discussion
Tomato production in Kenya is widespread and has been limited by the impact of the tomato leaf curl disease. Tomato yellow leaf curl virus has always been assumed to be the causal because of the typical yellow leaf curl symptoms commonly associated with tomato in Africa. Indeed, a tomato leaf curl-like virus infecting tomato in Kenya has previously been reported [47]. The paucity of information on viruses of high economic importance is compounded by the fact that only a few studies from Kenya have described the genomic properties of begomoviruses from cassava [48], sweet potato [49] and a non-cultivated weed host [50]. Using a metagenomics approach, we have described the occurrence of monopartite begomoviruses associated with the leaf curl disease of tomato in Kenya. Our results show that a genetically distinct begomovirus is associated with the disease in Kenya. Analyses of the complete genomes and coding regions of these begomoviruses, together with the failure to detect the presence of DNA-B component a rms that these virus populations were members of the Old World monopartite begomovirus species. Our ndings represent the rst comprehensive description of full begomovirus genomes from tomato in Kenya. This information is crucial for understanding the causal agents associated with the tomato leaf curl disease and its properties as a rst step towards appropriate robust disease management. The availability of full genome sequences will help to elucidate further the evolutionary behavior of the virus.
All the Kenyan ToLArV-like sequences obtained in this study, shared very high nucleotide and amino acid sequence similarities, indicating low intrapopulation genetic diversity. Similar conclusions have been reached in other studies on tomato begomoviruses [51,52]. Curiously, we observed that the nucleotide sequences of the 12 ToLCArV-like isolates shared high identities among themselves but shared lower sequence identities with other begomoviruses. This is likely as a result of the genetic bottleneck imposed through the method of begomovirus transmission by white ies [53]. Our study did not investigate virus occurrence within vectors. Nevertheless, the high genetic similarity within the population in our result could be due to a 'founder effect' arising from ecological and epidemiological factors such as vector or seed-mediated spread possibly from Tanzania. The derived amino acid sequences of the population in our results show homologous characteristic with other monopartite begomoviruses, indicating possible similar biological behaviors.
Results from sequence similarity indices, together with phylogenetic inferences, suggest that the ToLCArV-isolates associated with tomato leaf curl diseases in Kenya were likely of Tanzanian origin. The homogeneity of nucleotide and amino acids as well as phylogenetic inferences supports a single introduction of the tomato begomovirus into Kenya. There was no evidence of recombination occurring within our ToLCArV population. Intriguingly, ve algorithms detected recombination signals (P≤0.05) from a Tanzanian ToLCArV isolate (GenBank number DQ519575), identifying one of our Kenyan isolates (GenBank number MN894493) as a major parent (Table 3). This suggests that, although the properties of our isolates are just being characterized, they could be the parents that contributed to the emergence of ToLCArV which was rst described by [54]. Thus, it is possible that the Kenyan ToLCV population could pre-date the Tanzanian isolates which were then only reported earlier.
Since our analyses reveal clustering of isolates from geographically proximal countries, the dissemination of the ToLCArV-like isolates is likely to have occurred via virus-infected planting material or spread by cross-border spread of viruliferous white ies, leading to genetic similarity among these isolates. Although, our study did not investigate mode of virus transmission, evidence of seed transmission has recently been reported in other closely related begomovirus species from tomato [55] and other hosts [56,57]. Thus, further research is required to understand how speci c begomovirus species are spread across various borders in East Africa and to determine the epidemiological and ecological implications. Additionally, we propose studies to investigate the effect of white y-mediated transmission on begomovirus diversity in Kenya.
Interestingly, our results show that the begomovirus sequences from Kenya have discernible patterns of geographical structuring with other ToLCVlike isolates of African origin. This is in agreement with previous studies that have shown geographical structuring of African Old World begomovirus sub-populations into clear genetically distinct categories [58,59]. This suggests that these viruses perhaps came from a common ancestor that was introduced to the continent and speciation arose as they interacted with various hosts across different geographical locations. In this study, we determined the genetic diversity of ToLCArV-like sequences from Kenyan within tomato elds using coding regions and complete genome sequences. Over the years, tomato begomoviruses in Kenya have received little or no attention in previous studies [60]. Our current ndings will deepen the knowledge on genetic diversity of tomato begomoviruses, therefore allowing for better diagnostics and appropriate management options. Our results indicate that although there is low intra-speci c diversity among our isolates, the haplotype number and haplotype diversity analyses revealed varying homogenous levels within the coding regions. Thus, the non-coding regions could have contributed to the overall low diversity indices, similar to the observations of [61].
Our results show that varying natural selection pressures appear to be acting on the coding regions of the Kenya ToLCArV-like isolates, indicating independent coevolution of these genes. Our analyses of synonymous and nonsynonymous substitutions revealed that, except the C4 gene, all coding regions appear to be under strong negative or purifying selection to conserve its encoded amino acid sequence. This is in line with similar observations for other related tomato begomovirus species from the Old World [62] and New World [63]. The evolutionary constraints on these coding regions could be intended to preserve their biological functions which include virus replication, accumulation and delity to vector transmission. For example, the CP gene has been reported to mediate interactions between begomoviruses and their white y vectors [64]. Any alterations in the CP sequence could subsequently alter their virus-vector interactions or other associated biological functions [65]. This is probably why this phenomenon is more in the CP region with the lowest mean dN/dS values, indicating that it is undergoing a stronger purifying selection. Other studies have also indicated similar patterns within begomoviruses, with the CP gene having the strongest evolutionary constraint [66][67][68]. dN/dS ratios are estimators of evolutionary bottlenecks imposed on a coding region at intra-speci c levels. Because natural selection functions largely on these regions, synonymous and nonsynonymous mutations are usually under varying selective pressures and are xed at different rates within begomovirus genomes [69,70]. Thus, comparison of synonymous and nonsynonymous substitution rates can reveal the direction and strength of natural selection acting on virus proteins. Importantly, we found the C4 gene within the Kenyan isolates to be selectively neutral as its estimated dN/dS ratio (1.1491) suggests that neither purifying nor diversifying selection was ongoing. This neutral selection could be as a result of its divergent but crucial role in modulating disease severity, determination of host range, virus movement and suppression of host silencing mechanisms [71,72].

Conclusions
This study investigated the identity, full sequence properties, genetic diversity, population genetics and presence of recombinants within monopartite begomoviruses associated with leaf curl diseases of tomato in Kenya. Nucleotide and amino acid sequence analyses together with phylogenetic inferences identi ed the begomoviruses as variants of ToLCArV with origins from Tanzania. Genome analyses revealed low genetic diversity within the population with negative selection occurring within most of the coding regions. The information obtained in this research will assist in the design and implementation of quarantine plans to manage virus-host dynamics. Sequence information and genetic diversity data obtained in this study are also important for the development of rapid and robust detection tools towards the production of virus-free tomato seedlings for farmers. This will ultimately improve tomato production across the country for better food security.