Metagenomic analyses and genetic diversity of Tomato leaf curl virus complex affecting tomato plants in Kenya

Edith Khamonya Avedi (  e.aved@kephis.org ) Kenya Plant Health Inspectorate Service Adedapo Olutola Adediji (  adedapo.adediji@gmail.com ) University of Ibadan Dora Chao Kilalo University of Nairobi Florence Mmogi Olubayo University of Nairobi Isaac Macharia Kenya Plant Health Inspectorate Service Elijah Miinda Ateka Jomo Kenyatta University of Agriculture and Technology Eunice Magoma Machuka International Livestock Research Institute Josiah Musembi Mutuku International Livestock Research Institute

identi cation of causal organisms associated with virus diseases in crops, screening for speci c viruses when their presence is suspected, detection of asymptomatic or cryptic viruses, and the discovery of novel viruses and other microorganisms [27]. In this study, we used a metagenomics approach to investigate the occurrence of Tomato leaf curl viruses in tomato plants from farmers' elds in Kenya. The virus populations were further evaluated for their genetic diversity, evidence of recombination and occurrence of selection pressure.

Methods
Sample collection and extraction of nucleic acids Field surveys and sampling were carried out between January and May 2018 in four major tomato growing regions in Kenya, with different agro-ecological and climatic conditions ( Figure 1). Tomato elds were randomly selected based on crop availability, with 30 plants randomly assessed per eld. From each eld, young, trifoliate leaf samples (n=5) were obtained only from plants showing symptoms such as chlorosis, reduced leaf size, upward leaf curling, stunting and ower abscission ( Figure 2). A total of 515 leaf samples were obtained from 259 elds, carried in paper bags and stored at -80 C till further analysis. DNA extraction was done on pooled samples from each eld.
Extraction of total genomic DNA was performed as described [31]. Brie y, about 150 mg of leaf tissues were homogenized using a mortar and pestle with 1.5 ml of pre-warmed extraction buffer (2% cetyl trimethyl ammonium bromide w/v; 100 mM Tris-HCl, 1.4 M NaCl, 20 mM EDTA, pH 8.0 + 50 mg PVP + 0.2% v/v β -mercaptoethanol added just before use). The samples were transferred into 1.5 ml microtubes and incubated at 65ºC for 30 mins while mixing at 10 mins interval. The tubes were centrifuged at 10,000 rpm for 5 secs and supernatants (750 µl) were transferred into fresh microtubes.
Chloroform and isoamyl alcohol (750 µl) in the ratio 24:1 was added to the tubes, mixed and centrifuged at 10,000 rpm for 15 mins. The aqueous layers were transferred into new microtubes and ice cold isopropanol (300 µl) were added and mixed by inverting the tube slowly. Tubes were incubated overnight at -20ºC and the nucleic acids were pelleted by centrifugation at 10,000 rpm for 15 mins. The supernatants were discarded, pellets washed with 500 µl of 70% (v/v) ethanol and dried at room temperature. These were dissolved in 100 µL of Tris-EDTA buffer (10mM Tris-HCl [pH 8.0] + 1 mM EDTA), incubated at 37ºC for 30 mins and stored at -20ºC. A Nanodrop 2000 spectrophotometer (Thermo Fisher Scienti c, MA, USA) was used to determine the quality of the nucleic acids.

Library preparation and sequencing
Genomic DNA of 48 samples were randomly selected per region and used for library preparation. These were quanti ed using a Qubit TM uorometer (Thermo Fisher Scienti c, MA, USA) and normalized to 2.5 ng/µl. Libraries were prepared using Nextera DNA library preparation kit (Illumina, CA, USA) according to the manufacturer's instructions. Brie y, enzymatic fragmentation was carried out on normalized genomic DNA samples (20 µl) via addition of TD buffer (25 µl) and TDE (5 µl). Mixtures were centrifuged (Hettich Centrifugen, D-78532, Germany) at 14,000 rpm at 20 o C for 1 min and transferred into PCR tubes.
Tagmentation was carried out in a pre-programmed thermocycler at 55 o C lid and 55 ˚C incubation temperature, while holding at 10 o C. The tagmented DNA was barcoded using indexed adapters then cleaned with AMPure XP magnetic beads (Beckman Coulter, Inc. Indianapolis, IN) to remove shorter DNA fragments and other impurities. Library quality was con rmed with the Agilent Tape Station 2200 System (Agilent Technologies, Santa Clara, CA). All the 48 libraries were quanti ed using the Qubit TM uorometer (Thermo Fisher Scienti c Inc., Waltham, MA). The indexed DNA libraries of 48 biological samples were each normalized to a concentration of 4 nm before being pooled together. High-throughput sequencing was performed on an Illumina MiSeq System using 2 × 251 v2 kit and 12 pM of 1% PhiX v3 spike to create paired-end reads. Sequencing was performed at the facility of the Biosciences Eastern and Central Africa International Livestock Research Institute (BecA-ILRI) Hub, Nairobi, Kenya.

Sequence processing and assembly
After sequencing, quality control of fastq paired end reads was performed using FastP v.0.20.0 [32] to remove adapters, poly-N sequences (≥15%) and lter low quality reads. The resulting lengths after quality control and the N50 varied amongst the samples (Table 1). Individual sequences were assembled de novo and the resulting contigs were submitted to BLAST analysis.Q≤5). High-quality reads were mapped to the tomato genome (GenBank RefSeq accession number GCA_000188115.3) using Bowtie v.2.3.4.3 [33] under default parameters. Unmapped reads were assembled into contigs de novo using MEGAHIT v.1.1.3 [34] with default settings and those representing ssDNA sequences veri ed using Kaiju virus database [35]. The sequences were then subjected to BLASTN 2.9.0+ [36] to determine similarity match and virus identi cation (Additional File 1: Figure S1). Protein prediction of ORFs was determined using ORF Finder (http://www.ncbi.nlm.nih.gov/projects/gorf).

Sequence alignment, distance matrix and phylogeny
Sequences of monopartite begomoviruses found in tomato were retrieved from GenBank (Additional File 2: Table S1) and aligned with virus contigs using ClustalW multiple sequence alignment program as implemented in BioEdit v.7.2.3 [37] using default parameters. Deduced amino acids from the ToLCV genomes were compared with GenBank isolates while sequence pairwise identities were performed using SDT v1.2 with pairwise gap deletions [38]. A phylogenetic tree was constructed using the maximum likelihood method based on Jukes-Cantor model in MEGA v.6.06 [39]. Bootstrap replicate values were set at 1,000 while a strain of Tomato leaf curl purple vein virus (KY196216) was selected as an outgroup.
Genetic diversity, population genetic analysis and detection of recombination events Genetic structure and diversity within ToLCV populations in Kenya were investigated to understand potential evolutionary dynamics that produce variations. Population structure parameters estimated included average nucleotide diversity (π), haplotype diversity (Hd), number of polymorphic or segregating sites (S), the statistic estimate of population mutation based on the number of segregating sites (θ-W), total number of mutations (Eta), the average number of nucleotide differences between sequences (k) and the statistic estimate of population mutation based on the total number of mutations (θ-Eta). These were estimated using complete genome and protein coding sequences in DnaSP v5.10.01 [40].
The possible occurrences of selection pressure on individual genes and sites within the ToLCV populations were obtained using the single-likelihood ancestor counting (SLAC) method [41] in the HyPhy package [42] as implemented on the Datamonkey software [43] at http://www.datamonkey.org. The ratio of average number of nucleotide differences between the sequences per nonsynonymous site (d N ) to the average number of nucleotide differences between the sequences per synonymous site (d S ) were calculated as an indicator of natural selection. These were used to estimate the occurrence of positive and negative selection at typical begomovirus amino acid ORF sites: the movement protein (MP) or V1 protein, coat protein (CP) or V2 protein, replication protein (Rep) or C1 protein, transcription activator protein (TrAP) or C2 protein, Rep enhancer protein (REn) or C3 protein and the C4 protein. Depending on the dN/dS values, the selection pressure was considered negative or purifying (dN/dS < 1), neutral (dN/dS = 1), or diversifying or positive (dN/dS > 1) for data sets of each coding region. The DNAsp v5.10.01 was used to calculate the Tajima's D, Fu and Li's F* and D*, and Fu's Fs to determine the deviation of ToLCV populations from neutrality assuming a constant population size, with zero recombination and migration [44]. A negative Tajima's D statistic indicates super uous low-frequency polymorphism triggered by background selection, genetic hitchhiking, or population expansions [45]. Conversely, positive values of Tajima's D statistic suggests minimal levels of low and high frequency polymorphisms, indicating a reduction in population size and/or balancing selection.
A scan for recombination signatures were performed on each protein-coding sequence data using the single breakpoint scanning (SBP) and genetic algorithm recombination detection (GARD) methods [46].
These two methods were implemented by the Datamonkey software [43]. Potential recombination events were further investigated using the default settings of detection algorithms BoostScan [47], Chimaera [48], GENCONV [19], MaxChi [49], RDP [50], SiScan [51] and 3Seq which were implemented in the RDP v 4.13 software [52]. Putative recombination events, potential recombinants, and their parental sequences were deemed acceptable only when signals were identi ed by at least four of the seven detection methods, with strong levels of signi cance (P≤0.05).

Results
Sequence data and de novo assembly of DNA viruses After mapping of sequence reads from tomato leaf samples to the tomato reference genome, unmapped reads were subsequently assembled into contigs. The de novo assembly yielded several number of contigs, with the largest contigs having sizes of >45 kb while N50 values ranged from 135 -270 bp ( Table  1). After Kaiju analyses (see Materials and Methods), all assembled virus contigs were subjected to BLASTN 2.9.0+ searches. The results revealed twelve contig matches of lengths >2.7 kb from eleven samples with complete begomovirus genomes within the database (see Additional File 3: Table S2) while several other partial contigs matching DNA viruses were also present. Raw reads from these positive samples have been deposited at the SRA archive (Bioproject number PRJNA646848). Across all the samples, only monopartite begomoviruses with DNA-A-like sequences were recovered. The presence of beta-satellites was not evaluated in this study. However, a sample (Tom54) produced the full-length genome of a separate begomovirus, Chickpea chlorotic dwarf virus, which we recently described [53].

The begomoviruses in Kenyan tomato are a variant of ToLCArV
In all, the full-length genomes of the begomoviruses varied from 2,760 to 2,765 bp (Additional File 2: Table S1). These were deposited in GenBank database under the accession numbers MN894493 to MN894504. Sequence analyses showed that these genomes encoded the six ORFs (V1, V2, C1, C2, C3 and C4) that are typical of monopartite begomoviruses while the intergenic regions ranged from 245-250 nt. Pairwise alignments of begomoviruses with pairwise deletion of gaps using MUSCLE revealed the highest full genome similarity (95.9-98.9%) with an isolate of Tomato leaf curl Arusha virus (ToLCArV, GenBank accession EF194760) from Arusha, Tanzania (Table 2). This was followed by Tomato leaf curl Toliara virus (ToLCToV, GenBank accession AM701768) with 95.9-98.9 % identity and another isolate of Tomato leaf curl virus Arusha virus (ToLCArV, GenBank accession DQ519575) at 89.8-90.5% similarity. Furthermore, all isolates exhibited less than 80% pairwise sequence identity to other begomovirus sequences (see Additional File 4: Figure S2). Based on the species demarcation criteria of the International Committee for the Taxonomy of Viruses set for begomoviruses at <91% nucleotide sequence identity [54], the Kenyan begomoviruses were considered as a variant of ToLCArV. Similar patterns were observed for deduced amino acids as the highest identity was observed with ToLCArV (GenBank accession EF194760) across all the six coding regions (93. 3 (Table 3). Further analyses revealed 95.7-99.7% pairwise nucleotide sequence identities within the twelve Kenyan ToLCArV-like isolates while amino acid residues also revealed high similarities at the MP (94.1-100%), CP (98.5-100%), Rep (94.1-99.4%), TrAP (94.3-100%), REn (95.6-100%) and C4 (95.1-100%) coding regions (Table 5).

Recombination analyses
Evidence of recombination events were found within the CP and Rep genomic regions of ToLCArV-like populations using the automated SBP and GARD tools within the Datamonkey software (data not shown). Further analyses revealed putative recombination events and potential parental isolates using the programs implemented in the RDP3 software ( Table 5). The detection threshold of at least four of the seven programs revealed three recombination events among the ToLCArV-like populations from Kenya.
Two isolates, Tom5a and Tom39 collected from Kirinyaga and Makueni counties respectively, were identi ed as putative recombinants and parental isolates (P≤0.05) while Tom28, Tom35 and Tom45 were only detected as potential major and minor parental sequences ( Table 5). One of the putative recombinants (Tom5a) was identi ed as an intra-isolate recombination event (P≤0.05), possibly as a result of multiple co-infections of the parental isolates within affected plants.
Phylogenetic relationships and genetic diversity of Kenyan tomato begomoviruses A phylogenetic analysis was done using the full genome sequences of the 12 ToLCArV isolates from Kenya, together with TYLCV-like sequences and other tomato begomoviruses from GenBank. Expectedly, all TYLCV-like isolates (n=25) clustered separately from ToLCV-like sequences (n=46) with a clear geographical segregation (Figure 3). African ToLCV-like sequences (n=26) were separated from those of Asian origins (n=20) while isolates from Kenya formed a monophyletic cluster with isolates from Tanzania (ToLCArV, EF194760 and DQ519575) ( Figure 3). Similar results were obtained when isolates identi ed as putative recombinant sequences by the RDP3-implemented programs were excluded from these analyses (data not shown). This nding strengthens the hypothesis that Kenyan ToLCArV-like isolates are closely related to ToLCArV from Tanzania, with both strains having a common ancestor.
Analyses of haplotype number and haplotype diversity, represented by 'h' and 'Hd', respectively revealed varying values among the 12 Kenyan ToLCArV-like sequences and also among other ToLCV-like sequences from GenBank, based on the six coding regions evaluated (Table 6). Based on the total ToLCVlike sequences (n=46), haplotypes number ranged from 43 in the MP region to 46, obtained at the CP, Rep and whole genomes. Similarly, among the Kenyan ToLCArV-like isolates (n=12), 'h' values ranged from 9 (MP gene) to 12 (CP, Rep and complete genomes). Thus, across ToLCV-like sequences from the Genbank and the Kenyan ToLCArV-like sequences, each isolate represented a haplotype at both CP and Rep genes and reveals high genetic variation within the coding regions of each group. This therefore indicates that genetic variation was highest within the CP and Rep coding regions. Interestingly, Hd values were highest for the CP and REn gene and lowest for MP gene, both across ToLCV-like isolates obtained from GenBank and among the 12 Kenyan ToLCV-like sequences obtained in this study (Table 6). Furthermore, genetic distances for each gene-speci c data set were calculated, with highest π values obtained within the REn gene (0.2458) across the ToLCV-like isolates (n=46). The C4 gene and Rep gene recorded the lowest π values i.e. 0.21015 and 0.21165, respectively. Remarkably, the π value of the C4 gene within the 12 Kenyan ToLCArV-like isolates (0.00869) was more than half the π values of other coding regions, indicating that these coding regions were more variable than the C4 gene (Table 6). Collectively, these results show high genetic variability among the CP and Rep coding regions across both ToLCV groups, with C4 gene having the least variation across the isolates.

Tajima's D and estimation of selection pressure
Tajima's D statistical test [55] was used to evaluate the nucleotide polymorphism occurring within each gene and on the complete genomes of Kenyan ToLCArV-like isolates and other ToLCV-like isolates. The Tajima's D, Fu and Li's D and Fu and Li's F statistic revealed negative values for the complete genome datasets which did not statistically deviate from zero (P > 0.10) ( Table 6). Within Kenyan isolates, similar trends were observed for gene-speci c datasets except the MP and CP genes which revealed positive values that are not signi cantly (P > 0.10) different. These results indicate an excess of low-frequency polymorphism caused by background selection, genetic hitchhiking, or population increases.
In order to understand the selection pressure acting on the different coding regions within our ToLCArVlike sequences, the ratios of nonsynonymous substitution per nonsynonymous site (dN) and synonymous substitutions per synonymous sites (dS) were calculated ( Table 7). The dN/dS ratio is an estimator of the evolutionary constraints imposed on a coding region with a value >1 considered as evidence for positive selection, values <1 show evidence of negative selection while values of 1 indicate neutral selection [56]. Across the Kenyan ToLCArV-like sequences, the dN/dS ratio was 0.2067 for the MP gene, 0.067 for the CP gene, 0.3986 for Rep gen, 0.2590 for REn Gene, 0.2908 for TrAP gene and 1.1491 for C4 gene (Table 7). Thus, contrasting patterns of evolution were obtained for the coding region datasets as all except the C4 gene had dN/dS ratio of <1. This indicates a negative or purifying selection among ve out of six coding regions. In addition, these results show that although the MP, CP, Rep, TrAP and REn coding regions are under strong purifying selection, the purifying selective pressure is not distributed uniformly across the genes. The protein encoded by the C4 gene appears to be selectively neutral. The dN/dS values for the CP gene had the lowest values, with other gene sets having at least more than thrice its dN/dS ratio (Table 7).

Discussion
Tomato production in Kenya is widespread and has been limited by the impact of the tomato leaf curl disease. Because of the typical yellow leaf curl symptoms commonly associated with tomato in Africa, it is assumed that the causal organism is Tomato yellow leaf curl virus. Indeed, a tomato leaf curl-like virus infecting tomato in Kenya has previously been reported [57]. The paucity of information on viruses of high economic importance is compounded by the fact that only a few studies from Kenya have described the genomic properties of begomoviruses from cassava [58], sweet potato [59] and a non-cultivated weed host [60]. Using a metagenomics approach, we have described the occurrence of monopartite begomoviruses associated with the leaf curl disease of tomato in Kenya. Our results show that a genetically distinct begomovirus is associated with the disease in Kenya. Analyses of the complete genomes and coding regions of these begomoviruses, together with the failure to detect the presence of DNA-B component a rms that these virus populations were members of the Old World monopartite begomovirus species. Our ndings represent the rst comprehensive description of full begomovirus genomes from tomato in Kenya. This information is crucial for understanding the causal agents associated with the tomato leaf curl disease and its properties as a rst step towards appropriate robust disease management. The availability of full genome sequences will help to elucidate further the evolutionary behavior of the virus.
All the Kenyan ToLArV-like sequences obtained in this study shared very high nucleotide and amino acid sequence similarities, indicating low intra-population genetic diversity. Similar conclusions have been reached in other studies on tomato begomoviruses [61,62]. Curiously, we observed that the nucleotide sequences of the 12 ToLCArV-like isolates shared high identities among themselves but shared lower sequence identities with other begomoviruses (Table 3). This is likely as a result of the genetic bottleneck imposed through the method of begomovirus transmission by white ies [63]. Our study did not investigate virus occurrence within vectors. Nevertheless, the high genetic similarity within the population in our result could be due to a 'founder effect' arising from ecological and epidemiological factors such as vector or seed-mediated spread possibly from Tanzania. The derived amino acid sequences of the population in our results show homologous characteristic with other monopartite begomoviruses, indicating possible similar biological behaviors.
Results from sequence similarity indices together with phylogenetic inferences suggests that the ToLCArV-isolates associated with tomato leaf curl diseases in Kenya were likely of Tanzanian origin. The homogeneity of nucleotide and amino acids as well as phylogenetic inferences supports a single introduction of the tomato begomovirus into Kenya. Intriguingly, ve algorithms detected recombination signals (P ≤ 0.05) from a Tanzanian ToLCArV isolate (GenBank number DQ519575), identifying one of our Kenyan isolates (GenBank number MN894493) as a major parent (Table 5). This suggests that although the Kenyan isolates are just being characterized in this study, they could be the parents that contributed to the emergence of the ToLCArV that was initially described by [64]. It is thus possible, that the Kenyan ToLCV population could pre-date the Tanzanian isolates which were only reported earlier than our Kenyan isolates. Since our analyses reveal clustering of isolates from geographically proximal countries, the dissemination of the ToLCArV-like isolates is likely to have occurred via virus-infected planting material or spread by cross-border spread of viruliferous white ies, leading to genetic similarity among these isolates. Although, our study did not investigate mode of virus transmission, evidence of seed transmission has recently been reported in other closely related begomovirus species from tomato [65] and other hosts [66][67][68][69]. Thus, further research is required to understand how speci c begomovirus species are spread across various borders in East Africa and to determine the epidemiological and ecological implications. Additionally, we propose studies to investigate the effect of white y-mediated transmission on begomovirus diversity in Kenya.
Interestingly, our results show that the begomovirus sequences from Kenya have discernible patterns of geographical structuring with other ToLCV-like isolates of African origin. This is in agreement with previous studies that have shown geographical structuring of African Old World begomovirus subpopulations into clear genetically distinct categories [70,71]. This suggests that these viruses perhaps came from a common ancestor that was introduced to the continent and speciation arose as they interacted with various hosts across different geographical locations. In this study, we determined the genetic diversity of ToLCArV-like sequences from Kenyan within tomato elds using coding regions and complete genome sequences. Over the years, tomato begomoviruses in Kenya have received little or no attention in previous studies [72]. Our current ndings will deepen the knowledge on genetic diversity of tomato begomoviruses, therefore allowing for better diagnostics and appropriate management options.
Our results indicate that although there is low intra-speci c diversity among our isolates, the haplotype number and haplotype diversity analyses revealed varying homogenous levels within the coding regions. Thus, the non-coding regions could have contributed to the overall low diversity indices, similar to the observations of [73].
Our results have identi ed the occurrence of putative recombinants within the virus population (Table 5). Recombination has been shown to crucially contribute to the evolution of tomato-infecting begomoviruses [17,20,22,74] with far-reaching implications. A distinct begomovirus has recently been identi ed from tomato plants in Kenya [53]. Thus, the existence of these naturally-occurring recombinant isolates could evolve into the existence of virus complexes with signi cant effect on disease development, and evolution of virus populations. As intensive tomato cultivation continues in Kenya and as climate-driven changes in uence the spread of white y populations across East Africa, more virusvector-host interplays could give rise to further recombination events, leading to further virus evolution. In addition, begomovirus populations within close geographical delineations are known to possess distinct genetic properties as a result of frequent recombination events [71].
Our results show that varying natural selection pressures appear to be acting on the coding regions of the Kenya ToLCArV-like isolates, indicating independent coevolution of these genes. Our analyses of synonymous and nonsynonymous substitutions revealed that, except the C4 gene, all coding regions appear to be under strong negative or purifying selection to conserve its encoded amino acid sequence. This is in line with similar observations for other related tomato begomovirus species from the Old World [75] and New World [76]. The evolutionary constraints on these coding regions could be intended to preserve their biological functions which include virus replication, accumulation and delity to vector transmission. For example, the CP gene has been reported to mediate interactions between begomoviruses and their white y vectors [77,78]

Conclusions
This study investigated the identity, full sequence properties, genetic diversity, population genetics and recombination analyses of monopartite begomoviruses associated with leaf curl diseases of tomato in Kenya. Nucleotide and amino acid sequence analyses together with phylogenetic inferences identi ed the begomoviruses as variants of ToLCArV with origins from Tanzania. Genome analyses revealed low genetic diversity within the population with contrasting evolutionary patterns among the coding regions.
The information obtained in this research will assist in the design and implementation of quarantine plans to manage virus and vector dynamics. Sequence information and genetic diversity data obtained in this study are also important for the development of rapid and robust detection tools towards the production of virus-free tomato seedlings for farmers. This will ultimately improve tomato production across the country for better food security.