Skip to main content

Application of machine learning in understanding plant virus pathogenesis: trends and perspectives on emergence, diagnosis, host-virus interplay and management



Inclusion of high throughput technologies in the field of biology has generated massive amounts of data in the recent years. Now, transforming these huge volumes of data into knowledge is the primary challenge in computational biology. The traditional methods of data analysis have failed to carry out the task. Hence, researchers are turning to machine learning based approaches for the analysis of high-dimensional big data. In machine learning, once a model is trained with a training dataset, it can be applied on a testing dataset which is independent. In current times, deep learning algorithms further promote the application of machine learning in several field of biology including plant virology.

Main body

Plant viruses have emerged as one of the principal global threats to food security due to their devastating impact on crops and vegetables. The emergence of new viral strains and species help viruses to evade the concurrent preventive methods. According to a survey conducted in 2014, plant viruses are anticipated to cause a global yield loss of more than thirty billion USD per year. In order to design effective, durable and broad-spectrum management protocols, it is very important to understand the mechanistic details of viral pathogenesis. The application of machine learning enables precise diagnosis of plant viral diseases at an early stage. Furthermore, the development of several machine learning-guided bioinformatics platforms has primed plant virologists to understand the host-virus interplay better. In addition, machine learning has tremendous potential in deciphering the pattern of plant virus evolution and emergence as well as in developing viable control options.


Considering a significant progress in the application of machine learning in understanding plant virology, this review highlights an introductory note on machine learning and comprehensively discusses the trends and prospects of machine learning in the diagnosis of viral diseases, understanding host-virus interplay and emergence of plant viruses.


Machine learning: an introduction for biologists

Over the years, extensive research has been carried out in various fields of biology to understand the science behind a plethora of complex biological phenomena. The study of problems such as traits in plants and plant viral diseases lead to generation of massive data sets. The progress in technology has rendered data generation a simple task. Cost-effective technologies such as next generation sequencing (NGS) have made it easier to gather data regarding gene expression, chromosome conformation, genetic variation, traits and diseases of animals and plants, leading to generation of such massive data sets having multiple characteristics [1]. However, the resultant data explosion, especially in the field of omics, has made the handling of large datasets a major concern. The traditional statistical data analysis methodologies are not effective or efficient anymore in this context [2].

Furthermore, biological phenomena comprise various aspects, which lead to the generation of more than one data type. This necessitates an integrated analysis of the different types of data. But the noisiness of heterogeneous biological data makes this a difficult task [3]. Data dimensionality is another major impediment, for instance, omics data is generally highly resolved, hence highly dimensional. Moreover, the sample size in biological studies is limited in most cases. This may lead to issues including overfitting, multi-collinearity and data sparsity [4].

In order to overcome all these barriers, attempts are being made to incorporate machine learning (ML) and deep learning (DL) tools in the analysis of the datasets. ML tools identify patterns in the data using different statistical methods. Based on existing data, the ML paradigm can be used to derive models for classification, pattern recognition, and predictions. DL algorithms extract high-level features from huge datasets (such as collection of genomic sequences, or images), recognize the hidden patterns, and then utilize them to train models [5]. These trained models can be further applied to diverse types of unseen data from different sources for tasks such as prediction and classification. These techniques have the ability to tackle tough problems by detecting structure in seemingly random data, even when the amount of data is too complex and large for human comprehension [6]. Hence, ML especially DL, has the ability to perform analysis of enormous datasets in an extremely efficient, cost-effective, accurate and high-throughput manner [7].

In the context of ML, there are two primary frameworks for training the models: supervised and unsupervised learning. Both of these have potential for use in biology. Under supervised learning, the given collection features, or attributes of a system under investigation, are labeled [8]. Two recurring problems in the supervised learning framework are regression and classification. The classification process assigns objects into classes on the basis on the properties of features. In biology, one example of such training (involving mapping of object-to-class) is mapping of gene expression profiles to their respective diseases. The algorithm returns an assigned class of the object with certain “confidence measure” indicating the correctness of classification. Some of the widely used supervised models are linear/nonlinear regression, support vector machines (SVM), Gaussian processes, and neural nets [9].

In unsupervised learning, the objects involved in the study are not under any predefined labels [10]. The entire goal of these models is to recognize similarities in the various objects by exploring the data. These similarities define clusters in the data (groups of data objects). So, the basic concept of unsupervised learning is to discover natural patterns in data and categorize it into groups. Overall, in supervised learning, the data is pre-labeled and the algorithm learns how to utilize the labels to associate the objects to the classes. On the other hand, in unsupervised learning, the data is unlabeled, and the algorithm also learns to create labels by clustering the objects. Principal component analysis is an important example of unsupervised learning technique which includes k-means clustering, Gaussian mixture models, density-based spatial clustering of applications with noise (DBSCAN), and hierarchical clustering [9].

In certain exceptional cases, a method known as semi-supervised learning has proven to be quite useful. One example of such a scenario could be the classification of protein sequences. Only a few samples of protein sequences are labeled (belonging to a known class) but numerous sequences belong to unknown classes. The semi-supervised algorithm combines a small amount of labeled data with a large amount of unlabeled data during training [11].

The basic steps for creating a machine learning model for the study of biological data are shown in Fig. 1. Following the collection of data (labeled or unlabeled), it is divided into two sets for training and testing. The data samples need to undergo preprocessing and augmentation before the splitting in case they are corrupted with noise and outliers. Next the model is trained using the training dataset. The model can either be created from scratch, or a pre-trained model can be adjusted according to the collected dataset. Once the trained model is ready, the testing data is fed into it to determine the accuracy with which the objects are classified into different labels [12].

Fig. 1
figure 1

Standard flowchart for creation of a machine learning model to study biological data. The figure here shows the steps followed in order to create a machine learning model that can successfully study different types of biological data. The data is initially split up into training and testing sets. Each object of the training set is associated with a feature vector, which is passed into the required machine learning algorithm. After manipulating the various parameters of the model, a resultant machine learning model for prediction is developed. This model is then checked by passing the objects of the testing set through it. The resultant output accuracy determines the usefulness of the created model

Deep learning is broad category of machine learning wherein large multiple layered neural network models are employed for representation learning. Deep learning can be performed in supervised, unsupervised, or semi-supervised fashion. When working with neural networks, we essentially attempt to create the inferences analogous to the human brain by building an artificial neural network (ANN) [13]. An ANN resembles a biological neural network. The artificial neurons used here are basically mathematical models that carry out three main functions: activation, addition and multiplication. The goal is to build layers of neurons, each of which produces a suitable response to any input provided to it. The neurons of each layer multiply their inputs with the corresponding weights. Then it is passed through the activation function and finally transferred to the next layer of neurons. Once the input layer is fired up, the decision moves along to the subsequent layers of the neurons (hidden layers), firing up the respective neurons until the final output layer is reached [14]. A schematic representation of a neural network is presented in Fig. 2, where the various components of the network such as input, hidden and output layers are explained.

Fig. 2
figure 2

A schematic representation of a standard artificial neural network. The network is divided into three major components: the input layer, multiple hidden layers and the output layer. In this figure, it is assumed that the input layer has 3 independent variables, each of which is parsed through a set of weights and activation functions in the hidden layers and finally output layers to yield the model output. The activation functions are nonlinear mathematical function such as Tanh, Sigmoid, ReLU, etc. to induce nonlinearity to the model. Depending on the network structure, there may be ‘n’ neurons (also called hidden layer units) in each hidden layer and there may be multiple hidden layers. Any ANN with more than one hidden layer is technically is deep ANN. Once an input is fed into the network, one after another, each hidden layers gets operated among each other till finally the output layer is reached and activated, producing the final result. Weights in each layer is trained by means of the backpropagation algorithm

In neural networks, the direction of information flow is determined by the internodal connections. On this basis, there are two classifications of neural networks: (i) unidirectional flow which can further be divided into cascade forward and feed-forward; and (ii) bidirectional flow, also known as recurrent flow [15]. In feed-forward networks, the flow of information between the layers takes place in one direction. Cascade forward is similar except that the input to the next layer is weighted. In recurrent networks, flow of information takes place in both directions. All the nodes are interconnected among each other, including self-connection. These networks are extremely complex, bulky, difficult to operate and take up a large amount of computational space. In addition to this, some neural networks architectures, such as self-organizing networks, convolutional neural networks (CNN), variational auto encoders (VAE) and generative adversarial networks (GAN) [16, 17] have recently attracted great attenttion in the DL community.

Various parameters are used to evaluate the classification performance of the developed model. Some of the important parameters include accuracy, sensitivity/recall rate, specificity rate, precision/positive predictive value, negative predictive value and F1score [18]. The performance of models is evaluated by calculating various ratios involving true positives, false positives, true negatives and false negatives. All of these can also be combined into a single confusion matrix to assess the model’s performance [19]. Furthermore, the phenomena of overfitting and underfitting are widely faced while employing ML models [20]. Overfitting occurs when the model is fitted with respect to the noise in the data rather than the underlying signal. The validation data error increases while the training data error decreases [21]. On the other hand, underfitting is the reverse scenario. In this case, the model is not capable of recognizing data variability [22]. Several techniques including penalty methods, training by early stopping, batch normalization, and dropouts are being developed to avoid such imperfect fittings [23].

Main text

Application of machine learning in understanding plant viruses and viral infections

Diagnosis and detection of plant viral diseases

Plant viruses pose major economical constraints in cultivated crop plants across the world. Early detection of plant viral infection is crucial for successful disease management. An empirical evaluation through visible survey is traditionally followed by farmers to identify the symptoms of virus infected plants. The visual assessment bias dictates the inefficiency and inaccuracy of this method. On the other hand, laboratory-based detection techniques are primarily reliant on polymerase chain reaction (PCR) and serological-based method such as enzyme linked immune sorbent assay (ELISA). Despite their improved accuracy, the requirement for professional experts and their time-consuming and invasive nature pinpoints the shortcomings of these diagnosis assays [24, 25]. A pathogen attack significantly alters the biochemical and biophysical state of the plant leading to an alteration of tissue structure, water level and transpiration rate, ultrastructure of chloroplast and pigment content [26, 27]. At the very beginning of the twenty-first century, a few studies used remote sensors to capture and detect altered leaf reflectance and thermography profiling of diseased plants, which empowered the scientific community with an edge in phenotyping of stressed plants [28]. However, this technique was unable to determine if the stress was biotic or abiotic, and if biotic, what kind of pathogen was involved. Hyper-spectral imaging (HSI) and ML assisted data analysis are now revolutionizing the concept of stress phenotyping of diseased plants by enabling the diagnosis of specific plant diseases and even the severity of the disease. In the case of HSI, a light spectrum with a larger range of wavelengths is being used to capture plant images, which enables us to go beyond the limited range of human vision (400-700 nm) in monitoring minor alterations in the growth and development of plants [29]. For ML assisted detection of plant viral diseases, first a ML model has to be trained with a training dataset (images of diseased plants captured through unmanned aerial vehicle, grounded robots or even smartphones) [30,31,32,33,34,35,36,37,38]. There are increasing numbers of free online databases which provide images of specific plant diseases as training datasets. ‘Plantvillage’ is one such initiative [39]. Once a certain ML model has been trained accurately and precisely, a testing dataset (eg: hyperspectral images of specific plants under diagnosis) can be assessed [32, 34]. HSI generates high dimensional data with redundant information and hence, an efficient pre-processing of the data is crucial for the precise functioning of the model. An effective specific range of wavelength can be determined to reduce the dimensionality of HSI data [29]. The next step is feature extraction which minimizes the number of the features present in the raw dataset. The feature extraction method is vital for assuring a simple classifier with a limited variety of features, since multifeatured classification always hinders the smooth performance of the concerned model [40, 41]. ML researchers have devised a variety of feature extraction techniques based on the nature of the data and the model. However, the process is time-consuming and the success of the operation greatly relies on the expertise of the professional. Here comes the benefits of using DL techniques for feature extraction as DL empowers automatic extraction of features rather than handcrafted method used in traditional ML algorithms, for instance the application of convolutional neural networks. DL has substantially improved the reliability of plant stress phenotyping by enabling the accommodation of a large sample size for training and testing [40]. A major constraint of this method is the vast variation of environmental conditions between the field and the lab. While consistent temperature, humidity, and light intensity are maintained in the lab, all of these variables are constantly changing in the field, influencing the captured images [42]. Hence, it is recommended to use field images to train a model since it has been demonstrated that a classifier trained on field images can also classify lab-based images with precision [43]. The lack of availability of a huge collection of field-based images of a specific plant disease is another key challenge for an accurate and reliable diagnosis.

Transfer learning is a recent advancement in the field of ML which enables the data scientists to adopt a previously well-trained model for solving a similar kind of problems [40]. For example-a model trained for chilli-leaf curl disease detection may be used for detecting leaf curl symptoms caused by viruses in tomato (Fig. 3). There are several approaches to adopt a pre-trained model; one can select and finetune the architecture and/or parameters of a model depending upon the types of datasets. Table 1 summarizes the development of ML assisted diagnosis of plant viral diseases over last few years.

Fig. 3
figure 3

Application of ML in understanding plant virus pathogenesis. ML enables early diagnosis of plant viral diseases at field level through analyzing hyperspectral images. Metagenomics study of diseased plant samples helps identification of related and unrelated viral genomes. ML can assist in the classification of these viral sequences which primes our understanding of virus evolution. Furthermore, ML-assisted bioinformatics tools have been developed to identify viral suppressors of RNA silencing (VSRs). ML can also guide us to predict the sub-cellular localization and even the structure of the viral proteins. Prediction of accurate structures of virus encoded proteins may help to identify inhibitors of these effector proteins. To understand the host response, several groups have performed transcriptome, proteome and metabolome of virus infected plants. ML can prime the accurate and fast analysis of these high throughput data to identify gene regulatory networks (GRN) and novel host factors involved in host-virus interplay. Characterization of these host factors in terms of sub-cellular localization and structure prediction will boost understanding of plant virus pathogenesis. ML may also assist plant virologists in genomic selection to identify elite virus resistant cultivars. This figure was created using BioRender (

Table 1 The application of ML-assisted diagnosis of plant viral diseases

Understanding the diversity and emergence of plant viruses

The recent trend of studying plant virome through metagenomics has unveiled the diversity of plant viruses. Huge numbers of phylogenetically related and unrelated virus species have been found in diseased samples [44, 45]. Explosion in virome data generated through NGS necessitates the urgent structural orientation and analysis of sequence data in order to understand the actual portrait of the viral diversity. Although a significant progress has been followed up in the case of animal viruses, limited efforts have yet been recorded in the field of plant virology [46, 47]. V-pipe has provided a bioinformatics pipeline for analyzing genomic diversity of human immunodeficiency virus (HIV) from sequencing data [48].

As RNA viruses use error-prone polymerases during their replication, the chances of mutations in their genome sequences remain quite high. Mutation in the viral genome finally leads to the emergence of new virulent viral strains [49]. A neural network-based model can predict probable point mutations in the RNA sequence. It has been successfully explored in the case of newcastle virus [50], and its optimized form may be very useful for the prediction of mutations in plant viral genome (Fig. 3). Besides RNA viruses, DNA viruses also possess significant genetic variations. Events like recombination and genome reassortment play crucial role in mediating the emergence of new viral forms [51, 52]. The identification of novel virus and satellite molecules through metagenomics approach emphasizes the importance of precise taxonomic classification followed by demarcation of these new species. An excellent effort by Silva and collaborators have developed Fangorn Forest, a ML based method, for classification of geminiviruses. Among the three tested algorithms, random forest (RF) has proven to be best in classification of genes and genera of this largest plant virus family [53]. Recently, a CNN guided sequencing platform has successfully completed human genome sequencing within couple of hours and efficiently identified the disease-causing variations in the genome. This ML based fastest sequencing approach may open up new windows in studying diversity and evolution of plant viruses [54].

Understanding host-virus interplay

Being obligate parasites, viruses rely on cellular machineries of plants for every aspect of pathogenesis including replication, gene expression and movement [55]. Plants elicit a robust antiviral immune response to restrict viral invasion [56]. Viruses encode effector proteins which disarm plant defense signaling. This tug of war continues which fuels the co-evolution of both virus and host [57]. Hence, understanding the interplay between plant and viruses is crucial for an in-depth dissection of viral pathogenesis.

Although plants have evolved a variety of tools and tactics to prevent virus multiplication, the resistance (R) protein-mediated immune response and gene silencing are the most well-known features of their antiviral defense [56]. A majority of canonical R-proteins contain nucleotide binding site leucine-rich repeats (NBS-LRR), which mediate direct or indirect recognition of virus-encoded effector proteins, resulting in the activation of effector triggered immunity (ETI). Very few R-genes imparting immune response against viruses have been identified and characterised till date, which limits our knowledge regarding the detailed mechanism of dominant resistance in plant virus interaction [58]. Support vector machine-assisted development of a high throughput bioinformatics tool, NBSPred, precisely identifies NBS-LRR containing R proteins from genome, transcriptome and proteome data [59]. Receptor-like kinases (RLK) are crucial players in the immune perception of phytopathogens, many of them acting as pattern recognition receptors (PRRs) which lead to induction of pattern triggered immunity (PTI) [60]. However, several plant viruses target RLKs to promote viral pathogenesis [61]. Brustolini et al. have recently developed a machine learning assisted technique for detection of RLKs from proteome data. Identification and annotation of novel RLKs may advance our current understanding of plant-virus interactions [62]. Furthermore, to identify host factors differentially regulated in host-virus interplay, several groups have performed transcriptome and proteome analysis in both resistant and susceptible plant varieties. These studies have revealed that a significant proportion of differentially expressed transcripts are of unknown nature suggesting the existence of novel gene regulatory networks (GRNs) modulating the host-virus interaction [63,64,65,66]. ML helps biologists to predict GRNs from high-throughput transcriptome data [67] which may lead to identification of several regulatory nodes of plant immune signalling (Fig. 3).

On the other hand, viruses encode few but multitasking effector proteins which facilitate the viral pathogenesis. Examining the sub cellular localization of these effector proteins is important to understand their mechanism of action. Furthermore, viruses also redirect the subcellular localization of several host proteins to disrupt their assigned functions [68]. ML assisted development of online tools such as LOCALIZER and MU-LOC enable precise as well as accurate analysis of subcellular localization of effector proteins and host factors by simply using amino acid sequences of proteins as input (Fig. 3) [69, 70]. Application of ML in the successful prediction of fungal effector proteins has added an extra edge in phytopathology research [71]. In the case of viruses, some viral effector proteins have been evolved to block antiviral gene silencing, known as viral suppressors of RNA silencing (VSRs). VSRs expand the negative impact of viral diseases by promoting synergistic associations among different plant viruses [72]. Jagga et al. have developed a bioinformatics platform, pVsupPred, for the prediction of VSRs encoded by plant-infecting viruses. They have used four classifier models including LibSVM, J48, Naı¨ve Bayes and RF, and among all of them, RF algorithm has emerged as the best with an overall accuracy of 86.11% [73]. Later on, in another study, sequential minimal optimization (SMO) algorithm had been improved to achieve an overall accuracy of 95.3% for the successful identification of plant virus encoded VSRs (Fig. 3) [74].

Another significant facet of plant-virus interaction is the virus induced alteration of microRNA (miRNA) homeostasis which impacts the transcriptome profile of the infected cells. Hence, it is important to identify the accurate targets of specific miRNAs regulating plant immunity and viral pathogenesis [75]. The advent of ML in advancing the scope of bioinformatics has significantly eased this difficult job. Supervised ML approaches including graphical models, kernel machines and evolutionary algorithms are being widely used to identify the specific miRNA targets in eukaryotes [76]. Further, a new category of DL models known as graph neural nets (GNN) is emerging as a promising tool in bioinformatics. The biological networks, based on small RNAs–disease associations, can be constructed as graphs with nodes and edges. GNN can operate on the graphical data and has more representative features, which can be efficiently used for inferences [77].

Finally, the best possible way to understand the functional aspect of a protein is to visualize its accurate structure. A very small proportion of plant proteins involved in immune signalling have been structurally characterised yet. In addition, structures of plant viral proteins are also largely unresolved. Labour intensive methods of protein crystallization is the major bottleneck here. However, Jumper and collaborators have revolutionised the idea of protein structure prediction by launching Alphafold2, a neural network-assisted structural bioinformatics platform, which can successfully solve a protein structure with almost equivalent experimental accuracy even if there is no similar protein structure available [78]. ML-guided docking studies efficiently screen chemical inhibitors of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) encoded spike (S) protein [79]. Similarly, structure prediction of plant viral proteins and the prediction of their chemical inhibitors followed by successful delivery will be a novel and effective virus management strategy (Fig. 3).

Conclusions and future perspectives

Plant-infecting viruses not only compromise the yield of the infected plants but also significantly affect the nutritional content of crops. Heavy crop losses due to plant viral outbreaks is a vital concern for global food security and hence necessitates the urgent implementation of smart management measures. Studies aiming to understand the evolutionary biology of plant viruses and the molecular biology of plant-virus interactions have generated large-sized datasets in recent years. Here comes the prospective role of ML. Although the last decade has witnessed a sharp increase of application of ML in solving complex biological problems [80], its usage in the field of plant virology is still at a very naïve state. Several reports highlighted the role of ML in the precise diagnosis of plant viral diseases [30,31,32,33,34,35,36,37,38]. Plant virologists can foresee the tremendous scope of ML in addressing virus evolution, emergence, plant-virus interplay and above all management strategies (Table 2). Moreover, several specific issues need to be explored.

Table 2 Brief description of ML-based bioinformatics platforms used in studying plant virus interactions

Firstly, a vast amount of OMICS data (including transcriptome, proteome and metabolome) of virus infected plants are available. The application of ML may enable the integration of these OMICS data which will definitely uplift our knowledge of host response, especially the impact of novel potential host factors in viral infection. Secondly, the coordinated application of ML and HSI pave a new path in the detection of viral diseases. Now, designing improved DL-based algorithms for easily accessible mobile app-mediated detection of plant viral diseases is the need of the hour. Thirdly, ML algorithms would be very helpful in deciphering the patterns and parameters of plant viral evolution. ML may enable the accurate prediction of recombination, rate of nucleotide substitutions, mutations and phylogenetic relatedness among viruses. The use of ML to predict the advent of aggressive recombinant plant virus strains and the likelihood of associated epidemics will be extremely beneficial. Fourthly, a significant improvement can be achieved in metagenomics data analysis through ML approaches. Along with viral reads, metagenomics data also contains a substantial proportion of host contig contaminations which often hinders the identification of small viral reads. VirFinder is one such k-mer based platform which enables the identification of prokaryotic virus sequences from mixed metagenomic data [81]. A similar approach can be employed for plant virome study. Fifthly, ML is now widely utilized in genomic selection for rapid and better prediction of superior genotypes for breeding purposes [82, 83]. A certain progress of ML assisted genomic prediction will definitely help breeders in developing elite virus tolerant/ resistant varieties. Finally, a collaborative effort by both plant virologists and big data analysts is of prime importance for the fruitful application of ML in the understanding of plant virus pathogenesis followed by the development of antiviral strategies.

Availability of data and materials

Not applicable.



Next generation sequencing


Machine learning


Deep learning


Support vector machines


Density-based spatial clustering of applications with noise


Artificial neural network


Convolutional neural networks


Variational auto encoders


Generative adversarial networks


Polymerase chain reaction


Enzyme-linked immune sorbent assay


Hyper-spectral imaging


Human immunodeficiency virus


Ribonucleic acid


Deoxyribonucleic acid

R protein:

Resistance protein


Nucleotide binding site leucine-rich repeats


Effector triggered immunity


Pattern triggered immunity


Receptor-like kinases


Pattern recognition receptors


Gene regulatory networks


Viral suppressors of RNA silencing


Random forest


Sequential minimal optimization




Graph neural nets


Severe acute respiratory syndrome coronavirus 2

S protein:

Spike protein


Convolutional neural networks


Outlier removal auxiliary classifier generative adversarial nets


Support vector machine


Successive projections algorithm


Extreme learning machine

VGG 16:

Visual geometry group 16


Boosted regression tree


Tomato spotted wilt virus


Groundnut bud necrosis virus


Tobacco mosaic virus


  1. Ching T, Himmelstein DS, Beaulieu-Jones BK, Kalinin AA, Do BT, Way GP, Ferrero E, Agapow P-M, Zietz M, Hoffman MM, et al. Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface. 2018;15:20170387.

    PubMed  PubMed Central  Google Scholar 

  2. Xu C, Jackson SA. Machine learning and complex biological data. Genome Biol. 2019;20:76.

    PubMed  PubMed Central  Google Scholar 

  3. Zitnik M, Nguyen F, Wang B, Leskovec J, Goldenberg A, Hoffman MM. Machine learning for integrating data in biology and medicine: principles, practice, and opportunities. Information Fusion. 2019;50:71–91.

    PubMed  Google Scholar 

  4. Altman N, Krzywinski M. The curse(s) of dimensionality. Nat Methods. 2018;15:399–400.

    CAS  PubMed  Google Scholar 

  5. Bzdok D, Altman N, Krzywinski M. Statistics versus machine learning. Nat Methods. 2018;15:233–4.

    CAS  PubMed  PubMed Central  Google Scholar 

  6. Webb S. Deep learning for biology. Nature. 2018;554:555–8.

    CAS  PubMed  Google Scholar 

  7. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–44.

    CAS  Google Scholar 

  8. Singh A, Ganapathysubramanian B, Singh AK, Sarkar S. Machine learning for high-throughput stress phenotyping in plants. Trends Plant Sci. 2016;21:110–24.

    CAS  PubMed  Google Scholar 

  9. Tarca AL, Carey VJ, Chen X, Romero R, Drăghici S. Machine learning and Its applications to biology. PLoS Comput Biol. 2007;3:e116.

    PubMed  PubMed Central  Google Scholar 

  10. Prasad V, Gupta SD. Applications and potentials of artificial neural networks in plant tissue culture. In: Plant tissue culture engineering. Springer; 2008. p. 47–67

  11. Weston J, Leslie C, Ie E, Zhou D, Elisseeff A, Noble WS. Semi-supervised protein classification using cluster kernels. Bioinformatics. 2005;21:3241–7.

    CAS  PubMed  Google Scholar 

  12. Tang B, Pan Z, Yin K, Khateeb A. Recent advances of deep learning in bioinformatics and computational biology. Front Genet. 2019;10:214.

    PubMed  PubMed Central  Google Scholar 

  13. Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, et al. Human-level control through deep reinforcement learning. Nature. 2015;518:529–33.

    CAS  PubMed  Google Scholar 

  14. Krenker A, Bešter J, Kos A. Introduction to the artificial neural networks. Artificial neural networks: methodological advances and biomedical applications. InTech 2011. pp. 1–18.

  15. Osama K, Mishra BN, Somvanshi P: Machine learning techniques in plant biology. In PlantOmics: The omics of plant science. Springer; 2015. Pp. 731–54.

  16. Yao X. Evolving artificial neural networks. Proc IEEE. 1999;87:1423–47.

    Google Scholar 

  17. Yang ZR. A novel radial basis function neural network for discriminant analysis. IEEE Trans Neural Netw. 2006;17:604–12.

    PubMed  Google Scholar 

  18. Taner A, Öztekin YB, Duran H. Performance analysis of deep learning CNN models for variety classification in hazelnut. Sustainability. 2021;13:6527.

    Google Scholar 

  19. Hassan SM, Maji AK, Jasiński M, Leonowicz Z, Jasińska E. Identification of plant-leaf diseases using CNN and transfer-learning approach. Electronics. 2021;10:1388.

    Google Scholar 

  20. Smith LN. A disciplined approach to neural network hyper-parameters: Part 1--learning rate, batch size, momentum, and weight decay. arXiv preprint arXiv:1803.09820. 2018.

  21. Sarle WS. Stopped training and other remedies for overfitting. Comput Sci Stat. 1996;66:352–60.

    Google Scholar 

  22. Van der Aalst WM, Rubin V, Verbeek H, van Dongen BF, Kindler E, Günther CW. Process mining: a two-step approach to balance between underfitting and overfitting. Softw Syst Model. 2010;9:87–111.

    Google Scholar 

  23. Murakoshi K. Avoiding overfitting in multilayer perceptrons with feeling-of-knowing using self-organizing maps. BioSystems. 2005;80:37–40.

    PubMed  Google Scholar 

  24. Rubio L, Galipienso L, Ferriol I. Detection of plant viruses and disease management: relevance of genetic diversity and evolution. Front Plant Sci. 2020;11:1092.

    PubMed  PubMed Central  Google Scholar 

  25. Varma A, Singh MK. Chapter 6—Diagnosis of plant virus diseases. In: Awasthi LP, editor. Applied plant virology. Academic Press; 2020. p. 79–92.

    Google Scholar 

  26. Bhattacharyya D, Gnanasekaran P, Kumar RK, Kushwaha NK, Sharma VK, Yusuf MA, Chakraborty S. A geminivirus betasatellite damages the structural and functional integrity of chloroplasts leading to symptom formation and inhibition of photosynthesis. J Exp Bot. 2015;66:5881–95.

    CAS  PubMed  PubMed Central  Google Scholar 

  27. Pallas V, García JA. How do plant viruses induce disease? Interactions and interference with host components. J Gen Virol. 2011;92:2691–705.

    CAS  PubMed  Google Scholar 

  28. Landgrebe DA. Signal theory methods in multispectral remote sensing. Wiley; 2003.

    Google Scholar 

  29. Lowe A, Harrison N, French AP. Hyperspectral image analysis techniques for the detection and classification of the early onset of plant disease and stress. Plant Methods. 2017;13:80.

    PubMed  PubMed Central  Google Scholar 

  30. Kawasaki Y, Uga H, Kagiwada S, Iyatomi H: Basic study of automated diagnosis of viral plant diseases using convolutional neural networks. In: International symposium on visual computing. Springer; 2015. pp. 638–45.

  31. Ramcharan A, Baranowski K, McCloskey P, Ahmed B, Legg J, Hughes DP. Deep learning for image-based cassava disease detection. Front Plant Sci. 1852;2017:8.

    Google Scholar 

  32. Wang D, Vinson R, Holmes M, Seibel G, Bechar A, Nof S, Tao Y. Early detection of tomato spotted wilt virus by hyperspectral imaging and outlier removal auxiliary classifier generative adversarial nets (OR-AC-GAN). Sci Rep. 2019;9:1–14.

    Google Scholar 

  33. Kadam KU. Identification of groundnut bud necrosis virus on tomato fruits using machine learning based segmentation algorithm. Int J Fut Gener Commun Netw. 2020;13:259–64.

    Google Scholar 

  34. Zhu H, Chu B, Zhang C, Liu F, Jiang L, He Y. Hyperspectral imaging for presymptomatic detection of tobacco disease with successive projections algorithm and machine-learning classifiers. Sci Rep. 2017;7:1–12.

    Google Scholar 

  35. Gu Q, Sheng L, Zhang T, Lu Y, Zhang Z, Zheng K, Hu H, Zhou H. Early detection of tomato spotted wilt virus infection in tobacco using the hyperspectral imaging technique and machine learning algorithms. Comput Electron Agric. 2019;167:105066.

    Google Scholar 

  36. Griffel L, Delparte D, Edwards J. Using support vector machines classification to differentiate spectral signatures of potato plants infected with Potato Virus Y. Comput Electron Agric. 2018;153:318–24.

    Google Scholar 

  37. Joshi RC, Kaushik M, Dutta MK, Srivastava A, Choudhary N. VirLeafNet: Automatic analysis and viral disease diagnosis using deep-learning in Vigna mungo plant. Ecol Inf. 2021;61:101197.

    Google Scholar 

  38. Chakraborty S, Kodamana H, Chakraborty S. Deep learning aided automatic and reliable detection of tomato begomovirus infections in plants. J Plant Biochem Biotechnol. 2021;66:1–8.

    Google Scholar 

  39. Gao Z, Luo Z, Zhang W, Lv Z, Xu YJA. Deep learning application in plant stress imaging: a review. AgriEngineering. 2020;2:430–46.

    Google Scholar 

  40. Singh AK, Ganapathysubramanian B, Sarkar S, Singh A. Deep Learning for plant stress phenotyping: trends and future perspectives. Trends Plant Sci. 2018;23:883–98.

    CAS  PubMed  Google Scholar 

  41. Chen Y-M, Zu X-P, Li D. Identification of proteins of Tobacco mosaic virus by using a method of feature extraction. Front Genet. 2020;11:1186.

    Google Scholar 

  42. Mahlein A-K. Plant disease detection by imaging sensors—parallels and specific demands for precision agriculture and plant phenotyping. Plant Dis. 2016;100:241–51.

    PubMed  Google Scholar 

  43. Ferentinos KP. Deep learning models for plant disease detection and diagnosis. Comput Electron Agric. 2018;145:311–8.

    Google Scholar 

  44. Roossinck MJ. Plant virus metagenomics: biodiversity and ecology. Annu Rev Genet. 2012;46:359–69.

    CAS  PubMed  Google Scholar 

  45. Stobbe AH, Roossinck MJ. Plant virus metagenomics: what we know and why we need to know more. Front Plant Sci. 2014;5:150.

    PubMed  PubMed Central  Google Scholar 

  46. Li J, Zhang S, Li B, Hu Y, Kang X-P, Wu X-Y, Huang M-T, Li Y-C, Zhao Z-P, Qin C-F. Machine learning methods for predicting human-adaptive influenza A viruses based on viral nucleotide compositions. Mol Biol Evol. 2020;37:1224–36.

    CAS  PubMed  Google Scholar 

  47. Randhawa GS, Soltysiak MP, El Roz H, de Souza CP, Hill KA, Kari L. Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study. PLoS ONE. 2020;15:e0232391.

    CAS  PubMed  PubMed Central  Google Scholar 

  48. Posada-Céspedes S, Seifert D, Topolsky I, Jablonski KP, Metzner KJ, Beerenwinkel N. V-pipe: a computational pipeline for assessing viral genetic diversity from high-throughput data. Bioinformatics. 2021;6:66.

    Google Scholar 

  49. Elena SF, Agudelo-Romero P, Carrasco P, Codoñer FM, Martín S, Torres-Barceló C, Sanjuán R. Experimental evolution of plant RNA viruses. Heredity. 2008;100:478–83.

    CAS  PubMed  Google Scholar 

  50. Salama MA, Hassanien AE, Mostafa A. The prediction of virus mutation using neural networks and rough set techniques. EURASIP J Bioinf Syst Biol. 2016;2016:1–11.

    Google Scholar 

  51. Kumar RV, Singh AK, Singh AK, Yadav T, Basu S, Kushwaha N, Chattopadhyay B, Chakraborty S. Complexity of begomovirus and betasatellite populations associated with chilli leaf curl disease in India. J Gen Virol. 2015;96:3143–58.

    CAS  PubMed  Google Scholar 

  52. Devendran R, Kumar M, Ghosh D, Yogindran S, Karim MJ, Chakraborty S. Capsicum-infecting begomoviruses as global pathogens: host-virus interplay, pathogenesis, and management. Trends Microbiol. 2021;6:66.

    Google Scholar 

  53. Silva JCF, Carvalho TFM, Fontes EPB, Cerqueira FR. Fangorn Forest (F2): a machine learning approach to classify genes and genera in the family Geminiviridae. BMC Bioinformatics. 2017;18:431.

    PubMed  PubMed Central  Google Scholar 

  54. Gorzynski JE, Goenka SD, Shafin K, Jensen TD, Fisk DG, Grove ME, Spiteri E, Pesout T, Monlong J, Baid G, et al. Ultrarapid nanopore genome sequencing in a critical care setting. New Engl J Med. 2022;6:66.

    Google Scholar 

  55. Mandadi KK. Scholthof K-BG: Plant immune responses against viruses: How does a virus cause disease? Plant Cell. 2013;25:1489–505.

    CAS  PubMed  PubMed Central  Google Scholar 

  56. Calil IP, Fontes EPB. Plant immunity against viruses: antiviral immune receptors in focus. Ann Bot. 2017;119:711–23.

    CAS  PubMed  Google Scholar 

  57. Wu X, Valli A, García JA, Zhou X, Cheng X. The tug-of-war between plants and viruses: great progress and many remaining questions. Viruses. 2019;66:11.

    Google Scholar 

  58. Kourelis J, van der Hoorn RAL. Defended to the Nines: 25 years of resistance gene cloning identifies nine mechanisms for R protein function. Plant Cell. 2018;30:285–99.

    CAS  PubMed  PubMed Central  Google Scholar 

  59. Kushwaha SK, Chauhan P, Hedlund K, Ahrén D. NBSPred: a support vector machine-based high-throughput pipeline for plant resistance protein NBSLRR prediction. Bioinformatics. 2015;32:1223–5.

    PubMed  Google Scholar 

  60. Tang D, Wang G, Zhou JM. Receptor kinases in plant-pathogen interactions: more than pattern recognition. Plant Cell. 2017;29:618–37.

    CAS  PubMed  PubMed Central  Google Scholar 

  61. Macho AP, Lozano-Duran R. Molecular dialogues between viruses and receptor-like kinases in plants. Mol Plant Pathol. 2019;20:1191–5.

    PubMed  PubMed Central  Google Scholar 

  62. Brustolini OJ, Silva JC, Sakamoto T, Fontes EP. Bioinformatics analysis of the receptor-like kinase (RLK) superfamily. Methods Mol Biol. 2017;1578:123–32.

    CAS  PubMed  Google Scholar 

  63. Liu D, Zhao Q, Cheng Y, Li D, Jiang C, Cheng L, Wang Y, Yang A. Transcriptome analysis of two cultivars of tobacco in response to Cucumber mosaic virus infection. Sci Rep. 2019;9:3124.

    PubMed  PubMed Central  Google Scholar 

  64. Liu Y, Liu Y, Spetz C, Li L, Wang X. Comparative transcriptome analysis in Triticum aestivum infecting wheat dwarf virus reveals the effects of viral infection on phytohormone and photosynthesis metabolism pathways. Phytopathol Res. 2020;2:3.

    Google Scholar 

  65. Rajamäki M-L, Sikorskaite-Gudziuniene S, Sarmah N, Varjosalo M, Valkonen JPT. Nuclear proteome of virus-infected and healthy potato leaves. BMC Plant Biol. 2020;20:355.

    PubMed  PubMed Central  Google Scholar 

  66. Sade D, Shriki O, Cuadros-Inostroza A, Tohge T, Semel Y, Haviv Y, Willmitzer L, Fernie AR, Czosnek H, Brotman Y. Comparative metabolomics and transcriptomics of plant response to Tomato yellow leaf curl virus infection in resistant and susceptible tomato cultivars. Metabolomics. 2015;11:81–97.

    CAS  Google Scholar 

  67. Mochida K, Koda S, Inoue K, Nishii R. Statistical and machine learning approaches to predict gene regulatory networks from transcriptome datasets. Front Plant Sci. 2018;9:1770.

    PubMed  PubMed Central  Google Scholar 

  68. Rodriguez-Peña R, Mounadi KE, Garcia-Ruiz H. Changes in subcellular localization of host proteins induced by plant viruses. Viruses. 2021;13:677.

    PubMed  PubMed Central  Google Scholar 

  69. Zhang N, Rao RSP, Salvato F, Havelund JF, Møller IM, Thelen JJ, Xu D. MU-LOC: a machine-learning method for predicting mitochondrially localized proteins in plants. Front Plant Sci. 2018;9:634.

    PubMed  PubMed Central  Google Scholar 

  70. Sperschneider J, Catanzariti A-M, DeBoer K, Petre B, Gardiner DM, Singh KB, Dodds PN, Taylor JM. LOCALIZER: subcellular localization prediction of both plant and effector proteins in the plant cell. Sci Rep. 2017;7:44598.

    PubMed  PubMed Central  Google Scholar 

  71. Sperschneider J, Gardiner DM, Dodds PN, Tini F, Covarelli L, Singh KB, Manners JM, Taylor JM. EffectorP: predicting fungal effector proteins from secretomes using machine learning. New Phytol. 2016;210:743–61.

    CAS  PubMed  Google Scholar 

  72. Ghosh D. M M, Chakraborty S: Impact of viral silencing suppressors on plant viral synergism: a global agro-economic concern. Appl Microbiol Biotechnol. 2021;105:6301–13.

    CAS  PubMed  Google Scholar 

  73. Jagga Z, Gupta D. Supervised learning classification models for prediction of plant virus encoded RNA silencing suppressors. PLoS ONE. 2014;9:e97446.

    PubMed  PubMed Central  Google Scholar 

  74. Nath A, Subbiah K. Probing an optimal class distribution for enhancing prediction and feature characterization of plant virus-encoded RNA-silencing suppressors. 3 Biotech. 2016;6:93.

    PubMed  PubMed Central  Google Scholar 

  75. Zhang B, Li W, Zhang J, Wang L, Wu J. Roles of small RNAs in virus-plant interactions. Viruses. 2019;11:827.

    PubMed Central  Google Scholar 

  76. Zhang B-T, Nam J-W. Supervised learning methods for microRNA studies. In: Machine learning in bioinformatics. 2008. pp 339–365.

  77. Zhang X-M, Liang L, Liu L, Tang M-J. Graph neural networks and their current applications in bioinformatics. Front Genet. 2021;12:690049.

    CAS  PubMed  PubMed Central  Google Scholar 

  78. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–9.

    CAS  PubMed  PubMed Central  Google Scholar 

  79. Batra R, Chan H, Kamath G, Ramprasad R, Cherukara MJ, Sankaranarayanan SK. Screening of therapeutic agents for COVID-19 using machine learning and ensemble docking studies. J Phys Chem Lett. 2020;11:7058–65.

    CAS  PubMed  PubMed Central  Google Scholar 

  80. Ma C, Zhang HH, Wang X. Machine learning for big data analytics in plants. Trends Plant Sci. 2014;19:798–808.

    CAS  PubMed  Google Scholar 

  81. Ren J, Ahlgren NA, Lu YY, Fuhrman JA, Sun F. VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome. 2017;5:69.

    PubMed  PubMed Central  Google Scholar 

  82. Poland J, Rutkoski J. Advances and challenges in genomic selection for disease resistance. Annu Rev Phytopathol. 2016;54:79–98.

    CAS  PubMed  Google Scholar 

  83. Crossa J, Pérez-Rodríguez P, Cuevas J, Montesinos-López O, Jarquín D, de los Campos G, Burgueño J, González-Camacho JM, Pérez-Elizalde S, Beyene Y, et al: Genomic selection in plant breeding: methods, models, and perspectives. Trends Plant Sci. 2017;22:961–75.

Download references


Not applicable.


This work was supported by a grant from the Department of Biotechnology, Govt of India vide grant no. BT/PR31648/GET/119/283/2019 to SC. HK acknowledges support from BRNS with 51/14/11/2019-BRNS and SERB India with CRG/2018/001555. DG is supported by the Prime Minister’s Research Fellow (PMRF) scholarship.

Author information




DG, SRC, HK and SC conceived the work; DG and SRC collected information, analysed and wrote the manuscript, DG, SRC, HK and SC edited the manuscript; SC and HK arranged the funding. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Supriya Chakraborty.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

All authors have agreed for publication in the journal.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ghosh, D., Chakraborty, S., Kodamana, H. et al. Application of machine learning in understanding plant virus pathogenesis: trends and perspectives on emergence, diagnosis, host-virus interplay and management. Virol J 19, 42 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Machine learning
  • Deep learning
  • Plant virus
  • Pathogenesis
  • Host-virus interactions
  • Evolution and emergence
  • Control options