MRF: a tool to overcome the barrier of inconsistent genome annotations and perform comparative genomics studies for the largest animal DNA virus

Background The genome of the largest known animal virus, the white spot syndrome virus (WSSV) responsible for huge economic losses and loss of employment in aquaculture, suffers from inconsistent annotation nomenclature. Novel genome sequence, circular genome and variable genome length led to nomenclature inconsistencies. Since vast knowledge has already accumulated in the past two decades with inconsistent nomenclature, the insights gained on a genome could not be easily extendable to other genomes. Therefore, the present study aims to perform comparative genomics studies in WSSV on uniform nomenclature. Methods We have combined the standard mummer tool with custom scripts to develop missing regions finder (MRF) that documents the missing genome regions and coding sequences in virus genomes in comparison to a reference genome and in its annotation nomenclature. The procedure was implemented as web tool and in command-line interface. Using MRF, we have documented the missing coding sequences in WSSV and explored their role in virulence through application of phylogenomics, machine learning models and homologous genes. Results We have tabulated and depicted the missing genome regions, missing coding sequences and deletion hotspots in WSSV on a common annotation nomenclature and attempted to link them to virus virulence. It was observed that the ubiquitination, transcription regulation and nucleotide metabolism might be essentially required for WSSV pathogenesis; and the structural proteins, VP19, VP26 and VP28 are essential for virus assembly. Few minor structural proteins in WSSV would act as envelope glycoproteins. We have also demonstrated the advantage of MRF in providing detailed graphic/tabular output in less time and also in handling of low-complexity, repeat-rich and highly similar regions of the genomes using other virus cases. Conclusions Pathogenic virus research benefits from tools that could directly indicate the missing genomic regions and coding sequences between isolates/strains. In virus research, the analyses performed in this study provides an advancement to find the differences between genomes and to quickly identify the important coding sequences/genomes that require early attention from researchers. To conclude, the approach implemented in MRF complements similarity-based tools in comparative genomics involving large, highly-similar, length-varying and/or inconsistently annotated viral genomes. Supplementary Information The online version contains supplementary material available at 10.1186/s12985-023-02035-w.


What is MRF?
MRF is a virus comparative genomics tool that compares two genomes and generates completely and partially missing coding sequences (CDS) in one genome (query) with respect to the other (reference) and presents them in a MirCos plot in addition to tabular output.
Genome-wide similarity-search tools like blast often end up reporting similar genomic regions between genomes. These tools cannot indicate genome-wide missing regions in a genome in comparison to another. The genome-wide similarity becomes irrelevant while comparing virus genomes that are highly similar but vary in genome length. Here, documentation of missing CDS in an isolate against a reference would be helpful to understand better about a virus isolate. Moreover, obtaining this information in one annotation nomenclature becomes mandatory when inconsistencies exist in annotations between genomes. In addition, comparative genomics studies involving pathogenic viruses need to interpret on missing CDS in one isolate/strain when compared to another. Therefore, a tool capable of documenting missing CDS has great potential to complement existing alignment-based similarity-search tools in virus research.

Software and availability
The MRF is available for download at https://github.com/vinayciba/MRF. Alternatively, users can access MRF as an online tool in any web browser at http://14.139.181.163/mrf.

Uses
 Main use of MRF is to know the complete and partial deletions in coding sequences of a query virus genome against a reference.  MRF is best suited for comparing closely related virus species, strains of a species or isolates of a strain. However, it can still be useful in distantly related species to counter intuitively check for the presence of a particular coding sequence.  With MRF, one can track complete or partial deletion of one or more genes in a virus genome in comparison to another.  MRF is very useful in comparing genomes having inconsistent annotation nomenclature as it gives output in the annotation nomenclature of reference.  One can quickly explore a newly sequenced virus genome for deletion of genomic regions harboring binding sites for diagnostic primers.  One can quickly compare virus genomes and interpret for deletions in virulence-related or attenuation-related gene or gene families.

Key Features
 MRF can handle virus genomes of any size. The tool has been tested with the largest known virus, Pandoravirus salinus (2,473,870 bp; NC_022098).
 MRF takes the input files and instantly produces multiple outputs in real time.  MRF prints genome coordinates for missing genomic regions as per the base positions of both reference and query genomes.  The missing coding sequences output is given in both tabular and graphic form.  Users can interactively filter out partially deleted coding sequences based on proportion of deleted sequence length.

Quick Introduction
1. The Quickest way to know how MRF works is by running the demo example facilitated in the MRF website through the Run demo button as shown in the following screenshots.
2. One click of Run demo automatically loads example datasets. Successful loading of example datasets is indicated by a message at the bottom of each of the three files. Otherwise, you may upload any other files of your interest. After files are uploaded, user would be prompted to click the Submit button.
The files used for the demo run belong to the largest animal virus, White Spot Syndrome Virus (WSSV). The query fasta file (Accession no: KX686117) has a genome size of 300,223 bp, whereas the reference fasta file (Accession no: AF332093) has a genome size of 305,119 bp. The third file is annotation feature file for the same reference genome, in GFF3 format.

3.
Once the Submit button is clicked, MRF processes the input files and generates output. The user is instantly taken to the results page. The results page is split into two sections, the right-hand section contains the list of completely and partially missing coding sequences in query genome displayed in two separate tables. The left-hand section is a Dashboard with few controls and summary of the results at the bottom.

4.
The missing coding sequences can be viewed as a circular visualization by clicking on 'Generate MirCos Plot' button. Clicking the button will prompt you to enter the title for the plot, clicking ok will reveal a textbox to enter the user preferred title for the plot. The MirCos plot has controls to zoom-in and zoom-out a particular area.

5.
To further understand how the missing coding sequences are obtained, click on 'Generate Genomic Missing Regions' button.
The first four columns give us the co-ordinates of perfect matches between reference genome and query genome. The 5 th column gives us the length of non-hit region. The 6 th and 7 th columns give us the co-ordinates of the non-hit region with respect to the reference genome.
The 5 th column (non-hit region) gives three categories of values as explained in the following Figure 1. 3. Input and Output description 3.1 Inputs 1. query fasta: Nucleotide fasta # file of the query genome in .fa, .fna or .fasta format. Files with more than one chromosome or scaffold or contig are currently not handled. 2. reference fasta: Nucleotide fasta file of the reference genome in .fa, .fna or .fasta format. Files with more than one chromosome or scaffold or contig are currently not handled.
3. reference gff3: Annotation file of the reference genome as obtained from genbank in .gff3 format.

exact match length (or mum length / -m):
The minimum length of an exact match that mummer will find between query and reference genomes being compared. All the resulting matches are greater than or equal to the length being set. [Default : 20]. The terms 'exact match length' and 'mum length' will be used interchangeably though out this document.
The default exact match length of 20 works best for most genomes. However for genomes with high mutational rates such as RNA viruses, reducing the exact match length will result in more matches which in turn prevent false reporting of missing coding sequences.
5. False match length cut off value (-l): False exact matches tend to occur at random locations with shorter mum length leading to masking of true missing regions. To safe guard against finding false matches, MRF is equipped with two parameters, false match length and offset parameters. False match length is the length cut off, where matches less than or equal to the set value [default : 15] are screened further using the offset window options to determine if they are true matches or not.
6. Negative offset and positive offset (-n & -p): These two parameters in advanced options determine whether the matches screened using 'false match length cut off value' parameter are truly a false match. It helps user to rule out false matches especially while running MRF with shorter mum length. When these options are defined, then for each exact match, MRF checks the specified number of exact matches upstream (negative offset) and downstream (positive offset) to the current exact match for the base coordinates representing sequence contiguity of the query genome. If base coordinates are not contiguous then the current exact match would be treated as false match.
At the default setting [1,1], a match in query genome is examined whether it is contiguous by looking at one upstream and one downstream match. Any match that doesn't fall in between the upstream and downstream matches is deemed to be a false match.
# Query fasta file need not necessarily contain full genome sequence. It can contain a single sequence representing partial genome to check for the presence or deletion of that sequence in the reference genome.

Outputs
1. Missing coding sequences: The MRF prints two tables listing the completely missing coding sequences and partially missing coding sequences (along with base position coordinates) in the query in comparison to reference genome. These tables can be downloaded using Download feature in Dashboard.

Missing genomic regions:
The MRF also prints the details of exact matches; missing genomic regions and point mutations located between two exact matches, if any. This table can be downloaded using Download feature in Dashboard.

Completely present coding sequences:
This output can only be downloaded to a file but cannot be visualized in results page. This file consists of the list of completely present coding sequences in the query with respect to the reference.

MirCos plot:
It is a graphical plot representing the reference genome (outer ring); the genes present on sense strand of reference genome (second ring from outside); the genes present on anti-sense strand of reference genome (third ring from outside); completely missing coding sequences (red), partially missing coding sequences (orange), and completely present coding sequences (yellow) of query genome (inner ring). The MirCos plot is available for download with and without the legend.

How MRF works
MRF relies on the exact matches generated by MUMmer (mummer -mum -l 20 Reference.fasta Query.fasta) to extract the successive gaps between those matches. MUMmer generates a three column output which is read as reference_match_start_position, query_match_start_position and exact_match_length or as described in the MUMmer 3 manual "For each match, the three columns list the position in the reference sequence, the position in the query sequence, and the length of the match respectively". To know more about how mummer works, please visit the MUMmer home page. The head of the example ouput is shown below.
The above output is utilized to generate the end positions of the query and the reference and there after the coordinates of the missing regions. A screenshot of which is shown below.

Working with advanced options
The default exact match length set in MRF is 20, therefore each match is 20 bp or longer when run with this setting. At this point, the default value of false match length of 15 and its corresponding offset window does not affect the output as the exact match length is greater than the false match length. For the below example, the default setting yields sufficiently large matches with no false matches.
However when the exact match length is reduced to 5 and offset window is set to 0, many false matches appear which will affect the output.
It can be seen that there are two false matches resulted from low mum length. However, when False match length and offset window parameters are set appropriately, they are eliminated. Observe the output summary given below the 'Dashboard' window. Two and eight CDS were completely missing and partially missing # respectively in the query genome and the total deleted coding sequence length was 6924 bp. These large numbers are owed to the inability of MRF to obtain exact matches between query and reference with a mum length of 20. The regions with mismatches prevent the formation of exact matches of 20 bp and these regions would be counted in missing regions list. This issue can be mitigated by reducing the mum length.
# Though the reference has only 9 proteins, tat protein has two isoforms and one isoform is completely lost while the other is partially lost, so the discrepancy in number Reduced mum length resulted in a decrease in 'deleted coding sequences length'. With shorter mum length, more exact matches could be made between query and reference which were not possible with a mum length of 20. Now, observe the missing genomic regions table. We observe a new problem here. Some of the exact matches were false. This might be due to formation of exact matches at random locations in genome due to short mum length. Notice the highlighted false matches in the screenshot given below.
From the above image, it can be observed that there are few false matches, which in turn mask the true missing regions. For example in the above case, the actual missing region is from 1-160 bp of the reference, which should give a non-hit region of 160bp.
Since there is a false hit at base 90-99, the non-hit region is split in to two regions of 89 bp and 61 bp separated by a false match region (90-99 bp). These false matches can be eliminated by utilizing the false match and offset window parameters.
6.3 Mum length -10, False match length -15, Offset window -1,1 With the new parameters, many false matches were eliminated and many missing regions have appeared (some were extended and some were merged due to the elimination of false matches). This can be clearly observed in missing regions table in the following image where the elimination of false matches was depicted by a red line.
The false matches reported in the previous run (run 6.2) were eliminated by setting the false match cut-off to 15.
To further circumvent the errors in reporting the continuous stretch of exact matches, the mum length can be further reduced to 5, however it will drastically result in many more false matches. How the results would vary and how they can be mitigated is demonstrated below 6.4 Mum length -5, False match length -0, Offset window -0,0 Due to lowering of exact match length to 5, several false matches have resulted, some of which can be eliminated when MRF is run with the below settings.

Filtering partially missing coding sequences
By default, MRF prints all the partially missing coding sequences irrespective of their length or percentage of deletion. However, this output can be controlled by setting a threshold under the 'Filter Partially Missing Coding sequences' tab.
The below example show all the partially missing coding sequences by using exact match length of 10.
It is evident that many are miniscule missing coding sequences and most likely insignificant. By setting the threshold as 10 percent, most of them are filtered out from the results. This is shown in the below screenshots.

MRF for comparing multiple genomes -Batch mode
The standard usage of MRF is to compare a query genome with a reference genome and get the deletions in coding sequences (CDS) of query genome in the nomenclature of the reference genome. This gives a complete picture of the CDS in terms of complete, partial and no deletions in a MirCos plot along with the detailed coordinates in tables. While this usage is adequate for passively sequenced virus genomes such as those involved in causing diseases in poultry, aquaculture and other organisms of interests. However in circumstances such as viruses affecting human populations causing pandemic or endemic, there will be a concerted effort to sequence genomes in a large scale resulting in thousands of complete genomes available online eg. SARS-COV2, HIV, ZIKA etc. It is worthwhile noticing that even passively sequenced virus genomes are in large numbers.
In these scenarios, the standard usage of MRF may not be well suited, as one would want to see all the genomes at once and infer a pattern or narrow down the ones that are affected most. This can be achieved by running the MRF-batch program. This program is currently command line only.

Installation Dependencies
There are very minimal dependencies for running MRF-batch program. Verify if you have the following and install accordingly. Alternatively one can create a conda environment with requisite packages (scroll below)

Running MRF-batch program
MRF-batch program is demonstrated with WSSV and SARS-COV2 test cases. The main program which takes care of everything is batch-run-mrf-wrapper.sh, it first calls the MRF-batch.pl program to generate missing regions and missing coding sequences for all the queried genomes. Next these outputs are parsed with the help of a program called parse-mrf-output.pl to generated three files, summaryAll.txt, cdsALL.txt and cdsHeatmap.txt. Finally, a R script, genPlots.R will utilize those three files and generates the plots.
Users can choose to run the programs parse-mrf-output.pl and genPlots.R independently without calling the main program batch-run-mrf-wrapper.sh. For example if an user is satisfied with the results of or options given to the MRF-batch.pl program, but want to parse the outputs and generate plots in a different manner, he /she can directly run the corresponding programs.
To see the available options of batch-run-mrf-wrapper.sh program, run the command given below.  The main options (-m,-l,-n and -p) work in a similar way as described in the section 3.1 and the option -c works as described in section 7., except for the -o option, which specifies the output file prefix format. As for the Output parsing and Plot options, a simple description has been given in the usage. Nevertheless, how each option works and the expected output of each option is described further with use cases.

Program equivalent options
batch-run-mrf-wrapper.sh parse-mrf-output.pl The above command takes three arguments, the -d option takes a directory called 'query' as input and options -r and -f take fasta and gff3 file of the reference respectively. The query directory contains all the genomes to be compared against the reference in the fasta format. The fasta file should be in nucleotide format and allowed extensions are fasta, fa and fna.

Contents of the query directory is listed below
Once the command is executed, it runs MRF for each genome in the query directory against the reference genome. The output of all the runs are combined and parsed to produce three files called summaryAll.txt, cdsHeatmap.txt and cdsAll.txt . Apart from these three files, four easy to interpret figures, a Scatterplot, a Barplot and two Heatmaps (clustered and unclustered) are produced. These files are explained below.  The Table 1 shows the results of MRF run for each query genome in the query directory against the provided reference genome (AF332093). As described earlier, all the values are reported with respect to the reference genome. The above details are the same as reported in the MRF website for any one vs one run.
For the sake of better understanding, two images are generated out of the summary table to clearly see how the genomes are affected.  Output File 2: cdsAll.txt Table 2: The contents of cdsAll.txt file produced by the batch mode of MRF tool.
The Table 2. contains the length of bases lost for all the proteins present in the reference genome. The first two rows indicate CDS identifiers, protein_name and protein_id as reported in the supplied gff3 table of the reference genome. The rest of rows indicate the bases lost for the respective genome (accession in the first column).
In the above example run, the reference genome contains proteins wsv001 through wsv526. Going through these many proteins and inferring the result again becomes very tedious. So a concise table of top fifteen impacted proteins in the similar format as shown in the below table.
Output File 3: cdsHeatmap.txt  Till now, the tables and images described a run with default parameters, however one can parse the cdsAll.txt file in various ways by calling the respective options. Each option is described below with the help of a heatmap (clustered only).

Search by protein id (-I)
Will fetch the CDS that are supplied through the -I argument. Only protein ids should be provided as comma separated values.

Search by protein names (-N)
Will fetch the CDS that are supplied through the -I argument. Only protein names should be provided as comma separated values.

Filter by missing CDS length (-Y)
One can set a threshold value, where coding sequences that meet the threshold are only listed. But one has to exercise caution while setting a very low threshold as this will produce a very busy heatmap, when the number of CDS in the genome is very high, as is the case in WSSV.

List first n proteins (-F)
To list only first n proteins, call the program with -F option. In this example, the first 10 coding sequences are fetched.

List last n proteins (-Z)
To list only last n proteins, call the program with -Z option. In this example, the last 10 coding sequences are fetched.

List proteins in range (-R)
To list proteins from CDS 'm' to CDS 'n', call the program with -R option. In this example, the coding sequences from 475 to 485 are fetched.

List only top n affected genomes (-U)
By default only top ten affected genomes are listed, however this can be modified by setting the -U option. This option can be combined with options -I/-N/-F/-Z/-R, as shown in below example it is combined with -R.

Plot options:
Options -L and-B are used to modify the default output of heatmap and Stacked barplot respectively. The -L option can be used to increase of decreased the number of coding sequences that are shown in heatmap. The -B option is used to increase or decrease the number of genomes shown. One can see from the above image, out of more than 30k genomes, 10 genomes were identified with the largest deletions in comparison with the reference. Subsequently one can go through the Stacked Bar chart to see the top affected genomes.

41
To see the top 20 genomes in the heatmap instead of the default 10, it can be run with following command But since the above command will run MRF for all the genomes again, one can directly run the perl parser command as below This will recreate the cdsHeatmap.txt file again by fetching the top 20 affected genomes.To see this as a heatmap, run the following command.
The above command will generate the plots again by using the new cdsHeatmap.txt file.

Conclusion
The MRF tool is a fast and reliable tool to identify and visualize the genomic differences between a pair of virus genomes. The tool produces two important output files, missing regions and missing coding sequences in a tabular format. In addition, a publication-friendly and downloadable graphical output also would be produced by MRF. The tool is robust as it is built on top of the exact match algorithm and also flexible as it allows users to screen the exact matches and optimize the parameters to obtain the best results. At present it has the capability of handling both DNA viruses and RNA viruses and small genomes of organelles. Complex genomes with recombination are currently not handled well by the tool but the future versions will be designed to handle them.

APPENDIX -I (Motivation for building MRF)
In virus genomics research, it is very common to compare genome sequences of two or more virus isolates/strains in order to understand the similarities and differences between them. Currently, one of the most common practices in comparative genomics is to perform a BLAST search for one query genome against a reference genome to gain insights. Mostly, researchers involved in virus genomics studies would be interested to know the coding sequences that are gained or lost in certain isolates in comparison to others. Towards this goal, how MRF has potential to supplement the output of blast search is explained in this Appendix. The genome of White Spot Syndrome Virus (WSSV) was taken to explain the contrast between blast and MRF.
Different genome accessions of WSSV vary in genome length ranging from 280 Kb to 305 Kb. Let us compare two genomes of WSSV, KX686117 (300,223 bp) and AF332093 (305,119 bp). A nucleotide blast (blastn) search was performed for KX686117 against a well annotated AF332093 with the following parameters.
Following is the output of blastn search. It is apparent that despite the obvious genome size difference of around 5kb between the two genomes, the result is, a near exact match with 100% query coverage and with near 100% identity.
The results infer that the genomes are very identical and there is not much difference between them. But it does not account for the difference in genome length at least not in a straight forward way. It is further understood from the alignments tab that the results are supported by 1242 alignments and each alignment has its individual percentage identity and query coverage as shown in the snapshots given below.
So, the quest to map the base coordinates for the 5kb sequence (reason for length difference between genomes) in the larger genome has becomes a cumbersome exercise as one has to manually go through all the alignments considering the percentage identity and query coverage for each alignment. Added to this, the circular genomes pose further complication as these genomes may not have uniform start positions in linear sequence accession.
This situation might get better by re-visiting the objective. Instead of trying to identify the 5kb portion in the larger genome and understand its significance, the blast search can be modified to find if any CDS in the larger genome is lost in the smaller genome. The modified objective can be met by performing a blastn search for all the coding sequences (CDS) of the larger genome (AF332093) against the smaller genome (KX686117) and identify if any CDS does not have a match. Even though this seems simple, there may be certain problems, which are listed below.
1. The blast output table contains the similarity hits that are labelled by sequence identifiers (accession number in most cases) only. You may also browse hits with CDS nomenclature, but one hit only at a time. In genomes with several CDS, it is not convenient to track all of them individually. This situation warrants pre-processing of CDS fasta in such way to identify all the matches using CDS names. 2. It is imperative that each match has query coverage listed along with it, as it gives information about how much of the query is actually having a match. But the hit table available from the Blast website does not have this information and one has to go through each match separately to fetch query coverage. This problem may be mitigated by performing blastn search through command line using -OUTFMT 6 option. Therefore, we can affirmatively conclude that with blast or any other similarity search tools, one cannot find the missing CDS and their base coordinates in one go. In case of partially missing CDS, knowledge of missing base coordinates is highly valuable. This gap is filled with MRF. The MRF compares two genomes, one as query and the other as reference genome. The users need to give gff3 file of the reference genome also as input. Then MRF can tabulate and depict the complete and partially missing CDS in the query genome as per the CDS nomenclature of the reference genome. Therefore, users have the opportunity to compare several query genomes to a single reference genome and build a deletion profile. MRF utilizes the exact match algorithm of mummer to find perfect matches of defined mum length or more and extracts the missing regions and deleted coding sequences from the mummer output. Blast output is a list of matches along with its statistics. The CDS that do not have a hit and partial hit have to be manually extracted.
The output of MRF is a table of complete and partially deleted coding sequences. The CDS that have perfect hits can be obtained from the 'completely present coding sequences' file and can be further confirmed through the 'generate missing regions' option Since blast is optimized to find matches, it is able to span across the regions with poor identity Works well when the sequences have high similarity by reporting unambiguous matches. Reducing the word length increases the possibility of finding a match, but this leads to too many matches and sometimes with very weak alignments.
Reducing the mum length increases the possibility of finding a match, however this may result in many false matches but does not lead to an overall false alignment as each match is perfect and is not extended The following table compares the some of the complimentary features of Blast and MRF Table A2. Contrast between the commonly used blast-based searches and MRF.

Feature
Simple blast-based search MRF Initial seed to initiate alignment Word length Mum length or exact match length

Can compare two genomes
Yes. One genome as reference and the other as query.
Yes. One genome as reference and the other as query.

No Yes
Output includes the list of completely deleted CDS in query genome No. However, the coding sequences without blast hits can be taken as missing in query.

Yes
Can infer partially missing coding sequences in query

No
Yes. The partially missing coding sequences in query are tabulated.

No Yes
Can take both genome sequence and annotation as input No. Can take either total genome or coding sequences in fasta file as input. Only 2 files as input, reference sequence and query Yes. Takes both sequence and annotation as input. Takes 3 files as input, reference sequence, query sequence and .gff3 file of reference genome.
Few notes on MRF output 1. The primary output of MRF is missing genomic regions, which are the base coordinates of genomic regions (of reference) that do not have a match with query genome. Here, we conclude these regions as missing in the query genome. These missing genomic regions are generally true in most cases except for genomes which have not lost the region but have undergone too many mutations. Sometimes, a genomic region might vary between query and reference genomes due to several point mutations but without deletions. In this scenario, a perfect match could not be established and the genomic region might end up in 'missing genomic regions' list. 2. The output of 'completely missing coding sequences' always indicate the CDS that are lost in the query genome. There might be exceptions while handling highly mutable genomes. Sometimes, there might be a case of partial deletion only. But if the existing partial sequence harbors too many mutations leading to non-detection of exact matches, then a 'partially missing CDS' might be tabulated under 'completely missing CDS'. 3. Again in case of output indicating 'partially missing CDS', the missing lengths might be inflated in case of highly mutable genomes. 4. MRF has three levels of safety to avoid problems related to highly mutable genomes as explained in points 2 and 3. a. Point mutations: Single base deletions have been excluded from 'missing length' calculations as these are actually mutated sites in most cases. These point mutations are still displayed in missing genomic regions output table (Non_hit_region = X). Indirectly, this could be a summary of mutation sites between analyzed genomes. b. Exact match length: Users have the provision to choose their own 'exact match length' before submitting a job to MRF. While handling highly mutable genomes, users can reduce 'exact match length' from the default value, 20. This avoids reporting of highly mutated regions as missing because exact matches of shorter lengths could be made between mutated sites. c. Negative and positive offset: Running jobs with shorter 'exact match length' might lead to false matches at random positions. The negative and positive offset parameters safeguard against false matches. Assume that we define, negative offset = 2 and positive offset =2. Then MRF checks the 2 exact match upstream (negative offset) and downstream (positive offset) to the current exact match for the base coordinates representing sequence contiguity of the query genome. If base coordinates are not contiguous then the current exact match would be treated as false match. 5. MRF cannot detect mutated genomic regions shorter than 'exact match length' because the mutated sites prevent detection of exact matches between query and reference genomes.

APPENDIX -II (Benchmarking studies)
This sections describes as how MRF complements and differs with Blast in terms of ease of use, precise identification of deletions and summarization. The blast search is the most commonly used similarity-search tool in comparative genomics. Therefore it would be ideal to compare the results provided by MRF with that of blast search. This Appendix section provides a detailed account of the relative merits and demerits between MRF and blast search utilizing a few virus genomes as use cases.
In addition to comparison with Blast, a few case studies have been included by analyzing multiple isolates of WSSV, multiple strains of AFSV and several strains of HIV-1. These studies help to identify 'deletion hotspots', regions which are prone to frequent deletions and mutations. In addition to that, it is also possible to identify genes which are crucial to virulence or fitness of a particular strain when compared to vaccine or attenuated strain.  Table A3. The tabulated output from MRF indicating the complete and partial (>5% lost) missing coding sequences has been given in Table A4 and A5 respectively.  It is interesting to note and further probe the results obtained for wsv095 and wsv164. The blast search yielded no hits for these two CDS which means they were absent in subject genome (KX686117) whereas MRF reported no deletions in these two CDS in query genome (KX686117).

A2.1 USE CASE 1: White Spot Syndrome Virus (highly-similar but length-varying genomes)
Let us examine the exact match coordinates for these two CDS in the 'genomic missing regions' table obtained with MRF.
wsv095 (48619 -48939) wsv164 (89897 -90091) The results from MRF clearly indicated exact matches for these 2 CDS between query and reference genomes. Therefore, we shall repeat blast search with less stringent conditions. For each of these two CDS, blast search was performed with the 'somewhat similar sequences' option, which runs with word size of 11.
The changed parameters gave a hit for wsv095 in the subject genome with 100% query coverage as shown below.
Following is the alignment for the hit region.
Similarly, for wsv164 also, a hit was found in blast search.
Reducing the word length from 28 to 11 has a drastic change in hit status, from 'no hit' to a hit with 100% query coverage and near 100% identity. When the alignments were looked closely, there were even perfect matches greater than the length 28; however, they do not get reported as they were not considered as High Scoring Portions (HSPs). The alignments have low-complexity regions which blast generally indicates with lower-case nucleotides. The CDS with low complexity regions is not an issue while working MRF as it does not do any post processing for its matches and they are reported as it is. wsv178 Now, let us consider the case of wsv178 for which blast reported a hit in subject genome with 100% similarity and 100% query coverage however MRF reported deletion of 58.09% of CDS length. Let us check the alignments printed during blast search for wsv178.
Notice the different regions of query making alignment to same base positions of the subject genome (236575 -236640). It is evident from the alignments that there is no continuous match, but there is a 66 bp repeat region for which blast is able to cover the entire query sequence.
In In fact, the KX686117 has lost some repeat regions. The MRF on account of mummer's exact match algorithm is able to report those lost regions. However, blast failed to report the missing regions because several repeat regions of query make multiple alignments with one repeat region of the subject. That is the reason for the blast reporting 100% query coverage with 100% identity for wsv178. In certain virus cases, number of repeats might be linked to virulence. Therefore obtaining accurate information becomes mandatory.
These are few cases of complete and partially missing coding sequences of WSSV where MRF proved to be performing better than blast search thus offering the potential to supplement blast search in virus genomics.

A2.1.2 Case study:Identifying deletion hotspots in isolates of WSSV with MRF
White Spot Syndrome Virus exhibits a highly variable genome size with respect to its geography and host. There is a length difference of about 28.6 Kb between shortest and longest genomes. To study this huge variation, we have collected 14 complete WSSV genome sequences (Table 5) and analyzed them using MRF tool to extract the missing proteins. The China isolate, AF332093 was kept as the reference and 13 other genomes were considered as queries. Nucleotide fasta sequences of the reference and a query along with the gff3 table of the reference were given as input for MRF tool.

A2.1.2.1 Deletions in WSSV genomes
The length of fourteen complete WSSV genomes analysed in this study varied from 280,591 to 309,286 base pairs. The GC content is mostly uniform ranging from 40.85 % to 41.08 %. The 14 isolates belong to 9 different countries. The genome length of the reference isolate, CN is higher than all other isolates except the CN01 and Taiwan.
Compared to CN isolate, the bases deleted in most other isolates are quite close to the absolute genome length difference between them (Fig A1). However, there are certain isolates which differ, for example, the MEX2008 isolate is only 11,936 bases shorter but lost about 13,183 bases compared to CN isolate. Similarly, the IN_AP4RU and CN-Pc isolates which are shorter by 24,528 and 4,896 bases but show deletions of 26,713 and 6,429 bases respectively when compared to CN isolate. Therefore the longer genome does not necessarily have all the bases of the shorter genomes. This is very much evident in the case of CN01 and Taiwan isolates which are 4,167 and 2,168 bases longer than the reference but still lack 970 and 772 bases respectively that are present in CN isolate. Figure A1. Comparison of genome length difference (blue line) and base length deleted (red line) in WSSV isolates with respect to CN isolate as identified by MRF tool.
The MRF tool helped us to document the completely and partially missing CDS in all the 13 WSSV isolates in comparison to CN isolate. Importantly, the missing CDS reported and commented in the study follow the nomenclature of the reference genome (in this case CN isolate). The CN isolate has 524 CDS, out of which 82 CDS are either completely or partially lost in at least one of the other WSSV isolates.
There are 46 CDS that show complete deletion in at least one isolate compared to CN isolate. Some of them are deleted in only one isolate while others in as many as 10 isolates (Table A8).

.2 Deletion hotspots
In this study, the sequence regions where a stretch of 3 CDS are completely lost in three or more isolates are defined as deletion hotspots. We report three deletion hotspots in WSSV genome which are wsv481/wsv499 (read as wsv481 through wsv499), wsv237/wsv241 and wsv178/wsv180 ( Fig A2). As MRF also prints the missing genomic regions coordinates, it is possible to understand the position of deleted bases (5' end or 3'end or within) in the CDS present in deletion hotspots. Figure A2. Depiction of the deleted regions within the coding sequences of WSSV genomes at the identified deletion hotspots. Grey represents completely missing genes, green represents completely present genes and yellow to orange represents partially missing genes based on percentage of the missing regions.
All three proteins in this deletion hotspot are completely lost in CN03, CN04 and AU isolates but all 3 proteins showed only partial deletions in IN_AP4RU isolate. The CN01, EC-15098, K-LV1, CN02 and CN-Pc isolates showed partial deletions in wsv178 and wsv179 only while retaining full coding sequence for wsv180. The EG3, Taiwan, TH and MEX2008 are not affected by this deletion hotspot. The wsv178 displaying partial to complete deletion in various isolates is an immediate early protein.
The EG3, Taiwan, CN01, K-LV1, CN-Pc, TH, and MEX2008 isolates did not show any deletions in this deletion hotspot which consists of two important envelope proteins VP41A (wsv237) and VP52A (wsv238 iii. wsv481/wsv499. The sequence region that encodes wsv481/wsv499 is lost in EC-15098, IN_AP4RU, MEX2008, TH and CN03 & CN04 isolates of China. These 6 isolates along with AU showed partial deletion (> 40%) of wsv479. The TH and MEX2008 isolates in particular share a common deletion of wsv480/wsv503, few more CDS extending on either side of deletion hotspot. The genomes of CN02, CN-Pc and AU isolates share a common deletion pattern ranging from wsv489 to wsv499. The K-LV1 isolate also shares similar deletion pattern except that wsv489 is present and wsv497 is partially (89%) lost. The CN01, Taiwan and EG3 isolates retained entire genomic region that included this deletion hotspot.
The annotated coding sequences and the genome-wide deleted regions in WSSV isolates have been depicted in S1 Fig. The CDS for wsv53 is deleted in CN01 and AU isolates whereas the CDS for wsv60 is deleted in EC-15098 isolate. Only CN03 and AU isolates showed deletion of CDS for wsv74 and wsv196 proteins. The coding sequence of two proteins, wsv319 and wsv320 are deleted in only IN_AP4RU isolate. The CDS of an envelope protein, VP62 (wsv338) is completely deleted in CN03, CN04 and AU isolates whereas the same isolates lost more than 95% of the CDS of another envelope protein, VP39 (wsv339). The VP39 was identified as integument protein (12) and studies show targeting this protein increases the survival rate in shrimps (13). The coding sequence of protein, wsv337 was observed to be lost in only AU isolate. The genomes of K-LV1, CN-Pc, IN_AP4RU, CN03, CN04 and AU isolates completely lost the CDS for wsv462 and wsv463 proteins.

A2.2 USE CASE 2: African swine fever virus (ASFV)
The ASFV has several multigene families where copy number variation in a multigene family contributes to differences in genome length between isolates. It is interesting to compare MRF and blast in assessing deleted CDS between genomes of different isolates varying in genome length.   The CDS 'MGF 110-13L' was reported as completely lost by MRF however blast reported an alignment with several mismatches and 69% query coverage.
Here, the MRF was re-run with different parameters. As this genome display several mismatches, reducing exact match length might give better results. During re-run, the mum length was reduced to 15 and false match threshold was reduced to 10 as shown in the below snapshot.
As per the revised results, the 'MGF 110-13L' was in 'partially missing CDS' list. But still, the percentage of missing CDS length is very high. This is due to the MRF's inability to find enough exact matches of length '15', as the CDS in the query genome has significant variations when compared to reference as evident from the blast alignment. This is a situation where the 'exact match' algorithm suffers over blast or any other comparable program.
It is also worth noting that despite the reduced mum length, the coding sequences In the case of 'MGF 505-2R', it was reported as lost by MRF whereas blast reported a poor alignment. It was noted from the list of CDS that the OURT88/3 genome does not have 'MGF 505-2R'. However, blast with its heuristic nature of the algorithm is able to find a weak match.
When the coordinates of the hit region were further examined, it was found to be the sequence of 'MGF 505-10R'. Here blast search erred by falsely reporting an alignment with MGF-505-10R as a hit to MGF 505-2R. The sequence of MGF 505-2R shares similarity with the sequence of MGF 505-10R. This is depicted in pairwise sequence alignment given below.
In fact, MGF 505-3R was reported as 60% lost by MRF and blast result too supported with a perfect alignment, even though for only 39% of the query sequence.
As for the other CDS with less than 50% deletion reported by the MRF, we can infer that those CDS have undergone significant mutations which can be further examined through blast alignments.

A2.2.2 Case study: Analyze multiple strains of ASFV with MRF and identify deletion hotspots
African Swine Fever Virus (ASFV) is a double-stranded DNA virus that causes hemorrhagic fever with high mortality in domestic pigs and related animals. Like WSSV, the ASFV also shows variation in genome length ranging between 170 to 193 Kb among different strains (14)(15)(16). Like ASFV, the major envelope proteins of WSSV are also not glycosylated (17), a feature that is uncommon in animal viruses. A vaccine for protection against ASFV is not available but immunization of pigs with attenuated isolates like OURT88/3 can confer protection against lethal challenges from virulent isolates (18). When compared with the highly virulent isolate, Benin 97/1, the OURT88/3 isolate contains major deletions in the MGF360 and MGF530/505 gene families which have the potential to modulate type 1 interferon response (19). The presence or absence of these gene families was attributed to the variation in genome length (16) as well as virus attenuation (19). Therefore, in the present study, we tested the utility of MRF tool for quick tabulation of these missing gene families in various isolates of ASFV in comparison to highly virulent, Benin 97/1.
While analyzing ASFV genomes with MRF tool, major emphasis was to highlight the deletion profile of 8 genes in MGF360 and MGF530/505 gene families (MGF360-10L to MGF360-14L and MGF530/505-1R to MGF530/505-3R), which are crucial for virulence of the pathogen. The deletion of these 8 genes has been documented in 17 complete genomes of AFSV in comparison to highly virulent, Benin 97/1 isolate and depicted in figure A3. The missing genes are presented as two groups, complete (100 %) and partially (>50 %) deleted genes.
The low virulent isolates, OURT88/3 and NHV shared an identical deletion profile for both complete and partially deleted genes. Both of them did not have genes, MGF360-10L to MGF360-14L and MGF530/505-1R to MGF530/505-2R. The MGF530/505-3R gene is only partially lost in these isolates. Another avirulent isolate, Ba71V also shared similar deletion profile as that of OURT88/3 and NHV isolates, except that it still retained MGF530/505-2R and MGF530/505-3R genes and the MGF360-14L gene is only partially lost. A single run of MRF tabulated all the completely and partially deleted genes in query genomes compared to reference genome. Therefore, we could notice that apart from the MGF360 and MGF530/505 genes, many isolates showed complete deletion of certain MGF110 family of genes. The MGF110-12L and MGF110-13L genes have been found to be completely deleted in OURT88/3, NHV, Ba71V and tengani isolates. However it has been reported that the presence or absence of MGF110 family of genes does not alter the state of virulence of the virus (20). The pDP71L gene is also lost in ken05, ken06, kenya1950 and malawi isolates. Several other genes showing partial deletions have been depicted in Figure A4.         Generally, the mutation rates in RNA genomes are higher than DNA genomes of viruses (21). Therefore, we have tested the suitability of MRF tool for handling RNA genomes of the Human Immunodeficiency Virus (HIV). The Negative factor (Nef) gene of HIV genome plays a crucial role in progression of HIV infected individuals towards Acquired Immune Deficiency Syndrome (AIDS) (22,23). Attenuation of HIV was observed in mutated strains carrying deletions in the Nef gene, where the onset of AIDS was also delayed (24,25). Here, we tracked the Nef gene in 9 HIV-1 type C strains from India (   Figure A5. Deletions in CDS of HIV-1 subtype C strains observed as a heatmap. Figure A6. Deletions in selected CDS of HIV-1 subtype C strains observed as a heatmap. The CDS are restricted to those genes by applying 'filter by protein name' argument.

A2.4 USE CASE 4: SARS-COV2 genomes
The utility of MRF is better exemplified in the case of SARS-COV2 where lots of samples from different regions have been sequenced at rapid pace. Majority of research focuses on construction of phylogenies and evaluating the distances between the genomes but it is also important to know the CDS contributing differences between genomes.
Here  The ORF1ab sequence is reported as partially deleted by MRF. Similarly, blast search also showed the same in graphic summary and dotplot.
To conclude, MRF and blast can be used in conjunction to compare and list the deleted coding sequences in one genome with respect to another genome. MRF simplifies the procedure on account of its exact match algorithm and any ambiguous results could be verified with blast at least while handling pathogen genomes. In one go, MRF also identifies the missing genome sequence coordinates.

A2.4.3 Case study: Identification of Single Nucleotide Polymorphisms (SNP's)
MRF tool also identifies the substitutions (SNP's) as a part of identifying the deleted genomic regions. The substitutions are marked as X in the output file. The feature is demonstrated using one of the B.1.617 strain of SARS-COV2 as the query genome (Accession no. MZ140618) against the reference genome (Accession no. NC_045512). The B.1.617 lineage carries multiple mutations but two mutations namely L452R and E484Q in the spike protein is of significant concern and hence referred to as "double mutant", this variant has been suggested to more transmissible than the previous ones.
The results showed that MRF was able to detect the corresponding nucleotide changes of the above mutations as highlighted in the following images.
Eventhough MRF is able to detect single base pair changes (substitutions), it is currently not able to map these to amino acid changes. Another limitation being, MRF is able to identify only single base pair substitutions but not able to characterize indels.

A2.5 USE CASE 5: Marek's disease virus (Gallid herpesvirus 2)
To protect poultry from Marek's disease, many mutant strains of Marek's disease virus (MDV) have been developed as live vaccines. Out of many, the most successful has been the CVI988/Rispens strain. The key difference between the virulent and vaccine strain is an 18 bp deletion in UL49 gene in the latter. The UL49 gene codes for the protein vp22 which is involved in cell-to-cell transmission of the virus.
A virulent strain and a vaccine strain are compared with MRF and blast as well.

A2.6 SUMMARY
While similarity-search tools rely on the word length to build seed alignments and find similarities between two genomes, MRF uses the exact match length to find matching and missing genome regions. The blast or any other similarity search tool cannot directly output the missing CDS and their base coordinates in one go. This kind of output is highly valuable while studying pathogenic viruses. The MRF fills void for such tools in comparative genomics of viruses.
The MRF has three types of output, summary, tabular and graphic. The summary contains the number of coding sequences and the length of genome missing in query genome. The tables include completely missing coding sequences, partially missing coding sequences and genomic missing regions along with base coordinates. The graphic output is a MirCos plot of completely and partially missing coding sequences in query genome.
A highly-similar but length-varying genome (White Spot Syndrome Virus, WSSV), two clinical viruses (African Swine Fever Virus, ASFV and Human Immunodeficiency Virus-1, HIV-1), an emerging pathogen (2019 novel corona virus, 2019-nCoV) and a bird virus (Marek's disease virus, MDV) were taken to benchmark the efficiency of MRF over the commonly used similarity-search tool, the blast.
In case of highly-similar but length-varying genomes like WSSV, blast search ended up reporting high similarity between genomes and building several alignments between query and reference. Whereas the output of MRF could be inferred in a straightforward way to note the missing CDS and their base coordinates in genome. In addition, the circular nature and inconsistent annotations among WSSV genomes made MRF a better tool than blast to identify the genome-wide differences. In other use cases MRF gave comparable results to blast search while offering certain merits while handling genomes with high mutations.
Sometimes, blast algorithm might report false hits due to sequence alignments generated at random positions in the genome as was observed in the case of MGF 505-2R gene of ASFV. In such cases, the advanced features of MRF consisting of false match length and offset parameters gave an opportunity for users to control false matches and to perform several re-runs in less time. In genomes with high mutations like HIV-1, both MRF and blast algorithms struggled to avoid false matches/alignments. More number of mismatches between query and reference genomes led to several spurious alignments with blast and false matches with MRF. In case of HIV-1 genomes, utility of stringent offset parameters to control most of the false matches was further demonstrated.
Comparable results were obtained between MRF and blast search while analyzing 2019 nCoV genomes. Further, in the case of MDV, where an 18 bp deletion in UL49 gene differentiates a vaccine strain from a virulent strain, MRF was good enough to find this deletion in partially missing coding sequences table. Finally, MRF could handle the genome of the largest (2 Mbp) known virus, Pandoravirus salinus. Here a genome with custom-deletions was given as query as against a full genome and MRF could successfully complete the analysis and print results.
The following table gives a brief comparison of blast and MRF based on the use cases mentioned in the previous sections