Study design
Seventy-three VHSV (n = 55) and IHNV (n = 18) viral strains collected by the Novimark consortium were used for the exercise [Additional file 1] and processed by three participant laboratories, later referred as Lab 1, Lab 2 and Lab 3, according to the protocols and sequencer machines available at their facilities. In detail, 36 VHSV and 13 IHNV were sequenced with Illumina MiSeq by Lab 2 and Lab 3, and 19 VHSV and 5 IHNV were sequenced with Ion Proton™ Sequencer by Lab 1. Additionally, a calibrator specimen constituted by a recombinant VHSV strain (see below) was included in the exercise and analyzed with both sequencing technologies (Fig. 1). All the viruses were subject to library preparation and whole genome sequencing (WGS) according to the protocols reported below. In total, 75 unique raw datasets, later referred as “raw data” or “sample(s)”, were generated by Lab 1, Lab 2 and Lab 3 according to the methodologies implemented at their facilities, and shared in FASTQ format, either as a single file or as two paired files, via a secure File Transfer Protocol (FTP) site. For each sample, participants applied their own BI pipeline and produced a single consensus sequence for comparison.
Specimens processing and sequencing
Lab 1
Viral strains available at Lab 1 repository [Additional file 1], originally lyophilized or frozen at − 80 °C, were inoculated at a multiplicity of infection (MOI) of 2–10 onto 25 cm2 flasks seeded with blue gill fry (BF-2) or epithelioma papulosum cyprini (EPC) cells [12, 13] grown in Eagle’s minimal essential medium adjusted at pH 7.6 ± 0.2 with 0.19 M Tris-HCl buffer and supplemented with 10% fetal bovine serum (Eurobio), 100 U/mL penicillin and 100 μg/mL streptomycin (Pan Biotech). After 1-h adsorption at 14 °C ± 1, inocula were removed and replaced with fresh medium. Cell cultures were then incubated at 14 °C and checked regularly for the development of cytopathic effect (CPE). Upon completion of CPE, flasks were frozen and thawed, and viral suspensions were collected and clarified at 2000×g for 15 min at 5 °C ± 3 °C. Supernatants were then aliquoted and stored at − 80 °C until use.
For each specimen, total RNA was extracted from 1 ml of supernatant using TRIzol™ LS Reagent (Life Technologies) following the manufacturer’s instructions. Double stranded complementary DNA (ds-cDNA) libraries were prepared with the Ion Total RNA-Seq Kit V2 (Life Technologies). The protocol recommended by the supplier was slightly modified and is available from the authors upon request. Finally, cDNA libraries were quality-checked with Agilent 2100 Bioanalyzer (Agilent High Sensitivity DNA kit, Agilent Technologies), quantified by qPCR (Ion Library TaqMan™ Quantitation Kit, ThermoFisher Scientific) and finally sequenced on the Ion Proton™ Sequencer using an Ion PI™ Chip Kit v2 (Life Technologies).
Lab 2
Viral stocks provided by the European Union Reference Laboratory for Fish and Crustacean Diseases (Denmark) and processed by Lab 2 [Additional file 1], were freshly propagated at 15 °C on exponentially growing EPC cells [13] cultivated in 75 cm2 flasks with L-15 cell medium supplemented with 1 mM L-glutamine, 2% fetal bovine serum, 100 U/ml penicillin and 200 μg/ml streptomycin (Gibco). Once full CPE was obtained, culture supernatants were collected, and cell debris removed by centrifugation at 2500×g for 15 min at 4 °C. Viruses were then concentrated by ultracentrifugation at 100000×g for 1 h at 4 °C, and pellets re-suspended in 200 μl of cell culture medium.
Viral RNA was extracted using the EZ1 Virus Mini Kit v2.0 and the EZ1 Advanced extraction robot (Qiagen), and eluted in 60 μl of elution buffer. Approximately 100 ng of RNA were reverse transcribed at 37 °C for 1 h in a 20 μl reaction containing 1 mM dNTPs, 500 ng of random primers and 200 units M-MLV Reverse Transcriptase (Promega). Double-stranded cDNA was then generated with random primers using Sequenase V2.0 DNA Polymerase (Affymetrix), purified with the QIAquick PCR Purification Kit (Qiagen) and quantitated using Quantus™ Fluorometer (Promega). Libraries were prepared with the Nextera XT DNA Library Preparation Kit (Illumina), checked for quality and size with Agilent 2100 Bioanalyzer (Agilent High Sensitivity DNA kit, Agilent Technologies) and sequenced with Miseq v2 Reagent Kit (150PE) (Illumina) using Illumina MiSeq platform.
Lab 3
Three ml of each virus originating from Lab 3 repository [Additional file 1] were inoculated onto 24-h EPC cells seeded in 72 cm2 flasks (Falcon®). After 1-h adsorption at 15 °C under gentle shacking, 15 ml of MEM Eagle (Sigma-Aldrich) supplemented with 10% foetal calf serum (FCS) (Hyclone), 1% L-glutamine 200 mM (Sigma-Aldrich) and 1% antibiotic antimycotic solution 100X (Sigma-Aldrich) were added, and viruses were propagated at 15 °C until completion of CPE. Approximately 20 ml of viral suspension were collected from each flask and clarified at 4 °C for 10 min at 2800×g. Subsequently, viruses were concentrated by ultracentrifugation at 90000×g for 1 h using a Beckman Coulter Optima L-100 K, and then re-suspended in 500 μl of MEM Eagle (Sigma-Aldrich).
Viral RNA was isolated from 140 μl of concentrated virus using the QIAamp Viral RNA Mini kit (Qiagen) and then subject to retrotranscription with the SuperScript III Reverse Transcriptase (Invitrogen). Double-stranded cDNA was synthesized using NEBNext mRNA Second Strand Synthesis Module (Euroclone), purified with MagSI-NGSPREP PLUS beads (MagnaMedics) and quantified with Qubit dsDNA HS assay kit (ThermoFisher Scientific). The cDNA library was prepared using Illumina Nextera XT DNA Sample Preparation kit (Illumina), and fragments were selected with MagSI-NGSPREP PLUS beads (MagnaMedics). Library was checked for quality and size with Agilent 2100 Bioanalyzer (Agilent High Sensitivity DNA kit, Agilent Technologies) and sequenced with MiSeq v3 Reagent Kit (300PE) using Illumina MiSeq platform.
Calibrator production and sequencing
The recombinant VHSV strain r23/75 kindly provided by the Institut National de la Recherche Agronomique (INRA) was synthetized via reverse genetics by transfecting EPC cells with an expression plasmid containing the VHSV full-length cDNA [14]. Being a recombinant virus with a predetermined genome sequence, strain r23/75 acted as calibrator for the exercise and was subject to NGS with both Ion Proton™ and Illumina MiSeq according to the procedures available at Lab 1 and Lab 3, respectively. To further verify its sequence, the vector encoding r23/75 genome (p23/75) and used for EPC transfection was also Sanger sequenced by Lab 3. Briefly, the plasmid was propagated in XL 10-Gold ultracompetent cells (Agilent Technologies) and isolated using the EndoFree Plasmid Maxi Kit (Qiagen). The purified vector was then subject to sequencing using primers listed in Additional file 2 and the BrilliantDye™ Terminator Cycle Sequencing Kit (Nimagen) according to the manufacturer’s instructions. Sequencing reactions were then purified with CENTRI-SEP 96 Well Plates (Princeton Separations) and sequenced in a 16-capillary ABI PRISM 3130xl Genetic Analyzer (Applied Biosystems). Sequence data were assembled and edited using the SeqScape software v2.5 (Applied Biosystems) and the final consensus was compared with the genome sequences of r23/75 obtained with NGS. The VHSV complete genome of p23/75 is publicly available under the GenBank acc. no. MK792283.
Bioinformatics analysis
Lab 1
Reads were cleaned with Trimmomatic v0.36 [15] (ILLUMINACLIP: oligos.fasta: 2:30:5:1: true; LEADING: 22; TRAILING: 22 for Ion Proton™ sequencing and 28 for Illumina MiSeq sequencing; MAXINFO: 40:0.2; MINLEN: 36). Bowtie2 v2.2.5 [16] was adopted to perform a rapid alignment with down-sampled reads on the local NT database. VHSV or IHNV complete genomes with the highest number of matching reads were used as a primary reference to align all cleaned reads with BWA v0.7.15 [17]. Based on the alignment, the coverage was estimated using an in-house perl script, bam stat v1.0.13 (http://bamstats.sourceforge.net/) and bam2fastx v1.0 (https://github.com/PacificBiosciences/bam2fastx). Raw reads were down-sampled to fit an estimated global coverage of 80X fold and were de novo assembled with Mira v4.0.2 [18] and, after being cleaned with Trimmomatic, also with SPAdes v3.10.0 [19]. Subsequently, assembled contigs were aligned against the integrated NT database using BLAST [20]. The best match (i.e. the longest sequence corresponding to a complete genome of VHSV or IHNV) was selected as reference for a BWA alignment. The generated SAM file was then used to compute the mpileup using samtools v1.9 [21], and variant calling was performed with bcftools v1.9 [21]. Called variants were then used to produce a consensus sequence with vcfutils.pl v1.9 [21] and seqtk v1.2 (https://github.com/lh3/seqtk). Finally, assembled contigs and BWA alignments were visually inspected for ambiguous bases, indels, ORF integrity, genome integrity and, if required, were manually adjusted.
Lab 2
Unless otherwise specified, pre-processing of raw data, reads alignment, variant calling and alignment of the consensus sequences against the reference genomes were done using CLC Genomics Workbench v4.9 (https://www.qiagenbioinformatics.com/). Sequences were trimmed based on quality scores with a cumulative error probability of 0.03, and including a maximum of 3 ambiguous bases. Sequences shorter than 40 bp were discarded. Sequencing adaptors were trimmed with mismatch and insertion costs of 2 and 3, respectively, and with a minimum internal and end scores of 10 and 3, respectively. Reads were aligned against the following reference sequences: VHSV 23–75 (GenBank acc. no. FN665788), VHSV MI03GL (GenBank acc. no. GQ385941), VHSV SE-SVA-14-3D (GenBank acc. no. AB839745) and IHNV (GenBank acc. no. X89213). Mismatch, insertion and deletion costs were set at 2, 3 and 3, respectively. At least half of the reads was required to match the reference genomes with a nucleotide similarity ≥80%. For those reads mapping to multiple positions within the reference genome, the position in the final alignment was randomly assigned. The consensus sequence was inferred based on the alignment with the highest coverage. In case of disagreement among reads, those predominantly represented were used for base calling. Once the drafted consensus sequence was attempted, reads were re-aligned using the same criteria as before. Potentially deleterious mutations were examined by a) aligning the ultimate consensus sequence against the reference with the highest similarity; b) listing any mismatch with a putative detrimental effect on viral phenotype; c) predicting the reading frames using Prokka v1.12 [22]. The support for such deleterious mutations was obtained through examination of the original alignment and with comparison against alternative variant calling tools, such as Snippy [23].
Lab 3
Reads quality was assessed using FastQC v0.11.2 (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Illumina MiSeq raw data were cleaned by removing: a) reads with more than 10% of undetermined (“N”) bases; b) reads with more than 100 bases with Q score below 10; c) duplicated paired-end reads. Ion Proton™ raw data were cleaned by removing duplicated reads. Filtered reads were clipped from Illumina adaptors Nextera XT or Ion Proton™ adaptors with scythe v0.991 (https://github.com/vsbuffalo/scythe). Low quality ends were trimmed with sickle v1.33 (https://github.com/najoshi/sickle) over a range with minimum average quality of 25 for Illumina MiSeq reads, and 15 for Ion Proton™ reads. Reads shorter than 80 bases or unpaired after previous filters were discarded. Filtered reads were aligned against the integrated NT database (version 8 February 2017) with BLAST v2.6.0+ adopting default parameters and e-value <10e-50. Alignment hits matching against VHSV and IHNV complete genomes, with sequence similarity ≥95% and match length of the query ≥99%, were selected. For each sample, sequences with the highest number of matching reads were chosen as proper reference genomes. Reads were then re-aligned against their respective reference genome selected as above using BWA v0.7.12 and standard parameters. Before SNPs calling, alignments were processed with SAMtools v0.1.19 for conversion into BAM format and sorted by position. Subsequently, potential errors of the alignment were corrected with Picard-tools v2.1.0 (http://broadinstitute.github.io/picard/) and GATK v3.5 [24,25,26], reads were re-aligned in the proximity of indels and base quality was recalibrated. LoFreq v2.1.2 [27] was then run with the function “--call-indels” to produce a vcf file containing both SNPs and indels. Indels with a frequency lower than 50% and SNPs with a frequency lower than 25% were discarded. Indels within coding regions, determining a reading frameshift, were also removed. The consensus sequence was obtained adopting the following criteria: a) for coverage insufficient for reliable variant calling (<10X), the “N” base was assigned; b) for coverage ≥10X and no SNP call, the reference genome base was assigned; c) for coverage ≥10X and calling of at least one SNP, the nucleotide representing the observed bases was assigned adopting the IUPAC code. High quality reads were re-aligned against their consensus sequence with BWA. Finally, the alignment was visually inspected with tablet v1.14.10.21 [28] and used to manually revise the consensus sequence.
Consensus sequences comparison
The consensus sequences produced by Lab 1, Lab 2 and Lab 3 were saved in fasta format and collected for subsequent analysis via a secure FTP site. Complete genomes of VHSV and IHNV were aligned separately with MAFFT v7.294b [29] using standard parameters. Differences between the consensus sequences produced by the participants related to the same viral strain were identified with a perl script developed in-house and available upon request. For the recombinant calibrator, consensus sequences obtained with NGS were compared also with the Sanger sequence (Genbank acc. no. MK792283). Discrepancies were evaluated by taking into account their genome localization, namely a) coding regions (CDS), consisting of viral genes translated into viral proteins; b) non-coding intergenic regions, consisting of untranslated sequences that intersperse viral coding regions; c) genome termini, consisting of untranslated regions at the 3′ and 5′ ends of the viral genome. Discrepancies were further categorized as SNPs or indels. A summary of all the inconsistencies observed was generated and scrutinized. PT reproducibility was estimated as the ratio between the number of consistent sites observed and the cumulative genome size of the 75 samples, as previously proposed by Kozyreva et al. (2017) [30]. Accuracy of the BI pipelines was assessed by pairwise comparison of the consensus sequences related to the calibrator sample against its reference genome obtained with Sanger. The accuracy value was estimated as the ratio between observed identical bases and the expected genome size [30].
To deepen the reasons behind inconsistent results, an additional explanatory analysis was performed on a subset of 50 discrepancies randomly selected. We intentionally avoided discrepancies located at genome termini because the variability in the three BI pipelines adopted did not allow a proper assessment of the reasons for variant calls. The analysis was performed in three phases. Firstly, for each discrepancy, participants were asked to provide a justification for their choice in nucleotide assignment. Secondly, all the explanations provided were ascribed to one or a combination of key steps of the BI pipeline namely, i) alignment; ii) variant calling; iii) manual curation; iv) de novo assembly. The three laboratories performed steps i), ii) and iii) in the same sequential order to produce the final consensus sequence. Step iv) was unique to the BI pipeline of Lab 1 and was used during manual inspection of the initial consensus sequence from step iii). Finally, to identify the source of the discrepancies, the explanations of the three laboratories were compared with respect to the order of key steps described within the BI pipeline. Among the three explanations provided for each of the 50 sites, the last by order in the steps series i-iv was arbitrarily assumed to be the source of the inconsistency among laboratories.