Evaluation on the use of Nanopore sequencing for direct characterization of coronaviruses from respiratory specimens, and a study on emerging missense mutations in partial RdRP gene of SARS-CoV-2

Coronavirus disease 2019 (COVID-19) pandemic has been a catastrophic burden to global healthcare systems. The fast spread of the etiologic agent, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), highlights the need to identify unknown coronaviruses rapidly for prompt clinical and public health decision making. Moreover, owing to the high mutation rate of RNA viruses, periodic surveillance on emerging variants of key virus components is essential for evaluating the efficacy of antiviral drugs, diagnostic assays and vaccines. These 2 knowledge gaps formed the basis of this study. In the first place, we evaluated the feasibility of characterizing coronaviruses directly from respiratory specimens. We amplified partial RdRP gene, a stable genetic marker of coronaviruses, from a collection of 57 clinical specimens positive for SARS-CoV-2 or other human coronaviruses, and sequenced the amplicons with Nanopore Flongle and MinION, the fastest and the most scalable massively-parallel sequencing platforms to-date. Partial RdRP sequences were successfully amplified and sequenced from 82.46% (47/57) of specimens, ranging from 75 to 100% by virus type, with consensus accuracy of 100% compared with Sanger sequences available (n = 40). In the second part, we further compared 19 SARS-CoV-2 RdRP sequences collected from the first to third waves of COVID-19 outbreak in Hong Kong with 22,173 genomes from GISAID EpiCoV™ database. No single nucleotide variants (SNVs) were found in our sequences, and 125 SNVs were observed from global data, with 56.8% being low-frequency (n = 1–47) missense mutations affecting the rear part of RNA polymerase. Among the 9 SNVs found on 4 conserved domains, the frequency of 15438G > T was highest (n = 34) and was predominantly found in Europe. Our data provided a glimpse into the sequence diversity of a primary antiviral drug and diagnostic target. Further studies are warranted to investigate the significance of these mutations.

Hong Kong, the first 2 COVID-19 cases were confirmed on 23 January 2020 [2]. At that time, a number of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genome sequences and real-time reverse transcription polymerase chain reaction (rRT-PCR) protocols were already available so that we were more prepared than Wuhan for tracing and controlling circulation of this virus. Nevertheless, we cannot predict when and where the next coronavirus spillover will take place. Perhaps what we can do is to be well prepared based on accumulating knowledge on this virus family and well utilize state-of-the-art tools to facilitate early identification and timely containment. On the other hand, owing to the high mutation rate of RNA viruses, periodic surveillance on emerging variants of key virus components is essential to combat the viruses. Through studying their functional characteristics and evolution pattern, we can monitor and evaluate the impact of emerging variants on the efficacy of antiviral drugs, diagnostic assays and vaccines.
To control the spread of a highly contagious, unknown virus, rapid and accurate characterization of virus genome is crucial for developing sensitive screening assays. Metagenomic sequencing is a useful tool for rapid reconstruction of virus genomes, as evident by discovery and characterization of SARS-CoV-2 [3][4][5]. Successful retrieval of a complete virus genome from complex clinical specimens requires very deep sequencing to compensate contamination by host and commensal reads, with sequencing data processed by high performance computers and analyzed by bioinformatics expertise. As these are luxurious for most clinical laboratories, identification and characterization of unknown viruses are usually confined to reference laboratories. As a result, there is a lapse between initial presentation of a patient/ patients infected by unknown coronavirus, clueless microbiological investigations in frontline laboratories, and finally referral to reference laboratories for etiologic investigation. The duration of this lapse may determine the controllability of an outbreak. Compared with metagenomic sequencing, characterization of partial virus genome involves simpler workflow which is more implementable as a part of etiologic investigation in frontline laboratories, providing hint for more timely follow-up actions. This pan-coronavirus approach was also adopted for initial investigation of Middle East respiratory syndrome (MERS) and COVID-19 outbreaks [4][5][6].
In the first part of this study, we evaluated the feasibility of characterizing coronaviruses directly from clinical specimens. We selected partial RNA-dependent RNA polymerase gene (RdRP) as the amplification target, as it has been commonly used for coronavirus classification and phylogenetic analysis [7,8]. We sequenced the amplicons using Nanopore technology, which is the fastest and most scalable option in current massivelyparallel sequencing market, and assessed its consensus accuracy with Sanger's method. As COVID-19 pandemic is ongoing, every piece of genetic information about the causative agent may save lives. Therefore, in the second part of this study, we compared the SARS-CoV-2 RdRP sequences from our laboratory with genomes worldwide and looked for mutations which might alter the function of this key virus component. An overview of this study is shown in Fig. 1.

RNA extraction
Standard laboratory practices were applied to minimize risk of infection and contamination. RNA was extracted from 200-500 µL of respiratory specimens using EMAG ® (bioMérieux, Marcy I'Etoile, France). Nasal swabs, nasopharyngeal swabs and throat swabs preserved in universal transport medium (UTM ® , Copan, Murrieta, CA, USA) were homogenized by vortexing and added directly to NUCLISENS ® lysis buffer (bioMérieux, Marcy I'Etoile, France). Posterior oropharyngeal saliva, nasopharyngeal aspirate and sputum were liquefied with equal volume of working sputasol (Oxoid, Poole, England), briefly centrifuged to sediment large cell debris, and 400 µL of supernatant was added to lysis buffer. Off-board lysis was performed at ambient temperature for 10 min before loading into EMAG ® for total nucleic acid extraction, with elution volume of 50 µL. The extracts were kept on ice before testing or stored at −80 °C.
For the 6 PCR-negative specimens with sufficient residual RNA (Specimen 18, 20, 21, 23, 26 and 27), RT-PCR was repeated using an in-house developed protocol (Table 3). A new set of primers were designed by aligning second PCR primers [11] to 56 coronavirus reference genomes, with degenerate bases added to appropriate positions. SuperScript ® III First-Strand Synthesis System (Invitrogen, Carlsbad, CA, USA) was used for reverse transcription from 8 µL of RNA, followed by PCR using AmpliTaq Gold ™ DNA Polymerase with 20 µL of cDNA. PCR was optimized with higher   17:183 ¶ N and orf1b correspond to nucleocapsid gene and open reading frame 1b, respectively. The results of BioFire ® FilmArray ® Respiratory 2 Panel were qualitative and therefore Ct values were not available * Collected during the first wave of COVID-19 outbreak (January to early March, 2020, n = 4) # Collected during the second wave of COVID-19 outbreak (mid-March to May, 2020, n = 7) † Collected during the third wave of COVID-19 outbreak (late June onwards, 2020, n = 17) ‡ Sanger sequence was not available due to high level of background noise. The Nanopore read/ consensus sequence was compared to SARS-CoV-2 reference genome (NC_045512.2) § The number of Nanopore reads was insufficient for generating accurate consensus sequence (< 30×) Ct threshold cycle, HCoV human coronavirus, N/A not available, ND not done, NPA nasopharyngeal aspirate, NPS nasopharyngeal swab, NS nasal swab, pOS posterior oropharyngeal saliva, RT-PCR reverse transcription polymerase chain reaction, SP sputum, TS throat swab, VC virus culture

Bioinformatics
The bioinformatics workflow is summarized in Fig. 1. Nanopore sequencing reads from 'fastq pass' folders were used for data analysis. Reads from the first  [12]. From the resulting BAM files, consensus sequences were built with best-matched reference using Unipro UGENE (version 1.29.0) and deprived of primers. If coverage depth was less than 30x, more sequencing reads would be used for consensus building to attain a minimum depth of 30x. Identity of consensus sequences and similarity to their Sanger counterparts were evaluated using NCBI BLASTn. Full SARS-CoV-2 genomes were downloaded from Global Initiative on Sharing All Influenza Data (GISAID) EpiCoV ™ database (accessed on 3 June 2020) with the following search criteria: collection date from 1 December 2019 to 31 May 2020, human host, complete genomes > 29,000 bp, and high coverage. Partial RdRP sequence was extracted from SARS-CoV-2 Wuhan-Hu-1 reference

Nanopore sequencing results
Results are shown in Table 1

Discussion
We successfully characterized coronaviruses directly from majority of clinical specimens. For SARS-CoV-2, full-length RdRP sequences could be retrieved from specimens with Ct values of 31.68 (N gene) or less, suggesting that this method may be best used right after symptom  onset when viral load is at its maximum [15]. Our data showed that highly accurate consensus sequences could be built from error-prone Nanopore reads if coverage depth was sufficient (> 30×). Considering the reference sequence of an unknown coronavirus is not readily available, we repeated consensus building for selected specimens without SARS-CoV-2 reference genome, and the consensus accuracy was not compromised.
From our experience, the universal primers used in this study amplified human and commensal sequences occasionally. As the non-specific band(s) was very close to the target, gel purification is required to obtain clean Sanger chromatograms. In this regard, Nanopore sequencing facilitates a simpler workflow as sequencing reads can be analyzed independently without gel purification. It may therefore provide better resolution for mixed coronavirus infection, which comprised about 4.3% of SARS-CoV-2-positive respiratory specimens from symptomatic patients [16]. Nanopore sequencing is also a faster option as the time from amplicons to sequence data is about half of the Sanger's method. Compared with direct metagenomic sequencing, our method involved target enrichment by PCR and less complicated data processing, and consensus sequences were typically built in minutes. Using Flongle flow cells, reagent cost may be as low as 12 USD per sample for a 24-plex run [17], which is comparable to Sanger sequencing. In general, the proportion of genomes possessing SNVs by geographical area (America, Asia/ Middle East, Europe and Oceania) and by month of collection (Jan-May 2020) were similar, ranging from 3.00 to 6.22% (Table 4) with the exception of Africa (34.64%). As 153 genomes were retrieved from Africa which was at least 8 times lower than other areas, this relatively high proportion of genomes with SNVs may require confirmation by more representative sampling.
The partial RdRP gene we targeted encompasses parts of conserved domains which are important to polymerase functionality (Fig. 3). Our data displayed the diversity of SNVs involving 114 bases (28.93%) in a short segment of 394 bp, and missense mutations generally occurred at low frequencies (ranged from 1 to 47 genomes) compared to 15324C > T synonymous mutation (n = 553). Among the missense mutations found on conserved domains, the frequency of 15438G > T was the highest (n = 34) which changes the last residue of cofactor nsp8 interaction site from methionine to isoleucine (M666I) and was predominantly found in Europe. As mutation is a two-edged sword, the effect of these missense mutations on the pathogenicity of SARS-CoV-2 awaits further investigation, and added knowledge in this area is important for development of antiviral drugs, vaccines and diagnostic assays.
This study had several limitations. First, the variety of HCoVs might not be sufficient for thorough evaluation of a 'pan-coronavirus' assay, and further studies with more comprehensive sample collection is warranted. As a portion of MinION flow cells possessed suboptimal number of active pores, the sequencing time of some specimens might be overestimated. As Nanopore consensus sequences were built by majority rule, minority SNVs present in the specimens might not be detected. In addition, as GISAID EpiCoV ™ database is expanding continuously, there may be changes in geographical and temporal SNV patterns after accumulation of more SARS-CoV-2 genome data.

Conclusion
We developed and evaluated a method for direct characterization of coronaviruses from respiratory specimens, based on pan-coronavirus amplification and sequencing of partial RdRP gene. It provides a viable option for first-line etiologic investigation of suspected infection by unknown coronavirus, which may lead to more timely follow-up actions. The SNV data shed light on global distribution and frequencies of missense mutations in partial RdRP gene of SARS-CoV-2, providing valuable information for surveillance of this important antiviral drug and diagnostic target.