Modifications to JCVI’s primer design pipeline
The input into the degenerate primer design pipeline is a consensus reference sequence of the targeted genome. This consensus sequence is used as a reference for detecting alternative products and as a template from which primer pair candidates are selected. For segmented viruses, or if only a subset of the entire genome is targeted, a subset may be specified as the template, but the entire genome sequence is still required as a reference for detecting potential alternative amplification products.
The generation of the consensus sequence with ambiguity codes may be established by first generating a multiple sequence alignment (MSA) with a tool such as Clustalw  or Muscle  and then using a MSA-to-consensus generator such as cons from EMBOSS , or one of the applications from ANDES . If the sequence diversity of the targeted population or genomic region is significant, then it is recommended to use a tool, such as ANDES, to determine if it is necessary to stratify the population and also filter positions of potential variance by various means. This is described later in the Material and Methods subsection, “Constructing the consensus sequence with ANDES”.
The output is a set of degenerate primer pairs and their theoretical amplicons based on the consensus reference sequence. A variety of supplemental design information is also generated. Among them is a summary file of critique results including information such as primer and amplicon location on the reference, their melting temperatures and primer dimer calculations along with the sequences of the primers. An Adobe Portable Document Format (PDF) file, which contains a visualization of the tiling layout of the amplicons across the targeted genome, is also generated along with the summary file (Figure 3).
The core architecture for primer design has remained unmodified from its original form. The Coverage Manager (CM) component determines a dynamic tiling path across the genome based on the results of the Primer Critiquor (PC) component’s examination of candidate primer pairs. The PC component consists of a series of tests, or critiques, that determine whether a primer pair will be successful based on empirical evidence from previous sequencing experiments. The CM component utilizes primer3  to generate a list of candidate primer pairs, however because primer3 will return an error message if ambiguity codes are found on the user-specified template, an extra step is performed to convert ambiguity codes into N’s. After primer3 has generated a list of primer pair candidates with embedded N’s, the CM remaps the N’s back to their original ambiguity code and passes the degenerate primer pair candidates to the PC component for additional critiquing.
The following critique modules, utilized by the PC component, were modified to assess the anticipated success of primer pair candidates with degenerate base positions.
Alternative amplification detection
Instead of using BLAST , with a small word size, to search for primer binding sites on the reference consensus genome, dreg, from EMBOSS, is utilized. dreg uses regular expressions, a highly flexible pattern recognition specification, for sequence matching. Based on the conditions of the laboratory protocol utilized in the high-throughput sequencing laboratory, it was determined that the binding of the 11 base pairs of the 3’ end were sufficient for the polymerase to initiate synthesis of the copy strand. For this reason, only the 11 base pairs of the three prime end of the degenerate primer sequence are specified in the regular expression pattern. Detected primer binding sites are then analyzed to determine whether an alternative product will be formed, based on detected primer binding site orientation and pair-wise distances.
Primer dimer detection and melting temperature calculations
For each degenerate sequence in a primer pair candidate, a list of disambiguated sequences is generated by enumerating all possible non-degenerate primer sequences that can be represented by each degenerate sequence. Self and pair-wise primer dimer predictions are calculated within and between the disambiguated primer sequence lists, respectively. Primer melting temperatures are computed by averaging the nearest neighbor thermodynamic calculations  of the melting temperatures of the disambiguated sequences derived from the degenerate sequence.
Constructing the consensus sequence with ANDES
The construction of the template and/or reference consensus sequence with ambiguous bases is a critical step prior to executing the degenerate primer design pipeline. Unlike using standard, i.e. non-degenerate, primers where the analyst must assume that the primers are binding to conserved regions surrounding the target, the necessity of degenerate primers presumes that the target population is sufficiently heterogeneous such that there are insufficient conserved regions to allow standard primers to successfully bind and amplify all samples from the population. Degenerate primer design allows for a mixed population of primer pairs to be simultaneously available during the PCR reaction, thus increasing the likelihood of successfully targeting an unknown sample from a heterogeneous population. A properly designed consensus sequence, or set of consensus sequences, will represent this heterogeneous population, maximizing the likelihood of attaining PCR products for all samples, while at the same time, minimizing the potential for alternative products. The number of alternative products may exhibit a quadratic relationship with the number of unique primer sequences involved in the PCR. Thus, it is a goal to minimize the number of degenerate bases in each primer sequence, while maximizing the proportion of the target population the primers specifically bind to. The end effect is to limit the number of degenerate bases per primer to no more than 4, but preferably less than 3 bases.
ANDES tools provide the analyst with two key methodologies to support minimizing the number of degenerate bases utilized, while maximizing the proportion of the population targeted: abundance filtering and sample stratification. Ultimately, the targeted percent of degenerate bases in the consensus sequence should be no more than 10%, but ideally less than 8%.
Abundance filtering in ANDES can be performed using several techniques to reduce the number of degenerate bases introduced into the consensus sequence. When filtering is applied, the analyst assumes that not all point mutations represented in the set of sequences acquired from the population are equally important. Some polymorphisms may be due to sequencing error, while others may only represent an insignificant subpopulation. Given a MSA constructed from a set of sequences determined to be a reasonable representation of the population, ANDES computes the distribution of nucleotides for each position. A nominal filter or percentage filter can be applied to remove the presence of alleles that are not represented at or above a user-specified frequency or percentage threshold, respectively. For example, the degenerate base W is required if both the alleles A and T exist in the population for a specific position. However if 2% of the samples are A’s, and 98% are T’s, applying a 4% percentage filter would change the W to the non-degenerate base, T, if the analyst believes that this polymorphism was actually due to sequencing error. Alternatively, a statistical filter, combining both frequency and percentage, may be applied. This utilizes the binomial distribution to determine whether an allele would be detected upon resampling to the same depth based on a user-specified confidence, i.e. alpha. Lastly, an optimization-based filter is also available, in which the analyst specifies the maximum percentage of degenerate bases to be allowed in the consensus sequences, and the algorithm iteratively removes alleles starting with the ones with the lowest proportion across the samples, until the maximum percentage of degenerate bases have been achieved. If these filtering techniques either remove alleles that need to be targeted, or do not reduce the number of degenerate bases in the constructed consensus sequence sufficiently, then sample stratification will be necessary.
The goal of sample stratification is to reduce the total variation in a combined population by placing individuals into clusters which contain less intra-cluster variance than the entire population. The more a difference between clusters can be used to separate the clusters, the less variation will exist within each of the formed clusters with respect to the entire population. This stratification, i.e. clustering, can be performed with ANDES. Given a MSA, ANDES can be used to compute a distance matrix and dendrogram, based on sequence similarity, which can then be used to generate clusters given a user-specified cutoff. The maximum intra-cluster percent dissimilarity cutoff that works well is approximately 10%. When the percent similarity within a cluster (or undivided set of sequences) falls below 90%, the number of degenerate bases that must be introduced into the consensus sequence becomes too high. For each cluster that is created to reduce the amount of intra-cluster dissimilarity, a separate consensus sequence is subsequently generated to represent that subpopulation.
When determining an appropriate threshold for any of these methodologies, it is important to keep in mind the number of sequences included, since the confidence of real polymorphisms existing in the population can only be attained with sufficient sample size. A general landscape of variation can be ascertained with a plot of normalized entropy versus location. With sufficient conserved regions flanking highly variable regions, and a sufficiently large amplicon size, regions with localized high rates of variation do not generally introduce difficulty in primer pair selection due to the primer designer’s dynamic tiling capabilities.
Computational degenerate primer design parameters
The JCVI primer design pipeline automatically detects sub-regions of the targeted genome sequence that may use the standard protocol, or must use the high GC protocol. Once the sub-regions have been separated, the proper primer design parameters are utilized for the design and selection of primer pairs.
Key parameters shared between the standard and high GC protocols include: optimal/minimum amplicon dynamic tiling overlap of 100/100 bp, and optimal/minimum/maximum primer lengths of 20/18/25 bp, low complexity filter of 80%, primer binding site hairpin detection at >14 stem hydrogen bonds and <11 bp loop circumference size, maximum internal primer dimer of 8 bp and maximum end primer dimer of 3 bp.
Key parameters where the two protocols differed include: minimum/maximum amplicon melting temperatures of 68°/81.5°C (standard) and 85°/95°C (high GC), optimal/minimum/maximum amplicon sizes of 650/600/700 bp (standard) and 325/300/350 bp (high GC), and maximum amplicon dynamic tiling overlap of 550 bp (standard) and 300 bp (high GC). (MeV was the only exception. Its primer design utilized a project specific optimal/minimum/maximum amplicon size of 750/550/800 bp to reduce the number of amplicons necessary.)
Amplicon sizes were selected to be shorter than sequencing read length to achieve full bi-directional coverage, increasing the accuracy of downstream computational variation detection and decreasing the necessity for manual review for indeterminate base calls.
For both the standard and the high GC protocols, reverse transcription PCR (RT-PCR) was performed using Qiagen One Step RT-PCR Kit (cat# 210212). Each viral sample underwent RT-PCR reactions with 96 pairs of designed primers in a 96-well format. Qiagen One Step RT-PCR was set up in 10μl reactions containing 0.6μl water, 1.6μl buffer (Qiagen), 1.6μl Q solution (Qiagen), 0.3μl dNTP (Qiagen), 0.3μl enzyme (Qiagen), 0.4μl RNaseOut (Invitrogen), 2.2μl undiluted RNA, and 3μl primers (1.6μM, forward and reverse mixed). An additional 1.6μl of Q solution was added to each RUBV-G1 and RUBV-G2 reactions due to the high GC content of the genome, while for the other genomes, an additional 1.6μl of water was added to each reaction. Amplification was done in 96-well format on an MJ Research DNA Engine Tetrad 2 thermal cycler with the following cycling conditions: 1 cycle of 50°C (30 min); 1 cycle of 95°C (5 min); 35 cycles of 94°C (30 sec), 55°C (30 sec), and 72°C (1 min); 1 cycle of 72°C (10 min); hold at 4°C.
After amplification, 2μl of each reaction was run on a 1% agarose gel for 1 hour to visualize bands and confirm amplification. Unincorporated dNTPs were dephosphorylated and excess primers were removed from products using a SAP/Exo I reaction containing 0.5 units of shrimp alkaline phosphatase (USB Corporation), 1.0 unit of exonuclease I (USB Corporation) and molecular biology grade water. The cycling program was: 1 cycle of 37°C (60 min) and 1 cycle of 72°C (15 min).
Sanger sequencing reactions were performed on each RT-PCR product on a standard high-throughput sequencing system using Big Dye Terminator v3.1 (Applied Biosystems) with M13 sequencing primers (Forward primer: TGTAAAACGACGGCCAGT; Reverse primer: CAGGAAACAGCTATGACC). Dye terminators were removed by ethanol precipitation and sequences were obtained with a 3730xl DNA Analyzer (Applied Biosystems).
Raw sequence data was computationally trimmed to remove any primer-derived sequence, as well as low quality sequence. Genome sequences were assembled using JCVI’s internally developed assembly software named “FLAP” (unpublished) that is based on minimus, which is a component of the open-source AMOS project  (http://amos.sourceforge.net). All sequencing data was released to Genbank with annotation and can be found using the BioProject ID as provided in Table 1.