Repeated sequences in the soybean and rice genomes are primarily comprised of autonomous and non-autonomous transposable elements, and this class of repeats constitutes the majority of the soybean genome (Mudge et al., 2004; IRGSP, 2005; Schlueter et al., 2007). In species within the Plant Kingdom, transposable elements are known to have an elevated rate of a methyl group covalently linked to the cytosine in the dinucleotide 5’-CG-3’ and the trinucleotide 5’-CNG-3’ (Gruenbaum et al., 1981; Chandler and Walbot, 1986; Vongs et al., 1993; Meyer and Heidmann, 1994; Tran et al., 2005; Zilberman and Henikoff, 2007); thus the hypomethylated fraction, known to be gene-rich, can be distinguished from the hypermethylated fraction, generally rich in repeated sequences, by restriction endonucleases whose activity is blocked by methylation of cytosines in the recognition sequence (Antequera and Bird, 1988; Bennetzen et al., 1994; Rabinowicz et al., 2003). Among plant species, the ability to exploit methylation status as a method to separate hypermethylated repeat sequence from hypomethylated “genic” regions has been most successfully used in maize genomic sequencing via the Sanger platform by a number of researchers (Rabinowicz et al., 1999; Yuan et al., 2002; Rabinowicz et al., 2003; Whitelaw et al., 2003; Emberton et al., 2005).
While the majority of the total size of the repeated sequence in a genome is comprised of transposable elements, there are additional sources of repeated sequence that are not derived from transposable elements (Bennetzen et al., 2005; DeBarry et al., 2008). By having a large majority of the genomic sequence identified and ordered, the degree of repetitiveness of any given locus can be determined computationally using a range of metrics (Bao and Eddy, 2002; DeBarry et al., 2008). With the ability to predict the degree of repetitiveness of any given locus from the reference assembly, the promotion of any given SNP into marker development would depend on its context, where unique loci within or adjacent to gene-rich regions would be of greatest interest for downstream applications (Barbazuk et al., 2007; Gore et al., 2009).
The advent of massively parallel sequencing technology has allowed for the rapid resequencing of genomes at an exponentially larger scale for a fraction of the time and cost when compared with traditional Sanger technology (Shendure and Li, 2008). Therefore, rapid genome-wide SNP detection via deep resequencing using massively parallel sequencing strategies, in combination with a comprehensive reference genomic assembly, has become tractable (Hillier et al., 2008; Ossowski et al., 2008). However, the overall size and structure of plant genomes generally constitutes a major obstacle to conventional sequencing methods. While SNP discovery with next-generation sequencing technologies for the moderately complex genomes of Arabidopsis thaliana (Arabidopsis) and rice has been driven by whole genome resequencing without a need to reduce the complexity of the sequenced fraction of the genome (Ossowski et al., 2008; Huang et al., 2009), a majority of plant genomes, particularly among agronomically important species, appear to be more complex in their architecture, ploidy level, and repeat content when compared with these two model species (Bennetzen et al., 2005). Furthermore, the typically short read lengths of the sequences generated with next-generation sequencing technologies presents additional challenges when aligning those reads to complex genome sequences which possess a large content of repeated sequences (Pop and Salzberg, 2008). As a result, researchers have come up with various strategies to reduce the complexity of large genomes before next-generation sequencing. One of those strategies, reduced representation libraries (RRLs), was originally developed for SNP discovery with Sanger resequencing pools (Altshuler et al., 2000) and later adapted for SNP discovery with next generation resequencing platforms (Van Tassell et al., 2008). The rationale behind using an RRL for resequencing is to reduce the complexity of the subset of the genome that is being sequenced (potentially by orders of magnitude) and to generate independently constructed libraries that contain a large set of common fragments for alignment, assembly, and SNP discovery (Altshuler et al., 2000; Van Tassell et al., 2008). The utilization of RRLs for SNP discovery via massively parallel sequencing platforms in plant species has recently been described in Zea mays (maize) (Gore et al., 2009). In this study, Gore et al. (2009) described a method that relies on the use of a single methylation-sensitive restriction endonuclease followed by size selection to isolate blocks of genomic DNA corresponding to the hypomethylated fractions of maize B73 and Mo17 inbred genomes. SNPs were identified by comparing the public B73 genome assembly with sequencing data from the Mo17 methyl-filtered library generated with the 454 FLX system. Using coverage requirements of ≥2 reads as well as filtration for paralogous sequences in the maize B73 reference assembly generated acceptable rates of true SNP discovery (84.9%). cDNA libraries also are an effective way to target coding regions of a genome and avoid sequencing repetitive regions (transcribed sequences derived from transposable elements are infrequently observed in cDNA libraries). Recent work in complex genomes such as maize, Eucalyptus grandis (rose gum eucalyptus), and Brassica napus (oilseed rape) utilized the strategy of deep resequencing of the transcriptome with next generation sequencing platforms for SNP discovery (Barbazuk et al., 2007; Novaes et al., 2008; Trick et al., 2009), with SNPs validation rates ranging from 83 (Novaes et al., 2008) to 87.4% (Trick et al., 2009). Focusing on the transcriptome is a method of reducing the complexity of these genomes, but the two major limitations stem from the fact that (i) the expressed sequences in transcriptomes inherently have variable representation in the cDNA libraries and (ii) the transcribed regions are predominantly exon-specific.
We report here the development of a SNP discovery assay using the Illumina next generation sequencing platform, and the use of RRLs derived from sequential digestion of genomic DNA with a methylation-sensitive restriction endonuclease, then with the 4-bp restriction enzyme DpnII, followed by biotin selection and size selection. This strategy for SNP discovery, applied to the moderately complex genome of rice and the highly complex genome of soybean yields highly redundant sequences while using a modest sequencing capacity. As a result of this coverage (or redundancy) via the RRL strategy, SNPs have very low rates of false positive calls (3.6% or less) and are primarily located in or are closely adjacent to genes. Unlike transcript-based strategies where coverage is a by-product of expression level and the sequencing is exon-specific, the SNPs discovered by this RRL strategy are found in coding sequences, introns, regulatory regions, and intergenic regions. Further, the SNP distribution follows the gene-rich hypomethylated fraction which is less biased in its distribution relative to transcript libraries that are based on temporal and spatial expression patterns within the tissue(s) sampled. The protocol described here can be applied with a series of individual methylation-sensitive restriction endonucleases to a wide range of moderately or highly complex plant genomes where sufficient genomic reference sequence is available to generate appropriate numbers of SNPs for a wide range of genetics and breeding applications.
MATERIALS AND METHODS
Tissue Preparation and Genomic DNA Sample Isolation
The soybean and rice genomic DNA were extracted according to protocols given in the Supplementary Data section.
Rice Genomic DNA Preparation and Library Construction
Rice genomic DNA (10 μg) was digested for 4 h with 15 U of AlwNI (New England Biolabs, Ipswich, MA) in a total volume of 20 μL of 1X NEB4 buffer (New England Biolabs) then purified and concentrated with QIAquick PCR purification spin columns (Qiagen, Valencia, CA). Biotinylated AlwNI-specific adapters were created by mixing 30 pmol each of two synthetic oligonucleotides (upper strand,/5’Bio/GGTTGACATGCTGGATTGAGACTTCCAGNNN; where NNN is ACA, AAC, CAC, TCT, TTC or CTT; lower strand,/5’PO4–/CTGGAAGTCTCAATCCAGCATGTC) in 20 μL water, heating them up at 95°C for 2 min, allowing them to cool slowly to room temperature, and then mixing all six different adapters together. Adapters (37.5 pmol) were ligated to the digested DNA in a total volume of 20 μL of 1X ligase buffer (New England Biolabs) containing 10,000 U of T4 DNA ligase (New England Biolabs). The reaction was incubated overnight at 16°C, followed by purification and concentration with QIAquick PCR purification spin columns (Qiagen), and then resuspended in a total volume of 88 μL with water.
The ligated DNA was digested for 45 min with 20 U of DpnII (New England Biolabs) in a total volume of 100 μL of 1X DpnII buffer (New England Biolabs), then extracted with 100 μL of phenol/chloroform/isoamyl alcohol (25:24:1). The sample was precipitated with ethanol and 3 mol L−1 NaOAc and resuspended in 36 μL of water. 30 pmol of GEX DpnII Adaptor 1 (Illumina, San Diego, CA) was ligated to the digested DNA in a total volume of 50 μL of 1X ligation buffer (Invitrogen, Carlsbad, CA) containing 5 U of T4 DNA ligase (Invitrogen). The reaction was incubated overnight at 16°C followed by extraction with 100 μL of phenol/chloroform/isoamyl alcohol (25:24:1) and precipitation with ethanol and 3 mol L−1 NaOAc. DNA was resuspended in 100 μL of water. 100 μL of Streptavidin-Dynabeads M-280 (Invitrogen) were washed twice with 1 mL TE buffer then resuspended in 100 μL 2X B&W buffer (10mmol L−1 Tris-HCl, pH 7.5; 1 mmol L−1 EDTA; 2 mol L−1 NaCl). The GEX DpnII Adaptor 1-ligated DNA fragments then were added to the beads and incubated for 30 min at 30°C with horizontal mixing. After withdrawing the supernatant, the beads were collected, washed three times with 1 mL 1X B&W buffer then three times with 1 mL TE buffer. The washed beads then were resuspended in 200 uL of 1X NEBuffer 4 (New England Biolabs) containing 20 U AlwNI (New England Biolabs) and incubated for 90 min at 37°C with horizontal mixing. The supernatant containing the released fragments was then transferred to a new tube, extracted with 100 μL of phenol/chloroform/isoamyl alcohol (25:24:1) and precipitated with ethanol and 3 mol L−1 NaOAc. DNA was resuspended in 6 μL of water. AlwNI-specific GEX Adaptor 2 were created by mixing 30 pmol each of two synthetic oligonucleotides (upper strand, CAAGCAGAAGACGGCATACGANNN; where NNN is ACA, AAC, CAC, TCT, TTC or CTT; lower strand,/5’PO4–/TCGTATGCCGTCTTCTGCTTG) in 20 μL water, heating them to 95°C for 2 min, and then allowing them to cool slowly to room temperature, and then mixing all six different adapters together. 10 pmol of AlwNI-specific GEX Adaptor 2 was then ligated to the DNA fragments in a total volume of 10 μL of 1X ligation buffer (Invitrogen) containing 5 U of T4 DNA ligase (Invitrogen). The reaction was incubated for 2 h at 20°C. 1.5 μL of the ligation reaction were subject to PCR amplification (30 s at 98°C, followed by 14 rounds of 10 s at 98°C, 30 s at 60°C, 30 s at 72°C, and a final extension step of 10 min at 72°C) by using 25 pmol of GX1 primers (Illumina) and 25 pmol of GX2 primers (Illumina) in a total volume of 50 μL of 1X Phusion HF master mix (Finnzymes) containing 250 U Phusion HF DNA polymerase (Finnzymes) and 25 mmol L−1 dNTPs. The resulting PCR products then were loaded on an E-Gel 2% with SYBR Safe agarose gel (Invitrogen) in the presence of a low molecular weight DNA ladder (New England Biolabs). After running the gel for 30 min in an E-Gel PowerBase v.4 electrophoresis unit (Invitrogen), a 150–500bp DNA smear was cut out and the DNA was purified with QIAquick Gel Extraction kit (Qiagen) and resuspended in 30 μL EB Buffer (Qiagen).
Soybean Genomic DNA Preparation and Library Construction
The soybean libraries were constructed according to a protocol similar to the one used for making the rice libraries. Genomic DNA (10 μg) was digested for 4 h with 15 U of PstI (Promega) in a total volume of 20 μL of 1X PstI buffer (Promega), containing 1X acetylated BSA (Promega). Biotinylated PstI-specific adapters were created by mixing 1500 pmol each of two synthetic oligonucleotides (upper strand,/5Bio/GGTTGACATGCTGGATTGAGACTTCTGCA; lower strand,/5’PO4–GAAGTCTCAATCCAGCATGTC) in 100 μL water, heating them to 95°C for 2 min, and then allowing them to cool slowly to room temperature. A ∼50-fold excess of PstI-specific adapters (37.5 pmol) relative to available PstI ends was ligated to the digested DNA.
After digestion with DpnII, ligation of GEX DpnII Adaptor 1 (Illumina) and incubation of the GEX DpnII Adaptor 1-ligated DNA fragments in presence of Streptavidin Dynabeads M-280 (Invitrogen), the washed beads then were resuspended in 200 μL of 1X PstI buffer (Promega) containing 1X acetylated BSA (Promega) and 20 U PstI (Promega), and incubated for 90 min at 37°C with horizontal mixing. The supernatant containing the released fragments then was transferred to a new tube, extracted with 100 μL of phenol/chloroform/isoamyl alcohol (25:24:1) and precipitated with ethanol and 3 mol L−1 NaOAc. DNA was resuspended in 6 μL of water. PstI-specific GEX Adaptor 2 were created by mixing 1500 pmol each of two synthetic oligonucleotides (upper strand, CAAGCAGAAGACGGCATACGATGCA; lower strand,/5’PO4–/TCGTATGCCGTCTTCTGCTTG) in 100 μL water, heating them to 95°C for 2 min, and then allowing them to cool slowly to room temperature. 15 pmol of PstI-specific GEX Adaptor 2 (Illumina) then was ligated to the DNA fragments and 2.5 μL of the ligation reaction were subject to final PCR amplification and size-selection steps, using conditions similar to what was used for the rice libraries.
Cluster Generation and Single Read Sequencing
Cluster generation and single-read sequencing were performed on an Illumina Cluster Station and Genome Analyzer, respectively, according to the manufacturer's instruction. Sequencing primer hybridization was performed on the Cluster Station using the smRNA/GEX-DpnII sequencing primer (Illumina) and 36 cycle single-read recipes were used on the Genome Analyzer. Sequences and quality scores were generated with the Illumina Pipeline Version 1.4 software for image analysis and basecalling. During basecalling, the GERALD alignment from the control lane (with ELAND) was used to obtain error estimates and recalibrate the raw quality scores (from Bustard) for the other sample lanes. After initial base calling, additional custom quality filtering was performed, in which reads were excluded if they harbored one or more violations of the “chastity” rule, as defined in the Illumina basecalling pipeline, in the first 32 bases of a read (rather than the default 25 bases). Briefly, the “chastity” at a given position is defined by IntA/(IntA + IntB) > = 0.6 where IntA and IntB are the highest and second highest intensities, respectively, recorded from the Illumina pipeline at that given position. Finally, the last four bases were removed from 36-base reads to generate 32-base reads (Tables 1 and 2).
|No. of total reads generated (after initial basecalling)||37,666,279||38,000,474|
|No. of filtered total reads||24,519,484||23,101,973|
|No. of unitags (generated from filtered total reads)||965,610||885,429|
|No. of HQ unitags||255,918||246,102|
|Alignment of HQ unitags against the reference sequence:|
|Two or more mismatches||19,225||21,388|
|HQ unitags aligning uniquely to the reference sequence with zero mismatch||152,185||144,559|
|No. of total reads generated (after initial basecalling)||6,090,577||6,184,638|
|No. of filtered total reads||2925,512||2740,977|
|No. of unitags (generated from filtered total reads)||208,607||244,427|
|No. of HQ unitags||64,923||71,848|
|Alignment of HQ unitags against the reference sequence:|
|Two or more mismatches||7202||14,651|
|HQ unitags aligning uniquely to the reference sequence with zero mismatch||42,884||36,666|
Generation of Unitags and Post-filtration High Quality Unitag Datasets
Thirty-two-base reads from a given genotype were filtered using recalibrated quality scores generated by the Illumina basecalling pipeline to remove reads containing at least one base with a PHRED-equivalent quality score below 15. The score threshold of 15 was used to remove all low or very low quality reads from the analysis. Reads then were condensed into a unique set by collapsing identical reads into a single sequence or “unitag.” A new consensus quality value per base was re-calculated for each unitag using a best-evidence per-base algorithm analogous to that employed by traditional Sanger sequence assemblers. A second filtration was then applied on the unitags, in which only unitags formed from two or more reads were kept for further analysis, as single copy reads repeatedly had high false positive rates for SNP discovery during our initial validation (data not shown). A third optional filtration step is available, in which unitags containing at least one base with a consensus quality score below some defined value can be filtered out. While the consensus score threshold for this third filter is user-dependent, this optional filtering step was applied at the nucleotide position containing the SNP only, after the first two filtration steps were applied. Unique sequences passing the first two filtration steps were labeled as high-quality (HQ) unitags and represented only a small but highly stringent portion of the original sequence space (Tables 1 and 2).
Indexing Unitags with Vmatch and Performing Pairwise Alignment for Single Nucleotide Polymorphism Detection
HQ Unitag sets for a pair of genotypes were compared to each other (in all vs. all pairwise fashion) using the software Vmatch (http://www.vmatch.de/ [verified 27 Apr. 2010]). Vmatch requires the building of a suffix array index from one of the unitag sets, which was done with the following command options: “mkvtree -db GenotypeA.HQunitag.fasta -dna -indexname SFTree.GenotypeA -pl -allout.” The next command options were then used to compare the unitags, allowing up to one mismatch in their global alignments (hamming distance = 1): “vmatch -q GenotypeB.HQunitag.fasta -d -l tag_length -h 1 -s abbrev -showdesc 0 -v SFTree.GenotypeA,” where “tag_length” was 32. Results from Vmatch were parsed with local scripts and a bipartite graph was built from the pairwise matches. Filters coded into the scripts removed unwanted alignments from the graph to make certain that complex relationships of one-to-many or many-to-many matches (modeling alignments between very similar paralogs), as well as perfect one-to-one matches, were removed while retaining only the one-to-one matches carrying a SNP.
Mapping of the Single Nucleotide Polymorphism-carrying Unitags to the Reference Genome using a k-mer Table
The list of SNP-bearing one-to-one alignments between HQ unitags of two genotypes was then referenced against the assembly of the genome (the reference) to ascertain the uniqueness of their sequence as well as to obtain location coordinates (those having multiple locations in the genome do not provide simple reliable markers). To achieve this, the sequence of the reference genomes (soybean Glyma1.0 released assembly from JGI or rice assembly from TIGR v5) was indexed previously for all possible fixed-length substrings (k-mers) of length equal to the length of the unitags. Local software tools (Kent van der Velden, personal communication, 2008) scanned the sequence and efficiently stored all overlapping substrings of length k as well as their position in the genome (determined by the position of the first nucleotide in the k-mer) and their direction (sense or antisense). Each scanned k-mer was compared to its own reverse-complement string and the string sorting first lexicographically was the one to be stored. Efficiency in storage was achieved by storing each k-mer (or its reverse-complement) sequence only once as a 64-bit binary number representation (with 2 bits consumed per nucleotide for a maximum length of 32 bases). Subsequent sampling of the same substring simply added to its count in the index, and a new position and direction was recorded in a lookup table. Because all substrings stored in the index are lexicographically sorted, future seeks with a binary search function are very quick and all string comparisons are performed using bitwise operations. The process of indexing a full genome may be slow and results in large files, but needs to be done only once per value of k. The k-mer tools were then used to query the k-mer indices of the reference genomes with each pair of aligned unitags that identified SNPs between the two genotypes in the Vmatch alignment. Assuming a high quality assembly of a reference sequence, from a pair of aligned unitags, these were the possible outcomes of the query: (i) both of the polymorphic unitags have perfect unique matches to k-mers mapped in different coordinates in the genome (paralogs); (ii) one or both unitags match a k-mer mapped to several locations in the genome (likely to be part of a repeat area); (iii) one of the unitags matches a single copy k-mer perfectly while the other aligns with a mismatch at the site of the SNP (further evidence for SNP with two alleles); (iv) both unitags match imperfectly the same single-copy k-mer with the mismatch at the site of the SNP (three alleles discovered); and (v) the unitags fail to match any k-mer within the limits imposed. Only unitags that mapped to a single location in the reference genome with up to one mismatch at the position of the SNP identified previously, corresponding to Vmatch alignments that classified to outcomes (iii) and (iv) above, were captured. Only SNPs contained within these captured unitags, and whose recalibrated consensus quality scores for both genotypes (as defined by the Illumina basecalling pipeline) was equal to or higher than 25, were listed as confirmed SNPs between the tested genotypes and identified as candidates for Sanger-based validation.
Sanger-based Validation and Allele Frequency Determination
Sequences (400 bps) flanking either side of the confirmed SNP sites were extracted using the sequence of the reference genomes (soybean Glyma1.0 released assembly from JGI or rice assembly from TIGR v5). Local k-mer tools used for mapping SNP-carrying unitags to the reference genomes were used to mask repetitive k-mer strings from each 801 bp sequence. PCR primers (length 18–22bp, Tm 57–59°C) were designed from each masked 801 bp sequence, using a local version of the Primer3 primer design software tool, and M13 forward and reverse “tails” were added to the 5’ ends of the PCR primer sequences before ordering. PCR amplifications were performed on 192 randomly selected sequences (sequences for which PCR primers had been automatically designed were sorted by increasing random numbers, using the “RAND” function in Excel, and the first 192 sequences selected for processing). Genomic DNA from each genotype was subject to PCR amplification (15 min at 95°C, followed by 40 rounds of 30 s at 95°C, 30 s at 60°C, 1 min at 72°C, and a final extension step of 10 min at 72°C) by using 5 pmol each of the “tailed” PCR primers in a total volume of 10 μL of 1X HotStarTaq Master Mix (Qiagen) containing 25 U HotStar Taq DNA polymerase (Qiagen), 1.5 mmol L−1 MgCl2 and 200 μmol L−1 dNTPS. PCR cleanup reactions were then performed by mixing 2 μL of PCR products with 0.75 μL of ExoSAP-IT (USB Corporation, Cleveland, OH) in a total volume of 17 μL with sterile distilled water, and incubating at 37°C for 25 min then 80°C for 25 min. Five microliters of the cleaned-up amplified DNA then were end-sequenced using M13 forward and reverse oligonucleotides and the ABI BigDye version 3.1 Prism sequencing kit. After ethanol-based cleanup, cycle sequencing reaction products were resolved and detected on Applied Biosystems (Foster City, CA) ABI3730xl automated sequencers. Individual sequences from each genotype were combined into a single project (one project per amplified fragment) and assembled with the Phred/Phrap/Consed package (see http://www.phrap.org/phredphrapconsed.html [verified 27 Apr. 2010]). Confirmed SNPs identified on the Genome Analyzer were validated by comparison to single-base mismatches between genotypes located in regions that matched the original read sequence generated by the Genome Analyzer.
To determine allele frequency, the same PCR primers and conditions were used on genomic DNA from 25 public genotypes in soybean across seven randomly chosen loci (out of the 192 previously sequenced via Sanger sequencing) For rice, 10 randomly chosen loci (out of the 192 previously sequenced via Sanger sequencing) were sequenced across 34 public genotypes. The DNA was extracted, amplified, sequenced and assembled as described above, and single-base mismatches identified as SNPs were compared to the original SNP found by sequencing the two original genotypes on the Genome Analyzer.
RESULTS AND DISCUSSION
Construction of Reduced Representation Sequencing Libraries
To construct the RRLs, we used a strategy previously adapted to maize (Deschamps et al., 2008) and adopted from a protocol developed by Illumina/Solexa for digital gene expression-tag profiling with DpnII (Fig. 1). The goal of our RRL strategy was to (i) use a highly sensitive methylation-sensitive restriction endonuclease with the intention of enriching for genic regions and avoiding the capture of the repeated fraction of the genome enriched in transposable elements, (ii) incorporate a second digestion with DpnII and size selection on an agarose gel to capture fragments of a size range that would amplify efficiently on the Illumina platform, and (iii) add a biotin selection step following digestion with DpnII to selectively capture DNA fragments digested by PstI and DpnII.
A recent analysis of genome filtering using methylation-sensitive enzymes in Aegilops tauschii (bread wheat), Zea mays (maize), and Nicotiana tabacum (tobacco) revealed that among nine enzymes evaluated, PstI and AatII were found to enrich for either unique (no significant hit with BLAST against annotated or curated genes) or genic (significant BLAST hits with either nucleotide or protein libraries) regions while displaying relatively low rates of clones with homology to either transposable elements or organellar genomic sequence (Fellers, 2008). The genomic DNA of the selected soybean Williams82 and Pintado cultivars was completely digested with the methylation-sensitive restriction endonuclease PstI (5’-C*TGCA^C-3’), recognizing its historical use in low copy clone libraries during the construction of the original RFLP maps in diverse plant species (Burr et al., 1988; McCouch et al., 1988; Gill et al., 1991). The PstI activity is blocked by 5-C methylation of the cytosine in the first position (C*) in the recognition sequence (Fellers, 2008). DNA was then subjected to a column-based cleanup, allowing the removal of large undigested fragments containing large blocks of hypermethylated DNA. Smaller PstI-digested fragments corresponding to hypomethylated DNA fractions were subsequently ligated to biotinylated PstI-specific adapters, and subjected to the protocol described in Fig. 1 to generate the RRLs.
The rice genome at an approximated 389 Mb (IRGSP, 2005) is roughly one-quarter the size of the soybean genome at 1.15 Gb (Nelson and Shoemaker, 2006) with the rice genome having a repeat content of roughly 35%. The relatively small genome size and relatively low repeat content (relative to other related species within the Plant Kingdom) presented the opportunity to validate the use of the methylation-sensitive restriction enzyme AlwNI which has a nine base pair recognition site (5’-CAGNNNCTG-3’) where the middle three bases can be any of the four nucleotides. AlwNI activity has been shown to be blocked by methylation of the cytosine in the sixth position (C*) in the recognition sequence (Bourbonniere and Nalbantoglu, 1991; McClelland et al., 1994) and the online resource Rebase (http://rebase.neb.com/cgi-bin/damlist?eAlwNI [verified 27 Apr. 2010]) further indicates that cytosine methylation at either the first or sixth position blocks restriction endonuclease activity. After AlwNI digestion of the genomic DNA for the selected rice Taichung65 and Kasalath cultivars, a column-based clean-up and the ligation of biotinylated AlwNI-specific adapters (6 of the potential 64 combinations were used in this study, see Materials and Methods for details) to the digested DNA, the same modified strategy as the one used for soybean (Fig. 1) was employed to generate the RRLs.
DNA Sequencing and Filtering
The RRLs for the soybean cultivars Williams82 and Pintado as well as the rice cultivars Taichung65 and Kasalath were sequenced on the Illumina platform for the first 36 bps from the DpnII site. The sequencing primers for all reactions included the DpnII recognition sequence (5’-GATC-3’) in the last 4 bp of the sequencing primer. One full flow cell was employed for the soybean genotypes, with three lanes for Williams82 and four lanes for Pintado. For rice, two lanes of a flow cell were used for each genotype. For the 36 bp reads that were generated in the sequencing runs, the last 4 bp were trimmed due to the elevated rate of low quality sequence known to be present at the 3’ end of the reads (Ossowski et al., 2008; Hillier et al., 2008). The soybean run collectively yielded over 75 million reads (Table 1), while the rice run generated roughly 12 million reads (Table 2). The disparity between the average numbers of reads per run was due in part to an upgrade in optics and chemistry that was installed on the Illumina Genome Analyzer between the rice and soybean runs and showed the dramatic improvement of read density per run for soybean (Tables 1 and 2). Reads then were filtered based on individual base quality score, as reads containing at least one base with a score less than 15 were removed from further analysis. A quality score threshold of 15 ensures that not more than one error/SNP on average was present within a read (a PHRED score of 15 corresponds to one error in ∼32 bps). After filtering for individual base pair quality, the individual full length 32 bp reads were compared and those sharing identical sequences for the 32 bps were condensed and collapsed into a single representative unitag and then subsequently into a HQ unitag. The first set of unitags does not incorporate copy number of the tag whereas the set comprising the HQ unitags has a copy number requirement of greater than or equal to two. Therefore, all HQ unitags are a single 32 bps sequence which is a single copy representation generated from all redundant (i.e., identical) sequences having a copy number of greater than or equal to two. By subtracting the HQ unitag total from the unitag total, the number of singletons can be calculated. For the Williams82 run where greater than 24 million reads passed filtration based on the quality score of 15 or greater, only 709,692 reads (2.9%) were singletons and Pintado had a comparably low number (2.8%) of singletons. For rice, the percentage of singletons was modestly higher with Taichung65 and Kasalath having 4.9 and 6.3%, respectively. Thus the RRL strategy of focusing resequencing to discrete regions of the genome and providing sufficient coverage (where coverage is defined as two or more identical 32 bps reads for all HQ unitags), which is necessary for accurate SNP detection, has been demonstrated.
For all reads passing quality filtration, the majority in both rice and soybean aligned to the reference assembly one or more times with zero mismatches (or perfect alignment). When restricting the perfectly aligning reads to a single (i.e., unique) place in the genome, this accounted for approximately 73% of the total unitags for the two soybean cultivars and 85.3 and 81.9% for Taichung65 and Kasalath, respectively. Thus the RRL strategy of focusing on sequencing the hypomethylated fraction is successfully identifying a large amount of unique sequences (when compared to the reference assembly). For those tags that did not align perfectly to the reference assembly, the majority had only a single mismatch for both rice and soybean and this is the subset of tags from which the SNPs were identified.
Read Count Distributions of the High Quality Unitags in Soybean and Rice
To assess the rate of redundant sequencing in the RRLs, histograms for read count versus occurrence were generated for soybean (Fig. 2A) and rice (Fig. 2B) using the redundant reads in each library. The histograms for both rice and soybean showed a similar pattern; reads with two copies within a sequencing run were the most abundant and the frequency of reads with increasing copy number steadily declined in occurrence. This indicates that for both species, the occurrence of reads having relatively low counts (i.e., between two and 50 copies along the x axis) were quite frequent (i.e., the y axis). These histograms also reflected that a small minority of reads were heavily over-represented in both libraries (the maximum copy number for soybean and rice was > 1 × 106); the excessive copy number of these infrequently occurring reads relative to the predominance of reads that have a much smaller copy number (i.e., <100) suggests that they are derived from very low level contamination of the DNA preparation by repeated elements (presumably either hypomethylated transposable elements or plastid contamination). Overall, these data show clearly that the HQ unitags are being derived largely from single or low copy regions based on the frequencies observed in the histograms and these histogram results support the run metrics in Tables 1 and 2.
Single Nucleotide Polymorphism Identification in Soybean and Rice
The SNP detection pipeline as described in “Materials and Methods” is summarized in Fig. 3. It filters both quality information per base and counts per unique read (unitag), compares the sequence of the HQ unitags from the two genotypes, detects single nucleotide differences between the HQ unitags, and then compares the putative SNP-containing HQ unitags to the reference genome. The goal is to find SNP-containing HQ unitags that match one to one across the two datasets and to map those reads to a single location in the reference genome (i.e., avoids nearly identical paralogs). The rationale for mapping to a single genome position is that reads matching in multiple locations in the genome would present difficulties in generating robust SNP markers that would have a unique genetic position.
We identified a total of 1682 SNPs by comparing 255,918 HQ unitags for Williams82 with 246,102 HQ unitags for Pintado (Supplementary Table 1) and 2618 SNPs by comparing 64,923 HQ unitags for Taichung65 with 71,848 HQ unitags for Kasalath (Supplementary Table 2), all with a consensus quality score at the nucleotide position containing the SNP greater than or equal to 25. The SNP detection pipeline also identified four additional SNPs between Williams82 and Pintado, and 21 additional SNPs between Taichung65 and Kasalath, that were removed from subsequent analysis because of consensus quality scores at the SNP position in one or both lines lower than 25. The consensus score of 25 was chosen to ensure high quality of the basecall at the SNP position. Putative basecalling errors at any additional base position of the HQ unitags also would lead to the removal of the HQ unitags by the SNP detection pipeline as they would exhibit more than one mismatch when comparing the two HQ unitags from both genotypes with Vmatch.
The RRL strategy that we have employed has focused the sequencing effort onto a relatively small fraction of the genome. For soybean, the number of 32 bp HQ unitags for Williams82 totaled 255,918 and for Pintado totaled 246,102 (Table 1) which generated 1682 SNPs or a rate of approximately 1 SNP per 150 HQ unitags that were evaluated. For rice, the number of HQ unitags was considerably less than those captured for soybean; a total of 64,923 for Taichung65 and 71,848 for Kasalath generated 2618 SNPs or an approximate rate of one SNP per 25 HQ unitags that were evaluated. These approximate rates of SNP discovery between two genotypes are dependent on two primary factors: (i) the relatedness of the genotypes and (ii) the degree of coverage of sequencing. Thus they are not necessarily comparable to published rates of SNP discovery in either soybean or rice which use multiple genotype comparisons. However, this RRL sequencing strategy can be easily adapted to a multiple genotype comparison with success for both of these species and compared with established rates of SNP discovery (M. Campbell and S. Deschamps, unpublished data, 2009).
For a subset of the SNPs identified by the SNP detection pipeline, we employed Sanger sequencing to confirm that (i) the SNP exists between the two genotypes and (ii) the presence of the SNP was within a 32 bp region corresponding to the HQ unitags of interest. Unique primer sets were generated from a flanking sequence derived from the reference genomic assembly for each species (see Materials and Methods). A total of 192 loci were targeted for resequencing on the ABI3730xl platform. In soybean, out of 168 loci that were successfully amplified and sequenced, 163 (97.0%) contained SNPs identified via the Sanger biochemistry that matched the base calls generated with the Illumina sequencing. All 163 SNPs were contained within 32 bp sequences generated with the Sanger biochemistry that matched the sequence of the original HQ unitags. In rice, 162 loci (96.4%), out of 168 loci that were successfully amplified and sequenced contained SNPs that confirmed the original SNP call made with the Illumina platform. In both species, the cause for the remaining loci that failed Sanger validation was primarily reflected as monomorphism between the two genotypes in the Sanger assemblies, reflecting a false SNP call, as opposed to tri-allelism (data not shown).
Individual read counts for HQ unitags contained within loci that were successfully sequenced with the Sanger biochemistry were plotted and compared between genotypes (Supplementary Fig. 1). SNPs confirmed via Sanger validation were present in HQ unitags made of various numbers of identical reads (from 2 to >1000 in soybean and from 2 to >500 in rice). The small number of loci that failed Sanger validation prohibits the possibility of distinguishing specific patterns relating specific HQ unitag read counts to failed Sanger validation, although a slight bias could be observed in soybean, where four false positives (i.e., SNPs not validated via Sanger sequencing), out of the five originally detected, were present in HQ unitags comprised of two identical reads in at least one genotype.
Mapping of the High Quality Unitags to the Genome Assemblies and Annotations
The RRL strategy employed here for both soybean and rice targets hypomethylated regions of the genome for sequencing. Therefore, it is expected that the distribution of the HQ unitags should be primarily correlated with sequences within and associated with genic regions in both soybean and rice. To illustrate that correlation, graphical representations were generated in soybean (Fig. 4) and rice (Fig. 5), in which the physical positions of HQ unitags matching perfectly and uniquely the reference assembly (the Glyma1.0 reference assembly for soybean and the TIGR assembly release 5 for rice) were shown, along with the physical positions of the SNPs between the two genotypes, in relation to the physical position of annotated genes from each whole-genome assembly (note that all genes with significant homology to transposable elements were excluded from this analysis for both soybean and rice). The counts of HQ unitags, the counts of SNPs, and the counts of annotated genes were all represented for each soybean contigs (Fig. 4) or rice chromosomes (Fig. 5) as heat maps consisting of four separate colored bars with shades of colors corresponding to regions of low to high counts within distinct sliding windows (see figure legends for details). For soybean, a total of 152,185 Williams82 uniquely mapped HQ unitags, 144,559 Pintado uniquely mapped HQ unitags (see Table 1), and the 1682 SNPs found between the two genotypes were aligned to the whole genome Glyma1.0 reference assembly (Fig. 4) and their positions in each contigs of the soybean reference assembly were compared to the positions of the 65,748 predicted protein-coding loci (excluding alternative transcripts). In rice, the 42,884 Taichung65 uniquely mapped HQ unitags, the 36,666 uniquely mapped Kasalath HQ unitags (Table 2), and the 2618 SNPs found between the two genotypes were represented in Fig. 5 and their positions in each chromosome of the rice reference assembly were compared with the positions of the 41,752 predicted protein-coding loci (excluding alternative transcripts). A rapid visual inspection of Fig. 4 shows that, in soybean, HQ unitags and SNPs primarily map in positions of the genome corresponding to regions of high gene density. In rice (Fig. 5), HQ unitags and SNPs exhibit a genome-wide distribution that reflects the distribution of annotated genes in the rice genome, and can be explained by the relatively small size of the genome and its relatively low repeat content relative to the more complex genome of soybean which possesses a much higher repeat content in its genome, genic regions distributed toward the ends of the chromosomes and genic sequences under-represented around the centromere.
To confirm that HQ unitags map preferentially to regions in and about annotated genes (excluding transposable elements), the coordinates (in bps) of SNPs and of HQ unitags matching perfectly and uniquely to their respective reference assemblies in soybean and rice were compared to the coordinates (in bps) of annotated genes, as determined by their physical positions from the Glyma1.0 reference assembly for soybean and from the TIGR assembly release 5 for rice. Briefly, a custom Python script was designed to infer the exon and intron coordinates for the annotated genes from the GFF annotation files distributed with the genome assemblies and containing CDS and UTR coordinates. All genes annotated as transposable elements were removed from the rice GFF files and no genes with significant homology to transposable elements were present in the soybean GFF files. Another custom Python script was designed to compare the locations of HQ unitags and SNPs to the newly calculated intron/exon coordinates and to separate HQ unitags and SNPs into four distinct categories by determining their physical positions relative to the physical positions of annotated genes in soybean (Fig. 6A) and rice (Fig. 6B). To resolve ambiguities at any given location due to the existence of alternative transcripts, for instance, when a SNP position falls within the range of multiple features being annotated to different types (CDS for one transcript, intron for an alternative transcript), a priority table was designed that gave preference to CDS over intron and UTRs, and to those over the intergenic region. Figure 6A confirms that in soybean, the majority of HQ unitags (81.7% for Pintado and 79.4% for Williams82) and SNPs (78.6%) can be mapped in “genic” regions (i.e., within intron, within CDS or within 5 Kbps upstream or downstream from an annotated coding region of a gene and including UTRs). UTRs were separated from CDS as many annotated genes from the Glyma1.0 assembly do not have well defined UTRs. In rice (Fig. 6B), a similar distribution is observed; the majority of HQ unitags (90.6% for Taichung65 and 91.6% for Kasalath) and SNPs (90.0%) map in genic regions. The lower percentage of SNPs present in CDS for both species (16.2% in soybean and 18.8% in rice) when compared to the percentage of SNPs found in introns or within 5 Kbps of an annotated gene is presumably related to negative selection pressure within the coding regions of genes when compared with the potentially more neutral selection in introns and flanking regions of a genomic sequence. The lower percentage of HQ unitags and SNPs in intergenic regions in rice relative to soybean (i.e., more than 5 Kbps from an annotated coding region of a gene) may be correlated to the overall methylation state of the genome, or may indicate that the annotation project of the reference assembly of rice is quite mature (Yuan et al., 2005; Ouyang et al., 2007). By contrast, the annotation of the soybean genome is an ongoing effort. Thus, the gene space of rice may be relatively better characterized relative to soybean leading to higher rates of SNPs located within 5 Kbp of an annotated gene. For both rice and soybean, the fact that a large majority of HQ unitags are present in genic regions suggests that the utilization of a methylation-sensitive restriction enzyme is an effective tool to select for gene-containing regions of the genome.
Statistical Analysis of the Distribution of High Quality Unitags in Soybean and Rice
To determine whether HQ unitags in soybean and rice were randomly distributed in representative (i.e., gene-rich) regions of the genome, the statistical properties of the intervals between adjacent HQ unitags within a chromosome were determined. Only HQ unitags matching their respective reference assemblies perfectly and uniquely were used for this analysis. Chromosome 1 was chosen for the soybean and rice analysis after visually comparing the distribution of HQ unitags in chromosome 1 to those in other chromosomes. A simple random spatial effects model would predict that the HQ unitag spacing within a representative region of the genome would follow a Poisson distribution (Feller, 1950) and the exponential distribution of the HQ unitags in Fig. 7 largely supports this view and confirms the random distribution of HQ unitags within regions of the genomes (note that the distribution functions are plotted on log-log scales since this provides the best means to graphically assess decay curves). The dashed curves for exponential decay provide a good fit for the first part of the decay, with correlation coefficients of approximately 0.993 and 0.98 for rice and soybean, respectively. The slower decay in the tail of the distribution fits a stretched exponential distribution (Laherrere and Sornette, 1998), with correlation coefficients for the stretched exponential fits of 0.9997 and 0.998 for rice and soybean, respectively. This is a well known statistical effect in systems where stochastic processes occur on a wide range of scales, for which the stretched exponential distribution usually provides an adequate characterization. Interestingly, a similar distribution is observed when analyzing the distribution of SNPs observed between the two lines in soybean and in rice (data not shown).
In comparing HQ unitag interval data for each pair of lines, the expected inverse dependence of the decay length on the HQ unitag density is observed (data not shown). However, the average HQ unitag density adjusted decay length for soybean is significantly less than that for rice, with a Kolmogorov-Smirnov p value less than 1e−12. This different decay length may be a direct reflection of the gene density and repeat content of both the soybean and rice genomes, and might indicate that a majority of soybean HQ unitags are distributed in discrete areas of the genome (see Fig. 4).
Allele Distribution of Single Nucleotide Polymorphisms in Public Lines
To validate the RRL strategy in soybean Williams82 and Pintado were chosen for resequencing; Williams82 is the same genotype used to generate the reference genome assembly and Pintado is a public line that has been adapted to lower latitudes, displays phenotypic and morphological variation relative to Williams82, and these two genotypes are not closely related based on the known pedigree structure. For any SNP found between Pintado and Williams82, the Pintado allele would presumably be expected to differ with the reference assembly allele and the Williams82 allele should match the reference assembly allele. Supplementary Fig. 2A displays all 1682 SNPs on the physical assembly (visualized as horizontal bars) where a red bar indicates that the Williams82 allele matches the reference assembly allele and a blue bar indicates that Pintado matches the reference genome assembly. The overwhelming majority of SNPs identified (98.6%) were monomorphic between Williams82 and the Williams82 reference genome assembly. A low rate of discrepancies between the two accessions of a shared genotype was observed recently in Arabidopsis and was attributable to either error in the reference assembly or to slight variations between the accessions (Ossowski et al., 2008). Given that the Glyma1.0 assembly was derived from a whole genome shotgun and has been recently assembled and considering the results of the Col-0 Arabidopsis resequencing, this low rate of discrepancy is not surprising. To ascertain whether these SNPs might be informative in the broader context, we resequenced seven loci across 25 public soybean lines and identified the base at the SNP position which is displayed in Table 3. In the United States, soybean varieties are classified into one of 13 maturity groups (MG) where flowering time and seed maturity are driven by the effects of day length and temperature. The MG 000 is the earliest to mature (these lines are adapted to the higher latitudes of northern Minnesota and southern Canada) while the MG X (ten) has been adapted to the more equatorial latitudes in South America. Williams82 is classified as a mid-maturity group (with a MG score of III) whereas Pintado has a MG score of VIII. The public lines that were chosen for resequencing were selected to span a wide range of MGs. Table 3 shows that for the seven randomly chosen loci, the frequency of the alleles varies considerably across the 25 lines. In a generalized view, alleles from mid-maturity Williams82 are present at a higher frequency in the early and mid-maturity groups and conversely the alleles from Pintado are present at a higher frequency in the later maturity groups.
For SNP discovery in rice, Taichung65 belongs within the japonica varietal group, and more specifically is classified as a temperate japonica (Dalrymple, 1986). Kasalath is commonly grouped within indica varietal group but is more specifically arranged within the aus group which diverged from the indica subspecies shortly after domestication (Garris et al., 2005; Caicedo et al., 2007). Rice genotypes that are in the separate varietal groups are historically known to have higher rates of polymorphisms than those that are within a varietal group. (Caicedo et al., 2007). The 2618 rice SNPs found were placed onto the Nipponbare reference assembly (Supplementary Fig. 2B). For Taichung65 alleles that match the reference genome allele, the tags are displayed in red and for those where the Kasalath allele matches the reference allele the tags are blue Table 4. Given that Nipponbare is classified as a temperate japonica and within the same varietal group as Taichung65, it is not surprising that 97.6% of the alleles are shared between these two genotypes when compared with Kasalath which is classified in a separate varietal group. Ten randomly chosen loci were resequenced across 34 public genotypes of rice that represent a diverse range of previously analyzed temperate japonica, tropical japonica, and indica lines (Jiang et al., 2003; Caicedo et al., 2007). Similar to soybean, the SNP frequencies range broadly across the 10 loci (Table 4). When analyzing this small sampling, the Taichung65 alleles do generally have a higher frequency in the temperate japonica class and the Kasalath alleles have a much higher frequency in the indica varietal group which is congruent with published reports.
This report has shown that single-end sequencing of reduced-representation libraries on the Illumina Genome Analyzer platform can efficiently identify large numbers of SNPs in rice and soybean that are largely localized within and adjacent to annotated genes while sampling a small fraction of the total genomic space. The pipeline developed to identify these SNPs involves extensive filtering to remove reads with low quality base calls, elimination of reads that occur only once (i.e., singletons), and requires that the remaining reads that contain SNPs match the genome (allowing only the SNP base to vary) uniquely. By imposing a coverage requirement of at least two copies per HQ unitag, this eliminates large numbers of SNPs that originate from singletons and have comparably high rates of being false (M. Campbell and S. Deschamps, unpublished results, 2009). These filtering steps for quality and coverage yielded SNP validation rates for soybean and for rice, when confirmed with Sanger resequencing, at 97 and 96.4%, respectively. In both species, the vast majority of those SNPs (78.6% for soybean; 90.0% for rice) map within regions of high gene density, and primarily within annotated genes or adjacent (i.e., less than 5 Kbps upstream or downstream) to annotated genes. Such results confirm the effectiveness of using methylation-sensitive restriction endonucleases in generating RRLs which target sequencing to genic and also unique genic regions of the genome. In the case of a highly complex genome of soybean, our results suggest that by using a combination of a methylation sensitive enzyme along with DpnII to generate RRLs, a thorough sampling of the genome for SNP discovery could be performed using a modest sequencing capacity. For the moderately complex genome of rice, a similar RRL strategy can be employed that uses a methylation sensitive enzyme with a degenerate recognition sequence along with DpnII to obtain a thorough sampling of the genome for SNP discovery out of a single RRL library. This strategy of sequencing all of the various combinations for an AlwNI digested library would require much higher sequencing capacity to get sufficient coverage, an absolute requirement for SNP discovery, but would offer the advantage of only needing to construct a single library. Additionally, the AlwNI strategy could be applied to more complex genomes to generate a saturated set of SNPs within the genic fraction of the genome but would require additional sequencing capacity.
With the advent of ultra-high-throughput sequencing, SNP discovery in species having extensive reference sequence resources is both rapid and economically feasible. There are diverse applications of this RRL strategy when used on two genotypes. One powerful application is to obtain a whole genome snapshot of variation to identify regions that are polymorphic as well as identifying regions that are monomorphic (i.e., derived from a common lineage). This is useful for quickly generating genome-wide and genotype specific SNP markers for the two genotypes in polymorphic regions while simultaneously identifying and obviating the need for SNP discovery in monomorphic regions. This whole genome fingerprint is particularly useful in populations derived from a bi-parental cross to engage in de novo mapping of qualitative and quantitative traits where marker arrangement and spacing is critical. A similar application would be to identify introgressed segments in near isogenic lines, to identify the origin of chromosomal segments in recombinant inbred lines, or to identify residual chromosomal segments from the donor line in backcrossing scenarios. The RRL strategy for SNP discovery can easily be extended beyond the consideration of two genotypes. By using the same restriction enzyme(s) choices in RRL construction across genotypes in a species, those reads are most efficiently combined into an all vs. all comparison to maximize the efficiency of SNP discovery for subsequent marker development in a broader context of genotypes within a species.