Measures of genetic polymorphism, genetic distance, and population differentiation are important to the management of germplasm resources, crop improvement programs, and the success of association mapping. Tomato has provided a strong model for identifying genes that distinguish domestic and wild plants but has been explored less from the perspective of post-domestication selection. Mapping in wide crosses and the characterization of genes that affect specific traits have produced substantial understanding into the mechanisms of disease resistance (e.g., Jones et al., 1994; Martin et al., 1993), plant and fruit development (e.g., Frary et al., 2000; Pnueli et al., 1998; Xiao et al., 2009), and the regulation of biochemical pathways (e.g., Liu et al., 2003; Ronen et al., 2000). Our understanding of how selection influences cultivated populations has been limited by genomic resources that emphasize only a small number of accessions and are focused on biparental populations constructed from wide crosses. The shortage of genetic tools for investigation of diversity within cultivated lineages therefore limits our ability to ask important questions regarding human selection and restricts available tools for crop improvement.
The genomic resources available for tomato include 301,822 expressed sequence tags (ESTs) for S. lycopersicum (NCBI, 2011a) and draft genome sequences for S. lycopersicum (Heinz 1706) (SGN, 2011b) and Solanum pimpinellifolium L. (LA1589) (SGN, 2011a). The genomic resources have not sampled germplasm well enough to offer effective insight into the rich diversity conserved in germplasm collections or breeding programs. For example, 78.6% of the EST resources are derived from just two accessions, TA496 (116,711 ESTs) (NCBI, 2011a) and Micro-Tom (120,392 ESTs) (NCBI, 2011a). The accession TA496 is a processing tomato, E6203, with the addition of the Tm-2a introgression (Tanksley et al., 1998) while Micro-Tom is a novelty dwarf variety (Scott and Harbaugh, 1989). These genomic resources do not capture the breadth of variation within S. lycopersicum. The availability of markers for genetic analysis within cultivated tomato has been limiting as many markers selected based on polymorphism in wide crosses are not polymorphic within relevant germplasm (Jimenez-Gomez and Maloof, 2009).
Single nucleotide polymorphisms (SNPs) are the most common type of sequence variation in plant species (Ching et al., 2002). In tomato, strategies to develop resources for cultivated germplasm include in silico analysis of EST databases (Labate and Baldo, 2005; Yang et al., 2004) and simple sequence repeats (Frary et al., 2005), oligonucleotide array hybridization (Sim et al., 2009), and sequencing introns of conserved orthologous set genes (Van Deynze et al., 2007). These strategies demonstrated the feasibility of discovering sequence variation within genetically restricted germplasm pools, including cultivated populations. Single nucleotide polymorphism discovery through sequencing therefore appears quite promising as a means to uncover variation in agriculturally relevant populations (Robbins et al., 2011; Shirasawa et al., 2010).
While Sanger sequencing (Sanger and Coulson, 1975) and SNP discovery has provided a useful starting point for detecting polymorphic loci in tomato, the number of polymorphisms that can be defined is restricted by the genotypes sequenced and the depth of sequencing performed. Alternative methods for sequencing and genotyping that rely on highly parallel reactions and detection systems have made it possible to cost-effectively sequence and genotype large numbers of individuals (Hamilton et al., 2011). The high-throughput sequencing systems, also known as “next generation sequencing,” are characterized by higher error rates, higher levels of redundancy, and much higher throughput in most platforms. For example, emulsion-based pyrosequencing as implemented by 454 (Roche Inc.) can produce up to 1 million reads of 600 to 1,000 bp in a single run (Margulies et al., 2005) whereas emulsion polymerase chain reaction and sequencing by ligation technology as implemented by Applied Biosystems, Inc. (Life Technologies Inc.), produces 100 million reads of 50 bp (Valouev et al., 2008) and sequencing by synthesis as implemented by Illumina Hi-Seq (Illumina Inc.) produces 320 to 640 million reads of 150 bp (Bentley et al., 2008).
To increase the number of SNPs available for basic and applied tomato genetics, we sequenced the transcriptomes from six tomato accessions representing fresh market, processing, cherry, and S. pimpinellifolium, a close progenitor of cultivated tomatoes. We generated >2.7 Gb for each accession, representing an average of 32.5 Mb of unique transciptomic sequence per line. Using the transcriptome data coupled with the draft tomato genome sequence, we were able to identify a large collection of putative SNPs for use in high-throughput genotyping. Our polymorphism discovery was confirmed using high-throughput genotyping assays, with validation rates for SNP calls based on Genome Analyzer II (GA2) (Illumina Inc.) data greater than 95.8% and as high as 100%. These results demonstrate the potential of using high-throughput sequencing technologies to identify differences between cultivated plants and study the distribution of sequence variation within genes and crop lineages.
Materials and Methods
Tomato accessions used for transcriptome sequencing were assembled from three public breeding programs across the United States and the USDA National Plant Germplasm System (NPGS) (Table 1). Sequencing efforts were designed to expand available genomic resources to fresh-market breeding efforts while also providing additional resources for processing tomatoes as well as comparisons to more distant species. Three of the accessions (NC84173, FL7600, and OH08-6405) represent germplasm relevant to fresh-market breeding efforts, including a parent of commercial hybrids (Gardner, 1992). Previous public genomic efforts have ignored the high-value fresh-market germplasm. The line OH9242 (Francis, 2002) was also included as a commercially relevant parent to maximize genetic variation within the processing germplasm as a complement to the EST resources for TA496 (Tanksley et al., 1998) and the genome sequence for Heinz 1706 (Ozminkowski, 2004). TA496 represents the California processing tomato germplasm pool while Heinz 1706, developed in Bowling Green, OH, represents one of two subpopulations of midwestern U.S. processing material (Sim et al., 2011). The cherry accession PI 114490 and the S. pimpinellifolium accession PI 128216 were chosen as close relatives for the cultivated tomatoes and because both accessions have contributed to contemporary breeding populations for fresh and processing market classes.
|Accession||Species||Market class||Sequence data||Comments|
|FL7600||Solanum lycopersicum||Fresh market||GA2 ESTs||Sequenced in this study|
|NC84173||S. lycopersicum||Fresh market||GA2 ESTs||Sequenced in this study|
|OH08-6405||S. lycopersicum||Fresh market||GA2 ESTs||Sequenced in this study|
|OH9242||S. lycopersicum||Processing||GA2 ESTs||Sequenced in this study|
|PI 114490||S. lycopersicum||Cherry||GA2 ESTs||Sequenced in this study|
|PI 128216||S. pimpinellifolium||Wild||GA2 ESTs||Sequenced in this study|
|Micro-Tom||S. lycopersicum||Novelty||Sanger ESTs||Novelty dwarf variety|
|TA496||S. lycopersicum||Processing||Sanger ESTs||E6203 genetic background|
|Heinz 1706||S. lycopersicum||Processing||Genome sequence||Used in Tomato Genome Initiative|
For validation of SNPs called from the transcriptome sequences, we used a core collection of 88 tomato accessions including the six accessions selected for sequencing. These accessions are described in Supplemental Table S1. Briefly, the collection included nine representatives of wild species, three Latin American cultivars (LA1216, LA2256, and LA2281), two unimproved breeding lines (Ha7981 and Ha7998), 18 vintage cultivars, 21 fresh-market varieties, and 35 processing varieties. These accessions were assembled from 10 breeding programs in North America (the United States and Canada), the NPGS, and the C.M. Rick Tomato Genetics Resource Center. The collection contained parents of populations utilized by the tomato research community such as segmental substitution lines (M82 and LA0716; Eshed and Zamir, 1995), parents of several important recombinant inbred and inbred backcross populations (Doganlar et al., 2002; Graham et al., 2004; Kabelka et al., 2002; Robbins et al., 2009; Yang et al., 2005), and a mutation library (Menda et al., 2004). During SNP validation eight accessions were duplicated for quality control purposes.
Sequencing, de novo Assembly, and Annotation of the Tomato Transcriptomes
Ribonucleic acid (RNA) was isolated from roots, callus, young leaves, flowers, and three stages of fruit development using the cetyltrimethylammonium bromide method (Chang et al., 1993). Callus was produced on Murashige and Skoog medium supplemented with Gamborg's B-5 (Gamborg et al., 1968; Murashige and Skoog, 1962). Fruit stages corresponded to green (including immature to mature green), breaker to turning (10 to 30% red), and ripe fruit (more than 90% of the surface was red) based on the USDA color classification. The isolated RNA was pooled in equimolar concentrations to synthesize normalized complementary DNA prepared as described previously (Hamilton et al., 2011). Libraries were sequenced on an Illumina GA2 with one paired end lane and two single end lanes of 47 to 84 bp for each accession (Supplemental Table S2). Sequence reads are available in the National Center for Biotechnology Information Sequence Read Archive (accession number SRP007969) (NCBI, 2011b).
The GA2 reads for each of the six tomato accessions were assembled separately into contigs using the Velvet assembler (Zerbino and Birney, 2008) using a k-mer length of 31 and a minimum contig length of 150 bp. The insert length of FL7600 fragments was 300 bp whereas all the other accessions had fragment lengths of 350 bp (Supplemental Table S2). The estimated sequence coverage for the transcriptomes are FL7600 (35.2x), NC84173 (34.6x), OH9242 (33.7x), PI 114490 (33.7x), PI 128216 (37.3x), and OH08-6405 (38.0x). For all downstream analyses with GA2-generated sequences, only contigs (and not singleton reads) were included due to the short nature of GA2-generated reads. Sanger-derived ESTs from TA496 and Micro-Tom were assembled as described previously (Hamilton et al., 2011) and for Sanger-derived sequences, contigs (≥150 bp) were included in downstream analyses. Gene Ontology Slim (GOSlim) associations were assigned to the contigs (GA2- and Sanger-derived) using a BLASTX search (cutoff E = 1 × 10−5) against the Arabidopsis thaliana (L.) Heynh. proteome (The Arabidopsis Information Resource [TAIR] 10); TAIR, 2011) and transferring the TAIR GOSlim associations from the top hit. To assess representation of the complete tomato transcriptome, contigs (GA2- and Sanger-derived) were aligned with the Heinz 1706 Tomato Genome Initiative (TGI) gene model set (version iTAG2; ITAG, 2011) using Exonerate (Slater and Birney, 2005).
Single Nucleotide Polymorphism Discovery
Genome Analyzer II-Derived Sequences
The reads from each accession were mapped separately to the TGI v1.03 tomato scaffolds (SGN, 2011b) using Bowtie v0.12.3 (Langmead et al., 2009). Only read alignments where the read mapped uniquely on the genome were retained. Paired end reads were mapped separately from the single end reads and the paired and single end SAM alignment files were sorted and merged before further analysis. The merged SAM file was converted to BAM using SAMTools v 0.1.7 (Li et al., 2009) and SNPs were called using the SAMTools pileup tool.
The raw SNP calls were filtered using the samtools.pl varFilter script with a minimum depth of 20 reads, a maximum read depth of 240, and a maximum of one other SNP in a 100 bp flanking window, and the SNP had to be 50 bp from areas identified as indels by the pipeline. (samtools.pl varFilter-d 20-D 240-W 100-N 2-w50). The SNPs were further filtered using a custom Perl script that filtered out SNPs with a minimum consensus quality score of 20, a minimum SNP quality score 20, and a minimum mapping score 60 and SNPs where the reads aligning over the SNP position were composed of >10% of 3′ end base. The final filtered SNPs calls for each variety were then coalesced using a Perl script into a unified file of high confidence SNPs on the TGI scaffolds. The SNP positions and associated metadata were loaded into a PostgreSQL database (PostgreSQL Global Development Group, 2011) using a custom schema.
Single nucleotide polymorphisms were called using the Sanger-derived TA496 ESTs in two phases based on availability of tomato genome sequence. Single nucleotide polymorphisms for the BeadXpress design (Illumina Inc.) were called using alignments of accession-specific assemblies to 1207 Heinz 1706 BAC sequences made available by the TGI (SolCAP, 2011). Upon availability of the draft tomato genome scaffold sequences by the TGI v1.03 (SGN, 2011b), a second phase of SNP calling was performed using Sanger-derived ESTs derived from both TA496 and Micro-Tom to augment SNP identification on a genome level using a total of eight transcriptome datasets. Sanger-derived SNPs used in the identification of high confidence SNPs required a minimum of two ESTs with the SNP (relative to Heinz 1706 genome sequence) with a minimum depth of four ESTs.
Single Nucleotide Polymorphism Validation
Single nucleotide polymorphisms were selected for validation using the BeadXpress platform in this study and the Illumina Infinium platform (Illumina Inc.) for future studies. To meet the Illumina design requirements, the contigs were aligned to the genome scaffolds using GMAP (Wu and Watanabe, 2005) to annotate the boundaries separating introns and exons. Single nucleotide polymorphisms located in exons that aligned at >95% identity without gaps were retained whereas SNPs within 50 bp of a boundary separating exons and introns and SNPs that were not biallelic across the accessions were removed (Fig. 1A). The remaining SNPs were scored for suitability for the BeadXpress and Infinium platforms by Illumina, and SNPs with a BeadXpress score <0.75, an Infinium score <0.90, or a fail code were removed. The SNPs that passed the filtering and design requirements described above were classified into six subgroups (subgroups 1–6) based on the presence of the SNP in combinations of market class, cherry, and the S. pimpinellifolium accession. Subgroup 1 SNPs have a SNP in the wild tomato and no SNPs in the cultivated or cherry accessions. Subgroup 2 SNPs have a SNP in the wild tomato and one cultivated variety. Subgroup 3 SNPs have a SNP in the wild tomato and a SNP in two, three, or four of the cultivated or cherry accessions. Subgroup 4 SNPs have a SNP in two, three, or four of the cultivated or cherry accessions and no SNP in the wild tomato. Subgroup 5 SNPs have one SNP in a cultivated or cherry variety. Subgroup 6 SNPs have a SNP in wild tomato and the cherry variety and no SNPs in the cultivated accessions. The subgroup membership of the SNPs was added to the SNP metadata. The GA2-derived SNPs were also classified based on sequence representation, that is, coverage within the six surveyed transcriptomes: class A SNPs contained data from all six varieties or accessions, class B from five, and class C from four. For the validation work in this study, 96 GA2-derived SNPs were selected for the BeadXpress assay and represent 60 class A SNPs, 24 class B SNPs, and 12 class C SNPs. To provide an assessment of SNP distribution among and between market classes, SNPs from the subgroups 1 through 6 were included: 12 SNPs for each of the subgroups 1 to 4 and 24 SNPs for each of subgroups 5 and 6.
For the Sanger-derived SNPs, 93 SNPs with a range of EST coverage were selected for a separate BeadXpress assay: 49 SNPs based on two ESTs, 18 SNPs based on three ESTs, six SNPs based on four ESTs, and 20 SNPs based on five or more ESTs. In addition, we included three SNPs for which calls were inconsistent among the EST sequences. For the BeadXpress assay, a total of 50 ng of genomic DNA per accession was used following the manufacturer's protocol. The BeadXpress raw data were processed using Illumina GenomeStudio software (genotyping module v1.7.4; Illumina Inc., 2010) for SNP calling.
To assess the utility of the SNPs for characterizing germplasm, SNP genotyping was performed using 88 tomato varieties with eight duplicated for quality control. The BeadXpress allele calls (Supplemental Table S1) were converted to numerical calls compatible with the Microsatellite Analyzer software v4.05 (Dieringer and Schlotterer, 2003) and to a proportional scoring (in which 2 is equal to homozygous for the common allele, 1 is equal to heterozygotes, and 0 is equal to homozygous for the rare allele) for principal component analysis (PCA) using R (R Development Core Team, 2011). Accessions were classified into subpopulations as described above. Analysis was conducted to determine the proportion of markers that were polymorphic within each subpopulation, standard allelic richness (El Mousadik and Petit, 1996), and Nei's genetic distance(Nei, 1978). The p-values for the pairwise distances were calculated based on 10,000 permutations and a Bonferroni correction was applied. Principal component analysis was conducted to visualize relationships and to identify SNPs that contribute to variance among the germplasm. Analysis of variance was conducted to determine whether there were significant differences between germplasm classes and principal components (PCs). Marker loadings were sorted to identify SNPs that contributed high positive and negative loadings to the eigenvector corresponding to PC1, PC2, and PC3.
Analysis of Loci under Selection
We investigated loci under positive selection using an outlier detection method as implemented in the LOSITAN workbench (Antao et al., 2008). The outlier detection method uses the available data to derive a distribution of genetic differentiation based on the proportion of total genetic variance in a subpopulation relative to the total variance (Fst, calculated according to Weir and Cockerham, 1984) and expected heterozygosity. Five simulations for each of three pairwise comparisons between three market classes (fresh market, processing, and vintage) were run with 10,000 iterations, a 95% confidence interval, and options for neutral and forced mean Fst. For the mutation model option, we used an infinite allele model. Loci that deviate from the expected distribution of neutral markers are identified based on excessively high or low Fst. Outliers suggest directional selection when Fst is higher than expected or balancing selection when Fst is lower than expected.
RESULTS AND DISCUSSION
Tomato Transcriptome Sequencing and Assembly
By sequencing normalized libraries from six accessions, we were able to generate 19.6 Gb (raw) and 17 Gb (quality filtered) of sequence using the Illumina GA2 platform (Table 2). For each of the six accessions, a similar number of reads were generated (51.9–59.4 million raw reads and 45.7–53.2 million quality filtered reads) suggesting that library construction and sequencing reactions were comparable among the RNA samples. De novo assembly of the GA2-derived reads resulted in 59,051 to 66,181 contigs representing 30.6 to 34.9 Mb across the six accessions further suggesting consistency of our sampling, sequencing, and assembly methods. We further examined the representation of our GA2-derived sequences of the complete tomato transcriptome by aligning these against the predicted gene set in the tomato genome. As shown in Table 3, not only is there substantial coverage of the predicted gene set (55.3–59.6%) but there also is substantial overlap between all six accessions based on alignment to the tomato gene model set as there were a limited number (185–497) of unique genes detected in each of the six accessions. Analysis of molecular function Gene Ontology (GO) associations demonstrated that the assembled transcriptomes for each of the six accessions provide broad representation of the tomato transcriptome (294,176 total associations among the six accessions) and that the representation of genes encoding various molecular functions was similar among all six accessions (Fig. 2; Supplemental Table S3).
||Genome Analyzer II
|TA496||Micro-Tom||All Sanger||FL7600||NC84173||OH9242||OH08-6405||PI 114490||PI 128216||All Genome Analyzer II|
|Total no. sequences||131,308||120,462||251,770||54,162,444||52,539,617||51,954,487||59,348,840||52,727,224||57,699,707||328,432,319|
|Total no. bp sequences||55.3 Mb||63.6 Mb||118.9 Mb||3.1 Gb||3.2 Gb||2.9 Gb||3.4 Gb||3.1 Gb||3.9 Gb||19.6 Gb|
|No. sequences passed quality filters||101,154||117,562||218,716||49,053,794||45,741,571||47,425,783||53,160,164||46,228,186||50,305,539||291,915,037|
|No. of bp of sequences passed quality filters||55.0 Mb||62.7 Mb||117.7 Mb||2.8 Gb||2.8 Gb||2.7 Gb||3 Gb||2.7 Gb||3.0 Gb||17 Gb|
|No. Mb contigs||10.8||10.2||21.0||30.6||31.7||31||33.7||34.9||33.3||195.2|
|N50 contig size (bp)||879||761||794||863||850||880||1030||1016||812||908|
|Max. contig size (bp)||3,107||3,234||3,234||12,143||13,288||11,689||14,001||11,685||13,981||14,001|
|Min. contig size (bp)||150||102||102||150||150||150||150||150||150||150|
|Tomato accession||No. contigs||No. tomato genes covered||Percent of tomato genes covered||Unique tomato gene hits||Total SNPs||Accession-restricted SNPs|
To provide historical reference to our GA2-generated sequences, we examined publicly available Sanger-derived ESTs from TA496 (Van der Hoeven et al., 2002) and Micro-Tom (Aoki et al., 2010). As shown in Table 2, substantially fewer ESTs were available using the Sanger platform. However, even though there are fewer reads, the length of the reads resulted in representation of 10.2 and 10.8 Mb of total assembled sequence for Micro-Tom and TA496, respectively. As anticipated, there was lower coverage of the tomato gene model set (Table 3) and fewer GO molecular function associations (Fig. 2; Supplemental Table S3) than observed with the GA2-derived contigs.
Single Nucleotide Polymorphism Discovery
In the first step of the SNP discovery pipeline (Fig. 1A), 3,730,151 SNPs were called against the tomato reference genome using the SAMTools (Li et al., 2009) SNP caller and GA2 data for the six sequenced accessions. Further filtering using SAMTools and custom Perl scripts was done on depth at the SNP position, SNP density, SNP quality, and consensus quality resulting in 86,556 SNPs that passed quality control criteria. The positions of the SNPs in the accessions were then collapsed into a set of high confidence SNPs (57,996 SNPs; Supplemental Table S4) with unique positions on the genome. The accessions with a SNP at each genomic position were tracked in a relational database. A parallel yet modified pipeline was invoked to identify SNPs in the two Sanger-generated EST assemblies, Micro-Tom and TA496, to enable a comparison of SNPs across these two sequencing pipelines and augment SNP discovery in tomato. Single nucleotide polymorphisms were called from the CAP3 sequence assembly (Huang and Madan, 1999) read alignment file using a custom Perl script yielding 12,707 SNPs. Filtering of the Sanger-derived SNPs based on base quality and density yielded 5490 high confidence SNPs (Fig. 1B; Supplemental Table S5). These SNPs were then mapped to a location on the TGI scaffolds by aligning the contigs with GMAP (Wu and Watanabe, 2005) and calculating the genomic position with a Perl script. Collectively, 62,576 nonredundant SNPs were identified in the tomato genome from the Sanger- and GA2-derived sequences. All downstream analyses were performed with the 57,996 high confidence SNPs from the six GA2-derived accessions and the 5490 high confidence SNPs from the two Sanger-derived accessions.
We identified high confidence SNPs in all eight accessions, ranging from 2618 in TA496 to 42,622 in PI 128216 (Table 3), attributable to the sequence coverage difference in Sanger- versus GA2-derived accessions and true genotype differences. A subset of these SNPs was restricted to a single accession ranging from 942 in NC84173 to 31,095 in PI 128216. Note that while we were able to identify accession-specific SNPs, this is an overestimate due to sampling limitations with transcriptome-derived SNP discovery and the stringent filtering imposed in our computational pipeline. Pairwise comparison of SNPs between all eight accessions revealed a range of total and accession-restricted SNPs (Table 4) with the fewest SNPs in any pairwise combination involving Micro-Tom and TA496, indicative of the reduced sampling of SNPs obtained through the Sanger platform. With respect to identifying SNPs within a market class, we were able to identify 2044 nonredundant SNPs unique to the fresh market class; of these, 134 SNPs were restricted to the three fresh market accessions (FL7600, NC84173, and OH08-6405; Table 5). Even fewer SNPs (89) were unique to the processing lineage, with five common to OH9242 and TA496 (for which sampling is reduced based on Sanger sequencing) relative to the reference genome Heinz 1706 (Table 5). Ample SNPs were apparent within the cherry and wild market classes (Table 5) reflective of the genetic diversity between these accessions.
|Accession 1||Accession 2||Total SNPs||Accession-restricted SNPs|
|PI 114490||PI 128216||8282||5534|
|Market class||Total nonredundant SNPs||Market class restricted|
|Wild: PI 128216||42,622||31,095|
Validation of Single Nucleotide Polymorphism Discovery using the BeadXpress Platform
Important components of SNP selection for genotyping are false positive rate, false negative rate, and informativeness, that is, is the SNP polymorphic in the population. We tested our SNP predictions using the BeadXpress platform. Further filtering to meet the design requirements of SNP detection platforms such as distance to the boundary between intron and exon and allelic variants was performed using the high confidence GA2-derived SNP set to yield 25,740 SNPs compatible with both the Illumina BeadXpress and Infinium platforms (Fig. 1A; Supplemental Table S6). We first binned the Illumina-compatible SNPs into the three classes (class A, B, and C) based on coverage of that polymorphic base in the six GA2-sequenced accessions (Supplemental Table S7). We then binned the GA2-derived SNPs into six subgroups based on the presence of the SNP in each of the market classes (processing vs. fresh market), cherry, and the wild accession (S. pimpinellifolium) (Supplemental Table S7). This enabled us to validate our computational pipeline based on representation among the accessions and polymorphism across the genotypes. Validation tests with 96 SNPs derived from GA2-transcriptome sequences yielded a validation rate of 98% across all classes of SNPs suggesting we developed a robust computational pipeline for SNP discovery. High rates were obtained for class A (98.3%), class B (95.8%), and class C (100%), further supporting our filtering criteria for information content (Table 6).
|Genome Analyzer II-derived SNPs
|Group||Concordant||Discordant||Total||Validation rate (%)|
|EST coverage||Concordant||Discordant||Total||Validation rate (%)|
To compare our computational pipeline with SNPs detected using the more conventional Sanger sequencing platform, we selected 96 SNPs from our alignments of TA496 ESTs to Heinz 1706 genome sequences. Validation rates for the SNPs called from the TA496 ESTs ranged from 61.3 to 95% (Table 7). The lowest rate was detected when EST coverage was two reads while the highest rate resulted from read coverage of five or greater, consistent with the notion that increasing depth of coverage improves concordance of SNPs calls. Three SNP calls based on inconsistent EST alignments were not validated, which also suggested that inconsistent calls were based on sequence error in the EST data, not heterozygosity within the source accession.
|Market class||Prop. PM||Efficiency||Allelic richness||Processing||Vintage||Latin American||Wild|
Examination of high confidence SNPs based on molecular function (Fig. 2) revealed broad coverage of the GOSlim molecular function categories among both the GA2- and the Sanger-derived sequences. However, a higher fraction of the Sanger-derived sequences had SNPs than the GA2-derived sequences.
We assessed how informative SNPs were for detection of polymorphism between and within germplasm classes. High levels of polymorphism were detected within all subclasses (Table 7). As a measure of SNP marker efficiency, the proportion of markers with no missing data was low with 86 to 99% of markers yielding no failed allele calls in each of the market classes of cultivated germplasm. We detected a higher level of missing data in the wild germplasm, a classification that consisted of S. lycopersicum var. cerasiforme, S. pimpinellifolium, S. pennellii Correll, and S. habrochaites S. Knapp & D. M. Spooner. The majority (74%) of missing SNP calls occurred in the S. pennellii and S. habrochaites accessions, suggesting that sequence divergence may have contributed to assay failure (Table 7; Supplemental Table S1). Levels of allelic richness (El Mousadik and Petit, 1996) for SNPs will fall between 1 and 2. Allelic richness in cultivated germplasm ranged from 1.28 to 1.39, indicating that the markers are informative (Table 7). The value of 1.54 in the wild germplasm reflected higher sequence diversity in the class. Between class genetic diversities were adjusted for sample sizes (Nei, 1978) and were consistent with between population subdivision detected in previous studies (Sim et al., 2011).
Use of PCA to visualize the distribution of germplasm based on SNP data revealed separation of the defined classes. Over 50% of the variance was explained by the first five PCs (out of 88). The first PC explained 22% of the total variance and separated wild germplasm from cultivated types (p < 0.0001). The second PC explained 11.5% variance and separated the cultivated classes (Fig. 3) with significant differences between vintage and landrace and both fresh market and processing (p < 0.0004). The two contemporary cultivated classes, processing and fresh market, were also significantly separated along the second PC (p = 0.042). Although the green-fruited accessions LA0407 and LA0716 were distinct from the cultivated lineages along PC1, these accessions clustered with the red-fruited S. pimpinellifolium accessions. This lack of discrimination may reflect some ascertainment bias as the more distant accessions were not sequenced and only polymorphisms shared with the sequenced accessions will be informative.
Inspection of the loadings for SNPs along each PC is a useful approach to identify markers and their associated chromosome segments that distinguish the germplasm. Four of the top seven SNPs (5% of total), based on high positive and negative loadings, were CL542Contig1, CL5590Contig1, CL6432Contig1, and sl_15930, from chromosome 4.
Candidate Loci under Positive Selection
Analysis of the distribution of SNP alleles within and among accessions representing distinct populations or market classes can provide insight into which areas of the genome might be under selection (Sim et al., 2009). We used an outlier detection method as implemented in the LOSITAN program (Antao et al., 2008) to investigate the distribution of validated SNPs. A total of 20 unique loci were detected from pairwise comparisons in three market classes (fresh market, processing, and vintage) as falling outside of the 95% confidence interval (Table 8). We identified four loci between fresh market and processing, eight loci between fresh market and vintage, and 12 loci between processing and vintage. Four loci overlapped between the pairwise comparisons. The Fst estimates of these 20 loci ranged from 0.28 to 0.88 (Table 8). The 20 loci were distributed on eight tomato chromosomes and a high portion of these loci were derived from chromosomes 2 (5 loci), 4 (5 loci), and 11 (3 loci). We inferred putative functions of these loci based on the corresponding UniRef100 and A. thaliana annotation (Table 8). A high proportion of these annotations are for genes involved in biotic or abiotic stress resistance, but caution is needed in interpreting the results of outlier detection as direct cause and effect. The decay of linkage disequilibrium in cultivated tomatoes occurs over centimorgan intervals (Robbins et al., 2011). The SNPs detected based on outlier detection point to regions of the genome likely to have been selected during the crop improvement process.
Interestingly, many of the same SNPs were identified based on Fst outlier detection and based on high loadings in the first and second PC. Chromosome 2 contains several genes involved in fruit size and shape whereas chromosome 11 contains a number of disease resistance genes that differentiate fresh-market varieties from vintage varieties. The second PC, which separates the cultivated varieties, contained the most overlap with the LOSITAN (Antao et al., 2008) outlier detection for SNPs on chromosome 4. The significance of chromosome 4 is less clear, compared to chromosomes 2 and 11, as there have been few genes from this chromosome that are characterized at the molecular level. A search of known morphological genes from this chromosome (TGRC,2011) reveals numerous alleles affecting plant habit, leaf morphology, and sugar metabolism, which are all likely targets for modification through selection.
|Fresh market vs. processing
||Fresh market vs. vintage
||Processing vs. vintage
|Locus||Chromosome||He||Fst||He||Fst||He||Fst||UniRef100 and Arabidopsis annotation||Gene Ontology term|
|solcap_snp_sl_12352||1||0.58||0.56||0.54||0.18||0.25||0.23||Protein transport protein sec23||endoplasmic reticulum (ER) to Golgi transport (GO:0006888)|
|solcap_snp_sl_13404||1||0.57||0.47||0.53||0.12||0.33||0.17||Stress-responsive protein||response to stress (GO:0006950)|
|CL6523Contig1||2||0.62||0.50||0.08||0.06||0.65||0.64||Senescence-related gene 1||flavonoid biosynthesis (GO:0009813)|
|CL542Contig1||4||0.43||0.26||0.57||0.47||0.88||0.86||Annexin 1||oxidation-reduction process (GO:0055114)|
|CL5590Contig1||4||0.31||0.12||0.70||0.62||0.88||0.86||Exostosin family protein||biological process (GO:0008150)|
|CL6432Contig1||4||0.37||0.13||0.65||0.56||0.85||0.83||Serine carboxypeptidase-like 42||proteolysis (GO:0006508)|
|CL7515Contig1||4||0.32||0.08||0.70||0.62||0.85||0.83||Pathogenesis-related family protein||biological process (GO:0008150)|
|solcap_snp_sl_15930||4||0.50||0.37||0.50||0.38||0.89||0.88||Glucose-1-phosphate uridylyltransferase (UDP-glucose):glucosyltransferase||metabolic process (GO:0008152)|
|solcap_snp_sl_24440||6||0.37||0.06||0.15||0.13||0.31||0.30||Nucleic acid binding protein||regulation of transcription, DNA-dependent (GO:0006355)|
|CL2524Contig1||8||0.06||0.03||0.36||0.21||0.34||0.32||Haloacid dehalogenase-like hydrolase protein||metabolic process (GO:0008152)|
|CL657Contig1||8||0.17||0.16||0.32||0.30||0.38||0.04||Plant AT-rich sequence- and zinc-binding protein (PLATZ) transcription factor family protein||biological process (GO:0008150)|
|solcap_snp_sl_10976||11||0.55||0.48||0.54||0.34||0.13||0.03||L-ascorbate oxidase||oxidation-reduction process (GO:0055114)|
|solcap_snp_sl_21030||11||0.03||0.02||0.58||0.57||0.58||0.52||Disease resistance protein||defense response (GO:0006952)|
|solcap_snp_sl_1567||12||0.30||0.28||0.30||0.28||0.00||0.00||Subtilisin-like protease||lateral root formation (GO:0010102)|
We generated extensive sequence data of transcriptomes of six tomato accessions representing fresh market (NC84173, FL7600, and OH08-6405), processing (OH9242), cherry (PI 114490), and S. pimpinellifolium (PI 128216). In conjunction with a draft sequence for the tomato genome, we were able to identify a large number of SNPs which were validated with high confidence. These polymorphisms complement a set identified in two Sanger-derived EST collections and in total we were able to identify 62,576 nonredundant SNPs in tomato. Validation rates of our computational pipeline for GA2 data were greater than 95.8% and as high as 100%, suggesting the set of high confidence and Illumina-compatible SNPs will be robust in genotyping assays. We demonstrate that these SNPs will be effective for characterization of cultivated and wild tomato populations. In addition, population level analysis with these SNPs appears to be a promising approach for identifying regions of the genome that are under selection due to crop improvement. As these SNPs were identified in elite germplasm and across market classes, they provide a resource for genetic diversity, genome-wide association studies, and marker-assisted selection in populations that are directly relevant to plant breeders.
Supplemental Information Available
Supplemental material is available at http://www.crops.org/publications/tpg.
Supplemental Table S1. List of germplasm used for single nucleotide polymorphism (SNP) validation using the BeadXpress assay.
Supplemental Table S2. Description of Genome Analyzer II (GA2) libraries sequenced.
Supplemental Table S3. Gene Ontology Slim (GOSlim) molecular function associations with eight tomato transcriptome assemblies and percentage of associations associated with single nucleotide polymorphisms (SNPs).
Supplemental Table S4. List of high confidence single nucleotide polymorphisms (SNPs) identified from Genome Analyzer II (GA2) sequences.
Supplemental Table S5. List of high confidence single nucleotide polymorphisms (SNPs) identified from Sanger sequences (Sanger and Coulson, 1975).
Supplemental Table S6. List of Illumina compatible single nucleotide polymorphisms (SNPs) identified from Genome Analyzer II (GA2) sequences.
Supplemental Table S7. Classification of Illumina-compatible Genome Analyzer II (GA2) single nucleotide polymorphisms (SNPs) based on sequence coverage and market class representation.