A Solanaceae reference genome will be an invaluable resource in addressing two fundamental biological questions: first, how genomes code for extensive phenotypic differences using relatively conserved sets of genes; and second, how phenotypic diversity can be harnessed for the improvement of agricultural products. Sequence data from other species, such as expressed sequence tags (ESTs) (Adams et al., 1991), methylation (Palmer et al., 2003; Whitelaw et al., 2003; Fu et al., 2004), or Cot-filtered sequence (Peterson et al., 2002; Yuan et al., 2003), together with sequencing by novel very high throughput approaches such as 454 sequencing (Margulies et al., 2005) or Solexa sequencing (Shendure et al., 2005) in combination with good comparative maps (Tanksley et al., 1992; Doganlar et al., 2002; Fulton et al., 2002) between many Solanaceae plants (Hoeven et al., 2002; D'Agostino et al., 2007), will enable insights into evolution, domestication, development, response, and signal transduction pathways.
After the sequencing of a number of dicots from the rosid clade (Angiosperm Phylogeny Group, 2003), Arabidopsis thaliana L. (AGI, 2000), Medicago truncatula Gaertn. (Cannon et al., 2006) using bacterial artificial chromosome (BAC)-by-BAC approaches, and poplar [Populus trichocarpa (Torr. & A. Gray)] (Tuskan et al., 2006), grape (Vitis vinifera L.) (Jaillon et al., 2007), and others using whole-genome shotgun (WGS) techniques, the sequencing of the first genome in the asterids will shed light on this clade, permitting longer-range evolutionary distance comparisons and provide information about the larger picture of angiosperm evolution.
Ten countries are involved in sequencing the tomato genome and the 12 chromosomes have been allocated among the countries as depicted in Fig. 1. The chloroplast genome was recently completed by a European consortium (Kahlau et al., 2006) and the mitochondrial genome is being sequenced by the Instituto Nacional de Tecnología Agropecuaria in Argentina within the framework of the EU-SOL project (http://www.eu-sol.net [verified 10 Jan. 2009]).
The 950-Mb tomato genome is structured into distal, gene-rich euchromatin and gene-poor pericentromeric heterochromatin. The heterochromatic fraction, consisting mostly of repetitive sequences, will be extremely difficult to sequence. Therefore, the strategy is to initially sequence the euchromatic portions of the genome, which is estimated to make up one-quarter (220 Mb) of the tomato genomic sequence (Peterson et al., 1996) including >90% of the genes (Wang et al., 2006). As a consequence, the effort to sequence the majority of the gene space is less than twice the effort required to sequence the Arabidopsis genome at 157 Mb (Bennett et al., 2003).
To render the emerging tomato sequence immediately useful to the community, it is being annotated by the International Tomato Annotation Group (ITAG). Annotations are available on the SOL Genomics Network (SGN) website (http://sgn.cornell.edu/ [verified 10 Jan. 2009]), and a number of Web-based tools have been developed that allow researchers to download and analyze the emerging sequence.
Here, we provide a summary of the status of the project and relevant insights drawn from the annotation of the tomato genome performed to date.
Results and Discussion
To sequence the tomato euchromatin, a BAC-by-BAC approach was chosen in preference to a WGS strategy. This will generate a high-quality “gold standard” sequence, which is essential for use as a reference genome (International Rice Genome Sequencing Project, 2005) and which will serve as the scaffold for the related Solanaceae genomes. In short, the BAC-by-BAC strategy involves the anchoring of BACs or contigs of BACs to a reference genetic map. These anchored BACs are sequenced, and the sequence information is used to extend these BACs and BAC contigs further (“BAC walking”). Gaps between BAC contigs are closed by targeting novel markers or BACs to these gaps, which is then followed by successive rounds of BAC walking.
The high-density F2–2000 map (Fulton et al., 2002) is used as a reference genetic map for the sequencing project. This map is based on 80 F2 individuals from the cross Solanum lycopersicum LA925 × S. pennellii Correll LA716 and contains a subset of restriction fragment length polymorphism markers from the Tomato-EXPEN 1992 map (Tanksley et al., 1992). Most of the markers are conserved ortholog set (COS) markers (Fulton et al., 2002; Wu et al., 2006) derived from a comparison of Solanaceae ESTs against the entire Arabidopsis genome. Those COS markers selected were single–low copy, having a highly significant match with a putative orthologous locus in Arabidopsis. Maps constructed using COS markers can readily be compared and analyzed for chromosome inversions, duplications, and other large-scale genome rearrangements, a characteristic that will be useful for transferring knowledge from tomato to other species. In addition to COS markers, the map also contains a significant number of simple sequence repeat (SSR) markers, most of which were identified in ESTs (usually in 5′ or 3′ untranslated regions).
The BACs used in the tomato sequencing project are derived from several libraries, all of which were constructed from the Heinz 1706 tomato line. In addition to a HindIII library consisting of 129,024 clones that was available at the outset of the project (Budiman et al., 2000), two additional BAC libraries were generated, an EcoRI library of 72,264 clones and an MboI library of 52,992 clones. Together, these libraries provide more than 25× genome coverage. The BAC libraries have been deep end-sequenced in the United States, with >340,000 high-quality reads equivalent to 20% of the entire genome sequence. The BAC libraries are complemented by a fosmid library. Currently, >180,000 high-quality fosmid end sequences from the Wellcome Trust Sanger Institute and the University of Padua are available, equivalent to 15% of the entire genome sequence. Fosmid libraries are crucial in a genome sequencing project because their narrowly defined insert length can be used as an analytical tool to detect potential misassemblies of BACs, and their generally shorter insert length is ideal for filling smaller gaps and thereby reducing redundant sequence (Kim et al., 1995). The fosmid library is cut using shearing rather than restriction enzymes to obtain clone coverage in regions low or devoid of the relevant restriction sites.
All BACs from the HindIII library and from the MboI library were fingerprinted and contigs of overlapping BACs were generated using the fingerprinted contigs (FPC) tool (Soderlund et al., 2000). First, an analysis of the BAC fingerprint data yielded 6000 contigs, of which >3500 could be anchored to the genetic map. In an effort to globally reduce the number of contigs, the entire FPC data were reassembled using less stringent assembly criteria (cutoff E-value of 1 × 10−12 and tolerance of 7). This resulted in 4360 contigs representing about 658 Mb of sequence. To increase the contig size and to reduce the contig number further, the contigs were manually edited with anchoring information by contig end-search and merging, resulting in 4156 contigs.
Finally, a total of 837 markers were used to anchor the contigs to the tomato genetic map. The anchored contigs represent about 187 Mb of genomic DNA and are mainly composed of euchromatic sequences from the tomato genome.
Validation of the physical map was performed using fluorescent in situ hybridization (FISH) on pachytene complements with entire BAC clones as probes (Chang et al., 2007; Szinay et al., 2008) (see also FISH map on SGN, http://sgn.cornell.edu/cview/map.pl?map_id=13 [verified 10 Jan. 2009]), and by genetic mapping of anchored BACs using panels of tomato introgression line populations (Eshed and Zamir, 1995). The integrated map is available through WebFPC.
Since the current sequencing effort focuses on the tomato euchromatin, determining the chromosomal borders between euchromatin and heterochromatin is essential. Currently, we use FISH to identify BAC inserts from euchromatin–heterochromatin boundaries based on linkage map information and on the specific staining by FISH of the repetitive fraction of the tomato genome (Szinay et al., 2008); see Fig. 2.
In a multinational project, it is important that all participants use the same standards for completing their sequences. The Tomato Genome Project started to develop these standards early on, and they will be maintained and developed when new issues arise. The full quality standards are described in the Tomato Sequencing Guidelines document available online at http://docs.google.com/View.aspx?docid=dggs4r6k_1dd5p56 (verified 5 Feb. 2009).
In summary, the BACs are being sequenced to the following quality standards:
The BAC sequence submitted in high-throughput genome sequence (HTGS) Phase3 consists of a single contig.
All bases of the HTGS Phase 3 consensus sequence must have a Phred quality score of at least 30.
As a result of the shotgun process, the bulk of sequence will be derived from multiple subclones sequenced from both strands. Any regions of unidirectional sequence coverage with a single sequencing chemistry must pass manual inspection for sequence problems but need not be annotated. Regions covered by only a single subclone must be attempted from an alternate subclone or by direct walking on BAC DNA or by BAC polymerase chain reaction. These regions must concur with a restriction digest analysis of the clone. In addition, these regions must be annotated.
At least 99% of the sequence must have less than one error in 10,000 bp as reported by Phrap or other sequence assembly consensus scores. Exceptions must be manually checked and pass inspection for possible problems. Any areas not meeting this standard must be annotated as such.
To date (September 2008), 689 BACs have been sequenced and reported in the SGN BAC registry database (either HTGS Phase 2 or Phase 3) (Fig. 1), representing 74.8 Mb (including overlaps) (available from SGN and GenBank). Of these, 419 are included in the Accessioned Golden Path (AGP) files, which can be viewed in the SGN AGP map representing 44.5 Mb of sequence, representing roughly 20% of the tomato euchromatin. These BACs have been placed into 282 contigs and have been annotated using the ITAG annotation pipeline; see below.
Genome Annotation by ITAG
To render the sequence immediately useful to the community, ITAG is producing a high-quality automated annotation of the tomato genome in a distributed collaborative effort, which involves groups from Europe, Asia, and the United States. The centerpiece of the structural annotation is the EuGene gene prediction platform (Foissac et al., 2008), a powerful predictor capable of integrating a diverse array of inputs, such as evidence-based alignments and ab initio predictions. For the functional annotation, InterPro domains are determined using InterproScan and homology searches are performed. Where possible, other sequence features (i.e., noncoding RNAs) are predicted. An important initial activity of the ITAG group was to generate a training and test set of gene sequences to train gene finders for tomato. Gene finders that are trained or have been trained include EuGene (Foissac et al., 2008), GeneMark (Isono et al., 1994), TwinScan (Korf et al., 2001), and Augustus (Stanke et al., 2008). Results of predicted gene models and their functional annotations are available via the SGN Web site.
In the first batch of annotations partially based on as yet untrained gene finders, the ITAG pipeline has identified 7464 protein coding genes longer than 180 nucleotides in 44 Mb of nonredundant sequence. This represents a gene density of approximately one gene per 6 kb, slightly lower than the density of one gene per ∼4.5 kb in Arabidopsis (AGI, 2000) but is higher than one protein coding gene in 9.9 kb in the rice (Oryza sativa L.) genome (International Rice Genome Sequencing Project, 2005). The average coding sequence is 996 bp long and is composed of 3.7 exons. The primary difference between tomato and Arabidopsis genes is that tomato genes, including their introns, are longer. The average gene length from this analysis is ∼2 kb, with an average intron length of 485 bp and an average exon length of 268 bp, significantly larger than those in Arabidopsis. While the lower number of exons per gene almost certainly represents the current lower annotation quality of tomato genes, it is notable that the average intron length is more than twice that in Arabidopsis. Assuming a gene density of one gene per 6 kb in the rest of the tomato euchromatin, we can expect that the euchromatin of the tomato genome contains just over 40,000 genes, close to the estimated number of about 35,000 (Hoeven et al., 2002). Obviously, some of these parameters may change with improved tomato genome annotations and the further improvement of trained tomato gene finders. Figure 3 shows the number of tomato genes falling into certain annotation categories, and a comparison to the numbers in the categories found in Arabidopsis, rice, and poplar. The numbers in each category are similar between species, indicating that the fraction of the tomato sequence that has so far been sequenced is similar to other plant genomes.
De novo repeat analysis was performed on the available BAC-end sequences, and the resulting repeats were used to analyze both the BAC end sequences as well as the complete BAC sequences. The de novo repeat set masked 57% of BAC-ends and 24% of full BAC sequence, indicating that the BACs selected from the euchromatin contain fewer repeats than the genome as a whole. These results support the recently described distribution of tomato repetitive sequences as determined by FISH (Chang et al., 2008). The fraction of long terminal repeat elements was much higher in BAC-ends (30%) than in the full BAC sequences (12.6%), indicating that there are large differences in the nature of repeats occurring in different genome regions.
The distribution of repeats and gene content on selected chromosomes is shown in Fig. 4, defined by repeat analysis and EST coverage. The information is reported only for those chromosomes for which Tiling Path Format files, which represent the tentative order of the BACs in the chromosome assembly as provided by the sequencing centers, are available at the SGN Web site to date. The following number of BACs were analyzed for each chromosome: chromosome 4, 94; chromosome 5, 35; chromosome 6, 100; chromosome 9, 43; and chromosome 12, 34. This analysis includes a number of BACs that were attributed to heterochromatin but nevertheless have been sequenced. The bars in each panel represent the percentage of nucleotides in a BAC that could be aligned to Solanum lycopersicum ESTs (blue bars) and repeat sequences (red bars). Figure 4 shows that the repeats are much lower in abundance in the euchromatic arms and in some cases form a gradient of increasing density into the heterochromatin, whereas on other arms the transition appears less gradual. Also, in general, the gene-rich BACs have lower repeat content, supporting the general assumption that genes are predominantly present in the relatively repeat-poor euchromatin. The tomato heterochromatin consists of the bulk of the repetitive DNA fraction, which nevertheless also contains some genes as has been described by Yasuhara and Wakimoto (2006).
Transcription factors (TFs) play key roles in regulation of gene expression in various biological processes. The assembled ESTs (Plant Genome Database [PlantGDB]–assembled unique transcripts [PUTs]) of Solanum lycopersicum from PlantGDB were searched for putative TFs using hidden Markov model (HMM) profiles, which resulted in the identification of 1463 such PUTs that included 66 of the 71 known TF gene families. Considering that 40,000 genes are predicted in the tomato genome (Hoeven et al., 2002), this indicates that ∼3.6% of the total genes in the euchromatic region may be TFs. For Arabidopsis, 5.9 to 7% (Riechmann et al., 2000; Riano-Pachon et al., 2007) and rice, 4% (Goff et al., 2002; Riano-Pachon et al., 2007) of the total genes are TFs. Further, 237 PUTs (16%) encoding putative TFs could be mapped on 559 tomato BACs, representing around 56 Mb sequenced tomato genome. On average, one TF gene is present in every 200 kb (assuming average BAC size to be 100 kb); see Table 1. Chromosomes 12 and 11 seem to harbor the highest and lowest density of TF genes, respectively. The major three TF gene families in tomato include AP2-EREBP (APETALA2-ethylene responsive element binding protein), MYB, and bHLH (basic helix-loop-helix) families (not shown).
|Chromosome size (Mb)||108||85.6||83.6||82.1||80||53.8||80.3||64.7||81.8||88.5||64.7||76.4|
|No. of BACs analyzed||9||86||14||88||34||76||87||85||45||4||16||15|
|No. of transcription factors||4||41||4||31||10||42||34||31||21||3||3||13|
|No. of transcription factors per BAC||0.44||0.47||0.28||0.35||0.29||0.55||0.40||0.36||0.46||0.75||0.18||0.86|
Sequence analysis of cloned plant disease resistance genes (R-genes) conferring resistance to viral, bacterial, and fungal pathogens has shown that the majority of them possess common sequences and structural motifs. These R-genes can be grouped into three major classes (NBS-LRR type, LZ-NBS-LRR type, or LRR-Tm type) on the basis of their encoded protein motifs such as leucine zippers (LZ), nucleotide binding sites (NBS), leucine-rich repeats (LRR), protein kinases domains, trans-membrane (Tm) domains, and Toll-IL-IR homology regions. We analyzed 48,945 unigene (PUT) sequences of tomato from PlantGDB for the presence of R-gene homologs by a BLASTX analysis against the nonredundant database of the National Center for Biotechnology Information (NCBI) and classified them into the above three categories. The PUT matches to different putative R-genes and LRR motifs only were grouped into the miscellaneous R-gene category. In addition, defense response genes such as glucanases, chitinase, and thaumatin-like proteins were also included in the analysis.
We found a total of 155 annotations similar to resistance-like genes and 83 annotations showed homology to the defense-response-like genes (Fig. 5).
These R-gene and defense-response gene homologs were mapped in silico onto the sequenced BACs of the different chromosomes to find their physical locations, resulting in the localization of 59 R-gene homologs and of 21 defense-response gene homologs (see Table 2). Thus, the mapped resistance-like and defense-response genes represent about one-third of all expressed PUTs assembled from the tomato EST database. Since the number of BACs analyzed per chromosome varied considerably, we normalized the frequency of these genes per BAC clone to evaluate their relative distribution on different tomato chromosomes. Based on this analysis, chromosomes 4, 9, and 11 seem to harbor a larger than average number of R-gene homologs per BAC, whereas chromosome 5 has the largest number of defense-response genes per BAC. However, this may change as more sequence data become available, particularly from chromosomes 1, 3, 10, and 11, which were underrepresented when this analysis was undertaken.
|Chromosome size (Mbp)||108||85.6||83.6||82.1||80||53.8||80.3||64.7||81.8||88.5||64.7||76.4|
|No. of BACs sequenced (available at SGN)||19||91||15||105||42||126||100||127||57||4||18||50|
|Disease-resistance-like genes (PUTs) mapped||1||6||0||12||3||9||4||11||7||0||3||3|
|No. of resistance-like genes per BAC||0.05||0.07||0.00||0.11||0.07||0.07||0.04||0.09||0.12||0.00||0.17||0.06|
|No. of defense-response-like unigenes mapped||0||5||0||0||7||1||2||1||1||0||0||4|
|No. of defense-response-like genes per BAC||0||0.05||0||0||0.17||0.01||0.02||0.01||0.02||0||0||0.08|
|Total mapped resistance-like and defense-response-like genes: 59|
Comparison to Potato Sequence
An initial effort was made to compare the gene and repeat content of the tomato and potato genomes, based on the available BAC-end sequences for both species (Datema et al., 2008). The BAC-end sequence comparison is of particular interest as it provides a picture for the complete genome, including both euchromatic and heterochromatic sequence. Comparison using only sequenced tomato BACs will mainly provide a comparison between the euchromatin of tomato and potato. In total, 310,580 BAC-end sequences representing ∼19% of the 950-Mb tomato genome were compared to 128,819 potato BAC-end sequences representing ∼10% of the 840-Mb potato genome. It is important to note that while most potato varieties used in agriculture are tetraploid, the potato line being sequenced is diploid (van Os et al., 2006).
The tomato genome has a higher overall dispersed repeat content than the potato genome, with the majority of dispersed repeats in both species belonging to the Gypsy and Copia retrotransposon families. Specifically, the Copia:Gypsy ratio is higher in tomato than in potato, suggesting that the retrotransposon amplification associated with the genome expansion in tomato is predominantly the result of additional Copia elements. On the other hand, simple sequence repeats (SSRs) motifs are more abundant in potato than in tomato. In both genomes penta-nucleotide repeats are the most common form of SSRs, and AAAAT is the predominant repeat motif. This is in contrast to previously studied plant species, in which di- and penta-nucleotide repeats generally occur least frequently (Asp et al., 2007).
The potato BAC-end sequences have a 1.5- to 1.6-fold higher protein coverage than tomato when aligned to the NCBI nonredundant protein database, and a 1.3- to 1.4-fold higher coverage when compared with the species-specific EST data. Taking into account the difference in genome size and assuming that tomato has ∼40,000 genes, potato appears to contain up to 6400 more putative coding regions than tomato. Moreover, the P450 superfamily appears to have expanded dramatically in both species compared with Arabidopsis thaliana (Datema et al., 2008), suggesting an expanded network of specialized metabolic pathways in the Solanaceae.
Tomato Genome Tools Available for Researchers
A number of tools have been created for the tomato genome sequencing project that are also useful to the larger research community.
SGN Database, FTP Site, and BLAST Data Sets
All data, sequences, mapping information, and project statistics can be found on http://sgn.cornell.edu/.
The SGN database keeps track of the status of each BAC in the sequencing pipeline. The BACs can be searched at SGN (http://sgn.cornell.edu/search/direct_search.pl?search=bacs [verified 13 Jan. 2009]).
The Tomato Genome Browser displays the annotation for each BAC (http://sgn.cornell.edu/gbrowse/). All data sets can be downloaded from the SGN File Transfer Protocol (FTP) site (ftp://ftp.sgn.cornell.edu/tomato_genome/ [verified 16 Jan. 2009]), including BAC and contig sequences, BAC-end sequences, annotations in gff3 and GAME XML format, chromatograms and assembly files, and FPC raw data. The BAC-end and full BAC sequences generated in the tomato genome project, as well as tomato transcript sequences generated through other projects, are available in the SGN BLAST tool (http://sgn.cornell.edu/tools/blast/ [verified 16 Jan. 2009]). The SGN comparative map viewer (http://sgn.cornell.edu/cview/ [verified 16 Jan. 2009]) (Mueller et al., 2008) displays a number of genetic and physical maps for the tomato genome project.
Tomato and Potato Assembly Assistance System
The Tomato and Potato Assembly Assistance System was developed to automate the assembly and scaffolding of contig sequences for tomato chromosome 6 (Peters et al., 2006).
A tomato-specific data set was added to the Morgan2McClintock tool (Lawrence et al., 2006). This tool was implemented at the MaizeGDB database (http://www.maizegdb.org/) and initially used the maize Recombination Nodule map (Anderson et al., 2003, 2004) to calculate approximate chromosomal positions for loci given a genetic map for a single chromosome in maize. With the new data set (Chang et al., 2007), the tool can also be used for queries related to tomato.
U Padua PABS (Platform Assisted BAC-by-BAC Sequencing)
The Platform Assisted BAC-by-BAC Sequencing pipeline (Todesco et al., 2008) is an informatics pipeline to optimize BAC-by-BAC sequencing projects.
An Italian SOLAnaceae genomics resource, ISOL@ (http://biosrv.cab.unina.it/isola/ [verified 16 Jan. 2009]), was designed to provide full Web access to details of the genome annotation based on experimental evidence as derived from EST–full-length cDNA sequences (Chiusano et al., 2008).
Summary and Outlook
Recently, the Tomato Genome Sequencing Project has made highly significant progress toward its goal of sequencing 220 MB of euchromatin space of the tomato genome, which has been predicted to contain the majority of tomato genes. In total, more than 950 BACs have been sequenced, representing over one-third of the targeted genome space. Sequences are being deposited at GenBank (http://www.ncbi.nlm.nih.gov/sites/entrez?Db=genomeprj&cmd=ShowDetailView&TermToSearch=9509 [verified 5 Feb. 2009]) and the SGN database (http://sgn.cornell.edu/), and are being annotated using a pipeline established by an international group (ITAG) of bioinformatics centers. A number of tools have been created that allow both researchers and tomato breeders to work with the emerging sequence. Through the extensive comparative maps that are available, much of the information from the tomato sequence can readily be transferred to other Solanaceae and related asterids such as coffee (Coffea canephora L.) (Gentianales, Rubiaceae) or mint (Mentha) (Lamiales, Lamiaceae).
A BAC-by-BAC sequencing approach was chosen to sequence the tomato genome because it provides the highest possible sequence quality. However, since the project was started, novel “next generation” sequencing technologies have become available that are now being applied to WGS sequencing for complex genomes. The BAC-by-BAC approach has inherent advantages, and yields insights beyond sequence space as the approach is based on careful evaluation of BAC positions by genetic mapping and by FISH. For example, several inversions could be identified between the cultivated tomato and its wild relative parent used in the reference map (Tang et al., 2008). The main drawback of the BAC-by-BAC approach is that it is relatively more expensive and slower than the WGS approach. Recently, the grape genome was sequenced using a shotgun approach, resulting in >2000 unordered contigs. However, it was estimated that >95% of grape gene sequences were recovered in the sequence (Velasco et al., 2007). Thus, in the future, a hybrid approach for sequencing the tomato genome will be pursued by using WGS as an additional resource for finishing the euchromatic part of the genome and for obtaining sequence for the heterochromatic part of the genome.
A preliminary annotation of about 11% of the total assembled euchromatic space of tomato gives a gene density of one gene per 6 kb, which corresponds to an extrapolated gene count of just over 40,000 genes for the entire euchromatin, consistent with previous estimates. Notably, certain well-known tomato genes have been recovered in the genome sequence, such as R-gene alleles at the Mi resistance locus, the fruit shape locus ovate, and the phytoene synthase 1 gene involved in carotenoid biosynthesis.
The tomato genome is repeat-rich, and analyses of BAC-end sequences, which sampled sequence from both the heterochromatin and euchromatin, revealed that about 70% of the sequence was masked and hence largely represent heterochromatin repeats. In full BAC sequences, which were biased toward euchromatin, only 24% of the sequence was repeat masked, confirming earlier results from FISH analyses that the repeat content of hetero- and euchromatic regions are significantly different.
In some chromosomes whose sequencing is advanced, difficulties were encountered in finding new seed BACs in the gap regions. A number of initiatives have been put in place to increase the number of seed BACs, such as additional screening of BAC library filters and markers not used in the overgo process, computational mapping of BAC ends to marker sequences, and mapping of BACs on tomato chromosomes using introgression lines (Eshed and Zamir, 1995). To find novel cleaved amplified polymorphic sequences markers, BACs were selected containing open reading frames or unique sequences at their ends. Nearly 41% of these BACs have been successfully mapped to specific tomato chromosomes in preliminary screening of a set of 120 BACs. The procedure proposed requires minimum cost and efforts to generate new CAPS markers, and identified BACs can be directly used for sequencing. The 200,000-fosmid end sequences currently available have already proven to be extremely valuable for increasing the possibilities of extensions from other sequenced BACs.
Considerable synergies will be derived from the ongoing potato genome sequencing project. Potato, another important food staple in Solanum, is being sequenced by another, but similarly structured consortium (http://www.potatogenome.net/ [verified 16 Jan. 2009]). The first sequences should be available this year. Within Solanum tomato and potato are closely related, both are members of the same phylogenetically similar group of species, and only five major pericentromeric inversions have been observed between these two species (Tanksley et al., 1992). Because of their phylogenetic proximity, we expect that it will be possible to close sequence gaps in the tomato genome based on potato data and vice versa. The two projects have a good working relationship and regularly meet at the SOL genome workshops held once a year. All data related to the tomato genome sequencing project can be found on SGN (http://sgn.cornell.edu/) and BAC sequences are deposited to GenBank (http://www.ncbi.nlm.nih.gov/). We expect that the euchromatin sequence will be close to finished in 2010.
Data Availability and Sequencing Statistics
All data, including BAC and BAC-end sequences, chromatograms, assembly files, FISH localizations, overgo results, and mapping data are available on the SGN Web site (http://sgn.cornell.edu/). Sequence data are also available from GenBank (http://www.ncbi.nlm.nih.gov/). To track the progress of the project, a BAC registry database is run as a central resource on the SGN website. The sequencing teams have special log-in accounts that allow them to assign BACs to their projects and then adjust the status of each BAC in their sequencing pipeline. Based on this information, the summary statistics about project progress are calculated and displayed in real time on the International Tomato Sequencing Project overview page at http://sgn.cornell.edu/about/tomato_sequencing.pl [verified 16 Jan. 2009].
A comprehensive repeat database specific for tomato was generated by running RepeatScout (Price et al., 2005) on the BAC-end sequences of each library. The three different repeat collections (one per BAC library) were assembled into one library using the cap3 program. The resulting set was assayed for repeat frequency in the entire BAC-end database, and repeats occurring fewer than 30 times were discarded. This set, referred to as the unirepeat set, was annotated using BLAST against different databases (The Institute for Genomic Research repeat set and GenBank Nonredundant), and was used to assess repeat content in BAC-ends and in full BAC sequences.
ITAG Genome Annotation Pipeline
The ITAG annotation pipeline operates on batches of contigs composed of one or more BACs. These contigs are generated at SGN from the AGP files and the BAC sequences. Analyses such as repeat masking, EST alignment, and gene predictions using different gene finders such as GeneID (Parra et al., 2000), GeneMark (Isono et al., 1994), and Augustus (Stanke et al., 2008) are performed on those BACs. To generate a consensus annotation, these data are combined with homology to protein or genomic sequences from other species (BlastX, TblastX), and fed into the combiner software called EuGene (Foissac et al., 2008). The resulting gene models are then functionally annotated based on homology searches (BlastP), protein domain searches (Interpro) (Mulder et al., 2003), and gene ontology assignment (Ashburner et al., 2000). Noncoding RNAs were identified using the Infernal program (Griffiths-Jones et al., 2003).
Estimation of Transcription Factors in Tomato Genome Using Expressed Sequence Tags
To search putative TFs in the EST data sets of Solanum lycopersicum, the assembled ESTs from PlantGDB, version161a, September 2007 release (257,093 ESTs assembled into 48,945 PUTs) was downloaded and translated using ESTScan-3.0.2 (Iseli et al., 1999). These translated PUTs were categorized into TF gene families based on the classification process defined by two plant transcription databases—PlnTFDB (Riano-Pachon et al., 2007) and PlantTFDB (http://planttfdb.cbi.pku.edu.cn/ [verified 16 Jan. 2009]). A list of domains necessary for classifying a TF into a particular gene family was prepared and the available HMM profiles from PFAM (v22.0 [Finn et al., 2008]) were downloaded. The HMM profiles for the remaining domains were created using the protein alignments available at PlnTFDB. HMMER searches (http://hmmer.janelia.org/ [verified 16 Jan. 2009]) were performed on translated PUTs using HMM profiles and hits having E-values of ≤10−2 were selected. Further, these putative TFs were localized on 559 tomato BACs (finished and unfinished BAC sequences downloaded from SGN [bacsv205]) by performing BLASTN with selection criteria of ≥90% identity and 80% length coverage.
Analysis of Resistance and Defense-Response-Like Genes
We analyzed the 48,945 PUT sequences of tomato downloaded from the PlantGDB (Duvick et al., 2008). All the PUTs were used for BLASTX search with the NCBI nonredundant database (http://www.ncbi.nlm.nih.gov/) and top hits of all the genes were extracted in a tabulated form. Each gene showing homology to the above-mentioned three major classes of R-genes, that is, NBS-LRR type, LZ-NBS-LRR type, and LRR-Tm type together with other putative resistance proteins and defense-response genes, making five total categories, were tabulated in Microsoft Excel (Microsoft, Redmond, WA) format. These R-gene and defense-response gene homologs were then mapped in silico on 754 sequenced BACs of respective chromosomes to find their physical locations.