TriFLDB is a database that provides full-length coding sequences currently available for wheat and barley and their annotations. The current version of TriFLDB houses 7341 putative full-length coding sequences and their annotations in the case of barley and 8530 in the case of wheat; this includes data derived from 5006 barley full-length cDNA sequences retrieved from BarleyDB of Research Institute for Bioresources, Okayama University and from 6162 wheat full-length cDNA sequences (3.2 MB) (6137 filtered putative FLCDS set) that were newly sequenced in RIKEN and Kihara Institute for Biological Research, Yokohama City University supported by National Bio Resource Project in Japan.TriFLDB provides a keyword-based search interface by gene function, related Gene Ontology (GO) terms, and a homology search for DNA and deduced translated amino acid sequences to browse the annotation of each Triticeae full-length coding sequences (TriFLCDS). The annotation of TriFLCDS is based on the homology search results against several protein databases as well as the typical clustered and representative cDNA database of wheat and barley, domain organization predicted by InterProScan, hierarchical clustering of homologous sequences with the proteome dataset of rice, sorghum and Arabidopsis, and homology mapping results in the rice and sorghum genome sequences.
Growth of Full-Length CDS collection in Triticeae
- Full-length CDS prediction
Full-length CDS prediction in TriFLDB is carried out by 2 methods: identifying the longest open reading frame (ORF) and using DECODER (Fukunishi and Hayashizaki 2001). The results of the predictions of each full-length cDNA have been provided and an overview of the predictions is also shown on the statistics page.
- Similarity search results
Similarity search against each of the various databases was carried out to predict the gene function of TriFLCDS. The following databases were applied for the similarity search: the nonredundant (nr) protein database of NCBI and UniProt/trembl of EBI as the typical representative protein database; Rice Annotation Project Database (RAP-DB) and The Institute for Genomic Research (TIGR) rice database as the annotated rice protein dataset; protein data of predicted genes in Sorghum genome from the Joint Genome Institute (JGI); The Arabidopsis Information Resource (TAIR) as the annotated Arabidopsis protein dataset; and also cDNA sequences of barley and wheat of UniGene, TIGR GI, Plant-Genome database (GDB), and HarvEST as clustered or representative cDNA sequences. To support the predicted reading frame of cDNA, a similarity searches using BLASTX as well as BLASTP were performed with nucleotide sequences and corresponding translated protein sequences as queries against the protein databases.
- Hierarchical protein cluster
For phylogenetic insights, the predicted protein sequence of TriFLCDS entries were hierarchically clustered with proteome databases of other plants, i.e., Arabidopsis (TAIR7), rice (RAP-DB), and Sorghum (JGI), on the basis of the amino acid identities using the Cd-Hit package. The Cd-Hit clustering was carried out with a threshold of global identity of amino acid sequence hierarchically from 100% to 40% and with a local identity at 30% with 10% reduction at each step. The "Hierarchically Clustered Protein Viewer" allows users to find homologous counterparts with hierarchical identity thresholds of 90%, 60%, and 30%. Furthermore, clustered proteins and protein sequences of the TriFLCDS entries were aligned using ClustalW in each hierarchical cluster.
- Protein domain organization and GeneOntology assignment
Protein domain searches for TriFLCDS entries were carried out by using the iprscan program. The GO terms assigned with each entry have also interrelated based on iprscan search.
The NCBI WWW BLAST has been implemented on the TriFLDB server as TriFL-BLAST. The TriFL-BLAST provides nucleotide and protein sequences deduced from TriFLCDS entries as well as other plant proteome data of Arabidopsis (TAIR7), rice (RAP-DB), and Sorghum (JGI, ver. 1.4) since it is a BLAST search database.
Homology mapping onto the rice and sorghum genome
To predict the gene structure of each TriFLCDS, SIM4 is used to map the cording sequences to the rice genome sequences to show the predicted exon-intron structures with rice gene annotations in RAP-DB (http://rapdb.dna.affrc.go.jp/) and those of Sorghum in JGI (http://genome.jgi-psf.org/Sorbi1/Sorbi1.download.html). Exon-intron structures and associated rice genome annotation data are displayed using the Generic Genome browser (Gbrowse).
EST assembly with TriFLCDS
Wheat and barley ESTs currently available in dbEST were correlated with the TriFLCDS entries based on nucleotide identity by using the BLAST search. Each of EST alignments with TriFLCDS has been shown as contig alignment browser.
TriFLDB: A Database of Clustered Full-Length Coding Sequences from Triticeae with Applications to Comparative Grass Genomics.
Keiichi Mochida, Takuhiro Yoshida, Tetsuya Sakurai, Yasunari Ogihara, and Kazuo Shinozaki
Plant Physiol. 2009 May 15