First, the user is guided to download, parse
and import NCBI's Gene database entries in GeneBase.
GeneBase contains
three correlated tables: "Gene_Summary" collects details about each gene, such
as the official gene symbol, the official gene full name,
the organism name and a brief description of the gene; "Gene_Table" consists of one record for each exon
including the corresponding intron (if an intron follows
that exon), representing the exon/intron structure of each
transcript isoform; "Gene_Ontology" contains specific Gene Ontology
labels, codes and terms for each gene, when available.
In
addition, a table named "Reports" is generated to
provide statistics such as the mean lengths of exons and
introns.
Furthermore, following the download of the chromosome
sequences from NCBI Nucleotide database, the user can
extract and import exon and intron sequences.
Each software table presents a box showing
useful related fields of other related software
tables, giving the opportunity to perform crossed
searches. A sample screenshot of the 'Gene_Table' software section representing the
exon/intron structure of each transcript
isoform, with corresponding sequences and
related Gene Ontology categories:
Useful information
specifically calculated by GeneBase, which is not available
in NCBI's Gene database, is highlighted in red.
Database construction
We downloaded from NCBI Gene all current (alive/live)
eukaryotic records with a genomic gene source (excluding
gene models) available up to April 22nd, 2015. We obtained
679,451 entries for Animalia (Metazoa), 1,203,082 for Fungi
and 534,875 for Plants (Viridiplantae), for a total of 359
organisms. Among the 2,417,408 total gene entries, 76,182
are "Reviewed", 41,862 "Validated", 2,245,205 "Provisional",
31,464 "Inferred", 22,691 "Predicted", 1 "Model" and 1
"Withdrawn" (despite the gene model exclusion performed
using the web search described in the tutorial).
After the initial parsing and importing steps, the three
main tables in GeneBase database are constituted as follows:
"Gene_Summary" contains 2,417,408 records (one for each NCBI
Gene entry), "Gene_Table" (Figure) contains 13,824,965
records (one record for each gene exon, included the
corresponding intron if an intron follows that exon) and
"Gene_Ontology" contains 149,064 records in all (one for
each gene with GO information available). Due to the lack of
annotated transcribed products, a gene structure was not
available for 86,824 Gene unique identifiers (UIDs).
Among the total gene entries, 2,368,726 are protein coding,
25,796 pseudogenes (pseudo), 21,247 non coding RNA (ncRNA),
527 coding for small nucleolar RNA (snoRNA), 137 for small
nuclear RNA (snRNA), 86 for ribosomal RNA (rRNA) and 6 for
cytoplasmic RNA genes (scRNA) (the remaining are not
specified).
Then, in order to integrate nucleotide sequences, from the
"Gene_Table" table of our database, we selected 861,550 with
the "Validated" RefSeq status and 534,578 records with the "Reviewed" RefSeq
status (in both cases having an "NM_" or "NR_" type
of RefSeq RNA accession number, in order to exclude "XM_" or
"XR_" model Refseq records generated by automated pipelines)
for a total of 1,396,128 exon entries. Using Batch Entrez we
were able to retrieve and download 1,336 records out of the
1,338 corresponding chromosome sequences. This selection
gave rise to a total of 1,385,944 "Gene_Table" records,
which represent 10% of all available entries, updated with
exon, coding exon (for protein coding genes) and the
corresponding downstream intron sequences up to May 5th,
2015. The whole database including sequences has a size of
25.1 gigabytes following decompression.
The whole process, including data
import and processing, required about 4 days for
completion (1 additional day was required to obtain exon
and intron sequences).
Tutorial
This Tutorial
guides the user through a step-by-step process in order to
set up and use the software for the analysis of any
organism.