First, the user is guided to download, parse
and import NCBI's Gene database entries in GeneBase.
GeneBase 1.1
contains three correlated tables: "Gene_Summary" collects details about each gene, such
as the official gene symbol, the official gene full name,
the organism name and a brief description of the gene; "Gene_Table" consists of one record for each exon
including the corresponding intron (if an intron follows
that exon), representing the exon/intron structure of each
transcript isoform; "Gene_Ontology" contains specific Gene Ontology
labels, codes and terms for each gene, when available.
In
addition, a table named "Reports" is generated to
provide statistics such as the mean lengths of exons and
introns. A table named "Transcripts"
shows a set of useful fields from "Gene_Summary" and "Gene_Table" tables, in order
to give an overview of main available information for each
transcript. Finally, a table named "Genes"
shows a set of useful fields from "Gene_Summary" and "Gene_Table" tables,
in order to give an overview of main available information
for each gene. Here only the transcript isoform with the
highest number of exons is arbitrarily shown for each gene.
Furthermore, following the download of the chromosome
sequences from NCBI Nucleotide database, the user can
extract and import exon and intron sequences.
Each software table presents a box showing
useful related fields of other related software
tables, giving the opportunity to perform crossed
searches. A sample screenshot of the "Gene_Table" software section representing the
exon/intron structure of each transcript
isoform, with corresponding sequences and
related Gene Ontology categories:
Useful information
specifically calculated by GeneBase 1.1, which is not
available in NCBI's Gene database, is highlighted in red.
Human database construction
We obtained 59,801 entries from downloading all current live
human records with a genomic gene source from NCBI Gene
available up to January 19th, 2016.
Following the initial parsing and importing steps (described in the tutorial), the
three main tables in GeneBase 1.1 database are constituted
as follows: "Gene_Summary" contains 59,801 records (one for
each NCBI Gene entry). "Gene_Table" contains 1,502,237
records (one record for each gene exon, including the
downstream intron if an intron follows that exon),
corresponding to 40,942 genes with 136,694 transcripts in
total (equal to the "Transcripts" table record number),
excluding genes without annotated transcribed products.
"Gene_Ontology" contains 18,726 records in all, one for each
gene with Gene Ontology information available.
In order to integrate exon and intron nucleotide sequences,
only entries with the "REVIEWED" or the "VALIDATED" RefSeq
status having an "NM_" or "NR_" type of RefSeq RNA accession
number (in order to exclude "XM_" or "XR_" model Refseq
records generated by automated pipelines) were selected.
After the chromosome sequence download, parsing and
importing steps (described in the tutorial),
a total of 459,868 "Gene_Table" records were updated with
exon, coding exon (for protein-coding transcript isoforms)
and the corresponding downstream intron sequences up to
January 26th, 2016. The whole, fully indexed database
including sequences has a size of 6.43 gigabytes following
decompression.
The whole process, including data
import and processing, required about 2 days for
completion.
Tutorial
This Tutorial
guides the user through a step-by-step process in order to
set up and use the software for the analysis of any
organism.