GeneBase
Version 1.1 (2016)


Definition

"GeneBase" is a fully structured local database with a simple graphic interface for personal computers which allows users to do original calculations and searches for any information about eukaryotic genes annotated in the National Center for Biotechnology Information's (NCBI) Gene database.

GeneBase 1.1 Database Design Report

Download

Pre-loaded versions of GeneBase 1.1 filled with human data and sequences*:
    Macintosh
    Windows

Pre-loaded versions of GeneBase 1.1 filled with only human data (sequences excluded):
        Macintosh
    Windows

Empty (template) version of GeneBase 1.1 with Python scripts for parsing NCBI Gene entries and NCBI Nucleotide sequences included:
    Macintosh
    Windows

*Due to the presence of sequences which are indexed in order to improve sequence searches in
"Gene_Table", this version is slower than the version with sequences excluded in making summary calculations shown in the correlating "Report" table. Sequences are not necessary to calculate this summary, thus if the user is interested only in gene and exon/intron number and length statistics, the use of the pre-loaded versions of GeneBase 1.1 filled with only human data (sequences excluded) is preferable.

Description of the main steps of the analysis

First, the user is guided to download, parse and import NCBI's Gene database entries in GeneBase.
GeneBase 1.1 contains three correlated tables: "Gene_Summary" collects details about each gene, such as the official gene symbol, the official gene full name, the organism name and a brief description of the gene; "Gene_Table" consists of one record for each exon including the corresponding intron (if an intron follows that exon), representing the exon/intron structure of each transcript isoform; "Gene_Ontology" contains specific Gene Ontology labels, codes and terms for each gene, when available.
In addition, a table named "Reports" is generated to provide statistics such as the mean lengths of exons and introns. A table named "Transcripts" shows a set of useful fields from "Gene_Summary" and "Gene_Table" tables, in order to give an overview of main available information for each transcript. Finally, a table named "Genes" shows a set of useful fields from "Gene_Summary" and "Gene_Table" tables, in order to give an overview of main available information for each gene. Here only the transcript isoform with the highest number of exons is arbitrarily shown for each gene.

Furthermore, following the download of the chromosome sequences from NCBI Nucleotide database, the user can extract and import exon and intron sequences.

Flowchart.png

Each software table presents a box showing useful related fields of other related software tables, giving the opportunity to perform crossed searches. A sample screenshot of the "Gene_Table" software section representing the exon/intron structure of each transcript isoform, with corresponding sequences and related Gene Ontology categories:

Figures/Sample.png

Useful information specifically calculated by GeneBase 1.1, which is not available in NCBI's Gene database, is highlighted in red.

Human database construction


We obtained 59,801 entries from downloading all current live human records with a genomic gene source from NCBI Gene available up to January 19th, 2016.
Following the initial parsing and importing steps (
described in the tutorial), the three main tables in GeneBase 1.1 database are constituted as follows: "Gene_Summary" contains 59,801 records (one for each NCBI Gene entry). "Gene_Table" contains 1,502,237 records (one record for each gene exon, including the downstream intron if an intron follows that exon), corresponding to 40,942 genes with 136,694 transcripts in total (equal to the "Transcripts" table record number), excluding genes without annotated transcribed products. "Gene_Ontology" contains 18,726 records in all, one for each gene with Gene Ontology information available.
In order to integrate exon and intron nucleotide sequences, only entries with the "REVIEWED" or the "VALIDATED" RefSeq status having an "NM_" or "NR_" type of RefSeq RNA accession number (in order to exclude "XM_" or "XR_" model Refseq records generated by automated pipelines) were selected. After  the chromosome sequence download, parsing and importing steps
(described in the tutorial), a total of 459,868 "Gene_Table" records were updated with exon, coding exon (for protein-coding transcript isoforms) and the corresponding downstream intron sequences up to January 26th, 2016. The whole, fully indexed database including sequences has a size of 6.43 gigabytes following decompression.
The whole process, including data import and processing, required about 2 days for completion.

Tutorial

This Tutorial guides the user through a step-by-step process in order to set up and use the software for the analysis of any organism.