“
parse_entrezgene.py” and “
calculate_sequences.py”
are two useful script files for parsing NCBI Gene files and
NCBI Nucleotide sequences to create the
GeneBase
database.
Requirements:
any operating system (Linux, Mac OS X,
Windows, ...) with
Python 2.7 and IDLE.
In the “Applications” folder, verify if
Python 2.7 is
already present.
If not, download
Python 2.7 from
http://www.python.org/downloads/.
(IDLE is included in this installer). Install the downloaded
version of Python as usual.
For example, for Mac users:
Double-click to open the downloaded “.dmg” file. The window
represented in the following figure will open automatically:
double-click on the “Python.mpkg” file to install Python.
When the Python 2.7 appears in the “Applications” folder, the
system is properly configured.
Section A - parse_entrezgene.py
1. Go to the GeneBase folder.
2. If not already present, save a copy of the script file of
interest in this folder.
Please note that, in order to correctly
execute the script, you need to have the following files in
the same folder:
- parse_entrezgene.py;
- one or more files downloaded from the NCBI Gene website
(usually automatically named gene_result.txt,
gene_result(1).txt and so forth if more than one).
3. Double-click on the “parse_entrezgene.py” script file and
two windows will appear: the “Python Shell” and “IDLE”.
4. To execute the script, keeping the “parse_entrezgene.py”
window active, select “Run Module”
from the “Run” menu.
5. The programme is finished when the message “XXX gene results
processed” appears in the “Python Shell”, where XXX is the
number of NCBI's Gene entries downloaded in the first step of
GeneBase tutorial.
In the GeneBase folder you will obtain three tab-delimited
files: “
gene_ontology.txt”, “
gene_summary.txt” and
“
gene_table.txt”.
To import these files into GeneBase software, please go back to
GeneBase tutorial step 1.3.
Section B -
calculate_sequences.py
1. Go to the GeneBase folder.
2. If not already present, save a copy of the script file of
interest in this folder.
Please note that, in order to correctly
execute the script, you need to have the following files in
the same folder:
- calculate_sequences.py;
- exon_intron.txt;
- FASTA file(s) with the downloaded chromosome sequences, in
this example sequence.fasta and
sequence(1).fasta;
- file_list.txt (with a list of FASTA file names).
3. Double-click on the “calculate_sequences.py” script file
and two windows will appear: the “Python Shell” and “IDLE”.
4. To execute the script, keeping the “calculate_sequences.py”
window active, select “Run Module”
from the “Run” menu.
5. The programme is finished when the message
“Exon and intron sequences calculated” appears in the “Python
Shell” (this could take some hours).
In the GeneBase folder you will obtain a tab-delimited file
named “
exon_intron_seq.tab”.
To import this file into GeneBase software, please go back to
GeneBase tutorial step 2.3.