ABCD steps for the use of 5'_ORF_Extender
A.
Creation of a local RefSeq entries
database
using RefSeq_parser
database table
1.
Download the RefSeq text
file of the desired species at:
ftp://ftp.ncbi.nih.gov/refseq/
(choose "mRNA_Prot" folder,
download "gbff.gz" format file, decompress
it as
usual).
2.
Edit the downloaded file
using the Unix commands "tr" and "awk".
The file must be placed in
the same directory from which the commands are
launched.
These commands are also
included in Mac OS X and in most Unix-like systems,
e.g. Linux.
Editing is performed using
this instruction:
tr -ds "\n" "[:space:]" <
gbff.txt | awk '{gsub ("//LOCUS",
"\rLOCUS"); print $0;}' | tr -d "//\n" > out.txt
where "gbff.txt" should
actually be the name of the downloaded RefSeq file,
and "out.txt" the name of the
edited file produced as the output.
3.
Import the out.txt file in
the RefSeq_Parser table of 5'_ORF_Extender.
To
do this, first switch to the RefSeq_Parser table of the software.
(To switch among different
database tables, use the "Layout" menu at
the upper left corner).
Choose the command "Import
records"
from the "File" menu.
Select the file to be imported, choosing: "Tab-separated text" from the
"Show" pop-up menu.
The software will calculate and
extract this information in specific
calculated fields:
FIELD "FASTA":
the entry in FASTA format,
including accession number and mRNA sequence;
FIELD "LOCUS":
the entry accession number;
FIELD "bp":
length of the entry sequence (in
bp);
FIELD "CDS_start":
position of the entry-recorded
translational start codon;
FIELD "UTR5'_length": the length of the
mRNA 5' UTR sequence;
FIELD "Seq":
the mRNA sequence;
FIELD "Seq_UTR5'":
the
mRNA 5' UTR sequence;
FIELD "SYMBOLUM":
the gene symbol.
NM_ID field is used by the
script "Create chunks" to mark only NM_ coded
mRNA sequences, while XM_ code
labels entries with mRNA
computed models.
4. Execute
the "RefSeq_further_extendable"
script from the "Script" menu.
The script screens RefSeq mRNAs
for the presence of an in-frame stop
codon
upstream of the described
initiation codon in the mRNA sequence itself,
because this indicates that the
recorded 5´UTR sequence cannot
be part of a
longer continuous coding
sequence. Consequently, this subset of mRNAs is
analyzed no
further.
In this case, the
FIELD "Further_extendable"
shows "No", otherwise: "Yes".
B. Creation of a local EST entries
database
using EST_parser database table
1.
Download the EST text
file of the desired EST entries.
First, go to:
http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Limits&db=nucleotide
Perform query (e.g.): danio
rerio[ORGANISM] AND gbdiv_est[PROP]
and then choose "GenBank" as file
format from the "Display" menu.
Now you can download the found
entry set choosing "File" from the "Send to"
pop-up menu.
Change the organism name as
desired.
If the number of the EST entries
exceeds 3-400,000,
download subsets of this size
using the "Modification
date" range in the
"Limits" tab.
This is to limit the file size to
the maximum allowed for download
(1 Gb).
2. Edit
the downloaded file
using the Unix commands "tr" and "awk".
The file must be placed in
the same directory from which the commands are
launched.
These commands are also
included in Mac OS X and in most Unix-like systems,
e.g. Linux.
Editing is performed using
this instruction:
tr -ds "\n" "[:space:]"
< est.txt | awk '{gsub ("//LOCUS",
"\rLOCUS"); print $0;}' | tr -d "//\n" > out.txt
where "est.txt" should
actually be the name of the downloaded EST file,
and "out.txt" the name of the
edited file produced as the output.
3.
Import the out.txt file in
the EST_Parser table of 5'_ORF_Extender.
To
do this, first switch to the EST_Parser table of the software.
(To switch among different database tables, use the "Layout" menu at the
upper left corner).
Choose command "Import records"
from the "File" menu.
Select the file to be imported, choosing: "Tab-separated text" from the
"Show" pop-up menu.
The software will calculate and
extract this information in specific
calculated fields:
FIELD "LOCUS":
the accession number of the EST entry;
FIELD "bp":
length of the entry sequence (in
bp).
C.
Obtainment of a BLAST results file
1. CREATE THE QUERY
Obtain any number of mRNA sequences in FASTA format to be
analyzed
for
possible coding sequence extension.
For
massive comparison, you
may:
choose the "Create
chunks" command from the "Script" menu and follow the
instructions on the screen to generate
text files with the desired
number
of FASTA
format RefSeq entries to be submitted for BLAST analysis.
In addition, the script generates text
files with Unix-like scripts for:
batch submission to a locally running
BLAST software under the Linux
SuSE
system;
editing of the end-of-line characters
in the FASTA sequence entries
exported.
Edit
these files if necessary
(depending on the operating system)
in order to have ASCII 10
character at the end of each sequence
description
and at the end of each sequence.
For example, we had to edit these
files using "tr" Mac OS X utility,
to replace ASCII 13 or ASCII 11
with ASCII 10 control character, with
the
command:
tr "\v" "\n" < input.txt | tr
"\r" "\n" > output.txt
2. CREATE THE EST TARGET DATABASE
The
EST
database available online at NCBI BLAST site may be used
as the
target database.
However, for massive comparison,
it is advisable to create a local database.
We followed
the BLAST instructions at:
ftp://ftp.ncbi.nih.gov/blast/executables/LATEST/
to create a species-specific local EST
database.
Briefly, the Danio rerio EST subset
was first downloaded from Entrez
Nucleotide
in FASTA format.
We used as search criteria: Danio
rerio[ORGANISM] AND gbdiv_est[PROP]
with "Limits" date range set to years
1996-2005,
in two chunks due to maximum allowed
downloadable file size
(1996-2003 and
2004-2005).
The downloaded files were merged
(concatenated by "cat" Mac OS X
utility) in
a single file, containing 673,073
entries, and this file was converted
by
the
"formatdb" BLAST utility, in
order to create a local est_brare database
suitable for BLAST
comparison.
3. RUN BLAST
COMPARISON
Submit the FASTA files as query
on a BLASTN
server,
using the appropriate target
database.
Set "expect value" to 2e-12 ("e
2e-12") on the command line
(this
excludes hits with low similarity).
Select "Hit table" as the result
file format ("m 8" on the command
line).
Select "Plain text" format
for results.
Select the maximum number of
allowed alignments.
For
massive comparison, it is advisable to run a local version of
BLAST.
We used a local BLASTN running on
the processor cluster CLX
(operated by GNU Linux SuSE SLES
8
kernel 2.4.21-266).
Batch BLASTN comparisons were
launched using each 25-sequence Danio rerio
Refseq FASTA file as the
input query, and est_brare as the queried database.
25 sequences was the maximum
number of sequences processed within the
user
limit of 6 hours for each task.
A script including all needed
"bsub" commands was executed on the
cluster
with multiple lines of the
format:
bsub -n 2 -W 6:00 -e %J.err -o
%J.out ./blastall -p blastn -a 2 -d
est_brare
-i fasta.txt -v 100000 -b 100000
-e 2e-12 -m 8 -o
blast_results.txt
The very high number of returned
alignments is necessary to retrieve
all
ESTs matching each query mRNA.
D.
Using 5'_ORF_Extender
FileMaker Pro template to
calculate EST-driven mRNA coding
sequence extension
1.
Generate a
result.txt
file
merging all the results files obtained by
BLAST, in the event that your
BLAST strategy has generated multiple files.
The "cat" utility may be used to
do this on Unix-like systems.
2.
Import the result.txt file in
the Five'_ORF_Extender table of the
5'_ORF_Extender software,
using
the command "Import records"
from the "File" menu.
Select the file to be imported,
choosing: "Tab-separated
text" from the
"Show" pop-up menu,
or, if this is not possible,
choosing: "All available" from the
"Show" pop-up menu.
Select
the option "Perform auto-enter
options" if required.
(To switch among different
database tables, use the "Layout" menu at
the
upper left corner).
The software will extract this
information in specific fields:
FIELD "NM_Accession": the accession number of the RefSeq
entry
(query sequence, "q");
FIELD "SYMBOLUM":
the gene symbol
(obtained
by the RefSeq_Parser table);
FIELD "Sbjct_LOCUS": the accession number of the EST
matched entry
(subject sequence, "s");
FIELD "%_identity.": the percentage of identity in the
BLAST-aligned
sequence;
FIELD "align_length": the length of the BLAST-aligned sequence;
FIELD "mismatch":
the number of mismatches in the BLAST-aligned
sequence;
FIELD "gap_opens":
the number of gaps in the BLAST-aligned sequence;
FIELD "q_start":
the position of the first
aligned base of the "q";
FIELD "q_end":
the position of the last
aligned base of the "q";
FIELD "s_start":
the position of the first
aligned base of the "s";
FIELD "s_end":
the position of the last
aligned base of the "s";
FIELD "bit_score":
the BLAST bit score for
the alignment;
FIELD "e_value":
the BLAST e-value for the
alignment;
FIELD "mRNA_bp":
the length of the RefSeq
mRNA sequence corresponding
to the "NM_Accession" accession number;
FIELD "EST_bp":
the length of the EST
sequence corresponding
to the "Sbjct_LOCUS" accession number.
Use the "Results"
command from the "Script" Menu to:
generate results,
retrieve all BLAST hits
within a filled "Result"
field,
and export these records into the
"Results" table,
in order to save the calculation
results.
Data may be further exported from
the software to a text file using the
"Export records ..." command from
the "File" Menu.
The software executes these
calculations to generate mRNA sequences that
allow to extend the RefSeq mRNA
coding sequence following EST comparison:
FIELD
DESCRIPTION
"s_s>q_s":
it shows "yes",
if subject_start
is greater than query_start,
meaning that the EST sequence contains a sequence
upstream of the RefSeq
sequence;
"s_s>s_e":
it shows "yes",
if subject_start
is greater than subject_end,
meaning that the EST sequence is in the opposite
orientation with respect to the RefSeq sequence.
The script "Results" analyzes
only BLAST alignments matching these criteria:
mRNA sequence query “start”
position is “1”, and:
"s_s>q_s"
= "yes"
i.e. the EST subject
sequence
“start” position is greater than “1” ,
to be sure that only EST
sequences containing additional upstream
nucleotides
with respect to the most
upstream mRNA sequence known
are analyzed;
"s_s>s_e"
= "no"
i.e. the EST subject sequence
start
position is not greater than the EST subject
sequence “end” position, to focus
the analysis on the plus/plus
alignments
(i.e. with sequences in the same
orientation)
in the
generation of the extended mRNA models.
To avoid artifacts due to poor
alignments between the mRNA and the EST
sequences, only alignments with:
1) a percentage of nucleotide
identity
equal to or greater than 97%, and with:
2) a length of the EST
sequence aligned with the mRNA greater
than 49% of
the total EST length
are selected.
These parameters are stringent
but they may be modified by the user if
desired, modifying the fields
1) "Set_percent_of_identity_in_alignment" and, respectively:
2) "Set_percent_of_EST_sequence_aligned"
before
executing the "Results" script.
Lowering these values may allow
further identification of extended ORF
when
mRNA sequence only partly aligns
with the EST sequence due to:
the
existence of ESTs
longer than
the whole respective mRNA, or to:
alternative splicing
outside the aligned
region,
at the risk of possible retrieval of false positive ORF extensions.
RESULT FIELDS
DESCRIPTION
"First_in_frame_ATG":
position of the most upstream ATG located in the
EST sequence upstream the RefSeq sequence,
which is in-frame with the start codon recorded
in the RefSeq mRNA sequence entry;
this value is named "ATG_position_in_EST"
in the "Results" layout;
"Result":
EST sequence extending from the first in-frame ATG
to the first base upstream the RefSeq mRNA known
CDS (coding sequence),
confirmed after checking that there is no
in-frame stop codon within it;
this sequence is named "Extended_coding_seq"
in the "Results" layout;
"Result_positive":
it shows "Yes" if the field "Result" is filled;
"Extension
length":
length of the sequence in the "Result" field;
"Completeness":
the EST sequence containing an mRNA ORF extension
is scanned for the presence of an in-frame stop codon
upstream of the new initiation codon identified in the
EST; if it is found, the field shows "Yes".
This value is named "ORF complete"
in the "Results" layout.