RefSeq_Parser

ABCD steps for the use of 5'_ORF_Extender

A. Creation of a local RefSeq entries database
using RefSeq_parser database table

1. Download the RefSeq text file of the desired species at:
ftp://ftp.ncbi.nih.gov/refseq/

(choose "mRNA_Prot" folder, download "gbff.gz" format file, decompress it as
usual).

2. Edit the downloaded file using the Unix commands "tr" and "awk".
The file must be placed in the same directory from which the commands are
launched.
These commands are also included in Mac OS X and in most Unix-like systems,
e.g. Linux.
Editing is performed using this instruction:

tr -ds "\n" "[:space:]" < gbff.txt | awk '{gsub ("//LOCUS", "\rLOCUS"); print $0;}' | tr -d "//\n" > out.txt

where "gbff.txt" should actually be the name of the downloaded RefSeq file,
and "out.txt" the name of the edited file produced as the output.

3. Import the out.txt file in the RefSeq_Parser table of 5'_ORF_Extender.
To do this, first switch to the RefSeq_Parser table of the software.
(To switch among different database tables, use the "Layout" menu at
the upper left corner).
Choose the command "Import records" from the "File" menu.
Select the file to be imported, choosing: "Tab-separated text" from the
"Show" pop-up menu.

The software will calculate and extract this information in specific
calculated fields:

FIELD "FASTA":        the entry in FASTA format,
                      including accession number and mRNA sequence;
FIELD "LOCUS":        the entry accession number;
FIELD "bp":           length of the entry sequence (in bp);
FIELD "CDS_start":    position of the entry-recorded
                    translational start codon;
FIELD "UTR5'_length": the length of the mRNA 5' UTR sequence;
FIELD "Seq":        the mRNA sequence;
FIELD "Seq_UTR5'":    the mRNA 5' UTR sequence;
FIELD "SYMBOLUM":     the gene symbol.

NM_ID field is used by the script "Create chunks" to mark only NM_ coded
mRNA sequences, while XM_ code labels entries with mRNA computed models.

4. Execute the "RefSeq_further_extendable" script from the "Script" menu.
The script screens RefSeq mRNAs for the presence of an in-frame stop codon
upstream of the described initiation codon in the mRNA sequence itself,
because this indicates that the recorded 5´UTR sequence cannot be part of a
longer continuous coding sequence. Consequently, this subset of mRNAs is
analyzed no further.
In this case, the
FIELD "Further_extendable" shows "No", otherwise: "Yes".

B. Creation of a local EST entries database
using EST_parser database table

1. Download the EST text file of the desired EST entries.
First, go to:
http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Limits&db=nucleotide

Perform query (e.g.): danio rerio[ORGANISM] AND gbdiv_est[PROP]
and then choose "GenBank" as file format from the "Display" menu.
Now you can download the found entry set choosing "File" from the "Send to"
pop-up menu.

Change the organism name as desired.

If the number of the EST entries exceeds 3-400,000,
download subsets of this size using the "Modification date" range in the
"Limits" tab.
This is to limit the file size to the maximum allowed for download
(1 Gb).

2. Edit the downloaded file using the Unix commands "tr" and "awk".
The file must be placed in the same directory from which the commands are
launched.
These commands are also included in Mac OS X and in most Unix-like systems,
e.g. Linux.
Editing is performed using this instruction:

tr -ds "\n" "[:space:]" < est.txt | awk '{gsub ("//LOCUS", "\rLOCUS"); print $0;}' | tr -d "//\n" > out.txt

where "est.txt" should actually be the name of the downloaded EST file,
and "out.txt" the name of the edited file produced as the output.

3. Import the out.txt file in the EST_Parser table of 5'_ORF_Extender.
To do this, first switch to the EST_Parser table of the software.
(To switch among different database tables, use the "Layout" menu at the
upper left corner).
Choose command "Import records" from the "File" menu.
Select the file to be imported, choosing: "Tab-separated text" from the
"Show" pop-up menu.

The software will calculate and extract this information in specific
calculated fields:
FIELD "LOCUS":    the accession number of the EST entry;
FIELD "bp":           length of the entry sequence (in bp).

C. Obtainment of a BLAST results file

1. CREATE THE QUERY
Obtain any number of mRNA sequences in FASTA format to be analyzed
for possible coding sequence extension.

For massive comparison, you may:
choose the "Create chunks" command from the "Script" menu and follow the
instructions on the screen to generate text files with the desired number
of FASTA format RefSeq entries to be submitted for BLAST analysis.
In addition, the script generates text files with Unix-like scripts for:
batch submission to a locally running BLAST software under the Linux SuSE
system;
editing of the end-of-line characters in the FASTA sequence entries
exported.

Edit these files if necessary (depending on the operating system)
in order to have ASCII 10 character at the end of each sequence description
and at the end of each sequence.
For example, we had to edit these files using "tr" Mac OS X utility,
to replace ASCII 13 or ASCII 11 with ASCII 10 control character, with the
command:
tr "\v" "\n" < input.txt | tr "\r" "\n" > output.txt

2. CREATE THE EST TARGET DATABASE
The EST database available online at NCBI BLAST site may be used as the
target database.

However, for massive comparison, it is advisable to create a local database.
We followed the BLAST instructions at:
ftp://ftp.ncbi.nih.gov/blast/executables/LATEST/
to create a species-specific local EST database.
Briefly, the Danio rerio EST subset was first downloaded from Entrez
Nucleotide in FASTA format.
We used as search criteria: Danio rerio[ORGANISM] AND gbdiv_est[PROP]
with "Limits" date range set to years 1996-2005,
in two chunks due to maximum allowed downloadable file size
(1996-2003 and 2004-2005).
The downloaded files were merged (concatenated by "cat" Mac OS X utility) in
a single file, containing 673,073 entries, and this file was converted by
the "formatdb" BLAST utility, in order to create a local est_brare database
suitable for BLAST comparison.

3. RUN BLAST COMPARISON
Submit the FASTA files as query on a BLASTN server,
using the appropriate target database.
Set "expect value" to 2e-12 ("e 2e-12") on the command line
(this excludes hits with low similarity).
Select "Hit table" as the result file format ("m 8" on the command line).
Select "Plain text" format for results.
Select the maximum number of allowed alignments.

For massive comparison, it is advisable to run a local version of BLAST.
We used a local BLASTN running on the processor cluster CLX
(operated by GNU Linux SuSE SLES 8 kernel 2.4.21-266).
Batch BLASTN comparisons were launched using each 25-sequence Danio rerio
Refseq FASTA file as the input query, and est_brare as the queried database.
25 sequences was the maximum number of sequences processed within the user
limit of 6 hours for each task.
A script including all needed "bsub" commands was executed on the cluster
with multiple lines of the format:
bsub -n 2 -W 6:00 -e %J.err -o %J.out ./blastall -p blastn -a 2 -d est_brare
-i fasta.txt -v 100000 -b 100000 -e 2e-12 -m 8 -o blast_results.txt

The very high number of returned alignments is necessary to retrieve all
ESTs matching each query mRNA.

D. Using 5'_ORF_Extender FileMaker Pro template to
calculate EST-driven mRNA coding sequence extension

1. Generate a result.txt file merging all the results files obtained by
BLAST, in the event that your BLAST strategy has generated multiple files.
The "cat" utility may be used to do this on Unix-like systems.

2. Import the result.txt file in the Five'_ORF_Extender table of the
5'_ORF_Extender software,
using the command "Import records" from the "File" menu.
Select the file to be imported,
choosing: "Tab-separated text" from the "Show" pop-up menu,
or, if this is not possible,
choosing: "All available" from the "Show" pop-up menu.
Select the option "Perform auto-enter options" if required.
(To switch among different database tables, use the "Layout" menu at the
upper left corner).

The software will extract this information in specific fields:
FIELD "NM_Accession": the accession number of the RefSeq entry
                       (query sequence, "q");
FIELD "SYMBOLUM":      the gene symbol
                       (obtained by the RefSeq_Parser table);
FIELD "Sbjct_LOCUS":   the accession number of the EST matched entry
                       (subject sequence, "s");
FIELD "%_identity.":   the percentage of identity in the BLAST-aligned
                       sequence;
FIELD "align_length": the length of the BLAST-aligned sequence;
FIELD "mismatch":      the number of mismatches in the BLAST-aligned
                       sequence;
FIELD "gap_opens":     the number of gaps in the BLAST-aligned sequence;
FIELD "q_start":       the position of the first aligned base of the "q";
FIELD "q_end":         the position of the last aligned base of the "q";
FIELD "s_start":       the position of the first aligned base of the "s";
FIELD "s_end":         the position of the last aligned base of the "s";
FIELD "bit_score":     the BLAST bit score for the alignment;
FIELD "e_value":       the BLAST e-value for the alignment;

FIELD "mRNA_bp":       the length of the RefSeq mRNA sequence corresponding
                       to the "NM_Accession" accession number;

FIELD "EST_bp":        the length of the EST sequence corresponding
                       to the "Sbjct_LOCUS" accession number.

Use the "Results" command from the "Script" Menu to:
generate results,
retrieve all BLAST hits within a filled "Result" field,
and export these records into the "Results" table,
in order to save the calculation results.
Data may be further exported from the software to a text file using the
"Export records ..." command from the "File" Menu.

The software executes these calculations to generate mRNA sequences that
allow to extend the RefSeq mRNA coding sequence following EST comparison:

FIELD                  DESCRIPTION

"s_s>q_s":             it shows "yes",
                       if subject_start is greater than query_start,
                       meaning that the EST sequence contains a sequence
                       upstream of the RefSeq sequence;

"s_s>s_e":             it shows "yes",
                       if subject_start is greater than subject_end,
                       meaning that the EST sequence is in the opposite
                       orientation with respect to the RefSeq sequence.

The script "Results" analyzes only BLAST alignments matching these criteria:
mRNA sequence query “start” position is “1”, and:
"s_s>q_s" = "yes"

i.e. the EST subject sequence “start” position is greater than “1” ,
to be sure that only EST sequences containing additional upstream nucleotides
with respect to the most upstream mRNA sequence known are analyzed;

"s_s>s_e" = "no"
i.e. the EST subject sequence start position is not greater than the EST subject
sequence “end” position, to focus the analysis on the plus/plus alignments
(i.e. with sequences in the same orientation)
in the generation of the extended mRNA models.

To avoid artifacts due to poor alignments between the mRNA and the EST
sequences, only alignments with:
1) a percentage of nucleotide identity equal to or greater than 97%, and with:
2) a length of the EST sequence aligned with the mRNA greater than 49% of
   the total EST length are selected.

These parameters are stringent but they may be modified by the user if
desired, modifying the fields
1) "Set_percent_of_identity_in_alignment" and, respectively:
2) "Set_percent_of_EST_sequence_aligned"
before executing the "Results" script.

Lowering these values may allow further identification of extended ORF when
mRNA sequence only partly aligns with the EST sequence due to:
the existence of ESTs longer than the whole respective mRNA, or to:
alternative splicing outside the aligned region,
at the risk of possible retrieval of false positive ORF extensions.

RESULT FIELDS        DESCRIPTION

"First_in_frame_ATG": position of the most upstream ATG located in the
                       EST sequence upstream the RefSeq sequence,
                       which is in-frame with the start codon recorded
                       in the RefSeq mRNA sequence entry;
                       this value is named "ATG_position_in_EST"
                       in the "Results" layout;

"Result":              EST sequence extending from the first in-frame ATG
                       to the first base upstream the RefSeq mRNA known
                       CDS (coding sequence),
                       confirmed after checking that there is no
                       in-frame stop codon within it;
                       this sequence is named "Extended_coding_seq"
                       in the "Results" layout;

"Result_positive":     it shows "Yes" if the field "Result" is filled;

"Extension length":    length of the sequence in the "Result" field;

"Completeness":        the EST sequence containing an mRNA ORF extension
                       is scanned for the presence of an in-frame stop codon
                       upstream of the new initiation codon identified in the
                       EST; if it is found, the field shows "Yes".
                       This value is named "ORF complete"
                       in the "Results" layout.