INTRODUCTION
This online Guide is designed for detailed documentation of
GeneRecords 1.0 software.
A quick, illustrated Tutorial
teaching how to install the software
and how to import the desired GenBank entries is available.
--
GeneRecords is a solution for GenBank biological flat file
parsing, as it implements a structured representation of
each "feature" and "feature qualifier" in GenBank following import
in a
common database managing system usable in a personal computer
(Macintosh and Windows environments).
This collection of related databases enables the local management
of
GenBank records, allowing indexing, retrieval and analysis of both
information and sequences on a personal computer.
Minimal requirements for use are:
Macintosh OS 8.1 or later, 9.x
Macintosh OS X with Classic
Windows OS 95/98 with Internet Explorer 4.0 or later,
NT
4.0 with Service Pack 3, 2000.
Download GeneRecords 1.0 for
Mac OS 9 or Mac OS X from address:
http://apollo11.isto.unibo.it/software/GeneRecords_1.0/GeneRecordsMacOS9.sit
http://apollo11.isto.unibo.it/software/GeneRecords_1.0/GeneRecordsMacOSX.sit
The downloaded file should to be automatically decompressed,
generating a "GeneRecords" folder:
"GeneRecords 1.0 Mac OS 9" or
"GeneRecords 1.0 Mac OS X" folder, respectively.
If it does not happen, the decompression needs the
"Stuffit
Expander" utility.
The GeneRecords Folder contains:
GeneRecords file (runtime application),
42 related subdatabases (.DNA files),
the "SelfReplace GeneRecords Filter" helper application along with
its
"SelfReplace Pref" preference document,
the LocusLink database formatted for GeneRecords relationships.
The "Fasta extractor" is an utility to generate sequence in FASTA
format.
The Mac OS X version also includes the "Open Classic" application
(see below).
The "MacOS_Tutorial" and "MacOS_Guide" folders contain a copy
of the on-line documentation, for local (off-line)
use.
GeneRecords is based on FileMaker Pro 6 (FileMaker Pro, Inc.)
database management software (www.filemaker.com/index.html).
It is released as a set of related FileMaker Pro templates that
will import
any set of GenBank-formatted entries into a local database,
including a free runtime application to run "FileMaker Pro"
at the core of the software.
The GeneRecords database directly imports GenBank flat files data
sources,
automatically pre-formatted by a text-filter, and it contextually
generates
a relational database with 42 files containing the parsed data,
making them
available for complete search and analysis. The system implements
an
independent representation of each feature and its respective qualifiers
in
GenBank within a common database managing system usable in a personal
computer (Macintoshª and Windowsª environments).
The master file of GeneRecords collection is the file "Records"
(extension is .DNA).
Clicking on the "GeneRecords" application causes the opening of
"Records.DNA",
which readily opens all the related subdatabases.
The "Records.DNA" layout functions as the GeneRecords package main
view,
and it is a way to access all subdatabases that constitute the
package.
It is organized in a GenBank entry-specific way:
each database record in "Records.DNA" contains data
from one GenBank entry.
The software package contains also the related subdatabases listed
here
(names are the same as in the official GenBank features
specifications):
CDS; ClipX; Conflict; D_loop; Exon; Gene; Hyphen; iDNA;
Intron; LTR; Mat_peptide; Misc_binding; Misc_difference;
Misc_feature; Misc_recomb; Misc_structure; Modified_base; mRNA;
Old_sequence; PolyA_signal; PolyA_site; Primer_bind; Protein_bind;
RBS;
Reference; Region_X; Rep_origin; Repeat_region; Repeat_unit; RNA_X;
Satellite; Segment_X; Sig_peptide; Signal_X; Source; Stem_loop;
STS; Transit_peptide; Unsure; UTRX; Variation.
Please note that some features of the same type will be imported
in the same subdatabase; in particular:
the "ClipX" database includes the features: 5'clip, 3'clip;
the "UTRX" database includes the features: 5'UTR, 3'UTR;
the "Region_X" database includes the features: C_region, S_region,
V_region;
the "RNA_X" database includes the features:
misc_RNA, precursor_RNA, prim_transcript,
rRNA, scRNA, snRNA, snoRNA, tRNA;
the "Segment_X" database includes the features:
V_segment, D_segment, J_segment, N_region;
the "Signal_X" database includes the features:
attenuator, CAAT_signal, enhancer, GC_signal,
misc_signal,
promoter, TATA_signal, terminator, -10_signal,
-35_signal.
METHOD
First, a detailed description of GenBank flat file format
(ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt)
has been accurately analyzed for:
1. identification of characters usable as consistent limits
for each data type;
2. conversion of the flat file format in a multiple related table
series,
allowing the appropriate import for each data type.
Our strategy is thus based on treatment of the downloaded, expanded
data
file by a fast text conversion utility in order to find, change
or insert
control characters (e.g. insert carriage return and tabs at a desired
point, to define the end respectively of each record and of different
fields in the same record - the changing list is provided
as an on-line
supplementary material).
This step is performed by fully automated text filtering
(we have incorporated in the software a SelfReplace application
performing
automatically all the character changes), using an appropriate
provided
replacement filter based on invariant features of the GenBank flat
file
established format.
At this point the file is ready to be imported into the appropriated
text
fields (e.g. ÒNM EntryÓ, ÒFeaturesÓ, ÒSeq
1 50100Ó) of GeneRecords.
In this step, the master file is filled in
with the entry set data
("raw data" layout), using the "tabulator"
characters appropriately
inserted. Each GenBank entry is imported
as one record.
Finally, GeneRecord will automatically parse
data using calculated fields,
that extract each single data type from the
text fields.
The whole import process is driven by FileMaker
Pro (FMP) scripts launched
by clicking on the "Import data" button.
At the end of importation step, several kind
of subscripts create
a new record in related subdatabases for
each feature reported in any
entry; these records contain only the accession
number and a progessive
numeration.
These data create the relationships between
each subdatabase and the
master file, allowing the visualization of
the data (read from the
Records.DNA) within specific field in each
subdatabases.
Each feature of the entry is visualized in
a dedicated subdatabase.
In the master file, different layouts
are accessible from the "Layout
menu", (a pop-up Menu in the top left corner,
above the small book icon).
Each layout contains fields within a "portal",
the FileMaker Pro tool for
construction of relational databases.
Each field visualizes the related subdatabase
content for each feature.
It is possible to visualize these layouts
also by clicking on the
"To Features" button, and then on the desired
green button near each
feature name.
--
The free included FMP runtime allows
free records management and browsing,
while the creation of new fields for elaboration
or further relationships
definition require the installation of the
FMP application.
We encourage any creative use, modification
and noncommercial
redistribution of GeneRecords, as long as
the original paper is cited,
and statement is provided that the original
program has been modified (if the case).
The availability of complete GenBank datasets
in a relational database
format allows the easy integration with other
biological databases
available in the same or similar format;
for example,
Unigene
(collection of ESTs clusters) and LocusLink
(a table of
correlation among different types of biological
data) can be easily
imported in an FMP template.
Our release includes the integration with
LocusLink,
with the respective
GeneOntology tags.
INSTALLATION
Put the file "SelfReplace Pref" contained in the GeneRecords folder into the "Preferences" folder of the "System Folder".
Note:
Mac OS 9: "System Folder" is the name of the standard system folder
Mac OS X: "System Folder" is the name of the folder containing
Mac OS 9
"Classic"
compatibility environment.
"Classic"
is required for running GeneRecords,
check
that it is open by double-clicking on "Open Classic.app"
file
included in "GeneRecords folder").
GENERAL DEFINITIONS
File
One set of records pertaining to the same
subject.
GeneRecords is a set of 42 related files.
It includes a "master" GeneRecords
file (Records.DNA),
related to 41 files (subdatabases).
Each subdatabase correspond to one (or more)
specific
"GenBank Feature", except
"Fasta extractor.DNA"
(an utility to convert
sequence data in Fasta format).
A file with "LocusLink" data is also provided.
Record
One set of fields which constitute one
entry.
One record contains data about one GenBank
entry.
The record browser is a small book icon
at the top left of the
window.
You may browse the database by clicking on
the book pages,
or enter a record number
and click on "Return" key.
You constantly visualize these informations:
Records: total
number of Records in the database
Found: total number
of Records currently selected
Sorted: sorting
status of the Records (Sorted/Unsorted)
Field
One area of the record containig a specific
data type.
In the subdatabases,
each field corresponds to a specific "GenBank
Feature Qualifier"
Browse Mode
A mode to use the database.
It allows data entry, viewing, browsing,
sorting, manipulation.
It may be selected:
from the green "Browse mode" button on the
window, or
from the "View" menu, or
from the mode pop-up Menu bar, at the bottom
left of the window.
Find Mode
An alternative mode to use the database.
It allows searching for specific content
in the databases fields,
using any different combination of criteria
(see the "Search
mode" section below for details about searching).
It may be selected:
from the green "Find mode" button on the
window, or
from the "View" menu, or
from the mode pop-up Menu bar, at the bottom
left of the window.
Preview Mode
An alternative mode to use the database.
It visualizes a print preview of the found
records.
It may be selected:
from the "View" menu, or
from the mode pop-up Menu bar, at the bottom
left of the window.
Layout
A particular graphical organization of the
database fields.
A file may show data from within different
layouts.
Visualization of a field is independent from
the
memorization of the data it contains.
The user can navigate among the different
layout of a file by the
pop-up Menu (Layout Menu),
clicking on the bar at the top of the book icon.
USE
1. Download GenBank entries
Download a set of desired GenBank entries
(GenBank entry sequence allowed maximum size: up to 1,102,200 bp)
from any database which supports GenBank format
(GenBank "CON" division, i.e. NT_
entry code, is not supported).
There are two usual alternatives:
I. Querying GenBank via
World Wide Web at:
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Limits&db=Nucleotide
Users have to save the obtained entry set as follows:
choose "GenBank" format from the pop-up Menu
at the right of "Display" button,
choose "File" from the pop-up Menu
at the right of "Send to" button,
then click on "Send to" button and choose "Save" button
in the next dialog box.
II. Performing download of large
data set via ftp at:
ftp://ftp.ncbi.nih.gov/genbank/
(decompress the files when appropriate)
At the end of this step, the users should have a text file
in GenBank flat file format,
containing the sequences to be imported into the GeneRecords database.
2. Import GenBank Entries
Open the "GeneRecords" file into the "GeneRecords 1.0" folder.
This action will open the file "Records.DNA", that represents
the
"master" file of the program, linked and related to the
feature-specific subdatabases.
Advanced use:
You may open the program files using your copy of FileMaker
Pro,
being so fully enabled to any modification in the software.
In this case, don't open the program using the "GeneRecords"
file,
but open the master file "Records.DNA" with your FileMaker.
Following modifications, correct functioning of the program
requires
that you relaunch it by "GeneRecords" runtime, due to data pathawy
structure stored in the "GeneRecords" scripts.
Click on the "Import data" button and follow instructions on the
dialog box
(choose the file to be imported).
If the previously imported records have to be deleted,
choose "Erase sub-databases" from "Actions" Menu.
This step may require a long time,
depending on the size of the original data file.
A set of 42 related files will be automatically updated with
the new data,
with each type of information imported in the appropriate file/field.
The actual number of imported records is shown in the left side
of the
Records.DNA window.
Note that each GenBank entry is imported as a single record.
You may adjust the layout appearance using "Zoom In"/"Zoom out"
buttons,
or clicking on the small resizing buttons at the bottom left corner
of any window.
The "Layout Menu" is a pop-up Menu in the
top left corner,
above the small book icon.
Within one database file, separate layouts may be provided.
Layouts determine how data is displayed.
Changes in a field present on a layout are reflected in the same
field
on all the layouts in the database.
Each "feature" (e.g., exon)
of the entry is visualized in
a dedicated subdatabase (e.g.,
"Exon.DNA" file),
while each field of the subdatabase corresponds to a "Feature Qualifier"
according to GenBank Format (e.g., "exon number").
You can move among different subdatabases,
clicking on the "To the features" button, then clicking on a single
feature.
You may also choose a particular subdatabase from the "Window" Menu.
You may visualize and search the content of each subdatabase also
from within the master file,
clicking on the "To Features" button,
and then on the desired green button near each feature name.
Switching among different features can be finally made also by the
"Layout Menu".
Please remember that some features of the same type will be imported
in the same subdatabase; in particular:
the "ClipX" database includes the features: 5'clip, 3'clip;
the "UTRX" database includes the features: 5'UTR, 3'UTR;
the "Region_X" database includes the features:
C_region, S_region, V_region;
the "RNA_X" database includes the features:
misc_RNA, precursor_RNA, prim_transcript,
rRNA, scRNA, snRNA, snoRNA, tRNA;
the "Segment_X" database includes the features:
V_segment, D_segment, J_segment, N_region;
the "Signal_X" database includes the features:
attenuator, CAAT_signal, enhancer, GC_signal,
misc_signal,
promoter, TATA_signal, terminator, -10_signal,
-35_signal.
BROWSE MODE (NAVIGATION)
The FileMaker Pro based database may be used basically in these
"modes":
"Browse", "Find" and "Preview".
Switching among different modes can be made from the "View" Menu.
In the "Browse" mode,
browsing the records set can be made clicking on the small book
icon
in the upper left corner:
In the GeneRecords database the users find four types of coloured button:
The RED buttons allow to open windows
of the default web browser,
to show the related site on the Internet:
The GREEN buttons allow shifting among
different layouts
(i.e., different visualization mode) of the same database file:
The GRAY buttons activate a predefinite instruction, e.g.:
The BLUE buttons allow linking of the
related data
stored in the specific GeneRecords subdatabase:
ACCESS TO SEQUENCE DATA
Nucleotide sequence of each feature for each entry is provided
in the respective subdatabase.
Sequences are split into chunks of 50,100 bp.
To view the sequence of a desired feature,
click on "To the sequence" button in the respective subdatabase:
In some cases, the sequence is not immediately visualized in the subdatabases.
To visualize the actual sequence chunk data
in the current record of subdatabase,
you should click on the button "Click to extract",
or choose the command "Extract this sequence" from the Menu "Actions".
The command "Extract all sequences of the found entries set"
(from the Menu "Actions")
will extract the feature sequence for all the currently found records
set.
A FASTA format sequence export function is accessible in
each
feature subdatabase, from the "Sequence" layout,
clicking on the "Export fasta file" button.
SEARCH ("FIND") MODE
Switching among different modes can be made from the "View"
Menu,
or by clicking on the "Find mode"
green button in the database window.
In the "Find" mode, the small book icon in the upper left corner
represents different "requests" that are made for searching
in the database.
In the "Find" mode,
the user can fill a blank form allowing searching in specific
fields,
and by moving among the different layouts,
very complex searches can be made by
combining searches in different subdatabases
(each corresponding to a feature)
from within the master file "Records.DNA"
(which is dinamically related to the content of each subdatabase).
You can move among different layouts:
using the Layout (pop-up) Menu (Layout Menu),
clicking
on the bar at the top of the book icon; or:
using the green buttons available in the "Features" layout,
clicking
on the "To features" green button.
When searching in the master ("Records.DNA") database,
if one entry contains more recurrences of a feature,
all related records of the respective feature
subdatabase
are displayed in the master database corresponding layout.
In FileMaker Pro "Find" mode, the "AND" - "OR" - "NOT" operators
may
be used in a search in this way:
"AND" by filling in different fields located in the same "Request",
"OR" by generating additional requests
(from "Requests" Menu) in the same
query,
"NOT" by generating additional requests (from "Requests" Menu)
and checking the "Omit"
box.
The "Symbols" pop-up Menu in the "Find" mode allows query of
ranges, duplicates, wildcards and so on.
Each feature subdatabase can
be also individually searched,
after selection from Menu "Window".
The searching results are entries subsests matching the desired criteria.
The "Find record" script looks for a required corresponding record
group
(features) in the specific databases;
if one entry contains more recurrences of a feature,
all related records of the subdatabase are displayed.
GENERECORDS FUNCTIONS AND MENU
COMMANDS
FILE MENU
Page setup
Standard page set up command.
Print
Standard print command; you can choose to
print:
all records in the "Found"
set, or
only the current record,
or
a "blank" mask of the
record fields.
The appearance will be that of the layout
currently selected from
the layout Menu.
Import Records
This is the general "Import" function of
FileMaker Pro.
Use only "Import data" function
for correct GenBank file
import, from the "Actions" Menu, or
clicking on the "Import
data" button in "Records.DNA" file.
Export records
Export command for the found records set.
User can choose fields to be exported,
their order and the file format.
Save a copy as
Save a copy of the database, complete, compressed or
as a clone (database structure with no record present).
EDIT MENU
Undo
Standard "Undo" command.
Cut
Standard "Cut" text command.
Copy
Standard "Copy" text command.
Paste
Standard "Paste" text command.
Clear
Deletion of selected text.
Select all
Selection of all the text within a selected
field
(to select a field, click into the field).
Find/Replace
Utility for search/replace text strings within
fields.
Note: Use "Find" mode (from "View" Menu)
for full search
and selection of a record set.
VIEW MENU
Browse Mode
Switch to the "Browse Mode" (see "General Definitions" above).
Find Mode
Switch to the "Find Mode" (see "General Definitions" above).
Browse Mode
Switch to the "Preview Mode" (see "General Definitions" above).
RECORD MENU
New Record
Create a new empty record in the database.
The new Record will be the last of the current
record set.
Duplicate Record
Duplicate the current record in the database.
Delete Record
Delete the current records in the database.
This will delete all the corresponding linked
records
in the related subdatabases.
Delete All Records
Delete all records in the database.
Modify Last Find
Return to the last performed search to edit
it.
Show All Records
Show all the records in the database.
Omit Record
Remove a record from the found record set,
without deleting it.
Show Omitted
Show the records in the database that have
been omitted.
Sort...
Sort the current records set according to
desired criteria.
ACTIONS MENU
New record: Example
Create a new sample record of GeneRecords.
Export data of interest
Equivalent to the "Export" command in the
"File" Menu.
Print found records
Equivalent to the "Print" command in the
"File" Menu.
Import Data
Import from a file with GenBank format entries.
(equivalent to the "Import data" button in
the software window).
WINDOW MENU
List of the databases in use.
A specific database may be selected from
the Menu,
bringing it on the front.
---
Software limits:
Maximum size of GenBank entry to be imported: up to 1,102,200 bp.
Maximum size of GenBank ÒFeaturesÓ section for each entry,
allowing a correct Features splitting:
64,000 characters following text processing.
Entries with a larger ÒFeaturesÓ section will be processed,
but the splitting of the features in the subdatabases could be
incomplete.
A single GeneRecords database may store up to 2 Gbyte
(a physical limit of the core database).
The CON division of GenBank contains data for joining other sequences,
and it can not be imported.
Supplementary informations:
Detailed explanation of the GenBank Flat File format may be found
at:
ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt
The definition and explanation of each GenBank Feature may be found
at:
http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html
The list of character replacement
is provided.
It is interpreted by the SelfReplace
application (Guoniu Han),
which we have incorporated in the software using the "Developer"
version.
General information about the Filemaker core functions may be found
at:
http://www.strath.ac.uk/CC/Courses/FilemakerPro/filemaker.html
http://www.wellesley.edu/Computing/Filemaker/filemaker4_tutorial.html
We enclose the LocusLink database, maintained at NCBI:
http://www.ncbi.nlm.nih.gov/LocusLink/
http://research.nhgri.nih.gov/microarray/downloadable_cdna.html
as an example of creating a relationship among different databases
using GeneRecords.
Technical notes:
The below listed fields are placed in this order in the right column
of the "Import" dialog box, preceded by an arrow in "Map." column:
Gene Entry;
Features;
Seq 1 50100;
Seq 50101 100200;
Seq 100201 150300;
Seq 150301 200400;
Seq 200401 250500;
Seq 250501 300600;
Seq 300601 350700;
Seq 350701 400800;
Seq 400801 450900;
Seq 450901 501000;
Seq 501001 551100;
Seq 551101 601200;
Seq 601201 651300;
Seq 651301 701400;
Seq 701401 751500;
Seq 751501 801600;
Seq 801601 851700;
Seq 850701 901800;
Seq 901800 951900;
Seq 950901 1002000;
Seq 1002001 1052100;
Seq 1052101 1102200.
This ensures that each section of the pre-processed GenBank text
file
is properly directed to its respective GeneRecords field for further
extraction and visualization of each "Feature/Qualifier".
The scripts at the core of GeneRecords software are "FileMaker Pro"
scripts,
which in part also invoke #AppleScript"
language commands.
Bugs report:
Please report any bug or problem to:
pierluigi.strippoli@unibo.it
p.daddabbo@biologia.uniba.it