GeneRecords 1.0
Guide (Mac OS version)
 

INTRODUCTION

This online Guide is designed for detailed documentation of
GeneRecords 1.0 software.

A quick, illustrated Tutorial teaching how to install the software
and how to import the desired GenBank entries is available.

--

GeneRecords is a solution for GenBank biological flat file
parsing, as it implements a structured representation of
each "feature" and "feature qualifier" in GenBank following import in a
common database managing system usable in a personal computer
(Macintosh and Windows environments).

This collection of related databases enables the local management of
GenBank records, allowing indexing, retrieval and analysis of both
information and sequences on a personal computer.

Minimal requirements for use are:
Macintosh OS 8.1 or later, 9.x
Macintosh OS X with Classic
Windows OS 95/98 with Internet Explorer 4.0 or later,
           NT 4.0 with Service Pack 3, 2000.
 

Download GeneRecords 1.0 for Mac OS 9 or Mac OS X from address:
http://apollo11.isto.unibo.it/software/GeneRecords_1.0/GeneRecordsMacOS9.sit
http://apollo11.isto.unibo.it/software/GeneRecords_1.0/GeneRecordsMacOSX.sit

The downloaded file should to be automatically decompressed,
generating a "GeneRecords" folder:
"GeneRecords 1.0 Mac OS 9" or
"GeneRecords 1.0 Mac OS X" folder, respectively.

If it does not happen, the decompression needs the
"Stuffit Expander" utility.

The GeneRecords Folder contains:
GeneRecords file (runtime application),
42  related subdatabases (.DNA files),
the "SelfReplace GeneRecords Filter" helper application along with its
    "SelfReplace Pref" preference document,
the LocusLink database formatted for GeneRecords relationships.
The "Fasta extractor" is an utility to generate sequence in FASTA format.
The Mac OS X version also includes the "Open Classic" application
    (see below).
The "MacOS_Tutorial" and "MacOS_Guide" folders contain a copy
    of the on-line documentation, for local (off-line) use.

GeneRecords is based on FileMaker Pro 6 (FileMaker Pro, Inc.)
database management software (www.filemaker.com/index.html).

It is released as a set of related FileMaker Pro templates that will import
any set of GenBank-formatted entries into a local database,
including a free runtime application to run "FileMaker Pro"
at the core of the software.
The GeneRecords database directly imports GenBank flat files data sources,
automatically pre-formatted by a text-filter, and it contextually generates
a relational database with 42 files containing the parsed data, making them
available for complete search and analysis. The system implements an
independent representation of each feature and its respective qualifiers in
GenBank within a common database managing system usable in a personal
computer (Macintoshª and Windowsª environments).

The master file of GeneRecords collection is the file "Records"
(extension is .DNA).
Clicking on the "GeneRecords" application causes the opening of
"Records.DNA",
which readily opens all the related subdatabases.

The "Records.DNA" layout functions as the GeneRecords package main view,
and it is a way to access all subdatabases that constitute the package.

It is organized in a GenBank entry-specific way:
each database record in "Records.DNA" contains data from one GenBank entry.
 

The software package contains also the related subdatabases listed here
(names are the same as in the official GenBank features specifications):

CDS; ClipX; Conflict; D_loop; Exon; Gene; Hyphen; iDNA;
Intron; LTR; Mat_peptide; Misc_binding; Misc_difference;
Misc_feature; Misc_recomb; Misc_structure; Modified_base; mRNA;
Old_sequence; PolyA_signal; PolyA_site; Primer_bind; Protein_bind; RBS;
Reference; Region_X; Rep_origin; Repeat_region; Repeat_unit; RNA_X;
Satellite; Segment_X; Sig_peptide; Signal_X; Source; Stem_loop;
STS; Transit_peptide; Unsure; UTRX; Variation.

Please note that some features of the same type will be imported
in the same subdatabase; in particular:
the "ClipX" database includes the features: 5'clip, 3'clip;
the "UTRX" database includes the features: 5'UTR, 3'UTR;
the "Region_X" database includes the features: C_region, S_region, V_region;
the "RNA_X" database includes the features:
     misc_RNA, precursor_RNA, prim_transcript,
     rRNA, scRNA, snRNA, snoRNA, tRNA;
the "Segment_X" database includes the features:
     V_segment, D_segment, J_segment, N_region;
the "Signal_X" database includes the features:
     attenuator, CAAT_signal, enhancer, GC_signal, misc_signal,
     promoter, TATA_signal, terminator, -10_signal, -35_signal.
 

METHOD

First, a detailed description of GenBank flat file format
(ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt)
has been accurately analyzed for:

1. identification of characters usable as consistent limits
   for each data type;

2. conversion of the flat file format in a multiple related table series,
   allowing the appropriate import for each data type.

Our strategy is thus based on treatment of the downloaded, expanded data
file by a fast text conversion utility in order to find, change or insert
control characters (e.g. insert carriage return and tabs at a desired
point, to define the end respectively of each record and of different
fields in the same record - the changing list is provided as an on-line
supplementary material).
This step is performed by fully automated text filtering
(we have incorporated in the software a SelfReplace application performing
automatically all the character changes), using an appropriate provided
replacement filter based on invariant features of the GenBank flat file
established format.

At this point the file is ready to be imported into the appropriated text
fields (e.g. ÒNM EntryÓ, ÒFeaturesÓ, ÒSeq 1 50100Ó) of GeneRecords.
In this step, the master file is filled in with the entry set data
("raw data" layout), using the "tabulator" characters appropriately
inserted. Each GenBank entry is imported as one record.

Finally, GeneRecord will automatically parse data using calculated fields,
that extract each single data type from the text fields.
The whole import process is driven by FileMaker Pro (FMP) scripts launched
by clicking on the "Import data" button.

At the end of importation step, several kind of subscripts create
a new record in related subdatabases for each feature reported in any
entry; these records contain only the accession number and a progessive
numeration.
These data create the relationships between each subdatabase and the
master file, allowing the visualization of the data (read from the
Records.DNA) within specific field in each subdatabases.
Each feature of the entry is visualized in a dedicated subdatabase.

In the master file, different  layouts are accessible from the "Layout
menu", (a pop-up Menu in the top left corner, above the small book icon).
Each layout contains fields within a "portal", the FileMaker Pro tool for
construction of relational databases.
Each field visualizes the related subdatabase content for each feature.

It is possible to visualize these layouts also by clicking on the
"To Features" button, and then on the desired green button near each
feature name.

--
The free included FMP runtime allows free records management and browsing,
while the creation of new fields for elaboration or further relationships
definition require the installation of the FMP application.
We encourage any creative use, modification and noncommercial
redistribution of GeneRecords, as long as the original paper is cited,
and statement is provided that the original program has been modified (if the case).

The availability of complete GenBank datasets in a relational database
format allows the easy integration with other biological databases
available in the same or similar format; for example,
Unigene (collection of ESTs clusters) and LocusLink (a table of
correlation among different types of biological data) can be easily
imported in an FMP template.
Our release includes the integration with LocusLink, with the respective
GeneOntology tags.
 

INSTALLATION

Put the file "SelfReplace Pref" contained in the GeneRecords folder into the "Preferences" folder of the "System Folder".

Note:
Mac OS 9: "System Folder" is the name of the standard system folder
Mac OS X: "System Folder" is the name of the folder containing Mac OS 9
          "Classic" compatibility environment.
          "Classic" is required for running GeneRecords,
           check that it is open by double-clicking on "Open Classic.app"
           file included in "GeneRecords folder").
 

GENERAL DEFINITIONS

File
One set of records pertaining to the same subject.
GeneRecords is a set of 42 related files.
It  includes a "master" GeneRecords file (Records.DNA),
    related to 41 files (subdatabases).
Each subdatabase correspond to one (or more) specific
    "GenBank Feature", except "Fasta extractor.DNA"
    (an utility to convert sequence data in Fasta format).
A file with "LocusLink" data is also provided.

Record
One set of fields which constitute one entry.
One record contains data about one GenBank entry.
The record browser is a small book icon
    at the top left of the window.
You may browse the database by clicking on the book pages,
    or enter a record number and click on "Return" key.
You constantly visualize these informations:
    Records: total number of Records in the database
    Found: total number of Records currently selected
    Sorted: sorting status of the Records (Sorted/Unsorted)

Field
One area of the record containig a specific data type.
In the subdatabases,
each field corresponds to a specific "GenBank Feature Qualifier"

Browse Mode
A mode to use the database.
It allows data entry, viewing, browsing, sorting, manipulation.
It may be selected:
from the green "Browse mode" button on the window, or
from the "View" menu, or
from the mode pop-up Menu bar, at the bottom left of the window.

Find Mode
An alternative mode to use the database.
It allows searching for specific content in the databases fields,
using any different combination of criteria
      (see the "Search mode" section below for details about searching).
It may be selected:
from the green "Find mode" button on the window, or
from the "View" menu, or
from the mode pop-up Menu bar, at the bottom left of the window.

Preview Mode
An alternative mode to use the database.
It visualizes a print preview of the found records.
It may be selected:
from the "View" menu, or
from the mode pop-up Menu bar, at the bottom left of the window.

Layout
A particular graphical organization of the database fields.
A file may show data from within different layouts.
Visualization of a field is independent from the
                   memorization of the data it contains.
The user can navigate among the different layout of a file by the
pop-up Menu (Layout Menu),
             clicking on the bar at the top of the book icon.
 

USE

1. Download GenBank entries
Download a set of desired GenBank entries
(GenBank entry sequence allowed maximum size: up to 1,102,200 bp)
from any database which supports GenBank format
(GenBank "CON" division, i.e. NT_ entry code, is not supported).

There are two usual alternatives:

I. Querying GenBank via World Wide Web at:
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Limits&db=Nucleotide

Users have to save the obtained entry set as follows:
choose "GenBank" format from the pop-up Menu
                        at the right of  "Display" button,
choose "File" from the pop-up Menu
                        at the right of "Send to" button,
then click on "Send to" button and choose "Save" button
                        in the next dialog box.

II. Performing download of large data set via ftp at:
ftp://ftp.ncbi.nih.gov/genbank/
(decompress the files when appropriate)

At the end of this step, the users should have a text file
in GenBank flat file format,
containing the sequences to be imported into the GeneRecords database.

2. Import GenBank Entries

Open the "GeneRecords" file into the "GeneRecords 1.0" folder.

This action will open the file "Records.DNA", that represents the
"master" file of the program, linked and related to the
feature-specific subdatabases.

Advanced use:
You may open the program files using your copy of FileMaker Pro,
being so fully enabled to any modification in the software.
In this case, don't open the program using the "GeneRecords" file,
but open the master file "Records.DNA" with your FileMaker.
Following modifications, correct functioning of the program requires
that you relaunch it by "GeneRecords" runtime, due to data pathawy
structure stored in the "GeneRecords" scripts.

Click on the "Import data" button and follow instructions on the dialog box
(choose the file to be imported).

If the previously imported records have to be deleted,
choose "Erase sub-databases" from "Actions" Menu.

This step may require a long time,
depending on the size of the original data file.
A  set of 42 related files will be automatically updated with the new data,
with each type of information imported in the appropriate file/field.

The actual number of imported records is shown in the left side of the
Records.DNA window.

Note that each GenBank entry is imported as a single record.

You may adjust the layout appearance using "Zoom In"/"Zoom out" buttons,
or clicking on the small resizing buttons at the bottom left corner of any window.
 

The "Layout Menu" is a pop-up Menu in the top left corner,
above the small book icon.
Within one database file, separate layouts may be provided.
Layouts determine how data is displayed.
Changes in a field present on a layout are reflected in the same field
on all the layouts in the database.
 

Each "feature" (e.g., exon) of the entry is visualized in
a dedicated subdatabase (e.g., "Exon.DNA" file),
while each field of the subdatabase corresponds to a "Feature Qualifier"
according to GenBank Format (e.g., "exon number").

You can move among different subdatabases,
clicking on the "To the features" button, then clicking on a single feature.

You may also choose a particular subdatabase from the "Window" Menu.

You may visualize and search the content of each subdatabase also
from within the master file,
clicking on the "To Features" button,
and then on the desired green button near each feature name.

Switching among different features can be finally made also by the
"Layout Menu".

Please remember that some features of the same type will be imported
in the same subdatabase; in particular:
the "ClipX" database includes the features: 5'clip, 3'clip;
the "UTRX" database includes the features: 5'UTR, 3'UTR;
the "Region_X" database includes the features:
     C_region, S_region, V_region;
the "RNA_X" database includes the features:
     misc_RNA, precursor_RNA, prim_transcript,
     rRNA, scRNA, snRNA, snoRNA, tRNA;
the "Segment_X" database includes the features:
     V_segment, D_segment, J_segment, N_region;
the "Signal_X" database includes the features:
     attenuator, CAAT_signal, enhancer, GC_signal, misc_signal,
     promoter, TATA_signal, terminator, -10_signal, -35_signal.
 

BROWSE MODE (NAVIGATION)

The FileMaker Pro based database may be used basically in these "modes":
"Browse", "Find" and "Preview".
Switching among different modes can be made from the "View" Menu.

In the "Browse" mode,
browsing the records set can be made clicking on the small book icon
in the upper left corner:

In the GeneRecords database the users find four types of coloured button:

The RED buttons allow to open windows of the default web browser,
to show the related site on the Internet:

The GREEN buttons allow shifting among different layouts
(i.e., different visualization mode) of the same database file:

The GRAY buttons activate a predefinite instruction, e.g.:

The BLUE buttons allow linking of the related data
stored in the specific GeneRecords subdatabase:
 
 

ACCESS TO SEQUENCE DATA

Nucleotide sequence of each feature for each entry is provided
in the respective subdatabase.
Sequences are split into chunks of 50,100 bp.

To view the sequence of a desired feature,
click on "To the sequence" button in the respective subdatabase:

In some cases, the sequence is not immediately visualized in the subdatabases.

To visualize the actual sequence chunk data
in the current record of subdatabase,
you should click on the button "Click to extract",
or choose the command "Extract this sequence" from the Menu "Actions".

The command "Extract all sequences of the found entries set"
(from the Menu "Actions")
will extract the feature sequence for all the currently found records set.

A FASTA format sequence export function is accessible in each
feature subdatabase, from the "Sequence" layout,
clicking on the "Export fasta file" button.
 
 

SEARCH ("FIND") MODE

Switching among different modes can be made from the "View" Menu,
or by clicking on the "Find mode" green button in the database window.

In the "Find" mode, the small book icon in the upper left corner
represents different "requests" that are made for searching
in the database.

In the "Find" mode,
the user can fill a blank form allowing searching in specific fields,
and by moving among the different layouts,
very complex searches can be made by
combining searches in different subdatabases
(each corresponding to a feature)
from within the master file "Records.DNA"
(which is dinamically related to the content of each subdatabase).

You  can move among different layouts:
using the Layout (pop-up) Menu (Layout Menu),
          clicking on the bar at the top of the book icon; or:
using the green buttons available in the "Features" layout,
          clicking on the "To features" green button.

When searching in the master ("Records.DNA") database,
if one entry contains more recurrences of a feature,
all related records of the respective feature subdatabase
are displayed in the master database corresponding layout.

In FileMaker Pro "Find" mode, the "AND" - "OR" - "NOT" operators may
be used in a search in this way:

"AND" by filling in different fields located in the same "Request",

"OR"  by generating additional requests
      (from "Requests" Menu) in the same query,

"NOT" by generating additional requests (from "Requests" Menu)
      and checking the "Omit" box.

The "Symbols" pop-up Menu in the "Find" mode allows query of
ranges, duplicates, wildcards and so on.

Each feature subdatabase can be also individually searched,
after selection from Menu "Window".

The searching results are entries subsests matching the desired criteria.

The "Find record" script looks for a required corresponding record group
(features) in the specific databases;
if one entry contains more recurrences of a feature,
all related records of the subdatabase are displayed.
 
 

GENERECORDS FUNCTIONS AND MENU COMMANDS
 

FILE MENU

Page setup
Standard page set up command.

Print
Standard print command; you can choose to print:
    all records in the "Found" set, or
    only the current record, or
    a "blank" mask of the record fields.
The appearance will be that of the layout
    currently selected from the layout Menu.

Import Records
This is the general "Import" function of FileMaker Pro.
Use only "Import data" function
    for correct GenBank file import, from the "Actions" Menu, or
    clicking on the "Import data" button in "Records.DNA" file.

Export records
Export command for the found records set.
User can choose fields to be exported,
their order and the file format.

Save a copy as
Save a copy of the database, complete, compressed or
as a clone (database structure with no record present).
 

EDIT MENU

Undo
Standard "Undo" command.

Cut
Standard "Cut" text command.

Copy
Standard "Copy" text command.

Paste
Standard "Paste" text command.

Clear
Deletion of selected text.

Select all
Selection of all the text within a selected field
(to select a field, click into the field).

Find/Replace
Utility for search/replace text strings within fields.
Note: Use "Find" mode (from "View" Menu)
      for full search and selection of a record set.
 

VIEW MENU

Browse Mode
Switch to the "Browse Mode" (see "General Definitions" above).

Find Mode
Switch to the "Find Mode" (see "General Definitions" above).

Browse Mode
Switch to the "Preview Mode" (see "General Definitions" above).
 

RECORD MENU

New Record
Create a new empty record in the database.
The new Record will be the last of the current record set.

Duplicate Record
Duplicate the current record in the database.

Delete Record
Delete the current records in the database.

This will delete all the corresponding linked records
     in the related subdatabases.

Delete All Records
Delete all records in the database.

Modify Last Find
Return to the last performed search to edit it.

Show All Records
Show all the records in the database.

Omit Record
Remove a record from the found record set, without deleting it.

Show Omitted
Show the records in the database that have been omitted.

Sort...
Sort the current records set according to desired criteria.
 

ACTIONS MENU

New record: Example
Create a new sample record of GeneRecords.

Export data of interest
Equivalent to the "Export" command in the "File" Menu.

Print found records
Equivalent to the "Print" command in the "File" Menu.

Import Data
Import from a file with GenBank format entries.
(equivalent to the "Import data" button in the  software window).
 

WINDOW MENU

List of the databases in use.
A specific database may be selected from the Menu,
  bringing it on the front.
 

---

Software limits:

Maximum size of GenBank entry to be imported: up to 1,102,200 bp.

Maximum size of GenBank ÒFeaturesÓ section for each entry,
allowing a correct Features splitting:
64,000 characters following text processing.
Entries with a larger ÒFeaturesÓ section will be processed,
but the splitting of the features in the subdatabases could be incomplete.

A single GeneRecords database may store up to 2 Gbyte
(a physical limit of the core database).

The CON division of GenBank contains data for joining other sequences,
and it can not be imported.
 
 

Supplementary informations:

Detailed explanation of the GenBank Flat File format may be found at:
ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt

The definition and explanation of each GenBank Feature may be found at:
http://www.ncbi.nlm.nih.gov/projects/collab/FT/index.html

The list of character replacement is provided.
It is interpreted by the SelfReplace application (Guoniu Han),
which we have incorporated in the software using the "Developer" version.

General information about the Filemaker core functions may be found at:
http://www.strath.ac.uk/CC/Courses/FilemakerPro/filemaker.html
http://www.wellesley.edu/Computing/Filemaker/filemaker4_tutorial.html

We enclose the LocusLink database, maintained at NCBI:
http://www.ncbi.nlm.nih.gov/LocusLink/
http://research.nhgri.nih.gov/microarray/downloadable_cdna.html
as an example of creating a relationship among different databases
using GeneRecords.
 

Technical notes:

The below listed fields are placed in this order in the right column
of the "Import" dialog box, preceded by an arrow in "Map." column:
     Gene Entry;
     Features;
     Seq 1 50100;
     Seq 50101 100200;
     Seq 100201 150300;
     Seq 150301 200400;
     Seq 200401 250500;
     Seq 250501 300600;
     Seq 300601 350700;
     Seq 350701 400800;
     Seq 400801 450900;
     Seq 450901 501000;
     Seq 501001 551100;
     Seq 551101 601200;
     Seq 601201 651300;
     Seq 651301 701400;
     Seq 701401 751500;
     Seq 751501 801600;
     Seq 801601 851700;
     Seq 850701 901800;
     Seq 901800 951900;
     Seq 950901 1002000;
     Seq 1002001 1052100;
     Seq 1052101 1102200.
This ensures that each section of the pre-processed GenBank text file
is properly directed to its respective GeneRecords field for further
extraction and visualization of each "Feature/Qualifier".

The scripts at the core of GeneRecords software are "FileMaker Pro" scripts,
which in part also invoke #AppleScript" language commands.
 

Bugs report:

Please report any bug or problem to:
pierluigi.strippoli@unibo.it
p.daddabbo@biologia.uniba.it