Bioinformatics analysis in genomics
The term "bioinformatics" has only recently been introduced; in fact, it did not appear in literature until 1991, and even then only in the context of the emerging practice of electronic publication. The current concept of "bioinformatics" can probably be best described as the convergence of two technological revolutions: the explosive growth of biotechnology, equalled by that of information technology [Boguski, 1998]. This coincidence is clearly illustrated by the interesting fact that both the size of the DNA GenBank database and the computing power of the computers have doubled at about the same rate (every 18-24 months) for many years. Although the term "bioinformatics" is now very fashionable, many scholars built databases, developed algorithms and made biological discoveries through sequence analysis since the 1970s, long before anyone thought of labelling these activities with a specific term. If any field was referred to, many activities listed today as bioinformatics were included in the context of "molecular evolution".
The specific survey of computer technology in genomics assumes a fundamental value following the advancement of genome projects which aim to determine the complete nucleotide sequence of DNA of various species [Boguski, 1998]. This mass of data provides a privileged raw material for the ab initio identification of potential coding sequences as the first step towards the discovery of genes. Another fundamental step was the creation of databases containing partial sequences derived from messenger RNA and obtained from the automated analysis of a large number of bacterial clones obtained from cDNA libraries for many different tissues, in different species. These sequences (EST, expressed sequence tags; Boguski et al., 1993) are available for sequence analysis, which allows you to quickly identify genome sequences that are expressed (genes) without having to deal with the "background" of extragenic sequences, as happens in traditional "genome projects". The availability of these databases, on the one hand, and the evolution of the "software" tools necessary for their analysis, on the other, quickly led to a new approach to gene identification. Computer data became the starting point for in vitro experiments ("in silico" approach, as computer processors are made of silicon).
The basic operation in this sense consists in comparing the sequences with each other, deducing and quantifying their mutual "similarity" [Altschul 1998]. This term, technically referred to as similarity, is purely descriptive of a relationship between the two sequences more significant than that due to chance, while more formally with the term "homology" we mean the notion of a common evolutionary origin of the sequences. In practice, based on the similarity relationships between sequences, it may be possible to infer homology, even if outside of a formal biological model the descent from a common ancestral gene remains hypothetical. Programs that analyze the similarity between sequences are based, in short, on a score assigned based on the number of substitutions, insertions and deletions that must be carried out to convert one sequence into another. The different programs differ in the criteria used in scoring. Currently, the most used program for sequence comparison is based on the BLAST algorithm [Altschul et al., 1997], a heuristic process that identifies similar sequences very quickly, having the specific characteristic of also assigning a value of statistical significance correspondence found. This value ("expect value", or "E" value) corresponds to the number of comparisons between two sequences with an equal or higher similarity score that could be found, in that particular database, only as a result of chance; the smaller it is, the more meaningful the match.
For example, by using TBLASTN [Brenner 1998] it is possible to start from the amino acid sequence of a known protein, whose coding sequence can be automatically predicted based on the genetic code, and look for similar nucleotide sequences within the EST sequences.
Or it is possible to compare the
genomic DNA sequences, determined with the high yield procedures
and made publicly available in all the subsequent finishing
stages [Ouellette and Boguski, 1997], with the mRNA sequences
(known or obtained from the analysis of the EST), for a rapid
determination of the genomic structure of genes.
On the other hand, DNA sequences can be compared with amino acid
sequences following translation in all the possible protein
products they could encode for. Due to the fact that genetic
(translation) code is based on three letters, for any given
nucleotide sequence there are six possible translation frames
(frame +1, +2 and +3 in the query strand and -1, -2 and -3 in
the complementary strand). By using the BLASTX ("blast x
6 frames") variant of BLAST one can get clues about the protein
coding potential of a nucleotide sequence [Brenner
1998].
The availability of databases for many different species also
makes it possible to reconstruct the molecular evolution of the
sequences of interest, allowing distinguishing between orthology
(conservation of a particular gene between different species)
and paralogy (presence of a group of homologous genes within a
single species). Finally, there are many collections of short
sections ("motifs") of amino acid sequences that indicate
particular structural or functional elements. Research on these
collections from newly identified sequences allows reasonably
reliable function predictions to be made [Bork and Gibson,
1996].
Bioinformatics analysis and gene families
DNA sequences in the nuclear diploid genome usually exist in the form of two allelic copies, located on the paternal and maternal homologous chromosomes. In addition to this degree of repetition, about 40% of the human nuclear genome is composed, both in haploid and diploid cells, of groups of closely related non-allelic DNA sequences (families of DNA sequences, or repetitive DNA; Strachan and Read, 1999). Within the considerable variety of repeated DNA sequences, there are also DNA sequence families whose individual members comprise functional genes (multigene families). The operational definition of a family of DNA sequences is the relatively high level of sequence similarity between members of the family, at the level of the whole sequence or its localized regions.
The members of a gene family can be identified by:
The fact that two members of a family of DNA sequences show a high degree of similarity is indicative of a common evolutionary origin and is typically related to the conservation of a function.
A large percentage of actively expressed human genes are members of families of DNA sequences; the PFam catalogue [Bateman et al., 2000], maintained at the Sanger Center (Hinxton, Cambridge, UK), classifies 2,478 gene families in the version of 2001, and 6,680 in the version of 2021. We can distinguish different types of gene families. In "classical" gene families, members show a high degree of sequence homology along most of the extent of the genes or, at least, their coding sequence. This characteristic identifies in practice an evolutionary and functional correlation of these sequences. An example is the histone gene families. In some gene families, however, the homology is particularly pronounced within highly conserved regions of the genes, while the similarity between the remaining portions of the coding sequence can be very small. Often these families encode transcription factors that play an important role in the early stages of development, and the conserved sequence encodes a protein domain (folding unit) required for the selective binding of specific target genes to DNA (e.g., the domain Homeobox). Finally, there are also gene families whose members are not obviously correlated at the DNA sequence level, but encode for products characterized by a shared general function and by the presence of conserved short traits ("motifs") of the sequence; for example, the "box" DEAD (amino acid sequence Asp-Glu-Ala-Asp) is found in different genes, whose products all seem to work as RNA helicase. Members of gene families can occasionally be located close to each other in specific subchromosomal regions, such as the genes of the major class I histocompatibility complex (HLA), but are more often found dispersed in the genome.
Many different groups have addressed the problem of grouping protein sequences into families [review in Hofmann, 1998]. The various approaches differ in their degree of automation, in their completeness, in their focus on the complete sequence of proteins or protein domains. Indeed, the relationships between genes and gene families are so complex that "no simple hierarchical scheme can be used to make data easily understandable" [Henikoff et al., 1997], due to the modular composition of proteins.
Among the various tools
specifically designed for the reconstruction of gene families
through the analysis of the amino acid sequence, of particular
importance are PSI-BLAST and programs based on the statistical
method Hidden Markov Models (HMM). PSI-BLAST [Altschul et al.,
1997] is an "iterative profile-based research". First, a
similarity search is performed on a database starting from a
single sequence, using BLAST. The significantly similar
sequences are aligned to the query sequence, and a "profile" is
constructed, a position-specific scoring system derived from the
frequency with which a given amino acid residue is observed in a
column of the alignment. Since the families of sequences
preferentially retain specific residues and critical regions,
this information can allow more sensitive research to be carried
out, in repeated sequences (iterations). On the other hand,
HMM-based programs employ a particular statistical method [for a
review see Eddy, 1998a] for the recognition of the configuration
of a series of values (the sequence) that can be used to
represent the alignment of multiple sequences or sequence
segments, to identify the conservation of patterns or individual
residues.
The main interest of the study of human gene families consists
operationally in obtaining indications on the probable functions
of a gene that is similar to a gene already functionally
characterized, possibly allowing the recovery of information
obtained in model organisms of different species. Despite the
progress of large total DNA sequencing projects of different
species, many new genes identified to date have not been
assigned to gene families. Quoting Hofmann
[1998], it can be concluded that "It might appear that
using a combination of domain database searches, BLAST
searches and sub-family classification is too much effort for
the analysis of a single sequence. However, if one considers
how many months of experimental work have been spent on the
identification of the protein and the determina- tion of its
sequence, it might be worth a few extra hours of computing
time too".