Bioinformatics

The term "bioinformatics" has only recently been introduced; in fact, it did not appear in literature until 1991, and even then only in the context of the emerging practice of electronic publication. The current concept of "bioinformatics" can probably be best described as the convergence of two technological revolutions: the explosive growth of biotechnology, equalled by that of information technology [Boguski, 1998]. This coincidence is clearly illustrated by the interesting fact that both the size of the DNA GenBank database and the computing power of the computers have doubled at about the same rate (every 18-24 months) for many years. Although the term "bioinformatics" is now very fashionable, many scholars built databases, developed algorithms and made biological discoveries through sequence analysis since the 1970s, long before anyone thought of labelling these activities with a specific term. If any field was referred to, many activities listed today as bioinformatics were included in the context of "molecular evolution".

The specific survey of computer technology in genomics assumes a fundamental value following the advancement of genome projects which aim to determine the complete nucleotide sequence of DNA of various species [Boguski, 1998]. This mass of data provides a privileged raw material for the ab initio identification of potential coding sequences as the first step towards the discovery of genes. Another fundamental step was the creation of databases containing partial sequences derived from messenger RNA and obtained from the automated analysis of a large number of bacterial clones obtained from cDNA libraries for many different tissues, in different species. These sequences (EST, expressed sequence tags; Boguski et al., 1993) are available for sequence analysis, which allows you to quickly identify genome sequences that are expressed (genes) without having to deal with the "background" of extragenic sequences, as happens in traditional "genome projects". The availability of these databases, on the one hand, and the evolution of the "software" tools necessary for their analysis, on the other, quickly led to a new approach to gene identification. Computer data became the starting point for in vitro experiments ("in silico" approach, as computer processors are made of silicon).

The basic operation in this sense consists in comparing the sequences with each other, deducing and quantifying their mutual "similarity" [Altschul 1998]. This term, technically referred to as similarity, is purely descriptive of a relationship between the two sequences more significant than that due to chance, while more formally with the term "homology" we mean the notion of a common evolutionary origin of the sequences. In practice, based on the similarity relationships between sequences, it may be possible to infer homology, even if outside of a formal biological model the descent from a common ancestral gene remains hypothetical. Programs that analyze the similarity between sequences are based, in short, on a score assigned based on the number of substitutions, insertions and deletions that must be carried out to convert one sequence into another. The different programs differ in the criteria used in scoring. Currently, the most used program for sequence comparison is based on the BLAST algorithm [Altschul et al., 1997], a heuristic process that identifies similar sequences very quickly, having the specific characteristic of also assigning a value of statistical significance correspondence found. This value ("expect value", or "E" value) corresponds to the number of comparisons between two sequences with an equal or higher similarity score that could be found, in that particular database, only as a result of chance; the smaller it is, the more meaningful the match.

For example, by using TBLASTN [Brenner 1998] it is possible to start from the amino acid sequence of a known protein, whose coding sequence can be automatically predicted based on the genetic code, and look for similar nucleotide sequences within the EST sequences.

Or it is possible to compare the genomic DNA sequences, determined with the high yield procedures and made publicly available in all the subsequent finishing stages [Ouellette and Boguski, 1997], with the mRNA sequences (known or obtained from the analysis of the EST), for a rapid determination of the genomic structure of genes.
On the other hand, DNA sequences can be compared with amino acid sequences following translation in all the possible protein products they could encode for. Due to the fact that genetic (translation) code is based on three letters, for any given nucleotide sequence there are six possible translation frames (frame +1, +2 and +3 in the query strand and -1, -2 and -3 in the complementary strand). By using the BLASTX ("blast x 6 frames") variant of BLAST one can get clues about the protein coding potential of a nucleotide sequence [Brenner 1998].

The availability of databases for many different species also makes it possible to reconstruct the molecular evolution of the sequences of interest, allowing distinguishing between orthology (conservation of a particular gene between different species) and paralogy (presence of a group of homologous genes within a single species). Finally, there are many collections of short sections ("motifs") of amino acid sequences that indicate particular structural or functional elements. Research on these collections from newly identified sequences allows reasonably reliable function predictions to be made [Bork and Gibson, 1996].

DNA sequences in the nuclear diploid genome usually exist in the form of two allelic copies, located on the paternal and maternal homologous chromosomes. In addition to this degree of repetition, about 40% of the human nuclear genome is composed, both in haploid and diploid cells, of groups of closely related non-allelic DNA sequences (families of DNA sequences, or repetitive DNA; Strachan and Read, 1999). Within the considerable variety of repeated DNA sequences, there are also DNA sequence families whose individual members comprise functional genes (multigene families). The operational definition of a family of DNA sequences is the relatively high level of sequence similarity between members of the family, at the level of the whole sequence or its localized regions.

DNA hybridization and cloning, using a gene fragment as a probe for the screening of genetic libraries;
cloning by amplification with the polymerase chain reaction (PCR), by designing degenerate "primers" that bind to the conserved regions among family members;
sequence analysis, which allows the direct calculation of the degree of relationship between the genes.

The fact that two members of a family of DNA sequences show a high degree of similarity is indicative of a common evolutionary origin and is typically related to the conservation of a function.

A large percentage of actively expressed human genes are members of families of DNA sequences; the PFam catalogue [Bateman et al., 2000], maintained at the Sanger Center (Hinxton, Cambridge, UK), classifies 2,478 gene families in the version of 2001, and 6,680 in the version of 2021. We can distinguish different types of gene families. In "classical" gene families, members show a high degree of sequence homology along most of the extent of the genes or, at least, their coding sequence. This characteristic identifies in practice an evolutionary and functional correlation of these sequences. An example is the histone gene families. In some gene families, however, the homology is particularly pronounced within highly conserved regions of the genes, while the similarity between the remaining portions of the coding sequence can be very small. Often these families encode transcription factors that play an important role in the early stages of development, and the conserved sequence encodes a protein domain (folding unit) required for the selective binding of specific target genes to DNA (e.g., the domain Homeobox). Finally, there are also gene families whose members are not obviously correlated at the DNA sequence level, but encode for products characterized by a shared general function and by the presence of conserved short traits ("motifs") of the sequence; for example, the "box" DEAD (amino acid sequence Asp-Glu-Ala-Asp) is found in different genes, whose products all seem to work as RNA helicase. Members of gene families can occasionally be located close to each other in specific subchromosomal regions, such as the genes of the major class I histocompatibility complex (HLA), but are more often found dispersed in the genome.

Many different groups have addressed the problem of grouping protein sequences into families [review in Hofmann, 1998]. The various approaches differ in their degree of automation, in their completeness, in their focus on the complete sequence of proteins or protein domains. Indeed, the relationships between genes and gene families are so complex that "no simple hierarchical scheme can be used to make data easily understandable" [Henikoff et al., 1997], due to the modular composition of proteins.

Among the various tools specifically designed for the reconstruction of gene families through the analysis of the amino acid sequence, of particular importance are PSI-BLAST and programs based on the statistical method Hidden Markov Models (HMM). PSI-BLAST [Altschul et al., 1997] is an "iterative profile-based research". First, a similarity search is performed on a database starting from a single sequence, using BLAST. The significantly similar sequences are aligned to the query sequence, and a "profile" is constructed, a position-specific scoring system derived from the frequency with which a given amino acid residue is observed in a column of the alignment. Since the families of sequences preferentially retain specific residues and critical regions, this information can allow more sensitive research to be carried out, in repeated sequences (iterations). On the other hand, HMM-based programs employ a particular statistical method [for a review see Eddy, 1998a] for the recognition of the configuration of a series of values (the sequence) that can be used to represent the alignment of multiple sequences or sequence segments, to identify the conservation of patterns or individual residues.

The main interest of the study of human gene families consists operationally in obtaining indications on the probable functions of a gene that is similar to a gene already functionally characterized, possibly allowing the recovery of information obtained in model organisms of different species. Despite the progress of large total DNA sequencing projects of different species, many new genes identified to date have not been assigned to gene families. Quoting Hofmann [1998], it can be concluded that "It might appear that using a combination of domain database searches, BLAST searches and sub-family classification is too much effort for the analysis of a single sequence. However, if one considers how many months of experimental work have been spent on the identification of the protein and the determina- tion of its sequence, it might be worth a few extra hours of computing time too".