SEQUENCE SEARCHING OF PROTEINS AND NUCLEIC ACIDS
Updated: 18 February 2001
A. GENOME--the master blueprint for all cellular structures and
activities in an organism. The human genome is estimated to have
30,000-40,000 genes and 3,000,000,000 base
B. CHROMOSOMES--23 physically separate pairs into which the
three billion base pairs are organized in the human body,
each a packet of compressed and entwined DNA. [NB. Other animals
have different numbers of chromosomes, e.g., horse: 64; cat: 38.]
C. DNA--Deoxyribonucleic acids, composed of repeating nucleotide units.
1. Nucleotide Unit--a monomer of DNA consisting of:
D. RNA--Ribonucleic acids: structurally similar to DNA, but the sugar
is ribose, and the bases are A, G, C, U. In conjunction with enzymes, RNA
synthesizes peptides and proteins in a cell.
a. phosphate group
2. DNA normally found as a double-stranded helix:
b. a five-carbon sugar (deoxyribose)
c. a heterocyclic nitrogen-containing organic base
(either guanine, cytosine, thymine or adenine)
a. linked by hydrogen bonds between either of the pairs:
- guanine and cytosine: GC
b. each linkage is called a BASE PAIR.
- thymine and adenine: TA
3. DNA SEQUENCE--the order of the bases arranged along the
sugar-phosphate bond of the DNA.
4. Size of DNA--very high molecular weight, on the order
of several billion. Compare with RNA which has a MW of
1. MESSENGER RNA (mRNA)--a transient intermediary molecule
that goes from the nucleus of the cell to the cytoplasm
and serves as a template for protein synthesis.
2. RIBOSOMAL RNA (rRNA)--granular bodies in the cytoplasm of
the cell that are the sites of protein synthesis.
3. TRANSFER RNA (tRNA)--transports an amino acid to the mRNA
and its ribosomes as an ester of one ribose unit in a
4. COMPLEMENTARY DNA (cDNA)--synthesized from a messenger
RNA template; the single-stranded form is often used as
a probe in physical mapping.
E. GENE--segment of a DNA molecule arranged linearly along the
chromosomes containing the instructions needed to direct the
synthesis of a single polypeptide.
1. Ranges in size from fewer than 1,000 bases to several million.
2. The specific sequence of nucleotide bases carries the
information for constructing proteins. Markers in the DNA
sequence mapping are:
a. CODONs--groups of three DNA bases that direct the
cell's protein synthesizing machinery
b. EXONs--the protein coding sequences of genes
c. INTRONs--no coding function
F. PROTEINS--Biological polymers, made up of alpha-
R <-- Variation here occurs
with 20 amino acids
G. PEPTIDES--an amide containing 2-50 amino acids
1. PEPTIDE BOND--the amide link between an alpha-amino acid
and the carboxyl group of another amino acid.
2. Each amino acid in a peptide molecule is called a UNIT
3. The difference between a peptide and a protein is just
one of size. A protein has > 50 amino acid residues.
H. ENZYMES--a class of proteins composed of a polymeric chain
of amino acids held together by peptide bonds. It brings
together chemically reactive amino acids in a specific
arrangement for optimal catalysis. Enzymes are proteins
tailored to catalyze specific biological reactions.
1. ACTIVE SITE--the part of the enzyme that binds the
substrate molecule of the enzymatic reaction and performs
the conversion of the substrate to the product molecule.
II. STRUCTURE OF PROTEINS
2. COFACTOR--molecules other than amino acids that enzymes
incorporate into their active sites. They range from
simple metals to large aromatic groups such as heme and
3. INHIBITORS--bind to and inactivate an enzyme.
4. RESTRICTION ENZYMES-- recognize short DNA sequences and
cut the DNA molecules at those specific sites.
A. Primary Structure--the order (sequence) of amino acids in a
expressed as 1-letter or 3-letter codes.
III. SEQUENCE DATABASES (Genomic, DNA, Protein)--Number of
available sequences now doubling about every two years.
B. Secondary Structure--the arrangement of the chain into regular
repeating neighboring segments of protein chains
1. alpha-helix: hydrogen bonding between lone pairs on an
oxygen of one amino acid and hydrogen attached to a
nitrogen on another amino acid.
C. Tertiary Structure--the overall 3-D shape of the protein
determined mostly by interactions of amino acid side chains that are far
apart along the same backbone.
2. beta-sheet (pleated)
D. Quaternary Structure--the overall structure of two or more proteins
chains that aggregate to form large, ordered structures.
E. Motifs--3-D structure patterns that protein molecules
assume. Conserved patterns of amino acids or nucleotides
that are common to a group of functionally related sequences
and often encode or specify the discrete function.
F. Homology Search--search for patterns across different
species and within the same one, showing descent from a
G. Protein sequences--run from the amino to the carboxyl end.
A. Human Genome Project--an attempt to map each human chromosome
at increasingly finer resolutions.
Genome Map--describes the order of genes or other markers
and the spacing between them on each chromosome.
B. Mapping Databases
Genetic linkage maps--depict the relative chromosomal
locations of DNA markers by their patterns of inheritance
from parent to child.
Physical maps--describe the chemical characteristics of
the DNA molecule itself.
Genome Database (GDB) -- provides location, ordering and
distance information for human genetic markers, probes and
contigs linked to human genetic diseases.
C. Sequence Databases
Online Mendelian Inheritance in Man database-- a catalog of
inherited human traits and diseases.
Human and Mouse Probes and Libraries Database -- (located at
the American Type Culture Collection)
GBASE Mouse database
1. Nucleic Acids (DNA and RNA)
D. Sequence Searches in the Chemical Abstracts Service Registry
GenBank -- human, plant, virus, bacterial and other species.
Translations of coding regions in a separate db: GenPept.
EMBL (European Molecular Biology Laboratory)
DNA Database of Japan (DDBJ)
Protein Information Resource--contains sequence information,
motifs, and other features of protein structure.
SWISSPROT--includes translations of coding regions from
EMBL; has a companion database of protein motifs: PROSITE
3. Both DNA and Proteins
CCSD (Complex Carbohydrate Structure Database)--
carbohydrate, glycolipid, and glycopeptide sequences.
Entrez databases from NLM
a. Include protein and nucleotide sequences from PIR and
Protein Data Bank--3-D structures from crystallographic,
NMR, and molecular modeling data
b. Have MEDLINE citations
c. Have explicit links to the publications and sequences
d. Have implicit links to related sequences and articles
1. Nucleic Acid Sequences in the Registry File
As of 3/93 had about 169,000 nucleic acid sequences in REG
--from the primary literature
A complete sequence is:
a. one identified as a complete nucleic acid molecule by
Only nucleic acid sequences with lengths of nine or more
residues are available for searching and displaying using 1-
letter sequence codes.
b. the region (gene) within a larger nucleic acid molecule
which contains all of the coding information for a
protein or RNA product,
c. a genetic element (promoter, regulatory element, etc)
with clearly delineated beginning and ending residues.
Can search by SEQ and SQL (sequence length)
1. Exact Sequence Search of Nucleic Acids /SQEN
2. Protein Sequences in the Registry File
2. Subsequence Search of Nucleic Acids /SQSN
(exact plus embedded)
As of 3/93 over 216,000 protein sequences in REG back to 1957.
Those with chain length of 4 or more are searchable by 1-
letter or 3-letter amino acid codes.
Dipeptides and tripeptides are also registered but can only
be searched by name or structure.
Four options to search protein sequences using amino acid
codes (limit of 256 characters for input):
1. Exact Sequence Search of Proteins /SQEP
matches query by length and amino acid sequence
Also can search things like gene name, organism name,
sequence length, etc.
2. Exact Family Sequence Search of Proteins /SQEFP
fixed length polypeptide sequences: excludes uncommon
or undefined amino acids
3. Subsequence Search of Proteins /SQSP
polypeptides with one or more query sequences:
embedded or exact
4. Subsequence Search of Proteins /SQSFP
BLAST -- Basic Local Alignment Search Tool -- used
to compare sequences and find all significant similarities.