SEQUENCE SEARCHING OF PROTEINS AND NUCLEIC ACIDS
I. DEFINITIONS: Basic Biochemistry
A. GENOME--the master blueprint for all cellular structures and
activities in an organism. The human genome has at least
100,000 genes and is estimated to have 3,000,000,000 base
pairs.
B. CHROMOSOMES--24 physically separate units into which the
three billion base pairs are organized in the human body,
each a packet of compressed and entwined DNA.
C. DNA--Deoxyribonucleic acids, composed of repeating nucleotide
units.
1. Nucleotide Unit--a monomer of DNA consisting of:
a. phosphate group
b. a five-carbon sugar (deoxyribose or ribose)
c. a nitrogen-containing organic base
(either guanine, cytosine, thymine or adenine)
2. DNA normally found as a double-stranded helix:
a. linked by hydrogen bonds between either of the pairs:
- guanine and cytosine: GC
- thymine and adenine: TA
b. each linkage is called a BASE PAIR.
3. DNA SEQUENCE--the order of the bases arranged along the
sugar-phosphate bond of the DNA.
4. Size of DNA--very high molecular weight, on the order of
several billion. Compare with RNA which has a MW of
20,000-40,000.
D. GENE--segment of a DNA molecule arranged linearly along the
chromosomes
1. Ranges in size from fewer than 1,000 bases to several
million.
2. The specific sequence of nucleotide bases carries the
information for constructing proteins. Markers in the DNA
sequence mapping are:
a. CODONs--groups of three DNA bases that direct the
cell's protein synthesizing machinery
b. EXONs--the protein coding sequences of genes
c. INTRONs--no coding function
E. PROTEINS--Biological polymers, made up of alpha-
aminocarboxylic acids:
CO2H
|
H2N--C--H
|
R <-- Variation here occurs
with 20 amino acids
F. PEPTIDES--an amide containing 2-50 amino acids
1. PEPTIDE BOND--the amide link between an alpha-amino acid
and the carboxyl group of another amino acid.
2. Each amino acid in a peptide molecule is called a UNIT or
RESIDUE.
3. The difference between a peptide and a protein is just
one of size. A protein has > 50 amino acid residues.
G. ENZYMES--a class of proteins composed of a polymeric chain
of amino acids held together by peptide bonds. It brings
together chemically reactive amino acids in a specific
arrangement for optimal catalysis. Enzymes are proteins
tailored to catalyze specific biological reactions.
1. ACTIVE SITE--the part of the enzyme that binds the
substrate molecule of the enzymatic reaction and performs
the conversion of the substrate to the product molecule.
2. COFACTOR--molecules other than amino acids that enzymes
incorporate into their active sites. They range from
simple metals to large aromatic groups such as heme and
flavin.
3. INHIBITORS--bind to and inactivate an enzyme.
4. RESTRICTION ENZYMES-- recognize short DNA sequences and
cut the DNA molecules at those specific sites.
H. RNA--Ribonucleic acids, in conjunction with enzymes,
synthesize peptides and proteins in a cell.
1. MESSENGER RNA (mRNA)--a transient intermediary molecule
that goes from the nucleus of the cell to the cytoplasm
and serves as a template for protein synthesis.
2. RIBOSOMAL RNA (rRNA)--granular bodies in the cytoplasm of
the cell that are the sites of protein synthesis.
3. TRANSFER RNA (tRNA)--transports an amino acid to the mRNA
and its ribosomes as an ester of one ribose unit in a
tRNA molecule.
4. COMPLEMENTARY DNA (cDNA)--synthesized from a messenger
RNA template; the single-stranded form is often used as
a probe in physical mapping.
II. STRUCTURE OF PROTEINS
A. Primary Structure--the order (sequence) of amino acids,
expressed as 1-letter or 3-letter codes.
B. Secondary Structure--the arrangement of the chain
1. alpha-helix: hydrogen bonding between lone pairs on an
oxygen of one amino acid and hydrogen attached to a
nitrogen on another amino acid.
2. pleated sheet
C. Tertiary Structure--the overall shape of the protein
D. Motifs--3-D structure patterns that protein molecules
assume. Conserved patterns of amino acids or nucleotides
that are common to a group of functionally related sequences
and often encode or specify the discrete function.
E. Homology Search--search for patterns across different
species and within the same one, showing descent from a
common ancestor.
F. Protein sequences--run from the amino to the carboxyl end.
III. SEQUENCE DATABASES (Genomic, DNA, Protein)--Number of
available sequences now doubling about every two years.
A. Human Genome Project--an attempt to map each human chromosome
at increasingly finer resolutions.
Genome Map--describes the order of genes or other markers
and the spacing between them on each chromosome.
Genetic linkage maps--depict the relative chromosomal
locations of DNA markers by their patterns of inheritance
from partent to child.
Physical maps--describe the chemical characteristics of
the DNA molecule itself.
B. Mapping Databases
Genome Database (GDB) -- provides location, ordering and
distance information for human genetic markers, probes and
contigs linked to human genetic diseases.
Online Mendelian Inheritance in Man database-- a catalog of
inherited human traits and diseases.
Human and Mouse Probes and Libraries Database -- (located at
the American Type Culture Collection)
GBASE Mouse database
C. Sequence Databases
1. Nucleic Acids (DNA and RNA)
GenBank -- human, plant, virus, bacterial and other species.
Translations of coding regions in a separate db: GenPept.
EMBL (European Molecular Biology Laboratory)
DNA Database of Japan (DDBJ)
GenInfo
2. Proteins
Protein Information Resource--contains sequence information,
motifs, and other features of protein structure.
SWISSPROT--includes translations of coding regions from
EMBL; has a companion database of protein motifs: PROSITE
GenPept
CCSD (Complex Carbohydrate Structure Database)--
carbohydrate, glycolipid, and glycopeptide sequences.
3. Both DNA and Proteins
Entrez databases from NLM
a. Include protein and nucleotide sequences from PIR and
GenBank.
b. Have MEDLINE citations
c. Have explicit links to the publications and sequences
d. Have implicit links to related sequences and articles
Protein Data Bank--3-D structures from crystallographic,
NMR, and molecular modeling data
D. Sequence Searches in the Chemical Abstracts Service Registry
File
1. Nucleic Acid Sequences in the Registry File
As of 3/93 had about 169,000 nucleic acid sequences in REG
--from the primary literature
--from GenBank
A complete sequence is:
a. one identified as a complete nucleic acid molecule by
authors,
b. the region (gene) within a larger nucleic acid molecule
which contains all of the coding information for a
protein or RNA product,
c. a genetic element (promoter, regulatory element, etc)
with clearly delineated beginning and ending residues.
Only nucleic acid sequences with lengths of nine or more
residues are available for searching and displaying using 1-
letter sequence codes.
Can search by SEQ and SQL (sequence length)
Sequence Searches:
1. Exact Sequence Search of Nucleic Acids /SQEN
2. Subsequence Search of Nucleic Acids /SQSN
(exact plus embedded)
2. Protein Sequences in the Registry File
As of 3/93 over 216,000 protein sequences in REG back to
1957.
Those with chain length of 4 or more are searchable by 1-
letter or 3-letter amino acid codes.
Dipeptides and tripeptides are also registered but can only
be searched by name or structure.
Four options to search protein sequences using amino acid
codes (limit of 256 characters for input):
1. Exact Sequence Search of Proteins /SQEP
matches query by length and amino acid sequence
2. Exact Family Sequence Search of Proteins /SQEFP
fixed length polypeptide sequences: excludes uncommon
or undefined amino acids
3. Subsequence Search of Proteins /SQSP
polypeptides with one or more query sequences:
embedded or exact
4. Subsequence Search of Proteins /SQSFP
Also can search things like gene name, organism name,
sequence length, etc.
IV. Tools
BLAST -- Basic Local Alignment Search Tool -- used
to compare sequences and find all significant similarities.
CCIIM: 09-16.495 GW