SEQUENCE SEARCHING OF PROTEINS AND NUCLEIC ACIDS


I. DEFINITIONS: Basic Biochemistry

   A. GENOME--the master blueprint for all cellular structures and
      activities in an organism.  The human genome has at least
      100,000 genes and is estimated to have 3,000,000,000 base
      pairs.

   B. CHROMOSOMES--24 physically separate units into which the
      three billion base pairs are organized in the human body,
      each a packet of compressed and entwined DNA.

   C. DNA--Deoxyribonucleic acids, composed of repeating nucleotide
      units.
      1.  Nucleotide Unit--a monomer of DNA consisting of:
          a. phosphate group
          b. a five-carbon sugar (deoxyribose or ribose)
          c. a nitrogen-containing organic base 
             (either guanine, cytosine, thymine or adenine)
      2.  DNA normally found as a double-stranded helix:
          a. linked by hydrogen bonds between either of the pairs:
             - guanine and cytosine: GC
             - thymine and adenine:  TA
          b. each linkage is called a BASE PAIR.
      3.  DNA SEQUENCE--the order of the bases arranged along the
          sugar-phosphate bond of the DNA.
      4.  Size of DNA--very high molecular weight, on the order of
          several billion.  Compare with RNA which has a MW of
          20,000-40,000.

   D. GENE--segment of a DNA molecule arranged linearly along the 
      chromosomes
      1.  Ranges in size from fewer than 1,000 bases to several 
          million.
      2.  The specific sequence of nucleotide bases carries the
          information for constructing proteins. Markers in the DNA
          sequence mapping are:
          a. CODONs--groups of three DNA bases that direct the
             cell's protein synthesizing machinery
          b. EXONs--the protein coding sequences of genes
          c. INTRONs--no coding function

   E. PROTEINS--Biological polymers, made up of alpha-
      aminocarboxylic acids:
 
                            CO2H
                            |
                       H2N--C--H
                            |
                            R  <-- Variation here occurs
                                   with 20 amino acids



    F. PEPTIDES--an amide containing 2-50 amino acids

       1. PEPTIDE BOND--the amide link between an alpha-amino acid
          and the carboxyl group of another amino acid.
       2. Each amino acid in a peptide molecule is called a UNIT or
          RESIDUE.
       3. The difference between a peptide and a protein is just
          one of size.  A protein has > 50 amino acid residues.

    G. ENZYMES--a class of proteins composed of a polymeric chain
       of amino acids held together by peptide bonds.  It brings
       together chemically reactive amino acids in a specific
       arrangement for optimal catalysis.  Enzymes are proteins
       tailored to catalyze specific biological reactions.
    
       1. ACTIVE SITE--the part of the enzyme that binds the
          substrate molecule of the enzymatic reaction and performs
          the conversion of the substrate to the product molecule.
       2. COFACTOR--molecules other than amino acids that enzymes 
          incorporate into their active sites.  They range from
          simple metals to large aromatic groups such as heme and
          flavin.
       3. INHIBITORS--bind to and inactivate an enzyme.
       4. RESTRICTION ENZYMES-- recognize short DNA sequences and
          cut the DNA molecules at those specific sites.

    H. RNA--Ribonucleic acids, in conjunction with enzymes,
       synthesize peptides and proteins in a cell.
   
       1. MESSENGER RNA (mRNA)--a transient intermediary molecule
          that goes from the nucleus of the cell to the cytoplasm
          and serves as a template for protein synthesis.
       2. RIBOSOMAL RNA (rRNA)--granular bodies in the cytoplasm of
          the cell that are the sites of protein synthesis.
       3. TRANSFER RNA (tRNA)--transports an amino acid to the mRNA
          and its ribosomes as an ester of one ribose unit in a
          tRNA molecule.
       4. COMPLEMENTARY DNA (cDNA)--synthesized from a messenger  
          RNA template; the single-stranded form is often used as 
          a probe in physical mapping.

II. STRUCTURE OF PROTEINS

    A. Primary Structure--the order (sequence) of amino acids,
       expressed as 1-letter or 3-letter codes.

    B. Secondary Structure--the arrangement of the chain
       1. alpha-helix: hydrogen bonding between lone pairs on an
          oxygen of one amino acid and hydrogen attached to a
          nitrogen on another amino acid.
       2. pleated sheet

    C. Tertiary Structure--the overall shape of the protein

    D. Motifs--3-D structure patterns that protein molecules      
       assume.  Conserved patterns of amino acids or nucleotides  
       that are common to a group of functionally related sequences 
       and often encode or specify the discrete function.

    E. Homology Search--search for patterns across different
       species and within the same one, showing descent from a
       common ancestor.

    F. Protein sequences--run from the amino to the carboxyl end.


III. SEQUENCE DATABASES (Genomic, DNA, Protein)--Number of
available sequences now doubling about every two years.

   A. Human Genome Project--an attempt to map each human chromosome
   at increasingly finer resolutions.

   Genome Map--describes the order of genes or other markers 
      and the spacing between them on each chromosome.

      Genetic linkage maps--depict the relative chromosomal
         locations of DNA markers by their patterns of inheritance
         from partent to child.

      Physical maps--describe the chemical characteristics of     
         the DNA molecule itself.

    B. Mapping Databases

    Genome Database (GDB) -- provides location, ordering and
    distance information for human genetic markers, probes and
    contigs linked to human genetic diseases.

    Online Mendelian Inheritance in Man database-- a catalog of
    inherited human traits and diseases.

    Human and Mouse Probes and Libraries Database -- (located at
    the American Type Culture Collection)

    GBASE Mouse database

    C. Sequence Databases

       1. Nucleic Acids (DNA and RNA)
       
       GenBank -- human, plant, virus, bacterial and other species.
       Translations of coding regions in a separate db: GenPept.
       
       EMBL (European Molecular Biology Laboratory)

       DNA Database of Japan (DDBJ)

       GenInfo

       2. Proteins

       Protein Information Resource--contains sequence information,
       motifs, and other features of protein structure.

       SWISSPROT--includes translations of coding regions from
       EMBL; has a companion database of protein motifs: PROSITE

       GenPept 

       CCSD (Complex Carbohydrate Structure Database)--
       carbohydrate, glycolipid, and glycopeptide sequences.

       3. Both DNA and Proteins

       Entrez databases from NLM
          a. Include protein and nucleotide sequences from PIR and
             GenBank.
          b. Have MEDLINE citations
          c. Have explicit links to the publications and sequences
          d. Have implicit links to related sequences and articles

       Protein Data Bank--3-D structures from crystallographic,
       NMR, and molecular modeling data

    D. Sequence Searches in the Chemical Abstracts Service Registry 
       File

       1. Nucleic Acid Sequences in the Registry File

       As of 3/93 had about 169,000 nucleic acid sequences in REG
       --from the primary literature
       --from GenBank

       A complete sequence is:

       a. one identified as a complete nucleic acid molecule by
          authors,
       b. the region (gene) within a larger nucleic acid molecule
          which contains all of the coding information for a
          protein or RNA product,
       c. a genetic element (promoter, regulatory element, etc)
          with clearly delineated beginning and ending residues.

       Only nucleic acid sequences with lengths of nine or more   
 residues are available for searching and displaying using 1-     
 letter sequence codes.

       Can search by SEQ and SQL (sequence length)

       Sequence Searches:
       
          1. Exact Sequence Search of Nucleic Acids    /SQEN
          2. Subsequence Search of Nucleic Acids       /SQSN
          (exact plus embedded)

       2. Protein Sequences in the Registry File

       As of 3/93 over 216,000 protein sequences in REG back to   
       1957.

       Those with chain length of 4 or more are searchable by 1-  
       letter or 3-letter amino acid codes.

       Dipeptides and tripeptides are also registered but can only 
       be searched by name or structure.

       Four options to search protein sequences using amino acid  
       codes (limit of 256 characters for input):
    
          1. Exact Sequence Search of Proteins         /SQEP
             matches query by length and amino acid sequence
       
          2. Exact Family Sequence Search of Proteins  /SQEFP
             fixed length polypeptide sequences: excludes uncommon 
             or undefined amino acids
       
          3. Subsequence Search of Proteins            /SQSP
             polypeptides with one or more query sequences:       
             embedded or exact

          4. Subsequence Search of Proteins            /SQSFP

       Also can search things like gene name, organism name,      
       sequence length, etc.

IV. Tools 

    BLAST -- Basic Local Alignment Search Tool -- used           
    to compare sequences and find all significant similarities.

CCIIM: 09-16.495 GW

Return to CCIIM Home Page