CHEMINFO Title Bar

3. Chemical Identifiers

I371 Lecture Outline

Updated: 21 July 2006


I. Formulas

A. Empirical Formula - the simplest whole number ratio of the various types of atoms in a compound.

B. Molecular Formula - the formula that reflects the correct molecular weight of the substance, i.e., gives the symbol and exact numbers of each kind of atom found in a molecule.

C. Hill Formula Order - a method to arrange molecular formulas in an index.

D. Problem - ambiguity: more than one compound can have the same formula.

II. Registry Systems
A. Assignment of a unique identifier to a chemical substance

B. Each identifier represents one and only one chemical substance

C. Provides a link to other documents or data in the chemical information system

D. Chemical Abstracts Service Registry System

  1. Largest system in existence: over 58,000,000 substances, about half of which are biomolecular sequences
  2. All types of chemical substances
  3. Used by many other indexing services and government agencies
  4. CAS Registry Number
    • Number of format Y-XX-X, where:
    • Y = 2 - 6 digits, X is 1 digit
    • No inherent meaning in the number
    • Example: 91-56-5 (Isatin)

      Isatin entry in the 
Chemical Abstracts Registry File

  5. Problem: CAS RNs are proprietary; must pay CAS to include them in a database.

E. Other Numeric Identifiers

III. IUPAC International Chemical Identifier (InChI)

A. InChI
  1. Establishes a unique label for a compound
  2. Uses algorithms for converting an input organic chemical structure to this unique (CANONICAL) form
  3. Steps
    • Normalization (Apply chemical rules)
    • Canonicalization
    • Serialize (Assign the identifier)

B. Characteristics:

  1. Non-proprietary identifier for chemical substances
  2. For use in printed and electronic data sources
  3. To enable easier linking of diverse data compilations

C. Example of InChI for Isatin taken from PubChem:

  1. IUPAC Name: 1H-indole-2,3-dione
  2. InChI: InChI=1/C8H5NO2/c10-7-5-3-1-2-4-6(5)9-8(7)11/h1-4H,(H,9,10,11)/f/h9H

IV. Canonical SMILES for Isatin: C1=CC=C2C(=C1)C(=O)C(=O)N2


Copyright 2006
Gary Wiggins