
3. Chemical Identifiers
I371 Lecture Outline
Updated: 21 July 2006
I. Formulas
-
A. Empirical Formula - the simplest whole number ratio of the various types of atoms in a compound.
B. Molecular Formula - the formula that reflects the correct molecular weight
of the substance, i.e., gives the symbol and exact numbers of each kind of
atom found in a molecule.
C. Hill Formula Order - a method to arrange molecular formulas in an index.
D. Problem - ambiguity: more than one compound can have the same formula.
II. Registry Systems
-
A. Assignment of a unique identifier to a chemical substance
B. Each identifier represents one and only one chemical substance
C. Provides a link to other documents or data in the chemical information
system
D. Chemical Abstracts Service Registry System
- Largest system in existence: over 58,000,000 substances, about half of
which are biomolecular sequences
- All types of chemical substances
- Used by many other indexing services and government agencies
- CAS Registry Number
- Number of format Y-XX-X, where:
- Y = 2 - 6 digits, X is 1 digit
- No inherent meaning in the number
- Example: 91-56-5 (Isatin)
- Problem: CAS RNs are proprietary; must pay CAS to include them in a database.
E. Other Numeric Identifiers
III.
IUPAC International Chemical Identifier (InChI)
-
A. InChI
- Establishes a unique label for a compound
- Uses algorithms for converting an input organic chemical structure
to this unique (CANONICAL) form
- Steps
- Normalization (Apply chemical rules)
- Canonicalization
- Serialize (Assign the identifier)
B. Characteristics:
- Non-proprietary identifier for chemical substances
- For use in printed and electronic data sources
- To enable easier linking of diverse data compilations
C. Example of InChI for Isatin taken from PubChem:
- IUPAC Name: 1H-indole-2,3-dione
- InChI: InChI=1/C8H5NO2/c10-7-5-3-1-2-4-6(5)9-8(7)11/h1-4H,(H,9,10,11)/f/h9H
IV. Canonical SMILES
for Isatin: C1=CC=C2C(=C1)C(=O)C(=O)N2
Copyright 2006
Gary Wiggins