CHEMINFO Title Bar

11. Chemical Name and Formula Searching (Searching for Information on a Single Chemical Substance)


C471 Lecture Notes
Updated: 6 October 2003

I. Introduction

Chemical Abstracts Service's Registry File is the single largest collection of data that can be used to identify a chemical substance. Each unique chemcal substance is assigned a Registry Number, which CAS uses in preference to a chemical name to index documents in the CA or CAPlus Files. Much of the descriptive information about a compound (its molecular formula, variant names for the substance, as well as much detailed information about its makeup, including the structure) is found in the Registry File. Furthermore, in recent years, actual data (experimental or calculated data) have been added to the file, making it much more like a huge handbook. The Registry Number serves as the unique identifier of the record. The Registry File includes a number of search techniques that are built on the chemical name and other fields included in the Registry File records.

In the printed CA, there is no Registry Number Index. Instead, the Chemical Substance Index links the preferred CA Index Name for the substance to the documents that have information on it. However, names for classes of compounds are indexed in the General Subject Index. Also, in the printed Chemical Abstracts, supplemental access to the printed product is found in the Formula Indexes. The CSI has dictated much of the indexing policy for supplemental terms used to describe the role of the chemical substance in the document. The broad indexing terms found in the CAS Roles in the database and the Standard Subject Divisions in the printed CSI can be of considerable use in retrieving information about a compound on which much has been written.

Molecular formula searching in CA is dependent on the Hill Formula system. The concept of the dot-disconnected formula is important in both the database and the printed Molecular Formula Index to Chemical Abstracts.

A search for information on a single chemical substance may start with the name of the substance, its molecular formula, or various other words or codes that can be associated with it. (See: Locating All CA File References Citing a Chemical Substance. and How to Search for CAS Registry Numbers in the CAS Registry File.) In this lesson, we will encounter various coding systems that have been applied to the retrieval of chemical substances from both printed and computer-based sources. The main database to search for such information is the CAS Registry File, which now has in excess of 40,000,000 records for chemical substances. Most of the entries in the Registry File now are for sequences of biological macromolecules. The bulk of the remaining small molecule entries are for organic compounds, either simple organics (esters, steroids, heterocycles, stereoisomers, etc.) or such things as mixtures, polymers, and organic salts. Just over 10% of the file is comprised of inorganic compounds.

II. Substance Searching Using Chemical Abstracts Service Registry Numbers

One very effective method of retrieving chemical substance information from a reference source is to utilize the Chemical Abstracts Service REGISTRY NUMBER for the substance. The Registry Number is a unique number assigned to each substance indexed by CAS. The CAS RN is a number of the format Y-XX-X, where Y can be from two to six digits, and X is one digit, for example, 494-12-2. The Registry Number is found in many databases and increasingly as an index to printed reference works. Bear in mind, however, that the Registry File started in 1965 with new substances that were encounered from that date forward. Most older substances are just now being being entered into the system for records that date from 1907-65. CAS is expected to finish this task soon, so there are not many compounds discovered post-1907 that are not in the CAS Registry System. For compounds discovered prior to 1907, it is wise to search the Beilstein and Gmelin databases, which have coverage back to the 18th century.

The Registry Number appears in the indexing of CA and CAPlus File records in preference to the formal name of the compound.

Registry Numbers in Indexing

The indexing above is part of that given by CAS to the record for:

Grieco, Paul A.; Bahsas, Ali. Reactions of allylstannanes with in situ generated immonium salts in protic solvent: a facile aminomethano destannylation process. J. Org. Chem. (1987), 52(7), 1378-80. CODEN: JOCEAH ISSN:0022-3263. CAN 106:195826 AN 1987:195826 CAPLUS

CAS Registry Numbers are assigned to organic and inorganic substances, metals, alloys, minerals, polymers, coordination compounds, elements, isotopes, peptides, enzymes, biomolecular sequences, and nuclear particles. However, the mere mention of a compound in a document is not enough to insure that the indexers at Chemical Abstracts Service will tie a CAS RN to the record for that document. To get an entry in the CA indexes, there must be something new reported about the substance. It may be a new method of preparation, a new source for the substance, a new reaction, a new kinetic or mechanistic study, new chemical or physical properties, a new method of analysis, a new use or application, or a new biological effect. Chemical reactants and the resulting products are routinely indexed, but reagents are not indexed unless there is a new prepartion of the reagent itself or a novel use of a standard reagent. This must have been the case for some of the compounds in the record above.

III. The Index Guide and Chemical Name Searching in the Printed Chemical Substance Indexes

Just as the Index Guide controls the vocabulary that must be used in the Chemical Abstracts General Subject Index, it also provides the correct name to use in searching the CA Chemical Substance Index. For example, a check of the Index Guide for "Flavan" finds the following:

Flavan
See 2H-1-Benzopyran, 3,4-dihydro-2-phenyl- [494-12-2]

In alphabetizing chemical substance names in the index, locant numbers, stereo designators, etc. are ignored. Thus, we must look in the "B" section of the printed CA Chemical Substance Index for "Benzopyran" in order to find index entries on the compound. Note that the CAS Index Name for Flavan is inverted, with the name of the so-called HEADING PARENT listed first. This keeps structurally related compounds in the same area of the index. The basic Heading Parent compound is listed first, followed by derivatives and other structurally related compounds. The entries in the Chemical Substance Index include the TEXT MODIFICATIONS (other subject words) that give more information about the documents that are indexed.

IV. Qualified Substances in CAS Files and Indexes

If not much has been written about the substance during the indexing period, all of the indexed information is found in a single alphabetical sequence under the Index Name in the printed Chemical Substance Index. However, when the index entries become voluminous, CAS divides them into Standard Subject Divisions. The compounds so treated are referred to as QUALIFIED SUBSTANCES. Originally seven qualifiers were used, but two additional terms (formation and processes) were added in 1994, and one phrase (uses and miscellaneous) was subsequently split apart. The qualifiers are:

V. CAS Roles in the CA and other Files

ROLES are CAS indexing terms assigned to every indexed substance and to controlled index terms for classes of compounds. The use of roles began to be appplied to the new online CA File records with v. 121 (July 1994). They were then applied retrospectively to all CA File records by means of a computer algorithm. Since there are 38 specific roles and 7 broad super roles, they substantially expand the indexing terms that were used prior to their introduction. The role terms give a more precise link to the substance. For example, it is now possible to specify not only that you want the preparation of the substance, but also that the preparation be a synthetic preparation, as opposed to industrial manufacture. In the past, there was no distinction made in the use of the term "Preparation" in such cases.

VI. Searching the Registry File with a Chemical Name

The Registry File is the largest single source of chemical names in existence. It can be searched by a trade or common name for a substance (CN), by its CAS Index Name (CN) or by fragments of the CAS Index Name (CNS field). (See: Tips for Chemical Name Searching.) Just as we had a Basic Index that is formed from subject words in a bibliographic database, there is also a BASIC INDEX for the Registry File when searched on STN. The Basic Index of the Registry File includes both chemical name fragments and molecular formula fragments. It may be necessary to follow certain protocols for special characters in order to search for a chemical name. Greek characters, for example, are spelled out in their entirety with a period before and after the Greek part of the name. An example of such a chemical name search in SciFinder Scholar is below. Note that in the SciFinder Scholar system, the search will work with or without the periods around the "alpha," but in STN command-language searching, the dots are mandatory.

alpha-Methylbenzoin Name Search

Methylbenzoin Registry File Record

VII. Searching the Registry File and Printed CA Indexes with a Molecular Formula

The system which is most commonly used today for arranging molecular formulas in indexes is the HILL SYSTEM. The Hill System covers both organic and inorganic compounds according to the following rules:

  1. Sum individually all like atoms within the molecule.

  2. If carbon is present, place it and the total number of C's first in the formula.

  3. If both carbon and hydrogen are present, place hydrogen and the total number of H's second. Note that if carbon is not present, rule 4 applies to the substance, and the H is placed in its regular position in the alphabet.

  4. All other atoms in the molecule are arranged alphabetically.
    That means that for inorganic substances without carbon, the arrangement is alphabetical.

Within the index itself, the numbers of elements come into play. Here is an example of compounds arranged for a Hill System Index:



Al6 Ca5 O14 C5 H8 O2
B2 O3
C8 H5 N O2
B2 Zr3
C15 H24 N2
Br H
C22 H24 F N3 O2
C Cl4
Ca O3 Ti
C H Cl3
Cl H
C H N O
H2 O4 S
C2 Ca
H4 Sn
C2 H4
O3 Pb Rb2
C2 H4 Br Cl
O5 P14 Zn7
C2 H5 Al Br2
Sn Zr4

Note that in the Registry File (including the SciFinder Scholar approach), the formulas may be searched with or without spaces between the element symbols. They are put here for clarity. The Hill System gives rise to some formulas that are quite different from those a chemist is used to seeing, e.g., H2O4S for sulfuric acid or BrH for hydrobromic acid.

The printed CA Formula Indexes do not have entries for the 600 or so qualified substances that have lots of information written about them. Thus, we find in the CA Formula Index from the 10th Collective Index period (1977-81):


C8H5NO2
1H-Indole-2,3-dione [91-56-5].
    See Chemical Substance Index
    sodium salt [3486-31-5], 90: 6180p; 91: 157670v; 94: 209034z

This tells us that the printed Chemical Substance Index must be used for detailed information on isatin itself, but it gives direct information that three documents dealt with the sodium salt of isatin during the period. When a sustance would have more than 20 entries in a 6-month volume index or more than 50 entries in the 5-year collective Formula Indexes, a "See" reference is made to the name of the substance in the Chemical Substance Index. We find in the Formula Index the abstract numbers for the sodium salt of isatin since there were relatively few documents written about that compound during the 10th Collective Index period.

A chemical formula in the Hill System may have more than one substance with that formula. For a given formula, isomers are arranged alphabetically by the CAS Index Name.

In the online molecular formula index of the Registry File (/MF), salts, addition compounds, and mixtures have the molecular formulas for the components arranged separately, with ratios for salts and addition compounds specified when known. If the ratios are unknown, a lower case "x" before the second formula or subsequent formulas is used, e.g.,

C15 H24 N2 . 2 Cl H

C22 H24 F N3 O2 .x H2 O4 S

These are examples of the so-called DOT-DISCONNECTED FORMULAS. (See: Tips for Molecular Formula Searching.)

VIII. Molecular Formulas of Types of Compounds in CA/STN

A. Salts.

Simple salts such as sodium chloride are treated as any other Hill Formula: ClNa.

1. Metal Salts of Complex Organic or Organometallic Acids

In general these substances have the molecular formula of the cation followed by the dot disconnect symbol (the period) and a multiplier times the molecular formula of the anion.

For metal salts of organic acids, the metal replaces one or more hydrogens attached to N, O, P, As, Se, or Te in an organic substance. The CAS structuring conventions treat these substances in the following manner:

Example: C6 H8 O7 . 3 Na
1, 2, 3-Propanetricarboxylic acid, 2-hydroxy-, trisodium salt
CAS RN: 68-04-2

A search of the SciFinder Scholar product for the molecular formula yielded ten answers at the time of the search, among them:

68-04-2 Registry File record

Other examples:

Exceptions:

Organometallic compounds in the Registry File are substances which have a carbon atom directly bonded to a metal atom, e.g., Phenyl Lithium:
C6 H5 Li. Note, however, that carbonium ions and carbanions are generally found as dot-disconnects in the Registry File.

Coordination compounds in the Registry File are substances in which an atom or group of atoms is bound to a central metal atom by a pair of electrons supplied by the coordinate group and not by the central metal atom, e.g., metallocenes. These substances have the Class Identifier code CCS in the Registry File records.

B. Polymers.

Polymers are indicated with the molecular formula of the repeating unit(s) in parentheses to which is appended an "x". The "x" indicates a repeating unit. For example, the molecular formula for 1,3-Butadiene is (C4H6)x. A search for a polymer by molecular formula may retrieve variant forms of the substance, because the syndiotactic, isotactic, graft or co-polymer will all have separate Registry Numbers.

IX. The Basic Index in the Registry File

The Registry File's Basic Index contains chemical name fragments and molecular formula fragments (including molecular formulas for individual components of multi-component substances and single component substances). Formula fragments searched in the Basic Index must be entered without spaces.

X. Element Information

In command-driven searching, it is possible to search for various information about the elements comprising a chemical substance, such as:

XI. Ring System Data and Ring Indexes

The Ring Identifier information (RID) lets you search a database for everything from the number of rings in a substance to the Ring Formula (minus hydrogens). The Registry File now has much information about rings that can be searched online, such as the Elemental Sequence for the Smallest Ring (/ESS), the number of rings in the ring system (/NRRS), etc. These search techniques can be valuable in refining a substance search in the Registry File. See the Registry File Database Summary Sheet for more options.

The Ring Systems Handbook provides an easy way to find the Heading Parent name for ring compounds. This name can then be used in the printed Chemical Substance Index or, for an online search, either the name or the Registry Number can be used to retrieve the Registry File record. It is important to know that the compound found in the Ring Systems Handbook may not actually exist. That is, there may be no information in the CA File on the substance. When a new ring system is identified, the substituents are stripped off, and a new ring system entry placed in the RSH.

The access to the entries in the Ring Systems Handbook is by name or ring analysis (and then by molecular formula of the rings making up the compound, ignoring hydrogens). The main part of the set is arranged by the number of rings comprising the compounds and the individual sizes of the smallest set of smallest rings. Thus, the number of component rings, the sizes of those rings, and the elements comprising them are enough information to find a ring compound. A section in the main body of the work might be labeled:



We would find in the section an entry for 1H-Indole [120-72-9]
                         H
               C         .
             :   .     . N .
           C:      .C.      . C
           .        :         :
           .        :         :
           C:       C.........C
             :    .            
               :C.              

with the molecular formula C8H7N and a 2-dimensional structural drawing of the molecule.

It would not be too difficult then to guess the proper Chemical Abstracts Index name for isatin: 1H-Indole-2,3-dione

isatin

Chemical Abstracts incudes an Index of Ring Systems with each Formula Index, beginning with the 7th Collective Index period (1962-66).

XII. Compound Class Identifiers.

There are a number of other indexes that can be used in an online search of the Registry File, e.g., Compound Class Identifiers (/CI).

Class Name Code
Alloy AYS
Coordination Compound CCS
Registered Concept CTS
Generic Registration GRS
Incompletely Defined Substance IDS
Manually Registered Substance MAN
Mineral MNS
Mixture MXS
Polymer PMS
Radical Ion RIS
Ring Parent RPS

An example of the use of the CI field in command-level searching is:

=> SEARCH PMS/CI (retrieves polymers)

Such searches are of use in combination with other Registry File searches in order to narrow an answer set. See the Registry File Summary Sheet for additional possibilities.

XIII. The CAOLD File

The CAOLD File contains records for documents indexed in Chemical Abstracts 1907-66. It is possible to search the CAOLD file with the CAS Registry Number. The records for items in the CAOLD file bear little resemblance to those in the CA file, providing merely a link to the printed Chemical Abstracts accession numbers or a mechanism with which to link to a pdf file of the page. It is important to know that the CAOLD file records were generated from the CA Formula Indexes. Since the qualified substances do not have Formula Index entries, there are many CA accession numbers in the period 1907-66 that do not have pointers from the CAOLD file. It is always best to double-check the results of a CAOLD file search against the printed Collective Index. See the STN Database Summary Sheet for the CAOLD File for additional information.

XIV. Other Online Chemical Dictionary Files

Databases such as the Registry File are referred to as ONLINE CHEMICAL DICTIONARY FILES. They exist to help you identify substances, to gather like substances into a set, and to discover which files on the database vendor's system have information on the substance(s).

Of particular interest are the online chemical dictionary files from the National Library of Medicine. Although not nearly as large as the Registry File, NLM's CHEMLINE file contained over 1,360,000 records as of mid-1995. The CAS Registry Number is part of each record. Searching by CAS RN's, molecular formulas, CAS Index Names, synonyms, various name and structural fragments is possible. A smaller NLM file is ChemIDplus, with nearly 350,000 compounds. An important feature of the ChemIDplus file is SUPERLIST. SUPERLIST designates a collection of lists of chemical substances maintained by key federal and state government regulatory agencies, as well as by scientific organizations concerned with health and environmental hazards of chemical substances. ChemIDplus provides directory assistance to those lists. Searching the NLM files is considerably cheaper than searching the CAS Registry file.

Unlike CAS, the National Library of Medicine has attempted to group compounds with related substances in their index in a hierarchical fashion. From 1963 through 1995, a chemical was generally "treed" in two places: in one Tree showing its chemical structure and in a second Tree under its function, or pharmacological action. The arrangement of chemical headings in MeSH (Medical Subject Headings) has not changed, but NLM no longer puts all drugs under the functional trees.

Link to supplemental readings
Link to Internet sources

Return to C471 Home Page

Copyright
Gary Wiggins
7 October 1995