CHEMINFO Title Bar

12. Structure Searching

C471 Lecture Notes
Updated: 8 October 2003

I. What is Structure Searching?

STRUCTURE SEARCHING utilizes a graphic depiction of the chemical structure as input for a search. Such searches are generally run against the data in online chemical dictionary files, such as STN's Registry File. Depending on the type of structure search allowed by the system, the complete molecule or any compound containing the structure of the molecule will be retrieved as an answer set. Unlimited substitution of the input molecule may be allowed at free sites on the molecule (a FULL SUBSTRUCTURE SEARCH) or substitution may be limited to certain sites (a CLOSED SUBSTRUCTURE SEARCH). On the STN system, once an answer set is formed in the Registry File, it can be crossed over to the CA or other files to conduct further subject searches of the compounds thus isolated in a structure search. In these cases, it is actually the CAS Registry Number for the compounds that is being searched in the crossover files. Note that it is now possible to conduct a search that takes into account the stereochemistry of the chiral centers and double bonds. Stereo searching can be performed in the Registry File and the Beilstein File on STN or on the Beilstein CrossFire system. A SIMILARITY SEARCH finds target molecules that are like the query structure in some respects. That might be some biological property such as drug absorption or toxicology, with respect to metabolism. Usually, it is the similarity in functional groups that is measured. Finally, MARKUSH STRUCTURE SEARCHING, an important technique in patent searches that allows for considerable variablility in the structures retrieved, is another option in some files.

II. Why Use Structure Searching?

There are many reasons to do a substructure search, among them:

In combination with other types of searches, structure searching is a very powerful supplement.

III. Structure Searches in the STN Registry and Other Files.

Over 50,000,000 registered chemical substances appear in the Chemical Abstracts Service Registry File. Most of those have been registered since 1965, but, of course, not all of the compounds in the Registry File were discovered since that date. In 2002, Chemical Abstracts Service embarked on a project to retrospectively index all documents in the CA database. Thus, many compounds that have had no new information published about them since the establishment of the CA or CAPlus Files (i.e., since 1967) have now been added to the Registry File.

Most of the millions of compounds in the Registry File have their Registry Numbers linked to the databases on the STN system. The LC (File Locater) field of a Registry File record tells in which databases on STN the Registry Number is found. In addition to the Registry File, structure searches can be conducted in such databases on STN as BEILSTEIN, CASREACT, and others. A similar file locater function is included in other chemical dictionary files, such as NLM's ChemIDplus.

There are several types of structure searches possible in the Registry File, as well as different options for views of the molecules and different methods of inputting the structure. SciFinder Scholar masks to a certain extent the relationship between the Registry File and the CAPlus File, CASREACT, and other databases intertwined with its software.

SciFinder Scholar Structure Window

Within the SciFinder Scholar search stage itself, considerable information can be gleaned about the answer set to be retrieved. In the Preview option, the projected answer set can be analyzed by atom attachments, or, if the drawn structure contains them, by system-defined or user-defined variable groups. Once the structure is built and the answer set is retrieved, the search proceeds as it would if the compounds had been identified by name or molecular formula searches.

The structure search can be further refined with additional structural features or by limiting it to commercially available substances. Once refined, the references can be retrieved that have the Registry Number of the compounds in their indexing.

With a suitable viewer, the image of the molecule can be viewed as a 3-D model.

Isatin viewed with WebLab Viewer Lite

In traditional, command-driven structure searching, when logging on to STN, the choice of terminal determines what type of view of the molecule you will see. If one selects option 3 at the prompt:

TERMINAL (Enter 1, 2, 3 OR ?)

the structural depictions will be encoded with regular punctuation symbols found on a computer keyboard. Thus a double bond might be indicated by a colon (:) or an equal sign (=). With the proper telecommunications software, selecting option 2 will depict the structures as true graphical representations. That is the default option when using STN Express with Discover! (front-end software that allows the building of the structures offline).

The following types of structure searches are possible on STN:

With SciFinder Scholar, one of two options is available, depending on whether the Substructure Search Module is included in the version of the software. The basic SciFinder Scholar search covers an exact and family search. The SSS module allows the fuller search options.

There are actually several stages of a Registry File structure search. The first stage involves a screening of the huge file for compounds that have the requisite substitutents and other features, without regard to their position on the molecule. The much more computer-intensive iteration stage involves an atom-by-atom, bond-by-bond look at the candidate molecules isolated in the screen search. Since this stage requires so much of STN's computer resources, there are limits on the number of compounds that can be looked at during the iterative stage. A sample search must be run on approximately 5% of the file, after which a prediction as to whether the full file search will run to completion is given. Assuming the prediction is favorable, the full file search can be compared to the structure. Otherwise, the structure must be modified to be able to run to completion. With SciFinder Scholar, there is some built-in intelligence that offers to "autofix" a molecule that might give the system trouble. It is also wise to preview the SciFinder Scholar search to see what kinds of substances will be retrieved with the structure as drawn.

IV. Structure Searching on CrossFire

It is also possible to do very precise structure searching on the Beilstein CrossFire system.

CrossFire Structure Module

Unlike the STN system, where the type of structure search (exact, family, closed or full substructure; exact or substructure on SciFinder Scholar) determines the type of compounds retrieved, CrossFire requires the user to "set free sites" by indicating the number of substitutions allowed at given atoms or to make other choices at the time of structure drawing in order to broaden or narrow the scope of a search. Setting free sites is done either all at once in the Query Options menu (once the desired atoms have been selected) or atom by atom by choosing the precise number of free sites allowed for each atom. Other options are the inclusion/exclusion of isotopes and allowing substances that have a charge, radicals, etc. to be retrieved in the search.

As with SciFinder Scholar and other STN options for structure searching, CrossFire includes a number of template files to assist in building complex molecules. To be sure that you are properly drawing a functional group, choose it from the template file "residue.bsd". The template icon is just to the left of the Benzene ring when in structure-drawing mode. Once in a given template, you can use the File-Open option to see the other available templates.

CrossFire also allows predefined groups of variable atoms:

The addition of an H to these symbols means Hydrogen could also be one of the variable atoms. For example, XH implies that any of the atoms F, Cl, Br, I, or At plus H would satisfy the search. Likewise, there are generic group symbols to represent such things as carbocyclic or heterocyclic rings, alkyl, alkenyl, or alkynyl chain groups, etc. Finally, the user may define generic groups if the predefined groups are not sufficient.

V. Beilstein and Gmelin

Beilstein is for organic compounds, whereas Gmelin is for inorganic and organometallic compounds.

Beilstein covers compounds containing carbon along with the following elements:

          H
          Li, Be              B, C,  N,  O,  F
          Na, Mg                 Si, P,  S,  Cl
          K,  Ca                     As, Se, Br
          Rb, Sr                     Te, I
          Cs, Ba

Compounds can be single components or salts and mixtures (if they have at least one organic component). Peptides are covered if they contain twelve or fewer amino acids. Polymers or polycondensation products are not treated. The following are not typically treated as Beilstein compounds:

Gmelin covers compounds not covered in Beilstein, i.e., inorganic and organometallic chemistry as well as related fields such as mineralogy and metallurgy. Compounds are indexed with terms such as coordination compounds, alloys, ceramics, and inorganic polymers.

VI. Beilstein Lawson Numbers

Compounds in the Beilstein database are also indexed by a number that indicates various structural features. That is the Lawson Number. It represents certain structural fragments and can be used for structural similarity searches. In general, the smaller the Lawson Number, the more common the fragment. Every substance in Beilstein has at least one Lawson number assigned to it. Dividing the Lawson Number by 8 puts you roughly in the Beilstein system number for the printed Beilstein volume that contains the compound. The compounds are divided into 3 major groups in the printed Beilstein Handbook:

  1. Acyclic Compounds, Volumes 1-4; System Numbers 1-449
  2. Isocyclic Compounds, Volumes 5-16; System Numbers 450-2358
  3. Heterocyclic Compounds, Volumes 17-27; System Numbers 2359-4720.

[Unfortunately, the Beilstein Institute never published the meanings of the 4,720 system numbers used to classify organic compounds.]

The Lawson Number is effective when used in combination with other search keys, such as molecular formula, element ranges, etc. It is also useful when combined with NOT in substructure searches.

Link to supplemental readings
Link to Internet sources

Return to C471 Home Page

Copyright
Gary Wiggins
29 October 1995