CHEMINFO Title Bar

STRUCTURE SEARCHING ON STN AND OTHER SYSTEMS

C472 Lecture Notes
Updated 28 January 2001

I. Introduction

STRUCTURE SEARCHING allows a search to be run using the chemical structure as input. The searches are generally run against online chemical dictionary files, such as STN's Registry File. Depending on the type of structure search allowed by the system, the complete molecule or any compound containing the structure of the molecule will be retrieved as an answer set. The retrieved structures may include salts, isotopically labeled substances, mixtures, and structures in which the drawn structure is contained as a subset of a larger structure.

Unlimited substitution of the input molecule may be allowed at free sites on the molecule (a FULL SUBSTRUCTURE SEARCH) or substitution may be limited to certain sites (a CLOSED SUBSTRUCTURE SEARCH). On the STN system, once an answer set is formed in the Registry File, it can be crossed over to the CA or other files to conduct further subject searches of the compounds thus isolated in a structure search. In these cases, it is actually the CAS Registry Number for the compounds that is being searched in the crossover files. Note that it is now possible to conduct a search that takes into account the stereochemistry of the chiral centers and double bonds. Stereo searching can be performed in the Registry File and the Beilstein File on STN or on the Beilstein CrossFire system. Finally, MARKUSH STRUCTURE SEARCHING, an important technique in patent searches that allows for considerable variablility in the structures retrieved, is another option in some files.

II. Why Use Structure Searching?

There are many reasons to do a substructure search, among them:

In combination with other types of searches, structure searching is a very powerful complement.

III. Structure Searches in the STN Registry and Other Files.

Over 25,000,000 registered compounds appear in the Chemical Abstracts Service Registry File. All of those have been registered since 1965, but, of course, not all of the compounds in the Registry File were discovered since that date. In fact, there are many compounds in the Registry File that have no new information on them in the CA or CAPlus Files (that is, in the literature from 1967 onward). However, most of the millions of compounds in the Registry File have their Registry Numbers linked to the to databases on the STN system. The LC (File Locater) field of a Registry File record tells in which databases on STN the Registry Number is found. In addition to the Registry File, structure searches can be conducted in such databases on STN as BEILSTEIN, CASREACT, and others. A similar file locater function is included in other chemical dictionary files, such as NLM's ChemID.

There are several types of structure searches possible in the Registry File, as well as different options for views of the molecules and different methods of inputting the structure. SciFinder Scholar masks to a certain extent the relationship between the Registry File and the CAPlus File, CASREACT, and other databases intertwined with its software.

Once the structure is built and the answer set retrieved, the search proceeds as it does with compounds identified by name or molecular formula searches. The structure search can be further refined with additional structural features or by limiting it with other parameters. Once refined, the references can be retrieved that have the Registry Number of the compounds in their indexing.

In traditional, command-driven structure searching, when logging on to STN, the choice of terminal determines what type of view of the molecule you will see. If one selects option 3 at the prompt:

TERMINAL (Enter 1, 2, 3 OR ?)

the structural depictions will be encoded with regular punctuation symbols found on a computer keyboard. Thus a double bond might be indicated by a ":" or a "=". With the proper telecommunications software, selecting option 2 will depict the structures as true graphical representations. That is the default option when using STN Express with Discover! (front-end software that allows the building of the structures offline).

The following types of structure searches are possible on STN:

With SciFinder Scholar, one of two options is available, depending on whether the Substructure Search Module is included in the version of the software. The basic SciFinder Scholar search covers an exact and family search. The SSS module allows the fuller search options.

There are actually several stages of a Registry File structure search. The first stage involves a screening of the huge file for compounds that have the requisite substitutents and other features, without regard to their position on the molecule. The much more computer-intensive iteration stage involves an atom-by-atom, bond-by-bond look at the candidate molecules isolated in the screen search. Since this stage requires so much of STN's computer resources, there are limits on the number of compounds that can be looked at during the iterative stage. A sample search must be run on approximately 5% of the file, after which a prediction of whether the full file search will run to completion is given. Assuming the prediction is favorable, the full file search can be compared to the structure. Otherwise, the structure must be modified to be able to run to completion. With SciFinder Scholar, there is some built-in intelligence that offers to "autofix" a molecule that might give the system trouble. It is also wise to preview the SciFinder Scholar search to see what kinds of substances might be retrieved with the structure as drawn.

IV. How to Create a Structure in the Registry File.

The "old-fashioned" way of building structures on the STN system is to use alphanumeric commands to gradually create the molecule. There are front-end programs such as STN Express or Molkick that can be used to draw a graphic depiction of the molecule offline and upload it to STN once the connection is made. Of course, SciFinder or STN on the Web tool have a structure searching option. Nevertheless, it is instructive to see the original commands used to draw the molecule and the options for assigning parameters to the structure. When building the structure online via commands, it is advisable for cost reasons to build it in the cheap LREG file. Once complete and an L# is assigned to the structure query, you can transfer to the more expensive Registry File to run the search.

These are the basic steps that must be followed to create the structure online on STN using command language:

  1. Initiate the structure creation sub-program on the STN system by giving the STRUCTURE command at the STN LREG file prompt "=>".
  2. Build the outline of the structure using the GRAph command.
  3. Specify the non-carbon atoms with the NODe command.
  4. Specify the types of bonds in the molecule with the BONd command.
  5. Specify additional requirements for the molecule, such as:
  6. Do a final display of the molecule you have built with the DIS SIA (Display the Structure Image and Attributes) command.
  7. Terminate structure building with the END command.

At this point, an L# is assigned to the structure query you have created. Once the Registry File is entered, the structure search is initiated with the SEARCH L# command. An example of the structure building process using commands on STN and a Type 3 (alphanumeric) terminal setting is seen here.

V. The GRA Command and the Use of Pre-Drawn Structures

The Graph command builds the basic outline of the molecule. This can be a cumbersome process for larger molecules. Hence, there are alternatives. One way is to start with the Registry Number of a known substance that is similar to the compound of interest. Once the STRUCTURE command is given, you are prompted to:

ENTER NAME OF STRUCTURE TO BE RECALLED (NONE):

At this point, you could enter a Registry Number or, if you have built another structure in this session, the L# for that query structure. Another alternative is to enter a code for the pre-drawn systems used in creating structures. Rings of size 5 to 12 ring atoms can be created simply by inputting the appropriate number at the prompt. Other pre-drawn options include STEROD (steroids) and ADAMAN (adamantanes).

If starting from scratch, the two basic options for the GRA command are to draw a chain (c) or ring (r) followed by a number indicating the size of the chain or ring. Thus, GRA c3 builds a chain of 3 atoms, and GRA r6 builds a 6-membered ring. The structures appear on the screen with carbon atoms as the default nodes, and unspecified bonds. All nodes are numbered, so further commands to the system utilize the node numbers for appropriate actions.

One potentially confusing use of the GRA command occurs when two nodes are to be connected. Intuitively, this would seem to involve the BON command because we want to form a bond between the two atom nodes. However, BON is used only to modify an unspecified bond created with the GRA command. Thus, if we wanted to create a 14- membered ring, one way to do it would be to GRA c14, then GRA 1-14. That puts the necessary link between the two end nodes (although some other moving of the atoms would be necessary to make it appear reasonable on the screen).

VI. Use of the NOD Command

The NOD command takes the form: NOD # symbol where the # refers to the number of the node in the molecule and the symbol is defined either by regular symbols for the elements or by special node symbols understood by the STN system. The latter include such things as "X" to represent any halogen, "M" to represent a metal, or "Gk" (where k represents a number from 1 to 20) to indicate a node which can vary according to your defintion of the possible symbols (done with the VARiable command). There are also a number of SHORTCUT SYMBOLS for groups such as methyl "ME" or tert-butyl "T-BU".

There are four GENERIC GROUP SYMBOLS:

By issuing the GGC (Generic Group Category) command, these symbols can be further limited by type, for example, linear "LIN" or low carbon (6 or fewer carbons) "LOC".

Finally, it may be necessary to define a node as potentially being in either a ring or a chain. This is done with the command NOD # rc. Since the system assumes by default that the node is only to exist in the environment drawn, it is necessary to override the default with the rc specification when it is ok for an end node to be in either a ring or a chain in a substructure answer set.

VII. Use of the BON Command

The bond codes used in the Registry File structure building process are letter codes to specify bond types such as "se" (single exact) or "d" (double) or "n" (normalized). A NORMALIZED BOND is an aromatic bond or one found in a tautomer or combinations of rings and tautomers. If a ring has an even number of atoms and contains alternating single and double bonds all the way around the ring system, the bonds in the ring are designated as normalized. For fused rings, only the outside path is considered.

For a tautomer, the following environment must exist:

where:

It is also possible to specify that a bond is only a ring bond or only a chain bond by defining it as BON rs or BON cd, for example. By default the system will assume that the bond is only to be part of a compound that has the environment in which it is drawn.

VIII. Structure Searching on Beilstein

It is also possible to do very precise structure searching on the Beilstein CrossFire system.

CrossFire Structure Module

Unlike the STN system, where the choice of type of structure search (exact, family, closed or full substructure) determines the type of compounds retrieved, CrossFire requires the user to "set free sites" by indicating the number of substitutions allowed at given atoms or to make other choices at the time of structure drawing in order to broaden or narrow the scope of a search. Setting free sites is done either all at once in the Query Options menu (once the desired atoms have been selected) or atom by atom by choosing the precise number of free sites allowed for each atom. Other options are the inclusion/exclusion of isotopes, allowing substances that have a charge, radicals, etc. to be retrieved in the search.

As with SciFinder Scholar and other STN options for structure searching, CrossFire includes a number of template files to assist in building complex molecules. The standard location for the template files is the directory /xfire/template.

CrossFire also allows predefined groups of variable atoms: A=any, Q=any but C or H, M=metal, and X=halogen. The addition of an H to these symbols means Hydrogen could also be one of the variable atoms. For example, XH implies that any of the atoms F, Cl, Br, I, or At plus H would satisfy the search. Likewise, there are generic group symbols to represent such things as carbocyclic or heterocyclic rings, alkyl, alkenyl, or alkynyl chain groups, etc. Finally, the user may define generic groups if the predefined groups are not sufficient.

IX. Other Sources.

Here are some Internet sources that are of relevance to this topic.

Gary Wiggins
29 October 1995