S PN=5577239 OR PN=5950192 OR PN=4642762
2 PN=5577239
1 PN=5950192
1 PN=4642762
S2 3 PN=5577239 OR PN=5950192 OR PN=4642762
?
TYPE 2/2/ALL
2/2/1 (Item 1 from file: 653)
DIALOG(R)File 653:US Patents Fulltext
(c) format only 2001 The Dialog Corp. All rts. reserv.
01566805
Utility
STORAGE AND RETRIEVAL OF GENERIC CHEMICAL STRUCTURE REPRESENTATIONS
PATENT NO.: 4,642,762
ISSUED: February 10, 1987 (19870210)
INVENTOR(s): Fisanick, William, Columbus, OH (Ohio), US (United States of
America)
ASSIGNEE(s): American Chemical Society, (A U.S. Company or Corporation ),
Washington, DC (District of Columbia), US (United States of
America)
EXTRA INFO: Expired, effective February 10, 1999 (19990210), recorded in
O.G. of April 20, 1999 (19990420)
Reinstated, effective June 28, 1999 (19990628), recorded in
O.G. of July 27, 1999 (19990727)
APPL. NO.: 6-614,219
FILED: May 25, 1984 (19840525)
U.S. CLASS: 707-3 cross ref: 707-104
INTL CLASS: [4] G06F 15-40
FIELD OF SEARCH: 364-200MSFILE; 364-900MSFILE
References Cited
U.S. PATENT DOCUMENTS
4,473,890 9/1984 Araki 364-900
PRIMARY EXAMINER: Zache, Raulfe B.
ATTORNEY, AGENT, OR FIRM: Pollick, Philip J.
CLAIMS: 31
EXEMPLARY CLAIM: 1
DRAWING PAGES: 22
DRAWING FIGURES: 22
ART UNIT: 232
FULL TEXT: 1514 lines
2/2/2 (Item 1 from file: 654)
DIALOG(R)File 654:US PAT.FULL.
(c) format only 2001 The Dialog Corp. All rts. reserv.
02999303
Utility
RELATIONAL DATABASE MANGEMENT SYSTEM FOR CHEMICAL STRUCTURE STORAGE,
SEARCHING AND RETRIEVAL
PATENT NO.: 5,950,192
ISSUED: September 07, 1999 (19990907)
INVENTOR(s): Moore, Jeffrey, Timonimun, MD (Maryland), US (United States of
America)
Brazil, Joanne, White Hall, MD (Maryland), US (United States
of America)
Hoover, Jeffrey R., Baltimore, MD (Maryland), US (United
States of America)
ASSIGNEE(s): Oxford Molecular Group, Inc , (A U.S. Company or Corporation),
Towson, MD (Maryland), US (United States of America)
APPL. NO.: 8-883,165
FILED: June 26, 1997 (19970626)
This application is a continuation, of application Ser. No. 08-715,708,
filed Sep. 19, 1996, now abandoned, which is a continuation application of
Ser. No. 08-288,503, filed Aug. 10, 1994, now U.S. Pat. No. 5,577,239, the
entire disclosure of which is incorporated herein by reference.
U.S. CLASS: 707-3 cross ref: 702-27
INTL CLASS: [6] G06F 17-30
FIELD OF SEARCH: 395-496; 395-497; 395-499; 395-600; 395-603; 707-3; 707-2;
707-1; 707-100; 707-102; 707-104; 707-22; 707-27; 707-19;
707-20; 702-22; 702-27; 702-19; 702-20
References Cited
U.S. PATENT DOCUMENTS
4,642,762 2/1987 Fisanick 707-3
4,811,217 3/1989 Tokizane et al. 364-300
4,855,931 8/1989 Saunders 364-499
5,025,388 6/1991 Cramer, III et al. 364-496
5,056,035 10/1991 Fujita 364-497
5,259,137 11/1993 Wilson et al. 364-496
5,367,058 11/1994 Pitner et al. 530-391.9
5,379,234 1/1995 Wilson et al. 364-496
5,386,507 1/1995 Teig et al. 395-161
5,418,944 5/1995 DiPace et al. 395-600
5,463,564 10/1995 Agrafiotis et al. 364-496
5,577,239 11/1996 Moore et al. 395-603
NON-U.S. PATENT DOCUMENTS
090 895 A2 10/1983 EP (European Patent Office)
213 483 A2 3/1987 EP (European Patent Office)
OTHER REFERENCES
Viking Instruments Corp. (Hewlett Packard); SpectraTrak Transportable GS/MS
Systems; (brochure)-No Date.
Chemical Structures, The International Language of Chemistry; Wendy A. War
(Ed.); "Interfacing DARC--Oracle" AJCM (Juus) de Jong (1988).
J. Chem. Inf. Comput. Sci. (1983) , vol. 23, No. 3; pp. 102-108; DARC
Substructure Search System: A New Approach to Chemical Information; Roger
Attias.
J. Chem. Inf. Comput. Sci. (1987), vol. 27, No. 2; pp. 74-82; DARC System:
Notions of Defined and Generic Substructures. Filiation and Coding of FREL
Substructure (SS) Classes; Jacques-Emile Dubois et al.
J. Chem. Inf. Comput. Sci. (1990), vol. 30, No. 2; pp. 191-199,
Substructure Search Systems. 1. Performance Comparison of the MACCS, DARC,
HTSS, and CAS Registry MVSSS, and S4 Substructure Search System; Martin G.
Hicks & Clemens.
J. Chem. Inf. Comput. sci. (1988), vol. 28, No. 4; pp. 221-226; An
Efficient Graph Approach to Matching Chemical Structures, O. Owolabi.
J. Chem. Inf. Comput. Sci. (1990), vol. 30, No. 4; pp. 332-339; Reactions
in the Beilstein Information System: Nonaporic Organic Synthesis; Martin G.
Hicks.
Analytica Chimica Acta, 235 (1990), pp. 87-92; Substructure Search Systems
for Large Chemical Data Bases; Martin G. Hicks et al.
J. Chem. Inf. Comput. Sci. (1991), vol. 31, No. 2; pp. 320-326; The
Beilstein Structure Registry System. 1. General Design; Laszio Domokos.
J. Chem. Inf. Comput. Sci. (1989), vol. 29, No. 4; pp. 255-260; 3DSearch; A
System for Three-Dimensional Substructure Searching; Robert P. Sheridan, et
al.
Substructure Searches of Chemical Structure Files; (Jan. 23, 1973);
Strategic Considerations in the Design of a Screening System for
Substructure Searches of Chemical Structure Files; George W. Adamson, et
al.
Chemical Structure Searching; (Jan. 21, 1975); An Efficient Design for
Chemical Structure Searching. I. The Screens; Alfred Feldman et al.
J. Chem. Inf. Comput. Sci. (1982), vol. No. 4; The Third BASIC Fragment
Search Dictionary; W. Graf, H. K. Kaindl, et al.
J. Chem. Inf. Comput. Sci. (1983), vol. 23, No. 3; The CAS Online Search
System. 1. General System Design and Selection, Generation, and Use of
Search Screens; P. G. Dittmar, et al.
Computer Chemical, ((1991), vol. 15, No. 2, pp. 103-107; A Central Atom
Based Algorithm and Computer Program for Substructure Search; Alf Dengler
and Ivar Ugi.
J. Chem. Inf. Comput. Sci. (1993), vol. 33, No. 4; pp. 545-547; Sturcture
Searching in Chemical Databases by Direct Lookup Methods; Baradley D.
Christie et al.
PRIMARY EXAMINER: Von Buhr, Maria N.
ATTORNEY, AGENT, OR FIRM: Dickstein Shapiro Morin & Oshinsky
CLAIMS: 14
EXEMPLARY CLAIM: 1
DRAWING PAGES: 7
DRAWING FIGURES: 12
ART UNIT: 277
FULL TEXT: 798 lines
2/2/3 (Item 2 from file: 654)
DIALOG(R)File 654:US PAT.FULL.
(c) format only 2001 The Dialog Corp. All rts. reserv.
02592280
Utility
CHEMICAL STRUCTURE STORAGE, SEARCHING AND RETRIEVAL SYSTEM
PATENT NO.: 5,577,239
ISSUED: November 19, 1996 (19961119)
INVENTOR(s): Moore, Jeffrey, 12 Breezy Tree Ct., Timonimun, MD (Maryland),
US (United States of America), 21093
Brazil, Joanne, 4500 Jolly Acres Rd., White Hall, MD
(Maryland), US (United States of America), 21161
Hoover, Jeffrey R., 8639 Willow Oak Rd., Baltimore, MD
(Maryland), US (United States of America), 21234
[Assignee Code(s): 68000]
EXTRA INFO: Assignment transaction [Reassigned], recorded October 12,
1994 (19941012)
Assignment transaction [Reassigned], recorded January 21,
1997 (19970121)
APPL. NO.: 8-288,503
FILED: August 10, 1994 (19940810)
U.S. CLASS: 707-3 cross ref: 702-27
INTL CLASS: [6] G06F 17-30
FIELD OF SEARCH: 364-DIG.1; 364-DIG.2; 364-496; 364-497; 364-499; 395-600
References Cited
U.S. PATENT DOCUMENTS
4,642,762 2/1987 Fisanick 364-300
4,811,217 3/1989 Tokizane et al. 364-300
4,855,931 8/1989 Saunders 364-499
5,025,388 6/1991 Cramer, III et al. 364-496
5,056,035 10/1991 Fujita 364-497
5,249,137 2/1993 Wilson et al. 364-496
5,367,058 11/1994 Pitner et al. 530-391.9
5,379,234 1/1995 Wilson et al. 364-496
5,386,507 1/1995 Teig et al. 395-161
5,418,944 5/1995 DiPace et al. 395-600
5,463,564 10/1995 Agrafiotis et al. 364-496
OTHER REFERENCES
Viking Instruments Corp. (Hewlett Packard); Spectra Trak Transportable
GC/MS System; (brochure), No date.
Chemical Structure, The International Language of Chemistry; Wendy A. War
(Ed.); "Interfacing DARC-Oracle" AJCM (Juus) de Jong (1988).
J. Chem. Inf. Comput. Sci. (1983), vol. 23, No. 3 pp. 102-108; DARC
Substructure Search System; A New Approach to Chemical Information; Roger
Attias.
J. Chem. Inf. Comput. Sci. (1987), vol. 27, No. 2; pp. 74-82; DARC System;
Notions of Defined and Generic Substructures. Filiation and Coding of FREL
Substructure (SS) Classes; Jacques-Emile Dubois et al.
J. Chem. Inf. Comput. Sci. (1990), vol. 30, No. 2; pp. 191-199,
Substructure Search Systems, 1, Performance Comparison of the MACCS, DARC,
HTSS, CAS Registry MVSSS, and S4 Substructure Search Systems; Martin G.
Hicks.
J. Chem. Inf. Comput. Sci. (1988), vol. 28, No. 4; pp. 221-226; An
Efficient Graph Approach to Matching Chemical Structures, O. Owolabi.
J. Chem. Inf. Comput. Sci. (1990), vol. 30, No. 4; pp. 332-339; Reactions
in the Bellstein Information System: Nonaporic Organic Synthesis; Martin G.
Hicks.
Analytica Chimica Acta, 235 (1990), pp. 87-92; Substructure Search Systems
for Large Chemical Data Bases; Martin G. Hicks et al.
J. Chem. Inf. Comput. Sci. (1991), vol. 31, No. 2; pp. 320-326; The
Bellstein Structure Registry System, 1, General Design; Laszio Domokos.
J. Chem. Inf. Comput. Sci. (1989), vol. 29, No. 4; pp. 255-260; 3DSearch; A
System for Three-Dimensional Substructure Searching; Robert P. Sheridan, et
al.
Substructure Searches of Chemical Structure Files; (Jan. 23, 1973);
Strategic Considerations in the Design of a Screening System for
Substructure Searches of Chemical Structure Files; George W. Adamson, et
al.
Chemical Structure Searching; (Jan. 21, 1975); An Efficient Design for
Chemical Structure Searching, I, The Screens; Alfred Feldman et al.
J. Chem. Inf. Comput. Sci. (1982), vol. 22, No. 4; The Third BASIC Fragment
Search Dictionary; W. Graf, H. K. Kaindl, et al.
J. Chem. Inf. Comput. Sci. (1983), vol. 23, No. 3; The CAS ONLINE Search
System, 1, General System Design and Selection, Generation, and Use of
Search Screens; P. G. Dittmar, et al.
Computer Chemical, (1991), vol. 15, No. 2; pp. 103-107; A Central Atom
Based Algorithm and Computer Program for Substructure Search; Alf Dengler
and Ivar Ugi.
J. Chem. Inf. Comput. Sci. (1993), vol. 33, No. 4; pp. 545-547; Structure
Searching in Chemical Databases by Direct Lookup Methods; Bradley D.
Christie et al.
J. Chem. Inf. Comput. Sci. (1993); vol. 33, No. 4; pp. 539-541;
Substructure Searching on Very Large Files by Using Multiple Storage
Techniques; Alexander Bartmann et al.
PRIMARY EXAMINER: Black, Thomas G.
ASST. EXAMINER: Von Buhr, Maria N.
ATTORNEY, AGENT, OR FIRM: Dickstein Shapiro Morin & Oshinsky LLP
CLAIMS: 12
EXEMPLARY CLAIM: 1
DRAWING PAGES: 7
DRAWING FIGURES: 12
ART UNIT: 237
FULL TEXT: 791 lines
?
TYPE 2/2,EM,SU/ALL
2/2,EM,SU/1 (Item 1 from file: 653)
DIALOG(R)File 653:US Patents Fulltext
(c) format only 2001 The Dialog Corp. All rts. reserv.
01566805
Utility
STORAGE AND RETRIEVAL OF GENERIC CHEMICAL STRUCTURE REPRESENTATIONS
PATENT NO.: 4,642,762
ISSUED: February 10, 1987 (19870210)
INVENTOR(s): Fisanick, William, Columbus, OH (Ohio), US (United States of
America)
ASSIGNEE(s): American Chemical Society, (A U.S. Company or Corporation ),
Washington, DC (District of Columbia), US (United States of
America)
EXTRA INFO: Expired, effective February 10, 1999 (19990210), recorded in
O.G. of April 20, 1999 (19990420)
Reinstated, effective June 28, 1999 (19990628), recorded in
O.G. of July 27, 1999 (19990727)
APPL. NO.: 6-614,219
FILED: May 25, 1984 (19840525)
U.S. CLASS: 707-3 cross ref: 707-104
INTL CLASS: [4] G06F 15-40
FIELD OF SEARCH: 364-200MSFILE; 364-900MSFILE
References Cited
U.S. PATENT DOCUMENTS
4,473,890 9/1984 Araki 364-900
PRIMARY EXAMINER: Zache, Raulfe B.
ATTORNEY, AGENT, OR FIRM: Pollick, Philip J.
CLAIMS: 31
EXEMPLARY CLAIM: 1
DRAWING PAGES: 22
DRAWING FIGURES: 22
ART UNIT: 232
FULL TEXT: 1514 lines
FIELD
This invention relates to a method for storing and retrieving generic
chemical structure representations (Markush formulations) and information
associated with them. It is directed especially to development of
specific(real)-atom and generic-group representations of the Markush
formulation that are used in atom-by-atom and group-by-group comparison of
query and file representations and the use of screening techniques to
eliminate a high percentage of irrelevant file representations prior to
group-by-group and atom-by-atom comparison of generic-group and
specific-atom representations.
BACKGROUND
The ability to effectively retrieve information on generic chemical
structures, i.e., so-called Markush structures, has been a problem of
varying magnitude and complexity since the inception of the use of the
Markush claim by the Patent Office in the 1920's. Many manual and
mechanized information retrieval systems have been developed to meet the
challenge of this problem but the known techniques for such retrieval are
imprecise and often place a premium on the knowledge, intuition, and
cognitive skills of the searcher.
The basic system for dealing with Markush structures is a manual system
in which individual documents containing the Markush structure are
classified according to a highly refined classification system and
physically grouped according to the classification scheme into a search
file. In making a search, the searcher proceeds by classifying the document
(query) in hand and then goes to the appropriately classified physical
group of documents in the search file and manually searches those documents
for relevant retrievals. Such a system places a high premium on the correct
initial classification of search file documents, correct classification of
the query, physical search-file integrity, and highly-developed cognitive
skills of the searcher. Moreover, because the Markush may represent
thousands or even millions of compounds, it often is impossible to
promulgate copies of the document into all of the search file
classifications represented by the Markush formulation. Weaknesses in any
of the aforementioned areas is likely to produce unsatisfactory search
results. (U.S. Department of Commerce, "Development and Use of Patent
Classification Systems", U.S. Government Printing Office, Washington, D.C.,
1966.)
Another technique used in both manual and mechanized systems for the
handling of Markush structures involves the use of a system of
fragmentation codes that are in effect generic or real-atom "group"
representations of portions of a particular Markush formulation. For
example, that portion of the formulation containing chains of carbon atoms
might be generically encoded as alkyl, or OH group as an alcohol or
hydroxide, and F, Cl, Br, and I as a halide. Real-atom groups, such as
methyl for CH sub 3 13 , ethyl for CH sub 3 CH sub 2 --, and phenyl for C
sub 6 H sub 5 --, are also typically used. (Balent, M. Z.; Emberger, J. M.
"A Unique Chemical Fragmentation System for Indexing Patent Literature" J.
Chem. Inf. Comput. Sci. 1975, 15, 100-104. Kaback, S. M. "Chemical
Structure Searching in Derwent's World Patents Index" J. Chem. Inf. Comput.
Sci. 1980, 20, 1-6. Rossler, S.; Kolb, A. "The GREMAS System, an Integral
Part of the IDC System for Chemical Documentation" J. Chem. Doc. 1970, 10,
128-134. Rowlett, R. J. "Gleaning Patents with Chemical Abstracts" Chemtec.
1979, June, 348-349. Silk, J. A. "Present and Future Prospects for
Structural Searching of the Journal and Patent Literature." J. Chem. Inf.
Comput. Sci. 1979, 19, 195-198.) However, the inter-relationships among
these groups in a Markush formulation are typically not encoded. As a
result, such systems tend to have good recall, i.e., most of the relevant
search file answers are retrieved but, because the inter-relationship among
the groups can not be specified and the reliance on generic terminology,
such systems have a pronounced tendency to lack precision, i.e., many of
the answers retrieved are irrelevant to the query. Precision has been
improved by incorporation of a higher degree of specificity into the
fragmentation codes, but only at a price paid in terms of higher complexity
and difficulty in file encoding and search profile formulation and a
resulting higher potential for error.
Mechanized specific atom-by-atom structure matching of query and file
structural representations is a well-known commercial technique that has
been available since the 1960s and has demonstrated high recall and
precision as a search and retrieval technique. (Wigington, R. L. "Machine
Methods for Accessing Chemical Abstracts Service Information in Proceedings
of the IBM Symposium on Computers and Chemistry"; IBM Data Processing
Division: White Plains, NY, 1969. Eakin, D. R. "The ICI CROSSBOW System,"
in Ash, J. E.; Hyde, E., Eds. Chemical Information Systems, Chichester,
Horwood, 1975. Dubois, J. E. "DARC System in Chemistry", in Computer
Representation and Manipulation of Chemical Information, Wipke, W. T.;
Heller, S.; Feldman, R.; Hyde, E., Eds., Wiley, New York, 1974. Schenk, H.
R.; Wegmuller, F. "Substructure Search by Means of the Chemical Abstracts
Service Chemical Registry II System" J. Chem. Inf. Comput. Sci. 1976, 16,
153-161. Feldman, R. J. "Interactive Graphic Chemical Substructure
Searching" in Computer Representation and Manipulation of Chemical
Information, Wipke, W. T.; Heller, S.; Feldman, R.; Hyde, E., Eds., Wiley,
New York, 1974.) Because atom-by-atom structure matching is a relatively
slow process, screening techniques have been developed to eliminate a high
percentage of irrelevant file representations. Typically screening involves
capturing key features of the file representations such as atom environment
and atom sequences and then matching similar key features of the query
representation to give a set of answers that are then used in atom-by-atom
structure matching. (Dittmar, P. G.; Farmer, N. A.; Fisanick, W.; Haines,
R. C.; Mockus, J. "The CAS ONLINE Search System. 1. General System Design
and Selection, Generation, and Use of Search Screens" J. Chem. Inf. Comput.
Sci. 1983, 23, 93-102. Attias, R. "DARC Substructure Search System: A New
Approach to Chemical Information" J. Chem. Inf. Comput. Sci. 1983, 23,
102-108.) Unfortunately, structure matching techniques tend to be limited
to files containing representations of unique individual compounds and
queries have been limited to specific structural representations that must
exactly match the structural representation of the file compound
(full-structure search) or be embedded within it (substructure search).
Structure matching techniques have been applied to Markush formulations
which represent a relatively small number of specific compounds using
queries that contain only real atoms. (Meyer, E. "Topological Search for
Classes of Compounds in Large Files--even of Markush Formulas--at
Reasonable Machine Cost" in Computer Representation and Manipulation of
Chemical Information, Wipke, W. T.; Heller, S.; Feldman, R.; Hyde, E.,
Eds., Wiley, New York, 1974.) However, in attempting to apply structure
matching techniques to query and file structures represented by Markush
formulations of the type often found in broad patent claims, one is
immediately faced with the problem that a single Markush formulation may
literally represent millions of specific compounds. When one considers that
the file size of the current large commercial structural matching systems
is a little less than seven million specific compounds, an appreciation is
gained for the difficulty in using structure matching techniques to search
effectively Markush structures. Although proposals have been made to apply
structure matching techniques to broad Markush formulations, no viable
system for searching such Markush formulations that gives a high degree of
recall and precision has yet been achieved. (Lynch, M. F.; Bernard, J. M.;
Welford, S. M. "Computer Storage and Retrieval of Generic Chemical
Structures in Patents. 1. Introduction and General Strategy" J. Chem. Inf.
Comput. Sci. 1981, 21, 148-150. Barnard, J. M.; Lynch, M. F.; Welford, S.
M. "Computer Storage and Retrieval of Generic Chemical Structures in
Patents. 2. GENSAL, a Formal Language for the Description of Generic
Chemical Structures" J. Chem. Inf. Comput. Sci. 1981, 21, 151-161. Welford,
S. M.; Lynch, M. F.; Barnard, J. M. "Computer Storage and Retrieval of
Generic Chemical Structures in Patents. 3. Chemical Grammars and their Role
in the Manipulation of Chemical Structures" J. Chem. Inf. Comput. Sci.
1981, 21, 161-168. Barnard, J. M.; Lynch, M. F.; Welford, S. M. "Computer
Storage and Retrieval of Generic Chemical Structures in Patents. 4. An
Extended Connection Table Representation (ECTR) for Generic Structures." J.
Chem. Inf. Comput. Sci. 1982, 22, 160-164. Nakayama, T.; Fujiwara, Y.
"Computer Representation of Generic Chemical Structures by an Extended
Block-Cutpoint Tree" J. Chem. Inf. Comput. Sc 1983, 23, 80-87. Kudo, Y.;
Chihara H. "Chemical Substance Retrieval System for Searching Generic
Representations. 1. A Prototype System for the Gazetted List of Existing
Chemical Substances of Japan" J. Chem. Inf. Comput. Sci. 1983, 23,
109-117.)
SUMMARY
A typical Markush storage and retrieval process according to the resent
invention comprises the steps of forming a file of structural
representations of Markush formulations in which each Markush formulation
is represented by a single specific atom multiple connectivity node (SpMCN)
representation in which the formal valance requirements of requisite atoms
are relaxed to allow for the attachment of all atoms and groups of atoms
depicted in the Markush formulation and, as a result, gives a
representation containing all implicit specific atom structures found in
the Markush formulation. The SpMCN is then converted to an associated
generic group multiple connectivity node (GnMCN) representation through the
use of a generic-group hierarchy. A query Markush formulation is similarly
converted to SpMCN and GnMCN representations. The query GnMCN
representation then is compared on a group-by-group basis with each file
GnMCN in such fashion so that a match is found when at least one implicit
generic structure representation (IGSR) of the query GnMCN is identical
with (overlaps) or is contained in (embedded in) at least one IGSR of the
file GnMCN. The query SpMCN representation then is compared on an
atom-by-atom basis with the file SpMCN representations associated with the
file GnMCN representations (answers) obtained in the previous query
GnMCN/file GnMCN matching step in such fashion so that a match is found
when at least one implicit specific atom structure representation (ISSR) of
the query SpMCN structure is identical with (overlaps) or is contained in
(embedded in) at least one ISSR of the file SpMCN. An indexing system is
used to identify IGSRs and ISSRs for the matching process and to manipulate
large or complex GnMCN and SpMCN representations.
As a further refinement, generic features of the original Markush
formulation are captured by using the generic-group hierarchy as a means of
representing generic features of the Markush formulation in both the SpMCN
and GnMCN. To insure high recall, a roll-back feature is used to allow for
the exchange of generic-group and specific-atom representations in SpMCN
matching so that all real atom file or query structural features implied in
the generic structural features of the file or query SpMCN are matched. In
addition, specific features of the SpMCN and specifically identified parts
of generic features of the original Markush formulation, such as specific
atoms, type of bonding, ring size, etc. are associated with each generic
group of the file GnMCN as group attributes and are matched against group
attributes of the generic groups of the query GnMCN prior to SpMCN
matching.
As a further refinement, screening techniques are applied to both SpMCN
and GnMCN representations in order to eliminate a large number of
irrelevant file representations prior to the more exacting group-by-group
and atom-by-atom comparisons. In order to achieve a high level of recall, a
Boolean strategy is used in the query screen logic expression whereby
special, "diagnostic" generic-group screens are used as alternatives for
sets of specific-atom screens in order to retrieve file answers in
situations where real-atom structures of the SpMCN query structure are
implied in the generic portions of the file SpMCNs that originate from
generic features of the original Markush formulation and for which there
are no real-atom counterparts.
DETAILED DESCRIPTION
A simple Markush formulation is set forth in structure Ia of FIG. 1. This
formulation consists of a fixed structure portion to which is attached the
variable groups R sub 1 and R sub 2. As indicated in the text portion of
the formulation, R sub 1 may be chlorine (Cl) or bromine (Br) and R sub 2
may be ethyl (CH sub 3 CH sub 2) or methyl (CH sub 3). Implicit in the
Markush formulation is the representation of four distinct individual
compound representations, Ia1-Ia4, that are, in effect, all of the possible
individual structures resulting from the combinations of fragments in the
variable groups denoted by R sub 1 and R sub 2.
In representation Ia, it is noted that carbon (C) typically has a
connectivity (valance) of four, i.e., is capable of attaching or connecting
itself to four other entities or to fewer than four other entities via a
multiple bond to one or more of the entities. Specifically in the ring
system of representation Ia, each carbon is bound to a second carbon by a
multiple (double) bond, to a third carbon atom by a single bond, and to a
hydrogen atom (H) or to a variable group node (R sub 1,R sub 2) by a single
bond to give the usual carbon valance of four. As is shown in structures
Ia1-Ia4, it is common practice in the chemical arts often to omit the
hydrogen atoms and to designate the alternate single and double bonds
between carbon atoms in the ring as a circle, the later convention is felt
to represent more realisticly a delocalized bonding situation in which
there are more like one and a half bonds between all carbon atoms. Except
where noted, these common conventions will be followed throughout the
remainder of the specification and drawings.
Structure Ib of FIG. 1 is a multiple connectivity node (MCN) structure.
In it, all of the fragments belonging to the variable groups described in
the text part of the Markush formulation have been attached to their
respective nodes or points of variability (as shown in the structure part
of the Markush formulation Ia) giving rise to nodes of abnormally high
connectivity and hence the multiple connectivity node (MCN) designation.
Since Ib represents all of the specific atoms identified in the Markush
formulation, it is designated as a specific-atom multiple connectivity node
(SpMCN) representation. It should be noted that the four distinct
individual compound representations, Ia1-Ia4, are also implicit in the
SpMCN. These individual implicit representations are referred to as
implicit specific-atom structural representations (ISSRs).
By using common generic technology, it is possible to simplify further
the specific multiple connectivity node structure (SpMCN). For example,
carbon ring structures containing only carbon atoms in the ring are often
given the generic description of carbocycles; linear chains of carbon atoms
are generically termed alkyls; and chlorine and bromine are often called
halides. Using this basic generic terminology, it is possible to transform
the SpMCN representation shown in Ib to the generic multiple connectivity
node representation (GnMCN) shown in Ic. In transforming the SpMCN to a
GnMCN, the bonding level between the generic groups is preserved. In this
particular example, only a single bond exists between the carbocycle and
the variable groups. If, however, a multiple bond exists between the
generic representations, such bonding is indicated in the GnMCN. Implicit
within the GnMCN representation are four implicit generic group structure
representations (IGSRs), Ic1-Ic4, corresponding to the four distinct
compound representations, ISSRs, implicit in the original Markush and the
SpMCN. It is critical to note that IGSRs and ISSRs are used for
illustrative purposes only. This invention does not anticipate the storage
of all ISSRs and IGSRs associated with the respective SpMCN and GnMCN.
Rather the invention is directed at the indivdiual ISSRs and IGSRs as they
are implicitly contained within the SpMCN and GnMCN. The actual
representation and processing uses only the explicit SpMCN and GnMCN
representations. The ISSRs and IGSRs are used only as they are implicitly
found within the SpMCN and GnMCN representations.
FIGS. 2, 2', 3, 3' and 4 illustrate the use of the GnMCN and SpMCN
representations in file searching and retrieval. In FIGS. 2 and 2',
representations IIa-VIa are illustrative file Markush formulations as they
might appear in patent documents, IIb-VIb are SpMCN representations of the
corresponding Markush formulation, and IIc-VIc are GnMCN representations of
the corresponding SpMCN representations. The query Markush formulation VIIa
is also shown as a SpMCN representation (VIIb) and a GnMCN representation
(VIIc). As shown in FIGS. 3 and 3', a file search is initiated by comparing
each query IGSR (VIIc1 and VIIc2) with each file IGSR (11c1-IIc5,
IIIc1-IIIc3, IVc1-IVc3, Vc1-Vc4, and VIc1-VIc4; identical implicit
structures are shown only once). As seen, query IGSR VIIc1 matches with
file IGSRs IIc1-IIc5 and Vc4; VIIc2 matches with Vc2-Vc3 and VIc1-VIc4. At
this point, representations III and IV have been eliminated from the search
and, as shown in FIG. 4, matching now proceeds between the query ISSRs
VIIb1-VIIb2 and the file ISSRs IIb1-IIb6, Vb1-Vb4, and VIb1-VIb4; query
ISSR VIIb1 matches only with file ISSR IIb1 and query ISSR VIIb2 matches
nothing, specifically illustrating that only one ISSR need match one file
ISSR to give an answer. To complete the search, relevant information
associated with the Markush formulation IIa such as, but not limited to, an
abstract, patent number, or patent document is retrieved for the searcher.
FIG. 5 illustrates the two types of matching criteria that a searcher may
use in carrying out a search. Representation VIII is a single compound
representation in which the ISSR is identical with the actual structure.
This structure matches exactly with the file representation X which is also
a single compound representation. The exact matching of all characteristics
of the query representation with those of the file representation is termed
"overlap" or full-structure search. Exact matching may be relaxed such that
the query representation need only be contained within the file
representation. Thus, although query representation VIII does not exactly
match or "overlap" the single file representation XI, it is contained
within representation XI. Such containment of the query representation
within the file representation is termed "embedment" or substructure
search. Systems for both full-structure and substructure search are
commercially available, e.g., CAS ONLINE: The Registry File, Chemical
Abstracts Service, Columbus, Ohio.
Atom-by-atom searching involves the comparison of a query structure with
a file structure using a path-tracing technique. Typically the path-tracing
technique involves selecting a starting atom (node) of the query structure
(usually a noncarbon atom) and comparing it with the first atom of the file
structure. If the atoms do not match, the file structure is advanced to the
next atom (node) until a match with the starting query node is obtained. If
a match is obtained, the query proceeds to the next connected atom which is
compared with the next connected atom of the file structure. If these next
atoms do not match, the file structure is backtracked to the original
matching atom and another connected atom is selected for match. This
advancing/comparing/backtracking routine is continued until all atoms match
or all atom sequences of the query are exhausted. Overlap requires that all
atoms of the query match with all atoms of the file structure while
embedment requires that all of the atoms of the query be contained within
the file structure. A description of atom-by-atom matching is given in
Lynch, M. F.; Harrison, J. M.; Town, W. G.; Ash, J. E. Computer Handling of
Chemical Information, MacDonald, London and American Elsevier Inc., New
York, 1971 at pp. 73-74, all of which is herein incorporated by reference.
It is an object of this invention to extend both the overlap and
embedment matching concepts to Markush searching. Thus if the query SpMCN
IXa search is limited to overlap only, the query ISSR IXa1 will match only
with file ISSR XIIa2. If the matching criterion is relaxed to embedment,
ISSR XIIIa1 is also a valid match. It is not necessary to limit the
searching of a Markush query to a Markush file, e.g., the ISSRs of the
SpMCN representation also can be compared with both specific compound
representations such as X and XI and the ISSRs of the SpMCN representations
XIIa and XIIIa. At the overlap level of search, query IXa retrieves file
representations X and XIIa; at the embedment level of search, file
representations X, XI, XIIa, and XIIIa are retrieved. Single specific
compound queries also can be searched against the Markush file, e.g., VIII
matches with XIIa1 (overlap) and with XIIIa1 (embedment). Although not
illustrated, embedment and overlap criteria are also used at the generic
level of searching. Thus an implicit generic query representation,
alkyl-halide, overlaps an implicit generic file representation,
alkyl-halide, and is embedded in an implicit generic file representation,
carbocycle-alkyl-halide. Finally it is noted that the overlap criterion can
be applied to the entire SpMCN representation itself. Such a match
condition requires all structural elements of the file SpMCN be identical
to all structural elements of the query SpMCN, i.e., all ISSRs or the file
and query SpMCNs must be identical. Requiring the entire SpMCN
representation IXa to match at the overlap level permits only the retrieval
of file SpMCNs that are identical to it, i.e., contain both IXa1 and IXa2
but only those two implicit representations. For an entire SpMCN
representation to match at the embedment level, the file SpMCN
representation must contain all ISSRs of the query representation.
In order to convert SpMCN representations to GnMCN representations, it is
highly desirable to have a classification scheme that uses a small number
of controlled-vocabulary hierarchical terms that permit classification of
all groups of atoms likely to be encountered in a specific substance or a
Markush formulation. FIG. 6 illustrates such a classification scheme. The
overall structure of the classification scheme consists of breaking each
less-specific group into two mutually exclusive, more specific groups. The
general group "G" is used to handle groups of atoms that can not be easily
associated with a more specific group classification, e.g., an
electron-withdrawing group, a group containing nitrogen, etc. The G group
is classified further into two mutually exclusive groups: any cyclic group
(Cy) or any acyclic group (Ay). The cyclic group (Cy) is broken down into
any carbocycle group (Cb) or any heterocycle group (Hc). The carbocycle
group (Cb) characterizes any ring system containing only carbon atoms and
any attached hydrogen atoms. The Cb group may be attached to any other
group, including itself, or it may stand alone. The heterocycle group (Hc)
characterizes any ring system containing one or more hetero (noncarbon)
atoms and any attached hydrogen atoms. Similar to Cb, Hc may be attached to
any group, including Hc, or it may stand alone. A fused ring system, i.e.,
two or more rings joined at two or more atoms on each ring with each other,
is considered a single group while two rings joined to each other by an
acyclic bond is considered as two groups. Thus a naphthalene ring system is
designated as Cb while a biphenyl system would be characterized as Cb--Cb.
A quinoline ring system, which consists of a carbocycle fused to a
heterocycle, is considered as a single heterocycle group, Hc.
Moving to the acyclic side of the hierarchy, the acyclic group (Ay) is
broken down into any acyclic carbon (chain) group (Ch) or any acyclic
noncarbon (functional) group (Fg). The acyclic noncarbon group (Fg) is
further broken down into any acyclic noncarbon connecting group (Fc) or any
acyclic noncarbon terminal group (Ft). The terminal group (Ft) is
characterized as a single atom that is neither carbon or hydrogen but may
be attached to one or more hydrogens. The Ft group may stand alone
(unattached to any other group), e.g. NH sub 3, H sub 2 O, Cu, or it may be
attached to one and only one other group where the other group may be any
other group including Ft except that the Ft group cannot be bound to an
alkyl group (Ak) by a multiple bond since, by definition, an alkyl group
bound to a Ft group by a multiple bond is a Cg group. Thus C sub 6 H sub 5
--NH sub 2 transforms to Cb--Ft while an aldehyde such as CH sub 3 --CH
double bond O transforms to Cg double bond Ft and not Ak double bond Ft.
See infra Cg and Ak. The acyclic noncarbon connecting group (Fc) is defined
as a single atom that is neither carbon or hydrogen but may be attached to
one or more hydrogens and must be attached to two or more other groups
including itself, e.g., phenyl-O-phenyl is expressed as Cb--Fc--Cb. By
definition, Fc may not stand by itself or attached to only one other group.
The acyclic carbon group (Ch) is further broken down into an acyclic
carbon group (Cg) attached to an acyclic noncarbon terminal group (Ft) by a
multiple bond, or any other acyclic carbon group (Ak) not defined as Cg. By
definition, the Cg group can not stand alone. It must be attached to at
least one Ft group by a multiple bond and it may also be attached to other
groups, except Ak or Cg. The Ak group consists of a group of acyclic carbon
atoms and any attached hydrogen atoms that may stand alone or may be
attached to any group, except Cg or Ak. When a Cg is attached to an Ak or
another Cg or when an Ak is attached to a Cg or another Ak, the two groups
merge into the appropriate single group, e.g., Ft double bond Cg--Cg double
bond Ft becomes Ft double bond Cg double bond Ft, Ft double bond Cg--Ak
becomes Ft double bond Cg, and Ak--Ak becomes Ak. CH sub 3 --CH double bond
O becomes Cg double bond Ft; CH sub 3 --OH becomes Ak--Ft. The compound CH
sub 3 --CH sub 3 is not represented Ak--Ak but rather as simply Ak. The