A New Effective Algorithm for the
Unambiguous Identification of the Stereochemical Characteristics of Compounds
During Their Registration in Databases
Introduction
Chemical structures represent
separate entities, which exist in nature or can be synthesized, and are widely
studied and described in publications. A variety of methods have been devised to
impart information on the structural representation of compounds. These methods
include the use of molecular formulae, various line notations [1], trade/trivial
names, registry numbers, structure diagrams, and systematic nomenclature [2].
For the practising chemist, the most
acceptable visual representation of a chemical compound is a two-dimensional
plan of the three-dimensional structure, and this usually gives an adequate view
of the structure, and is easily understood, drawn and copied. Used in reaction
sequence it shows clearly the course of the reaction, with display of transient
intermediates if required. If it is necessary to show the third dimension, the
use of special bond configurations representing lines going above or below the
plane can accomplish this. This diagrammatic representation is usually referred
to as a structure diagram or structure graph. The structure diagram conventions
are established as an international standard and so far are the only
unique method of communicating information on a chemical compound. The
diagrams, for computer processing, are transformed (manually or automatically)
into linear string of characters or into two-dimensional matrices listing all
the atoms (nodes) and their mutual interconnections (bonds, edges). These
concise representations are referred to as connection tables (CT). Although
important only for computer storage and processing, connection tables have, in
recent times, been the main means of communication of the information on
chemical compounds [3]. Uniquely derived from the standardised structure
diagrams, the CT-s became the most complete structural representation and, in
the same time, the most visually unrecognisable for the human and the most
computer-friendly.
One of the main purposes of the
various representations of chemical structures is their unambiguous
identification and unique registration for storage in databases of chemical
substances as well as for their effective and fast retrieval from the databases.
The issue of reliable identification is more and more crucial as database sizes
grow steadily. Thus the two biggest and most important databases i.e., CAS and
Beilstein contain enormous number of structures (CAS over 18,1 million organic
and inorganic substances as of June 2001 [4], and Beilstein File over 8 million
organic compounds and 1.4 million inorganic and organometallic compounds as of
June 2001 [5]) together with much bigger numbers of accompanying records
describing chemical, physical, and recently, also biological characteristics of
the corresponding compounds (for example the Beilstein File contains over 35
million associated chemical property and bioactivity records as of June 2001).
The data stored in the databases
come from two main sources namely from scientific journals as well as from
published patents. For both sources - if they contain structural diagrams
representing the compounds - usually some sort of an �offline� graphical
structure editing software package is used. The structural diagram of the new
compound selected for registration in a database is drawn using a structure
editor (ISIS/Draw, Beilstein SE, ChemDraw ACD/ChemSketch, etc.), the resulting -
initial - representation of the structural diagram as a connection table is sent
as input to a dedicated program for normalisation and afterwards, also as
connection table, is stored as a record in the database. The connection table -
as referred to in this paper � is understood as a labelled graph for which nodes
are stored either as atom adjacency matrices of bond order labels (single,
double, triple, single up, single down, etc.) or as an atom adjacency lists
which, for each atom separately, contain bond order labels only to directly
connected atoms (neighbours). For both representations each atom is specified by
the atom property vector containing such information as atomic number, atomic
mass (if non-standard), charge, valency (in non standard), etc. Such connection
tables are then canonicalized, i.e., a canonical numbering of atoms is generated
by special algorithms [6,7]. The only purpose of the canonicalization is to
generate - independent of the used structure coding method � a numeric or
alphanumeric identifier (strictly numeric for the Beilstein File, for example)
which uniquely and unambiguously labels the structural diagram of the input
compound. Such an identifier � specific within a particular database - is called
a registry number (CRN for CAS or BRN for Beilstein).
An important property of a chemical
compound is its stereoisomerism, i.e., the differentiation of the compounds
exclusively due to the arrangement of their atoms in space. Different
stereoisomers of the same compound should have different registry numbers.
This is not always the case since the translation of the stereoisomers
into unique registry numbers is a very complex.
This paper describes a new algorithm
for analysis and processing of a chemical structure represented as structural
graphs in scientific journal and/or patent publications. Correct and unique
interpretation of the steric information contained in graphs of such structures
� by the computer programs - provides for reliable registration of different
stereoisomers of the same compound in databases of chemical compounds.
|