Representation of the spatial
configuration of a chemical structure
In 1859 August Kekul� found that
carbon atom is quadrivalent and formulated the basic rules of structural
description. In 1874 Van�t Hoff and Lebel introduced, independently of each
other, the theory of spatial bond arrangement for tetrahedral carbon atoms [8].
Their theory was soon confirmed by data from chemical and physical experiments
and the concept of stereochemistry was born. Since then the need to represent 3D
models of compounds using 2D structural diagram conventions has arisen.
The 2D graphical representation of
stereoisomers is unambiguous. Even for such strictly formalised representation
as in the Beilstein File there are cases of compounds with stereochemical
unspecified centres even though all the chiral centres were originally (in the
source literature) determined. The graphical representation of structures in the
Beilstein File was statistically analysed for a sample of over 3.1 million
structures [9] showing relevant difficulties in avoiding ambiguity.
The problem was very recently
recognised and in 1996, IUPAC published recommendations for graphic
representation of three-dimensional structures [10]. However, these
recommendations have received very little attention by organic chemists and
database companies and contributed very little to overcoming the confusion and
ambiguity. A very interesting recent suggestion by Lin et al. [11] proposing a
single wedge convention for chirality representation has relevant advantages
over the currently used two wedge (solid and broken) representation and seems to
offer the solution and deservedly is getting more and more recognition [12]
It should be noted, however, that
stereochemistry is not limited only to chirality as assumed so far. Another
important spatial isomerism considered in chemistry is the so called geometrical
isomerism resulting from blocked rotation at multiple bonds. The simplest
compound occurring in two isomeric geometrical forms is butene. Other examples
of this type are cumulated allene as well as substituted ring systems not
exceeding 5 atoms in size.
Stereochemistry and its perception
are a challenge for any registration software. The various chiral and geometric
isomers should be unambiguously registered under unique registry numbers. The
issue is by no means trivial. The stereocentres are selected depending on the
type of the stereochemistry present. In practice these are either carbon atoms
with 4 varied ligands or trivalent nitrogen with three differentiated ligands.
In case of geometrical isomerism the central atom laying in the double bond axis
for compounds of the allene type takes over the role of the stereocentre. It
became obvious that registration of such stereoisomers cannot be handled without
systematic nomenclature based on Cahn-Ingold-Prelog (CIP) system [13]. As the
latter associates a pre-defined label (R, S, r, s, 0, or E, Z, e, z) with an
atom or bond, it can be used to encode stereochemical characteristics as graph
nodes or graph edge attributes. These attributes can be then encoded into the
connection table of the compound differentiating it from the other compounds.
The attribute must be first manually
assigned or algorithmically calculated using a graphical representation of the
compound directly. This calculation is based on an algorithm which depends
strongly on the 2D graphical representation of the stereochemistry occurring in
a compound. For the above/below plane bonds the solid and broken wedges are
used. This wedge representation is in most of the cases unique unless the wedge
bond must be located between two directly neighbouring chiral centres. For
geometrical isomers on both sides of a double bond no particular descriptor is
provided for ligands on both sides of the bond. This is definitely a
disadvantage for any computer based registry system. It is not clear if the
composition of ligands at both sides of the double bond is consciously chosen or
simply accidental. It is particularly unclear if the citation source article
describes a racemate or simply ignores the stereochemistry of the compound.
Usually, to avoid such situations, structure drawing softwares provide for
additional conventions such as drawing wavy bonds in order to visualise the
undefined steric character of the bond. Some modern drawing software packages
(Beilstein SE, ChemDraw) enable entering and representation - in the resulting
CT - of the steric double bonds.
Stereodescriptors for structures to
be registered in a database are determined using the CIP ordering sequence
convention [13]. For all ligands a, b, c, d at located chiral centre(s) the CIP
ordering is used to determine a prioritised ranking of the ligands. The lowest
priority ligand is then selected as the reference ligand. The line connecting
the chiral centre and the lowest priority ligand determines the reference plane.
If transition from the line towards higher priority ligands goes in the
clockwise direction then the chiral centre is described as R (rectus) otherwise
as S (sinister).
The rules as originally suggested by
Cahn, Ingold, and Prelog and their extensions and revisions [14,15,16] play a
double role. First, they allow determining if the considered atom is really
asymmetric, and second they rank the ligands connected to the asymmetric atom
producing the pre-defined sequence starting at the least important ligand and
ending at the most important one. This ranking and resulting sequence should be
unique and always the same. It is well known fact that it is not always true.
The CIP rules, even after multiple revisions coming from both chemists and
mathematicians, do not always order ligands unambiguously. This problem is
particularly noticeable for condensed, aromatic ring systems with a number of
atoms different from 6 or other complex and heavily substituted rings with
multiple chirality centres.
Algorithms implemented in the
software used for registration of chemical structures in databases generate a
canonical numbering of atoms independently of the CIP based ordering of ligands.
Each atom is uniquely indexed within the structure and the index makes priority
ordering of ligands obsolete since the ligands a, b, c, d can be ordered
according to their indexes (which after canonicalization are unique). However,
canonicalization cannot answer the question if the ligands of a given atom are
symmetrical or not. Permuting the indexes of the ligand atoms and registering
the structure for various permutations can usually answer this question. If the
(alpha)numeric registration string stays constant for all permutation one can
assume that the ligands are symmetrical.
Practically all effective
canonicalization methods are based on the Morgan algorithm [17]. Internal
symmetries within the canonicalized compound lead to a combinatorial explosion
which in order to handled from a practical point of view makes the registration
algorithm choose some sort of �best� approximation. In effect, it can lead to
miss-registered compounds in a database, i.e., the same structure may occur in
the database twice or even more times since it carries two or more different
registry numbers [18].
The use of the CIP rules ordering
ligands during the course of registration is necessary only for structures for
which an explicit stereo descriptor or multiple stereo descriptors had to be
used in order to identify them. For such cases CIP based descriptors have to be
translated into parity attributes which are then directly used in the
canonicalization algorithm.
The reverse situation of translating
parity into the CIP based descriptors is of no meaning for registration itself,
but can be useful for other CT based applications such as for example
algorithmically generated nomenclature. This was applied in the AutoNom [19]
software package generating systematic nomenclature directly from the connection
table of a compound.
|