The IUPAC International Chemical
Identifier
The chemical structure of a compound is its true
“identifier” but structure representations are not unique
or convenient for computers. The International Union of Pure and
Applied Chemistry (IUPAC) has thus developed a method for generating a
freely available, non-proprietary identifier for chemical substances
that can be used in printed and electronic data sources, thus enabling
easier linking of diverse data compilations and unambiguous
identification of chemical substances.
The project to develop the IUPAC International Chemical Identifier
(InChI)1-7 was proposed in 2000 and approved in 2002.8 Version 1 of the
InChI system was launched in 2005. IUPAC decided to tackle this
problem because the increasing complexity of molecular structures was
making conventional naming procedures inconvenient, and because there
was no suitable, openly available electronic format for exchanging
chemical structure information over the Internet.
In a digital world structures are not ideal “names”: there
are too many ways to draw them, they are non-linear, and they are
inconvenient. What was needed was an openly available, electronic
format for exchanging chemical structure information over the Internet:
a unique, linear identifier, or “digital signature”. The
InChI algorithm converts a chemical structure (in the form of its
connection table) into a unique, alphanumeric string of characters. The
program can also convert an InChI label back into a molecular
structure. Two requirements must be fulfilled in doing this: different
compounds must have different identifiers, with all the information
needed to distinguish the structures; and any one compound must have
only one identifier, including only the necessary information to
identify that compound.
InChI is free, open source software, sponsored by IUPAC, implemented by
the US National Institute of Standards and Technology (NIST), and
distributed under the terms of the GNU Lesser General Public License.
Any organization can use it, in either public or private databases. The
source code and associated software, documentation, and licensing
conditions can be downloaded free from the IUPAC website.9 InChI is
written in C and can be compiled on most systems. It can be packaged
into a dll for Windows or a library for UNIX.10
Creation of an InChI
An InChI identifier is created from an input connection table (in
molfile,11 SDfile,11 or CML7 format) in three steps: normalization,
canonicalization, and serialization. In the normalization step,
electron density is ignored; salts and metal atoms in organometallic
compounds are disconnected; and mobile hydrogens, and variable
protonation and charge are normalized. The step is needed, for example,
to remove variations in the ways of representing a nitro group.
In the canonicalization step, a set of atom labels is algorithmically
generated that does not depend on how the structure was initially
drawn; equivalent atoms get the same label. Bond orders and charge
positions are ignored: connectivity alone is used. This does not
introduce ambiguity as long as all hydrogen atoms are accounted for.
Dmitrii Tchekhovskoi of NIST wrote the canonical numbering algorithm12
by modifying a more recent version13 of the well known Morgan
algorithm.14 In the final step the labeled structure is serialized and
the InChI character string is output.
The identifier is hierarchically “layered”; each layer
holds a distinct and separable class of structural information, with
the layers ordered to provide successive structural refinement. There
are currently six InChI layer types, each representing a different
class of structural information: the main layer, a charge layer, a
stereochemical layer, an isotopic layer, a fixed-H layer, and a
reconnected layer. Except for the main layer (atoms and their bonds),
the presence of a layer is not required and appears only when
corresponding input information has been provided. Layers and sublayers
are separated by the forward slash (/) delimiter. Except in the case of
the chemical formula sublayer of the main layer, each layer starts
(after the slash mark) with a lower-case letter to indicate the type of
information held.
|