QSAR WORLD
Home | About QSAR World | Strand Life Sciences | Contact Us
Google Custom Search

The IUPAC International Chemical Identifier

The chemical structure of a compound is its true “identifier” but structure representations are not unique or convenient for computers. The International Union of Pure and Applied Chemistry (IUPAC) has thus developed a method for generating a freely available, non-proprietary identifier for chemical substances that can be used in printed and electronic data sources, thus enabling easier linking of diverse data compilations and unambiguous identification of chemical substances.

The project to develop the IUPAC International Chemical Identifier (InChI)1-7 was proposed in 2000 and approved in 2002.8 Version 1 of the InChI system was launched in 2005.  IUPAC decided to tackle this problem because the increasing complexity of molecular structures was making conventional naming procedures inconvenient, and because there was no suitable, openly available electronic format for exchanging chemical structure information over the Internet.

In a digital world structures are not ideal “names”: there are too many ways to draw them, they are non-linear, and they are inconvenient. What was needed was an openly available, electronic format for exchanging chemical structure information over the Internet: a unique, linear identifier, or “digital signature”. The InChI algorithm converts a chemical structure (in the form of its connection table) into a unique, alphanumeric string of characters. The program can also convert an InChI label back into a molecular structure. Two requirements must be fulfilled in doing this: different compounds must have different identifiers, with all the information needed to distinguish the structures; and any one compound must have only one identifier, including only the necessary information to identify that compound.

InChI is free, open source software, sponsored by IUPAC, implemented by the US National Institute of Standards and Technology (NIST), and distributed under the terms of the GNU Lesser General Public License. Any organization can use it, in either public or private databases. The source code and associated software, documentation, and licensing conditions can be downloaded free from the IUPAC website.9 InChI is written in C and can be compiled on most systems. It can be packaged into a dll for Windows or a library for UNIX.10

Creation of an InChI

An InChI identifier is created from an input connection table (in molfile,11 SDfile,11 or CML7 format) in three steps: normalization, canonicalization, and serialization. In the normalization step, electron density is ignored; salts and metal atoms in organometallic compounds are disconnected; and mobile hydrogens, and variable protonation and charge are normalized. The step is needed, for example, to remove variations in the ways of representing a nitro group.

In the canonicalization step, a set of atom labels is algorithmically generated that does not depend on how the structure was initially drawn; equivalent atoms get the same label. Bond orders and charge positions are ignored: connectivity alone is used. This does not introduce ambiguity as long as all hydrogen atoms are accounted for. Dmitrii Tchekhovskoi of NIST wrote the canonical numbering algorithm12 by modifying a more recent version13 of the well known Morgan algorithm.14 In the final step the labeled structure is serialized and the InChI character string is output.
 
The identifier is hierarchically “layered”; each layer holds a distinct and separable class of structural information, with the layers ordered to provide successive structural refinement. There are currently six InChI layer types, each representing a different class of structural information: the main layer, a charge layer, a stereochemical layer, an isotopic layer, a fixed-H layer, and a reconnected layer. Except for the main layer (atoms and their bonds), the presence of a layer is not required and appears only when corresponding input information has been provided. Layers and sublayers are separated by the forward slash (/) delimiter. Except in the case of the chemical formula sublayer of the main layer, each layer starts (after the slash mark) with a lower-case letter to indicate the type of information held.


Page 1 | 2 | 3 | 4
Have any Questions?
Name:
Email:
Enter your query/comment here
 

    Facilitated by
    Strand Life Sciences Pvt. LtdStrandls Logo