A Primer on Molecular Similarity in QSAR and Virtual Screening
Part I - Descriptor Choice
2. Molecular Descriptors
A large series of molecular descriptors has been published and recently reviewed, and here we are not attempting to give a comprehensive overview. Instead, we will focus on three distinct aspects of molecular descriptors which have been put forward in recent literature: Firstly, how to choose descriptors for establishing meaningful and “as-trustworthy-as-possible” structure-activity models; secondly, what the ability of 2D and 3D descriptors is to perform “scaffold-hopping”; and thirdly how to assess the information content of descriptors we currently use.
(a) Guidelines for Choosing Molecular Descriptors – Less is (Sometimes) More
A multitude of molecular descriptors exists which can be used to describe a molecular structure, certainly ranging into the hundreds if not thousands, and they range from one-dimensional descriptors and two-dimensional (fragment) representations, over three-dimensional, conformation-dependent descriptors, to those incorporating conformational flexibility (sometimes referred to as four-dimensional descriptors). All of them, in particular the geometric descriptors which are often easier to back-project, have their advantages and disadvantages – but upon combination an analogy from a different area holds: beer and wine, taken separately, may be delicious drinks. But in a mixture, they should be avoided.
What does that mean in the world of QSAR or cheminformatics? The analogy is referring to a more and more common tendency in recent years to calculate a very large number of descriptors – since they are readily available – and, by means of feature selection methods, to retain only those variables, which are found to improve predictive performance. While this might be intuitively the right thing to do, consider the following example (More detailed descriptions can be found in publications by Topliss & Costello and Livingstone & Salt,).
Imagine you have a small number of data points, say 10 data points, which represent your measurement (such as solubility), and you also have 2 variables which describe your system. If those 2 variables are sensibly chosen, for example as molecular weight and polar surface area, you will probably be able to correlate the output variable, solubility, to a reasonable degree with the 2 input variables in a linear model. The model won’t be ideal, since the two variables insufficiently describe the system under consideration (and also the number of data points is probably not sufficient), but you will be able to achieve some kind of model, and it will be statistically significant.