A Primer on Molecular Similarity in QSAR and Virtual Screening
Part I - Descriptor Choice
Now, imagine you are not only using 2 input variables, but rather you employ all descriptors current software packages are able to calculate – this will be in the area of thousands of descriptors. What will be the case now if you plug all of those input variables into a model, and try to find a simple, linear model with 2 input variables, which describes the solubility of your dataset of 10 compounds? You will most certainly obtain a model with close-to-perfect fit, simply because there is a huge number of models to choose from – and some of them, just by accident, will be able to model your data. (One counterargument is that a suitable model validation routine can alleviate the problem. But this is only true to a certain extent – if the number of models is large enough, always a random model can be found that, by pure chance, will be able to fit both the training and test set nearly perfectly.)
The more precise mathematical background for this phenomenon has been described in some recent publications – and if one is thinking about developing structure-activity models involving feature selection, a look at them would certainly be beneficial to avoid some common pitfalls[5-7]. Using few and interpretable features not only gives a neater model which can be interpreted – it is also more likely to be of statistical significance. (And you will be in sync with the likes of Albert Einstein, who once said: “Everything should be made as simple as possible, but not simpler”. He was certainly referring to QSAR models here.)
(b) Scaffold Hopping Capability of Molecular Descriptors
“Scaffold hopping” is a term that has been used extensively in recent literature on virtual screening, and it describes the ability of molecular descriptors to identify molecules with similar properties, despite different underlying structures (and scaffolds). Often the opinion is stated that “3D descriptors are better at finding diverse scaffolds”. While this might be the intuitive answer, a look into recent literature doesn’t give as clear a picture.