A Primer on Molecular Similarity in QSAR and Virtual Screening Part III - Model Generation
Andreas Bender, PhD of Novartis Institutes for Biomedical Research, an Editorial Advisor and columnist of QSAR World concludes his popular three-part 'primer' series.
Download PDF Version
Read previous articles >>
1. Introduction
The generation of models relating structure and some form of measurement of structural properties, be it bioactivity or any physicochemical property, consists of the description of the problem, the choice of a dataset of experimental endpoints, and the construction of the actual mathematical model. The descriptor generation was discussed in the first part of this primer, with the conclusion to choose descriptors relevant to the problem and of not overly complex nature. In the second part we were discussing experimental endpoints that can be used as 'output variables' of models. Here the conclusion was that also experimental data are not without fault - data measured in a single laboratory may show significant differences (e.g. in high-throughput screenings), and combining results from different labs might be even more error-prone. Therefore, a clean dataset, measured by a single, well-defined experimental procedure should be used in the ideal case. In the current, concluding part of the primer I would like to discuss the basics of constructing a useful mathematical model that connects molecular descriptors as an independent variable, and outputs (predicted) properties as the dependent variable. The focus of this part of the primer will be on two steps commonly performed in the generation of QSAR/QSPR models, namely feature selection and model validation.
2. Feature Selection
One of the common first steps in QSAR/QSPR modeling (although not necessarily the best one) is to calculate a large number of features as a function of the molecular connectivity table (or 3D structure, or electronic wave function). Several thousand descriptors exist, with new ones published about weekly[1]. Next, often feature selection is employed to find which variables are beneficial to give a good regression or classification result – so a multitude of models is generated, validated against a set of internal and external validation sets, and the process of feature selection is repeated until the ‘best’ model according to user preferences is obtained. There is some reason for this ‘irrational’ process: Initially, when only a set of structures with their associated properties is known (but no knowledge whatsoever about the target protein, or the physical properties of a solute-solvent interaction) all the user is aware of is the structures – and no knowledge which properties of the structures may be responsible for the observed effect. If one knows that a target interaction in QSAR is dominated by hydrogen donor interactions, for example, one can tailor his descriptors – but this is often not known from the onset. While this trial and error approach seems ‘intuitive’ and hence can be a sensible approach in some cases, I would like to allude to an important point in this process: That models, generated via feature selection, are sometimes not as good as one might think from the statistics!
Firstly, imagine the situation that you have a small set of only ten compounds with measured activities against a target A. If you calculate hundred of descriptors for each compound, and combine this with a technique such as neural networks or support vector machines, then generate thousands of possible models... then you necessarily will find models which are able to model your input data, simply by pure chance[2]! This is an important point, and it refers both to the number of features used as an input, as well as the modeling technique which might be more or less flexible. (‘Flexible’ here refers to the number of different models than can be fitted by a given technique, also known as the ‘hypothesis space’ accessible to the model.) As a rule of thumb, you need to be aware of this issue in particular in case of a larger number of descriptors, a flexible modeling technique (such as neural networks, support vector machines, and other nonlinear techniques), as well as a small dataset to train and test with. For neural networks in particular, a very good article by David Livingstone gives a discussion of the above points[3].
|