Accuracy of Prediction
Wendy Warr, editorial advisor of QSARWorld, evaluates the current dilemmas and choices in the accuracy of predictions and over fitting. Over to Wendy...
Download PDF Version
The non-specialist might have assumed that the main objective of a QSAR study is to predict whether an untested compound will be active or inactive (or to do virtual screening, i.e., predictions about a whole virtual library of compounds). In practice, much work has been devoted to 'explanatory' QSAR, relating changes in molecular structure to changes in activity, and only recently has there been considerable interest in predictivity; QSAR is now being used for virtual screening, to find biologically active molecules. There are many reasons why models fail [1, 2]: bad data, bad methodology, inappropriate descriptors, domain inapplicability [3], etc. In this article we can address only a few of the issues. Vendors are supplying models that may or may not be applicable to a corporate virtual library [2] and many (in-house approved) models are now available to non-experts on corporate intranets. How are these users to judge applicability?
Significant issues concerning accuracy of prediction are extrapolation (whether the model can be applied to molecules unlike those in the training set) and overfitting. Overfitting has been considered for a long time [4] but extrapolation has received too little attention. Running cross-validation studies on the data to get an overall rms error for prediction is a reasonable check for overfitting but it is inadequate as a measure of extrapolation [5].
The outcome of a leave one out (LOO), or leave-many-out, cross-validation procedure is cross-validated R2 (LOO q2). The inadequacy of q2 as a measure of predictivity was realized more than ten years ago, in the case of 3D QSAR, in what John van Drie refers to as 'the Kubinyi paradox': models that give the best retrospective fit give the worst prospective results [6]. To get good values for R2 you should not choose the highest values of q2. The 'best fit' models are not the best ones in external prediction because internal predictivity tries to fit compounds in the training set as well as possible and does not take new compounds into account [7].
Thus, it is not fair to assume that internally cross-validated models will automatically be externally predictive. Although a low value of q2 for the training set may well indicate low predictivity in a model, high q2 does not necessarily imply high predictivity. While a high value of q2 is a necessary condition for high predictive power, it is not a sufficient condition. Tropsha and his colleagues argue that a reliable model should be characterized by both high q2 and a high correlation coefficient (or R2) between the predicted and observed activities of compounds from a test set [8, 9]. They have proposed several approaches to the division of experimental data sets into training and test sets and have formulated a set of general criteria for the evaluation of the predictive power of QSAR models.
|