Accuracy of Prediction
Researchers at Merck [5] have proposed a way to estimate the reliability of the prediction for an arbitrary chemical structure, using a given QSAR model, given the training set from which the model was derived. Based on a set of retrospective cross-validation experiments using 20 diverse in-house activity sets, they found two useful measures: the similarity of the molecule to be predicted to the nearest molecule in the training set and the number of neighbors in the training set, where neighbors are those more similar than a user-chosen cut-off. The molecules with the highest similarity and/or the most neighbors are the best-predicted, even for many diverse training sets (though to a lesser degree). The result does not depend on which QSAR method or descriptor is used. Three years later, workers at Strand Life Sciences, unaware of the Merck publication, drew similar conclusions [16].
Nevertheless, says Gerry Maggiora, incorrect predictions of activity still arise among similar molecules even in cases where overall predictivity is high, because, in his metaphor, activity landscapes are not always like gently rolling hills, but may be more like the rugged landscape of the Bryce Canyon [17]. Even very local, linear models cannot account satisfactorily for landscapes with lots of 'cliffs', and perfectly valid data points located in cliff regions may appear to be outliers, even though they are perfectly valid data points. It may also be necessary to assay additional compounds in the neighborhoods around the cliffs, to ensure that activity landscapes are adequately represented in these rapidly varying regions. Maggiora also discusses the consequences of lack of invariance of chemical space to changes in the set of descriptors.
Bob Clark referred to 'clumpy' data sets, rather than 'activity cliffs' in a recent presentation [18]. A larger data set is not necessarily better than a smaller one in the case of cross-validation: larger data sets in which the observations are unevenly distributed through the descriptor space are particularly susceptible to problematic distortions of the validation statistics. Clark’s paper was given in a symposium on evaluation of computational methods at the fall 2007 ACS Meeting. Papers arising from that symposium, selected by guest editors, should shortly appear in the Journal of Computer-Aided Molecular Design.
Only a small number of the oral presentations related to QSAR; most papers concerned measures of the quality of docking results. In his concluding remarks, Terry Stouch said that there was agreement on the need for better test, validation, and decoy sets and we are approaching agreement on what more is necessary [19]. Two significant new data sets are now available for testing docking algorithms: a Directory of Useful Decoys (DUD) [20, 21] and WOMBAT Data for Enrichment Studies [22, 23]. Is there a need for newer, better QSAR data sets and what should be the criteria for building them? Since QSAR and docking are both being used now for virtual high throughput screening, comparisons of the two methods are likely to be of interest. I can see here topics worthy of further discussion in QSARWorld. The spring 2008 ACS meeting also promises a symposium entitled 'Model Applicability Domains: When Can I Use my Model?' Maybe I will be writing more about accuracy of prediction for QSARWorld in 2008.
|