A Primer on Molecular Similarity in QSAR and Virtual Screening Part III - Model Generation
Several additional techniques exist to show that your model does not fit your output function by pure chance, one of them being Y-scrambling[9]. Here, the output variable is randomly permutated, and the model is fit to the new (random) variable. If similar model performance as that with the real output variable can be obtained, this indicates a high likelihood that the model obtained its credentials purely by chance. (Typically, for example the correlation coefficients obtainable by Y scrambling should be much lower - if for the real model a correlation of for example 0.7 can be observed, the models obtained using random scrambling should not show correlation larger than maybe half that number. As always, this depends on the dataset size, the modeling algorithm, the number of random scramblings performed, etc., and for suitably large datasets correlations on the scrambles output variable should not differ significantly from zero.)
What also needs to be kept in mind, no matter what you generate a model for, is its applicability domain – the compounds for which you are confident to achieve good predictions. This relates much to the chemistry covered in the training set, but since models are based on descriptors instead of the structures themselves, one school of thought is to only make predictions for compounds whose descriptors fall into the ranges covered by the training set. Applicability domains have become more and more important in the recent cheminformatics literature, and for readers in the business of creating predictive models I would like to refer you to some recent articles in the area[10-12].
4. Summary & Conclusions
In this final primer of the series of generating structure-activity models we discussed some of the dangers we encounter when using descriptors, and experimental data, to generate a mathematical descriptor-property relationship. It should be clear from the paragraphs above that the generation of a reliable QSAR model is no trivial task, and that its quality depends on many different factors. To summarize, it is best to have sufficient and reliable experimental data (reproducible; usually from a single source); to use a small number of descriptors which are relevant to the problem (but not fewer of them); to apply a modeling technique that is as simply as possible (but no simpler), and to validate the model using repeated, appropriately-sized dataset splits. If all those points are considered, and you apply your final model to compounds you think are in its applicability domain, you can still not be sure that the model will make good predictions in the future, but at least you applied attention to all the details and did the best you could.
And all that this leaves me to do is wishing you good luck for your next QSAR study!
|