QSAR WORLD
Home | About QSAR World | Strand Life Sciences | Contact Us
Google Custom Search

A Primer on Molecular Similarity in QSAR and Virtual Screening
Part III - Model Generation

2. Feature Selection

In the context of the 'model significance' one of the often measured figures of merit of multiple linear regression models are the so-called 'F measures', which tell the user how 'significant' (unlikely to occur by pure chance) a given model is. Now, imagine in the case of a small dataset, using lots of descriptors, from the millions of models you generate you find one that gives you a very high correlation coefficient with the test dataset. Is this model a 'significant model?' It might well be one - but it might also have occurred by chance, given that you firstly created lots of models to choose from. The F measure was initially derived if you have a single shot at creating a model, and not a large number of them. In a recent publication it was described how the significance of F measures varies if feature selection is performed[4] - and, as might be expected, the more features (models) to choose from, the less significant a given correlation becomes at identical dataset size.

3. Model Validation

As outlined above, a given correlation coefficient (or RMSE, for that matter) does not tell you a whole lot about the performance of your model. Of relevance are also the number of descriptors, the size of your dataset, the training/test/validation set split, the diversity of structures (which define the applicability domain for new compounds), the quality of your experimental data - and these are only the most important factors. But some of them are fixed and some are not: Let’s assume you have a given dataset, with data points of a given quality. Then how do you go about creating the best model possible with your data? How do you ensure applicability to future structures, how do you validate your model? We will now discuss some of the methods at our disposal to ensure quality of the models we create.

First of all, it is crucial to use multiple splits of your data - and this means, usually, three sets: One set to derive parameters from your model, one set to judge quality of your model (and perform model selection), and one set to assess performance of your model. An alternative is to use cross-validation: Your dataset is split into, for example, five different parts of equal size. In each run, four of the parts are used to train the model, and model performance is assessed on the fifth part in each run, and later averaged. Sometimes also 'leave-one-out' cross-validation has been performed, where every compound is left out in turn and predicted, but this practice has been strongly advocated against and it should not be used anymore[5] - since the model performance one obtains this way is not predictive of the performance for new compounds at all. For details and on which method to use in which case, the reader is referred to the literature, since it all depends on the size of the dataset one uses[6]. Also of tremendous value is to use leave-multiple-out splits (so, to use a set of multiple compounds to judge model performance), but not to do so once and for all systematically (as in conventional cross-validation) but to repeat new splits over and over again[7]. From applications shown[8] this protocol gives better judgment of model performance, along with less complex and smaller models.

Page 1 | 2 | 3 | 4
Have any Questions?
Name:
Email:
Enter your query/comment here
 

    Facilitated by
    Strand Life Sciences Pvt. LtdStrandls Logo