QSAR WORLD
Home | About QSAR World | Strand Life Sciences | Contact Us
Google Custom Search

Expectations of a chemist from a ‘good’ QSAR Model

We are glad to present the first guest column of QSARWorld by accomplished chemists Dr. Mukund Chorghade of Chorghade Enterprises and Jon Heal, Bill Hamilton, Joe Sheridan from Prosarix Ltd..

Dr. Mukund Chorghade is currently President of D&O Pharmachem, where he provides consultations to major American and European pharmaceutical companies on collaborations with worldwide academic, government and industrial laboratories. William "Bill" Hamilton is the CEO of Prosarix. He has 20 years experience in the commercial biotech sector. Bill has BSc in Biochemistry and PhD in Molecular Biology from Imperial College, London. Jonathan "Jon" Heal, the CSO of Prosarix, has a first class degree and a PhD in Chemistry from Imperial College, London with subsequent experience in software development over 10 years as a Microsoft Certified Developer in finance services as well as pharmaceutical industries. Joe Sheridan, who heads the Drug Discovery division at Prosarix, has a first class degree in Chemistry from UMIST, Manchester and a PhD in Drug Design from the University of Manchester.

Here they discuss their ideas about what a 'good' QSAR model should have and the issues and challenges faced by chemists related to in silico modeling techniques. Over to them...


Download PDF Version


Constructing a predictive model for compound activity (QSAR), or some other property, (QSPR) gleaned from experimental data is now common practice in drug discovery. Judicious use of QSAR has provided increased efficiency in the hit finding and lead optimisation stages of a project: computational predictive models have been used by researchers in several drug discovery campaigns.

The degree of reliance on QSAR depends on the type of property being predicted, the stage of the project and the relative ease and cost of compound synthesis and subsequent testing. Many QSAR models provide useful predictions; a number do not, despite good statistics generated from internal data used in training.

Chemoinformaticians understand that a useful QSAR model must comply with general characteristics proposed by Topliss & Costello1 and Unger and Hansch2:

  • The training set must have sufficient examples to cover the range of properties required to be predicted by the model. Usually this includes several log orders of magnitude of the end point being predicted.
  • The model should be based on a number of non-correlated descriptors far less numerous than the number of compounds in the training set (at least 5 - 10 fold) and biophysically relating to the property being predicted.
  • The simplest model should preferably be selected.
  • The model should be characterised by a number of statistical parameters (including correlation, standard deviation, F values and confidence intervals) and cross-validated to test internal predictivity. Ideally, model performance should be measured against a separate test set (external predictivity).

Adhering to these general principles improves the likelihood of the QSAR model being predictive. Often however, there is no extensive external testing (using compounds unseen in training). In this familiar scenario, correlations for external predictions can be far worse than expected regardless of the encouraging statistics generated by the training data: the reasons for such poor model performance are not fully understood yet; notwithstanding the many theoretical studies into this area.

The ‘Kubinyi Paradox’, following systematic investigations into the relationships between internal and external predictivity, states that high internal predictivity may often result in low external predictivity and vice versa3. One explanation for this is that the overall error of the prediction is compounded when errors inherent in the model are coupled with experimental errors in the data from external compounds. This unsuspected conclusion dictates a more critical definition of a ‘good’ QSAR model as one, which has been validated with significant external testing. Unfortunately, this degree of validation is usually only gained as a project progresses by testing sets of synthesised analogues selected from model predictions. Many models, when first constructed, do not have significant external validation.

Even for a model exhibiting good external predictivity, a second problem relating to the chemical space of the training set and to the scope of the trained model may become apparent. A training set, by definition, is always limited and the model only learns about the properties displayed by this set i.e. most models are local. A model for predicting properties for compounds which are substantially dissimilar to the training set will exhibit significantly diminished predictive capability because the range of descriptor values in the test set are outside of the range ‘seen’ in the training set. Understanding the scope of the model is critical to recognising the capability of the model to make predictions on diverse structures, although, depending on the descriptors used, this may not always be obvious! An awareness of this problem is only possible from an analysis of the training data. Usually the chemist does not have access to this and is less likely to understand this potential pitfall than the originator of the model. Poor predictions relating to model scope are very common: it is the responsibility of the chemoinformatician to explain where the model is likely to be applicable and flag its possible limitations. Conversely, it is also the responsibility of the chemist to see a QSAR model not as a black box but instead to understand the likely scope of the model based on the chemical space of the training set.

Page 1 | 2
Have any Questions?
Name:
Email:
Enter your query/comment here
 

    Facilitated by
    Strand Life Sciences Pvt. LtdStrandls Logo