QSAR WORLD
Home | About QSAR World | Strand Life Sciences | Contact Us
Google Custom Search

New Horizons in Toxicity Prediction. Lhasa Limited Symposium Event in Collaboration with

the University of Cambridge - February 2009


A Report by Wendy A. Warrwendy@warr.com, http://www.warr.com
Consensus QSAR models
Mark Cronin, Liverpool John Moores University

Integrated testing strategies (ITS) can involve compilation of in silico predictions from the same or similar techniques, such as regression based models; compilation of in silico predictions from different techniques, weighting or averaging predictions; and compilation of in silico predictions with in vitro and in chemico data. Consensus modeling of regression-based QSARs for large, heterogeneous data sets requires large groups of physicochemical descriptors and/or properties and a method to select them. A pool of models is created (usually regression models but neural networks can also be used) and the best (statistically) QSARs and/or most diverse QSARs are determined. Predictions are weighted or averaged. The method often performs better than a single QSAR.16 One study by Cronin’s own team shows that the use of consensus models does not seem warranted given the minimal improvement in model statistics for the data sets in question.17

Consensus QSAR has also been applied to models developed from different techniques and different data sets. Matthews et al. used MC4PC, MDL-QSAR, BioEpisteme, Leadscope PDM, and Derek for Windows, with the same data sets, to predict carcinogenicity.2 The QSAR models were based upon a weight-of-evidence paradigm, which has a bigger “cost” than weighting or averaging. The individual models made complementary predictions of carcinogenesis and had equivalent predictive performance. Consensus predictions for two programs achieved better performance, better confidence predictions, and better sensitivity. Four QSAR programs predicted carcinogenicity with high specificity (85%). Consensus positive predictions identified clusters of carcinogens with reasonable mechanisms of action. Consensus models from three different expert systems have also been used with some success in prediction of mutagenicity using the commercial system KnowItAll.18 The system discussed by Boyer9 earlier in the meeting is a good example of how consensus models can be tailored to a risk assessment scenario.

Consensus can improve models, it confirms predictions in expert systems, it provides greater confidence in predictions, it accounts for outliers and it has regulatory significance and proven use. There are, however, disadvantages. Consensus models hide outliers, incorrect data and interesting parts of the data set. They lack portability, transparency and mechanistic interpretation. It is not clear how to characterize and develop a QSAR Model Reporting Format (QMRF) for compliance with the REACH regulations, for example. Defining applicability domain, statistical concerns, how to carry out validation, cost, and difficulty in use are other concerns.

The initial stage in an ITS is in silico assessment and this may include consensus QSAR. If there is insufficient confidence in the in silico assessment, in chemico assessment, bringing in reactivity data, can be used. From stage to stage in an ITS, more and more information is gathered. In vitro assessment follows in chemico (although it is not clear whether this will be acceptable under the REACH regulations) and only after all the other methods have been used, is in vivo assessment necessary. A special supplement19 to Alternatives to Laboratory Animals deals with the development of ITS for REACH. Integrated testing strategies will reduce animal usage by providing frameworks to use non-test data but the European Chemicals Agency supplies only guidance;20 it does not deal with all the requirements (weighting factors, costs, probabilistic techniques for decision making, tools and case studies) for making an ITS functional. Expertise is required for success.

Ensemble models are controversial. Ann Richard pointed out that you cannot just use eight models some of which are awful: you must apply some judgment. Douglas Hawkins added that you will not achieve much from consensus of good models; consensus adds improvement if you have several weak methods. You should not mix good and bad. Bobby Glen is suspicious of mixing models. Mark Cronin says that if your base model is poor it may tell you something about your data set. Bobby thought that it might be better to model the process; he referred to phenomenological models. Mark Cronin is currently looking at reactivity, specifically at modeling glutathione activity.


QSAR approaches, models and statistics relating to toxicity prediction

Douglas M. Hawkins, University of Minnesota

 

(Jessica J. Kraker, University of Wisconsin, was a co-author.) Consider a set of dependent measures, Y, and predictors, X. The dependent measures may be binary (e.g., toxic or non-toxic) or numeric. The predictors are almost unlimited topological descriptors, atom pairs, Burden numbers etc. QSAR modeling relates Y to X. Models can be broadly categorized as global, where a single X:Y relationship is used for the whole data set, or local, where different models are used in different parts of predictor space. Note that some people may use a slightly different definition and equate local with congeneric.

 

Global models may arise from linear methods (ordinary regression, ridge regression, least angle regression, lasso, elastic net, partial least squares, principal component regression, logistic regression) or nonlinear methods (primarily neural nets). Local models are derived using k nearest neighbors, kernel methods, Support Vector Machines (SVM) or tree models. Global methods are good when true, but potentially disastrous if not true. Local methods are arguably conservative and safe, but they are of lower statistical efficiency, squeezing less information from the data.

 

In global, additive models a predictor set X is written as x1, x2, … xp, there are n compounds for the fitting, and the model determines suitable functions hj and predicts Y on the basis of

 

 

Neural nets take this form. Usually functions h are monotonic, so they require that “more x is better” or “more x is worse”. Additive models assume no interaction between predictors. Linear methods further specialize the additive model to the form

 

 

In feature selection, some coefficients are zero. Prediction is not necessarily XTb; a link function g(XTb) could be needed if a curve is produced instead of a straight line.

 

The linear regression family is intended for numeric dependents, but it also works for a binary classification (by regression formulation of linear discriminant analysis (LDA)). The traditional method is ordinary least squares (OLS), but this requires n to be much greater than p, so it is not useful in many QSAR applications. If n is less than p, variants to get round under-determination are ridge regression, least angle regression, lasso, and elastic net. The last three methods can also do feature selection. OLS often works better than it has a right to.

 

Two other linear methods are Partial Least Squares (PLS) and Principal Component Regression (PCR). PLS is computationally fast and empirically performs well, though formal statistical proofs of good properties are sparse. PCR relies on the assumption that a few latent dimensions drive both X and Y. Its performance is spotty and it is probably safe to ignore it.

 

At first glance, kernel Support Vector Machines are linear regressions applied to transforms of x. In practice though, SVM is a local method. It rests on the choice of a kernel function and its effectiveness rests on how well this is selected. A kernel regression method predicts the Y at some future X as a weighted average of all Yi in the calibration data, weighted by the distance between X and Xi. The quality of the results depends on the weighting function. Any prediction in principle requires the full calibration data set, so this method does not scale well.

 

Nearest neighbor is a cousin of the kernel regression methods. To predict at X, it finds the k calibration cases closest to X and uses these cases’ Y values to predict Y by the average if Y is numeric, or the modal class if Y is a classification. kNN has some drawbacks. Prediction requires the full data set. A distance metric is needed. Conventionally Euclidean distance is used, ignoring the impact of correlation among predictors. Scaling is also a concern.

 

Recursive partitioning (RP) produces a tree model. This has minimal statistical assumptions and making predictions is easy but one drawback is that RP needs big samples. Random forests improve on single trees, squeezing more information out of data. The goal of feature selection is to pick the predictors that matter, and eliminate redundant ones. It is vital in drug discovery but may be less so in toxicology. Some linear regression models can do feature selection, and RP relies on it. Feature selection is harder with the other methods.

 

Models must be validated. If many compounds are available, a learning set and a validation set can be split out. All model building is done on the learning set and testing is done on the validation set. If only a moderate number of compounds is available then it is advisable to use cross validation instead.

 

Hawkins presented two examples from his own work on two data sets using about 300 mainly topological descriptors. The first data set was a mutagenicity one: 508 compounds with binary data from the CRC Handbook of Identified Carcinogens and Non-Carcinogens. The second was the Crebelli data set of 55 halocarbons assessed for D37 toxicity (with a numeric objective). The models were evaluated by cross validation. For the CRC Handbook data set there were appreciable differences in method capability. Random Forest was best, in line with notion that “ensemble” methods work well. PLSLDA is second best. SVM was worse than random. For the halocarbons, elastic net was best, with RP methods a little behind.

 

Hawkins also summarized some results reported by Young and Hughes-Oliver at the 2008 Spring National ACS Meeting (to be published in Cheminformatics). For 57,821 compounds tested in cathepsin L, these authors found that Random Forest and atom pairs were a good choice. These are only three examples. Sometimes global methods win; sometimes local ones. It depends on the descriptors and the dependent measures. The lesson is perhaps not to be wedded to a single QSAR methodology. From the audience, Stephen Pickett commented that in other QSAR areas SVMs have been shown to perform comparably to the other methods employed here.21 SVM should not be used without feature selection.


Page 1 | 2 | 3 | 4  |  5  |  6  |  7  |  8  |  9  |  10  |  11  |  12
Have any Questions?
Name:
Email:
Enter your query/comment here
 

    Facilitated by
    Strand Life Sciences Pvt. LtdStrandls Logo