New Horizons in Toxicity Prediction.
Lhasa Limited Symposium Event in Collaboration with
the University of Cambridge - February 2009
A Report by Wendy A. Warr, wendy@warr.com, http://www.warr.com
Consensus QSAR models
Mark Cronin, Liverpool John Moores University
Integrated testing strategies (ITS) can involve compilation of in
silico predictions from the same or similar techniques, such as
regression based models; compilation of in silico
predictions from
different techniques, weighting or averaging predictions; and
compilation of in silico
predictions with in
vitro and in chemico data.
Consensus modeling of regression-based QSARs for large, heterogeneous
data sets requires large groups of physicochemical descriptors and/or
properties and a method to select them. A pool of models is created
(usually regression models but neural networks can also be used) and
the best (statistically) QSARs and/or most diverse QSARs are
determined. Predictions are weighted or averaged. The method often
performs better than a single QSAR.16 One study
by Cronin’s
own
team shows that the use of consensus models does not seem warranted
given the minimal improvement in model statistics for the data sets in
question.17
Consensus QSAR has also been applied to models developed from different
techniques and different data sets. Matthews et al. used MC4PC,
MDL-QSAR, BioEpisteme, Leadscope PDM, and Derek for Windows, with the
same data sets, to predict carcinogenicity.2 The
QSAR models were based
upon a weight-of-evidence paradigm, which has a bigger
“cost” than weighting or averaging. The individual
models
made complementary predictions of carcinogenesis and had equivalent
predictive performance. Consensus predictions for two programs achieved
better performance, better confidence predictions, and better
sensitivity. Four QSAR programs predicted carcinogenicity with high
specificity (85%). Consensus positive predictions identified clusters
of carcinogens with reasonable mechanisms of action. Consensus models
from three different expert systems have also been used with some
success in prediction of mutagenicity using the commercial system
KnowItAll.18 The system discussed by Boyer9
earlier in the meeting is a
good example of how consensus models can be tailored to a risk
assessment scenario.
Consensus can improve models, it confirms predictions in expert
systems, it provides greater confidence in predictions, it accounts for
outliers and it has regulatory significance and proven use. There are,
however, disadvantages. Consensus models hide outliers, incorrect data
and interesting parts of the data set. They lack portability,
transparency and mechanistic interpretation. It is not clear how to
characterize and develop a QSAR Model Reporting Format (QMRF) for
compliance with the REACH regulations, for example. Defining
applicability domain, statistical concerns, how to carry out
validation, cost, and difficulty in use are other concerns.
The initial stage in an ITS is in
silico assessment and this may
include consensus QSAR. If there is insufficient confidence in the in
silico assessment, in
chemico assessment, bringing in reactivity data,
can be used. From stage to stage in an ITS, more and more information
is gathered. In vitro
assessment follows in chemico (although it is not
clear whether this will be acceptable under the REACH regulations) and
only after all the other methods have been used, is in vivo assessment
necessary. A special supplement19 to
Alternatives to Laboratory Animals
deals with the development of ITS for REACH. Integrated testing
strategies will reduce animal usage by providing frameworks to use
non-test data but the European Chemicals Agency supplies only
guidance;20 it does not deal with all the
requirements (weighting
factors, costs, probabilistic techniques for decision making, tools and
case studies) for making an ITS functional. Expertise is required for
success.
Ensemble models are controversial. Ann Richard pointed out that you
cannot just use eight models some of which are awful: you must apply
some judgment. Douglas Hawkins added that you will not achieve much
from consensus of good models; consensus adds improvement if you have
several weak methods. You should not mix good and bad. Bobby Glen is
suspicious of mixing models. Mark Cronin says that if your base model
is poor it may tell you something about your data set. Bobby thought
that it might be better to model the process; he referred to
phenomenological models. Mark Cronin is currently looking at
reactivity, specifically at modeling glutathione activity.
QSAR
approaches,
models and statistics relating to toxicity prediction
Douglas M.
Hawkins, University
of Minnesota
(Jessica J.
Kraker, University
of Wisconsin,
was a co-author.) Consider a set of dependent measures, Y, and predictors,
X. The
dependent measures may
be binary (e.g., toxic or non-toxic) or numeric. The predictors are
almost
unlimited topological descriptors,
atom pairs, Burden numbers etc. QSAR modeling
relates Y to X. Models
can be broadly categorized as global,
where a single X:Y relationship
is used for the whole data set, or local, where different models are
used in
different parts of predictor space. Note that some people may use a
slightly
different definition and equate local with congeneric.
Global models
may arise from linear methods (ordinary
regression, ridge regression, least angle regression, lasso, elastic
net,
partial least squares, principal component regression, logistic
regression) or
nonlinear methods (primarily neural nets). Local models are derived
using k nearest
neighbors, kernel methods, Support Vector Machines (SVM) or tree
models. Global methods are good when true,
but
potentially disastrous if not true. Local methods are arguably
conservative and
safe, but they are of lower statistical efficiency, squeezing less information
from the data.
In global, additive
models a predictor set X
is
written as x1, x2,
… xp, there
are n compounds for the fitting, and the model
determines suitable
functions hj and predicts Y
on the basis of

Neural nets
take this form. Usually functions h are
monotonic, so they require that “more x
is better” or “more x is
worse”. Additive models assume no interaction between
predictors. Linear methods further
specialize the
additive model to the form

In feature
selection, some coefficients are zero. Prediction
is not necessarily XTb;
a link function g(XTb) could be needed if a curve is
produced instead of a straight
line.
The linear
regression family is intended for numeric
dependents, but it also works for a binary classification (by
regression
formulation of linear discriminant analysis (LDA)). The traditional
method is
ordinary least squares (OLS), but this requires n to be much greater than p,
so
it is not useful in many QSAR applications. If n is less than p, variants
to get round under-determination
are ridge regression, least angle regression, lasso, and elastic net.
The last
three methods can also do feature selection. OLS often works better
than it has
a right to.
Two other
linear methods are Partial Least Squares (PLS) and
Principal Component Regression (PCR). PLS is computationally fast and
empirically performs well, though formal statistical proofs of good
properties
are sparse. PCR relies on the assumption that a few latent dimensions
drive
both X and Y. Its performance
is spotty and it is probably safe
to ignore it.
At first
glance, kernel Support Vector Machines are linear
regressions applied to transforms of x. In practice
though, SVM is a
local method. It rests on the choice of a kernel function and its
effectiveness
rests on how well this is selected. A kernel regression method predicts
the Y
at some future X as a weighted average of all Yi
in
the calibration data, weighted by the distance between X
and Xi. The
quality of the results
depends on the weighting
function. Any prediction in principle requires the full calibration
data set,
so this method does not scale well.
Nearest
neighbor is a cousin of the kernel regression
methods. To predict at X, it finds the k
calibration cases
closest to X and uses these cases’ Y
values to predict Y by
the average if Y is numeric, or the modal class if Y
is a
classification. kNN has some drawbacks. Prediction
requires the full
data set. A distance metric is needed. Conventionally Euclidean
distance is
used, ignoring the impact of correlation among predictors. Scaling is
also a
concern.
Recursive
partitioning (RP) produces a tree model. This has
minimal statistical assumptions and making predictions is easy but one
drawback
is that RP needs big samples. Random forests improve on single trees,
squeezing
more information
out of data. The goal of feature selection is to pick the predictors
that
matter, and eliminate redundant ones. It is vital in drug discovery but
may be
less so in toxicology. Some linear regression models can do feature
selection,
and RP relies on it. Feature selection is harder with the other methods.
Models must
be validated. If many compounds are available, a
learning set and a validation set can be split out. All model building
is done
on the learning set and testing is done on the validation set. If only
a
moderate number of compounds is available then it is advisable to use
cross
validation instead.
Hawkins
presented two examples from his own work on two data
sets using about 300 mainly topological descriptors. The first data set
was a
mutagenicity one: 508 compounds with binary data from the CRC Handbook of Identified Carcinogens and
Non-Carcinogens. The
second was the Crebelli data set of 55 halocarbons assessed for D37
toxicity
(with a numeric objective). The models were evaluated by cross
validation. For
the CRC Handbook data set there
were
appreciable differences in method capability. Random Forest
was best, in line with notion that “ensemble”
methods work well. PLSLDA is
second best. SVM was worse than random. For the halocarbons, elastic
net was
best, with RP methods a little behind.
Hawkins also
summarized some results reported by Young and
Hughes-Oliver at the 2008 Spring National ACS Meeting (to be published
in Cheminformatics). For 57,821 compounds tested in cathepsin L,
these authors found that Random Forest and
atom pairs
were a good choice. These are only three examples.
Sometimes global
methods win; sometimes local ones. It depends on the descriptors and
the
dependent measures. The lesson is perhaps not to be wedded to a single
QSAR
methodology. From the audience, Stephen Pickett commented that in other QSAR
areas SVMs have been shown to perform comparably to the other methods
employed
here.21 SVM should not be used
without feature selection.
|