Expectations from a good QSAR tool in
Drug Discovery Applications, December 2008
At every stage of the drug discovery pipeline, the application of QSAR
is evidently beneficial, yet limited in its reliability in its current
state. In this commentary, the various applications of QSAR are
reviewed with respect to the drug discovery stages of compound library
design, virtual screening, and lead optimization. The features
required, performance expectations, and design constraints for an
effective QSAR application vary significantly for each drug discovery
stage, however, there are certain requirements that are common across
all stages as well.
A QSAR software application could comprise of three functionally
distinct modules: (i) ‘Model Building Software
Tools’, (ii) ‘QSAR Models’ and (iii)
‘Model Deployment and Prediction Systems’.
(i)
Model Building Software Tools:
A QSAR model building software toolset is expected to handle common
molecular structure format representations, perform structure
optimizations, compute or import descriptors and property values for
the input compounds, contain a set of machine learning algorithms for
building QSAR Models as well as methods to validate them.
While QSAR modelers use a collection of statistical and computation
chemistry software tools to achieve the above functions, very
sophisticated specialty QSAR modeling software products are now
available. These software products provide a broad selection of
features useful at all stages of model building. With the
implementation of current best practices, intelligent wizards and
guided modeling workflows these products enable modelers of all skill
levels to build good models of their data. Building the best models of
any data is possible only by running the data through a wide range of
statistical methods and model building algorithms over a wide range of
parameter sets, as well as a variety of methods to validate the models
and assess their robustness over intended ranges. Further, QSAR
modeling software provides a rich interactive graphical interface for
visual examination of data and results at all stages.
(ii)
QSAR Models:
QSAR models can be categorized in a few ways. Depending on the type of
end-point they are meant to predict, models can be activity, ADME, or
toxicity models. Models are either global or local; local models are
designed to predict over a small chemical space like a target focused
library, a therapeutic class, or certain range of end point values,
while global models are expected to cover a wider range of chemical
space.
There are several ‘Pre-built’ QSAR models for ADME
and Toxicity predictions available commercially and in the public
domain. Most of the pre-built models available are ‘black
boxes’ with little information about the applicability domain
and the prediction confidence metrics available to the users of the
models. There are some model providers, though, that provide abundant
information about the models, such as the training compounds, range and
distribution of end-point values used, the descriptor features used in
building the model, the algorithms and parameter settings employed, and
so on. When the training data set is packaged with the pre-built
models, it allows modelers to “localize” or
“globalize” them by sub-setting or adding new or
in-house data and retraining these models.
(iii)
Model Deployment and Prediction System:
Information that allow users of models to attach confidence to the
predictions, like similarity of input compounds to the model training
compounds in the chemical and descriptor spaces, would be an essential
aspect of an effective software system through which models are
deployed for users. The model deployment system should allow users to
visually examine the effect of variations on the compounds, like
R-group enumerations, on the predictions. More often than not, the
users of models are not as sophisticated users of computer programs as
the modelers, so a higher level of product design considerations for
ease-of-use and intuitiveness are essential in designing model
deployment and prediction systems.
QSAR models are commonly built and “thrown over the
wall” for users. Focus is seldom on proper ways to collect
information on performance and usage of these models. This information
feedback would be vital to model builders to continually improve the
predictive performance of the models. This also allows organizations to
assess the value addition of the QSAR technology applications to their
research efficiency. An effective model deployment system should focus
on keeping the models updated. New data, especially data on compounds
for which decisions were made upstream based on QSAR model predictions,
should be made available to tune and improve the models as and when it
becomes available from the labs.