Understanding
toxicity from predictive data mining
Chihae Yang, ORISE fellow, US FDA Center for Food Safety and Applied
Nutrition (CFSAN)
US FDA CFSAN has initiated a project to develop a knowledge base for
food additives and ways to implement computational risk assessment
methods within the workflow of reviewers. One of the goals of this
project is to develop the center’s knowledge base, which will
be
disseminated through structural categories and predictive models.
Predictive data mining is the process that is being used at US FDA
CFSAN to build these components of the knowledge base.
The currently available SAR paradigm is associated with a couple of
inherent issues. First, when linking chemistry and biology, there is an
inadequate description of biology per chemical feature: complex biology
is compressed into a highly summarized outcome, while the chemistry
domain is extremely diverse and sparse. Chemical diversity is of the
order of 1059 while biology diversity is of the order of 109 and is
very highly summarized (0,1 or an LC50 value). Most of the time, the
training sets of the models suffer from this issue and the data mining
process has been a black box. Second, most of the QSAR models are
“global” and data-driven while the original SAR
paradigm is
defined only within a mechanistic domain. Hence the issue of
“global” versus “mechanistic”
models requires
the clear definition of the applicability domain, where valid ranges of
independent variables must be established. We must make both the data
mining process of the training data set and the limitations of the
knowledge base transparent and sustainable.
The predictive data mining process begins with data preparation, which
is followed by data mining and analysis, then knowledge base
development of structural rules and prediction models, and then
applying and disseminating knowledge. There are two types of learning
goal: (1) identifying relationships between structural classes and
various toxicity endpoints so that an intelligent testing decision can
be made (“what do I make next?”); and (2)
developing
structural alerts and QSAR models to assist safety decisions.
Data sources included the Leadscope-FDA genetic and carcinogenic
toxicity databases (constructed according to the criteria of the ToxML
standard)15 and biological assay data from the
first National
Toxicology Program High Throughput Screening (NTPHTS) campaign. The
toxicity endpoints considered in this study are genetic toxicity
(salmonella, mouse lymphoma, in vitro chromosome aberration and in vivo
micronucleus) and rodent carcinogenicity (mouse and rat). The sources
of the Leadscope genetic toxicity and carcinogenicity databases include
NTP, CCRIS, Tokyo-Eiken (Tokyo Metropolitan Institute of Public Health
Epidemiological Information Office), US FDA, and primary publications.
The NTPHTS campaign data are from an HTS project which NTP has
initiated to explore new approaches to evaluating chemicals across a
spectrum of high-throughput biological assays. Assays are being
selected based on their potential to be informative of animal bioassay
results and relevant to human health risk assessments. As an initial
phase of this project, NTP has provided a set of 1,408 chemicals from
NTP inventories for HTS in bioassays relevant to toxicology, to the NIH
Chemical Genomics Center (NCGC), part of the NIH Molecular Libraries
and Imaging Roadmap (MLR) initiative. Assays are described and assay
results reported in PubChem for this NTPHTS chemical data set in the
same manner as for compounds from the Molecular Libraries Small
Molecule Repository. The DSSTox project11 is
collaborating with the
NTPHTS project to provide structure annotation and cheminformatics
support for this effort. Drawing largely from the contents of the
existing NTP Bioassay Online Indicator (BSI) Structure-Index Locator
File, the DSSTox NTPHTS Structure-Index File provides the full
complement of DSSTox standard chemical fields for the NTPHTS chemical
set.
Once the data set is prepared, the data mining and analysis steps
follow. The compound level profile, a data matrix of compounds and
activity (positive or negative) for the six endpoints is sparse. For
example, out of a total of 3,548 structures included in this study,
only 45 have all the data for four genetic toxicity endpoints. To
profile the association between chemistry and biological endpoint
better, chemical structures are decomposed into features. A feature
level profile is a data matrix of structural classes and average
endpoint results for each class, which has very few empty cells. The
structural classes can be any chemically meaningful fragments and Yang
used Leadscope features as an example. Multivariate analysis of
structure classes allows one to detect, for example, salmonella
negatives which are mouse lymphoma positive. Non-concordant chemical
classes can give insights, e.g., the pyrrolidine, 2-oxo class is
salmonella positive, and mouse lymphoma negative; aromatic amines and
alkyl halides are in vitro chromosome aberration positive but
micronucleus negative.15
Probabilistic analysis is possible when the database is sufficiently
large. For example, a probability of a compound or a structural class
to be mutagenic or carcinogenic can be estimated from a large database.
Probabilities can be marginal, conditional, or joint. If 2000 out of
8000 compounds are salmonella positive, the marginal probability of a
salmonella positive result is 0.25. Conditional probability is defined
under a given condition; for example, if 75 out of 150 compounds are
salmonella positive given that micronucleus is positive, then the
conditional probability is 0.5. Joint probability is the probability of
both salmonella and micronucleus being positive. If the marginal
probability of the micronucleus is 0.25, then the joint probability of
a compound being positive for both salmonella and micronucleus positive
is 0.25x0.25 (0.0625).
This probability analysis can be extended to the structure-class level.
If a class is a structural alert for salmonella, then the probability
of the class for salmonella should be high. The mean values from each
structural class can be used in the probabilistic analysis and further
for predictive likelihood. Since a chemical is made of these structural
classes, a joint probability can be calculated to estimate the
likelihood of a chemical to be salmonella mutagenic, for example. These
structural classes provide a chemical features dimension and act as a
link between the structure toxicity matrix for compounds tested in vivo
and the structure assay matrix for compounds tested in vitro.
Significant features describing compounds can be related to the
probabilities, and probabilistic feature analysis can be carried out.
This probabilistic analysis provides validation for using these classes
as structural alerts and molecular descriptors in QSAR models.
After training sets and endpoints have been prepared and descriptors
selected and validated, a weight of evidence approach can be applied to
statistical QSAR modeling. For example, Yang presented a rat
carcinogenicity model based on molecular descriptors including the
structural classes, physicochemical calculated properties, and NTPHTS
screening assays. To model salmonella negative but rat carcinogenic,
NTPHTS screening assays reflecting mostly apoptosis cycle were used.
Four submodels based on structural classes were put together by
optimizing the weights for individual structural class models to result
in one final model. A combination of descriptors selected based on our
knowledge of chemistry and biology leads to a much simpler
interpretation of the domain of applicability and weight of evidence
optimization improves reliability.
In the FDA CFSAN critical path project, this predictive data mining
method will be transparently documented for reproducibility. The plan
is eventually to disseminate knowledge in a decision tree type of
algorithm for making the computational knowledge available on demand to
the reviewers. This plan also includes making some of this knowledge
base publicly available.