QSAR WORLD
Home | About QSAR World | Strand Life Sciences | Contact Us
Google Custom Search

New Horizons in Toxicity Prediction. Lhasa Limited Symposium Event in Collaboration with

the University of Cambridge - February 2009


A Report by Wendy A. Warrwendy@warr.com, http://www.warr.com
Strengths and Limitations of Current Toxicity Prediction Systems

Understanding toxicity from predictive data mining
Chihae Yang, ORISE fellow, US FDA Center for Food Safety and Applied Nutrition (CFSAN)

US FDA CFSAN has initiated a project to develop a knowledge base for food additives and ways to implement computational risk assessment methods within the workflow of reviewers. One of the goals of this project is to develop the center’s knowledge base, which will be disseminated through structural categories and predictive models. Predictive data mining is the process that is being used at US FDA CFSAN to build these components of the knowledge base.

The currently available SAR paradigm is associated with a couple of inherent issues. First, when linking chemistry and biology, there is an inadequate description of biology per chemical feature: complex biology is compressed into a highly summarized outcome, while the chemistry domain is extremely diverse and sparse. Chemical diversity is of the order of 1059 while biology diversity is of the order of 109 and is very highly summarized (0,1 or an LC50 value). Most of the time, the training sets of the models suffer from this issue and the data mining process has been a black box. Second, most of the QSAR models are “global” and data-driven while the original SAR paradigm is defined only within a mechanistic domain. Hence the issue of “global” versus “mechanistic” models requires the clear definition of the applicability domain, where valid ranges of independent variables must be established. We must make both the data mining process of the training data set and the limitations of the knowledge base transparent and sustainable.

The predictive data mining process begins with data preparation, which is followed by data mining and analysis, then knowledge base development of structural rules and prediction models, and then applying and disseminating knowledge. There are two types of learning goal: (1) identifying relationships between structural classes and various toxicity endpoints so that an intelligent testing decision can be made (“what do I make next?”); and (2) developing structural alerts and QSAR models to assist safety decisions.

Data sources included the Leadscope-FDA genetic and carcinogenic toxicity databases (constructed according to the criteria of the ToxML standard)15 and biological assay data from the first National Toxicology Program High Throughput Screening (NTPHTS) campaign. The toxicity endpoints considered in this study are genetic toxicity (salmonella, mouse lymphoma, in vitro chromosome aberration and in vivo micronucleus) and rodent carcinogenicity (mouse and rat). The sources of the Leadscope genetic toxicity and carcinogenicity databases include NTP, CCRIS, Tokyo-Eiken (Tokyo Metropolitan Institute of Public Health Epidemiological Information Office), US FDA, and primary publications.

The NTPHTS campaign data are from an HTS project which NTP has initiated to explore new approaches to evaluating chemicals across a spectrum of high-throughput biological assays. Assays are being selected based on their potential to be informative of animal bioassay results and relevant to human health risk assessments. As an initial phase of this project, NTP has provided a set of 1,408 chemicals from NTP inventories for HTS in bioassays relevant to toxicology, to the NIH Chemical Genomics Center (NCGC), part of the NIH Molecular Libraries and Imaging Roadmap (MLR) initiative. Assays are described and assay results reported in PubChem for this NTPHTS chemical data set in the same manner as for compounds from the Molecular Libraries Small Molecule Repository. The DSSTox project11 is collaborating with the NTPHTS project to provide structure annotation and cheminformatics support for this effort. Drawing largely from the contents of the existing NTP Bioassay Online Indicator (BSI) Structure-Index Locator File, the DSSTox NTPHTS Structure-Index File provides the full complement of DSSTox standard chemical fields for the NTPHTS chemical set.

Once the data set is prepared, the data mining and analysis steps follow. The compound level profile, a data matrix of compounds and activity (positive or negative) for the six endpoints is sparse. For example, out of a total of 3,548 structures included in this study, only 45 have all the data for four genetic toxicity endpoints. To profile the association between chemistry and biological endpoint better, chemical structures are decomposed into features. A feature level profile is a data matrix of structural classes and average endpoint results for each class, which has very few empty cells. The structural classes can be any chemically meaningful fragments and Yang used Leadscope features as an example. Multivariate analysis of structure classes allows one to detect, for example, salmonella negatives which are mouse lymphoma positive. Non-concordant chemical classes can give insights, e.g., the pyrrolidine, 2-oxo class is salmonella positive, and mouse lymphoma negative; aromatic amines and alkyl halides are in vitro chromosome aberration positive but micronucleus negative.15

Probabilistic analysis is possible when the database is sufficiently large. For example, a probability of a compound or a structural class to be mutagenic or carcinogenic can be estimated from a large database. Probabilities can be marginal, conditional, or joint. If 2000 out of 8000 compounds are salmonella positive, the marginal probability of a salmonella positive result is 0.25. Conditional probability is defined under a given condition; for example, if 75 out of 150 compounds are salmonella positive given that micronucleus is positive, then the conditional probability is 0.5. Joint probability is the probability of both salmonella and micronucleus being positive. If the marginal probability of the micronucleus is 0.25, then the joint probability of a compound being positive for both salmonella and micronucleus positive is 0.25x0.25 (0.0625).

This probability analysis can be extended to the structure-class level. If a class is a structural alert for salmonella, then the probability of the class for salmonella should be high. The mean values from each structural class can be used in the probabilistic analysis and further for predictive likelihood. Since a chemical is made of these structural classes, a joint probability can be calculated to estimate the likelihood of a chemical to be salmonella mutagenic, for example. These structural classes provide a chemical features dimension and act as a link between the structure toxicity matrix for compounds tested in vivo and the structure assay matrix for compounds tested in vitro. Significant features describing compounds can be related to the probabilities, and probabilistic feature analysis can be carried out. This probabilistic analysis provides validation for using these classes as structural alerts and molecular descriptors in QSAR models.

After training sets and endpoints have been prepared and descriptors selected and validated, a weight of evidence approach can be applied to statistical QSAR modeling. For example, Yang presented a rat carcinogenicity model based on molecular descriptors including the structural classes, physicochemical calculated properties, and NTPHTS screening assays. To model salmonella negative but rat carcinogenic, NTPHTS screening assays reflecting mostly apoptosis cycle were used. Four submodels based on structural classes were put together by optimizing the weights for individual structural class models to result in one final model. A combination of descriptors selected based on our knowledge of chemistry and biology leads to a much simpler interpretation of the domain of applicability and weight of evidence optimization improves reliability.

In the FDA CFSAN critical path project, this predictive data mining method will be transparently documented for reproducibility. The plan is eventually to disseminate knowledge in a decision tree type of algorithm for making the computational knowledge available on demand to the reviewers. This plan also includes making some of this knowledge base publicly available.


Page 1 | 2 | 3 | 4  |  5  |  6  |  7  |  8  |  9  |  10  |  11  |  12
Have any Questions?
Name:
Email:
Enter your query/comment here
 

    Facilitated by
    Strand Life Sciences Pvt. LtdStrandls Logo