QSAR WORLD
Home | About QSAR World | Strand Life Sciences | Contact Us
Google Custom Search

QSAR Modeling Competition 2008 - Results

Runner-up:

Anthony E. Klon
Scientist, Computational Chemistry, Locus Pharmaceuticals

Details:

All 643 molecules in the training set were read into a MOE database and minimized using the MMFF94x force field. All 2D and i3D (3D descriptors based on internal coordinates) descriptors as well as the 166 MACCS keys, for a total of 465 computed descriptors. The descriptors and measured bioavailability values were exported into a csv formatted file and imported into Weka 3-6-0. Attribute selection was carried out using the CfsSubsetEval attribute evaluator, (locallyPredictive option set to True) with the BestFirst search method (search direction = Forward, lookupCacheSize = 1, searchTermination = 5). This selection process resulted in 20 attributes:
   
Descriptor
    Class       Description
BCUT_SLOGP_0     2D       LogP BCUT (0/3)
BCUT_SMR_0     2D       Molar Refractivity BCUT (0/3)
a_nP     2D         Number of phosphorous atoms
opr_violation     2D       Oprea Violation Count
MACCS(--8)     2D       # of heteroatoms in 4-membered rings
MACCS(-13)     2D       # N connected to 1 O and 2 C
MACCS(-15)        2D       # C connected to 3 O
MACCS(-16)     2D       # of  heteroatoms in 3-membered rings
MACCS(-21)        2D       # C = bonded to C and 3 heavy atoms
MACCS(-23)        2D       # C bonded to 1 N and 2 O
MACCS(-28)     2D       # of XCH2X, where X<>C
MACCS(-29)        2D       # of phosphorous atoms
MACCS(-30)     2D       # of non-C Q4 bonded to >= 3 C
MACCS(-37)     2D      # of C bonded to >= 1 O & >= 2 N
MACCS(-49)     2D      # of charged atoms
MACCS(-51)     2D      # of S bonded to a C and an O
MACCS(107)     2D      # of XQ>3 bonded to at least 1 halogen
a_base     2D      Number of basic atoms
vsurf_IW7     2D      Hydrophilic integy moment at -5.0
vsurf_Wp8     2D      Polar volume at -6.0

a_nP and MACCS(-29) are redundant and so a_nP was evicted from further consideration

Several classifiers were attempted in Weka, including Gaussian processes, support vector machines, and linear regression. Gaussian processes without hyperparameter tuning (GP) gave the best results and was considered further for model building and refinement. The remaining 19 descriptors from the above list were whittled down by iteratively building different GP models with one descriptor from the list left out. Descriptors were evicted from the final model if their removal either improved the correlation coefficient or had an only minor effect on its value. The final model contained thirteen descriptors:

Opr_violation

MACCS(-15)
MACCS(-21)
MACCS(-28)
MACCS(-29)
MACCS(-30)
MACCS(-37)
MACCS(-49)
MACCS(-51)
MACCS(107)
a_base
vsurf_IW7
vsurf_Wp8

Different parameters were explored in the GP model, and the best model found used the RBF kernel (gamma = 0.5) with normalized training data and the level of Gaussian noise = 1. The final model was saved for use on the set of test compounds. The performance of the model on the training set was as follows:

Ten-fold cross validation

Correlation coefficient            0.5073
Mean absolute error               24.2396
Root mean squared error       28.7131
Relative absolute error           82.5596 %
Root relative squared error    86.0962 %
Total Number of Instances      643

66 % Training Set, Predict on 33 %

Correlation coefficient              0.5632
Mean absolute error                 23.7236
Root mean squared error         28.1363
Relative absolute error             79.3712 %
Root relative squared error      83.5003 %
Total Number of Instances        219

Full Training Set

Correlation coefficient              0.5445
Mean absolute error                 23.5233
Root mean squared error         27.9861
Relative absolute error             80.2099 %
Root relative squared error      84.0205 %
Total Number of Instances       643
   
The 162 compounds in the test set were read into a MOE database, energy minimized as described previously, and the thirteen descriptors listed above were calculated. These descriptors were imported into Weka as described for the training set and the GP model built with the data in the training set was used to predict the bioavailability values for the compounds in the test set.

From QSARworld:

The test set predictions for this model gives RMSE of 30.9716, the second lowest among all the entries.
Page 1 |
Have any Questions?
Name:
Email:
Enter your query/comment here
 

    Facilitated by
    Strand Life Sciences Pvt. LtdStrandls Logo