Anthony E. Klon
Scientist, Computational Chemistry, Locus Pharmaceuticals
Details:
All 643 molecules in the training set were read into a MOE database and
minimized using the MMFF94x force field. All 2D and i3D (3D descriptors
based on internal coordinates) descriptors as well as the 166 MACCS
keys, for a total of 465 computed descriptors. The descriptors and
measured bioavailability values were exported into a csv formatted file
and imported into Weka 3-6-0. Attribute selection was carried out using
the CfsSubsetEval attribute evaluator, (locallyPredictive option set to
True) with the BestFirst search method (search direction = Forward,
lookupCacheSize = 1, searchTermination = 5). This selection process
resulted in 20 attributes:
Descriptor
Class
Description
BCUT_SLOGP_0
2D
LogP BCUT (0/3)
BCUT_SMR_0
2D
Molar Refractivity BCUT (0/3)
a_nP
2D
Number of phosphorous atoms
opr_violation
2D
Oprea Violation Count
MACCS(--8)
2D
# of heteroatoms in 4-membered rings
MACCS(-13)
2D
# N connected to 1 O and 2 C
MACCS(-15)
2D
# C connected to 3 O
MACCS(-16)
2D
# of heteroatoms in 3-membered rings
MACCS(-21)
2D
# C = bonded to C and 3 heavy atoms
MACCS(-23)
2D
# C bonded to 1 N and 2 O
MACCS(-28)
2D
# of XCH2X, where X<>C
MACCS(-29)
2D
# of phosphorous atoms
MACCS(-30)
2D
# of non-C Q4 bonded to >= 3 C
MACCS(-37)
2D
# of C bonded to >= 1 O & >= 2 N
MACCS(-49)
2D
# of charged atoms
MACCS(-51)
2D
# of S bonded to a C and an O
MACCS(107)
2D
# of XQ>3 bonded to at least 1 halogen
a_base
2D
Number of basic atoms
vsurf_IW7
2D
Hydrophilic integy moment at -5.0
vsurf_Wp8
2D
Polar volume at -6.0
a_nP and MACCS(-29) are redundant and so a_nP was evicted from further consideration
Several
classifiers were attempted in Weka, including Gaussian processes,
support vector machines, and linear regression. Gaussian processes
without hyperparameter tuning (GP) gave the best results and was
considered further for model building and refinement. The remaining 19
descriptors from the above list were whittled down by iteratively
building different GP models with one descriptor from the list left
out. Descriptors were evicted from the final model if their removal
either improved the correlation coefficient or had an only minor effect
on its value. The final model contained thirteen descriptors:
Different parameters were explored in the GP model, and the best model
found used the RBF kernel (gamma = 0.5) with normalized training data
and the level of Gaussian noise = 1. The final model was saved for use
on the set of test compounds. The performance of the model on the
training set was as follows:
Ten-fold cross validation
Correlation coefficient 0.5073
Mean absolute error 24.2396
Root mean squared error 28.7131
Relative absolute error 82.5596 %
Root relative squared error 86.0962 %
Total Number of Instances 643
66 % Training Set, Predict on 33 %
Correlation coefficient 0.5632
Mean absolute error 23.7236
Root mean squared error 28.1363
Relative absolute error 79.3712 %
Root relative squared error 83.5003 %
Total Number of Instances 219
Full Training Set
Correlation coefficient 0.5445
Mean absolute error 23.5233
Root mean squared error 27.9861
Relative absolute error 80.2099 %
Root relative squared error 84.0205 %
Total Number of Instances 643
The 162 compounds in the test set were read into a MOE database, energy
minimized as described previously, and the thirteen descriptors listed
above were calculated. These descriptors were imported into Weka as
described for the training set and the GP model built with the data in
the training set was used to predict the bioavailability values for the
compounds in the test set.
From QSARworld:
The test set predictions for this model gives RMSE of 30.9716, the second lowest among all the entries.