A Primer on Molecular Similarity in QSAR and Virtual Screening
Part I - Descriptor Choice
Shown in the figure below are the number of active compounds retrieved by the “dumb” atom counts, relative to the number of active compounds retrieved by conventional circular fingerprints. A number of 1 effectively means that one method performs as well as the other, while a factor of 2 for example indicates that the conventional fingerprints perform twice as well (retrieve twice as many actives) as the dumb atom counts.
Over the 11 classes of active compounds it can be seen that the difference between the methods varies between a factor of about 1 and a factor of about 3. This means, conventional fingerprints perform better than counting atoms overall, no questions. But overall, they only perform not even twice as well as simply counting atoms – where, on average, circular fingerprints are able to obtain enrichments of around 7, counting atoms also obtains enrichments of around 4. So – are we so much better than counting atoms right now? Overall, we certainly do perform better. But not as much better as one might expect, not even twice as well.
(More recently the “performance of the atom count descriptor” has also been evaluated on a another dataset, with similar results. Thus, the above tendency seems to be general, hinting at the possibility that descriptors which just count atoms already capture a surprisingly large part of the total information content of other descriptors employed in virtual screening.)