A Primer on Molecular Similarity in QSAR and Virtual Screening Part I - Descriptor Choice
Recently, while sub-structural keys were found to retrieve less scaffolds for diverse classes than 3D fingerprints, topological (2D) fingerprints were found to be at least en par with them[8], and superior performance of 2D descriptors on other test databases was attributed to the large number of close analogues. On the other hand, circular fingerprints (which are a 2D representation) were in a large comparative study found to retrieve a large number of active compounds as well as a large number of different scaffolds: indeed, a similar percentage of scaffolds as of active compounds from the whole database - which would hint into the opposite direction, at least for fingerprints not based on sub-structural keys[9], [10]. In addition, it has been suggested that the question whether descriptors are able to identify novel scaffolds or not is heavily depending on the particular dataset under consideration as shown on four different targets (all from different target classes)[11]. Therefore, are 3D descriptors more likely than 2D descriptors to identify novel scaffolds in a virtual screening setting? Possibly this is true for some 2D/3D descriptor combinations, but it is still open to discussion whether this is due to an inherent property of the descriptor dimensionality, or each particular descriptor definition. Given that the global spatial arrangement of atoms is, by and large, already defined by the (local) connectivity information, it might be possible that no too large intrinsic bias between 2D and 3D descriptors exists. Clearly, further research is needed here.
(c) The Information Content of Current Descriptors
While circular fingerprints such as ECFP4[12] or MOLPRINT 2D fingerprints[13] have been shown to be information-rich, and currently the best-performing descriptors available as benchmarked on standard dataset9, the question arises how well those descriptors actually perform, compared to not random, but a very dumb, basic classifier.
This work has indeed recently been performed, with quite surprising results[14]. Namely, molecules in a virtual screening setting were described by simple descriptors, which didn't include information about the connectivity of the molecule at all. Molecules were only assigned descriptors based on simple "atom counts", which contained the frequency of heavy atoms of different types (carbon, nitrogen, oxygen and so on) in the molecule - and nothing else. The similarity of molecules was assessed via the distance of those 12-dimensional count vectors, and the number of active compounds retrieved was compared to that via a standard circular fingerprint (MOLPRINT 2D) in combination with the Tanimoto Coefficient. Given enrichments for current virtual screening methods which are often in the range of 20-fold (20 times better than random) and higher, it could be expected that simply counting atoms was to perform much worse. But, to the contrary, the results obtained on a standard dataset[15] were rather surprising.
|