QSAR WORLD
Home | About QSAR World | Strand Life Sciences | Contact Us
Google Custom Search

Fingerprint-based Similarity

A simple count of shared features (common fragment substructures) can be a measure of chemical distance when used in some similarity coefficient. This is a simple yet computationally efficient way of quantifying degree of structural resemblance.

Dictionaries of predefined structural fragments, such as MDL Information Systems‘ MACCS keys, are used to identify features contained in a molecule. This however has the drawback that fragments not considered during the design of these keys are consequently not a part of the dictionary.

The structural fragments or features that are present in the given molecule are turned ON (set as 1) and the ones that are absent are kept OFF (set as 0). Thus, for each molecule one ends up having a string containing 1s and 0s (bit string) as determined by the elements of the dictionary. It should also be noted that some aspects of molecular structure cannot be captured in a bit string based representation. Bits may be set only once irrespective of the frequency of occurrence of the given key. Bits get set on the basis of fragments of whole structure and there is poor capturing of properties of the whole molecule.

Once the molecules have been represented by such bit-strings any of the association coefficients can be used to assess similarity between any two given molecules.  Tanimoto coefficient is a frequently used measure to assess similarity based on fingerprints (bit-string representations). Let’s say, we are comparing two molecules A and B. If NA is number of features (ON bits) in A, NB is the number of features (ON bits) in B, and NAB is the number of features (ON bits) common to both A and B, then, Tanimoto coefficient simply is:

τ = NAB / NA + NB - NAB

Note that the OFF bits do not determine the similarity. In other words, if some molecular features are absent in both molecules then that is not taken as an indication of similarity between the two.

Binary representations of molecules in combination with similarity coefficients possess some implicit properties that skew the results of similarity searches and may introduce unintentional bias with respect to factors like size of the molecule. In a similarity search using fragment bit-strings or fingerprints, a large molecule in database is a priori much more likely to have bits in common with the target structure than is a small molecule.

Also, it has been noted that as the query structure becomes larger and complicated, average similarity appears to increase.

See Also:
similarity principle, chemical similarity, descriptor-based similarity


References:

Flower, D. R., On the Properties of Bit String-Based Measures of Chemical Similarity, J. Chem. Inf. Comput. Sci., 38 (3), 379 -386, 1998.

Glen, R. C. and Adams, S. E. Similarity Metrics and Descriptor Spaces - Which Combinations to Choose? QSAR Comb. Sci., 25(12), 1133-1142, 2006.

Willett, P., Barnard, J. M. and Downs, G. M.  Chemical Similarity Searching J. Chem. Inf. Comput. Sci., 38 (6), 983-996, 1998.

http://www.daylight.com/dayhtml/doc/theory/theory.finger.html

http://en.wikipedia.org/wiki/Jaccard_index


Cite This As:

Dogra, Shaillay K., "Fingerprint-based Similarity." From QSARWorld--A Strand Life Sciences Web Resource.
http://www.qsarworld.com/insilico-chemistry-fingerprint-based-similarity.

Have any Questions?
Name:
Email:
Enter your query/comment here
 

    Facilitated by
    Strand Life Sciences Pvt. LtdStrandls Logo