A Primer on Molecular Similarity in QSAR and Virtual Screening
Part I - Descriptor Choice
We are glad to present Dr. Andreas Bender of Novartis Institutes for Biomedical Research in Cambridge, MA as an Editorial Advisor and columnist of QSAR World.
Dr. Bender has a PhD in Chemistry from Cambridge University. He is a Presidential Postdoctoral Fellow with the Novartis Institutes for Biomedical Research working on projects related to ligand-based drug design in the Lead Discovery Informatics group
He will be meeting you regularly through his columns in this space discussing latest issues and challenges in Similarity Searches and Virtual Screening. Over to Andreas...
Download PDF Version
Methods such as QSAR and various cheminformatics techniques have gained huge popularity in recent years and decades. This can partly be attributed to increased productivity pressure in pharmaceutical industry and the assumption that computational models can replace some experiments, but also to the fact that more data, more validated modeling methods and more computer power are readily available.
While methods abound, one should also take a step back from time to time to look at the bigger state of affairs to ask oneself what has been gained, and which expectations were simply hyped too much and failed to keep their promises.
In this article, the first of a series, we will discuss which progress has been made since the early works, such as those by Crum Brown and Frazer and Hansch, who were among the first to correlate chemical structure and biological (or physiochemical) properties and to postulate a causal relationship between the two.
Relating a molecular property to the underlying structure involves three broad steps: Firstly, the representation of a molecule in a way suitable for computerized treatment, often referred to as the choice of a descriptor. Secondly, the choice of the variable one attempts to model, often called the endpoint – which can be any molecular property that can be experimentally measured. Frequently used endpoints (since they are relevant in practice) are solubility and logP as physicochemical properties or bioactivity as a biological measurement variable. Thirdly, descriptors (input variables) and endpoints (output variables) need to be connected, by means of one of a variety of available model generation methods.
Each of those steps deserves particular attention and it is of crucial importance that the descriptor chosen to represent the system, the mathematical method employed to generate the model, and the property or endpoint measured are a suitable combination that at least gives the possibility of a successful model generation. If the system is not adequately described by suitable descriptors as input variables, a model is based on effectively random variables and thus doomed from the beginning. If the modeling method is not able to handle the expected input-output relationships, such as the application of a linear method to a bilinear relationship, no suitable fitting of the function can be expected. If the endpoints measured are not purely a function of the molecular structure, but also of the procedure used to obtain the measurements, information is missing from the system and the model can’t be more accurate than the available, error-riddled data. Large variability of measurements between laboratories, or also between experimental procedures, decrease the quality of the model that will be obtained in the end, so a dataset as homogenous as possible is of crucial importance here.