QSAR WORLD
Home | About QSAR World | Strand Life Sciences | Contact Us
Google Custom Search

Y Scrmabling

When selecting descriptors that are of relevance to modeling the property of interest, it is possible to find some descriptors that seem of importance, just-by-chance, given the high-dimensionality of feature space from which we are doing such a search. For example, using correlation-based selection methods, it is quite possible to select, say 10 descriptors out of a 1000 descriptors, that are actually of no real significance but seem to fit the property of interest well. This can occur due to pure statistical chance wherein some descriptors happen to have good correlation with the property of interest. In the above example, a 1 in a 100 chance of good correlation with the outcome may result in us selecting 10 descriptors that are of no real meaning in capturing the relation with the property of interest.

To guard against the possibility of having learned such chance models, the method of Y-scrambling is advocated. In this method, models are fitted for randomly reordered property/activity values and compared with the model obtained for the actual property/activity values.

What is done is as follows: 1) For the training set, on which the given model was learned, the descriptor data (‘X’) is left as is while the activity data (‘Y’) is randomly shuffled to change its true order. Thus, though the values (and the statistical distribution) stay the same, their position against the appropriate compound and its descriptor(s) is now altered thus destroying any meaningful relation that may have existed between ‘X’ independent variables and the ‘Y’ dependent variable.

2) Next, a new QSAR model is obtained for such permuted data and metrics like R-square and Q-square are noted for the fitted model.

3) Steps 1 and 2 are done for sufficient number of iterations, a good number being 50 to 100.

4) Values obtained in the above fashion are compared with the ‘true’ values obtained for the model that was fitted on the real data. Such a comparison can be done by means of a histogram or scatter-plot to check how the ‘true’ value differs from the ‘background’ reference distribution obtained by performing the above-mentioned permutation tests.

True values should lie much outside such a ‘background’ reference distribution for one to confidently say that there exists a real model on the given data (the model that was originally learned) and that it is not the same as models that were learned by chance (the models learned through Y-scrambling).

Y-scrambling

References:

1. Methods for reliability and uncertainty assessment and for applicability evaluations of classification- and regression-based QSARs, Lennart Eriksson, Joanna Jaworska, Andrew P Worth, Mark T D Cronin, Robert M McDowell, and Paola Gramatica, Environ Health Perspect.,111(10): 1361-1375, 2003.



2. The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models, Alexander Tropsha, Paola Gramatica, and Vijay K. Gombar, QSAR Comb. Sci., 22 (1), 69-77, 2003.



3. Principles of QSAR models validation: internal and external, Paola Gramatica, QSAR Comb. Sci., 2007, 26(5), 694-701.



Cite This As:

Dogra, Shaillay K., "Y Scrmabling" From QSARWorld--A Strand Life Sciences Web Resource.
http://www.qsarworld.com/qsar-statistics-y-scrambling.php

Have any Questions?
Name:
Email:
Enter your query/comment here
 

    Facilitated by
    Strand Life Sciences Pvt. LtdStrandls Logo