Training involves using a dataset with known values, and learning a model from that dataset. However, models that fit the training dataset very well may mis-predict new data points. Such over-fitting of the training data will most likely yield a model that cannot be generalized and, therefore, would not be useful.
Therefore, an algorithm and its associated parameters must be validated before they are used to predict new data. This process involves segmenting the training data into two sets. One set is used for training and the other for testing the model. Typically, validation should be done with a variety of algorithms and parameters, and results monitored to choose the best combination. This combination can then be used to build a model with the entire training dataset, and subsequently to predict for new data.
Cross-validation is an important tool to avoid over-fitting models on training data, as over-fitting will give low accuracy on validation. Also, validation can help choose the right set of descriptors, an appropriate algorithm and associated parameters for a given dataset. Validation can be run on the same dataset using various algorithms and altering the parameters of each algorithm. The results of validation can then be examined to choose the best algorithm and parameters for the model.
Two types of validation are frequently used:
Leave One Out - All data with the exception of one compound is used to train the learning algorithm. The model thus learnt is used to classify the remaining compound. The process is repeated for every compound in the dataset and the average results are reported.
N-fold - The compounds in the input data are randomly divided into N equal parts; N-1 parts are used for training, and the remaining 1 part is used for testing. The process is repeated N times, with a different part being used for testing in each iteration. Thus, each compound is used at least once in training and once in testing, and the average results are reported. This whole process is then repeated as many times as specified by the ‘number of repeats’.
Mitchell, Tom M. "Generalization, Overfitting, and Stopping Criterion." §4.6.5 in Machine Learning, McGraw-Hill, International Edition, pp. 111-112, 1997.
Cite This As:
Dogra, Shaillay K., "Cross-validation" From QSARWorld--A Strand Life Sciences Web Resource.