In data science and machine learning, a model is often trained on one part of the data (the training set) and tested on the other part (testing set). Performance metrics such as accuracy, precision, and recall can be calculated on the testing set, providing some indication of how well the model performs on new data.
However, this approach isn’t always desirable, especially when very small datasets are used. To make better use of the data, statisticians use so-called resampling techniques. Two very common resampling techniques are cross-validation and bootstrapping, and they will be briefly described here.
One commonly performed type of cross-validation is termed k-fold cross validation. In k-fold cross validation, the data is randomly partitioned into k subsets. For example, taking k to be 10, the dataset will be split into 10 equal parts. Then, the machine learning model is trained on k-1 subsets (9 subsets in the case of 10-fold cross validation), and performance metrics such as accuracy are computed according to the predictions on the remaining subset. The 10 performance metric scores can be averaged to form a cross-validation accuracy score.
K-fold cross validation makes better use of the data than a single train/test split, since it results in all of the data being used exactly once as testing data. Moreover, the performance metrics are more representative of the model’s actual performance.
Other variations on cross-validation exist besides k-fold cross-validation. For example, leave-one-out-cross-validation (LOOCV) is another commonly used technique. In LOOCV, the model is trained on all of the data except for one observation, and the remaining observation is used as a testing set. Then, performance metrics such as accuracy can be computed based on all of the predictions made.
Bootstrapping is another statistical technique that can be used instead of cross-validation. The bootstrapping process is as follows:
- The dataset is randomly resampled with replacement. This means that observations are drawn from the dataset, and an observation being drawn from the dataset does not preclude it from being drawn again. Because of random sampling with replacement, duplicate observations can be present. How many times is the dataset randomly sampled with replacement? It is common practice to sample n times, where n is the number of observations in the dataset.
- The remaining (unused) samples are collected into an “out-of-bag” set. This set is analogous to a testing set.
- The model is trained on the observations drawn from the dataset with resampling, and is tested on the out-of-bag set. Performance metrics can be calculated.
- The observations in the dataset are shuffled randomly, and the process is repeated.
The general algorithm outline above occurs B times (where B is any positive integer), resulting in B separate out-of-bag performance metrics. These can be averaged if obtaining a single performance metric estimate is helpful to the statistician. Additionally, a histogram of all the different performance metrics can be plotted to assess whether the distribution of metrics conforms to a normal distribution.
Although bootstrapping is a very powerful statistical technique, it suffers from some disadvantages. The main disadvantage is that it is computationally expensive. Training and testing models many times can be infeasible, especially when datasets are large and/or computational resources are limited.