In a typical machine learning experiment, we possess data that (in the simplest variant) is split into three disjoint sets: training, validation, and test. The training set allows for the estimation of model parameters (e.g., neural network weights, regression coefficients, distribution parameters in Naive Bayes, etc.) by minimizing a chosen loss function or maximizing likelihood. The goal of learning is not the “perfect mapping” of training data per se, but achieving good generalization—i.e., low error on new, unseen examples (assuming they originate from the same distribution as the training data).
In practice, during training, the model may start to overfit: the error on the training set continues to decrease, but performance on unseen data stops improving or even deteriorates. Therefore, we introduce a validation set, which is not used to fit parameters in a given training run, but rather for model selection and setting the learning process configuration (hyperparameters), as well as for procedures like early stopping (halting training when the loss or metric on the validation set stops improving).
Selecting the “best” model is usually done by minimizing loss or maximizing a chosen metric (Accuracy, F1, AUC, etc.) on the validation set. However, one must remember that intensive tuning (numerous trials of architectures, hyperparameters, random seeds, augmentations, preprocessing decisions) can lead to “overfitting to validation” and an overly optimistic evaluation. This is a form of selection bias in model selection.
The test set should be kept “for the end” and used only after the entire design process (model selection and hyperparameter tuning) is complete. The result on the test set serves as an estimation of the generalization error for the adopted protocol, but it is still an estimation subject to uncertainty (dependent, among other factors, on the test size and problem variance).
When data is scarce, or we require a more stable evaluation, resampling procedures are employed. Classically, -fold cross-validation (including leave-one-out as an extreme case) and bootstrap methods are used for prediction error estimation. Bootstrap (e.g., the variant) is sometimes treated as a “smoothed” alternative to CV with a different bias–variance trade-off. If cross-validation serves both for hyperparameter tuning and for quality assessment, a nested cross-validation procedure is necessary to avoid optimistic bias. In this approach, the inner loop tunes hyperparameters, while the outer loop estimates the generalization error.
Empirical research consists of designing models and learning procedures and comparing them within a controlled protocol. The theoretical current is complementary: it provides formal frameworks where (under certain assumptions) one can obtain guarantees such as “with high probability, the true risk will not exceed a certain value,” dependent on the empirical risk and a measure of solution complexity. In particular, PAC-Bayes theory provides probabilistic upper bounds on the generalization error (often for stochastic/Gibbs classifiers) as a function of empirical error and a complexity term expressed, among others, by the Kullback-Leibler divergence () between the posterior and prior distributions. This does not imply a simple deterministic relationship like ; the relationship between validation and test errors depends on the selection protocol and can be biased by the model selection process.