Introduction

Resampling methods are an indispensable tool in modern statistics. They involve repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model. For example, in order to estimate the variability of a linear regression fit, we can repeatedly draw different samples from the training data, fit a linear regression to each new sample, and then examine the extent to which the resulting fits differ. Such an approach may allow us to obtain information that would not be available from fitting the model only once using the original training sample.

Resampling approaches can be computationally expensive, because they involve fitting the same statistical method multiple times using different subsets of the training data. However, due to recent advances in computing power, the computational requirements of resampling methods generally are not prohibitive. In this chapter, we discuss two of the most commonly used resampling methods, cross-validation and the bootstrap. Both methods are important tools in the practical application of many statistical learning procedures. For example, cross-validation can be used to estimate the test error associated with a given statistical learning method in order to evaluate its performance, or to select the appropriate level of flexibility. The process of evaluating a model’s performance is known as model assessment, whereas the process of selecting the proper level of flexibility for a model is known as model selection. The bootstrap is used in several contexts, most commonly to provide a measure of accuracy of a parameter estimate or of a given statistical learning method.

Training Error versus Test error

Outlines the critical difference between a model's performance on data it has already seen versus new, unseen data.

Here is a breakdown of the concepts presented:

Training Error: This is the error a model produces when evaluated against the exact same observations it was trained on. It is easy to calculate but can be misleadingly optimistic.

Test Error: This is the true measure of a model's quality and ability to generalize. It is the average error the model makes when predicting a new observation that was not used during the training process.

The most important point is the final one: the training error rate is often very different from the test error rate. Specifically, the training error can "dramatically underestimate" the test error.

This gap between low training error and high test error is the definition of overfitting. The model has not learned the general patterns in the data; instead, it has essentially "memorized" the training data, including its noise, and fails when presented with new examples.

This concept is the primary motivation for using ensembles:

Bagging (Bootstrap Aggregating): As discussed in the initial report, this method primarily reduces variance. High variance is the cause of overfitting, where a model is overly sensitive to the training data. By training many models on different samples of the data and averaging their results, bagging creates a final model that is more stable and generalizes better to new data (i.e., it lowers the test error).
Stacking with K-Fold Cross-Validation: The K-Fold process you saw earlier is a direct solution to the problem described in this image. If we trained the "meta-model" on the same data its base models were trained on, it would overfit. Instead, by training the meta-model only on "out-of-fold" predictions (data it hasn't seen), we are simulating its performance on new data and forcing it to learn to generalize, thereby reducing the final test error.

Introduction

Training Error versus Test error

The Core Problem (This Graph)