Resamping method

Validation approach

In [1]:
library(ISLR)
set.seed(1)
train=sample(392, 196)

using the sample() function to split the set of observations into two halves, by selecting a random subset of 196 observations out of the original 392 observations.

In [4]:
length(train)
196
In [5]:
lm.fit = lm(mpg ~ horsepower, data = Auto, subset=train)
In [6]:
attach(Auto)
In [8]:
mean( (mpg - predict(lm.fit, Auto))[-train]^2) # MSE
26.1414211520072

Therefore, the estimated test MSE for the linear regression fit is 26.14. We can use the poly() function to estimate the test error for the polynomial and cubic regressions.

In [11]:
lm.fit2 = lm(mpg ~ poly(horsepower, 2), data=Auto, subset=train)
mean( (mpg - predict(lm.fit2, Auto))[-train]^2) # MSE
19.8225850408262
In [12]:
lm.fit2 = lm(mpg ~ poly(horsepower, 3), data=Auto, subset=train)
mean( (mpg - predict(lm.fit2, Auto))[-train]^2) # MSE
19.7825166856023

These error rates are 19.82 and 19.78, respectively. If we choose a different training set instead, then we will obtain somewhat different errors on the validation set.

In [13]:
set.seed(2)
train = sample(392, 196)
lm.fit = lm(mpg ~ horsepower, data = Auto, subset=train)
mean( (mpg - predict(lm.fit, Auto))[-train]^2) # MSE
23.2955851508862
In [14]:
lm.fit2 = lm(mpg ~ poly(horsepower, 2), data=Auto, subset=train)
mean( (mpg - predict(lm.fit2, Auto))[-train]^2) # MSE
18.9012408317778
In [15]:
lm.fit2 = lm(mpg ~ poly(horsepower, 3), data=Auto, subset=train)
mean( (mpg - predict(lm.fit2, Auto))[-train]^2) # MSE
19.2573982608642

LOOCV

The LOOCV estimate can be automatically computed for any generalized linear model using the glm() and cv.glm() functions. In the lab for Chap- ter 4, we used the glm() function to perform logistic regression by passing in the family="binomial" argument. But if we use glm() to fit a model without passing in the family argument, then it performs linear regression, just like the lm() function. So for instance,

In [18]:
glm.fit = glm(mpg~horsepower, data=Auto)
coef(glm.fit)
(Intercept)
39.9358610211705
horsepower
-0.157844733353654
In [20]:
lm.fit = lm(mpg~horsepower, data=Auto)
coef(lm.fit)
(Intercept)
39.9358610211705
horsepower
-0.157844733353654

yield identical linear regression models. In this lab, we will perform linear regression using the glm() function rather than the lm() function because the latter can be used together with cv.glm(). The cv.glm() function is part of the boot library.

In [21]:
library(boot)
glm.fit = glm(mpg~horsepower, data=Auto)
cv.err=cv.glm(Auto,glm.fit)
cv.err$delta
  1. 24.2315135179292
  2. 24.2311440937562
In [23]:
summary(cv.err)
      Length Class  Mode   
call    3    -none- call   
K       1    -none- numeric
delta   2    -none- numeric
seed  626    -none- numeric
In [25]:
cv.error = rep(0, 5)
for (i in 1:5){
    glm.fit = glm(mpg~poly(horsepower, i), data=Auto)
    cv.error[i] = cv.glm(Auto, glm.fit)$delta[1]
}
cv.error
  1. 24.2315135179292
  2. 19.2482131244897
  3. 19.334984064029
  4. 19.4244303104303
  5. 19.0332138547041

k-Fold cross validaton

The cv.glm() function can also be used to implement k-fold CV. Below we use k = 10, a common choice for k, on the Auto data set. We once again set a random seed and initialize a vector in which we will store the CV errors corresponding to the polynomial fits of orders one to ten.

In [26]:
set.seed(17)
cv.error.10 = rep(0, 10)
for (i in 1:10){
    glm.fit = glm(mpg~poly(horsepower, i), data=Auto)
    cv.error.10[i] = cv.glm(Auto, glm.fit, K=10)$delta[1]
}
cv.error.10
  1. 24.2051967567753
  2. 19.1892390663471
  3. 19.3066158967501
  4. 19.3379909004929
  5. 18.8791148363354
  6. 19.0210341885228
  7. 18.8960903802809
  8. 19.7120146188182
  9. 18.9514005667302
  10. 19.501959228555

The bootstrap

Estimating the accuracy of a statistic of Interest

One of the great advantages of the bootstrap approach is that it can be applied in almost all situations. No complicated mathematical calculations are required. Performing a bootstrap analysis in R entails only two steps. First, we must create a function that computes the statistic of interest. Second, we use the boot() function, which is part of the boot library, to perform the bootstrap by repeatedly sampling observations from the data set with replacement.

In [30]:
alpha.fn = function(data, index) {
    X = data$X[index]
    Y = data$Y[index]
    return((var(Y)-cov(X,Y))/(var(X)+var(Y)-2*cov(X,Y)))
}
In [31]:
alpha.fn(Portfolio ,1:100)
0.57583207459283
In [29]:
head(Portfolio)
XY
-0.8952509-0.2349235
-1.5624543-0.8851760
-0.4170899 0.2718880
1.0443557-0.7341975
-0.3155684 0.8419834
-1.7371238-2.0371910

The next command uses the sample() function to randomly select 100 ob- servations from the range 1 to 100, with replacement. This is equivalent to constructing a new bootstrap data set and recomputing $\hat{\alpha}$ based on the new data set.

In [32]:
set.seed(1)
alpha.fn(Portfolio,sample(100,100,replace=T))
0.596383302006392

However, the boot() function automates boot() this approach. Below we produce R = 1, 000 bootstrap estimates for α.

In [33]:
boot(Portfolio ,alpha.fn,R=1000)
ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Portfolio, statistic = alpha.fn, R = 1000)


Bootstrap Statistics :
     original        bias    std. error
t1* 0.5758321 -7.315422e-05  0.08861826

Estimating the accuracy of a Linear Regression Model

The bootstrap approach can be used to assess the variability of the coef- ficient estimates and predictions from a statistical learning method. Here we use the bootstrap approach in order to assess the variability of the estimates for β0 and β1, the intercept and slope terms for the linear regres- sion model that uses horsepower to predict mpg in the Auto data set. We will compare the estimates obtained using the bootstrap to those obtained using the formulas for SE(βˆ0) and SE(βˆ1) described in Section 3.1.2.

In [36]:
boot.fn = function(data, index){return(coef(lm(mpg~horsepower, data=Auto, subset=index)))}
In [37]:
boot.fn(Auto, 1:392)
(Intercept)
39.9358610211705
horsepower
-0.157844733353654
In [38]:
boot(Auto, boot.fn, 1000)
ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Auto, statistic = boot.fn, R = 1000)


Bootstrap Statistics :
      original        bias    std. error
t1* 39.9358610  0.0126152644 0.871267432
t2* -0.1578447 -0.0002691801 0.007540188
In [39]:
summary(lm(mpg~horsepower, data=Auto))$coef
EstimateStd. Errort valuePr(>|t|)
(Intercept)39.9358610 0.717498656 55.65984 1.220362e-187
horsepower-0.1578447 0.006445501 -24.48914 7.031989e-81