Resamping method¶

Validation approach¶

library(ISLR)
set.seed(1)
train=sample(392, 196)

using the sample() function to split the set of observations into two halves, by selecting a random subset of 196 observations out of the original 392 observations.

length(train)

lm.fit = lm(mpg ~ horsepower, data = Auto, subset=train)

attach(Auto)

mean( (mpg - predict(lm.fit, Auto))[-train]^2) # MSE

Therefore, the estimated test MSE for the linear regression fit is 26.14. We can use the poly() function to estimate the test error for the polynomial and cubic regressions.

lm.fit2 = lm(mpg ~ poly(horsepower, 2), data=Auto, subset=train)
mean( (mpg - predict(lm.fit2, Auto))[-train]^2) # MSE

lm.fit2 = lm(mpg ~ poly(horsepower, 3), data=Auto, subset=train)
mean( (mpg - predict(lm.fit2, Auto))[-train]^2) # MSE

These error rates are 19.82 and 19.78, respectively. If we choose a different training set instead, then we will obtain somewhat different errors on the validation set.

set.seed(2)
train = sample(392, 196)
lm.fit = lm(mpg ~ horsepower, data = Auto, subset=train)
mean( (mpg - predict(lm.fit, Auto))[-train]^2) # MSE

lm.fit2 = lm(mpg ~ poly(horsepower, 2), data=Auto, subset=train)
mean( (mpg - predict(lm.fit2, Auto))[-train]^2) # MSE

lm.fit2 = lm(mpg ~ poly(horsepower, 3), data=Auto, subset=train)
mean( (mpg - predict(lm.fit2, Auto))[-train]^2) # MSE

LOOCV¶

The LOOCV estimate can be automatically computed for any generalized linear model using the glm() and cv.glm() functions. In the lab for Chap- ter 4, we used the glm() function to perform logistic regression by passing in the family="binomial" argument. But if we use glm() to fit a model without passing in the family argument, then it performs linear regression, just like the lm() function. So for instance,

glm.fit = glm(mpg~horsepower, data=Auto)
coef(glm.fit)

lm.fit = lm(mpg~horsepower, data=Auto)
coef(lm.fit)

yield identical linear regression models. In this lab, we will perform linear regression using the glm() function rather than the lm() function because the latter can be used together with cv.glm(). The cv.glm() function is part of the boot library.

library(boot)
glm.fit = glm(mpg~horsepower, data=Auto)
cv.err=cv.glm(Auto,glm.fit)
cv.err$delta

summary(cv.err)

      Length Class  Mode   
call    3    -none- call   
K       1    -none- numeric
delta   2    -none- numeric
seed  626    -none- numeric

cv.error = rep(0, 5)
for (i in 1:5){
    glm.fit = glm(mpg~poly(horsepower, i), data=Auto)
    cv.error[i] = cv.glm(Auto, glm.fit)$delta[1]
}
cv.error

k-Fold cross validaton¶

The cv.glm() function can also be used to implement k-fold CV. Below we use k = 10, a common choice for k, on the Auto data set. We once again set a random seed and initialize a vector in which we will store the CV errors corresponding to the polynomial fits of orders one to ten.

set.seed(17)
cv.error.10 = rep(0, 10)
for (i in 1:10){
    glm.fit = glm(mpg~poly(horsepower, i), data=Auto)
    cv.error.10[i] = cv.glm(Auto, glm.fit, K=10)$delta[1]
}
cv.error.10

The bootstrap¶

Estimating the accuracy of a statistic of Interest¶

One of the great advantages of the bootstrap approach is that it can be applied in almost all situations. No complicated mathematical calculations are required. Performing a bootstrap analysis in R entails only two steps. First, we must create a function that computes the statistic of interest. Second, we use the boot() function, which is part of the boot library, to perform the bootstrap by repeatedly sampling observations from the data set with replacement.

alpha.fn = function(data, index) {
    X = data$X[index]
    Y = data$Y[index]
    return((var(Y)-cov(X,Y))/(var(X)+var(Y)-2*cov(X,Y)))
}

alpha.fn(Portfolio ,1:100)

head(Portfolio)

The next command uses the sample() function to randomly select 100 ob- servations from the range 1 to 100, with replacement. This is equivalent to constructing a new bootstrap data set and recomputing $\hat{\alpha}$ based on the new data set.

set.seed(1)
alpha.fn(Portfolio,sample(100,100,replace=T))

However, the boot() function automates boot() this approach. Below we produce R = 1, 000 bootstrap estimates for α.

boot(Portfolio ,alpha.fn,R=1000)

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Portfolio, statistic = alpha.fn, R = 1000)


Bootstrap Statistics :
     original        bias    std. error
t1* 0.5758321 -7.315422e-05  0.08861826

Estimating the accuracy of a Linear Regression Model¶

The bootstrap approach can be used to assess the variability of the coef- ficient estimates and predictions from a statistical learning method. Here we use the bootstrap approach in order to assess the variability of the estimates for β0 and β1, the intercept and slope terms for the linear regres- sion model that uses horsepower to predict mpg in the Auto data set. We will compare the estimates obtained using the bootstrap to those obtained using the formulas for SE(βˆ0) and SE(βˆ1) described in Section 3.1.2.

boot.fn = function(data, index){return(coef(lm(mpg~horsepower, data=Auto, subset=index)))}

boot.fn(Auto, 1:392)

boot(Auto, boot.fn, 1000)

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = Auto, statistic = boot.fn, R = 1000)


Bootstrap Statistics :
      original        bias    std. error
t1* 39.9358610  0.0126152644 0.871267432
t2* -0.1578447 -0.0002691801 0.007540188

summary(lm(mpg~horsepower, data=Auto))$coef

X	Y
-0.8952509	-0.2349235
-1.5624543	-0.8851760
-0.4170899	0.2718880
1.0443557	-0.7341975
-0.3155684	0.8419834
-1.7371238	-2.0371910

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	39.9358610	0.717498656	55.65984	1.220362e-187
horsepower	-0.1578447	0.006445501	-24.48914	7.031989e-81