library(ISLR)
set.seed(1)
train=sample(392, 196)
using the sample()
function to split the set of observations
into two halves, by selecting a random subset of 196 observations out of the original 392 observations.
length(train)
lm.fit = lm(mpg ~ horsepower, data = Auto, subset=train)
attach(Auto)
mean( (mpg - predict(lm.fit, Auto))[-train]^2) # MSE
Therefore, the estimated test MSE for the linear regression fit is 26.14. We can use the poly()
function to estimate the test error for the polynomial and cubic regressions.
lm.fit2 = lm(mpg ~ poly(horsepower, 2), data=Auto, subset=train)
mean( (mpg - predict(lm.fit2, Auto))[-train]^2) # MSE
lm.fit2 = lm(mpg ~ poly(horsepower, 3), data=Auto, subset=train)
mean( (mpg - predict(lm.fit2, Auto))[-train]^2) # MSE
These error rates are 19.82 and 19.78, respectively. If we choose a different training set instead, then we will obtain somewhat different errors on the validation set.
set.seed(2)
train = sample(392, 196)
lm.fit = lm(mpg ~ horsepower, data = Auto, subset=train)
mean( (mpg - predict(lm.fit, Auto))[-train]^2) # MSE
lm.fit2 = lm(mpg ~ poly(horsepower, 2), data=Auto, subset=train)
mean( (mpg - predict(lm.fit2, Auto))[-train]^2) # MSE
lm.fit2 = lm(mpg ~ poly(horsepower, 3), data=Auto, subset=train)
mean( (mpg - predict(lm.fit2, Auto))[-train]^2) # MSE
The LOOCV estimate can be automatically computed for any generalized linear model using the glm() and cv.glm() functions. In the lab for Chap- ter 4, we used the glm() function to perform logistic regression by passing in the family="binomial" argument. But if we use glm() to fit a model without passing in the family argument, then it performs linear regression, just like the lm() function. So for instance,
glm.fit = glm(mpg~horsepower, data=Auto)
coef(glm.fit)
lm.fit = lm(mpg~horsepower, data=Auto)
coef(lm.fit)
yield identical linear regression models. In this lab, we will perform linear regression using the glm() function rather than the lm() function because the latter can be used together with cv.glm(). The cv.glm() function is part of the boot library.
library(boot)
glm.fit = glm(mpg~horsepower, data=Auto)
cv.err=cv.glm(Auto,glm.fit)
cv.err$delta
summary(cv.err)
cv.error = rep(0, 5)
for (i in 1:5){
glm.fit = glm(mpg~poly(horsepower, i), data=Auto)
cv.error[i] = cv.glm(Auto, glm.fit)$delta[1]
}
cv.error
The cv.glm() function can also be used to implement k-fold CV. Below we use k = 10, a common choice for k, on the Auto data set. We once again set a random seed and initialize a vector in which we will store the CV errors corresponding to the polynomial fits of orders one to ten.
set.seed(17)
cv.error.10 = rep(0, 10)
for (i in 1:10){
glm.fit = glm(mpg~poly(horsepower, i), data=Auto)
cv.error.10[i] = cv.glm(Auto, glm.fit, K=10)$delta[1]
}
cv.error.10
One of the great advantages of the bootstrap approach is that it can be applied in almost all situations. No complicated mathematical calculations are required. Performing a bootstrap analysis in R entails only two steps. First, we must create a function that computes the statistic of interest. Second, we use the boot() function, which is part of the boot library, to perform the bootstrap by repeatedly sampling observations from the data set with replacement.
alpha.fn = function(data, index) {
X = data$X[index]
Y = data$Y[index]
return((var(Y)-cov(X,Y))/(var(X)+var(Y)-2*cov(X,Y)))
}
alpha.fn(Portfolio ,1:100)
head(Portfolio)
The next command uses the sample() function to randomly select 100 ob- servations from the range 1 to 100, with replacement. This is equivalent to constructing a new bootstrap data set and recomputing $\hat{\alpha}$ based on the new data set.
set.seed(1)
alpha.fn(Portfolio,sample(100,100,replace=T))
However, the boot() function automates boot() this approach. Below we produce R = 1, 000 bootstrap estimates for α.
boot(Portfolio ,alpha.fn,R=1000)
The bootstrap approach can be used to assess the variability of the coef- ficient estimates and predictions from a statistical learning method. Here we use the bootstrap approach in order to assess the variability of the estimates for β0 and β1, the intercept and slope terms for the linear regres- sion model that uses horsepower to predict mpg in the Auto data set. We will compare the estimates obtained using the bootstrap to those obtained using the formulas for SE(βˆ0) and SE(βˆ1) described in Section 3.1.2.
boot.fn = function(data, index){return(coef(lm(mpg~horsepower, data=Auto, subset=index)))}
boot.fn(Auto, 1:392)
boot(Auto, boot.fn, 1000)
summary(lm(mpg~horsepower, data=Auto))$coef