library(MASS)
library(ISLR)
The MASS library contains the Boston data set, which records medv
(median house value) for 506 neighborhoods around Boston. We will seek to predict medv
using 13 predictors such as rm
(average number of rooms per house), age
(average age of houses), and lstat
(percent of households with low socioeconomic status).
fix(Boston)
names(Boston)
lm.fit = lm(medv ~ lstat, data=Boston)
lm.fit
summary(lm.fit)
names(lm.fit)
coef(lm.fit)
Confidence interval
confint(lm.fit)
predict(lm.fit, data.frame(lstat=c(5,10,15)), interval="confidence")
predict(lm.fit, data.frame(lstat=c(5,10,15)), interval="prediction")
attach(Boston)
plot(lstat, medv)
abline(lm.fit)
par(mfrow=c(2,2))
plot(lm.fit)
plot(predict(lm.fit), residuals(lm.fit))
plot(predict(lm.fit), rstudent(lm.fit))
plot(hatvalues (lm.fit))
which.max(hatvalues (lm.fit))
The which.max() function identifies the index of the largest element of a vector. In this case, it tells us which observation has the largest leverage statistic.
lm.fit = lm(medv ~ lstat + age, data=Boston)
summary(lm.fit)
The Boston data set contains 13 variables, and so it would be cumbersome to have to type all of these in order to perform a regression using all of the predictors. Instead, we can use the following short-hand:
lm.fit = lm(medv~., data=Boston)
summary(lm.fit)
?summary.lm
summary(lm.fit)$r.squared
What if we would like to perform a regression using all of the variables but one? For example, in the above regression output, age has a high p-value. So we may wish to run a regression excluding this predictor. The following syntax results in a regression using all predictors except age.
lm.fit1 = lm(medv ~ . - age, data=Boston)
summary(lm.fit1)
Alternatively, the update() function can be used.
lm.fit2 = update(lm.fit, ~.-age)
summary(lm.fit2)
The syntax lstat:black
tells R to include an interaction term between lstat and black. The syntax lstat*age
simultaneously includes lstat
, age
, and the interaction term lstat×age
as predictors; it is a shorthand for lstat+age+lstat:age
.
summary(lm(medv~lstat*age, data=Boston))
The lm() function can also accommodate non-linear transformations of the predictors. For instance, given a predictor X, we can create a predictor X2 using I(X^2). The function I() is needed since the ^ has a special meaning I() in a formula; wrapping as we do allows the standard usage in R, which is to raise X to the power 2. We now perform a regression of medv onto lstat and lstat2.
lm.fit2 = lm(medv ~ lstat + I(lstat^2))
summary(lm.fit2)
The near-zero p-value associated with the quadratic term suggests that it leads to an improved model. We use the anova() function to further quantify the extent to which the quadratic fit is superior to the linear fit.
lm.fit = lm(medv ~ lstat, data=Boston)
anova(lm.fit, lm.fit2)
Here Model 1 represents the linear submodel containing only one predictor, lstat, while Model 2 corresponds to the larger quadratic model that has two predictors, lstat and lstat2. The anova() function performs a hypothesis test comparing the two models. The null hypothesis is that the two models fit the data equally well, and the alternative hypothesis is that the full model is superior. Here the F-statistic is 135 and the associated p-value is virtually zero. This provides very clear evidence that the model containing the predictors lstat and lstat2 is far superior to the model that only contains the predictor lstat. This is not surprising, since earlier we saw evidence for non-linearity in the relationship between medv and lstat.
par(mfrow=c(2,2))
plot(lm.fit2)
In order to create a cubic fit, we can include a predictor of the form I(X^3). However, this approach can start to get cumbersome for higher- order polynomials. A better approach involves using the poly() function to create the polynomial within lm().
lm.fit5 = lm(medv ~ poly(lstat, 5))
summary(lm.fit5)
par(mfrow=c(2,2))
plot(lm.fit5)
This suggests that including additional polynomial terms, up to fifth order, leads to an improvement in the model fit! However, further investigation of the data reveals that no polynomial terms beyond fifth order have signifi- cant p-values in a regression fit.
summary(lm(medv~log(rm), data=Boston))
We will now examine the Carseats data, which is part of the ISLR library. We will attempt to predict Sales (child car seat sales) in 400 locations based on a number of predictors.
names(Carseats)
head(Carseats)
lm.fit=lm(Sales~.+Income:Advertising+Price:Age,data=Carseats)
summary(lm.fit)
attach(Carseats)
contrasts(ShelveLoc)