We will begin by examining some numerical and graphical summaries of the Smarket data, which is part of the ISLR library. This data set consists of percentage returns for the S&P 500 stock index over 1, 250 days, from the beginning of 2001 until the end of 2005. For each date, we have recorded the percentage returns for each of the five previous trading days, Lag1 through Lag5. We have also recorded Volume (the number of shares traded on the previous day, in billions), Today (the percentage return on the date in question) and Direction (whether the market was Up or Down on this date).

In [14]:
library(ISLR)
In [2]:
names(Smarket)
  1. 'Year'
  2. 'Lag1'
  3. 'Lag2'
  4. 'Lag3'
  5. 'Lag4'
  6. 'Lag5'
  7. 'Volume'
  8. 'Today'
  9. 'Direction'
In [3]:
head(Smarket)
YearLag1Lag2Lag3Lag4Lag5VolumeTodayDirection
2001 0.381-0.192-2.624-1.055 5.0101.1913 0.959Up
2001 0.959 0.381-0.192-2.624-1.0551.2965 1.032Up
2001 1.032 0.959 0.381-0.192-2.6241.4112-0.623Down
2001 -0.623 1.032 0.959 0.381-0.1921.2760 0.614Up
2001 0.614-0.623 1.032 0.959 0.3811.2057 0.213Up
2001 0.213 0.614-0.623 1.032 0.9591.3491 1.392Up
In [4]:
dim(Smarket)
  1. 1250
  2. 9
In [5]:
summary(Smarket)
      Year           Lag1                Lag2                Lag3          
 Min.   :2001   Min.   :-4.922000   Min.   :-4.922000   Min.   :-4.922000  
 1st Qu.:2002   1st Qu.:-0.639500   1st Qu.:-0.639500   1st Qu.:-0.640000  
 Median :2003   Median : 0.039000   Median : 0.039000   Median : 0.038500  
 Mean   :2003   Mean   : 0.003834   Mean   : 0.003919   Mean   : 0.001716  
 3rd Qu.:2004   3rd Qu.: 0.596750   3rd Qu.: 0.596750   3rd Qu.: 0.596750  
 Max.   :2005   Max.   : 5.733000   Max.   : 5.733000   Max.   : 5.733000  
      Lag4                Lag5              Volume           Today          
 Min.   :-4.922000   Min.   :-4.92200   Min.   :0.3561   Min.   :-4.922000  
 1st Qu.:-0.640000   1st Qu.:-0.64000   1st Qu.:1.2574   1st Qu.:-0.639500  
 Median : 0.038500   Median : 0.03850   Median :1.4229   Median : 0.038500  
 Mean   : 0.001636   Mean   : 0.00561   Mean   :1.4783   Mean   : 0.003138  
 3rd Qu.: 0.596750   3rd Qu.: 0.59700   3rd Qu.:1.6417   3rd Qu.: 0.596750  
 Max.   : 5.733000   Max.   : 5.73300   Max.   :3.1525   Max.   : 5.733000  
 Direction 
 Down:602  
 Up  :648  
           
           
           
           
In [6]:
pairs(Smarket)

The cor() function produces a matrix that contains all of the pairwise correlations among the predictors in a data set. The first command below gives an error message because the Direction variable is qualitative.

In [7]:
cor(Smarket)
Error in cor(Smarket): 'x' must be numeric
Traceback:

1. cor(Smarket)
2. stop("'x' must be numeric")
In [8]:
cor(Smarket[, -9])
YearLag1Lag2Lag3Lag4Lag5VolumeToday
Year1.00000000 0.029699649 0.030596422 0.033194581 0.035688718 0.029787995 0.53900647 0.030095229
Lag10.02969965 1.000000000-0.026294328-0.010803402-0.002985911-0.005674606 0.04090991 -0.026155045
Lag20.03059642 -0.026294328 1.000000000-0.025896670-0.010853533-0.003557949-0.04338321 -0.010250033
Lag30.03319458 -0.010803402-0.025896670 1.000000000-0.024051036-0.018808338-0.04182369 -0.002447647
Lag40.03568872 -0.002985911-0.010853533-0.024051036 1.000000000-0.027083641-0.04841425 -0.006899527
Lag50.02978799 -0.005674606-0.003557949-0.018808338-0.027083641 1.000000000-0.02200231 -0.034860083
Volume0.53900647 0.040909908-0.043383215-0.041823686-0.048414246-0.022002315 1.00000000 0.014591823
Today0.03009523 -0.026155045-0.010250033-0.002447647-0.006899527-0.034860083 0.01459182 1.000000000
In [9]:
heatmap(cor(Smarket[,-9]), col=)
In [10]:
attach(Smarket)
In [11]:
plot(Volume)

Logistic regression

Next, we will fit a logistic regression model in order to predict Direction using Lag1 through Lag5 and Volume. The glm() function fits generalized linear models, a class of models that includes logistic regression. The syntax generalized of the glm() function is similar to that of lm(), except that we must pass in the argument family=binomial in order to tell R to run a logistic regression rather than some other type of generalized linear model.

In [12]:
glm.fit = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Smarket, family=binomial)
In [13]:
summary(glm.fit)
Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
    Volume, family = binomial, data = Smarket)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.446  -1.203   1.065   1.145   1.326  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.126000   0.240736  -0.523    0.601
Lag1        -0.073074   0.050167  -1.457    0.145
Lag2        -0.042301   0.050086  -0.845    0.398
Lag3         0.011085   0.049939   0.222    0.824
Lag4         0.009359   0.049974   0.187    0.851
Lag5         0.010313   0.049511   0.208    0.835
Volume       0.135441   0.158360   0.855    0.392

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1731.2  on 1249  degrees of freedom
Residual deviance: 1727.6  on 1243  degrees of freedom
AIC: 1741.6

Number of Fisher Scoring iterations: 3

The smallest p-value here is associated with Lag1. The negative coefficient for this predictor suggests that if the market had a positive return yesterday, then it is less likely to go up today. However, at a value of 0.15, the p-value is still relatively large, and so there is no clear evidence of a real association between Lag1 and Direction.

In [15]:
coef(glm.fit)
(Intercept)
-0.12600025655927
Lag1
-0.0730737458900263
Lag2
-0.0423013440073083
Lag3
0.0110851083796763
Lag4
0.0093589383702788
Lag5
0.0103130684758178
Volume
0.135440658859162
In [16]:
summary(glm.fit)$coef
EstimateStd. Errorz valuePr(>|z|)
(Intercept)-0.1260002570.24073574 -0.5233966 0.6006983
Lag1-0.0730737460.05016739 -1.4565986 0.1452272
Lag2-0.0423013440.05008605 -0.8445733 0.3983491
Lag3 0.0110851080.04993854 0.2219750 0.8243333
Lag4 0.0093589380.04997413 0.1872757 0.8514445
Lag5 0.0103130680.04951146 0.2082966 0.8349974
Volume 0.1354406590.15835970 0.8552723 0.3924004
In [17]:
summary(glm.fit)$coef[,4]
(Intercept)
0.600698319413347
Lag1
0.145227211568647
Lag2
0.398349095427021
Lag3
0.824333346101535
Lag4
0.851444506926453
Lag5
0.83499739049983
Volume
0.392400433202423

The predict() function can be used to predict the probability that the market will go up, given values of the predictors. The type="response" option tells R to output probabilities of the form P(Y = 1|X), as opposed to other information such as the logit. If no data set is supplied to the predict() function, then the probabilities are computed for the training data that was used to fit the logistic regression model. Here we have printed only the first ten probabilities. We know that these values correspond to the probability of the market going up, rather than down, because the contrasts() function indicates that R has created a dummy variable with a 1 for Up.

In [21]:
contrasts(Smarket$Direction)
Up
Down0
Up1
In [22]:
glm.probs = predict(glm.fit, type="response")
In [23]:
glm.probs[1:10]
1
0.507084133395401
2
0.481467878454591
3
0.481138835214201
4
0.515222355813022
5
0.510781162691538
6
0.506956460534911
7
0.492650874187038
8
0.509229158207377
9
0.517613526170958
10
0.488837779771376
In [24]:
glm.pred = rep("Down", 1250)
glm.pred[glm.probs>0.5] = "Up"
In [25]:
table(glm.pred, Direction)
        Direction
glm.pred Down  Up
    Down  145 141
    Up    457 507
In [26]:
(507+145)/1250
0.5216
In [27]:
mean(glm.pred == Direction)
0.5216

At first glance, it appears that the logistic regression model is working a little better than random guessing. However, this result is misleading because we trained and tested the model on the same set of 1, 250 observa- tions. In other words, 100 − 52.2 = 47.8 % is the training error rate. As we have seen previously, the training error rate is often overly optimistic—it tends to underestimate the test error rate. In order to better assess the ac- curacy of the logistic regression model in this setting, we can fit the model using part of the data, and then examine how well it predicts the held out data. This will yield a more realistic error rate, in the sense that in prac- tice we will be interested in our model’s performance not on the data that we used to fit the model, but rather on days in the future for which the market’s movements are unknown.

To implement this strategy, we will first create a vector corresponding to the observations from 2001 through 2004. We will then use this vector to create a held out data set of observations from 2005.

In [28]:
train = (Year < 2005)
Smarket.2005 = Smarket[!train, ]
dim(Smarket.2005)
Direction.2005 = Direction[!train]
  1. 252
  2. 9
In [30]:
ls()
  1. 'Direction.2005'
  2. 'glm.fit'
  3. 'glm.pred'
  4. 'glm.probs'
  5. 'Smarket.2005'
  6. 'train'
In [32]:
glm.fit = glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Smarket, family=binomial, subset=train)
In [33]:
glm.probs = predict(glm.fit, Smarket.2005, type="response")
In [34]:
glm.pred = rep("Down", 252)
glm.pred[glm.probs > 0.5] = "Up"
In [40]:
table(glm.pred, Direction.2005)
        Direction.2005
glm.pred Down  Up
    Down   35  35
    Up     76 106
In [36]:
mean(glm.pred == Direction.2005)
0.48015873015873
In [37]:
mean(glm.pred != Direction.2005)
0.51984126984127

We recall that the logistic regression model had very underwhelming p- values associated with all of the predictors, and that the smallest p-value, though not very small, corresponded to Lag1. Perhaps by removing the variables that appear not to be helpful in predicting Direction, we can obtain a more effective model. After all, using predictors that have no relationship with the response tends to cause a deterioration in the test error rate (since such predictors cause an increase in variance without a corresponding decrease in bias), and so removing such predictors may in turn yield an improvement. Below we have refit the logistic regression using just Lag1 and Lag2, which seemed to have the highest predictive power in the original logistic regression model.

In [39]:
glm.fit = glm(Direction~Lag1+Lag2, data=Smarket, family=binomial, subset=train)
glm.probs = predict(glm.fit, Smarket.2005, type="response")
glm.pred=rep("Down", 252)
glm.pred[glm.probs > 0.5] = "Up"
mean(glm.pred == Direction.2005)
0.55952380952381

Suppose that we want to predict the returns associated with particular values of Lag1 and Lag2. In particular, we want to predict Direction on a day when Lag1 and Lag2 equal 1.2 and 1.1, respectively, and on a day when they equal 1.5 and −0.8. We do this using the predict() function.

In [41]:
predict(glm.fit, newdata=data.frame(Lag1=c(1.2, 1.5), Lag2=c(1.1, -0.8)), type="response")
1
0.479146239171912
2
0.496093872956532

Linear Discriminant Analysis

Now we will perform LDA on the Smarket data. In R, we fit a LDA model using the lda() function, which is part of the MASS library. Notice that the syntax for the lda() function is identical to that of lm(), and to that of glm() except for the absence of the family option. We fit the model using only the observations before 2005.

In [42]:
library(MASS)
lda.fit = lda(Direction~Lag1+Lag2, data=Smarket, subset=train)
In [43]:
lda.fit
Call:
lda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)

Prior probabilities of groups:
    Down       Up 
0.491984 0.508016 

Group means:
            Lag1        Lag2
Down  0.04279022  0.03389409
Up   -0.03954635 -0.03132544

Coefficients of linear discriminants:
            LD1
Lag1 -0.6420190
Lag2 -0.5135293
In [44]:
plot(lda.fit)
In [45]:
lda.pred = predict(lda.fit, Smarket.2005)
In [46]:
names(lda.pred)
  1. 'class'
  2. 'posterior'
  3. 'x'

The predict() function returns a list with three elements. The first ele- ment, class, contains LDA’s predictions about the movement of the market. The second element, posterior, is a matrix whose kth column contains the posterior probability that the corresponding observation belongs to the kth class, computed from (4.10). Finally, x contains the linear discriminants, described earlier.

In [51]:
lda.pred$class[1:10]
  1. Up
  2. Up
  3. Up
  4. Up
  5. Up
  6. Up
  7. Up
  8. Up
  9. Up
  10. Up
Levels:
  1. 'Down'
  2. 'Up'
In [50]:
lda.pred$x[1:10]
  1. 0.0829309552084355
  2. 0.591141023027852
  3. 1.16723063295097
  4. 0.833350216682104
  5. -0.0379289182430164
  6. -0.0874314155399546
  7. -0.145127187879209
  8. 0.217013240543786
  9. 0.0587379182873576
  10. 0.350686418624796
In [52]:
mean(lda.pred$class == Direction.2005)
0.55952380952381
In [53]:
lda.pred$posterior
DownUp
9990.49017920.5098208
10000.47921850.5207815
10010.46681850.5331815
10020.47400110.5259989
10030.49278770.5072123
10040.49385620.5061438
10050.49510160.5048984
10060.48728610.5127139
10070.49070130.5092987
10080.48440260.5155974
10090.49069630.5093037
10100.51199880.4880012
10110.48951520.5104848
10120.47067610.5293239
10130.47445930.5255407
10140.47995830.5200417
10150.49357750.5064225
10160.50308940.4969106
10170.49788060.5021194
10180.48863310.5113669
10190.50065680.4993432
10200.51087350.4891265
10210.50399250.4960075
10220.49163350.5083665
10230.50417720.4958228
10240.50267510.4973249
10250.49140430.5085957
10260.48059640.5194036
10270.48827180.5117282
10280.50621870.4937813
12210.49016060.5098394
12220.50697300.4930270
12230.50847640.4915236
12240.50412880.4958712
12250.50482990.4951701
12260.50238790.4976121
12270.49869030.5013097
12280.48247580.5175242
12290.48254690.5174531
12300.48316000.5168400
12310.50174970.4982503
12320.50587080.4941292
12330.48903210.5109679
12340.49110520.5088948
12350.48642500.5135750
12360.48470620.5152938
12370.49448900.5055110
12380.49622610.5037739
12390.50057020.4994298
12400.50390680.4960932
12410.49463760.5053624
12420.48643660.5135634
12430.48070220.5192978
12440.48514390.5148561
12450.49517340.5048266
12460.50058930.4994107
12470.49722100.5027790
12480.47919880.5208012
12490.48316730.5168327
12500.48925910.5107409

Quadratic Discriminant Analysis

QDA is implemented in R using the qda() function, which is also part of the MASS library. The qda() syntax is identical to that of lda().

In [54]:
qda.fit = qda(Direction~Lag1+Lag2, data=Smarket, subset=train)
qda.fit
Call:
qda(Direction ~ Lag1 + Lag2, data = Smarket, subset = train)

Prior probabilities of groups:
    Down       Up 
0.491984 0.508016 

Group means:
            Lag1        Lag2
Down  0.04279022  0.03389409
Up   -0.03954635 -0.03132544

The output contains the group means. But it does not contain the coef- ficients of the linear discriminants, because the QDA classifier involves a quadratic, rather than a linear, function of the predictors. The predict() function works in exactly the same fashion as for LDA.

In [55]:
qda.class = predict(qda.fit, Smarket.2005)$class
In [56]:
table(qda.class, Direction.2005)
         Direction.2005
qda.class Down  Up
     Down   30  20
     Up     81 121
In [57]:
mean(qda.class == Direction.2005)
0.599206349206349

Interestingly, the QDA predictions are accurate almost 60% of the time, even though the 2005 data was not used to fit the model. This level of accu- racy is quite impressive for stock market data, which is known to be quite hard to model accurately. This suggests that the quadratic form assumed by QDA may capture the true relationship more accurately than the linear forms assumed by LDA and logistic regression. However, we recommend evaluating this method’s performance on a larger test set before betting that this approach will consistently beat the market!

K-Nearest Neighbors

We will now perform KNN using the knn() function, which is part of the class library. This function works rather differently from the other model- fitting functions that we have encountered thus far. Rather than a two-step approach in which we first fit the model and then we use the model to make predictions, knn() forms predictions using a single command. The function requires four inputs.

  1. A matrix containing the predictors associated with the training data, labeled train.X below.
  2. A matrix containing the predictors associated with the data for which we wish to make predictions, labeled test.X below.
  3. A vector containing the class labels for the training observations, labeled train.Direction below.
  4. A value for K, the number of nearest neighbors to be used by the classifier.
In [59]:
library(class)
train.X = cbind(Lag1, Lag2)[train,]
test.X = cbind(Lag1, Lag2)[!train,]
train.Direction = Direction[train]
In [62]:
set.seed(10)
knn.pred = knn(train.X, test.X, train.Direction, k = 1)
table(knn.pred, Direction.2005)
mean(knn.pred == Direction.2005)
        Direction.2005
knn.pred Down Up
    Down   43 58
    Up     68 83
0.5

The results using K = 1 are not very good, since only 50 % of the observa- tions are correctly predicted. Of course, it may be that K = 1 results in an overly flexible fit to the data. Below, we repeat the analysis using K = 3.

In [64]:
knn.pred = knn(train.X, test.X, train.Direction, k = 3)
table(knn.pred, Direction.2005)
mean(knn.pred == Direction.2005)
        Direction.2005
knn.pred Down Up
    Down   48 55
    Up     63 86
0.531746031746032

The results have improved slightly. But increasing K further turns out to provide no further improvements. It appears that for this data, QDA provides the best results of the methods that we have examined so far.