Chapter4 Classification


In this chapter, we study approaches for predicting qualitative responses, a process that is known as classification. Predicting a qualitative response for an obervation can be refered to as classifying that observation, since it involves assigning the observation to a category, or class. On the other hand, often the methods used for classification first predict the probability of each of the categories of a qualitative variable, as the basis for making the classification. In this sense they also behave like regression methods.

We will first discuss three of the most widely used classifiers: logistic regression, linear discriminant anlaysis, and K-nearest neighbors.

An overview of Classification

Just as in the regression setting, in the classification setting we have a set of training observations (x1,y1),,(xn,yn)(x_1,y_1), \ldots,(x_n, y_n) that we can use to build a classifier. We want our classifier to perform well not only on the training data, but also on the test observations that were not used to train the classifier.

Why not Linear Regression

We have stated that linear regression is not appropriate in the case of qualitative response. Why not?

Suppose that we are trying to predict the medical condition of a patient in the emergency room on the basis of her symptoms. In this simplified example, there are three possible diagnoses: stroke, drug overdose, and epileptic seizure. We could consider encoding these values as a quantitative response variable, YY, as followes:

y={1 if stroke;2 if drug overdoses;3 if epileptic seizure. y = \begin{cases} 1 \text{ if stroke}; \\ 2 \text{ if drug overdoses};\\ 3 \text{ if epileptic seizure}. \end{cases}

Using this coding, least squares could be used to fit a linear regression model to predict YY on the basis of a set of predictions X1,,XpX_1,\ldots,X_p. Unfortuantely, this coding implies an ordering on the outcomes, putting durg overdose in between stroke and epileptic seizure, and insisting that the difference among them are the same. In practice, there is no particular reason that this needs to be the case. For instance, one could choose an equally reasonable coding,

y={1 if drug overdoses;2 if epileptic seizure;3 if stroke. y = \begin{cases} 1 \text{ if drug overdoses};\\ 2 \text{ if epileptic seizure};\\ 3 \text{ if stroke}. \end{cases}

which would imply a totally different relationship among the three conditions. Each of these coding would produce fundamentally different linear models that would ultimately lead to different sets of predicitons on test observations.

If the response variable’s values did take on a natural ordering, such as mild,moderate,severemild, moderate, severe, and we felt the gap between mild and moderate was similar to the gap between moderate and severe, then a 1,2,31,2,3 coding would be reasonable. Unfortunately, in general there is no natural way to convert a qualitative response variable with more than two levels into a quantitative response that is ready for linear regression.

For a binary (two level) qualitative response, the situation is better. For instance, perhaps there are only two possiblities for the patient’s medical condition: stroke and drug overdose. We then potentially use the dummy variable approach to code the response as follows:

y={1 if stroke;2 if drug overdoses y = \begin{cases} 1 \text{ if stroke}; \\ 2 \text{ if drug overdoses} \end{cases}

We could then fit a linear regression to this binary response, and predict drug overdose if Y^>0.5\hat{Y} > 0.5 and stroke otherwise. In this case, even if we flip the above coding, linear regerssion will produce the same final predictions.

For a binary response with a 0/10/1 coding as above, regression by least square does make sense; it can be shown that the Xβ^X\hat{\beta} ovbtained using linear regression is in fact an estiamte of Pr(drug overdoseX)Pr(\text{drug overdose}|X) in this special case. However, if we use linear regression, some of our estimates might be outside the [0,1][0,1] interval, making them hard to interpret as probabilities!

Curiously, it turns out that the classifictions that we get if we use linear regression to predict a binary response will be the same as for the linear discriminant analysis (LDA) procedure.

Logistic Regression

The Logistic Model

How should we model the relationship between p(X)=Pr(Y=1X)p(X)=Pr(Y=1|X) and X?

In previous section, we talked about using a linear regression model to represent these probabilities:

p(X)=β0+β1X p(X) = \beta_0 + \beta_1 X

If we use this approach to predict defaul=yes using balance data set, we will have predictions are much bigger than 1 when deal with big balance. These predictions are not sensible, since of course the true probability of default regardless of the credit card balance, must fall between 0 and 1.

To avoid this problem, we must model p(X)p(X) using a function that gives outputs between 0 and 1 for all values of XX. Many functions meet this description. In logistic regression, we use the logistic function,

p(X)=eβ0+β1X1+eβ0+β1X p(X)=\frac{e^{\beta_0+\beta_1 X}}{1+e^{\beta_0+\beta_1 X}}

To fit the model, we use a method called maximum likelihood, which we discuss in the next section.

After a bit of manipulation, we find that

p(X)1p(X)=eβ0+β1X \frac{p(X)}{1-p(X)} = e^{\beta_0+\beta_1 X}

The quantify p(x)/[1p(X)]p(x)/[1-p(X)] is caled the odds, and can take on any value between 0 and \infty. Values of the odds close to 0 and \infty indicate very low and very high probabilities, respactively.

By taking the logarithm of both side of the formula, we arrive at

log(p(X)1p(X))=β0+β1X log\Bigg(\frac{p(X)}{1-p(X)}\Bigg) = \beta_0+\beta_1 X

The left-hand side is called the log-odds or logit. We see that the logistic regression model has a logit that is linear in XX.

Estimating the Regression Coefficients

The coefficients β0\beta_0 and β1\beta_1 are unknown, and must be estimated based on the available training data. Although we could use non-linear least squares to fit the model, the more general method of maximum likelihood is preferred, since it has better statistical properties.

The basid intuition behind using maximum likelihood to fit a logistic regression model is a s follows: we seek estimates for β0\beta_0 and β1\beta_1 such that the predicted probability p^(xi)\hat{p}(x_i) of default for each individual corresponds as closely as possible to the individual’s observed default status. In other words, we try to find β^0\hat{\beta}_0 and β^1\hat{\beta}_1 such that plugging these estimates into the model for p(X)p(X) yields a number close to one for those who defaulted, and a number close to zero for all who did not. This can be formalized using a mathematical equation called a likelihood funciton:

(β0,β1)=i:yi=1p(xi)i:yi(1p(xi)) \ell(\beta_0, \beta_1) = \prod_{i:y_i=1} p(x_i) \prod_{i':y_{i'}} (1 - p(x_{i'}))

The estimate β^0\hat{\beta}_0 and β^1\hat{\beta}_1 are chosen to maximize this likelihood function.

Maximum likelihood is a very general approach that is used to fit many of the non-linear regression models.

Making predictions

Once the coefficients have been estimated, it is a simple matter to compute the probability for any given input.

Multiple Logistic Regression

We now consider the problem of predicting a binary response using multiple predictors. By analogy with the extension from simple to multiple linear regression, we can generalize as follows:

log(p(X)1p(X))=β0+β1X1++β+pXp log\Big(\frac{p(X)}{1-p(X)}\Big) = \beta_0 + \beta_1 X_1 + \cdots + \beta+p X_p

Where X=(X1,,Xp)X=(X1,\ldots,X_p) are pp predictors. The equation can be rewritten as

p(X)=eβ0+β1X1++βpXp1+eβ0+β1X1++βpXp p(X) = \frac{e^{\beta_0+\beta_1 X_1 + \cdots + \beta_p X_p}}{1+e^{\beta_0+\beta_1 X_1 + \cdots + \beta_p X_p}}

Logistic Regression for >2 Response Classes

The two-class logistic regression models discussed in the previous sections have multiple-calss extensions, but in practice they tend not to be used all that often. Instead, discriminant anlaysis, is popular for multiple-class classification.

Linear Discriminant Analysis, LDA

Logistic regression involves directly modeling Pr(Y=kX=x)Pr(Y=k|X=x) using the logistic function. We now consider an alternative and less direct approach to estimating these probabilities. In this alternative approach, we model the distributino of the predictors XX separately in each of the response class (i.e., given YY), and then use Bayes’ theorem to flid these around into estiamtes for Pr(Y=kX=x)Pr(Y=k|X=x). When these distributions are assumed to be normal, it turns out that the model is very similar in form to logistic regression.

Why do we need another method, when we have logistic regression?

  • When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant anlaysis does not suffer from this problem.

  • if nn is small and the distribution of the predictors XX is approximately normal in each of the classes, the linear discriminant model is again more stable then the logistic regression model.

  • As mentioned earlier, linear discriminant analysis is popular when we have more than two response classes.

Using Bayes’ Theorem for Classification

Suppose that we wish to classify an observation into one of KK classes, where K2K \geq 2. In other words, the qualitative response variable YY can take on KK possible distinct and unordered values. Let πk\pi_k represent the overall or prior probability that a randomly chosen ovservation comes from the kkth class; this is the probability that a given observation is associated with the kkth category of the response varible YY. Let fk(X)Pr(X=xY=k)f_k(X) \equiv Pr(X=x|Y=k) denote the density function of XX for an obervation that comes from the kkth class. In other words, fk(x)f_k(x) is relatively large if there is a high probability that an observation in the kkth class XxX \approx x, and fk(x)f_k(x) is small if it is very unlikely that an observation in the kkth class has XxX \approx x. Then Bayes’ theorem states that

Pr(Y=kX=x)=πkfk(x)l=1Kπlfl(x) Pr(Y=k|X=x) = \frac{\pi_k f_k(x)}{\sum_{l=1}^K \pi_l f_l(x)}

In accordance with our earlier notation, we will use the abbreviation pk(X)=Pr(Y=kX)p_k (X) = Pr(Y=k|X). This suggests that instead of directly computing pk(X)p_k(X), we can imply plug in estiamtes of πk\pi_k and fk(X)f_k (X) into the formula. In general, estimating πk\pi_k is easy if we have a random sample of YYs from the population: we simply compute the fraction of the training observations that belog to the kkth class. However, estimating fk(X)f_k(X) tends to be more challenging, unless we assume some simple forms for these density. We refer to pk(x)p_k(x) as the posterior probability that an observation X=xX=x belongs to the kkth class. That is, it is the probability that the observation belongs to the kkth class, given the predictor value for that observation.

We know that the Bayes classifier, which classifies an observation to the class for which pk(X)p_k(X) is largest, has the lowest possible error rate out of all classifiers. Therefore, if we can find a way to estimate fk(X)f_k(X), then we can develop a classifier that approximates the Bayes classfier.

Linear Discriminant Analysis for p = 1

For now, assume that p=1p = 1–that is, we have only one predictor. We would like to obtain an estimate for fk(x)f_k (x) that we can plug into the formula in order to estimate pk(x)p_k(x). We will then classify an observation to the class for which pk(x)p_k(x) is greatest. In order to estiamte fk(x)f_k(x), we will first make some assumptions about its form.

Suppose we assume that fk(x)f_k(x) is normal or Gaussian. In the one-dimensional setting, the normal density takes the form

fk(x)=12πσkexp(12σk2(xμk)2), f_k(x) = \frac{1}{\sqrt{2\pi}\sigma_k}\exp\Big(-\frac{1}{2\sigma_k^2}(x - \mu_k)^2\Big),

where μk\mu_k and σk2\sigma_k^2 are the mean and variance parameters for the kkth class.

For now, lwet us further assume that σ12==σk2\sigma_1^2=\ldots=\sigma_k^2: that is, there is a shared variance term across all KK classes, which for simplicity we can denote by σ2\sigma^2. Plugging the fk(x)f_k(x) into the formula, we find that

pk(x)=πk12πσexp(12σ2(xμk)2)l=1Kπl12πσexp(12σ2(xμl)2) p_k(x) = \frac{\pi_k \frac{1}{\sqrt{2\pi}\sigma}\exp\Big(-\frac{1}{2\sigma^2}(x - \mu_k)^2\Big)}{\sum_{l=1}^K \pi_l\frac{1}{\sqrt{2\pi}\sigma}\exp\Big(-\frac{1}{2\sigma^2}(x - \mu_l)^2\Big)}

Taking the log of the formula and rearranging the terms, we will see that this is equivaletn to assigning the observations to the class for which

δk(x)=xμkσ2μk22σ2+log(πk) \delta_k(x) = x \cdot \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} + log(\pi_k)

is largest.

For instance, if K=2K = 2 and π1=π2\pi_1 = \pi_2, then the Bayes classifier assigns an observation to class 1 if 2x(μ1μ2)>μ12μ222x(\mu_1 - \mu_2) \gt \mu_1^2 - \mu_2^2, and to class 2 otherwise. In this case, the Bayes decision boundary corresponds to the point where

x=μ12μ222(μ1μ2)=μ1+μ22 x = \frac{\mu_1^2 - \mu_2^2}{2(\mu_1 - \mu_2)}=\frac{\mu1 + \mu2}{2}

In practice, the linear discriminant analysis method approximates the Bayes classifier by plugging estimates for πk,μk\pi_k,\mu_k, and σ2\sigma^2. The following estimates are used in particular:

μ^k=1nki:yi=kxiσ^2=1nkk=1Ki:yi=k(xiμ^k)2 \begin{aligned} \hat{\mu}_k &= \frac{1}{n_k}\sum_{i:y_i=k} x_i\\ \hat{\sigma}^2 &= \frac{1}{n-k}\sum_{k=1}^K\sum_{i:y_i=k} (x_i - \hat{\mu}_k)^2 \end{aligned}

where nn is th total number of training observations, and nkn_k is the number of training ovservations in the kkth class. The estimate for μk\mu_k is simply the average of all the training observations from the kkth class, which σ^2\hat{\sigma}^2 can be seen as a weighted average of the sample variances for each of the KK classes. Sometimes we have knowledge of the class membership probabilities π1,,πk\pi_1,\ldots,\pi_k, which can be used directly. In the absence of any additional information, LDA estimates πk\pi_k using the proportion of the training observations that belong to the kkth class. In other words,

π^k=nk/n \hat{\pi}_k = n_k / n

The LDA classifier plugs the estimates and assigns an observation X=xX = x to the class for which

δ^k(x)=xμ^kσ^2μ^k22σ^2+log(π^k) \hat{\delta}_k(x) = x \cdot \frac{\hat{\mu}_k}{\hat{\sigma}^2} - \frac{\hat{\mu}_k^2}{2\hat{\sigma}^2} + log(\hat{\pi}_k)

is largest. The word linear is the classifier’s name stems from the fact that the discriminant function hatδk(x)hat{\delta_k(x)} are linear functions of xx (as opposed to a more complex function of xx).

To reiterate, the LDA classifier results from assuming that the observations within each class come from a normal distribution with a class-specific mean vaector and a common variance σ2\sigma^2, and pluggin estiamtes for these parameters into the Bayes classifier. In the following section, we will consider a less stringent set of assumptions, by allowing the observations in the kkth class to have a class-specific variance, σk2\sigma_k^2.

Linear Discriminatnt Analysis for p > 1

We now extend the LDA classifier to the case of multiple predictors. To do this, we will assume that X=(X1,X2,,Xp)X = (X_1,X_2,\ldots,X_p) is drawn from a multiple Gaussian or multivariate normal distribution, with a class-specific mean vector and a common covarince matrix.

The multivariate Gaussian distribution assemes that each individual predictor follows a one-dimensional normal distribution, with some correlation between each pair of predictors.

To indicate that a p-dimensional randome variable XX has a multivariate Gaussian distribution, we write XN(μ,Σ)X \sim N(\mu, \Sigma). Here E(X)=μE(X) = \mu is the mean of XX (a vector with pp components), and Cov(X)=ΣCov(X)=\Sigma is the p×pp \times p covariance matrix of XX. Formally, the multivariate Gaussian density is defined as

f(x)=1(2π)p/2Σ1/2exp(12(xμ)TΣ1(xu)) f(x)=\frac{1}{(2\pi)^{p/2}|\Sigma|^{1/2}}\exp\Big(-\frac{1}{2}(x-\mu)^T\Sigma^{-1} (x-u)\Big)

In the ase of p>1p > 1 predictors, the LDA classifier assumes that the observations in the kkth class are drawn from a multivariate Gaussian distribution N(μk,Σ)N(\mu_k, \Sigma), where μk\mu_k is a class-specific mean vector, and Σ\Sigma is a covariance matrix that is common to all KK classes. Plugging the density function for the kkth class, fk(X=x)f_k(X = x), and performing little of algebra reveals the Bayes classifier assigns an observation X=xX=x to the class for which

δk(x)=XTΣ01μk12μkTΣ1μk+logπk \delta_k(x) = X^T\Sigma^{01} \mu_k - \frac{1}{2}\mu_k^T \Sigma^{-1}\mu_k + log\pi_k

is the largest.

Quadratic Discriminant Analysis

As we have discussed, LDA assumes that the observations within wach class are drawn from a multivariate Gaussian distribution with a class-specific mean vector and a covariance matrix that is cmommon to all KK classes. Quadratic discriminant analysis (QDA) provides an alternative approach. Like LDA, the QDA classifier results from assuming that the observations from each class are drawn a Gaussian distribution, and plugging estiamtes for the parameters into Bayes’ theorem in order to perform prediction. However, unlike LDA, QDA assumes that each class has its own covariance matrix. That is, it assumes that an observation from the kkth class is of the form XN(μk,Σj)X \sim N(\mu_k, \Sigma_j), where Σk\Sigma_k is a covariance matrix for the kkth class. Under this assumption, the Bayes classifier assigns an observation X=xX = x to class for which

δk(x)=12(xμk)TΣk1(xμk)12logΣk+logπk=12xTΣk1x+xTΣk1μk12μkTΣk112logΣk+logπk \begin{aligned} \delta_k(x) &= -\frac{1}{2}(x - \mu_k)^T \Sigma_k^{-1}(x - \mu_k) - \frac{1}{2}log|\Sigma_k|+log\pi_k \\ &=-\frac{1}{2}x^T\Sigma_k^{-1}x + x^T\Sigma_k^{-1}\mu_k - \frac{1}{2}\mu_k^T\Sigma_k^{-1}-\frac{1}{2}log|\Sigma_k| + log\pi_k \end{aligned}

is largest. So the QDA classifier involves plugging estimates for Σk,μk\Sigma_k, \mu_k and πk\pi_k. The quantity xx appears as a quadratic function. This is where QDA gets its name.

Why does it matter whether or not we assume that the KK classes share a common covariance matrix? In other words, why would one prefer LDA to QDA or vice-versa? The answer lies in the bias-variance trade-off. When there are pp predictors, then estiamting a covariance matrix requires estimating p(p+1)/2p(p+1)/2 parameters. QDA estimates a separate covariance matrix for each class, for a total of Kp(p+1)/2Kp(p+1)/2 parameters. With 50 predictors this is some multiple of 1225, which is a lot of parameters. By instead assuming that the KK classes share a common covariance matrix, the LDA model becomes linear in xx, which means there are KpKp linear coefficients to estimate. Consequently, LDA is a much less flexible classifier than QDA, and so has substantially lower variance. This can potentially lead to improved prediction performance. But there is a trade-off, if LDA’s assumption that the KK class share a common covariance matrix is badly off, then LDA can suffer from high bias. ROughly speaking, LDA tends to be a better bet than QDA if there are relatively few training obsrvations and so reducing variance is crucial. In contrast, QDA is recommended if the training set is very large, so that the variance of the classifier is not a major concern, or if the assumption of a common variance matrix for the KK classes is clearly untenable.

A Comparison of Classification Methods

We have considered three different classification approaches: logistic regression, LDA, and QDA. In chapter 2, we also discussed the K-nearest neighbors (KNN) method. We now consider the types of scenarios in which on approach might dominate the others.

Though their motivations differ, the logistic regression and LDA methods are closely connected. Consider the two-class setting which p=1p=1 predictor, and let p1(x)p_1(x) and p2(x)=1p1(x)p_2(x)=1-p_1(x) be the probability that the observation X=xX=x belongs to class 1 and class 2, respectively. In the LDA framework, we can see that the log odds is given by

log(p1(x)1p1(x))=log(px(x)p2(x))=c0+c1x log\Big(\frac{p_1(x)}{1-p_1(x)}\Big) = log\Big(\frac{p_x(x)}{p_2(x)}\Big) = c_0 + c_1 x

where c0c_0 and c1c_1 are functions of μ1\mu_1, μ2\mu_2 and σ2\sigma^2. We also know that in logistic regression,

log(p11p1)=β0+β1x log\Big(\frac{p_1}{1-p_1}\Big) = \beta_0 + \beta_1 x

Both are linear functions of xx. Hence, both logistic regression and LDA produce linear decision boundaries. The only difference between the two approaches lies in the fact that β0\beta_0 and β1\beta_1 are estimated using maximum likelihood, where c0c_0 and c1c_1 are computed using the estimated mean and variance from a normal distribution. This same connection between LDA and logistic regression also holds for multidimensinal data with p>1p > 1.

Since logistic regression and LDA differ only in their fitting procedures, one might expect the two approaches to give similar results. This is often, but not always, the case. LDA assumes that the observations are drawn from a Gaussian distribution with a common covariance matrix in each class, and so can provide some improvements over logistic regression when this assumption approximately holds. Conversely, logistic regression can outperform LDA if these Gaussian assumptions are not met.

Recall that KNN takes a completely different approach from the classifier seen this chapter. In order to make a prediction for an observation X=xX = x, the KK training ovservations that are closest to xx are identified. Then XX is assigned to the class to which the plurality of these observations belong. Hence KNN is a completely non-parametric approach: no assumptions are made about the shape of the decision boundary. Therefore, we can expect this approach to dominate LDA and logistic regression when the decision boundary is highly non-linear. On the other hand, KNN does not tell us which predictors are important; we don’t get a table of coefficients.

Finally, QDA serves as a compromise between the non-parametric KNN method and the linear LDA and logistic regression approaches. Since QDA assumes a quadratic decision boundary, it can accurately model a wider range of problems than can the linear methods. Though not as flexible as KNN, QDA can perform better in the presence of a limited number of training observations because it does make some assumptions abou the form of the decision boundary.

No one method will dominate the others in every situatioin. When the true decision boundaries are linear, then the LDA and logistic regression appraoches will tend to perform well. When the boundaries are moderately non-linear, QDA may give better results. Finally for much more complicated decision boundaries, a non-parametric approach such as KNN can be superio. But the level of smoothness for a non-parametric approach must be chosen carefully.

Finally, we cann that in the regression setting we can accommodate a non-linear relationship between the predictors and the response by performing regression using transformations of the predictors. A similar approach could be taken in the classification setting. For instance, we could create a more flexible version of logistic regression by including X2,X3X^2, X^3 and even X4X^4 as predictors. This may or may not improve logistic regression performance, depending on whether the increase in variance due to the added flexibility is offset by a sufficiently large reduction in bias. We could do hte same for LDA. If we add all possible quadratic terms and cross-products to LDA, the form of the model would be the same as the QDA model, although the parameter estimates would be different. This device allows us to move somewhere between an LDA and a QDA model.


Chapter4_Classification_lab.html