Martin C. Arnold's Homepage

Introduction

Linear Discriminant Analysis (LDA) is a widely used technique in both classification and dimensionality reduction. Its goal is to project data into a lower-dimensional subspace where class separability is maximized. While it is routinely applied in many fields, many practitioners leverage its power without fully grasping what the algorithm used actually does.

Recently, during one of my applied statistical learning classes, students raised a question about the R implementation in MASS::lda(). They were curious about how the associated predict() method actually transforms the feature data data into what is given as “LD” entries in the output object. It turns out that the method transforms the feature data into a lower-dimensional space to achieve optimal class separability. More mathematically: MASS::lda() implements reduced-rank LDA, where the optimal decision boundaries are determined in a lower-dimensional feature space created by projecting the original features into that space.

Here’s a summary of the mathematics behind this projection, along with a from-scratch implementation in R and a comparison to MASS::lda().

(A more thorough treatment of the math behind this is given in Chapter 4.3.3 of Hastie, Tibshirani, and Friedman (2009).)

Mathematical Framework

The key steps in the dimensionality reduction involve algebraic manipulation of the feature data using two matricies

The within-class covariance matrix $\boldsymbol{W}$, which captures the spread of data points within each class and
the between-class covariance matrix $\boldsymbol{B}$, which describes the separation between class means

For $k$ classes with means $\boldsymbol{\mu}_1, \ldots, \boldsymbol{\mu}_k$, these matrices are defined as:

$$ \boldsymbol{W} := \frac{1}{k}\sum_{i=1}^k \boldsymbol{\Sigma}_i, $$

where $\boldsymbol{\Sigma}_i$ is the covariance matrix of the features for class $i$, and

$$ \boldsymbol{B} := \text{Cov}(\boldsymbol{M}), $$

where $\boldsymbol{M}$ is the matrix of class centroids.

Reduced-rank transformation

The reduced-rank transformation involves these steps:

Compute $\boldsymbol{W}^{-1/2}$ using eigen-decomposition of $\boldsymbol{W}$
Transform the class centroids $$\boldsymbol{M}^* := \boldsymbol{M}\boldsymbol{W}^{-1/2}$$
Compute $$\boldsymbol{B}^* := \text{Cov}(\boldsymbol{M}^*)$$
Find $\boldsymbol{B}^{*}$, the eigenvectors of $\boldsymbol{V}^{*}$
Obtain the final projection matrix: $$\boldsymbol{P} := -\boldsymbol{W}^{-1/2}\boldsymbol{V}^*$$

Implementation in R

As in my R tutorial, we use the iris dataset to demonstrate the steps of the projection outlined in the previous section. Here’s the from-scratch implementation:

# load libraries and dataset
library(dplyr)
data(iris)

# matrix of class centroids
M <- iris %>% 
  group_by(Species) %>% 
  summarise_all(mean) %>% 
  select(-Species) %>%
  as.matrix()

# *within*-class covariance matrix W
seto <- iris %>% filter(Species == "setosa") %>% select(-Species) 
vers <- iris %>% filter(Species == "versicolor") %>% select(-Species) 
virg <- iris %>% filter(Species == "virginica") %>% select(-Species) 

W <- (var(seto) + var(vers) + var(virg)) / 3

Eigen-Decomposition of Covariance Matrices

The matrix $\boldsymbol{W}^{-1/2}$ is computed using the eigen-decomposition of $\boldsymbol{W}$: If $\boldsymbol{W} = \boldsymbol{Q}\boldsymbol{\Lambda}\boldsymbol{Q}’$, where $\boldsymbol{Q}$ hold the eigen-vectors of $\boldsymbol{W}$ in its columns and $\boldsymbol{\Lambda}$ is a diagonal matrix of the corresponding eigen-values, then:

$$ \boldsymbol{W}^{-1/2} = \boldsymbol{Q}\boldsymbol{\Lambda}^{-1/2}\boldsymbol{Q}' $$

We use this to compute the eigen-decomposition of $W$ and then transform the class centroids and again use eigen-decomposition to compute $\boldsymbol{V}^*$.

# eigen-decomposition of W
r <- eigen(W)
W_sqrt <- solve(r$vectors %*% diag(sqrt(r$values)) %*% t(r$vectors))
Mstar <- M %*% W_sqrt

# between-class covariance matrix Bstar
Bstar <- var(Mstar)
r <- eigen(Bstar)

# eigen-vectors of Bstar
Vstar <- r$vectors

Optimal Projections

The optimal projection vectors are the eigenvectors of $\boldsymbol{B}^*$ corresponding to the largest eigenvalues. For a $k$-class problem, there are at most $k-1$ non-zero eigenvalues, which determines the maximum meaningful dimensionality of the projection.

For the iris data, we have three classes (setosa, versicolor, and virginica) and four features (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width). The optimal projection (which is nice for visualization of the decision boundary!) reduces the original feature space to 2 dimensions. We obtain the projected data using the first two eigenvectors of $B^*$ in Vstar.

# full regressor matrix
X <- iris %>% select(-Species) %>% as.matrix()

# compute projections of the centered features
# (separately for each dimension of the two dim. target space)

# for first eigen-vector
Z_1 <- scale(X, scale = FALSE) %*% -(W_sqrt %*% Vstar[, 1]) 

# for second eigen-vector
Z_2 <- scale(X, scale = FALSE) %*% -(W_sqrt %*% Vstar[, 2])

Comparison with `MASS::lda()`

Now that we have the rank-reduced data in Z_1 and Z_2, we can check that our results corresponds to what MASS::lda() computed. For this we compute the fit and obtain predictions using predict().

library(MASS)

iris_lda <- lda(Species ~ ., data = iris)
iris_preds <- predict(iris_lda)

First, let’s check the centroids and the LD coefficients, which define the projection¹:

# compare centroids
## MASS::lda()
iris_lda$means

           Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa            5.006       3.428        1.462       0.246
versicolor        5.936       2.770        4.260       1.326
virginica         6.588       2.974        5.552       2.026

## our implementation
M

     Sepal.Length Sepal.Width Petal.Length Petal.Width
[1,]        5.006       3.428        1.462       0.246
[2,]        5.936       2.770        4.260       1.326
[3,]        6.588       2.974        5.552       2.026

# compare LD *coefficients*
## MASS::lda()
iris_lda$scaling

                    LD1         LD2
Sepal.Length  0.8293776 -0.02410215
Sepal.Width   1.5344731 -2.16452123
Petal.Length -2.2012117  0.93192121
Petal.Width  -2.8104603 -2.83918785

## P^* in our implementation
cbind(
  -(W_sqrt %*% Vstar[, 1]),
  -(W_sqrt %*% Vstar[, 2])
)

           [,1]        [,2]
[1,]  0.8293776  0.02410215
[2,]  1.5344731  2.16452123
[3,] -2.2012117 -0.93192121
[4,] -2.8104603  2.83918785

Given the low dimension of the feature space, we may easily check by eye-balling that the outcomes of both approaches are indeed equivalent!²

We may visualize the rank-reduced version(s) of the dataset using scatter plots.

library(ggplot2)
library(gridExtra)

# Create data frames for plotting
plot_data <- tibble(
    LD1 = iris_preds$x[, 1],
    LD2 = iris_preds$x[, 2],
    Z1 = Z_1,
    Z2 = -Z_2,
    Species = iris$Species
)

# the plots
## MASS::lda()
p1 <- ggplot(plot_data, aes(x = LD1, y = LD2, color = Species)) +
    geom_point() +
    labs(title = "MASS::lda() Projection") +
    theme_minimal()

## from-scratch implementation
p2 <- ggplot(plot_data, aes(x = Z1, y = Z2, color = Species)) +
    geom_point() +
    labs(title = "Reduced-Rank LDA Projection") +
    theme_minimal()

# arrange plots
grid.arrange(p1, p2, ncol = 1)

Figure 1: MASS::lda() vs. from-scratch implementation of reduced-rank LDA projection

Unsurprisingly, the subplots of the projected data in Figure 1 are identical.

At this stage, we could apply LDA within the rank-reduced feature space. But why bother with rank reduction in the first place? Well, rank reduction is valuable when the number of features exceeds the number of observations, as it helps prevent overfitting by isolating the most discriminative directions. It thereby ensures numerical stability, enhances generalization by focusing on class-relevant variance, and boosts computational efficiency, making it especially powerful in fields like image analysis, where high-dimensional data is common.

And that’s it for now 😊.

References

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. Springer.

The LD coefficients are the entries of $P^{*}$. Note that we have not assigned the result for $P^{*}$ to a variable but used it directly in the computation of the projected data. ↩︎
Note that the sign of the LD2 coefficients is opposite to the sign of the coefficients in my result. While this doesn’t have an effect for the implied decision boundaries, I flip the coefficient signs in the plot’s code below. ↩︎

Introduction#

Mathematical Framework#

Reduced-rank transformation#

Implementation in R#

Eigen-Decomposition of Covariance Matrices#

Optimal Projections#

Comparison with MASS::lda()#

References#