Published on

Notes on Regression - Singular Vector Decomposition


Here's a fun take on the OLS that I picked up from The Elements of Statistical Learning. It applies the Singular Value Decomposition, also known as the method used in principal component analysis, to the regression framework.

Singular Vector Decomposition (SVD)

First, a little background on the SVD. The SVD could be thought of as a generalisation of the eigendecomposition. An eigenvector v of matrix A\mathbf{A} is a vector that is mapped to a scaled version of itself:

Av=λv\mathbf{A}v = \lambda v

where λ\lambda is known as the eigenvalue. For a full rank matrix (this guarantees orthorgonal eigenvectors), we can stack up the eigenvalues and eigenvectors (normalised) to obtain the following equation:

AQ=QΛA=QΛQ1\begin{aligned} \mathbf{A}\mathbf{Q} &= \mathbf{Q}\Lambda \\ \mathbf{A} &= \mathbf{Q}\Lambda\mathbf{Q}^{-1} \end{aligned}

where Q\mathbf{Q} is an orthonormal matrix.

For the SVD decomposition, A\mathbf{A} can be any matrix (not square). The trick is to consider the square matrices AA\mathbf{A}'\mathbf{A} and AA\mathbf{A}\mathbf{A}'. The SVD of the n×kn \times k matrix A\mathbf{A} is UDV\mathbf{U}\mathbf{D}\mathbf{V}', where U\mathbf{U} is a square matrix of dimension nn and V\mathbf{V} is a square matrix of dimension kk. This implies that AA=VD2V\mathbf{A}'\mathbf{A} = \mathbf{V}\mathbf{D}^{2}\mathbf{V} and V\mathbf{V} can be seen to be the eigenvalue matrix of that square matrix. Similarly, the eigenvectors of AA\mathbf{A}\mathbf{A}' forms the columns of U\mathbf{U} while D\mathbf{D} is the square root of the eigenvalues of either matrix.

In practice, there is no need to calculate the full set of eigenvectors for both matrices. Assuming that the rank of A\mathbf{A} is k, i.e. it is a long matrix, there is no need to find all n eigenvectors of AA\mathbf{A}\mathbf{A}' since only the elements of the first k eigenvalues will be multiplied by non-zero elements. Hence, we can restrict U\mathbf{U} to be a n×kn \times k matrix and let D\mathbf{D} be a k×kk \times k matrix.


Applying the SVD to OLS

To apply the SVD to the OLS formula, we re-write the fitted values, substituting the data input matrix XX with its equivalent decomposed matrices:

Xβ^=X(XX)1Xy=UDV(VDUUDV)1VDUy=UD(DD)1DUy=UUy\begin{aligned} \mathbf{X}\hat{\beta} &= \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y} \\ &= \mathbf{U}\mathbf{D}\mathbf{V}'(\mathbf{V}\mathbf{D}'\mathbf{U}'\mathbf{U}\mathbf{D}\mathbf{V}')^{-1}\mathbf{V}\mathbf{D}'\mathbf{U}'\mathbf{y} \\ &= \mathbf{U}\mathbf{D}(\mathbf{D}'\mathbf{D})^{-1}\mathbf{D}\mathbf{U}'\mathbf{y} \\ &= \mathbf{U}\mathbf{U}'\mathbf{y} \end{aligned}

where the third to fourth line comes from the fact that (DD)1(\mathbf{D}'\mathbf{D})^{-1} is a k×kk \times k matrix with the square root of the eigenvalues on the diagonal, and D\mathbf{D} is a square diagonal matrix. Here we see that the fitted values are computed with respect to the orthonormal basis U\mathbf{U}.1

The ridge regression is an OLS regression with an additional penalty term on the size of the coefficients and is a popular model in the machine learning literature. In other words, the parameters are chosen to minimalise the penalised sum of squares:

i=1n(yij=1kxijβj)2+λj=1kβj2\sum_{i=1}^{n}(y_{i} - \sum_{j=1}^{k} x_{ij}\beta_{j})^{2} + \lambda \sum_{j=1}^{k} \beta_{j}^{2}

The solution to the problem is given by: β^ridge=(XX+λIk)1XY\hat{\beta}^{ridge} = (\mathbf{X}'\mathbf{X} + \lambda I_{k})^{-1}\mathbf{X}'\mathbf{Y}. Substituting the SVD formula into the fitted values of the ridge regression:

Xβ^ridge=X(XX+λI)1Xy=UD(DD+λI)1DUy=j=1kujdj2dj2+λujy\begin{aligned} \mathbf{X}\hat{\beta}^{ridge} &= \mathbf{X}(\mathbf{X}'\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}'\mathbf{y} \\ &= \mathbf{U}\mathbf{D}(\mathbf{D}'\mathbf{D} + \lambda\mathbf{I})^{-1}\mathbf{D}\mathbf{U}'\mathbf{y} \\ &= \sum_{j=1}^{k} \mathbf{u}_{j} \frac{d^{2}_{j}}{d^{2}_{j} + \lambda} \mathbf{u}_{j}'\mathbf{y} \end{aligned}

where u\mathbf{u} is a n-length vector from the columns of U\mathbf{U}. This formula makes the idea of regularisation really clear. It shrinks the predicted values by the factor dj2/(dj2+λ)d^{2}_{j}/(d^{2}_{j} + \lambda). Moreover, a greater shrinkage factor is applied to the variables which explain a lower fraction of the variance of the data i.e. lower djd_{j}. This comes from the fact that the eigenvectors associated with a higher eigenvalue explain a greater fraction of the variance of the data (see Principal Component Analysis).

The difference between how regularisation works when one uses the Principal Component Analysis (PCA) method vs the ridge regression also becomes clear with the above formulation. The PCA approach truncates variables that fall below a certain threshold, while the ridge regression applies a weighted shrinkage method.

  1. Doing a QR decomposition will also give a similar set of results, though the orthogonal bases will be different.