Reduced rank linear regression

5/25/2026

Recently, I have been interested in reduced-rank linear regression. My first stop was section 3.7 (Multiple Outcome Shrinkage and Selection) of Elements of Statistical Learning, but I found the treatment there a bit unclear (others have felt the same way). So, I decided to derive the solution by hand as a fun exercise.

Derivation

Suppose we have $X\in \mathbb{R}^{n\times p}$ , $Y\in \mathbb{R}^{n\times q}$ , and positive integer rank $r$ (typically $r \lt n, p, q$ ). We want to find a matrix $\Theta \in \mathbb{R}^{p\times q}$ that solves the problem

$\begin{align} \min_{\Theta}. \quad & \|Y - X\Theta\|_F^2 \\ \text{s.t.} \quad & \text{rank}(\Theta) \leq r.\nonumber \end{align}$

This problem looks quite benign, but the rank constraint is non-convex, so it's not immediately clear how to solve it. With some simple manipulations though, we can show a closed-form, two-step method to produce an exact (up to numerical precision) solution.

First, let $\hat{\Theta}$ be the ordinary least squares solution to the unconstrained problem $\min_{\Theta}. \|Y - X\Theta\|_F^2$ . The normal equations of the unconstrained problem give us that

$\mathrm{tr}\left(X^T(X\hat{\Theta} - Y)\right) = 0.$

If we define the residual matrix $R = Y - X\hat{\Theta}$ , then $\mathrm{tr}\left(X^TR\right) = 0$ . It follows that for all $\Theta\in \mathbb{R}^{p\times q}$ , we have

$\begin{align*} \|Y - X\Theta\|_F^2 & = \|R + X\hat{\Theta} - X\Theta\|_F^2 \\ & = \|R + X(\hat{\Theta} - \Theta)\|_F^2 \\ & = \|R\|_F^2 + \|X \hat{\Theta} - X\Theta\|_F^2 + 2\mathrm{tr}\left(R^TX(\hat{\Theta} - \Theta)\right) \\ & = \|R\|_F^2 + \|X \hat{\Theta} - X\Theta\|_F^2. \end{align*}$

Since $R$ has no dependence on $\Theta$ , the problem (1) is equivalent to

$\begin{align} \min_{\Theta}. \quad & \|X \hat{\Theta} - X\Theta\|_F^2 \\ \text{s.t.} \quad & \text{rank}(\Theta) \leq r.\nonumber \end{align}$

Now, consider the following relaxation of problem (2):

$\begin{align} \min_{M}. \quad & \|X \hat{\Theta} - M\|_F^2 \\ \text{s.t.} \quad & \text{rank}(M) \leq r,\nonumber \end{align}$

where $M\in \mathbb{R}^{n\times q}$ is a new variable. Problem (3) is a relaxation of problem (2) because for any feasible $\Theta$ in problem (2), we can set $M = X\Theta$ to get a feasible solution to problem (3) with the same objective value. Thus, the optimal value of problem (3) is less than or equal to the optimal value of problem (2). However, we will show that the optimal values of problems (2) and (3) are actually equal, and that we can use the solution to problem (3) to construct a solution to problem (2).

The solution to (3) is the best rank- $r$ (or less) approximation to $X\hat{\Theta}$ in the Frobenius norm. If we define $X\hat{\Theta} = U\Sigma V^T$ to be the thin SVD of $X\hat{\Theta}$ , then by the Eckart-Young-Mirsky theorem, this approximation is given by $U_{:r}\Sigma_{:r} V_{:r}^T$ , where $U_{:r}$ , $\Sigma_{:r}$ , and $V_{:r}$ are the first $r$ columns of $U$ , the first $r$ singular values of $\Sigma$ , and the first $r$ columns of $V$ , respectively. Thus, we have that

$\begin{align*} M^\star & = U_{:r}\Sigma_{:r} V_{:r}^T \\ & = U\Sigma V_{:r}^T \\ & = U\Sigma V^T V_{:r} V_{:r}^T \\ & = X\hat{\Theta} V_{:r} V_{:r}^T. \end{align*}$

Since $X\hat{\Theta} V_{:r} V_{:r}^T$ is a solution to the relaxation (3), it follows that $\hat{\Theta} V_{:r} V_{:r}^T$ is a solution to the original problem (2). $\square$

The core intuition behind this problem is realizing the following: the least squares residual is always orthogonal to the column space of $X$ , and the task of selecting the best rank- $r$ coefficient matrix $\Theta$ involves choosing the best subspace within the column space of $X$ . Since the least squares residual and the reduced-rank residual are orthogonal, we can split the problem into two steps: first find the least squares solution, and then find the best rank- $r$ approximation to the fitted values $X\hat{\Theta}$ .

Afterthoughts

Suppose $X$ and $Y$ are both orthogonal and low rank but corrupted by noise so that they appear full rank. A least squares fit $\Theta$ will then be full rank, with $X\Theta$ matching or nearly matching $Y$ . Intuitively, reduced-rank regression might seem like a good choice here, but when $X\hat{\Theta}=Y$ , the reduced-rank approximation simply picks out the leading left singular vectors of $Y$ from the column space of $X$ (a column space artificially enlarged by noise). The resulting fit will appear much better than it actually is: reduced-rank regression doesn't act as a regularizer here, and the singular values of $\Theta$ can be quite large when the model is fitting to noise.

When $X$ and $Y$ are noisy, it might also be tempting to directly compute the rank- $r$ approximation of the least squares solution $\hat{\Theta}$ , but this is even worse. When $\hat{\Theta}$ fits to noise in $X$ , the corresponding singular values of $\hat{\Theta}$ will be large, and the rank- $r$ approximation of $\hat{\Theta}$ will precisely pick out these components.

In contrast, taking a low-rank approximation of $X$ before doing least squares is much more like regularization. Optimistically, this low-rank approximation will zero out the noise in $X$ , and $\Theta$ will be forced to fit on the signal in $X$ rather than the noise. At this stage, performing reduced-rank regression on the fitted values $X\hat{\Theta}$ might be more reasonable since we don't run the risk of picking out noise from $X$ .

While deriving the reduced-rank regression coefficients, I was also reminded of the orthogonal Procrustes problem. The orthogonal Procrustes problem asks for the best orthogonal matrix $Q$ that maps one matrix $A$ to another matrix $B$ in the Frobenius norm. As cleanly demonstrated on Wikipedia, the solution is $Q = UV^T$ , where $U\Sigma V^T$ is the thin SVD of $BA^T$ . Both problems are examples of least-squares with a non-convex constraint, and both have closed-form solutions derived using the SVD.