Reduced rank linear regression

5/25/2026

Recently, I have been interested in reduced-rank linear regression. My first stop was section 3.7 (Multiple Outcome Shrinkage and Selection) of Elements of Statistical Learning, but I found the treatment there a bit unclear (others have felt the same way). So, I decided to derive the solution by hand as a fun exercise.

Derivation

Suppose we have XRn×pX\in \mathbb{R}^{n\times p}, YRn×qY\in \mathbb{R}^{n\times q}, and positive integer rank rr (typically r<n,p,qr \lt n, p, q). We want to find a matrix ΘRp×q\Theta \in \mathbb{R}^{p\times q} that solves the problem

minΘ.YXΘF2s.t.rank(Θ)r. \begin{align} \min_{\Theta}. \quad & \|Y - X\Theta\|_F^2 \\ \text{s.t.} \quad & \text{rank}(\Theta) \leq r.\nonumber \end{align}

This problem looks quite benign, but the rank constraint is non-convex, so it's not immediately clear how to solve it. With some simple manipulations though, we can show a closed-form, two-step method to produce an exact (up to numerical precision) solution.

First, let Θ^\hat{\Theta} be the ordinary least squares solution to the unconstrained problem minΘ.YXΘF2\min_{\Theta}. \|Y - X\Theta\|_F^2. The normal equations of the unconstrained problem give us that

tr(XT(XΘ^Y))=0. \mathrm{tr}\left(X^T(X\hat{\Theta} - Y)\right) = 0.

If we define the residual matrix R=YXΘ^R = Y - X\hat{\Theta}, then tr(XTR)=0\mathrm{tr}\left(X^TR\right) = 0. It follows that for all ΘRp×q\Theta\in \mathbb{R}^{p\times q}, we have

YXΘF2=R+XΘ^XΘF2=R+X(Θ^Θ)F2=RF2+XΘ^XΘF2+2tr(RTX(Θ^Θ))=RF2+XΘ^XΘF2. \begin{align*} \|Y - X\Theta\|_F^2 & = \|R + X\hat{\Theta} - X\Theta\|_F^2 \\ & = \|R + X(\hat{\Theta} - \Theta)\|_F^2 \\ & = \|R\|_F^2 + \|X \hat{\Theta} - X\Theta\|_F^2 + 2\mathrm{tr}\left(R^TX(\hat{\Theta} - \Theta)\right) \\ & = \|R\|_F^2 + \|X \hat{\Theta} - X\Theta\|_F^2. \end{align*}

Since RR has no dependence on Θ\Theta, the problem (1) is equivalent to

minΘ.XΘ^XΘF2s.t.rank(Θ)r. \begin{align} \min_{\Theta}. \quad & \|X \hat{\Theta} - X\Theta\|_F^2 \\ \text{s.t.} \quad & \text{rank}(\Theta) \leq r.\nonumber \end{align}

Now, consider the following relaxation of problem (2):

minM.XΘ^MF2s.t.rank(M)r, \begin{align} \min_{M}. \quad & \|X \hat{\Theta} - M\|_F^2 \\ \text{s.t.} \quad & \text{rank}(M) \leq r,\nonumber \end{align}

where MRn×qM\in \mathbb{R}^{n\times q} is a new variable. Problem (3) is a relaxation of problem (2) because for any feasible Θ\Theta in problem (2), we can set M=XΘM = X\Theta to get a feasible solution to problem (3) with the same objective value. Thus, the optimal value of problem (3) is less than or equal to the optimal value of problem (2). However, we will show that the optimal values of problems (2) and (3) are actually equal, and that we can use the solution to problem (3) to construct a solution to problem (2).

The solution to (3) is the best rank-rr (or less) approximation to XΘ^X\hat{\Theta} in the Frobenius norm. If we define XΘ^=UΣVTX\hat{\Theta} = U\Sigma V^T to be the thin SVD of XΘ^X\hat{\Theta}, then by the Eckart-Young-Mirsky theorem, this approximation is given by U:rΣ:rV:rTU_{:r}\Sigma_{:r} V_{:r}^T, where U:rU_{:r}, Σ:r\Sigma_{:r}, and V:rV_{:r} are the first rr columns of UU, the first rr singular values of Σ\Sigma, and the first rr columns of VV, respectively. Thus, we have that

M=U:rΣ:rV:rT=UΣV:rT=UΣVTV:rV:rT=XΘ^V:rV:rT. \begin{align*} M^\star & = U_{:r}\Sigma_{:r} V_{:r}^T \\ & = U\Sigma V_{:r}^T \\ & = U\Sigma V^T V_{:r} V_{:r}^T \\ & = X\hat{\Theta} V_{:r} V_{:r}^T. \end{align*}

Since XΘ^V:rV:rTX\hat{\Theta} V_{:r} V_{:r}^T is a solution to the relaxation (3), it follows that Θ^V:rV:rT\hat{\Theta} V_{:r} V_{:r}^T is a solution to the original problem (2). \square

The core intuition behind this problem is realizing the following: the least squares residual is always orthogonal to the column space of XX, and the task of selecting the best rank-rr coefficient matrix Θ\Theta involves choosing the best subspace within the column space of XX. Since the least squares residual and the reduced-rank residual are orthogonal, we can split the problem into two steps: first find the least squares solution, and then find the best rank-rr approximation to the fitted values XΘ^X\hat{\Theta}.

Afterthoughts

Suppose XX and YY are both orthogonal and low rank but corrupted by noise so that they appear full rank. A least squares fit Θ\Theta will then be full rank, with XΘX\Theta matching or nearly matching YY. Intuitively, reduced-rank regression might seem like a good choice here, but when XΘ^=YX\hat{\Theta}=Y, the reduced-rank approximation simply picks out the leading left singular vectors of YY from the column space of XX (a column space artificially enlarged by noise). The resulting fit will appear much better than it actually is: reduced-rank regression doesn't act as a regularizer here, and the singular values of Θ\Theta can be quite large when the model is fitting to noise.

When XX and YY are noisy, it might also be tempting to directly compute the rank-rr approximation of the least squares solution Θ^\hat{\Theta}, but this is even worse. When Θ^\hat{\Theta} fits to noise in XX, the corresponding singular values of Θ^\hat{\Theta} will be large, and the rank-rr approximation of Θ^\hat{\Theta} will precisely pick out these components.

In contrast, taking a low-rank approximation of XX before doing least squares is much more like regularization. Optimistically, this low-rank approximation will zero out the noise in XX, and Θ\Theta will be forced to fit on the signal in XX rather than the noise. At this stage, performing reduced-rank regression on the fitted values XΘ^X\hat{\Theta} might be more reasonable since we don't run the risk of picking out noise from XX.

While deriving the reduced-rank regression coefficients, I was also reminded of the orthogonal Procrustes problem. The orthogonal Procrustes problem asks for the best orthogonal matrix QQ that maps one matrix AA to another matrix BB in the Frobenius norm. As cleanly demonstrated on Wikipedia, the solution is Q=UVTQ = UV^T, where UΣVTU\Sigma V^T is the thin SVD of BATBA^T. Both problems are examples of least-squares with a non-convex constraint, and both have closed-form solutions derived using the SVD.