Per Mattsson, Dave Zachariah, Petre Stoica
Linear regression models have a wide range of applications in statistics, signal processing, and machine learning. In this Lecture Notes column we will examine the performance of the least-squares (LS) estimator with a focus on the case when there are more parameters than training samples, which is often overlooked in textbooks on estimation.
A linear regression model can be written as: \[\hat{y}{(}\mathbf{\varphi}{;}\mathbf{\theta}{)} = {\mathbf{\varphi}}^{\top}\mathbf{\theta} \tag{1} \] where $\hat{y}{(}\mathbf{\varphi}{;}\mathbf{\theta}{)}$ is the predicted output given a d-dimensional regressor/feature vector $\mathbf{\varphi},$ and $\mathbf{\theta}$ is a d-dimensional parameter vector. The central problem is to form a parameter estimate ${\hat{\mathbf{\theta}}}_{n}$ from a set of n training samples ${(}{y}_{i},{\mathbf{\varphi}}_{i}{),}{i} = {1},\ldots,{n}{.}$ The standard approach is to use the LS method, which was introduced by Legendre and Gauss more than two centuries ago. The LS method minimizes the sum of the in-sample squared errors (SEs), i.e., the prediction errors on the training data.
A common way of analyzing estimation methods is based on the assumption that the data are generated according to: \[{y}_{i} = {\mathbf{\varphi}}_{i}^{\top}{\mathbf{\theta}}_{\circ} + {\varepsilon}_{i},{i} = {1},\ldots,{n} \tag{2} \] where ${\mathbf{\theta}}_{\circ}$ contains the unknown system parameters and ${\varepsilon}_{i}$ is a noise term. Then it is possible to ask questions about how far the estimate ${\hat{\mathbf{\theta}}}_{n}$ is from the unknown ${\mathbf{\theta}}_{\circ}{.}$ If ${\varepsilon}_{i}$ is a stochastic process, one can measure the discrepancy using the mean SE (MSE): \[{\text{MSE}}_{n} = {\mathbb{E}}\left[{{\Vert}{{\mathbf{\theta}}_{\circ}\,{-}\,{\hat{\theta}}_{n}}{\Vert}^{2}}\right] \tag{3} \] where ${\Vert}{\cdot}{\Vert}$ is the ${\ell}_{2} {-} {norm},$ so that ${\Vert}{x}{\Vert}^{2} = {x}^{\top}{x}{.}$ For ${n}\geq{d}$ it is well-known that the MSE will decrease as n increases if, for example, ${\varepsilon}_{i}$ is a white noise process with zero mean.
The case of ${n}{<}{d}$ has received less attention in the literature, and is the focus of this Lecture Notes column. In this undersampled regime, it is possible to perfectly fit the model to any training data if the features are linearly independent. The conventional wisdom has been that such estimators that interpolate the training data are problematic, since they may be subject to high variability and therefore have poor generalization properties for features not in the training data. For this reason, solutions using regularizations, such as ridge regression, are often employed in the undersampled case.
Let us look at an example. Figure 1 shows the MSE of a specific LS estimator that we will study in detail in this Lecture Notes column. We see that in the undersampled regime, ${n}{>}{d},$ the MSE is not monotonically decreasing with n. In fact, we have a double-descent behavior where the MSE decreases for small n, then starts to dramatically increase when n gets close to d, and finally decreases again for ${n}{>}{d}{.}$ This behavior would appear to go against the conventional wisdom.
Figure 1. A typical example of the MSE of the LS estimator (see “The SE” section for details). The vertical dashed line indicates ${n} = {d},$ so the left half of the figure corresponds to the undersampled regime.
In this Lecture Notes column, the aim is to explain the behavior of the LS estimator using only basic linear algebra.
Interpolating estimators have received a lot of attention recently. In [1] the general phenomenon of double descent was introduced by studying the predictive performance of a number of interpolating estimators. This phenomenon was later analyzed in the setting of linear regression in several articles, such as [2], [3], [4], and [5]. An overview can be found in [6]. The cited articles consider the predictive performance of interpolating estimators when the number of parameters d varies. Their theoretical results are relatively involved and generally depend on how $\begin{gathered}{\mathbf{\varphi}}\end{gathered}$ in (1) is generated. Thus, they do not translate into insights that are easily conveyed in a lecture.
In this Lecture Notes column we instead consider the MSE in (3) in the case when the number of parameters d is fixed while the number of samples n varies. This allows us to give an intuitive explanation of the double-descent behavior using simple linear algebra.
The reader is required to have basic knowledge on linear algebra and elementary probability theory.
Our analysis of the interpolating LS estimator rests heavily on the Moore–Penrose pseudoinverse of a matrix (see “Moore–Penrose pseudoinverse”, below) and the orthogonal projector (see “Orthogonal projector”, below) onto the null space of a matrix.
Moore–Penrose Pseudoinverse
Let $\begin{gathered}{\mathbf{\Phi}}\end{gathered}$ be a matrix in ${\mathbb{R}}^{{n}\times{d}}{.}$ If $\begin{gathered}{\mathbf{\Phi}}\end{gathered}$ is square ${(}{n} = {d}{)}$ and has full rank, then there exists a unique inverse ${\mathbf{\Phi}}^{{-}{1}}$ such that ${\mathbf{\Phi}\mathbf{\Phi}}^{{-}{1}} = {\mathbf{\Phi}}^{{-}{1}}\mathbf{\Phi} = {I}{.}$ The Moore–Penrose pseudoinverse ${\mathbf{\Phi}}^{\dagger}$ is a generalization of the inverse to cases when ${n}\ne{d}$ or $\begin{gathered}{\mathbf{\Phi}}\end{gathered}$ is rank-deficient. It is the unique matrix that satisfies the following four conditions: \[{\mathbf{\Phi}\mathbf{\Phi}}^{\dagger}\mathbf{\Phi} = \mathbf{\Phi} \tag{S1} \] \[{\mathbf{\Phi}}^{\dagger}{\mathbf{\Phi}\mathbf{\Phi}}^{\dagger} = {\mathbf{\Phi}}^{\dagger} \tag{S2} \] \[{(}{\mathbf{\Phi}\mathbf{\Phi}}^{\dagger}{)}^{\top} = {\mathbf{\Phi}\mathbf{\Phi}}^{\dagger} \tag{S3} \] \[{(}{\mathbf{\Phi}}^{\dagger}\mathbf{\Phi}{)}^{\top} = {\mathbf{\Phi}}^{\dagger}\mathbf{\Phi}{.} \tag{S4} \]
It can be shown that the pseudoinverse also satisfies \[{\mathbf{\Phi}}^{\dagger} = {(}{\mathbf{\Phi}}^{\top}\mathbf{\Phi}{)}^{\dagger}{\mathbf{\Phi}}^{\top} = {\mathbf{\Phi}}^{\top}{(}{\mathbf{\Phi}\mathbf{\Phi}}^{\top}{)}^{\dagger} \tag{S5} \]
If we assume that $\begin{gathered}{\mathbf{\Phi}}\end{gathered}$ has full rank, then (S5) gives us convenient formulas for ${\mathbf{\Phi}}^{\dagger}{.}$ If ${n} = {d},$ then ${\mathbf{\Phi}}^{\dagger} = {\mathbf{\Phi}}^{{-}{1}}{.}$ If ${n}\geq{d},$ then ${\mathbf{\Phi}}^{\top}\mathbf{\Phi}$ is invertible, and \[{\mathbf{\Phi}}^{\dagger} = {(}{\mathbf{\Phi}}^{\top}\mathbf{\Phi}{)}^{{-}{1}}{\mathbf{\Phi}}^{\top}{.} \tag{S6} \]
Finally, if ${n}\leq{d},$ then ${\mathbf{\Phi}\mathbf{\Phi}}^{\top}$ is invertible and \[{\mathbf{\Phi}}^{\dagger} = {\mathbf{\Phi}}^{\top}{(}{\mathbf{\Phi}\mathbf{\Phi}}^{\top}{)}^{{-}{1}}{.} \tag{S7} \]
What explains the puzzling “double-descent” performance of the LS estimator? Specifically, why does the MSE first decrease with n and then increase until ${n} = {d}{?}$ When can a standard regularization method, such as ridge regression, improve the performance over the LS estimator?
We will provide a solution by decomposing the error of the LS estimate into two orthogonal components: a systematic component that is due to the unidentifiable part of ${\mathbf{\theta}}_{\circ},$ and a noise-induced component that is due to ${\varepsilon}_{i}$ in (2). For ${n}\leq{d}$ we will show that the systematic component decreases while the noise-induced component increases with n, and this leads to the double-descent behavior.
We will also show that ridge regression cannot reduce the error below the systematic component, and therefore it can typically not improve over the minimum-norm LS estimator when ${n}\ll{d}{.}$
The LS method finds an estimate ${\tilde{\mathbf{\theta}}}_{n}$ by minimizing the in-sample errors using the following criterion: \begin{align*}\begin{gathered}{{V}_{n}{(}\mathbf{\theta}{)} = \mathop{\sum}\limits_{{i} = {1}}\limits^{n}{{(}{y}_{i}{-}\hat{y}{(}{\mathbf{\varphi}}_{i}}{;}\mathbf{\theta}{))}^{2}}\\{ = \mathop{\sum}\limits_{{i} = {1}}\limits^{n}{{(}{y}_{i}{-}{\mathbf{\varphi}}_{i}^{\top}\mathbf{\theta}{)}^{2}}{.}}\end{gathered} \tag{4} \end{align*}
We will find it convenient to express the quantities in the above criterion as: \begin{align*}{y}_{n} = \left[{\begin{array}{c}{{y}_{1}}\\{\vdots}\\{{y}_{n}}\end{array}}\right]\,{\text{and}}\,{\mathbf{\Phi}}_{n} = \left[{\begin{array}{c}{{\mathbf{\varphi}}_{1}^{\top}}\\{\vdots}\\{{\mathbf{\varphi}}_{n}^{\top}}\end{array}}\right]{.}\end{align*}
Using this notation, (4) can be written as: \[{V}_{n}{(}\mathbf{\theta}{)} = {\Vert}{{y}_{n}{-}{\mathbf{\Phi}}_{n}\mathbf{\theta}}{\Vert}^{2} \tag{5} \] and (2) as \[{y}_{n} = {\mathbf{\Phi}}_{n}{\mathbf{\theta}}_{\circ} + {\mathbf{\varepsilon}}_{n} \tag{6} \] where \[{\mathbf{\varepsilon}}_{n} = {\left[{\begin{array}{ccc}{{\varepsilon}_{1}}&{\cdots}&{{\varepsilon}_{n}}\end{array}}\right]}^{\top}{.}\]
All minimizers of ${V}_{n}{(}\mathbf{\theta}{)}$ can be expressed using the pseudoinverse of the regressor matrix, ${\mathbf{\Phi}}_{n}^{\dagger},$ and the orthogonal projector onto its nullspace, which we denote ${\mathbf{\Pi}}_{n}^{\bot}$ (see “Moore–Penrose pseudoinverse” and “Orthogonal projector”, above).
Result 1: ${\tilde{\mathbf{\theta}}}_{n}$ is a minimizer of ${V}_{n}{(}\mathbf{\theta}{)}$ if and only if \[{\tilde{\mathbf{\theta}}}_{n} = {\mathbf{\Phi}}_{n}^{\dagger}{y}_{n} + {\mathbf{\Pi}}_{n}^{\bot}{x} \tag{7} \] for some ${x}\in{\mathbb{R}}^{d}{.}$
Proof: The derivative of the criterion (5) is given by \[\frac{\partial{V}_{n}{(}\mathbf{\theta}{)}}{\partial\mathbf{\theta}} = {2}{\mathbf{\Phi}}_{n}^{\top}{\mathbf{\Phi}}_{n}\mathbf{\theta}{-}{2}{\mathbf{\Phi}}_{n}^{\top}{y}_{n}{.}\]
The minimizers are those vectors $\mathbf{\theta}$ for which the derivative is zero \[{\mathbf{\Phi}}_{n}^{\top}{\mathbf{\Phi}}_{n}\mathbf{\theta} = {\mathbf{\Phi}}_{n}^{\top}{y}_{n}{.}\]
This is a linear system of equations, and the general solution can be written as: \[{\tilde{\mathbf{\theta}}}_{n} = {z} + {\mathbf{\Pi}}_{n}^{\bot}{x}\] where z is any particular solution, and ${\mathbf{\Pi}}_{n}^{\bot}\text{x}$ can represent any vector in the null space of ${\mathbf{\Phi}}_{n}{.}$ To see that ${z} = {\mathbf{\Phi}}_{n}^{\dagger}{y}_{n}$ is a particular solution, note that: \[{\mathbf{\Phi}}_{n}^{\top}{\mathbf{\Phi}}_{n}{\mathbf{\Phi}}_{n}^{\dagger}{y}_{n} = {\mathbf{\Phi}}_{n}^{\top}{(}{\mathbf{\Phi}}_{n}^{\top}{)}^{\dagger}{\mathbf{\Phi}}_{n}^{\top}{y}_{n} = {\mathbf{\Phi}}_{n}^{\top}{y}_{n}\] where (S3) was used in the first equality and (S1) in the last equality.
For simplicity, throughout the rest of this Lecture Notes column we will assume that ${\mathbf{\Phi}}_{n}$ has full rank, so that ${\mathbf{\Phi}}_{n}^{\dagger}$ can be computed using (S7) when ${n}\leq{d}$ and (S6) when ${n}\geq{d}{.}$
If ${n}\geq{d},$ it follows from (S6) that ${\mathbf{\Pi}}_{n}^{\bot} = {\mathbf{0}},$ and we get a unique LS solution: \[{\tilde{\mathbf{\theta}}}_{n} = {\mathbf{\Phi}}_{n}^{\dagger}{y}_{n} = {(}{\mathbf{\Phi}}_{n}^{\top}{\mathbf{\Phi}}_{n}{)}^{{-}{1}}{\mathbf{\Phi}}_{n}^{\top}{y}_{n}{.} \tag{8} \]
However, if ${n}{<}{d},$ then the null space of ${\mathbf{\Phi}}_{n}$ is nontrivial so ${\mathbf{\Pi}}_{n}^{\bot}\ne{\mathbf{0}}{.}$ Thus (7) provides infinitely many minimizers of (5) by varying $\text{x}.$ Furthermore, in such a case all minimizers yield ${V}_{n}{(}{\tilde{\mathbf{\theta}}}_{n}{)} = {0},$ which means $\hat{y}{(}{\mathbf{\varphi}}_{i}{;}{\tilde{\mathbf{\theta}}}_{n}{)}$ matches ${y}_{i}$ exactly. In other words, any ${\tilde{\mathbf{\theta}}}_{n}$ is an interpolating LS estimator. The question is: Which solution should we choose, given that they all produce the same in-sample predictions but are likely to result in different out-of-sample predictions?
The object of inference ${\mathbf{\theta}}_{\circ}$ can be decomposed into two orthogonal terms: \[{\mathbf{\theta}}_{\circ} = {\mathbf{\Pi}}_{n}^{\bot}{\mathbf{\theta}}_{\circ} + {\mathbf{\Pi}}_{n}{\mathbf{\theta}}_{\circ}\] as explained in “Orthogonal projection matrices”. According to (6), ${\mathbf{\Pi}}_{n}^{\bot}{\mathbf{\theta}}_{\circ}$ has no effect on the training data, and thus cannot be inferred from data, so we call this term the unidentifiable part of ${\mathbf{\theta}}_{\circ}{.}$ Since the training data contain no information about this unidentifiable part, a common choice is to set the corresponding part of ${\tilde{\mathbf{\theta}}}_{n}$ to zero. This corresponds to choosing ${x} = {\mathbf{0}}$ in (7), and we let ${\hat{\mathbf{\theta}}}_{n}$ denote the resulting estimator: \[{\hat{\mathbf{\theta}}}_{n} = {\mathbf{\Phi}}_{n}^{\dagger}{y}_{n}{.} \tag{9} \]
This is the only LS estimator that belongs to the range of ${\mathbf{\Phi}}_{n}^{\top}{.}$ It is also the minimizer of (5) that has minimum norm. To see this, note that \begin{align*}\begin{gathered}{{\Vert}{{\tilde{\mathbf{\theta}}}_{n}}{\Vert}^{2} = {\Vert}{{\mathbf{\Phi}}_{n}^{\dagger}{y}_{n}}{\Vert}^{2} + {\Vert}{{\mathbf{\Pi}}_{n}^{\bot}\text{x}}{\Vert}^{2}}\\{ + {2}{x}^{\top}{\mathbf{\Pi}}_{n}^{\bot}{\mathbf{\Phi}}_{n}^{\dagger}{y}_{n}}\\{ = {\Vert}{{\mathbf{\Phi}}_{n}^{\dagger}{y}_{n}}{\Vert}^{2} + {\Vert}{{\mathbf{\Pi}}_{n}^{\bot}\text{x}}{\Vert}^{2}}\end{gathered}\end{align*} since ${\mathbf{\Pi}}_{n}^{\bot}{\mathbf{\Phi}}_{n}^{\dagger} = {\mathbf{0}}$ according to (S2). Hence, we get the minimum-norm solution (9) if we set ${x} = {\mathbf{0}}$ in (7).
Orthogonal Projection Matrices
The null space of $\mathbf{\Phi}\in{\mathbb{R}}^{{n}\times{d}}$ consists of all vectors $\text{x}$ such that $\mathbf{\Phi}{x} = {\mathbf{0}},$ i.e., \[{\cal{N}}{(}\mathbf{\Phi}{)} = \left\{{{x}\in{\mathbb{R}}^{d}{:}\mathbf{\Phi}{x} = {\mathbf{0}}}\right\}{.}\]
The range space of $\begin{gathered}{\mathbf{\Phi}}\end{gathered}$ is defined as: \[{\cal{R}}{(}\mathbf{\Phi}{)} = \left\{{{x}\in{\mathbb{R}}^{n}{:}{\bf{x}} = \mathbf{\Phi}{\bf{z}}\,{\text{for some}}\,{z}\in{\mathbb{R}}^{d}}\right\}{.}\]
The matrix \[\mathbf{\Pi} = {\mathbf{\Phi}}^{\dagger}\mathbf{\Phi} \tag{S8} \] is the orthogonal projector onto the range space of ${\mathbf{\Phi}}^{\top},$ while \[{\mathbf{\Pi}}^{\bot} = {\bf{I}}{-}\mathbf{\Pi} \tag{S9} \] is the orthogonal projector onto the null space of $\begin{gathered}{\mathbf{\Phi}{.}}\end{gathered}$ Since ${\mathbf{\Pi}}^{\bot}$ is a projector, we must have ${(}{\mathbf{\Pi}}^{\bot}{)}^{2} = {\mathbf{\Pi}}^{\bot},$ which can be verified using (S1). Since ${\mathbf{\Pi}}^{\bot}$ projects onto the null space of $\begin{gathered}{\mathbf{\Phi},}\end{gathered}$ we have that ${\mathbf{\Phi}\mathbf{\Pi}}^{\bot} = {\mathbf{0}},$ which can also be verified using (S1). Using these projection matrices, we can decompose any vector ${x}\in{\mathbb{R}}^{d}$ into two orthogonal terms: \[{x} = {\mathbf{\Pi}}^{\bot}{x} + \mathbf{\Pi}{x}\] where ${\mathbf{\Pi}}^{\bot}\text{x}$ is the part that lies in the null space of $\begin{gathered}{\mathbf{\Phi},}\end{gathered}$ while $\mathbf{\Pi}{x}$ is orthogonal to the null space, which means that it lies in the range space of ${\mathbf{\Phi}}^{\top}{.}$ That is: \begin{align*}\begin{gathered}{{\mathbf{\Pi}}^{\bot}{x}\in{\cal{N}}{(}\mathbf{\Phi}{),}}\\{\mathbf{\Pi}{x}\in{\cal{N}}{(}\mathbf{\Phi}{)}^{\bot} = {\cal{R}}{(}{\mathbf{\Phi}}^{\top}{)}{.}}\end{gathered}\end{align*}
In this section we will conduct an analysis that aims to explain the behavior in Figure 1 for ${n}{<}{d},$ and we specifically consider the minimum norm estimate in (9). To keep the analysis as general as possible, we will not make any statistical assumptions at this point, and thus we will study the SE \[{\text{SE}}_{n} = {\Vert}{{\mathbf{\theta}}_{\circ}{-}{\hat{\mathbf{\theta}}}_{n}}{\Vert}^{2} \tag{10} \] in this section. Then in “The MSE” section, below, we will add some statistical assumptions and take the expectation of the SE to obtain the MSE.
We next show that the SE can be decomposed into one term that is due to the unidentifiable part of ${\mathbf{\theta}}_{\circ}$ and another term that arises from the noise ${\mathbf{\varepsilon}}_{n}{.}$
Result 2: The SE of the minimum-norm estimate (9) is given by \[{\text{SE}}_{n} = {\Vert}{{\mathbf{\Pi}}_{n}^{\bot}{\mathbf{\theta}}_{\circ}}{\Vert}^{2} + {\Vert}{{\mathbf{\Phi}}_{n}^{\dagger}{\mathbf{\varepsilon}}_{n}}{\Vert}^{2}{.} \tag{11} \]
We refer to the first term as the systematic error component, and to the second as the noise-induced error component.
Proof: The estimation error is given by: \begin{align*}\begin{gathered}{{\mathbf{\theta}}_{\circ}{-}{\hat{\mathbf{\theta}}}_{n} = {\mathbf{\theta}}_{\circ}{-}{\mathbf{\Phi}}_{n}^{\dagger}{(}{\mathbf{\Phi}}_{n}{\mathbf{\theta}}_{\circ} + {\mathbf{\varepsilon}}_{n}{)}}\\{ = {\mathbf{\Pi}}_{n}^{\bot}{\mathbf{\theta}}_{\circ}{-}{\mathbf{\Phi}}_{n}^{\dagger}{\mathbf{\varepsilon}}_{n}{.}}\end{gathered}\end{align*}
Inserting this into (10), and noting that the cross-term vanishes since ${\mathbf{\Pi}}_{n}^{\bot}{\mathbf{\Phi}}_{n}^{\dagger} = {\mathbf{0}},$ we get (11).
We will now study how these two error components behave as n increases.
The systematic component ${\Vert}{{\mathbf{\Pi}}_{n}^{\bot}{\mathbf{\theta}}_{\circ}}{\Vert}^{2}$ in (11) is due to the unidentifiable part of ${\mathbf{\theta}}_{\circ}.$ Before obtaining any training data, no part of ${\mathbf{\theta}}_{\circ}$ can be identified (assuming that no prior information is available) so the systematic component equals ${\Vert}{\mathbf{\theta}}_{\circ}{\Vert}^{2}{.}$ As more data are collected, the null space of ${\mathbf{\Phi}}_{n}$ shrinks, so the unidentifiable part of ${\mathbf{\theta}}_{\circ}$ is reduced. When ${n}\geq{d},$ the null space of ${\mathbf{\Phi}}_{n}$ is trivial, i.e., ${\mathbf{\Pi}}_{n}^{\bot} = {\mathbf{0}}$ and ${\mathbf{\theta}}_{\circ}$ is fully identifiable.
To summarize, the systematic error component is monotonically decreasing from ${\Vert}{{\mathbf{\theta}}_{\circ}}{\Vert}^{2}$ to zero as the number of samples increase, so that we have for ${n} = {1},{2},\ldots,{d}{-}{1}{:}$ \begin{align*}{\Vert}{{\mathbf{\theta}}_{\circ}}{\Vert}^{2} & {>}{\Vert}{{\mathbf{\Pi}}_{n}^{\bot}{\mathbf{\theta}}_{\circ}}{\Vert}^{2}\geq{\Vert}{{\mathbf{\Pi}}_{{n} + {1}}^{\bot}{\mathbf{\theta}}_{\circ}}{\Vert}^{2} \\ & \geq{\Vert}{{\mathbf{\Pi}}_{d}{\mathbf{\theta}}_{\circ}}{\Vert}^{2} = {0}{.} \tag{12} \end{align*}
An explicit update formula, \[{\Vert}{{\mathbf{\Pi}}_{{n} + {1}}^{\bot}{\mathbf{\theta}}_{\circ}}{\Vert}^{2} = {\Vert}{{\mathbf{\Pi}}_{n}^{\bot}{\mathbf{\theta}}_{\circ}}{\Vert}^{2}{-}\frac{{(}{\mathbf{\varphi}}_{{n} + {1}}^{\top}{\mathbf{\Pi}}_{n}^{\bot}{\mathbf{\theta}}_{\circ}{)}^{2}}{{\mathbf{\varphi}}_{{n} + {1}}^{\top}{\mathbf{\Pi}}_{n}^{\bot}{\mathbf{\varphi}}_{{n} + {1}}}\] can be derived using (S7) and the matrix inversion lemma (the derivation is left as an exercise to the reader).
When ${n}\geq{d},$ the systematic error component vanishes: ${\Vert}{{\mathbf{\Pi}}_{n}^{\bot}{\mathbf{\theta}}_{\circ}}{\Vert}^{2}\equiv{0}{.}$
In this section we will analyze the noise-induced component ${\Vert}{{\mathbf{\Phi}}_{n}^{\dagger}{\mathbf{\varepsilon}}_{n}}{\Vert}^{2}$ in (11). Before any training data have been collected, there is no noise, so we can consider this error to be zero for ${n} = {0}{.}$ We will now show that as we start to collect data, this error component will monotonically increase until ${n} = {d}{.}$
To see this, consider the linear system of equations: \[{\mathbf{\Phi}}_{n}\mathbf{\vartheta} = {\mathbf{\varepsilon}}_{n} \tag{13} \] with n equations and d unknowns. When ${n}{<}{d},$ this system of equations has infinitely many solutions, and $\mathbf{\vartheta} = {\mathbf{\Phi}}_{n}^{\dagger}{\mathbf{\varepsilon}}_{n}$ is the solution with minimum ${\ell}_{2} {-} {n}{o}{r}{m}{.}$ Thus, the noise-induced component in (6) is the squared norm of this solution. Let us now study what happens to $\mathbf{\vartheta}$ as n increases.
When n is increased by one, we add an equation to (13) and consequently the set of solutions shrinks. This implies that ${\Vert}{{\mathbf{\Phi}}_{n}^{\dagger}{\mathbf{\varepsilon}}_{n}}{\Vert}$ is monotonically increasing for ${n} = {1},{2},\ldots,{d}{.}$ When ${n} = {d}$ there is only one solution: $\mathbf{\vartheta} = {\mathbf{\Phi}}_{d}^{{-}{1}}{\mathbf{\varepsilon}}_{d}{.}$
To summarize, the noise-induced error component is monotonically increasing as the number of samples increases, so that we have for ${n} = {1},{2},\ldots,{d}{-}{1}{:}$ \begin{align*}\begin{gathered}{{0}\leq{\Vert}{{\mathbf{\Phi}}_{n}^{\dagger}{\mathbf{\varepsilon}}_{n}}{\Vert}^{2}\leq{\Vert}{{\mathbf{\Phi}}_{{n} + {1}}^{\dagger}{\mathbf{\varepsilon}}_{{n} + {1}}}{\Vert}^{2}}\\{\leq{\Vert}{{\mathbf{\Phi}}_{d}^{{-}{1}}{\mathbf{\varepsilon}}_{d}}{\Vert}^{2}{.}}\end{gathered} \tag{14} \end{align*}
An explicit update formula: \begin{align*}{\Vert}{{\mathbf{\Phi}}_{{n} + {1}}^{\dagger}{\mathbf{\varepsilon}}_{{n} + {1}}}{\Vert}^{2} & = {\Vert}{{\mathbf{\Phi}}_{n}^{\dagger}{\mathbf{\varepsilon}}_{n}}{\Vert}^{2} \\ & + \frac{{(}{\mathbf{\varphi}}_{{n} + {1}}^{\top}{\mathbf{\Phi}}_{n}^{\dagger}{\mathbf{\varepsilon}}_{n}{-}{\varepsilon}_{{n} + {1}}{)}^{2}}{{\mathbf{\varphi}}_{{n} + {1}}^{\top}{\mathbf{\Pi}}_{n}^{\bot}{\mathbf{\varphi}}_{{n} + {1}}}\end{align*} can be derived using (S7) and the matrix inversion lemma (this derivation is also left as an exercise to the reader).
For ${n}{>}{d},$ (13) is overdetermined and does not have an exact solution anymore. Here the noise-induced error component can either increase or decrease as n increases, but we will see in “The MSE” section, below, that under mild assumptions the general trend is that it decreases with increasing n when ${n}\geq{d}{.}$
To illustrate the analysis so far, consider Figure 2. Here the training data were generated according to (6) with ${d} = {1},{000}$ parameters generated from a Gaussian distribution with zero mean and a covariance matrix given by the identity matrix. The parameters where normalized so that ${\Vert}{{\mathbf{\theta}}_{\circ}}{\Vert}^{2} = {1}{.}$ The features ${\mathbf{\varphi}}_{i}$ are independently drawn from a Gaussian distribution with zero mean and a covariance matrix given by the identity matrix. The noise ${\varepsilon}_{i}$ is a white noise process that is also generated from a Gaussian distribution with zero mean and variance chosen so that the signal-to-noise ratio is 10.
Figure 2. The SE for minimum-norm and optimal ridge regression for one realization, plotted together with the systematic and noise-induced components.
For ${n}\leq{d}$ we see that the systematic component monotonically decreases from ${\Vert}{{\mathbf{\theta}}_{\circ}}{\Vert}^{2} = {1}$ to 0 as the number of samples increase from n = 0 to d, and at the same time the noise-induced component is monotonically increasing. In this example the systematic component dominates the error for small n, so the SE decreases when n is relatively small, but as n gets close to d the noise-induced component starts to dominate and thus the SE increases.
For ${n}\geq{d},$ the systematic component is zero. If we study Figure 2 carefully, we can see that the noise-induced component can increase sometimes, but the general trend is that it decreases. This will be explained in “The MSE” section, below.
Figure 2 also includes an optimally tuned ridge regression estimator, which we will now explain.
Interpolating LS estimators are typically not covered in standard textbooks. Instead, to avoid multiple solutions and mitigate interpolation of the noise, regularization of the LS criterion is discussed. A popular choice is ridge regression, which minimizes the criterion: \[{\tilde{V}}_{n}{(}\mathbf{\theta}{)} = {\Vert}{{y}_{n}{-}{\mathbf{\Phi}}_{n}\mathbf{\theta}}{\Vert}^{2} + {\lambda}{\Vert}{\mathbf{\theta}}{\Vert}^{2}\] where ${\lambda}{>}{0}$ is a tuning parameter that controls the size of the penalty term. The minimizer of ${\tilde{V}}_{n}{(}\mathbf{\theta}{)}$ is unique, and is given by: \begin{align*}{\hat{\mathbf{\theta}}}_{n}^{{(}{\lambda}{)}} & = {\mathbf{\Phi}}_{n}^{\top}{(}{\mathbf{\Phi}}_{n}{\mathbf{\Phi}}_{n}^{\top} + {\lambda}{I}{)}^{{-}{1}}{y}_{n} \\ & = {(}{\mathbf{\Phi}}_{n}^{\top}{\mathbf{\Phi}}_{n} + {\lambda}{I}{)}^{{-}{1}}{\mathbf{\Phi}}_{n}^{\top}{y}_{n}{.} \tag{15} \end{align*}
From the above equation, we see that the ridge regression estimator lies in the range space of ${\mathbf{\Phi}}_{n}^{\top},$ so if we let ${\lambda}\rightarrow{0}$ it can be shown that ${\hat{\mathbf{\theta}}}_{n}^{{(}{\lambda}{)}}$ approaches the only LS estimator that lies in the range space of ${\mathbf{\Phi}}_{n}^{\top},$ i.e., the minimum-norm estimator (9). Another implication of (15) is that ${\mathbf{\Pi}}_{n}^{\bot}{\hat{\mathbf{\theta}}}_{n}^{{(}{\lambda}{)}} = {\mathbf{0}},$ so \begin{align*}{\Vert}{{\mathbf{\theta}}_{\circ}{-}{\hat{\mathbf{\theta}}}_{n}^{{(}{\lambda}{)}}}{\Vert}^{2} & = {\Vert}{{\mathbf{\Pi}}_{n}^{\bot}{\mathbf{\theta}}_{\circ}}{\Vert}^{2} \\ & \quad + {\Vert}{{\mathbf{\Pi}}_{n}{(}{\mathbf{\theta}}_{\circ}{-}{\hat{\mathbf{\theta}}}_{n}^{{(}{\lambda}{)}}{)}}{\Vert}^{2} \\ & \geq{\Vert}{{\mathbf{\Pi}}_{n}^{\bot}{\mathbf{\theta}}_{\circ}}{\Vert}^{2}{.} \end{align*}
That is, ridge regression, or any other estimators that lie in the range space of ${\mathbf{\Phi}}_{n}^{\top},$ cannot reduce the SE below the systematic component.
In Figure 2 we can see that for small n the error of the LS estimator is dominated by the systematic error. Since the systematic error is a lower bound for ridge regression, it is not possible to significantly improve the accuracy using this type of regularization, a fact that is illustrated in Figure 2, where ${\lambda}$ for the ridge regression estimate has been tuned for each n to achieve as small an error as possible. Note that such tuning requires the true ${\mathbf{\theta}}_{\circ},$ which obviously is not available in applications. However, this shows that in this example there is no ${\lambda}$ that significantly improves the estimate compared to minimum-norm LS when n is small. However, when n is close to d, it is possible to avoid the noise-induced peak by choosing a suitable ${\lambda}{.}$
So far in the Lecture Notes column we have not made any assumptions about the noise in (6), so the results are valid for any data realization. However, to explain the behavior of the LS estimate for ${n}{>}{d},$ some statistical assumptions are needed.
A common statistical assumption is that ${\varepsilon}_{i}$ in (6) is a white noise process with zero mean and variance v, so that ${\mathbb{E}}{[}{\mathbf{\varepsilon}}_{n}{]} = {\mathbf{0}}$ and ${\mathbb{E}}{[}{\mathbf{\varepsilon}}_{n}{\mathbf{\varepsilon}}_{n}^{\top}{]} = {v}{I}{.}$ If we take the expected value of (11), the MSE of the estimator can be written as: \begin{align*}{\text{MSE}}_{n} & = {\mathbb{E}}\left[{{\Vert}{{\mathbf{\theta}}_{\circ}{-}{\hat{\mathbf{\theta}}}_{n}}{\Vert}^{2}}\right] \\ & = {\Vert}{{\mathbf{\Pi}}_{n}^{\bot}{\mathbf{\theta}}_{\circ}}{\Vert}^{2} + {\mathbb{E}}\left[{{\Vert}{{\mathbf{\Phi}}_{n}^{\dagger}{\mathbf{\varepsilon}}_{n}}{\Vert}^{2}}\right] \\ & = {\Vert}{{\mathbf{\Pi}}_{n}^{\bot}{\mathbf{\theta}}_{\circ}}{\Vert}^{2} + {v}{\text{tr}}\left\{{{\mathbf{\Phi}}_{n}^{\dagger}({\mathbf{\Phi}}_{n}^{\dagger}{)}^{\top}}\right\}{.} \end{align*}
By using the first equality in (S5) and (S2), it follows that ${\mathbf{\Phi}}_{n}^{\dagger}{(}{\mathbf{\Phi}}_{n}^{\dagger}{)}^{\top} = {(}{\mathbf{\Phi}}_{n}^{\top}{\mathbf{\Phi}}_{n}{)}{}^{\dagger}$${\mathbf{\Phi}}_{n}^{\top}{\mathbf{\Phi}}_{n}{(}{\mathbf{\Phi}}_{n}^{\top}{\mathbf{\Phi}}_{n}{)}{}^{\dagger} = {(}{\mathbf{\Phi}}_{n}^{\top}{\mathbf{\Phi}}_{n}{)}{}^{\dagger},$ so \[{\text{MSE}}_{n} = {\Vert}{{\mathbf{\Pi}}_{n}^{\bot}{\mathbf{\theta}}_{\circ}}{\Vert}^{2} + {v}{\text{tr}}\left\{{({\mathbf{\Phi}}_{n}^{\top}{\mathbf{\Phi}}_{n}){}^{\dagger}}\right\}{.}\]
Recall that for ${n} = {1},{2},\ldots,{d},$ the systematic component decreases to zero, while the noise-induced component increases monotonically. For ${n} = {d},{d} + {1},\ldots,$ we can show that the expected noise-induced component decreases monotonically.
Result 3: If ${n}\geq{d}$ and ${\mathbf{\varepsilon}}_{n}$ is a white noise sequence with zero mean, then \[{\text{MSE}}_{{n} + {1}}\leq{\text{MSE}}_{n}{.}\]
Proof: In this case, ${\mathbf{\Pi}}_{n}^{\bot}{\mathbf{\theta}}_{\circ} = {\mathbf{0}}$ and ${\mathbf{\Phi}}_{n}^{\top}{\mathbf{\Phi}}_{n}$ is invertible, so \[{\text{MSE}}_{n} = {v}{\text{tr}}\left\{{({\mathbf{\Phi}}_{n}^{\top}{\mathbf{\Phi}}_{n}{)}^{{-}{1}}}\right\}{.}\]
Note that: \[{\mathbf{\Phi}}_{{n} + {1}}^{\top}{\mathbf{\Phi}}_{{n} + {1}} = {\mathbf{\Phi}}_{n}^{\top}{\mathbf{\Phi}}_{n} + {\mathbf{\varphi}}_{{n} + {1}}{\mathbf{\varphi}}_{{n} + {1}}^{\top}{.}\]
Using the matrix inversion formula, we see that \begin{align*} & {\text{MSE}}_{{n} + {1}} = {v}{\text{tr}}\left\{{(}{\mathbf{\Phi}}_{n}^{\top}{\mathbf{\Phi}}_{n} + {\mathbf{\varphi}}_{{n} + {1}}{\mathbf{\varphi}}_{{n} + {1}}^{\top}{)}^{{-}{1}}\right\} \\ & \quad = {v}{\text{tr}}\left\{{(}{\mathbf{\Phi}}_{n}^{\top}{\mathbf{\Phi}}_{n}{)}^{{-}{1}}\right\}{-}{v} \\ & \qquad \times{\text{tr}}\left\{\frac{\left({\mathbf{\Phi}}_{n}^{\top}{\mathbf{\Phi}}_{n}\right)^{{-}{1}}{\mathbf{\varphi}}_{{n} + {1}}{\mathbf{\varphi}}_{{n} + {1}}^{\top}\left({\mathbf{\Phi}}_{n}^{\top}{\mathbf{\Phi}}_{n}\right)^{{-}{1}}}{{1} + {\mathbf{\varphi}}_{{n} + {1}}^{\top}{\mathbf{\Phi}}_{n}^{\top}{\mathbf{\Phi}}_{n}{\mathbf{\varphi}}_{{n} + {1}}}\right\} \\ & \quad \leq{v}{\text{tr}}\left\{{(}{\mathbf{\Phi}}_{n}^{\top}{\mathbf{\Phi}}_{n}{)}^{{-}{1}}\right\} = {\text{MSE}}_{n}{,} \end{align*} where the inequality holds since the trace of a positive semidefinite matrix is greater than or equal to zero.
We return to the experiment in “The SE” section, but now evaluate the MSE. We select the noise variance v of ${\varepsilon}_{i}$ such that the resulting signal-to-noise ratio is 1, 10, or 100.
Figure 3(a)–(c) shows the result when ${\mathbf{\varphi}}_{i}$ are independently drawn from a Gaussian distribution with zero mean and covariance matrix equal to the identity matrix. We see again that the systematic component decreases, while the expected value of the noise-induced component increases for ${n}\leq{d}{.}$ For ${n}\geq{d}$ the MSE is monotonically decreasing. Another fact to note is that the relative size between the systematic component and the noise-induced component depends on the signal-to-noise ratio. For higher signal-to-noise ratios, the systematic component dominates for small n, so from the “Ridge regression” section, above, we know that ridge regression cannot significantly improve the accuracy in such a case.
Figure 3. MSE of LS estimator decomposed into the systematic and noise-induced components. (a)–(c) Random Gaussian features. (d)–(f) Orthogonal feature vectors. At high signal-to-noise ratios (SNR), the MSE is dominated by the systematic component when ${n}\ll{d}{.}$
In Figure 3(d)–(f), the features ${\mathbf{\varphi}}_{i}$ have been chosen so that they are mutually orthogonal for ${i}\leq{d}{.}$ For ${i}{>}{d},$ ${\mathbf{\varphi}}_{i}$ are independently drawn from a Gaussian distribution as above. While the noise-induced component is monotonically increasing for ${n}\leq{d}$ in this case too, we see that the increase is much smaller than for independently drawn features. We can understand this by noticing that the peak is at $\mathbb{E}{\Vert}{{\mathbf{\Phi}}_{d}^{{-}{1}}{\mathbf{\varepsilon}}_{d}}{\Vert}^{2},$ and thus the condition number of ${\mathbf{\Phi}}_{d}$ determines how large the peak is. With orthogonal components, the condition number can be quite small.
In this Lecture Notes column, we have investigated the SE performance of the minimum-norm LS estimator. We have shown that:
Together, these points explain the double-descent behavior seen in the examples of this Lecture Notes column. For ${n}{<}{d},$ the error is initially decreasing due to the decrease of the systematic component, but as more noise is interpolated the noise-induced component grows so that the SE may increase. For ${n}\geq{d},$ the systematic component vanishes and the expected noise-induced component decreases.
Furthermore, we have seen that ridge regression regularization cannot reduce the error below the systematic component. Hence, if the error is dominated by the systematic component for ${n}\ll{d},$ ridge regression will not yield any significant improvements even if it is optimally tuned. The numerical examples show that this typically happens for high signal-to-noise ratios.
This work has been partly supported by the Swedish Research Council (VR) under Contracts 621-2016-0607, 2018-05040, and 2021-05022.
Per Mattsson (per.mattsson@it.uu.se) received his M.S. degree in engineering physics and his Ph.D. degree in automatic control from Uppsala University, Uppsala, Sweden, in 2010 and 2016, respectively. He is an assistant professor at Uppsala University, 751 05 Uppsala, Sweden. His research interests include statistical signal processing, system identification, control theory, and machine learning.
Dave Zachariah (dave.zachariah@it.uu.se) received his M.S. degree in electrical engineering and his Technical License and Ph.D. degrees in signal processing from the Royal Institute of Technology, Stockholm, Sweden, in 2007, 2011, and 2013, respectively. He is an associate professor at Uppsala University, 751 05 Uppsala, Sweden. His research interests include statistical signal processing, machine learning, causal inference, and localization.
Petre Stoica (ps@it.uu.se) is a professor in signals and systems modeling at Uppsala University, 751 05 Uppsala, Sweden. He is a member of the Royal Swedish Academy of Engineering Sciences, Stockholm, Sweden, an international member of the United States National Academy of Engineering, Washington, D.C. USA, an honorary member of the Romanian Academy, Bucharest, Romania, a member of the European Academy of Sciences, Liege, Belgium, and a member of the Royal Society of Sciences, Uppsala, Sweden. He is also a fellow of the European Association for Signal Processing and the Royal Statistical Society, a distinguished fellow of the International Engineering and Technology Institute, and a Fellow of IEEE.
[1] M. Belkin, D. Hsu, S. Ma, and S. Mandal, “Reconciling modern machine-learning practice and the classical bias–variance trade-off,” Proc. Nat. Acad. Sci., vol. 116, no. 32, pp. 15,849–15,854, Aug. 2019, doi: 10.1073/pnas.1903070116.
[2] V. Muthukumar, K. Vodrahalli, V. Subramanian, and A. Sahai, “Harmless interpolation of noisy data in regression,” IEEE J. Sel. Areas Inf. Theory, vol. 1, no. 1, pp. 67–83, May 2020, doi: 10.1109/JSAIT.2020.2984716.
[3] M. Belkin, D. Hsu, and J. Xu, “Two models of double descent for weak features,” SIAM J. Math. Data Sci., vol. 2, no. 4, pp. 1167–1180, 2020, doi: 10.1137/20M1336072.
[4] T. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani, “Surprises in high-dimensional ridgeless least squares interpolation,” Ann. Statist., vol. 50, no. 2, pp. 949–986, Apr. 2022, doi: 10.1214/21-AOS2133.
[5] P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler, “Benign overfitting in linear regression,” Proc. Nat. Acad. Sci., vol. 117, no. 48, pp. 30,063–30,070, Dec. 2020, doi: 10.1073/pnas.1907378117.
[6] M. Belkin, “Fit without fear: Remarkable mathematical phenomena of deep learning through the prism of interpolation,” Acta Numerica, vol. 30, pp. 203–248, May 2021, doi: 10.1017/S0962492921000039.
Digital Object Identifier 10.1109/MSP.2023.3242083