Rahul Parhi, Robert D. Nowak
THIS ARTWORK WAS GENERATED USING DALL·E 2 WITH KEYWORDS FROM THE ARTICLE
Deep learning (DL) has been wildly successful in practice, and most of the state-of-the-art machine learning methods are based on neural networks (NNs). Lacking, however, is a rigorous mathematical theory that adequately explains the amazing performance of deep NNs (DNNs). In this article, we present a relatively new mathematical framework that provides the beginning of a deeper understanding of DL. This framework precisely characterizes the functional properties of NNs that are trained to fit to data. The key mathematical tools that support this framework include transform-domain sparse regularization, the Radon transform of computed tomography, and approximation theory, which are all techniques deeply rooted in signal processing. This framework explains the effect of weight decay regularization in NN training, use of skip connections and low-rank weight matrices in network architectures, role of sparsity in NNs, and explains why NNs can perform well in high-dimensional problems.
DL has revolutionized engineering and the sciences in the modern data age. The typical goal of DL is to predict an output ${y}\in{\mathcal{Y}}$ (e.g., a label or response) from an input ${\mathbf{x}}\in{\mathcal{X}}$ (e.g., a feature or example). An NN is “trained” to fit to a set of data consisting of the pairs ${\left\{{{(}{\mathbf{x}}_{n},{y}_{n}{)}}\right\}}_{{n} = {1}}^{N}$ by finding a set of NN parameters $\mathbf{\theta}$ so that the NN mapping closely matches the data. The trained NN is a function, denoted by ${f}_{\mathbf{\theta}}{:}{\mathcal{X}}\rightarrow{\mathcal{Y}},$ that can be used to predict the output ${y}\in{\mathcal{Y}}$ of a new input ${\mathbf{x}}\in{\mathcal{X}}{.}$ This paradigm is referred to as supervised learning, which is the focus of this article. The success of DL has spawned a burgeoning industry that is continually developing new applications, NN architectures, and training algorithms. This article reviews recent developments in the mathematics of DL, focused on the characterization of the kinds of functions learned by NNs fit to data. There are currently many competing theories that explain the success of DL. These developments are part of a wider body of theoretical work that can be crudely organized into three broad categories: 1) approximation theory with NNs, 2) the design and analysis of optimization (“training”) algorithms for NNs, and 3) characterizations of the properties of trained NNs.
This article belongs to the latter category of research and investigates the functional properties (i.e., the regularity) of solutions to NN training problems with explicit, Tikhonov-type regularization. Although much of the success of DL in practice comes from networks with highly structured architectures, it is hard to establish a rigorous and unified theory for such NNs used in practice. Therefore, we primarily focus on fully connected, feedforward NNs with the popular rectified linear unit (ReLU) activation function. This article introduces a mathematical framework that unifies a line of work from several authors over the last few years that sheds light on the nature and behavior of NN functions that are trained to a global minimizer with explicit regularization. The presented results are just one piece of the puzzle toward developing a mathematical theory of DL. The purpose of this article is, in particular, to provide a gentle introduction to this new mathematical framework, accessible to readers with a mathematical background in signals and systems and applied linear algebra. The framework is based on mathematical tools familiar to the signal processing community, including transform-domain sparse regularization, the Radon transform of computed tomography, and approximation theory. It is also related to well-known signal processing ideas such as wavelets, splines, and compressed sensing. This framework provides a new take on the following fundamental questions:
The task of DL corresponds to learning the input–output mapping from a dataset in a hierarchical or multilayer manner. DNNs are complicated function mappings built from many smaller, simpler building blocks. The simplest building block of a DNN is an (artificial) neuron, inspired by the biological neurons of the brain [24]. A neuron is a function mapping ${\mathbb{R}}^{d}\rightarrow{\mathbb{R}}$ of the form ${\mathbf{z}}\mapsto{\sigma}{(}{\boldsymbol{w}}^{\top}{\boldsymbol{z}}{-}{b}{),}$ where ${\mathbf{w}}\in{\mathbb{R}}^{d}$ corresponds to the weights of the neuron and ${b}\in{\mathbb{R}}$ corresponds to the bias of the neuron. The function ${\sigma}{:}{\mathbb{R}}\rightarrow{\mathbb{R}}$ is referred to as the activation function of the neuron and controls the nonlinear response of the neuron. A neuron “activates” when the weighted combination of its input x exceeds a certain threshold, i.e., ${\boldsymbol{w}}^{\top}{\boldsymbol{x}}\,{>}\,{b}{.}$ Therefore, typical activation functions such as the sigmoid, unit step function, or ReLU activate when their input exceeds zero, as seen in Figure 1.
Figure 1. The typical activation functions found in NNs. (a) Sigmoid. (b) unit step. (c) ReLU.
A neuron is composed of a linear mapping followed by a nonlinearity. A popular form (or “architecture”) of a DNN is a fully connected feedforward DNN, which is a cascade of alternating linear mappings and componentwise nonlinearities. A feedforward DNN ${f}_{\mathbf{\theta}}$ (parameterized by $\mathbf{\theta}{)}$ can be represented as the function composition \[{f}_{\mathbf{\theta}}{(}{x}{)} = {\mathbf{A}}^{(L)}\circ\mathbf{\sigma}\circ{\mathbf{A}}^{{(}{L}{-}{1}{)}}\circ\cdots\mathbf{\sigma}\circ{\mathbf{A}}^{(1)}{(}{\mathbf{x}}{)} \tag{1} \] where for each ${\ell} = {1},\ldots,{L},$ the function ${\mathbf{A}}^{{(}{\ell}{)}}{(}{\mathbf{z}}{)} = {W}^{{(}{\ell}{)}}{\mathbf{z}}{-}{\mathbf{b}}^{{(}{\ell}{)}}$ is an affine linear mapping with weight matrix ${W}^{{(}{\ell}{)}}$ and bias vector ${\mathbf{b}}^{{(}{\ell}{)}}{.}$ The functions $\mathbf{\sigma}$ that appear in the composition apply the activation function ${\sigma}{:}{\mathbb{R}}\rightarrow{\mathbb{R}}$ componentwise to the vector ${\mathbf{A}}^{{(}{\ell}{)}}{(}{\mathbf{z}}{).}$ Although the activation function could change from neuron to neuron, in this article, we assume that the same activation function is used in the entire network. The parameters of this DNN are the weights and biases, i.e., $\mathbf{\theta} = \left\{{{(}{W}^{{(}{\ell}{)}},{\mathbf{b}}^{{(}{\ell}{)}}{)}}\right\}_{{\ell} = {1}}^{L}{.}$ Each mapping $\mathbf{A}^{{(}{\ell}{)}}$ corresponds to a layer of the DNN, and the number of mappings L is the depth of the DNN. The dimensions of the weight matrices ${W}^{{(}{\ell}{)}}$ correspond to the number of neurons in each layer (i.e., the width of the layer). Alternative DNN architectures can be built with other simple building blocks, e.g., with convolutions and pooling/downsampling operations, which would correspond to deep convolutional NNs. DNN architectures are often depicted with diagrams, as in Figure 2.
Figure 2. An example depiction of a DNN and its components. (a) A feedforward DNN architecture where the nodes represent the neurons and the edges represent the weights. (b) A single neuron from the DNN in (a) mapping an input ${\mathbf{z}}\in{\mathbb{R}}^{d}$ to an output ${\mathbf{Z}}\in{\mathbb{R}}^{D}$ via ${Z} = {v}{\sigma}{(}{\mathbf{w}}^{\top}{\mathbf{z}}{-}{b}{).}$
Given a DNN ${f}_{\mathbf{\theta}}$ parameterized by $\mathbf{\theta}\in\Theta$ (of any architecture), the task of learning from the data ${\left\{{{(}{\mathbf{x}}_{n},{y}_{n}{)}}\right\}}_{{n} = {1}}^{N}$ is formulated as the optimization problem \[\mathop{\min}\limits_{\mathbf{\theta}\,\in\,\Theta}\mathop{\sum}\limits_{{n} = {1}}^{N}{{\mathcal{L}}\left({{y}_{n},{f}_{\mathbf{\theta}}{(}{\mathbf{x}}_{n}{)}}\right)} \tag{2} \] where ${\mathcal{L}}\left({\cdot,\cdot}\right)$ is a loss function (squared error, logistic, hinge loss, and so on). For example, the squared error loss is given by ${\mathcal{L}}{(}{y},{z}{)} = {(}{y}{-}{z)}^{2}{.}$ A DNN is trained by solving this optimization problem, usually via some form of gradient descent. In typical scenarios, this optimization problem is ill-posed, so the problem is regularized either explicitly through the addition of a regularization term and/or implicitly by constraints on the network architecture or the behavior of gradient descent procedures [34]. A surprising phenomenon of gradient descent training algorithms for overparameterized NNs is that, among the many solutions that overfit the data, these algorithms select one that often generalizes well on new data, even without explicit regularization. This has led to researchers trying to understand the role of overparameterization and the effect of random initialization of NN parameters on the implicit bias of gradient-based training algorithms [8].
On the other hand, explicit regularization corresponds to solving an optimization problem of the form \[\mathop{\min}\limits_{\mathbf{\theta}\,\in\,\Theta}\mathop{\sum}\limits_{{n} = {1}}^{N}{{\mathcal{L}}\left({{y}_{n},{f}_{\mathbf{\theta}}\left({{\mathbf{x}}_{n}}\right)}\right) + {\lambda}{C}\left({\mathbf{\theta}}\right)} \tag{3} \] where ${C}\left({\mathbf{\theta}}\right)\geq{0}$ for all $\mathbf{\theta}\in\Theta{.}\,{C}\left({\mathbf{\theta}}\right)$ is a regularizer, which measures the “size” (or “capacity”) of the DNN parameterized by $\mathbf{\theta}\in\Theta,$ and ${\lambda}{>}{0}$ is an adjustable hyperparameter, which controls the tradeoff between the data-fitting term and the regularizer. DNNs are often trained using gradient descent algorithms with weight decay, which corresponds to solving the optimization problem \[\mathop{\min}\limits_{\mathbf{\theta}\,\in\,\Theta}\mathop{\sum}\limits_{{n} = {1}}^{N}{{\mathcal{L}}\left({{y}_{n},{f}_{\mathbf{\theta}}\left({{\mathbf{x}}_{n}}\right)}\right) + {\lambda}{C}_{\text{wd}}\left({\mathbf{\theta}}\right)} \tag{4} \] where the weight decay regularizer ${C}_{\text{wd}}\left({\mathbf{\theta}}\right)$ is the squared Euclidean-norm of all the network weights. Sometimes, the weight decay objective regularizes all the parameters, including biases, while sometimes it only regularizes the weights (so that the biases are unregularized). This article focuses on the variant of weight decay with unregularized biases.
Neural Balance Theorem
Let ${f}_{\mathbf{\theta}}$ be a deep neural network (DNN) of any architecture parameterized by $\mathbf{\theta}\in\Theta,$ which solves the DNN training problem with weight decay in (4). Then, the weights satisfy the following balance constraint: if $\mathbf{w}$ and $\mathbf{v}$ denote the input and output weights of any homogeneous unit in the DNN, respectively, then ${\Vert}{\mathbf{w}}{\Vert}_{2} = {\Vert}{\mathbf{v}}{\Vert}_{2}{.}$
Weight decay is a common form of regularization for DNNs. On the surface, it appears to simply be the familiar Tikhonov or “ridge” regularization. In standard linear models, it is well known that this sort of regularization tends to reduce the size of the weights but does not produce sparse weights. However, when this regularization is used in conjunction with NNs, the results are strikingly different. Regularizing the sum of squared weights turns out to be equivalent to regularization with a type of ${\ell}^{1} {-} $norm regularization on the network weights, leading to sparse solutions in which the weights of many neurons are zero [47]. This is due to the key property that the most commonly used activation functions in DNNs are homogeneous. A function ${\sigma}{(}{t}{)}$ is said to be homogeneous (of degree 1) if ${\sigma}{(}{\gamma}{t}{)} = {\gamma}{\sigma}{(}{t}{)}$ for any ${\gamma}{>}{0}{.}$ The most common NN activation function, the ReLU, is homogeneous as well as the leaky ReLU, linear activation, and pooling/downsampling units. This homogeneity leads to the following theorem, referred to as the neural balance theorem (NBT) [(47), Th.1] (see “Neural Balance Theorem”).
The proof of this theorem boils down to the simple observation that for any homogeneous unit with input weights w and output weights v , we can scale the input weight by ${\gamma}{>}{0}$ and the output weight by ${1}{/}{\gamma}$ without changing the function mapping. For example, consider the single neuron ${\mathbf{z}}\mapsto{\mathbf{v}}{\sigma}{(}{\mathbf{w}}^{\top}{\mathbf{z}}{-}{b}{)}$ with homogeneous activation function ${\sigma},$ as depicted in Figure 2(b). In the case of a DNN, as in (1), w corresponds to a row of a weight matrix in the affine mapping of a layer, v corresponds to a column of the weight matrix in the subsequent layer, and b corresponds to an entry in the bias vector. It is immediate that ${(}{\mathbf{v}}{/}{\gamma}{)}{\sigma}{((}{\gamma}{\mathbf{w}}{)}^{\top}{\mathbf{z}}{-}{\gamma}{b}{)} = {\mathbf{v}}{\sigma}{(}{\mathbf{w}}^{\top}{\mathbf{z}}{-}{b}{).}$ By noting that the biases are unregularized, the theorem follows by noticing that ${\min}_{{\gamma}{>}{0}}{\Vert}{{\gamma}{\mathbf{w}}}{\Vert}_{2}^{2} + {\Vert}{{\mathbf{v}}{/}{\gamma}}{\Vert}_{2}^{2}$ occurs when ${\gamma} = \sqrt{{\Vert}{\mathbf{v}}{\Vert}_{2}/{\Vert}{\mathbf{w}}{\Vert}_{2}},$ which implies that the minimum squared Euclidean-norm solution must satisfy the property that the input and output weights, w and v , respectively, are balanced.
The balancing effect of the NBT has a striking effect on solutions to the weight decay objective, particularly a sparsity-promoting effect akin to least absolute shrinkage and selection operator (LASSO) regularization [41]. As an illustrative example, consider a shallow ${(}{L} = {2}{)},$ feedforward NN mapping ${\mathbb{R}}^{d}\rightarrow{\mathbb{R}}^{D}$ with a homogeneous activation function (e.g., the ReLU) and K neurons. In this case, the NN is given by \[{f}_{\mathbf{\theta}}{(}{\mathbf{x}}{)} = \mathop{\sum}\limits_{{k} = {1}}^{K}{{\mathbf{v}}_{k}{\sigma}\left({{\mathbf{w}}_{k}^{\top}{\mathbf{x}}{-}{b}_{k}}\right)}{.} \tag{5} \]
Here, the weight decay regularizer is of the form $\left({1/2}\right){\Sigma}_{{k} = {1}}^{K}{\Vert}{{\mathbf{v}}_{k}}{\Vert}_{2}^{2} + {\Vert}{{\mathbf{w}}_{k}}{\Vert}_{2}^{2},$ where ${\mathbf{w}}_{k}$ and ${\mathbf{v}}_{k}$ are the input and output weights of the kth neuron, respectively. By the NBT, this is equivalent to using the regularizer ${\Sigma}_{{k} = {1}}^{K}{\Vert}{{\mathbf{v}}_{k}}{\Vert}_{2}{\Vert}{{\mathbf{w}}_{k}}{\Vert}_{2}{.}$ Due to homogeneity of the activation function, we can assume, without loss of generality, that ${\Vert}{{\mathbf{w}}_{k}}{\Vert}_{2} = {1}$ by “absorbing” the magnitude of the input weight ${\mathbf{w}}_{k}$ into the output weight ${\mathbf{v}}_{k}.$ Therefore, by constraining the input weights to be unit norm, the training problem can then be reformulated using the regularizer ${\Sigma}_{{k} = {1}}^{K}{\Vert}{{\mathbf{v}}_{k}}{\Vert}_{2}$ [47]. Remarkably, this is exactly the well-known group LASSO regularizer [48], which favors solutions with few active neuron connections (i.e., solutions typically have many ${\mathbf{v}}_{k}$ exactly equal to zero), although the overall training objective remains nonconvex. We also note that there is a line of work that has reformulated the nonconvex training problem as a convex group LASSO problem [32].
More generally, consider the feedforward DNN architecture in (1) with a homogeneous activation function, and consider training the DNN with weight decay only on the network weights. An application of the NBT shows that the weight decay problem is equivalent to the regularized DNN training problem with the regularizer \[{C}\left({\mathbf{\theta}}\right) = \frac{1}{2}\mathop{\sum}\limits_{k=1}^{{K}^{(1)}}{\Vert}{\mathbf{w}}_{k}^{(1)}{\Vert}_{2}^{2} + \frac{1}{2}\mathop{\sum}\limits_{k=1}^{{K}^{(L)}}{\Vert}{\mathbf{v}}_{k}^{(L)}{\Vert}_{2}^{2} + \mathop{\sum}\limits_{{\ell} = {1}}^{L}\mathop{\sum}\limits_{{k} = {1}}^{{K}^{{(}\ell{)}}}{\Vert}{\mathbf{w}}_{k}^{{(}{\ell}{)}}{\Vert}_{2}{\Vert}{\mathbf{v}}_{k}^{{(}{\ell}{)}}{\Vert}_{2} \tag{6} \] where ${K}^{{(}{\ell}{)}}$ denotes the number of neurons in layer ${\ell},\,{\mathbf{w}}_{k}^{{(}{\ell}{)}}$ denotes the input weights to the kth neuron in layer ${\ell},$ and ${\mathbf{v}}_{k}^{{(}{\ell}{)}}$ denotes the output weights to the kth neuron in layer ${\ell}$ (see [47, eq. (2)]). The solutions based on this regularizer will also be sparse due to the two norms that appear in the last term in (6) being not squared, akin to the group LASSO regularizer. In particular, this regularizer can be viewed as a mixed ${\ell}^{2,1} {-} $norm on the weight matrices. Moreover, increasing the regularization parameter ${\lambda},$ will increase the number of weights that are zero in the solution. Therefore, training the DNN with weight decay favors sparse solutions, where sparsity is quantified via the number of active neuron connections. An early version of this result appeared in 1998 [16], although it did not become well known until it was rediscovered in 2015 [25].
The sparsity-promoting effect of weight decay arising from the NBT in network architectures with homogeneous activation functions has several consequences on the properties of trained NNs. In this section, we focus on the popular ReLU activation function ${\rho}{(}{t}{)} = \max\left\{{{0},{t}}\right\}{.}$ The imposed sparsity not only promotes sparsity in the sense of the number of active neuron connections but also promotes a kind of transform-domain sparsity. This transform-domain sparsity suggests the inclusion of skip connections and low-rank weight matrices in network architectures.
In the univariate case, a shallow feedforward ReLU NN with K neurons is realized by the mapping \[{f}_{\mathbf{\theta}}{(}{x}{)} = \mathop{\sum}\limits_{{k} = {1}}^{K}{{v}_{k}}{\rho}{(}{w}_{k}{x}{-}{b}_{k}{).} \tag{7} \]
Training this NN with weight decay corresponds to the solving the optimization problem \[\mathop{\min}\limits_{\mathbf{\theta}\,\in\,\Theta}\mathop{\sum}\limits_{{n} = {1}}^{N}{{\mathcal{L}}\left({{y}_{n},{f}_{\mathbf{\theta}}{(}{x}_{n}{)}}\right)} + \frac{\lambda}{2}\mathop{\sum}\limits_{{k} = {1}}^{K}{{\left|{{v}_{k}}\right|}^{2} + {\left|{{w}_{k}}\right|}^{2}}{.} \tag{8} \]
From the “The Secret Sparsity of Weight Decay” section, we saw that the NBT implies that this problem is equivalent to \[\mathop{\min}\limits_{\mathbf{\theta}\,\in\,\Theta}\mathop{\sum}\limits_{{n} = {1}}^{N}{{\mathcal{L}}\left({{y}_{n},{f}_{\mathbf{\theta}}{(}{x}_{n}{)}}\right) + {\lambda}\mathop{\sum}\limits_{{k} = {1}}^{K}{\left|{{v}_{k}}\right|}\left|{{w}_{k}}\right|}. \tag{9} \]
As illustrated in “Rectified Linear Unit Sparsity in the Second Derivative Domain,” we see that (9) is actually regularizing the integral of the second derivative of the NN, which can be viewed as a measure of sparsity in the second derivative domain. The integral in (S6) must be understood in the distributional sense because the Dirac impulse is not a function, but a generalized function or distribution. To make this precise, let ${g}_{\varepsilon}{(}{x}{)} = {e}^{{-}{x}^{2}{/}{2}\varepsilon}{/}\sqrt{{2}{\pi}\varepsilon}$ denote the Gaussian density with variance $\varepsilon\,{>}\,{0}{.}$ As is well known in signal processing, ${g}_{\varepsilon}$ converges to the Dirac impulse as $\varepsilon\rightarrow{0}{.}$ Using this idea, given a distribution f, define the norm \[{\Vert}{f}{\Vert}_{\mathcal{M}}\colon = \mathop{\sup}\limits_{\varepsilon{>}{0}}{\Vert}{{f}\ast{g}_{\varepsilon}}{\Vert}_{{L}^{1}} = \mathop{\sup}\limits_{\varepsilon{>}{0}}\mathop{\int}\nolimits_{{-}\infty}\nolimits^{\infty}{\left|{\mathop{\int}\nolimits_{{-}\infty}\nolimits^{\infty}{f}{(}{x}{)}{g}_{\varepsilon}\left({{y}{-}{x}}\right){\text{d}}{x}}\right|}{\text{d}}{y}{.} \tag{10} \]
This definition provides an explicit construction, via the convolution with a Gaussian, of a sequence of smooth functions that converge to f, where the supremum acts as the limit. For example, if ${f}{(}{x}{)} = {g}{(}{x}{)} + {\Sigma}_{{k} = {1}}^{K}{v}_{k}{\delta}{(}{x}{-}{t}_{k}{)},$ where g is an absolutely integrable function, then ${\Vert}{f}{\Vert}_{\mathcal{M}} = {\Vert}{g}{\Vert}_{{L}^{1}} + {\Sigma}_{{k} = {1}}^{K}\left|{{v}_{k}}\right| = {\Vert}{g}{\Vert}_{{L}^{1}} + {\Vert}{\mathbf{v}}{\Vert}_{1},$ with ${\Vert}{\mathbf{v}}{\Vert}_{1} = {\Sigma}_{{k} = {1}}^{K}\left|{{v}_{k}}\right|{.}$ It is in this sense that (S6) holds, i.e., ${\Vert}{{D}^{2}{f}_{\mathbf{\theta}}}{\Vert}_{\mathcal{M}} = {\Sigma}_{{k} = {1}}^{K}\left|{{v}_{k}}\right|\left|{{w}_{k}}\right|{.}$ In particular, the ${\mathcal{M}}{-} $norm is precisely the continuous-domain analog of the sparsity-promoting discrete ${\ell}^{1} {-} $norm. Therefore, we see that training an NN with weight decay, as in (8), prefers solutions with sparse second derivatives.
It turns out that the connection between sparsity in the second derivative domain and NNs is even tighter. Let ${\text{BV}}^{2}\left({\mathbb{R}}\right)$ denote the space of functions mapping ${\mathbb{R}}\rightarrow{\mathbb{R}}$ such that ${\Vert}{{D}^{2}{f}}{\Vert}_{\mathcal{M}}$ is finite. This is the space of functions of second-order bounded variation and the quantity ${\Vert}{{D}^{2}{f}}{\Vert}_{\mathcal{M}}$ is the second-order total variation (TV) of f. Note that the classical notion of TV, often used in signal denoising problems [35], is ${\text{TV}}\left({f}\right)\colon = {\Vert}{{D}{f}}{\Vert}_{\mathcal{M}},$ and so the second-order TV of f can be viewed as the TV of the derivative of ${f}{:}{\Vert}{{D}^{2}{f}}{\Vert}_{\mathcal{M}} = {\text{TV}}\left({{D}{f}}\right){.}$
It is well known from spline theory [14], [23], [44] that functions that fit data and have minimal second-order TV are continuous piecewise linear (CPwL) functions. As the ReLU is a CPwL function, ReLU NNs are CPwL functions [3]. In fact, under mild assumptions on the loss function, the solution set to the optimization problem \[\mathop{\min}\limits_{{f}\in{\text{BV}}^{2}\left({\mathbb{R}}\right)}\mathop{\sum}\limits_{{n} = {1}}^{N}{\mathcal{L}}\left({{y}_{n},{f}{(}{x}_{n}{)}}\right) + {\lambda}{\Vert}{{D}^{2}{f}}{\Vert}_{\mathcal{M}} \tag{11} \] is completely characterized by NNs of the form \[{f}_{\mathbf{\theta}}{(}{x}{)} = \mathop{\sum}\limits_{{k} = {1}}^{K}{{v}_{k}{\rho}{(}{w}_{k}{x}{-}{b}_{k}{)} + {c}_{1}{x} + {c}_{0}} \tag{12} \] where the number of neurons is strictly less than the number of data ${(}{K}{<}{N}{)}$ in the sense that the solution set to (11) is a closed convex set whose extreme points take the form of (12) with ${K}{<}{N}$ [9], [28], [37]. In NN parlance, the ${c}_{1}{x} + {c}_{0}$ term is a skip connection [17]. This term is an affine function that naturally arises because the second-order TV of an affine function is zero, and so the regularizer places no penalty for the inclusion of this term.
Rectified Linear Unit Sparsity in the Second Derivative Domain
Given a rectified linear unit neuron ${r}{(}{x}{)} = {\rho}{(}{wx}{-}{b}{),}$ its first derivative, $\text{D}r(\mathbf{x}),$ is \begin{align*}{\text{D}}\,{r}{(}{x}{)} & = {\text{D}}\,{\rho}{(}{wx}{-}{b}{)} \\ & = {wu}{(}{wx}{-}{b}{)} \tag{S1} \end{align*} where $u$ is the unit step function [Figure 1(b)]. Therefore, its second derivative, ${\text{D}}^{2}r(x),$ is \begin{align*}{\text{D}}^{2}{r}{(}{x}{)} & = {\text{D}}\,{wu}{(}{wx}{-}{b}{)} \\ & = {w}^{2}{\delta}{(}{wx}{-}{b}{).} \tag{S2} \end{align*}
By the scaling property of the Dirac impulse [S1, Problem 1.38(a)] \[{\delta}{(}{\gamma}{x}{)} = \frac{1}{\left|{\gamma}\right|}{\delta}{(}{x}{)} \tag{S3} \] we have \begin{align*}{\text{D}}^{2}{r}{(}{x}{)} & = \frac{{w}^{2}}{\left|{w}\right|}{\delta}\left({{x}{-}\frac{b}{w}}\right) \\ & = \left|{w}\right|{\delta}\left({{x}{-}\frac{b}{w}}\right){.} \tag{S4} \end{align*}
The second derivative of the neural network (7) is then \[{\text{D}}^{2}{f}_{\mathbf{\theta}}{(}{x}{)} = \mathop{\sum}\limits_{{k} = {1}}^{K}{{v}_{k}}\left|{{w}_{k}}\right|{\delta}\left({{x}{-}\frac{{b}_{k}}{{w}_{k}}}\right){.} \tag{S5} \]
Therefore, \[\mathop{\int}\nolimits_{{-}\infty}\nolimits^{\infty}{\left|{{\text{D}}^{2}{f}_{\mathbf{\theta}}(x)}\right|}{\text{d}}{x} = \mathop{\sum}\limits_{{k} = {1}}^{K}{\left|{{v}_{k}}\right|}\left|{{w}_{k}}\right|{.} \tag{S6} \]
Reference
[S1] A. V. Oppenheim, A. S. Willsky, and S. H. Nawab, Signals and Systems (Prentice-Hall Signal Processing Series). Englewood Cliffs, NJ, USA: Prentice-Hall, 1997.
The intuition behind this result is due to the fact that the second derivative of a CPwL function is an impulse train and therefore exhibits extreme sparsity in the second derivative domain, as illustrated in Figure S1. Therefore, the optimization problem (11) will favor sparse CPwL functions that always admit a representation, as in (12). In signal processing parlance, “signals” that are sparse in some transform domain are said to have a finite rate of innovation [45]. Here, the involved transform is the second derivative operator, and the innovation is the impulse train that arises after applying the second derivative operator to a CPwL function.
Figure S1. An illustration of the sparsity in the second derivative domain of a univariate, shallow feedforward neural network with six neurons.
Consider the optimization over the NN parameter space ${\Theta}_{K}$ of networks, as in (12), with fixed-width ${K}\geq{N}{.}$ From the derivation in “Rectified Linear Unit Sparsity in the Second Derivative Domain,” we have $\left\{{{f}_{\mathbf{\theta}}{:}\mathbf{\theta}\in{\Theta}_{K}}\right\}\subset{\text{BV}}^{2}\left({\mathbb{R}}\right){.}$ Furthermore, from the earlier discussion, we know that there always exists an optimal solution to (11) that takes the form of (12) with ${K}{<}{N},$ i.e., there always exists a solution to (11) in $\left\{{{f}_{\mathbf{\theta}}{:}\mathbf{\theta}\in{\Theta}_{K}}\right\}{.}$ Therefore, from the equivalence of (8) and (9), we see that training a sufficiently wide $\left({{K}\geq{N}}\right)$ NN with a skip connection (12) and weight decay (8) results in a solution to the optimization problem (11) over the function space ${\text{BV}}^{2}\left({\mathbb{R}}\right){.}$ Although this result may seem obvious in hindsight, it is remarkable because it says that the kinds of functions that NNs trained with weight decay (to a global minimizer) are exactly optimal functions in ${\text{BV}}^{2}\left({\mathbb{R}}\right){.}$ Moreover, this result sheds light on the role of overparameterization as this correspondence hinges on the network being critically parameterized or overparameterized (because ${K}\geq{N}{).}$
In the multivariate case, a shallow feedforward NN has the form \[{f}_{\mathbf{\theta}}{(}{\mathbf{x}}{)} = \mathop{\sum}\limits_{{k} = {1}}^{K}{{v}_{k}{\rho}\left({{\mathbf{w}}_{k}^{\top}{\mathbf{x}}\,{-}\,{b}_{k}}\right)}{.} \tag{13} \]
The key property that connects the univariate case and ${\text{BV}}^{2}\left({\mathbb{R}}\right)$ is that ReLU neurons are sparsified by the second derivative operator, as in (S4). A similar analysis can be carried out in the multivariate case by finding an operator that is the sparsifying transform of the multivariate ReLU neuron ${r}{(}{\mathbf{x}}{)} = {\rho}{(}{\mathbf{w}}^{\top}{\mathbf{x}}{-}{b}{).}$ The sparsifying transform was proposed in 2020 in the seminal work of Ongie et al. [26] and hinges on the Radon transform that arises in tomographic imaging. Connections between the Radon transform and neural networks have been known since at least the 1990s, gaining popularity due to the proposal of ridgelets [6], and early versions of the sparsifying transform for neurons were studied as early as 1997 [19]. A summary of the Radon transform appears in “The Radon Transform and Fourier Slice Theorem.” The sparsifying transform for multivariate ReLU neurons is based on a result regarding the (filtered) Radon transform, which appears in “Filtered Radon Transform of a Neuron With Unit-Norm Input Weights.”
The Radon Transform and Fourier Slice Theorem
The Radon transform, first studied by Radon in 1917 [S2], of a function mapping ${\mathbb{R}}^{d}\rightarrow{\mathbb{R}}$ is specified by the integral transform \[{\mathscr{R}}{\{}{f}{\}(}\mathbf{\alpha},{t}{)} = \mathop{\int}\nolimits_{{\mathbb{R}}^{d}}{f}{(}{\mathbf{x}}{)}{\delta}{(}{\mathbf{\alpha}}^{\top}{\mathbf{x}}{-}{t}{)}{\text{d}}{x} \tag{S7} \] where d is the univariate Dirac impulse, $\mathbf{\alpha}\in{\mathbb{S}}^{{d}{-}{1}} = {\{}{\boldsymbol{u}}\in{\mathbb{R}}^{d}{:}{\Vert}{\boldsymbol{u}}{\Vert}_{2} = {1}{\}}$ is a unit vector, and ${t}\in{\mathbb{R}}$ is a scalar. The Radon transform of ${f}$ at ${(}\mathbf{\alpha},{t}{)}$ is the integral of ${f}$ along the hyperplane ${\{}{\mathbf{x}}\in{\mathbb{R}}^{d}\,{:}\,{\mathbf{\alpha}}^{\top}{\mathbf{x}} = {t}{\}.}$
The Radon transform is tightly linked with the Fourier transform, specified by \[\hat{f(}\mathbf{\omega}{)} = \mathop{\int}\nolimits_{{\mathbb{R}}^{d}}{{f}\left({\mathbf{x}}\right){e}^{{-}{j}{\mathbf{\omega}}^{\top}{\mathbf{x}}}{\text{d}}{x}} \tag{S8} \] where ${j}^{2} = {-}{1}{.}$ Indeed, \begin{align*}&\widehat{\mathscr{R}\{f\}}{(}\mathbf{\alpha},{\omega}{)} \\ & = \mathop{\int}\nolimits_{\mathbb{R}}{\left({\mathop{\int}\nolimits_{{\mathbb{R}}^{d}}{f}{(}{\mathbf{x}}{)}{\delta}{(}{\mathbf{\alpha}}^{\top}{\mathbf{x}}{-}{t}{)}{\text{d}}{x}}\right)}{e}^{{-}{j}{\omega}{t}}{\text{d}}{t} \\ & = \mathop{\int}\nolimits_{{\mathbb{R}}^{d}}{f}{(}{\mathbf{x}}{)}\left({\mathop{\int}\nolimits_{\mathbb{R}}{\delta}{(}{\mathbf{\alpha}}^{\top}{\mathbf{x}}{-}{t}{)}{e}^{{-}{j}{\omega}{t}}{\text{d}}{t}}\right){\text{d}}{x} \\ & = \mathop{\int}\nolimits_{{\mathbb{R}}^{d}}{f}{(}{\mathbf{x}}{)}{e}^{{-}{j}{(}{\omega}\mathbf{\alpha}{)}^{\top}{\mathbf{x}}}{\text{d}}{x} = \hat{f}{(}{\omega}\mathbf{\alpha}{).} \tag{S9} \end{align*}
This result is known as the Fourier slice theorem. It states that the univariate Fourier transform in the Radon domain corresponds to a slice of the Fourier transform in the spatial domain.
[S2] J. Radon, “Über die bestimmung von funktionen durch ihre integralwerte längs gewisser mannigfaltigkeiten,” Berichte Verhandlungen Sächsische Akademie Wissenschaften, vol. 69, pp. 262–277, 1917.
The filter $\text{K}$ in (S13) is exactly the backprojection filter that arises in the filtered backprojection algorithm in tomographic image reconstruction and acts as a high-pass filter (or ramp filter) to correct the attenuation of high frequencies from the Radon transform. The intuition behind this is that the Radon transform integrates a function along hyperplanes. In the univariate case, the magnitude of the frequency response of an integrator behaves as ${1}{/}\mid{\omega}\mid$ and therefore attenuates high frequencies. The magnitude of the frequency response of integration along a hyperplane therefore behaves as ${1}{/}\mid{\omega}{\mid}^{{d}{-}{1}}$ as hyperplanes are of dimension ${(}{d}{-}{1}{)}{.}$ Note that the even projector that appears in (S17) is due to the fact that the Radon-domain variables ${(}\mathbf{\alpha},{t}{)}$ and ${(}{-}\mathbf{\alpha},{-}{t}{)}$ parameterize the same hyperplane.
From the derivation in “Filtered Radon Transform of a Neuron With Unit-Norm Input Weights,” we immediately see that the sparsifying transform of the multivariate ReLU neuron ${r}{(}{\mathbf{x}}{)} = {\rho}{(}{\mathbf{w}}^{\top}{\mathbf{x}}{-}{b}{)}$ with ${(}{\mathbf{w}},{b}{)}\in{\mathbb{S}}^{{d}{-}{1}}\times{\mathbb{R}}$ is the ${D}_{t}^{2}\text{K}\text{R}$ operator, where ${D}_{t}^{2} = {\partial}^{2}{/}\partial{t}^{2}$ denotes the second-order partial derivative with respect to t. We have \[{\text{D}}_{t}^{2}{\text{K}}\,{\mathscr{R}}{\{}{r}{\}(}\mathbf{\alpha},{t}{)} = {\delta}_{R}{((}\mathbf{\alpha},{t}{)}{-}{(}{\mathbf{w}},{b}{))} \tag{14} \] where ${\delta}_{\mathscr{R}}{(}{\mathbf{z}}{-}{\mathbf{z}}_{0}{)}{:} = {\text{P}}_{\text{even}}{\{}{\delta}{(}{\mathbf{z}}{-}{\mathbf{z}}_{0}{)\}} = {(}{\delta}{(}{\mathbf{z}}{-}{\mathbf{z}}_{0}{)} + {\delta}{(}{\mathbf{z}} + {\mathbf{z}}_{0}{))} / {2}$ is the even symmetrization of the Dirac impulse that arises due to the even symmetry of the Radon domain. From the homogeneity of the ReLU activation, applying this sparsifying transform to the (unconstrained) neuron ${r}{(}{\mathbf{x}}{)} = {\rho}{(}{\mathbf{w}}^{\top}{\mathbf{x}}{-}{b}{)}$ with ${(}{\mathbf{w}},{b}{)}\in{\mathbb{R}}^{d}\times{\mathbb{R}}$ yields \[{\text{D}}_{t}^{2}{\text{K}}\,{\mathscr{R}}{\{}{r}{\}(}\mathbf{\alpha},{t}{)} = {\Vert}{\mathbf{w}}{\Vert}_{2}{\delta}_{R}{((}\mathbf{\alpha},{t}{)}{-}{(}\tilde{\mathbf{w}},\tilde{b}{))} \tag{15} \] where $\tilde{\mathbf{w}} = {\mathbf{w}}{/}{\Vert}{\mathbf{w}}{\Vert}_{2}$ and $\tilde{b} = {b}{/}{\Vert}{\mathbf{w}}{\Vert}_{2}{.}$ This is analogous to how $\text{D}^{2}$ is the sparsifying transform for univariate neurons, as in (S4). The sparsifying operator is simply the second derivative in the filtered Radon domain. The key idea is that the (filtered) Radon transform allows us to extract the (univariate) activation function from the multivariate neuron and apply the univariate sparsifying transform in the t variable. Figure 3 is a cartoon diagram that depicts the sparsifying transform of a ReLU neuron.
Figure 3. A cartoon diagram illustrating the sparsifying transform of the ReLU neuron ${r}{(}{\mathbf{x}}{)} = {\rho}{(}{\mathbf{w}}^{\top}{\mathbf{x}}{-}{b}{)}$ with ${(}{\mathbf{w}},{b}{)}\in{\mathbb{S}}^{{d}{-}{1}}\times{\mathbb{R}}{.}$ The heatmap is a top-down view of the ReLU neuron depicted in (a). (a) The surface plot of $r(\mathbf{x}).$ (b) The filtered Radon transform when $\mathbf{\alpha} = {\mathbf{w}}{.}$ (c) The filtered Radon transform when $\mathbf{\alpha}\ne{\mathbf{w}}{.}$ (d) Sparsifying transform ${D}_{t}^{2}\text{K}\text{R}\{r\}.$
Filtered Radon Transform of a Neuron With Unit-Norm Input Weights
First, consider the neuron ${r}{(}{\mathbf{x}}{)} = {\sigma}{(}{\mathbf{w}}^{\top}{\mathbf{x}}{)}$ with ${\mathbf{w}} = {\mathbf{e}}_{1} = {(}{1},{0}{}\ldots,{0}{)}$ (the first canonical unit vector). In this case, ${r}{(}{\mathbf{x}}{)} = {\sigma}{(}{x}_{1}{).}$ By noticing that this function can be written as a tensor product, the Fourier transform is given by the following product: \[\widehat{r}{(}\mathbf{\omega}{)} = \hat{\sigma}{(}{\omega}_{1}{)}\mathop{\prod}\limits_{{k} = {2}}^{d}{2}{\pi}{\delta}{(}{\omega}_{k}{).} \tag{S10} \]
By the Fourier slice theorem \[\widehat{\mathscr{R}\{r\}}{(}\mathbf{\alpha},{\omega}{)} = \hat{\sigma}{(}{{\omega}{\alpha}}_{1}{)}\mathop{\prod}\limits_{{k} = {2}}^{d}{2}{\pi}{\delta}{(}{{\omega}{\alpha}}_{k}{).} \tag{S11} \]
By the scaling property of the Dirac impulse (S3), the quantity in (S11) equals \[ = \hat{\sigma}{(}{{\omega}{\alpha}}_{1}{)}\frac{{(}{2}{\pi}{)}^{{d}{-}{1}}}{{\left|{\omega}\right|}^{{d}{-}{1}}}{\delta}{(}{\alpha}_{2},\ldots,{\alpha}_{d}{).} \tag{S12} \]
If we define the filter via the frequency response \[\widehat{\text{K}f}{(}{\omega}{)} = \frac{{\left|{\omega}\right|}^{{d}{-}{1}}}{{2}{(}{2}{\pi}{)}^{{d}{-}{1}}}\hat{f}{(}{\omega}{)} \tag{S13} \] we find \[\widehat{\text{K}\mathscr{R}\{r\}}{(}\mathbf{\alpha},{\omega}{)} = \frac{\hat{\sigma}({{\omega}{\alpha}}_{1})}{2}{\delta}{(}{\alpha}_{2},\ldots,{\alpha}_{d}{).} \tag{S14} \]
Taking the inverse Fourier transform, \begin{align*}&{\text{K}}{\mathscr{R}}{\{}{r}{\}(}\mathbf{\alpha},{t}{)} \\ & = \frac{1}{{2}\left|{{\alpha}_{1}}\right|}{\sigma}\left({\frac{t}{{\alpha}_{1}}}\right){\delta}{(}{\alpha}_{2},\ldots,{\alpha}_{d}{)} \\ & = \frac{{\sigma}{(}{t}{)}{\delta}{(}{\alpha}_{1}{-}{1}{)} + {\sigma}{(}{-}{t}{)}{\delta}{(}{\alpha}_{1} + {1}{)}}{2}{\delta}{(}{\alpha}_{2},\ldots,{\alpha}_{d}{)} \\ & = \frac{{\sigma}{(}{t}{)}{\delta}{(}{\sigma}{-}{e}_{1}{)} + {\sigma}{(}{-}{t}{)}{\delta}{(}\mathbf{\alpha} + {\mathbf{e}}_{\mathbf{1}}{)}}{2} \\ & = {:}{\text{P}}_{\text{even}}{\{}{\sigma}{(}{t}{)}{\delta}{(}\mathbf{\alpha}{-}{\mathbf{e}}_{\mathbf{1}}{)\}} \tag{S15} \end{align*} where ${\text{P}}_{\text{even}}$ is the projector, which extracts the even part of its input [in terms of the variables ${(}\mathbf{\alpha},{t}{)}$]. The second line holds by the dilation property of the Fourier transform [S1, eq. (4.34)]. \[\frac{1}{\left|{\gamma}\right|}{f}\left({\frac{t}{\gamma}}\right)\mathop{\longleftrightarrow}\limits^{{\mathscr{F}}}\hat{f}{(}{\gamma}{\omega}{).} \tag{S16} \]
As $\mathbf{\alpha}\in{\mathbb{S}}^{{d}{-}{1}},$ the third line holds by observing that when ${\alpha}_{1} = \pm{1},{\alpha}_{2},\ldots,{\alpha}_{d} = {0},$ the second line is ${{\sigma}{(}\pm{t}{)}}{/}{2}$ multiplied by an impulse, and when ${\alpha}_{1}\ne\pm{1},$ the second line is zero, which is exactly the third line. By the rotation properties of the Fourier transform, we have the following result for the neuron ${r}{(}{\mathbf{x}}{)} = {\sigma}{(}{\mathbf{w}}^{\top}{\mathbf{x}}{):}$ \[{\text{K}}\,{\mathscr{R}}{\{}{r}{\}(}\mathbf{\alpha},{t}{)} = {\text{P}}_{\text{even}}{\{}{\sigma}{(}{t}{)}{\delta}{(}\mathbf{\alpha}{-}{\mathbf{w}}{)\}} \tag{S17} \] where ${w}\in{\mathbb{S}}^{{d}{-}{1}}{.}$
The story is now analogous to the univariate case. Indeed, by the NBT, training the NN in (13) with weight decay is equivalent to solving the optimization problem \[\mathop{\min}\limits_{\mathbf{\theta}\,\in\,\Theta}\mathop{\sum}\limits_{{n} = {1}}^{N}{\mathcal{L}}{(}{y}_{n},{f}_{\mathbf{\theta}}{(}{x}_{n}{))} + {\lambda}\mathop{\sum}\limits_{{k} = {1}}^{K}{\left|{{v}_{k}}\right|}{\Vert}{{\mathbf{w}}_{k}}{\Vert}_{2}{.} \tag{16} \]
From (15), we see that ${\Vert}{\text{D}}_{t}^{2}{\text{K}}{\mathscr{R}}{f}_{\mathbf{\theta}}{\Vert}_{\mathcal{M}} = {\Sigma}_{{k} = {1}}^{K}\left|{{v}_{k}}\right|{\Vert}{{\mathbf{w}}_{k}}{\Vert}_{2},$ so training the NN (13) with weight decay prefers solutions with sparse second derivatives in the filtered Radon domain. This measure of sparsity can be viewed as the second-order TV in the (filtered) Radon domain. Let ${\mathscr{R}}{\text{BV}}^{2}({\mathbb{R}}^{d})$ denote the space of functions on ${\mathbb{R}}^{d}$ of second-order bounded variation in the (filtered) Radon domain (i.e., the second-order TV in the (filtered) Radon domain is finite). A key result regarding ${\mathscr{R}}{\text{BV}}^{2}({\mathbb{R}}^{d})$ is the following representer theorem for NNs, first proven in [29]. Under mild assumptions on the loss function, the solution set to the optimization problem \[\mathop{\min}\limits_{{f}\in{\mathscr{R}}{\text{BV}}^{2}{(}{\mathbb{R}}^{d}{)}}\mathop{\sum}\limits_{{n} = {1}}^{N}{\mathcal{L}}{(}{y}_{n},{f}{(}{\mathbf{x}}_{n}{))} + {\lambda}{\Vert}{{\text{D}}_{t}^{2}\text{K}\,\mathscr{R}f}{\Vert}_{\mathcal{M}} \tag{17} \] is completely characterized by NNs of the form \[{f}_{\mathbf{\theta}}{(}{x}{)} = \mathop{\sum}\limits_{{k} = {1}}^{K}{{v}_{k}}{\rho}{(}{\mathbf{w}}_{k}^{\top}{\mathbf{x}}{-}{b}_{k}{)} + {\mathbf{c}}^{\top}{\mathbf{x}} + {c}_{0} \tag{18} \] where the number of neurons is strictly less than the number of data ${(}{K}{<}{N}{)}$ in the sense that the solution set to (17) is a closed convex set whose extreme points take the form of (18) with ${K}{<}{N}$ (see [5], [27], and [43] for further refinements of this result). Common loss functions such as the squared-error satisfy the mild assumptions. The skip connection ${\mathbf{c}}^{\top}{\mathbf{x}} + {c}_{0}$ arises because the null space of the sparsifying transform is the space of affine functions. Therefore, by the same argument presented in the univariate case, sufficiently wide $\left({{K}\geq{N}}\right)$ NNs [as in (18)] trained with weight decay to global minimizers are exactly optimal functions in ${\mathscr{R}}{\text{BV}}^{2}({\mathbb{R}}^{d}).$
The machinery is straightforward to extend to the case of DNNs. The key idea is to consider fitting data using compositions of ${\mathscr{R}}{\text{BV}}^{2}$ functions. It is shown in [30] and [39] that under mild assumptions on the loss function, a solution to the optimization problem \[\mathop{\min}\limits_{{f}^{(1)},\ldots,{f}^{(L)}}\mathop{\sum}\limits_{{n} = {1}}^{N}{\mathcal{L}}{(}{\mathbf{y}}_{n}{,}{f}^{{(}{L}{)}}\circ\cdots\circ{f}^{{(}{1}{)}}{(}{\mathbf{x}}_{n}{))} + {\lambda}\mathop{\sum}\limits_{{\ell} = {1}}^{L}\mathop{\sum}\limits_{{i} = {1}}^{{d}_{\ell}}{{\Vert}{{\text{D}}_{t}^{2}{\text{K}}\,{\mathscr{R}}{f}_{i}^{{(}{\ell}{)}}}{\Vert}_{\mathcal{M}}} \tag{19} \] has the form of a DNN, as in (1), where ${d}_{\ell}$ are the intermediary dimensions in the function compositions, which satisfy the following properties:
Such an architecture is illustrated in Figure 4. The result shows that ReLU DNNs with skip connections and linear bottlenecks trained with a variant of weight decay [30, Remark 4.7] are optimal solutions to fitting data using compositions of ${\mathscr{R}}{\text{BV}}^{2}$ functions. Linear bottlenecks may be written as factorized (low-rank) weight matrices of the form ${W}^{{(}{\ell}{)}} = {U}^{{(}{\ell}{)}}{V}^{{(}{\ell}{)}}{.}$ These bottleneck layers correspond to layers with linear activation functions $\left({{\sigma}{(}{t}{)} = {t}}\right){.}$ They arise naturally due to the compositions of functions that arise in (19). The number of neurons in each bottleneck layers is bounded by ${d}_{\ell}.$ The incorporation of linear bottlenecks of this form in DNNs have been shown to speed up learning [1] and increase the accuracy [15], robustness [36], and computational efficiency [46] of DNNs.
Figure 4. A feedforward DNN architecture with linear bottlenecks. The blue nodes represent ReLU neurons, gray nodes represent linear neurons, and white nodes depict the DNN inputs. As the linear layers are narrower than the ReLU ones, this architecture is referred to as a DNN with linear bottlenecks.
The primary focus of the article so far has been the ReLU activation function ${\rho}{(}{t}{)} = {\text{max}}{\{}{0},{t}{\}.}$ Many of the previously discussed ideas can be extended to a broad class of activation functions. The property of the ReLU exploited so far has been that it is sparsified by the second derivative operator in the sense that ${D}_{t}^{2}{\rho} = {\delta}{.}$ Indeed, we can define a broad class of neural function spaces akin to ${\mathscr{R}}{\text{BV}}^{2}({\mathbb{R}}^{d})$ by defining spaces characterized by different sparsifying transforms matched to an activation function. This entails replacing ${D}_{t}^{2}$ in (14) with a generic sparsifying transform $\text{H}.$ Table 1 (adapted from [42]) provides examples of common activation functions that fall into this framework, where each sparsifying transform $\text{H}$ is defined by its frequency response $\hat{H}{(}{\omega}{).}$ For the ReLU, we have ${\text{H}} = {\text{D}}_{t}^{2},$ and so $\hat{H}{(}{\omega}{)} = {(}{j}{\omega}{)}^{2} = {-}{\omega}^{2}{.}$
Table 1. Common activation functions.
Therefore, many of the previously discussed results can thus be directly extended to a broad class of activation functions, including the classical sigmoid and arctan activation functions. We remark that the sparsity-promoting effect of weight decay hinges on the homogeneity of the activation function in the DNN. Although the ReLU and truncated power activation functions in Table 1 are homogeneous, the other activation functions are not. This provides evidence that one should prefer homogeneous activation functions like the ReLU to exploit the tight connections between weight decay and sparsity. Although the sparsity-promoting effect of weight decay does not apply to the nonhomogeneous activation functions, statements akin to (14) do hold by considering neurons with input weights constrained to be unit-norm. Therefore, these sparsifying transforms uncover the innovations of finite-width NNs with unit-norm input weights. Therefore, by only considering neurons with unit-norm input weights, the key results that characterize the solution sets to the optimization problems akin to (17) and (19) hold, providing insight into the kinds of functions favored by DNNs using these activation functions.
In 1993, Barron published his seminal paper [4] on the ability of NNs with sigmoid activation functions to approximate a wide variety of multivariate functions. Remarkably, he showed that NNs can approximate functions that satisfy certain decay conditions on their Fourier transforms at a rate that is completely independent of the input dimension of the functions. This property has led to many people heralding his work as “breaking the curse of dimensionality.” Today, the function spaces he studied are often referred to as the spectral Barron spaces. It turns out that this remarkable approximation property of NNs is due to sparsity.
To explain this phenomenon, we first recall a problem that “suffers the curse of dimensionality.” A classical problem in signal processing is reconstructing a signal from its samples. Shannon’s sampling theorem asserts that sampling a band-limited signal on a regular grid at a rate faster than the Nyquist rate guarantees that the sinc interpolator perfectly reconstructs the signal. As the sinc function and its shifts form an orthobasis for the space of band-limited signals, the energy of the signal (squared L 2-norm) corresponds to the squared (discrete) ${\ell}^{2}$-norm of its samples. Multivariate versions of the sampling theorem are similar and assert that sampling multivariate, band-limited signals on a sufficiently fine regular grid guarantees perfect reconstruction with (multivariate) sinc interpolation. It is easy to see that the grid size (and therefore the number of samples) grows exponentially with the dimension of the signal. This shows that the sampling and reconstruction of band-limited signals suffers the curse of dimensionality. The fundamental reason for this is that the energy or “size” of a band-limited signal corresponds to the ${\ell}^{2}$-norm of the signal’s expansion coefficients in the sinc basis.
It turns out that there is a stark difference if we instead measure the “size” of a function by the more restrictive ${\ell}^{1}$-norm instead of the ${\ell}^{2}$-norm, an idea popularized by wavelets and compressed sensing. Let ${D} = {\{}{\psi}{\}}_{{\psi}\in{D}}$ be a dictionary of atoms (e.g., sinc functions, wavelets, neurons, and so on). Consider the problem of approximating a multivariate function mapping ${\mathbb{R}}^{d}\rightarrow{\mathbb{R}}$ that admits a decomposition ${f}{(}{\mathbf{x}}{)} = {\Sigma}_{{k} = {1}}^{\infty}{v}_{k}{\psi}_{k}{(}{\mathbf{x}}{),}$ where ${\psi}_{k}\in{D}$ and the expansion coefficients satisfy ${\Sigma}_{{k} = {1}}^{\infty}\left|{{v}_{k}}\right| = {\Vert}{\mathbf{v}}{\Vert}_{{\ell}^{1}}{<}\infty{.}$ It turns out that there exists an approximant constructed with K terms from the dictionary $\text{D}$ whose L 2-approximation error ${\Vert}{{f}{-}{f}_{K}}{\Vert}_{{L}^{2}}$ decays at a rate completely independent of the input dimension d.
Here, we illustrate the argument when ${D} = {\{}{\psi}_{k}{\}}_{{k} = {1}}^{\infty}$ is an orthonormal basis (e.g., multivariate Haar wavelets). Given a function ${f}{:}{\mathbb{R}}^{d}\rightarrow{\mathbb{R}}$ that admits a decomposition ${f}{(}{\mathbf{x}}{)} = {\Sigma}_{{k} = {1}}^{\infty}{v}_{k}{\psi}_{k}{(}{\mathbf{x}}{)}$ such that ${\Vert}{\mathbf{v}}{\Vert}_{{\ell}^{1}}{<}\infty,$ we can construct an approximant ${f}_{{K}}$ by a simple thresholding procedure that keeps the K largest coefficients of f and sets all other coefficients to zero. If we let $\left|{{v}_{(1)}}\right|\geq\left|{{v}_{(2)}}\right|\geq\cdots$ denote the coefficients of f sorted in nonincreasing magnitude, then the squared approximation error is bounded as \[{\Vert}{{f}{-}{f}_{K}}{\Vert}_{{L}^{2}}^{2} = {\Vert}{\mathop{\sum}\limits_{{k}{>}{K}}{{v}_{(k)}}{\psi}_{(k)}}{\Vert}_{{L}^{2}}^{2} = \mathop{\sum}\limits_{{k}{>}{K}}{{\left|{{v}_{(k)}}\right|}^{2}} \tag{20} \] where the last equality follows by exploiting the orthonormality of the $\{{\psi}_{k}{\}}_{{k} = {1}}^{\infty}.$ Finally, as the original sequence of coefficients ${\mathbf{v}} = {(}{v}_{1},{v}_{2},\ldots{)}$ is absolutely summable, $\left|{{v}_{(k)}}\right|$ must decay strictly faster than $1/k$ for k > K (because the tail of the harmonic series ${\Sigma}_{{k}{>}{K}}\tfrac{1}{k}$ diverges). Putting this together with (20), the L 2-approximation error ${\Vert}{{f}{-}{f}_{K}}{\Vert}_{{L}^{2}}$ must decay as ${K}^{{-}{1}{/}{2}},$ completely independent of the input dimension d. For a more precise treatment of this argument, we refer the reader to [22, Th. 9.10]. These kinds of thresholding procedures, particularly with wavelet bases [12], revolutionized signal and image processing and were the foundations of compressed sensing [7], [11].
By a more sophisticated argument, a similar phenomenon occurs when the orthonormal basis is replaced with an essentially arbitrary dictionary of atoms. The result for general atoms is based on a probabilistic technique presented by Pisier in 1981 at the Functional Analysis Seminar at École Polytechnique, Palaiseau, France, crediting the idea to Maurey [33]. An implication of Maurey’s technique is that, given a function that is an ${\ell}^{1}$ combination of bounded atoms from a dictionary, there exists a K-term approximant that admits a dimension-free approximation rate that decays as ${K}^{{-}{1}{/}{2}}{.}$ Motivated by discussions with Jones [18] on his work on greedy approximation, which provides a deterministic algorithm to find the approximant that admits the dimension-free rate, Barron used the technique of Maurey to prove his dimension-free approximation rates with sigmoidal NNs. This abstract approximation result is now referred to as the Maurey–Jones–Barron lemma.
In particular, the Maurey–Jones–Barron lemma can be applied to any function space where the functions are ${\ell}^{1}$ combinations of bounded atoms. Such spaces are sometimes called variation spaces [2], [20]. Recall from the “What Kinds of Functions Do NNs Learn?” and “What Is the Role of NN Activation Functions” sections that the $\text{H} \text{K}\text{R}$ operator sparsifies neurons of the form ${\sigma}{(}{\mathbf{w}}^{\top}{\mathbf{x}}{-}{b}{),}$ where ${(}{\mathbf{w}},{b}{)}\in{\mathbb{S}}^{{d}{-}{1}}\times{\mathbb{R}}$ and ${\sigma}$ are matched to $\text{H}.$ This implies that the space of functions ${f}{:}{\mathbb{R}}^{d}\rightarrow{\mathbb{R}}$ such that ${\Vert}{\text{H} \text{K}\,\mathscr{R}f}{\Vert}_{\mathcal{M}}{<}\infty$ can be viewed as a variation space, where the dictionary corresponds to the neurons ${\{}{\sigma}{(}{\mathbf{w}}^{\top}{\mathbf{x}}{-}{b}{)\}}_{{(}{\mathbf{w}},{b}{)}\in{\mathbb{S}}^{{d}{-}{1}}\times{\mathbb{R}}}{.}$ Therefore, given ${f}{:}{\mathbb{R}}^{d}\rightarrow{\mathbb{R}}$ such that ${\Vert}{\text{H} \text{K}\,\mathscr{R}f}{\Vert}_{\mathcal{M}}{<}\infty,$ there exists a K-term approximant ${f}_{{K}}$ that takes the form of a shallow NN with K neurons such that the L 2-approximation error decays as ${K}^{{-}{1}{/}{2}}.$ These techniques have been studied and extended in great detail [40] and have been extended to the setting of DNNs [13] by considering compositional function spaces akin to the compositional space introduced in the “DNNs” section.
Combining these dimension-free approximation rates with the sparsity-promoting effect of weight decay regularization for ReLU NNs has a striking effect on the learning problem. Suppose that we train a shallow ReLU NN with weight decay on data generated from the noisy samples ${y}_{n} = {f}{(}{\mathbf{x}}_{n}{)} + {\varepsilon}_{n},{n} = {1},\ldots,{N},$ of ${f}\in{\mathscr{R}}{\text{BV}}^{2}{(}{\mathbb{R}}^{d}{),}$ where ${\mathbf{x}}_{n}$ are independent identically distributed (i.i.d.) uniform random variables on some bounded domain $\Omega\subset{\mathbb{R}}^{d},$ and ${\varepsilon}_{n}$ are i.i.d. Gaussian random variables. Let ${f}_{{N}}$ denote this trained NN. Then, it has been shown [31] that the mean integrated squared error (MISE) $\mathbb{E}{\Vert}{{f}{-}{f}_{N}}{\Vert}_{{L}^{2}{(}\Omega{)}}^{2}$ decays at a rate of ${N}^{{-}{1}{/}{2}},$ independent of the input dimension d. Moreover, this result also shows that the generalization error of the trained NN on a new example x generated uniformly at random on Ω is also immune to the curse of dimensionality. Furthermore, these ideas have been studied in the context of DNNs [38], proving dimension-free MISE rates for estimating Hölder functions (that exhibit low-dimensional structure) with ReLU DNNs.
The national meeting of the American Mathematical Society in 2000 was held to discuss the mathematical challenges of the 21st century. Here, Donoho [10] gave a lecture titled “High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality.” In this lecture, he coined the term mixed variation to refer to the kinds of functions that lie in variation spaces, citing the spectral Barron spaces as an example. Variation spaces are different from classical multivariate function spaces in that they favor functions that have weak variation in multiple directions (very smooth functions) as well as functions that have very strong variation in one or a few directions (very rough functions). These spaces also disfavor functions with strong variation in multiple directions. It is this fact that makes them quite “small” compared to classical multivariate function spaces, giving rise to their dimension-free approximation and MISE rates. Examples of functions with different kinds of variation are illustrated in Figure 5. The prototypical examples of functions that lie in mixed variation spaces can be thought of as superpositions of few neurons with different directions or superpositions of many neurons (even continuously many) in only a few directions.
Figure 5. Examples of functions exhibiting different kinds of variation. (a) Weak variation in multiple directions. (b) Strong variation in one direction. (c) Strong variation in multiple directions.
To interpret the idea of mixed variation in the context of modern data analysis and DL, we turn our attention to Figure 5(b). In this figure, the function has strong variation, but only in a single direction. In other words, this function has a low-dimensional structure. It has been observed by a number of authors that DNNs are able to automatically adapt to the low-dimensional structure that often arises in natural data. This is possible because the input weights can be trained to adjust orientation of each neuron. The dimension-independent approximation rate quantifies the power of this tunability. This explains why DNNs are good at learning functions with a low-dimensional structure. In particular, the function in Figure 5(c) has strong variation in all directions, so no method can overcome the curse of dimensionality in this sort of situation. On the other hand, in Figure 5(a), the function has weak variation in all directions, and Figure 5(b) has strong variation in only one direction, so these are functions for which NNs will overcome the curse. For Figure 5(b), the sparsity-promoting effect of weight decay promotes DNN solutions with neurons oriented in the direction of variation (i.e., it automatically learns the low-dimensional structure).
In this article, we presented a mathematical framework to understand DNNs from first principles, through the lens of sparsity and sparse regularization. Using familiar mathematical tools from signal processing, we provided an explanation for the sparsity-promoting effect of the common regularization scheme of weight decay in NN training, the use of skip connections and low-rank weight matrices in network architectures, and why NNs seemingly break the curse of dimensionality. This framework provides the mathematical setting for many future research directions.
The framework suggests the possibility of new neural training algorithms. The equivalence of solutions using weight decay regularization and the regularization in (6) leads to the use of proximal gradient methods akin to iterative soft-thresholding algorithms to train DNNs. This avenue has already begun to be explored. The preliminary results in [47] have shown that proximal gradient training algorithms for DNNs perform as well as and often better (particularly when labels are corrupted) than standard gradient-based training with weight decay, while simultaneously producing sparser networks.
The large body of work dating back to 1989 [21] on NN pruning has shown empirically that large NNs can be compressed or sparsified to a fraction of their size while still maintaining their predictive performance. The connection between weight decay and sparsity-promoting regularizers, like in (6), suggests new approaches to pruning. For example, one could apply proximal gradient algorithms to derive sparse approximations to large pretrained NNs [39]. There are many open questions in this direction, both experimental and theoretical, including applying these algorithms to other DNN architectures and deriving convergence results for these algorithms.
The framework in this article also shows that trained ReLU DNNs are compositions of ${\mathscr{R}}{\text{BV}}^{2}$ functions. As we have seen in this article, at this point in time, we have a clear and nearly complete understanding of ${\mathscr{R}}{\text{BV}}^{2}({\mathbb{R}}^{d}).$ In particular, the ${\mathscr{R}}{\text{BV}}^{2}$ space favors functions that are smooth in most or all directions, which explains why NNs seemingly break the curse of dimensionality. Less is clear and understood about the compositions of ${\mathscr{R}}{\text{BV}}^{2}$ functions (which characterize DNNs). A better understanding of compositional function spaces could provide new insights into the benefits of depth in NNs. This in turn could lead to new guidelines for designing NN architectures and training algorithms.
The authors would like to thank Rich Baraniuk, Misha Belkin, Çag˘atay Candan, Ron DeVore, Kangwook Lee, Greg Ongie, Dimitris Papailiopoulos, Tomaso Poggio, Lorenzo Rosasco, Joe Shenouda, Jonathan Siegel, Ryan Tibshirani, Michael Unser, Becca Willett, Stephen Wright, Liu Yang, and Jifan Zhang for many insightful discussions on the topics presented in this article.
Rahul Parhi was supported in part by the National Science Foundation (NSF) Graduate Research Fellowship Program under Grant DGE-1747503 and the European Research Council’s Project FunLearn under Grant 101020573. Robert Nowak was supported in part by NSF Grants DMS-2134140 and DMS-2023239, U.S. Office of Naval Research MURI Grant N00014-20-1-2787, Air Force Office of Scientific Research/Air Force Research Laboratory Grant FA9550-18-1-0166, and the Keith and Jane Nosbusch Professorship.
Rahul Parhi (rahul.parhi@epfl.ch) received his B.S. degree in mathematics and his B.S. degree in computer science from the University of Minnesota, Twin Cities in 2018, and his M.S. and Ph.D. degrees in electrical engineering from the University of Wisconsin–Madison in 2019 and 2022, respectively. During his Ph.D. studies, he was supported by a National Science Foundation Graduate Research Fellowship. Currently, he is a postdoctoral researcher with the Biomedical Imaging Group, École Polytechnique Fédérale de Lausanne, 1015 Lausanne, Switzerland. His research interests include applications of functional and harmonic analysis to problems in signal processing and data science. He is a Member of IEEE.
Robert D. Nowak (rdnowak@wisc.edu) received his Ph.D. degree in electrical engineering from the University of Wisconsin–Madison, Madison, WI 53706 USA, where he is the Grace Wahba Professor of Data Science and Keith and Jane Morgan Nosbusch Professor of Electrical and Computer Engineering. He serves as a section editor of SIAM Journal on Mathematics of Data Science and a senior editor of IEEE Journal on Selected Areas in Information Theory. With research focusing on signal processing, machine learning, optimization, and statistics, his work on sparse signal recovery and compressed sensing has won several awards. He is a Fellow of IEEE.
[1] L. J. Ba and R. Caruana, “Do deep nets really need to be deep?” in Proc. 27th Int. Conf. Neural Inf. Process. Syst., 2014, vol. 2, pp. 2654–2662.
[2] F. Bach, “Breaking the curse of dimensionality with convex neural networks,” J. Mach. Learn. Res., vol. 18, no. 1, pp. 629–681, Jan. 2017.
[3] R. Balestriero and R. G. Baraniuk, “Mad max: Affine spline insights into deep learning,” Proc. IEEE, vol. 109, no. 5, pp. 704–727, May 2021, doi: 10.1109/JPROC.2020.3042100.
[4] A. R. Barron, “Universal approximation bounds for superpositions of a sigmoidal function,” IEEE Trans. Inf. Theory, vol. 39, no. 3, pp. 930–945, May 1993, doi: 10.1109/18.256500.
[5] F. Bartolucci, E. De Vito, L. Rosasco, and S. Vigogna, “Understanding neural networks with reproducing kernel Banach spaces,” Appl. Comput. Harmon. Anal., vol. 62, pp. 194–236, Jan. 2023, doi: 10.1016/j.acha.2022.08.006.
[6] E. J. Candès, “Ridgelets: Theory and applications,” Ph.D. dissertation, Stanford Univ., Stanford, CA, USA, 1998.
[7] E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information,” IEEE Trans. Inf. Theory, vol. 52, no. 2, pp. 489–509, Feb. 2006, doi: 10.1109/TIT.2005.862083.
[8] L. Chizat and F. Bach, “Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss,” in Proc. 33rd Conf. Learn. Theory (PMLR), 2020, pp. 1305–1338.
[9] T. Debarre, Q. Denoyelle, M. Unser, and J. Fageot, “Sparsest piecewise-linear regression of one-dimensional data,” J. Comput. Appl. Math., vol. 406, May 2022, Art. no. 114044, doi: 10.1016/j.cam.2021.114044.
[10] D. L. Donoho, “High-dimensional data analysis: The curses and blessings of dimensionality,” AMS Math Challenges Lectures, vol. 1, p. 32, Aug. 2000.
[11] D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory, vol. 52, no. 4, pp. 1289–1306, Apr. 2006, doi: 10.1109/TIT.2006.871582.
[12] D. L. Donoho and I. M. Johnstone, “Minimax estimation via wavelet shrinkage,” Ann. Statist., vol. 26, no. 3, pp. 879–921, Jun. 1998, doi: 10.1214/aos/1024691081.
[13] E. Weinan and S. Wojtowytsch, “On the Banach spaces associated with multi-layer ReLU networks: Function representation, approximation theory and gradient descent dynamics,” CSIAM Trans. Appl. Math., vol. 1, no. 3, pp. 387–440, 2020, doi: 10.4208/csiam-am.20-211.
[14] S. D. Fisher and J. W. Jerome, “Spline solutions to L1 extremal problems in one and several variables,” J. Approximation Theory, vol. 13, no. 1, pp. 73–83, Jan. 1975, doi: 10.1016/0021-9045(75)90016-7.
[15] A. Golubeva, B. Neyshabur, and G. Gur-Ari, “Are wider nets better given the same number of parameters?” in Proc. Int. Conf. Learn. Representations, 2021.
[16] Y. Grandvalet, “Least absolute shrinkage is equivalent to quadratic penalization,” in Proc. Int. Conf. Artif. Neural Netw., London, U.K.: Springer-Verlag, 1998, pp. 201–206, doi: 10.1007/978-1-4471-1599-1_27.
[17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit. (CVPR), 2016, pp. 770–778.
[18] L. K. Jones, “A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training,” Ann. Statist., vol. 20, no. 1, pp. 608–613, Mar. 1992, doi: 10.1214/aos/1176348546.
[19] V. Kůrková, P. C. Kainen, and V. Kreinovich, “Estimates of the number of hidden units and variation with respect to half-spaces,” Neural Netw., vol. 10, no. 6, pp. 1061–1068, Aug. 1997, doi: 10.1016/S0893-6080(97)00028-2.
[20] V. Kůrková and M. Sanguineti, “Bounds on rates of variable-basis and neural-network approximation,” IEEE Trans. Inf. Theory, vol. 47, no. 6, pp. 2659–2665, Sep. 2001, doi: 10.1109/18.945285.
[21] Y. LeCun, J. Denker, and S. Solla, “Optimal brain damage,” in Proc. 2nd Int. Conf. Neural Inf. Process. Syst. (NIPS), 1989, pp. 598–605.
[22] S. Mallat, A Wavelet Tour of Signal Processing, 3rd ed. Amsterdam, The Netherlands: Elsevier/Academic Press, 2009.
[23] E. Mammen and S. van de Geer, “Locally adaptive regression splines,” Ann. Statist., vol. 25, no. 1, pp. 387–413, Feb. 1997, doi: 10.1214/aos/1034276635.
[24] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in nervous activity,” Bull. Math. Biophys., vol. 5, no. 4, pp. 115–133, Dec. 1943, doi: 10.1007/BF02478259.
[25] B. Neyshabur, R. Tomioka, and N. Srebro, “In search of the real inductive bias: On the role of implicit regularization in deep learning,” in Proc. Int. Conf. Learn. Representations (Workshop), 2015.
[26] G. Ongie, R. Willett, D. Soudry, and N. Srebro, “A function space view of bounded norm infinite width ReLU nets: The multivariate case,” in Proc. Int. Conf. Learn. Representations, 2020.
[27] R. Parhi, “On ridge splines, neural networks, and variational problems in Radon-domain BV spaces,” Ph.D. dissertation, The Univ. of Wisconsin–Madison, Madison, WI, USA, 2022.
[28] R. Parhi and R. D. Nowak, “The role of neural network activation functions,” IEEE Signal Process. Lett., vol. 27, pp. 1779–1783, Sep. 2020, doi: 10.1109/LSP.2020.3027517.
[29] R. Parhi and R. D. Nowak, “Banach space representer theorems for neural networks and ridge splines,” J. Mach. Learn. Res., vol. 22, no. 1, pp. 1960–1999, Jan. 2021.
[30] R. Parhi and R. D. Nowak, “What kinds of functions do deep neural networks learn? Insights from variational spline theory,” SIAM J. Math. Data Sci., vol. 4, no. 2, pp. 464–489, 2022, doi: 10.1137/21M1418642.
[31] R. Parhi and R. D. Nowak, “Near-minimax optimal estimation with shallow ReLU neural networks,” IEEE Trans. Inf. Theory, vol. 69, no. 2, pp. 1125–1140, Feb. 2023, doi: 10.1109/TIT.2022.3208653.
[32] M. Pilanci and T. Ergen, “Neural networks are convex regularizers: Exact polynomial-time convex optimization formulations for two-layer networks,” in Proc. 37th Int. Conf. Mach. Learn. (PMLR), 2020, pp. 7695–7705.
[33] G. Pisier, “Remarques sur un résultat non publié de B. Maurey,” in Proc. Séminaire d’Anal. Fonctionnelle (dit “Maurey-Schwartz”), Apr. 1981, pp. 1–12.
[34] T. Poggio, A. Banburski, and Q. Liao, “Theoretical issues in deep networks,” Proc. Nat. Acad. Sci. USA, vol. 117, no. 48, pp. 30,039–30,045, Dec. 2020, doi: 10.1073/pnas.1907369117.
[35] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,” Phys. D, Nonlinear Phenomena, vol. 60, nos. 1–4, pp. 259–268, Nov. 1992, doi: 10.1016/0167-2789(92)90242-F.
[36] A. Sanyal, P. H. Torr, and P. K. Dokania, “Stable rank normalization for improved generalization in neural networks and GANs,” in Proc. Int. Conf. Learn. Representations, 2019.
[37] P. Savarese, I. Evron, D. Soudry, and N. Srebro, “How do infinite width bounded norm networks look in function space?” in Proc. 32nd Conf. Learn. Theory (PMLR), 2019, pp. 2667–2690.
[38] J. Schmidt-Hieber, “Nonparametric regression using deep neural networks with ReLU activation function,” Ann. Statist., vol. 48, no. 4, pp. 1875–1897, Aug. 2020, doi: 10.1214/19-AOS1875.
[39] J. Shenouda, R. Parhi, K. Lee, and R. D. Nowak, “Vector-valued variation spaces and width bounds for DNNs: Insights on weight decay regularization,” 2023, arXiv:2305.16534.
[40] J. W. Siegel and J. Xu, “Sharp bounds on the approximation rates, metric entropy, and n-widths of shallow neural networks,” Found. Comput. Math., early access, 2022, doi: 10.1007/s10208-022-09595-3.
[41] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. Roy. Statist. Soc., Ser. B (Methodological), vol. 58, no. 1, pp. 267–288, Jan. 1996, doi: 10.1111/j.2517-6161.1996.tb02080.x.
[42] M. Unser, “From kernel methods to neural networks: A unifying variational formulation,” 2022, arXiv:2206.14625.
[43] M. Unser, “Ridges, neural networks, and the Radon transform,” J. Mach. Learn. Res., vol. 24, no. 37, pp. 1–33, 2023.
[44] M. Unser, J. Fageot, and J. P. Ward, “Splines are universal solutions of linear inverse problems with generalized TV regularization,” SIAM Rev., vol. 59, no. 4, pp. 769–793, 2017, doi: 10.1137/16M1061199.
[45] M. Vetterli, P. Marziliano, and T. Blu, “Sampling signals with finite rate of innovation,” IEEE Trans. Signal Process., vol. 50, no. 6, pp. 1417–1428, Jun. 2002, doi: 10.1109/TSP.2002.1003065.
[46] H. Wang, S. Agarwal, and D. Papailiopoulos, “Pufferfish: Communication-efficient models at no extra cost,” in Proc. 3rd Mach. Learn. Syst., 2021, pp. 365–386.
[47] L. Yang, J. Zhang, J. Shenouda, D. Papailiopoulos, K. Lee, and R. D. Nowak, “A better way to decay: Proximal gradient training algorithms for neural nets,” in Proc. 14th Annu. Workshop Optim. Mach. Learn. (OPT), 2022.
[48] M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped variables,” J. Roy. Statist. Soc., Ser. B (Statist. Methodology), vol. 68, no. 1, pp. 49–67, Feb. 2006, doi: 10.1111/j.1467-9868.2005.00532.x.
Digital Object Identifier 10.1109/MSP.2023.3286988