Luis Albert Zavala-Mondragón, Peter H.N. de With, Fons van der Sommen
©SHUTTERSTOCK.COM/DABARTI CGI
Encoding-decoding convolutional neural networks (CNNs) play a central role in data-driven noise reduction and can be found within numerous deep learning algorithms. However, the development of these CNN architectures is often done in an ad hoc fashion and theoretical underpinnings for important design choices are generally lacking. Up to now, there have been different existing relevant works that have striven to explain the internal operation of these CNNs. Still, these ideas are either scattered and/or may require significant expertise to be accessible for a bigger audience. To open up this exciting field, this article builds intuition on the theory of deep convolutional framelets (TDCFs) and explains diverse encoding-decoding (ED) CNN architectures in a unified theoretical framework. By connecting basic principles from signal processing to the field of deep learning, this self-contained material offers significant guidance for designing robust and efficient novel CNN architectures.
A well-known image processing application is noise/artifact reduction of images, which consists of estimating a noise/artifact-free signal out of a noisy observation. To achieve this, conventional signal processing algorithms often employ explicit assumptions on the signal and noise characteristics, which has resulted in well-known algorithms such as wavelet shrinkage [1], sparse dictionaries [2], total-variation minimization [3], and low-rank approximation [4]. With the advent of deep learning techniques, signal processing algorithms applied to image denoising, have been regularly outperformed and increasingly replaced by encoding-decoding CNNs.
In this article, rather than conventional signal processing algorithms, we focus on so-called encoding-decoding CNNs. These models contain an encoder that maps the input to multichannel/redundant representations and a decoder, which maps the encoded signal back to the original domain. In both, the encoder and decoder, sparsifying nonlinearities, which suppress parts of the signal, are applied. In contrast to conventional signal processing algorithms, encoding-decoding CNNs are often presented as a solution, which does not make explicit assumptions on the signal and noise. For example, in supervised algorithms, an encoding-decoding CNN learns the optimal parameters to filter the signal from a set of paired examples of noise/artifact-free images and images contaminated with noise/artifacts [5], [6], [7], which highly simplifies the solution of the noise-reduction problems as this circumvents use of explicit modeling of the signal and noise. Furthermore, the good performance and simple use of encoder-decoder CNNs have enabled additional data-driven noise-reduction algorithms, where CNNs are embedded as part of a larger system. Examples of such approaches are unsupervised noise reduction [8] and denoising based on generative adversarial networks [9]. Besides this, smoothness in signals can also be obtained by advanced regularization using CNNs, e.g., by exploiting data-driven model-based iterative reconstruction [10].
Despite the impressive noise-reduction performance and flexibility of encoding-decoding CNNs, these models also have downsides that should be considered. First, the complexity and heuristic nature of such designs often offers restricted understanding of the internal operation of such architectures [11]. Second, training and deployment of CNNs requires specialized hardware and use of significant computational resources. Third and finally, restricted understanding of signal modeling in encoding-decoding CNNs does not clearly reveal the limitations of such models and, consequently, it is not obvious how to overcome these problems.
To overcome the limitations of encoding-decoding CNNs, new research has tackled the lack of explainability of these models by acknowledging similarity of the building blocks of encoding-decoding CNNs applied to image noise reduction and the elements of well-known signal processing algorithms, such as wavelet decomposition, low-rank approximation [12], [13], [14], variational methods [15], lower-dimensional manifolds [8], and convolutional sparse coding [16]. Furthermore, practical works based on the shrinkage-based CNNs inspired in well-established wavelet shrinkage algorithms has further deepened the connections between signal processing and CNNs [17], [18]. This unified treatment of signal processing-inspired CNNs has resulted in more explainable [6], [8], better performing [6], and more memory-efficient designs [19].
This article has three main objectives. The first is to summarize the diverse explanations of the components of encoding-decoding CNNs applied to image noise reduction based on the concept of deep convolutional framelets [12], and on elementary signal processing concepts. Both aspects are considered with the aim of achieving an in-depth understanding of the internal operation of encoding-decoding CNNs, and to show that the design choices have implicit assumptions about the signal behavior inside the CNN. A second objective is to offer practitioners tools for optimizing their CNN designs with signal processing concepts. Third and finally, the aim is to show practical use cases where existing CNNs are analyzed in a unified framework, thereby enabling a better comparison of different designs by making their internal operation explicitly visible. Our analysis is based on existing works [6], [12], [20] by authors who analyzed CNNs where the nonlinearities are ignored. In this article, we overcome this limitation and present a complete analysis including the nonlinear activations, which reveals important assumptions implicit in the analyzed models.
The structure of this article is as follows. The “Notation” section introduces the notation used in this text. The “Encoding-Decoding CNNs” section describes the signal model and the architecture of encoding-decoding networks. Afterward, the “Signal Processing Fundamentals” section addresses fundamental aspects of signal processing, such as singular value decomposition (SVD), low-rank approximation, and framelets as well as estimation of signals in the framelet domain. All the concepts of the “Encoding-Decoding CNNs” and “Signal Processing Fundamentals” sections converge in the “Bridging the Gap Between Signal Processing and CNNs: Deep Convolutional Framelets and Shrinkage-Based CNNs” section, where the encoding-decoding CNNs are interpreted in terms of a data-driven low-rank approximation, and of wavelet shrinkage. Afterward, based on the learnings from the “Bridging the Gap Between Signal Processing and CNNs: Deep Convolutional Framelets and Shrinkage-Based CNNs” section, the “Mathematical Analysis of Relevant Designs” section shows the analysis of diverse architectures from a signal processing perspective and under a set of explicit assumptions. Afterward, the “What Happens in Trained Models?” section explores whether some of the theoretical properties exposed here are related to trained models. Based on the diverse described models and theoretical operation of CNNs, the “Which Network Fits My Problem?” section addresses a design criterion that can be used to design or choose new models and briefly describes the state of the art for noise reduction with CNNs. Finally, the “Conclusions and Future Outlook” section elaborates on concluding remarks and discusses the diverse elements that have not yet been (widely) explored by current CNN designs.
CNNs are composed by basic elements such as convolution, activation, and down-/upsampling layers. To achieve better clarity in the explanations given in this article, we define the mathematical notation to represent the basic operations of CNNs. Some of the definitions presented here are based on the work of Zavala-Mondragón et al. [19].
In the following, a scalar is represented by a lower-case letter (e.g., a), while a vector is represented by an underlined lower-case letter (e.g., $\underline{a}{)}$. Furthermore, a matrix, such as an image or convolution mask, is represented by a boldface lowercase letter (e.g., variables x and y). Finally, a tensor is defined by a boldface uppercase letter. For example, the two arbitrary tensors A and Q are defined by \begin{align*}{\bf{A}} = \left({\begin{array}{ccc}{{\bf{a}}_{0}^{0}}&{\ldots}&{{\bf{a}}_{{N}_{C}{-}{1}}^{0}}\\{\vdots}&{\ddots}&{\vdots}\\{{\bf{a}}_{0}^{{N}_{R}{-}{1}}}&{\ldots}&{{\bf{a}}_{{N}_{C}{-}{1}}^{{N}_{R}{-}{1}}}\end{array}}\right),{\bf{Q}} = \left({\begin{array}{c}{{\bf{q}}^{0}}\\{\vdots}\\{{\bf{q}}^{{N}_{R}{-}{1}}}\end{array}}\right){.} \tag{1} \end{align*}
Here, entries ${\bf{a}}_{c}^{r}$ and ${\bf{q}}^{r}$ represent 2D arrays (matrices). As the defined tensors are used in the context of CNNs, matrices ${\bf{a}}_{c}^{r}$ and ${\bf{q}}^{r}$ are learned filters, which have dimensions of ${(}{N}_{V}\,{\times}\,{N}_{H}{),}$ where ${N}_{V}$ and ${N}_{H}$ denote the filter dimensions in the vertical and horizontal directions, respectively. Finally, we define the total tensor dimension of A and Q by ${(}{N}_{C}\,{\times}\,{N}_{R}\,{\times}\,{N}_{V}\,{\times}\,{N}_{H}{)}$ and ${(}{N}_{R}\,{\times}\,{1}\,{\times}\,{N}_{V}\,{\times}\,{N}_{H}{),}$ where ${N}_{R}$ and ${N}_{C}$ are the number of row and column entries, respectively. If the tensor A contains the convolution weights in a CNN, the row-entry dimensions represent the input number of channels to a layer, while the number of column elements denotes the number of output channels.
Having defined the notation for the variables, we focus on a few relevant operators. First, the transpose of a tensor ${(}\,{\cdot}\,{)}^{\top}$, expressed by \[{\bf{Q}}^{\top} = \left({\bf{q}}^{0}\quad{\ldots}\quad{\bf{q}}^{{N}_{{\text{R}}{-}{1}}}\right){.} \tag{2} \]
Furthermore, the convolution of two tensors is written as $\text{AQ}$ and specified by \begin{align*}{\bf{AQ}} = \left({\begin{array}{c}{\mathop{\sum}\limits_{{r} = {0}}\limits^{{N}_{R}{-}{1}}{{\bf{a}}_{r}^{0}}\ast{\bf{q}}^{r}}\\{\vdots}\\{\mathop{\sum}\limits_{{r} = {0}}^{{N}_{R}{-}{1}}{{\bf{a}}_{r}^{{N}_{R}{-}{1}}}\ast{\bf{q}}^{r}}\end{array}}\right){.} \tag{3} \end{align*}
Here, the symbol $\ast$ defines the convolution between two matrices (images).
In this article, images that are 2D arrays (matrices) are often convolved with 4D tensors. When this operation is performed, images are considered to have dimensions of ${(}{1}\,{\times}\,{1}\,{\times}\,{N}_{V}\,{\times}\,{N}_{H}{)}$. In addition, in this article, matrix I is the identity signal for the convolution operator, which, for a 2D image, is the Kronecker delta/discrete impulse (an image with a single nonzero pixel with unity amplitude at the center of the image). Furthermore, we indicate that variables in the decoding path of a CNN are distinguished with a tilde (e.g., $\tilde{\bf{K}},\underline{\tilde{b}}{)}$.
Additional symbols that are used throughout the article are the down and upsampling operations by a factor s, which are denoted by ${f}_{{(}{s}\downarrow{)}}{(}\,{\cdot}\,{)}$ and ${f}_{{(}{s}\uparrow{)}}{(}\,{\cdot}\,{)}$ for downsampling and upsampling, respectively. In this article, both operations are defined in the same way as in multirate filter banks. For example, consider the signal \[\underline{x} = \left({1,\,2,\,3,\,4,\,5,\,6,\,7,\,8,\,9,\,10}\right){.} \tag{4} \]
If we apply the downsampling operator to $\underline{x}$ by a factor of two, it results in \[\underline{z} = {f}_{{(}{2}\downarrow{)}}{(}\underline{x}{)} = \left({\begin{array}{ccccc}{1,}&{3,}&{5,}&{7,}&{9}\end{array}}\right) \tag{5} \] where $\underline{z}$ is the downsampled version of $\underline{x}.$ Conversely, the result of applying the upsample operator ${f_{(2\uparrow)}(\cdot)}$ gives the result \[{f}_{{(}{2}\uparrow{)}}{(}{z}{)} = \left({\begin{array}{cccccccccc}{1,}&{0,}&{3,}&{0,}&{5,}&{0,}&{7,}&{0,}&{9,}&{0}\end{array}}\right){.} \tag{6} \]
Additional operators used in the article are rectified linear units (ReLUs), shrinkage/thresholding, and clipping, which are represented by ${(}{\cdot}{)}_{+},\,{\mathcal{\tau}_{{(}{\cdot)}}}{(}{\cdot}{)}\,{\text{and}}\,{\mathcal{C}}_{{(}\,{\cdot}\,{)}}{(}\,{\cdot}\,{),}$ respectively.
For better clarity, the most important symbols used in this article are listed in Table 1. In addition, graphical representations of some of the symbols that are used to graphically describe CNNs are shown in Figure 1.
Figure 1. Symbols used for the schematic representations of the CNNs addressed in this article.
Table 1. Relevant symbols used in this article.
In noise-reduction applications, the common additive signal model is defined by \[{\bf{y}} = {\bf{x}} + \mathbf{\eta} \tag{7} \] where the observed signal y is the result of contaminating a noiseless image x with additive noise $\mathbf{\eta}$. Assume that the noiseless signal x is to be estimated from the noisy observation y. In deep learning applications, this is often achieved by models with the form \[{\hat{\bf{x}}} = {G}{(}{\bf{y}}{)}{.} \tag{8} \]
Here, ${G}{(}{\cdot}{)}$ is a generic encoding-decoding CNN. We refer to this form of noise reduction as nonresidual. Alternatively, it is possible to find ${\hat{\bf{x}}}$ by training ${G}{(}{\cdot}{)}$ to estimate the noise component $\hat{\mathbf{\eta}}$ and subtracting it from the noisy image y to estimate the noiseless image ${\hat{\bf{x}}}$, or equivalently, \[{\hat{\bf{x}}} = {\bf{y}}{-}{G}{(}{\bf{y}}{)}{.} \tag{9} \]
This model is referred to as residual [5], [7], [21] because the output of the network is subtracted from its input. For reference, Figure 2 portrays the difference of the placement of the encoding-decoding structure in residual and nonresidual configurations.
Figure 2. Residual and nonresidual network configurations. Note that the main difference between both designs is the global skip connection occurring in the residual structure. Still, it can be observed that the network ${G}{(}\,{\cdot}\,{)}$ may contain skip connections internally.
Encoding-decoding (convolutional) neural networks are rooted in techniques for data-dimensionality reduction and unsupervised feature extraction, where a given signal is mapped to an alternative space via a nonlinear transformation. This space should have properties that are somehow attractive for the considered task. For example, for dimensionality reduction, the alternative space should be lower dimensional than the original input. In this article, we are interested in models that are useful for noise-reduction applications. Specifically, this manuscript addresses models that are referred to as encoding-decoding CNNs, such as the model by Ranzato et al. [22], in which the encoder uses convolution filters to produce multichannel/redundant representations, in which sparsifying nonlinearities are applied. The sparsified signal is later mapped back to the original representation. It should be noted that despite the fact that the origins of the encoding-decoding CNNs are linked to feature extraction, this type of architecture was quickly shown to be useful for other applications such as noise reduction, which is the topic of this article. For the rest of this manuscript, whenever we mention an encoding-decoding CNN, we are referring to a design that follows the same basic principles as Ranzato’s design.
It can be observed that encoding-decoding CNNs are constituted of three main parts. The first is the encoder, which maps the incoming image to a representation with more image channels with a convolution layer. Every channel of the resulting redundant representation contains a fraction of the content of the original signal. It should be noted that the encoder often (but not necessarily) decreases the resolution of the higher-dimensional representation to enable multiresolution processing, and to decrease the memory requirements of the design. The second main part is the decoder, which maps the multichannel representation back to the original space. The third main part is the nonlinearities, which suppress specific parts of the signal. In summary, the most basic encoding-decoding step in a CNN ${G}{(}{\cdot}{)}$ is expressed by \[{G}{(}{\bf{y}}{)} = {G}_{\text{dec}}{(}{G}_{\text{enc}}{(}{\bf{y}}{)}{)} \tag{10} \] where ${G}_{\text{enc}}{(}{\cdot}{)}$ is the encoder, which is generally defined by \begin{align*}{\bf{C}}_{0} & = {E}_{0}{(}{\bf{y}}{),} \\ {\bf{C}}_{1} & = {E}_{1}{(}{\bf{C}}_{0}{),} \\ {\bf{C}}_{2} & = {E}_{2}{(}{\bf{C}}_{1}{),} \\ & \,\,\vdots \\ {\bf{C}}_{{N}{-}{1}} & = {E}_{{N}{-}{1}}{(}{\bf{C}}_{{N}{-}{2}}{),} \\ {G}_{\text{enc}}{(}{\bf{y}}{)} & = {\bf{C}}_{{N}{-}{1}}{.} \tag{11} \end{align*}
Here, ${C}_{n}$ represents the code generated by the nth encoding ${E_n}(\cdot)$, which can be expressed by \[{\bf{C}}_{n} = {E}_{n}{(}{\bf{C}}_{{n}{-}{1}}{)} = {f}_{{(}{s}\downarrow{)}}{(}{A}_{({\underline{b}}_{{n}{-}{1}})}{(}{\bf{K}}_{{n}{-}{1}}{\bf{C}}_{{n}{-}{1}}{)}{)}{.} \tag{12} \]
Here, the function $A(\cdot)_ (\cdot)$ is a generic activation used in the encoder, and ${f_{(s\downarrow)}}(\cdot)$ is a downsampling function by factor s. Complementary to the encoder, the decoder network maps the multichannel sparse signal back to the original domain. Here, we define the decoder by \begin{align*}{\tilde{\bf{C}}}_{{N}{-}{2}} & = {D}_{{N}{-}{1}}{(}{\bf{C}}_{{N}{-}{1}}{),} \\ &\,\,\vdots \\ {\tilde{\bf{C}}}_{1} & = {D}_{2}{(}{\tilde{\bf{C}}}_{2}{),} \\ {\tilde{\bf{C}}}_{0} & = {D}_{1}{(}{\tilde{\bf{C}}}_{1}{),} \\ {G}{(}{\bf{y}}{)} & = {D}_{0}{(}{\tilde{\bf{C}}}_{0}{)} \tag{13} \end{align*} where ${\hat{\bf{C}}}_{n}$ is the nth decoded signal, which is produced by the nth decoder layer, yielding the general expression \[{\tilde{\bf{C}}}_{{n}{-}{1}} = {D}_{n}{(}{\tilde{\bf{C}}}_{n}{)} = {\tilde{A}}_{(\underline{\tilde{b}})}{(}{\tilde{\bf{K}}}_{n}^{\top}{f}_{{(}{s}\uparrow{)}}{(}{\tilde{\bf{C}}}_{n}{)}{)}{.} \tag{14} \]
In (14), ${\tilde{\boldsymbol{A}}}{(}\cdot{)}_{{(}\cdot{)}}$ is the activation function used in the decoder, and ${{f}_{({s}\uparrow)}}(\cdot)$ is an upsampling function of factor s.
An important remark is that the encoder-decoder CNN does not always contain down-/upsampling layers, in which case, the decimation factor s is unity, which causes ${f}_{{(}{1}\uparrow{)}}{(}{\bf{x}}{)} = {f}_{{(}{1}\downarrow{)}}{(}{\bf{x}}{)} = {\bf{x}}$ for any matrix x. Furthermore, it should also be noted that we assume that the number of channels of the code ${C}_{N}$ is always larger than the previous one: ${C}_{{N}{-}{1}}$. Furthermore, it should be noted that a single encoder layer ${E_n}(\cdot)$ and its corresponding decoder layer ${D_n}(\cdot)$ can be considered a single-layer encoder-decoder network/pair.
For this article, the encoding convolution filter for a given layer K has dimensions of ${(}{N}_{o}\,{\times}\,{N}_{i}\,{\times}\,{N}_{h}\,{\times}\,{N}_{v}{),}$ where ${N}_{i}$ and ${N}_{o}$ are the number of input and output channels for a convolution layer, respectively. Similarly, ${N}_{h}$ and ${N}_{v}$ are the number of elements in the horizontal and vertical directions, respectively. Note that the encoder increases the number of channels of the signal (e.g., ${N}_{o}{>}{N}_{i}{),}$ akin to Ranzato’s design [22]. Furthermore, it is assumed that the decoder is symmetric in the number of channels to the encoder, therefore, the dimensions of the decoding convolution kernel ${\tilde{\bf{K}}}^{\top}$ are ${(}{N}_{i}\,{\times}\,{N}_{o}\,{\times}\,{N}_{h}\,{\times}\,{N}_{v}{)}$. The motivation of this symmetry is to emphasize the similarity between the signal processing and the CNN elements.
As shown by Ye et al. [12], within encoding-decoding CNNs, the signal is treated akin to well-known sparse representations, where the coefficients used for the transformation are directly learned from the training data. Prior to addressing this important concept in more detail, relevant supporting concepts such as sparsity, sparse transformations, and nonlinear signal estimation in the wavelet domain are explained.
A sparse image is a signal where most of the coefficients are small and the relatively few large coefficients capture most of the information [23]. This characteristic allows to discard low-amplitude components with relatively small perceptual changes. Hereby, the use of sparse signals is attractive for applications such as image compression, denoising, and suppression of artifacts.
Despite the convenient characteristics of sparse signals, natural images are often nonsparse. Still, there are numerous transformations that allow for mapping the signal to a sparse domain and that are analogous to the internal operations of CNNs. For example, SVD factorizes the image in terms of two sets of orthogonal bases, of which few basis pairs contain most of the energy of the image. An alternative transformation is based on framelets, where an image is decomposed in a multichannel representation, whereby each resulting channel contains a fragment of the Fourier spectrum. In the following sections, we address all of these representations in more detail.
Assume that an image (patch) is represented by a matrix y with dimensions of ${(}{N}_{r}\,{\times}\,{N}_{c}{),}$ where ${N}_{r}$ and ${N}_{c}$ are the number of rows and columns, respectively. Then, the SVD factorizes y as \[{\bf{y}} = \mathop{\sum}\limits_{{n} = {0}}^{{N}_{\text{SV}}{-}{1}}{({\underline{u}}_{n}{\underline{v}}_{n}^{\top})}\cdot\underline{\sigma}{[}{n}{]} \tag{15} \] in which ${N}_{\text{SV}}$ is the number of singular values, n is a scalar index, while ${\underline{u}}_{n}$ and ${\underline{v}}_{n}$ are the nth left and right singular vectors, respectively. Furthermore, vector $\underline{\sigma}$ contains the singular values, and each of its entries $\underline{\sigma}[n]$ is the weight assigned to every basis pair ${\underline{u}}_{n},{\underline{v}}_{n}$. This means that the product $({\underline{u}}_{n}{\underline{v}}_{n}^{\top})$ contributes more to the image content for higher values of $\underline{\sigma}{[}{n}{]}$. It is customary for the singular values to be ranked in descending order and for the amplitudes of the singular values $\underline{\sigma}$ to be sparse, therefore, $\underline{\sigma}{[}{0}{]}\gg\underline{\sigma}{[}{N}_{\text{SV}}{-}{1}{]}$. The reason for this sparsity is because image (patches) intrinsically have high correlation. For example, many images contain repetitive patterns (e.g., a wall with bricks, a fence, rooftop tiles, or a zebra’s stripes) or uniform regions (for example, the sky or a person’s skin). This means that an image patch may contain only a few linearly independent vectors that describe most of the image’s content. Consequently, a higher weight is assigned to such image bases.
Given that the amplitudes of the singular values of y in SVD are sparse, it is possible approximate $\hat{y}$ with only a few bases: ${(}{\underline{u}}_{n}{\underline{v}}_{n}^{\top}{)}$. Note that this procedure reduces the rank of signal y, and hence it is known as low-rank approximation. This process is equivalent to \[\hat{\bf{y}} = \mathop{\sum}\limits_{{n} = {0}}^{{N}_{\text{LR}}{-}{1}}{({\underline{u}}_{n}{\underline{v}}_{n}^{\top})}\,{\cdot}\,{\sigma}{[}{n}{]} \tag{16} \] where ${N}_{\text{SV}}{>}{N}_{LR}$. Note that this effectively cancels the product $({\underline{u}}_{n}{\underline{v}}_{n}^{\top})$, where the weight given by ${\sigma}{[}{n}{]}$ is low. Alternatively, it is possible to assign a weight of zero to the product $({\underline{u}}_{n}{\underline{v}}_{n}^{\top})$ for ${n}\geq{N}_{LR}$.
The low-rank representation of a matrix is desirable for diverse applications, among which we can find image denoising. The motivation for using low-rank approximation for this application results from the fact that, as mentioned earlier, natural images are considered low rank due to the strong spatial correlation between pixels, whereas noise is high rank (it is spatially uncorrelated). As a consequence, reducing the rank/number of singular values decreases the presence of noise while still providing a good approximation of the noise-free signal, as exemplified in Figure 3.
Figure 3. SVD reconstruction of clean and corrupted images with a different number of singular values. Note that reconstruction of the clean image with eight or 32 singular values ${(}{N}_{\text{SV}} = {8}$ or ${N}_{\text{SV}} = {32}$, respectively) yields to reconstructions indistinguishable from the original image. This contrasts with their noisy counterparts, where ${N}_{\text{SV}} = {8}$ reconstructs a smoother image in which the noise is attenuated, while ${N}_{\text{SV}} = {32}$ reconstructs the noise texture perfectly.
Just as with SVD, framelets are also commonly used for image processing. In a nutshell, a framelet transform is a signal representation that factorizes/decomposes an arbitrary signal into multiple bands/channels. Each of these channels contains a segment of the energy of the original signal. In image and signal processing, the framelet bands are the result of convolving the analyzed signal with a group of discrete filters that have finite length/support. In this article, the most important characteristic that the filters of the framelet transform should comply with is that the bands they generate capture all the energy contained on the input to the decomposition. This is important to avoid the loss of information of the decomposed signal. In this text, we refer to framelets that comply with the previous characteristics as tight framelets, and the following paragraphs describe this property in more detail.
In its decimated version, the framelet decomposition for tight frames is represented by \[{\bf{Y}}_{\text{fram}} = {f}_{{(}{2}\downarrow{)}}{(}{\bf{Fy}}{)} \tag{17} \] in which ${\bf{Y}}_{\text{fram}}$ is the decomposed signal, and F is the framelet basis (tensor). Note that the signal ${\bf{Y}}_{\text{fram}}$ has more channels than y. Furthermore, the original signal y is recovered from ${Y}_{\text{fram}}$ by \[{\bf{y}} = {\tilde{\bf{F}}}^{\top}{f}_{{(}{2}\uparrow{)}}{(}{\bf{Y}}_{\text{fram}}{)}{\cdot}{c}{.} \tag{18} \]
Here, $\tilde{\bf{F}}$ is the filter of the inverse framelet transform and c denotes an arbitrary constant. If ${c} = {1}$, the framelet is normalized. Finally, note that the framelet transform can also be undecimated. This means that in undecimated representations, the downsampling and upsampling layers, ${f}_{{(}{2}\downarrow{)}}{(}\,{\cdot}\,{)}$ and ${f}_{{(}{2}\uparrow{)}}{(}\,{\cdot}\,{),}$ are not used. An important property of the undecimated representation is that it is less prone to aliasing than its decimated counterpart, but more computationally expensive. Therefore, for efficiency reasons, the decimated framelet decomposition is often preferred over the undecimated representation. In summary, the decomposition and synthesis of the decimated framelet decomposition is represented by \[{\bf{y}} = {\tilde{\bf{F}}}^{\top}{f}_{{(}{2}\uparrow{)}}{(}{f}_{{(}{2}\downarrow{)}}{(}{\bf{Fy}}{))}\,{\cdot}\,{c} \tag{19} \] while for the undecimated framelet it holds that \[{\bf{y}} = {\tilde{\bf{F}}}^{\top}{(}{\bf{Fy}}{)}{\cdot}{c}{.} \tag{20} \]
A notable normalized framelet is the discrete wavelet transform (DWT), where variables F and $\tilde{\bf{F}}$ are replaced by tensors ${\bf{W}} = \left({\begin{array}{c}{{\bf{w}}_{\text{LL}},{\bf{w}}_{\text{LH}},{\bf{w}}_{\text{HL}},{\bf{w}}_{\text{HH}}}\end{array}}\right)$ and $\tilde{\bf{W}} = \left({\begin{array}{c}{{\tilde{\bf{w}}}_{\text{LL}},{\tilde{\bf{w}}}_{\text{LH}},{\tilde{\bf{w}}}_{\text{HL}},{\tilde{\bf{w}}}_{\text{HH}}}\end{array}}\right)$, respectively. Here, ${\bf{w}}_{\text{LL}}$ is the filter for the low-frequency band, while ${\bf{w}}_{\text{LH}},{\bf{w}}_{\text{HL}},{\bf{w}}_{\text{HH}}$ are the filters used to extract the detail in the horizontal, vertical, and diagonal directions, respectively. Finally, ${\tilde{\bf{w}}}_{\text{LH}}{\tilde{\bf{w}}}_{\text{LH}},{\tilde{\bf{w}}}_{\text{HL}},{\tilde{\bf{w}}}_{\text{HH}}$ are the filters of the inverse decimated DWT.
To understand the DWT more intuitively, Figure 4 shows the decimated framelet decomposition using the filters of the DWT. Note that the convolution $\text{Wy}$ results in a four-channel signal, where each channel contains only a fraction of the spectrum of image y. This allows for downsampling of each channel with minimal aliasing. Furthermore, to recover the original signal, each individual channel is upsampled, thereby introducing aliasing, which is then removed by the filters of the inverse transform. Finally, all the channels are added and the original signal is recovered.
Figure 4. 2D spectrum analysis of the decimated discrete framelet decomposition and reconstruction. In the figure, function ${\mathcal{F}}\left\{{\cdot}\right\}$ stands for the amplitude Fourier spectrum of the input argument. The yellow squares indicate a region in the low-frequency area of the Fourier spectrum, while the orange, purple and blue squares indicate the high-pass/detail bands. For these images, ideal orthogonal bases are assumed. Note that the forward transform is composed by two steps. First, the signal is convolved with the wavelet basis $\left({\bf{Wy}}\right)$. Afterward, downsampling is applied to the signal $\left({{f}_{{(}{2}\downarrow{)}}({\bf{Wy}})}\right)$. During the inverse transformation, the signal is upsampled by inserting zeros between each sample $\left({{f}_{{(}{2}\uparrow{)}}({f}_{{(}{2}\downarrow{)}}({\bf{Wy}}))}\right)$, which causes spatial aliasing (dashed blocks). Finally, the spatial aliasing is removed by the inverse transform filter ${\tilde{\bf{W}}}$ and all the channels are added $\left({\tilde{\bf{W}}}^{\top}{f}_{{(}{2}\uparrow{)}}{(}{f}_{{(}{2}\downarrow{)}}{(}{\bf{Wy}}{)}{)}\right)$.
Analogous to the low-rank approximation, in framelets, the reduction of noise is achieved by setting the noisy components to zero. These components are typically assumed to have low amplitude when compared to the amplitude of the sparse signal, as expressed by \[{\tilde{\bf{y}}} = {\tilde{\bf{F}}}^{\top}{f}_{{(}{2}\uparrow{)}}{(}{\tau}_{(t)}{(}{f}_{{(}{2}\downarrow{)}}{(}{\bf{Fy}}{)))}\,{\cdot}\,{c} \tag{21} \] where ${\tau}_{\underline{t}}{(}\,{\cdot}\,{)}$ is a generic thresholding/shrinkage function, which sets each of the pixels in ${f}_{{(}{2}\downarrow{)}}{(}{\bf{Fy}}{)}$ to zero when values are lower than the threshold level $\underline{t}$.
As mentioned in the “Framelets” section, framelets decompose a given image y by convolving it with a tensor F. Note that many of the filters that compose F have a high-pass nature. Images often contain approximately uniform regions in which the variation is low, therefore, convolving a signal y with a high-pass filter ${f}_{{h}}$ – where ${f}_{{h}}\,{\in}\,{F}$ – produces the sparse detail band ${d} = {f}_{{h}}\ast{y}$ in which uniform regions have low amplitudes, while transitions, i.e., edges, contain most of the energy of the bands.
Assume a model in which a single pixel ${d}\,{\in}\,{d}$ is observed, which is contaminated with additive noise ${\eta}$. Then, the resulting observed pixel z is defined by \[{z} = {d} + {\eta}{.} \tag{22} \]
To recover the noiseless pixel d from observation z, it is possible to use the point-maximum a posteriori (MAP) estimate [1], [24], defined by the maximization problem \[\hat{d} = \mathop{\text{argmax}}\limits_{d}{[}\ln{(}{P}{(}{d}{\vert}{z}{))].} \tag{23} \]
Here, the log-posterior $\ln{(}{P}{(}{d}\mid{z}{))}$ is defined by \[\ln{(}{P}{(}{d}\mid{z}{))} = \ln{(}{P}{(}{z}\mid{d}{))} + \ln{(}{P}{(}{d}{))} \tag{24} \] where the conditional probability density function (PDF) ${P}{(}{z}\mid{d}{)}$ expresses the noise distribution, which is often assumed Gaussian and defined by \[{P}{(}{z}{\vert}{d}{)}\propto\exp\left({{-}\frac{{(}{z}{-}{d}{)}^{2}}{2{\sigma}_{\eta}^{2}}}\right){.} \tag{25} \]
Here, ${\sigma}_{\eta}^{2}$ is the noise variance. Furthermore, as prior probability, it is assumed that the distribution of ${P}{(}{d}{)}$ corresponds to a Laplacian distribution, which has been used in wavelet-based denoising [1]. Therefore, ${P}{(}{d}{)}$ is mathematically described by \[{P}{(}{d}{)}\propto\exp\left({{-}\frac{\mid{d}\mid}{{\sigma}_{d}}}\right), \tag{26} \] where ${\sigma}_{d}$ is the dispersion measure of the Laplace distribution. For reference, Figure 5 portrays an example of both a Gaussian and a Laplacian PDF. Note that the Laplacian distribution has a higher probability of zero elements occurring than the Gaussian distribution for the same standard deviation. Finally, substituting (25) and (26) in (24) results in \[\ln{(}{P}{(}{d}{\vert}{z}{))}\propto{-}\frac{{(}{z}{-}{d}{)}^{2}}{2{\sigma}_{\eta}^{2}}{-}\frac{{\vert}{d}{\vert}}{{\sigma}_{d}}{.} \tag{27} \]
Figure 5. The probability density function for (a) Gaussian and (b) Laplacian distributions.
In (27), maximizing d in $\ln{(}{P}{(}{d}{\vert}{z}{))}$ with the first derivative criterion, in an (un) constrained way, leads to two common activations in noise-reduction CNNs: the ReLU and the soft-shrinkage function. Furthermore, the solution also can be used to derive the so-called clipping function, which is useful in residual networks.
For reference and further understanding, Figure 6 portrays the elements composing the noise model of (22), signal-transfer characteristics of the ReLU, soft-shrinkage and clipping functions, and the effect that these functions have on the signal of the observed noisy detail band z.
Figure 6. Signals involved in the additive noise model, input/output transfer characteristics of activation layers and estimates produced by the activation layers when applied to the noise-contaminated signal. (a) The signals involved in the additive noise model. (b) The output amplitude of activation functions with respect to the input amplitude. (c) Finally, the application of the activation functions to the noisy observation $z.$
If (27) is solved for d while constraining the estimator to be positive, the noiseless estimate $\hat{d}$ becomes \[{\hat{d}} = {(}{z}{-}{t}{)}_{+} \tag{28} \] which is also expressed by \[{(}{z}{-}{t}{)}_{+} = \begin{cases}\begin{array}{ll}{z}{-}{t},&{\text{if}}\,{z}\geq{t},\\{0},&{\text{if}}\,{t}{>}{z}{.}\end{array}\end{cases} \tag{29} \]
Here, the threshold level is defined by \[{t} = \frac{{\sigma}_{\eta}^{2}}{{\sigma}_{d}}{.} \tag{30} \]
Note that this estimator cancels the negative and low-amplitude elements of d lower than the magnitude of the threshold level t. For example, if the signal content on the feature map is low, then ${\sigma}_{d}\rightarrow{0}$. In such case, ${t}\rightarrow + \infty$ and, consequently, $\hat{d}\rightarrow{0}$. This means that the channel is suppressed. Alternatively, if the feature map has strong signal presence, i.e., ${\sigma}_{d}\rightarrow\infty$, consequently, ${t}\rightarrow{0}$, and then $\hat{d}\rightarrow{(}{z}{)}_{+}$.
A final remark is made on the modeling of functions of a CNN. It should be noted that the estimator of (28) is analogous to the activation function of a CNN, known as an ReLU. However, in a CNN, the value of t would be the bias b learned from the training data.
If (27) is maximized in an unconstrained way, the estimate $\hat{d}$ is \[{\hat{d}} = {\tau}_{(t)}^{\text{Soft}}{(}{z}{)} = {(}{z}{-}{t}{)}_{+}{-}{(}{-}{z}{-}{t}{)}_{+}{.} \tag{31} \]
Here, ${\tau}_{(t)}^{\text{Soft}}{(}\,{\cdot}\,{)}$ denotes the soft-shrinkage/-thresholding function, which is often also written in the form \[{\tau}_{(t)}^{\text{Soft}}{(}{z}{)} = \begin{cases}\begin{array}{ll}{{z} + {t}},&{{\text{if}}\,{z}\geq{t},}\\{0,}&{{\text{if}}\,{t}{>}{z}\geq{-}{t},}\\{{z}{-}{t},}&{{\text{if}}\,{-}{t}{>}{z}{.}}\end{array}\end{cases} \tag{32} \]
It can be observed that the soft threshold enforces the low-amplitude components whose magnitude is lower than the magnitude threshold level t to zero. In this case, t is also defined by (30). It should be noted that the soft-shrinkage estimator can also be obtained from a variational perspective [25]. Finally, it can be observed that soft shrinkage is the superposition of two ReLU functions, which has been pointed out by Fan et al. [18].
In the “ReLU” and “Soft Shrinkage/Thresholding” sections, the estimate $\hat{d}$ is obtained directly from the noisy observation z. Alternatively, it is possible to estimate the noise ${\eta}$ and subtract it from z, akin to the residual CNNs represented by (9). This can be achieved by solving the model \[{\hat{\eta}} = {z}{-}\hat{d} = {z}{-}{\tau}_{(t)}^{\text{Soft}}{(}{z}{)} \tag{33} \] which is equivalent to \[\hat{\eta} = {\mathcal{C}}_{(t)}^{\text{Soft}}{(}{z}{)} = {z}{-}{((}{z}{-}{t}{)}_{+}{-}{(}{-}{z}{-}{t}{)}_{+}{)} \tag{34} \] where ${\mathcal{C}}_{(t)}^{\text{Soft}}{(}\,{\cdot}\,{)}$ is the soft-clipping function. Note that this function also can be expressed by \begin{align*}{\mathcal{C}}_{(t)}^{\text{Soft}}{(}{z}{)} = \begin{cases}\begin{array}{ll}{t,}&{{\text{if}}\,{z}\geq{t},}\\{z,}&{{\text{if}}\,{t}\geq{z}{>}{-}{t},}\\{{-}{t},}&{{\text{if}}\,{-}{t}\geq{z}{.}}\end{array}\end{cases} \tag{35} \end{align*}
One of the main drawbacks of the soft-threshold activation is that it is a biased estimator. This limitation has been addressed by the hard and semihard thresholds, which are (asymptotically) unbiased estimators for large input values. In this section, we focus solely on the semihard threshold and avoid the hard variant because is discontinuous and therefore not suited to models that rely on gradient-based optimization, such as CNNs.
Among the semihard thresholds, two notable examples are the garrote shrink and the shrinkage functions generated by derivatives of Gaussians (DoGs) [19], [26]. The garrote shrink function ${\tau}_{{(}\,{\cdot}\,{)}}^{\text{Soft}}{(}\,{\cdot}\,{)}$ is defined by \[{\tau}_{(t)}^{\text{Gar}}{(}{z}{)} = \frac{{(}{z}^{2}{-}{t}^{2}{)}_{+}}{z}{.} \tag{36} \]
Furthermore, an example of a shrinkage function based on the DoG is given by \[{\tau}_{(t)}^{\text{DoG}}{(}{z}{)} = {z}{-}{\mathcal{C}}_{(t)}^{\text{DoG}}{(}{z}{)} \tag{37} \] where the semihard clipping function with the DoG ${\mathcal{C}}_{{(}\,{\cdot}\,{)}}^{\text{DoG}}{(}\,{\cdot}\,{)}$ is given by \[{\mathcal{C}}_{(t)}^{\text{DoG}}{(}{z}{)} = {z}{\cdot}{\text{exp}}\left({{-}\frac{{z}^{p}}{{t}^{p}}}\right) \tag{38} \] in which p is an even number.
The garrote and semihard DoG shrinkage functions are shown in Figure 7 as well as their clipping counterparts. Note the shrinkage functions’ approximate unity for ${|}{z}{|}\gg{t}$, therefore, they are asymptotically unbiased for large signal values.
Figure 7. Transfer characteristics of the semihard thresholds based on the difference of Gaussians and of the garrote shrink as well as their clipping counterparts. Note that in contrast with the soft-shrinkage and clipping functions shown in Figure 6, the semihard thresholds tend to unity for large values, while the semihard clipping functions tend to zero for large signal intensities.
The final thresholding function addressed in this section is the linear expansion of thresholds (LETs) proposed by Blu and Luisier [26]. This technique, known as LETs, combines multiple thresholding functions to improve performance and is defined by \[{\tau}_{{(}\underline{t}{)}}^{\text{LET}}{(}{z}{)} = \mathop{\sum}\limits_{{n} = {0}}\limits^{{N}_{T}{-}{1}}{{a}_{n}}{\cdot}{\tau}_{({t}_{n})}{(}{z}{)} \tag{39} \] where ${a}_{n}$ is the weighting factor assigned to each threshold where all weighting factors should add up to unity.
The next sections address the theoretical operation of noise-reduction CNNs based on ReLUs and shrinkage/thresholding functions. The first part describes the TDCFs [12] and is the most extensive study on the operation of encoding-decoding ReLU-based CNNs up to now. Afterward, we focus on the operation of networks that use shrinkage functions instead of ReLUs [17], [18], [19], with the aim of mimicking well-established denoising algorithms [1]. Finally, the final part addresses the connections between both methods and additional links between CNNs and signal processing.
The TDCFs [12] describes the operation of encoding-decoding ReLU-based CNNs. Its most relevant contributions are 1) to establish the equivalence of framelets and the convolutional layers of CNNs, 2) to provide conditions that preserve the signal integrity within an ReLU CNN, and 3) explain how ReLUs and convolution layers reduce noise within an encoding-decoding CNN.
The similarity between framelets and the encoding and decoding convolutional filters can be observed when comparing (12) and (14) with (17) and (18), where it becomes visible that the convolution structure of encoding-decoding CNNs is analogous to the forward and inverse framelet decomposition.
Regarding the signal reconstruction characteristics, the TDCFs [12] states the following. First, to be able to recover an arbitrary signal ${y}\,{\in}\,{\mathbb{R}}^{N}$, the number of output channels of a convolution layer with ReLU activation should at least double the number of input channels. Second, the encoding convolution kernel K should be composed of pairs of filters with opposite phases. These two requirements ensure that any negative and positive values propagate through the network. Under these conditions, the encoding and decoding convolution filters, K and $\tilde{K}$, respectively, should comply with \[{\bf{y}} = {\tilde{\bf{K}}}^{\top}{(}{\bf{Ky}}{)}_{+}{\cdot}{c}{.} \tag{40} \]
It can be noted that (40) is an extension of (20), which describes the reconstruction characteristics of tight framelets. From this point on, we refer to convolutional kernels compliant with (40) as phase-complementary tight framelets. As a final remark, it should be noted that a common practice in CNN designs is to also use ReLU nonlinearities in the decoder. In such a case, the phase-complementary tight-framelet condition can still be met as long as the pixels ${y}\,{\in}\,{y}$ comply with ${y}\geq{0}$, which is equivalent to \[{\bf{y}} = {(}{\bf{y}}{)}_{+} = {(}{\tilde{\bf{K}}}^{\top}{(}{\bf{Ky}}{)}_{+}{\cdot}{c}{)}_{+}{.} \tag{41} \]
It can be observed that the relevance of the properties defined in (40) and (41) is that they ensure that a CNN can propagate any arbitrary signal, which is important to avoid any distortions (such as image blur) in the processed images.
An additional element of the TDCFs regarding reconstruction of the signal is to show that conventional pooling layers (e.g., average pooling) can discard high-frequency information of the signal, which effectively blurs the processed signals. Furthermore, Ye et al. [12] have demonstrated that this can be fixed by replacing the conventional up-/downsampling layers by reversible operations, such as the DWT. To exemplify this property, we refer to Figure 4. If only an average pooling layer followed by an upsampling stage were to be applied, the treatment of the signal would be equivalent to the low-frequency branch of the DWT. Consequently, only the low-frequency spectrum of the signal would be recovered and the images processed with that structure would become blurred. In contrast, if the full-forward and inverse wavelet transform of Figure 4 is used for up and downsampling, it is possible to reconstruct any signal, irrespective of its frequency content.
The ultimate key contribution of the TDCFs is its explanation of the operation of ReLU-based noise-reduction CNNs. For a nonresidual configuration, ReLU CNNs perform the following operations. 1) The convolution filters decompose the incoming signal into a sparse multichannel representation. 2) The feature maps, which are uncorrelated to the signal, contain mainly noise. In this case, the bias and the ReLU activation cancel the noisy feature maps in a process analogous to the MAP estimate shown in the “ReLU” section. 3) The decoder reconstructs the filtered image. Note that this process is analogous to the low-rank decomposition described in the “SVD and Low-Rank Approximation” section. In the case of residual networks, the CNN learns to estimate the noise, which means that in that configuration the ReLU nonlinearities suppress the channels with high activation.
A visual example of low-rank approximation in ReLU CNNs is shown in Figure 8, illustrating the operation of an idealized single-layer encoding-decoding ReLU CNN operating in both a residual and nonresidual way. It can be noted ReLU activation suppresses specific channels in the sparse decomposition provided by the encoder, thereby preserving the low-rank structures in the nonresidual network. Alternatively, in the residual example, the ReLUs eliminate the feature maps with high activation, which results in a noise estimate that is subtracted from the input to estimate the noiseless signal.
Figure 8. Operation of a simplified denoising (non)residual ReLU CNN according to the TDCFs. In the figure, the noisy observation y is composed by two vertical bars plus uncorrelated Gaussian noise. Furthermore, for this example, the encoding and decoding convolution filters $(\text{K}$ and $\tilde{K}$, respectively) are the Haar basis of the 2D DWT and its phase-inverted counterparts. Given the content of the image, the image in the decomposed domain ${K}{y}$ produces only a weak activation for the vertical and diagonal filters $({w}_{\text{LH}}$ and ${w}_{\text{HH}}$, respectively), and those feature maps contain mainly noise. In the case of the nonresidual network, the ReLUs and biases suppress the channels with low activation [see the $\left({{Ky} + \underline{b}}\right){}_{+}$, column], which is akin to the low-rank approximation. In contrast, in the residual example, the channels with image content are suppressed while preserving the uncorrelated noise. Finally, the decoding section reconstructs the noise-free estimate $\tilde{x}$ for the nonresidual network or the noise estimate $\hat{\mathbf{\eta}}$ for the residual example, where it is subtracted from $\text{y}$ to compute the noiseless estimate ${\hat{\bf{x}}}.$
As in ReLU networks, the encoder of shrinkage networks [17], [18], [19] separates the input signal in a multichannel representation. As a second processing stage, shrinkage networks estimate the noiseless encoded signal by canceling the low-amplitude pixels in the feature maps in a process akin to the MAP estimate of the “Soft Shrinkage/Thresholding” section. As final step, the encoder reconstructs the estimated noiseless image. Note that the use of shrinkage functions reduces the number of channels required by ReLU counterparts to achieve perfect signal reconstruction because the shrinkage activation preserves positive and negative values, while ReLUs preserve only the positive part of the signal.
As shown in the “Signal Model and Noise-Reduction Configurations” section, in residual learning, a given encoding-decoding network estimates the noise signal $\mathbf{\eta}$ so that it can be subtracted from the noisy observation y to generate the noiseless estimate ${\hat{\bf{x}}}.$ As shown in the “Soft Clipping” section, in the framelet domain, this is achieved by preserving the low-amplitude values of the feature maps by clipping the signal. Therefore, in residual networks, the shrinkage functions can be explicitly replaced by clipping activations.
Visual examples of the operation of a single-layer shrinkage and of clipping networks are presented in Figure 9, where it can be noted that the operation of shrinkage and clipping networks is analogous to their ReLU counterparts, with the main difference being that shrinkage and clipping networks do not require phase complements in the encoding and decoding layers as ReLU-based CNNs do.
Figure 9. Operation of denoising in shrinkage and clipping networks. In the nonresidual configuration, the noisy signal y is decomposed by a set of convolution filters, which, for this example, are the 2D Haar basis functions of the DWT ${(}{K}{y}{)}$. As a second step, the semihard shrinkage produces an MAP estimate of the noiseless detail bands/feature maps $\left({{\tau}{}_{{(}\underline{b}{)}}^{\text{DoG}}{(}{Ky}{)}}\right)$. As third and final step, the decoder maps the estimated noiseless encoded signal to the original image domain. In the residual network, the behavior is similar, but the activation layer is a clipping function that performs an MAP estimate of the noise in the feature maps, which is reconstructed by the decoder to generate the noise estimate $\hat{\mathbf{\eta}}.$ After reconstruction, the noise estimate is subtracted from the noisy observation y to generate the noise-free estimate $\tilde{x}.$
As addressed in the “Nonlinear Signal Estimation in the Framelet Domain” section, the soft-threshold function is the superposition of two ReLU activations. As a consequence, it is feasible that in ReLU CNNs, shrinkage behavior could arise in addition to the low-rankness enforcement mentioned in the “TDCFs” section. It should be noted that this can only happen if the number of channels of the encoder and decoder complies with the redundancy constraints of the TDCFs, and if the decoder is linear. To prove this, (31) is reparameterized as \[\hat{\bf{d}} = {\tilde{\bf{K}}}^{\top}{(}{\bf{Ky}} + {\underline{b}}{)}_{+} \tag{42} \] where convolution filters K and ${\tilde{\bf{K}}}^{\top}$ are defined by ${K} = {((}{I}$ ${-}{I}{\mathbf{))}}^{\top}$ and ${\tilde{\bf{K}}}^{\top} = \left({\begin{array}{c}{\left({\begin{array}{cc}{I}&{{-}{I}}\end{array}}\right)}\end{array}}\right)$, respectively, and $\underline{b} = {\left({\begin{array}{cc}{{-}{t}}&{{-}{t}}\end{array}}\right)}^{\top}$ represents the threshold value.
In addition to the soft-shrinkage function, note that the clipping function described by (34) also can be expressed by (42) if ${K} = {\left({\begin{array}{c}{\left({\begin{array}{cccc}{I}&{{-}{I}}&{I}&{{-}{I}}\end{array}}\right)}\end{array}}\right)}^{\top},{\tilde{\bf{K}}}^{\top} = \left({\begin{array}{c}{\left({\begin{array}{cccc}{I}&{{-}{I}}&{{-}{I}}&{I}\end{array}}\right)}\end{array}}\right)$, and $\underline{b} = {\left({\begin{array}{cccc}{0}&{0}&{{-}{t}}&{{-}{t}}\end{array}}\right)}^{\top}$. It can be noted that representing the clipping function in convolutional form requires four-times-more channels than the original input signal.
It should be noted that the ability of ReLUs to approximate other signals has also been observed by Daubechies et al. [29], who have proven that deep ReLU CNNs are universal function approximators. In addition, Ye and Sung [13] have demonstrated that the ReLU function is the main source of the high-approximation power of CNNs.
Up to now, it has been assumed that operation of the encoding and decoding convolution filters is limited to mapping the input image to a multichannel representation and to reconstructing it (i.e., K and ${\tilde{\bf{K}}}^{\top}$ comply with ${\tilde{\bf{K}}}^{\top}{(}{K}{)}_{+} = {I}\,{\cdot}\,{c}{)}$. Still, it is possible that, in addition to performing decomposition and synthesis tasks, the encoding-decoding structure also filters/colors the signal in a way that improves image estimates. It should be noted that this implies that the perfect reconstruction encoding-decoding structure is no longer preserved. For example, consider the following linear encoding-decoding structure \[{\hat{\bf{x}}} = {\tilde{\bf{K}}}^{\top}{(}{\bf{Ky}}{)} \tag{43} \] which can be reduced to \[{\hat{\bf{x}}} = {\bf{k}}\ast{\bf{y}}{.} \tag{44} \]
Here, ${k} = {\tilde{\bf{K}}}^{\top}{K}$ is optimized to reduce the distance between $\text{y}$ and the ground truth x. Consequently, the equivalent filter k can be considered a Wiener filter. It should be noted that this article is not the first to address the potential Wiener-like behavior of a CNN. For example, Mohan et al. [14] suggested that by eliminating the bias of the convolution layers, the CNN could behave more akin to the Wiener filter and be able to generalize better to unseen noise levels. It should be noted that by doing so, the CNN can also behave akin to the switching behavior described by the TDCFs, which can be described by the equation \[{(}{z}{)}_{+} = \begin{cases}\begin{array}{ll}{z,}&{{\text{if}}\,{z}\geq{0},}\\{0}&{{\text{if}}\,{z}{<}{0}}\end{array}\end{cases} \tag{45} \] where z is a pixel that belongs to the signal ${z} = {k}\ast{x}$. It can be observed that in contrast with the low-rank behavior described in the “TDCFs” section, in this case, the switching behavior is only dependent on the correlation between signal x and filter k. Consequently, if the value of z is positive, its value is preserved. On the contrary, if the correlation between x and k is negative, then the value of z is canceled. Consequently, the noise reduction becomes independent/invariant of the noise level. It can be observed that this effect can be considered a nonlinear extension of signal annihilation filters [30].
It should be noted that aside from the low-rank approximation interpretation of ReLU-based CNNs, additional links to other techniques can be derived. For example, the decomposition and synthesis provided by the encoding-decoding structure is also akin to the nonnegative matrix factorization [31], in which a signal is factorized as a weighted sum of positive bases. In this conception, feature maps are the bases, which are constrained to be positive by the ReLU function. Furthermore, an additional interpretation of encoding-decoding CNNs can be obtained by analyzing them from a low-dimensional manifold representation perspective [8]. Here, the convolution layers of CNNs are interpreted as two operations. On one hand they can provide a Hankel representation, and on the other they can provide a bottleneck that reduces dimensionality of the manifold of image patches. It should be noted that the Hankel-like structure attributed to the convolution layers of CNNs has also been noted by the TDCFs [12]. Two final connections with signal processing and CNNs are the variational formulation combined with kernel-based methods [15] and the convolutional sparse coding interpretation of CNNs [16].
To demonstrate an application of the principles summarized in the “Signal Processing Fundamentals” and “Bridging the Gap Between Signal Processing and CNNS: Deep Convolutional Framelets and Shrinkage-Based CNNs” sections, this section analyzes relevant designs of ReLUs and shrinkage CNNs. The analyses focus on three main aspects: 1) overall descriptions of the network architecture, 2) the signal reconstruction characteristics provided by the convolutional layers of the encoder and decoder subnetworks, and 3) the number operations ${\mathcal{O}}{(}\,{\cdot}\,{)}$ executed by the trainable parts of the network, as this will give insight into the computational requirements needed to execute each network, and its overall complexity.
Signal reconstruction analysis provides a theoretical indication that a given CNN design can propagate any arbitrary signal when considering the use of ideal filters (i.e., they provide perfect reconstruction and are maximally sparse). In other words, for a fixed network architecture, there exists a selection of parameters (weights and biases) that make the neural network equal to the identity function. This result is important because a design that cannot propagate arbitrary signals under ideal conditions will potentially distort the signals that propagate through it by design. Consequently, this cannot be fixed by training with large datasets and/or with the application of any special loss term. To better understand the signal reconstruction analysis, we provide a brief example where it is a nonresidual CNN ${G}{(}\,{\cdot}\,{),}$ where we propagate a noiseless signal x contaminated with noise $\mathbf{\eta}$ so that \[{\bf{x}}\approx{G}{(}{\bf{x}} + \mathbf{\eta}{).} \tag{46} \]
Here, an ideal CNN allows us to propagate any x while canceling the noise component $\mathbf{\eta}$, irrespective of the content of x. If we switch our focus to an ideal residual CNN ${R}{(}\,{\cdot}\,{),}$ it is possible to observe that \[{\hat{\bf{x}}}\approx{R}{(}{\bf{y}}{)} = {\bf{y}}{-}{G}{(}{\bf{y}}{).} \tag{47} \]
Here, ${G}{(}\,{\cdot}\,{)}$ is the encoding-decoding section of the residual network ${R}{(}\,{\cdot}\,{)}$. Consequently, it is desirable that the network ${G}{(}\,{\cdot}\,{)}$ is able to propagate the noise $\mathbf{\eta}$, while suppressing the noiseless signal x, which is equivalent to \[\begin{array}{l}{\mathbf{\eta}\approx{G}{(}{\bf{x}} + \mathbf{\eta}{).}}\end{array} \tag{48} \]
It should be noted that in both residual and nonresidual cases, there are two behaviors. On one hand, there is a signal that the network decomposes and reconstructs (almost) perfectly, and on the other a signal is suppressed. Signal reconstruction analysis focuses on the signals that the network can propagate or reconstruct, rather than the signal cancelation behavior. Consequently, we focus on the linear part of ${G}{(}\,{\cdot}\,{)}$ (i.e., its convolution structure), of which, according to the “TDCFs” section, we assume that it handles decomposition and reconstruction of the signal within the CNN. It should be noted that the idealized model assumed here is only considered for analysis purposes as practical implementations do not guarantee that this exact behavior is factually obtained. For more information, see the “Additional Links Between Encoding-Decoding CNNs and Existing Signal Processing Techniques” section and “Fitting Low-Rank Approximation in Rectified Linear Unit Convolutional Neural Networks” and “Network Depth.”
Fitting Low-Rank Approximation in Rectified Linear Unit Convolutional Neural Networks
To further understand the analogy between convolutional neural networks (CNNs) and low-rank approximation established by the theory of deep convolutional framelets, we can use as starting point the definition of singular value decomposition, which is expressed in (15), by \[{\bf{y}} = \mathop{\sum}\limits_{{n} = {0}}^{{N}_{\text{SV}}{-}{1}}{(}{\mathit{\underline{u}}}_{n}{\mathit{\underline{v}}}_{n}^{\top}{)}\cdot{\underline{\sigma}}{[}{n}{].} \]
Given that left and right singular vector pairs ${\mathit{\underline{u}}}_{n}{\mathit{\underline{v}}}_{n}^{\top}$ generate an image ${\bf{D}}{[n]}$, then (15) can be rewritten as \[{\bf{y}} = \mathop{\sum}\limits_{{n} = {0}}^{{N}_{\text{SV}}{-}{1}}{D}{[}{n}{]}\cdot{\underline{\sigma}}{[}{n}{]} \tag{S1} \] where tensor ${D} = {(}{(}{\underline{u}}_{0}{\mathit{\underline{v}}}_{0}^{\top}{)}\ldots{(}{\mathit{\underline{u}}}_{{N}_{\text{SV}}{-}{1}}{\mathit{\underline{v}}}_{{N}_{\text{SV}}{-}{1}}^{\top}{)}{)}^{\top}$ contains the products of the left and right singular vectors and has dimensions of ${(}{N}_{\text{SV}}\,{\times}\,{1}\,{\times}\,{M}\,{\times}\,{N}{)}$. Furthermore, the equation can be further reformulated to \[{\bf{y}} = {\tilde{\bf{K}}}^{\top}{\bf{D}} \tag{S2} \] in which ${\tilde{\bf{K}}}^{\top} = {(}{(}{\mathit{\underline{\sigma}}}{[}{0}{]}{)}\ldots{(}{\mathit{\underline{\sigma}}}{[}{N}_{{\text{SV}}{-}{1}}{]}{)}{)}$, where the brackets of the ${(}{1}\,{\times}\,{1}{)}$ filters have been excluded for simplicity. In addition, it is now assumed that it is desirable to perform a low-rank approximation of signal ${\bf{y}}$ based on the reformulation of (43). If we assume that ${D}\,{\in}\,{\mathbb{R}}_{\geq{0}}^{N}$, then the low-rank approximation can be expressed by \[{\hat{\bf{y}}} = {\tilde{\bf{K}}}^{\top}{(}{D} + {\mathit{\underline{b}}}{)}_{+} \tag{S3} \] in which the values ${\underline{b}}$ are set to zero for the channels of ${\bf{D}}$ that have high contributions to the image content. Conversely, the channels of ${\bf{D}}{[n]}$ with less perceptual relevance are then canceled by assigning large negative values to the corresponding entries of ${\underline{b}}$. As a final reformulation, we can assume that the basis images ${\bf{D}}$ are the result of decomposing the input image ${\bf{y}}$ with a set of convolution filters, i.e., ${\bf{D}} = {\bf{Ky}}$; this transforms (44) into \[{\hat{\bf{x}}} = {\tilde{\bf{K}}}^{\top}{(}{\bf{Ky}} + {\underline{b}}{)}_{+}{.} \tag{S4} \]
Here, it is visible that (S4) is analogous to the encoding-decoding architecture defined in (10)–(14), and the encoder and decoder filters are akin to the framelet formulation presented in the “Framelets” section. Note that (S4) assumes that the entries ${\bf{D}} = {\bf{Ky}}$ are positive, which may be not always true. In this situation, tensor ${\bf{D}}$ requires redundant channels in which their respective phases are inverted to avoid signal loss. Furthermore, it should also be noted that in a CNN, the bias/threshold level is not inferred from the statistics of the feature maps but learned from the data presented to the network during training.
Multilayer designs
It should be noted that CNNs contain multiple layers, which recursively decompose/reconstruct the signal. This may pose an advantage with respect to conventional low-rank approximation algorithms for a few reasons. First, the data-driven nature of CNNs allows us to learn the basis functions, which optimally decompose and suppress noise in the signal. Second, as networks are deep, the incoming signal is recursively decomposed and sparsified. This multidecomposition scheme is very similar to the designs used in noise-reduction algorithms based on framelets. It can be noted that, in the past, recursive sparsifying principles have been observed in methods such as the (learned) iterative soft-thresholding algorithm [27], [28] as well as convolutional sparse coding. In fact, the convolutional sparse-coding approach has been used for interpreting the operation of CNNs [16].
What about practical implementations?
When training a CNN, the parameters of the model (i.e., ${\bf{K}}$, ${\tilde{\bf{K}}}^{\top}$, and ${\mathit{\underline{b}}}$) are updated to reduce the loss between the processed noisy signal and the ground truth, which does not warranty that the numerical values of the convolution filters and biases of the trained model comply with the assumptions performed here. This is because CNNs do not have mechanisms to enforce that filters have properties such as sparsity or perfect reconstruction and negative values for the biases. Consequently, CNNs may not necessarily perform a low-rank approximation of the signal, although the mathematical formulation of the low-rank approximation and the single-layer encoding decoding are similar. Hence, the analysis presented here should be treated as insight into the mathematical formulation and/or potential properties that can be enforced for specific applications, and not as a literal description of what trained models do.
Network Depth
It should be noted that one of the key elements of convolutional neural networks (CNNs) is their network depth, which we address in this section. To illustrate the effect of network depth, assume an arbitrary N-layer encoding-decoding CNN, in which the encoding layers are defined by \begin{align*}{\bf{E}}_{0} & = {(}{\bf{K}}_{0}{\bf{y}} + {\mathit{\underline{b}}}_{0}{)}_{+}, \\ {\bf{E}}_{1} & = {(}{\bf{K}}_{1}{\bf{E}}_{0} + {\mathit{\underline{b}}}_{1}{)}_{+}, \\ {\bf{E}}_{2} & = {(}{\bf{K}}_{2}{\bf{E}}_{1} + {\mathit{\underline{b}}}_{2}{)}_{+}, \\ & \quad\vdots \\ {\bf{E}}_{{N}{-}{1}} & = {(}{\bf{K}}_{{N}{-}{1}}{\bf{E}}_{{N}{-}{2}} + {\mathit{\underline{b}}}_{{N}{-}{1}}{)}_{+} \tag{S5} \end{align*} \[{\bf{E}}_{n} = {(}{\bf{K}}_{n}{\bf{E}}_{{n}{-}{1}} + {\mathit{\underline{b}}}_{n}{)}_{+}{.} \tag{S6} \]
Here, ${E}_{n}$ represents the encoded signal at the nth decomposition level, while ${K}_{n},{\mathit{\underline{b}}}_{n}$ are the convolution weights and biases for the nth encoding layer, respectively. As addressed in the “ReLU” and “TDCFs” sections, the role of the rectified linear unit activations is to enforce sparsity and nonnegativity, which can be interpreted as the process of suppressing noninformative bases in the low-rank approximation algorithm. Consequently, every encoded signal ${E}_{n}$ is an encoded sparsified version of the signal ${E}_{{n}{-}{1}}.$ To recover the signal, we apply the decoder part of the CNN, given by \begin{align*}{\tilde{\bf{E}}}_{{N}{-}{1}} & = {(}{\tilde{\bf{K}}}_{{N}{-}{1}}^{\top}{\bf{E}}_{{N}{-}{1}} + {\mathit{\underline{b}}}_{{N}{-}{1}}{)}_{+}, \\ & \quad\vdots \\ {\tilde{\bf{E}}}_{1} & = {(}{\tilde{\bf{K}}}_{2}^{\top}{\tilde{\bf{E}}}_{2} + {\mathit{\underline{b}}}_{2}{)}_{+}, \\ {\tilde{\bf{E}}}_{0} & = {(}{\tilde{\bf{K}}}_{1}^{\top}{\tilde{\bf{E}}}_{1} + {\mathit{\underline{b}}}_{1}{)}_{+}, \\ \tilde{\bf{x}} & = {(}{\tilde{\bf{K}}}_{0}^{\top}{\tilde{\bf{E}}}_{0} + {\mathit{\underline{b}}}_{0}{)}_{+} \tag{S7} \end{align*} \[{\tilde{\bf{E}}}_{{n}{-}{1}} = {(}{\tilde{\bf{K}}}_{n}^{\top}{\tilde{\text{E}}}_{n} + {\mathit{\underline{b}}}_{n}{)}_{+}{.} \tag{S8} \]
Here, $\hat{\bf{x}}$ is the low-rank estimate/denoised version of the input signal ${\bf{y}}$, while ${\tilde{\bf{E}}}_{n},{\tilde{\bf{K}}}_{n}^{\top},{\mathit{\underline{b}}}_{n}$ are the decoded signal components at the nth composition level and the decoder convolution weights and biases for the nth layer, respectively. In (S8), every decoded signal ${\tilde{\bf{E}}}_{n}$ is the low-rank estimate of the encoded layer ${\bf{E}}_{{(}{n}{-}{1}{)}}.$ It should be noted that the activation of each of the decoder layers ${(}\,{\cdot}\, + {\mathit{\underline{b}}}_{n}{)}_{+}$ can further enforce sparsity on the low-rank estimates ${\tilde{\bf{E}}}_{{(}{n}{-}{1}{)}}{.}$
Summary
In conclusion, the mathematical formulation of deep networks is analogous to a recursive data-driven low-rank approximation, where the input to the successive encoding-decoding pairs is the low-rank approximated encoded signal generated by the encoder of the previous level. Still, as mentioned in “Fitting Low-Rank Approximation in Rectified Linear Unit Convolutional Neural Networks,” low-rank approximation algorithms and CNNs are similar in terms of mathematical formulation, but we cannot ensure that the values obtained during training for the encoding and decoding filters and their biases have the properties needed to ensure that a CNN is an exact recursive data-driven low-rank approximation. For example, it is possible that the filters of the encoder and decoder do not reconstruct the signal perfectly because this may not be necessary to reduce the loss function used to optimize the network.
Is it possible to impose a tighter relationship between low-rank approximation and CNNs?
In specific applications where signal preservation and interpretability is required (e.g., medical imaging), it is desirable that the operation of CNNs is closer to the low-rank approximation description. To achieve this, the CNNs embedded in frameworks such as the convolutional analysis operator [S1] and Fast Iterative Soft Thresholding Algorithm Network [S2] explicitly train the filters ${K}_{n}$ and ${\tilde{\bf{K}}}_{n}$ to have properties such as perfect signal reconstruction and sparsity. By enforcing these characteristics, the mathematical descriptions of low-rank behavior and of CNNs are more similar and the models become inherently more interpretable and predictable in their operation.
References
[S1] I. Y. Chun, Z. Huang, H. Lim, and J. Fessler, “Momentum-net: Fast and convergent iterative neural network for inverse problems,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 4, pp. 4915–4931, Apr. 2023, doi: 10.1109/TPAMI.2020.3012955.
[S2] J. Xiang, Y. Dong, and Y. Yang, “FISTA-net: Learning a fast iterative shrinkage thresholding network for inverse problems in imaging,” IEEE Trans. Med. Imag., vol. 40, no. 5, pp. 1329–1339, May 2021, doi: 10.1109/TMI.2021.3054167.
To test the perfect reconstruction in nonresidual CNNs, we propose the following procedure. 1) We assume an idealized model ${G}{(}\,{\cdot}\,{),}$ where its convolution filters, ${K}_{n}$ and ${\tilde{\bf{K}}}_{n}$, comply with the phase-complementary tight-framelet condition, and where the biases and nonlinearities suppress low-amplitude (and negative for ReLU activations) samples from the feature maps. 2) The biases/thresholds of ReLU/shrinkage CNNs are set to zero (or to infinity for clipping activations). It can be observed that this condition prevents low-rank (or high rank for residual models) approximation behavior of the idealized CNN. Under this circumstance, it should be possible to prove that the analyzed CNN can perfectly reconstruct any signal. 3) The last step involves simplifying the mathematical description of the resulting model of the previous point. The mathematical simplification of the model should lead to the identity function if the model complies with the perfect reconstruction property.
To conclude the explanation on the perfect reconstruction analysis, we provide two relevant considerations. First, it can be claimed that a residual network, such as the model ${R}{(}{\bf{y}}{)} = {\bf{y}}{-}{G}{(}{\bf{y}}{)}$ discussed in (47), is able to reconstruct any signal when ${G}{(}{\bf{y}}{)} = {0}$ for any ${\bf{y}} = {\bf{x}} + \mathbf{\eta}$. Still, this does not convey information about the behavior of the encoding-decoding network ${G}{(}\,{\cdot}\,{),}$ which should be able to perform a perfect decomposition and reconstruction of the noise signal $\mathbf{\eta}$, as discussed in (55). To avoid this trivial solution, instead of analyzing the network ${R}{(}\,{\cdot}\,{),}$ the analysis described for nonresidual models is applied to the encoding-decoding structure ${G}{(}\,{\cdot}\,{),}$ which means that the residual connection is excluded from the analysis.
The second concluding remark is that to distinguish the equations of the perfect signal reconstruction analysis from other models, we specify the analyzed designs of the perfect reconstruction models in which the low-rank approximation behavior is avoided by setting the bias values to zero using a special operator ${P}\left\{{\cdot}\right\}$.
For the analyses regarding the total number of operations of the trainable parameters, it is assumed that the tensors ${K}_{0}$, ${\tilde{\bf{K}}}_{0}^{\top}$, $({\tilde{\bf{K}}}_{0}^{u}){}^{\top}$, $({\tilde{\bf{K}}}_{0}^{d}){}^{\top}$, ${K}_{1}$, and ${\tilde{\bf{K}}}_{1}^{\top}$, shown in Figure 10, have dimensions of ${(}{C}_{0}\,{\times}\,{1}\,{\times}\,{N}_{f}\,{\times}\,{N}_{f}{),}$ ${(}{1}\,{\times}\,{C}_{0}\,{\times}\,{N}_{f}\,{\times}\,{N}_{f}{),}$ ${(}{1}\,{\times}\,{C}_{0}\,{\times}\,{N}_{f}\,{\times}\,{N}_{f}{),}$ ${(}{1}\,{\times}\,{C}_{0}\,{\times}\,{N}_{f}\,{\times}\,{N}_{f}{),}$ ${(}{C}_{1}\,{\times}\,{C}_{0}\,{\times}\,{N}_{f}\,{\times}\,{N}_{f}{),}$ and ${(}{C}_{0}\,{\times}\,{C}_{1}\,{\times}\,{N}_{f}\,{\times}\,{N}_{f}{),}$ respectively. Here, ${C}_{0}$ and C1 represent the number of channels after the first and second convolution layers, respectively, and all the convolution filters are assumed to be square with ${N}_{f}\,{\times}\,{N}_{f}$ pixels. Furthermore, the input signal x has dimensions of ${(}{1}\,{\times}\,{1}\,{\times}\,{N}_{r}\,{\times}\,{N}_{c}{),}$ where ${N}_{r}$ and ${N}_{c}$ denote the number of rows and columns, respectively.
Figure 10. Simplified structure of encoding-decoding ReLU CNNs. The displayed networks are the U-Net/filtered backprojection network the encoder-decoder residual CNN (RED) and finally, the learned wavelet-frame shrinkage network (LWFSN). Note that for all the designs, the encoding-decoding structures are indicated by dashed blocks. It should be kept in mind that the drawings are simplified, they do not contain normalization layers, are shallow, commonly appearing dual convolutions are drawn as one layer.
The analyses shown for the different networks in this article have the following limitations. 1) The analyzed networks have only enough decomposition levels and convolution layers to understand their basic operation. The motivation for this simplification is to keep the analyses short and clear. Moreover, the same principles can be extended to deeper networks because the same single-decomposition CNNs would be recursively embedded within the given architectures. 2) The normalization layers are not considered because they are linear operators that provide mean shifts and amplitude scaling. Consequently, for analysis purposes, it can be assumed that they are embedded in the convolution weights. 3) For every encoder convolution kernel, it is assumed that there is at least one decoder filter. 4) No coadaptation between the filters of the encoder and decoder layers is considered.
The remainder of this section shows analyses of a selection of a few representative designs. Specifically, the chosen designs are the U-Net [32] and its residual counterpart, the filtered backprojection network [21]. (Matlab implementation by their authors available at https://github.com/panakino/FBPConvNet.) Additional designs analyzed here are the residual encoder-decoder CNN [5] (Pytorch implementation by their authors available at https://github.com/SSinyu/RED-CNN) as well as the learned wavelet frame-shrinkage network (LWFSN) (Pytorch implementation available at https://github.com/LuisAlbertZM/demo LWFSN TMI and interactive demo available at IEEE’s code ocean https://codeocean.com/capsule/9027829/tree/v1. The demo also includes as reference pytorch implementations of FBPConvNet and the tight-frame UNet.). For reference, all the designs are portrayed in Figure 10.
The first networks analyzed are the U-Net and filtered backprojection networks, both of which share the encoding-decoding structure $U(\cdot)$. However, they differ in the fact that the U-Net is nonresidual, while the filtered backprojection network operates in a residual configuration. Therefore, an estimate of the noiseless signal ${\hat{\bf{x}}}$ from the noisy observation y in the conventional U-Net is achieved by \[{\hat{\bf{x}}} = {U}{(}{\bf{y}}{)} \tag{49} \] whereas in the filtered backprojection network, ${U}{(}\,{\cdot}\,{)}$ is used in a residual configuration, which is equivalent to \[{\hat{\bf{x}}} = {\bf{y}}{-}{U}{(}{\bf{y}}{).} \tag{50} \]
If we now switch our focus to the encoding-decoding structure of the U-Net $U({\bf{y}})$, it can be shown that it is described by \[{U}{(}{\bf{y}}{)} = {U}^{u}{(}{\bf{y}}{)} + {U}^{d}{(}{\bf{y}}{)} \tag{51} \] where ${U}^{u}({\bf{y}})$ corresponds to the undecimated path and is defined by \[{U}^{u}{(}{\bf{y}}{)} = {(}{\tilde{\bf{K}}}_{0}^{u}{)}{}^{\top}{(}{\bf{K}}_{0}{\bf{y}} + {\underline{b}}_{0}{)}_{+} \tag{52} \] while the decimated path is \[{U}^{d}{(}{\bf{y}}{)} = {(}{\tilde{\bf{K}}}_{0}^{d}{)}{}^{\top}{\tilde{\bf{W}}}_{L}^{\top}{f}_{{(}{2}\uparrow{)}}\left({{(}{\tilde{\bf{K}}}_{1}^{\top}{(}{K}_{1}{Z} + {\underline{b}}_{1}{)}_{+} + {\tilde{\underline{b}}}_{1}{)}_{+}}\right){.} \tag{53} \]
Here, signal Z is defined by \[{Z} = {f}_{{(}{2}\downarrow{)}}{(}{W}_{L}{(}{K}_{0}{\bf{y}} + {\underline{b}}_{0}{)}_{+}{).} \tag{54} \]
Note that the decimated path contains two nested encoding-decoding architectures, as observed by Jin et al. [21], who has acknowledged that the nested filtering structure is akin to the (learned) iterative soft-thresholding algorithm [27], [28].
To prove whether the U-Net can perfectly reconstruct any signal, we assume that the biases are equal to zero; on this condition, the network ${\scr{P}}{\{}{U}{\}(}{\bf{y}}{)}$ is defined by \[{\scr{P}}{\{}{U}{\}(}{y}{)} = {\scr{P}}{\{}{U}^{\text{u}}{\}(}{y}{)} + {\scr{P}}{\{}{U}^{\text{d}}{\}(}{y}{)} \tag{55} \] where subnetwork ${\scr{P}}{\{}{U}^{u}{\}(}\,{\cdot}\,{)}$ is defined by \[{\scr{P}}{\{}{U}^{\text{u}}{\}}{(}{\bf{y}}{)} = {(}{\tilde{\bf{K}}}_{0}^{u}{)}^{\top}{(}{K}_{0}{\bf{y}}{)}_{+}{.} \tag{56} \]
Assuming that $({K}_{0},{\tilde{\bf{K}}}_{0}^{u})$ is a complementary-phase tight-framelet pair, then ${\scr{P}}{\{}{U}^{u}{\}(}{\bf{y}}{)}$ is simplified to \[{\scr{P}}{\{}{U}^{\text{u}}{\}(}{\bf{y}}{)} = {\bf{y}}{\cdot}{c}_{0}{.} \tag{57} \]
Furthermore, the low-frequency path is \[{U}^{\text{d}}{(}{\bf{y}}{)} = {(}{\tilde{\bf{K}}}_{0}^{d}{)}^{\top}{\tilde{\bf{W}}}_{L}^{\top}{f}_{{(}{2}\uparrow{)}}\left({{(}{\tilde{\bf{K}}}_{1}^{\top}{(}{K}_{1}{P}{\{}{Z}{\}}{)}_{+}{)}_{+}}\right) \tag{58} \] where ${P}{\{}{Z}{\}}$ is defined by \[{\scr{P}}{\{}{Z}{\}} = {f}_{{(}{2}\downarrow{)}}{(}{W}_{L}{(}{\bf{K}}_{0}{\bf{y}}{)}_{+}{)}{.} \tag{59} \]
If ${K}_{1}$ is a phase-complementary tight frame, we know that ${\tilde{\bf{K}}}_{1}^{\top}{(}{K}_{1}{Z}{)}_{+} = {Z}\,{\cdot}\,{c}_{1}$. Consequently, (66) becomes \[{\scr{P}}{\{}{U}^{d}{\}}{(}{\bf{y}}{)} = {(}{\tilde{\bf{K}}}_{0}^{d}{)}^{\top}{\tilde{\bf{W}}}_{L}^{\top}{f}_{{(}{2}\uparrow{)}}{(}{f}_{{(}{2}\downarrow{)}}{((}{W}_{L}{K}_{0}{\bf{y}}{)}_{+}{))}\,{\cdot}\,{c}_{1}{.} \tag{60} \]
Here, it can be noted that if ${K}_{0}$ is a phase-complementary tight framelet, then ${P}{\{}{U}^{d}{\}(}{\bf{y}}{)}$ approximates a low-pass version of y, or equivalently, \[{\scr{P}}{\{}{U}^{d}{\}(}{\bf{y}}{)}\approx{\mathcal{W}}_{L}{\bf{y}}\,{\cdot}\,{c}_{1} \tag{61} \] where ${\mathcal{W}}_{L}$ is a low-pass filter. Finally, substituting (65) and (69) in (63) results in \[{\scr{P}}{\{}{U}{\}(}{\bf{y}}{)}\approx{(}{I}\,{\cdot}\,{c}_{0} + {\mathcal{W}}_{L}\,{\cdot}\,{c}_{1}{)}{\bf{y}}{.} \tag{62} \]
This result proves that the design of the U-Net cannot evenly reconstruct all the frequency of y unless ${c}_{1} = {0}$, in which case, the whole low-frequency branch of the network is ignored. Note that this limitation is inherent to its design and cannot be circumvented by training with large datasets and/or with any loss function.
It can be noted that encoding filter ${K}_{0}$ convolves x at its original resolution and maps it to a tensor with ${C}_{0}$ channels. Therefore, the number of operations ${\mathcal{O}}{(}\,{\cdot}\,{)}$ for kernel ${K}_{0}$ is ${\mathcal{O}}{(}{K}_{0}{)} = {C}_{0}\,{\cdot}\,{N}_{r}\,{\cdot}\,{N}_{c}\,{\cdot}\,{N}{}_{f}^{2}$ floating-point operations (FLOPs). Conversely, due to the symmetry between encoder and decoder filters, ${\mathcal{O}}{(}{\tilde{\bf{K}}}_{0}^{u}{)} = {\mathcal{O}}{(}{\tilde{\bf{K}}}_{0}^{d}{)} = {\mathcal{O}}{(}{K}_{0}{)}$. Furthermore, for this design, filter ${K}_{1}$ processes the signal encoded by ${K}_{0}$, which is downsampled by a factor of one half, and maps it from ${C}_{0}$ to ${C}_{1}$ channels. This results in the estimated operation cost ${\mathcal{O}}{(}{K}_{1}{)} = {\mathcal{O}}{(}{\tilde{\bf{K}}}_{1}{)} = {C}_{0}\,{\cdot}\,{C}_{1}\,{\cdot}\,{N}_{r}\,{\cdot}\,{N}_{c}\,{\cdot}\,{N}{}_{f}^{2}\,{\cdot}\,{(}{2}{)}{}^{{-}{2}}$ [FLOPs]. Finally, adding the contributions of filters ${K}_{0}$, ${\tilde{\bf{K}}}_{0}^{u}$, ${\tilde{\bf{K}}}_{0}^{d}$, ${K}_{1}$, and ${\tilde{\bf{K}}}_{1}$ results in \[{\mathcal{O}}{(}{U}{)} = {(}{3} + {2}^{{-}{1}}\,{\cdot}\,{C}_{1}{)}\,{\cdot}\,{C}_{0}\,{\cdot}\,{N}_{r}\,{\cdot}\,{N}_{c}\,{\cdot}\,{N}{}_{f}^{2}\quad{[}{\text{FLOPS}}{].} \tag{63} \]
The U-Net/FBPConvNet is a flexible multiresolution architecture. Still, as has been shown, the pooling structure of this CNN may be suboptimal for noise-reduction applications because its configuration does not allow for recovery of the signal’s frequency information evenly. This has been noted and fixed by Han and Ye [6], who introduced the tight-frame U-Net in which the down-/upsampling structure is replaced by the DWT and its inverse. This simple modification overcomes limitations of the U-Net and improved its performance for artifact removal in compressed sensing imaging.
The residual encoder-decoder CNN shown in Figure 10 consists of nested single-layer residual encoding-decoding networks. For example, in the network showcased in Figure 10, we can see that network ${R}_{1}{(}\,{\cdot}\,{)}$ is nested into ${R}_{0}{(}\,{\cdot}\,{)}$. Furthermore, for this case, the image estimate is given by \[{\hat{\bf{x}}} = {(}{\bf{y}} + {R}_{0}{(}{\bf{y}}{)} + \underline{\tilde{b}}{}_{0}{)}_{+} \tag{64} \] in which ${R}_{0}{(}\,{\cdot}\,{)}$ is the outer residual network and $\underline{\tilde{b}}{}_{0}$ is the bias for the output layer. Note that the ReLU placed at the output layer intrinsically assumes that the estimated signal ${\hat{\bf{x}}}$ is positive.
From (64), the output of the subnetwork ${R}_{0}{(}\,{\cdot}\,{)}$ is defined by \[{Z} = {R}_{0}^{\text{dec}}{(}{\hat{\bf{Q}}}{).} \tag{65} \]
Here, the decoder ${R}_{0}^{\text{dec}}{(}\,{\cdot}\,{)}$ is defined by \[{R}_{0}^{\text{dec}}{(}\hat{\bf{Q}}{)} = \tilde{\bf{K}}{}_{0}^{\top}\hat{\bf{Q}}{.} \tag{66} \]
In (66), $\hat{\bf{Q}}$ is the noiseless estimate of the intermediate signal Q and is defined by \[\hat{\bf{Q}} = {(}{\bf{Q}} + {R}_{1}{(}{\bf{Q}}{)} + \underline{\tilde{b}}{}_{1}{)}_{+} \tag{67} \] where the network ${R}_{1}{(}\,{\cdot}\,{)}$ is \[{R}_{1}{(}{\bf{Q}}{)} = {\tilde{\bf{K}}}_{1}^{\top}{(}{\bf{K}}_{1}{\bf{Q}} + \underline{b}{}_{1}{)}_{+}{.} \tag{68} \]
Furthermore, Q represents the signal encoded by ${R}_{0}{(}\,{\cdot}\,{),}$ or equivalently, \[{\bf{Q}} = {R}_{0}^{\text{enc}}{(}{\bf{y}}{)} \tag{69} \] where ${R}_{0}^{\text{enc}}{(}\,{\cdot}\,{)}$ is defined by \[{R}_{0}^{\text{enc}}{(}{\bf{y}}{)} = {\bf{K}}_{0}{\bf{y}}{.} \tag{70} \]
As mentioned earlier, the residual encoder-decoder CNN is composed by nested residual blocks, which are independently analyzed to study the reconstruction characteristics of this network. First, block ${R}_{1}{(}\,{\cdot}\,{),}$ is given by \[{\scr{P}}\left\{{{R}_{1}}\right\}{(}{\bf{Q}}{)} = {\tilde{\bf{K}}}_{1}^{\top}{(}{\bf{K}}_{1}{\bf{Q}}{)}_{+}{.} \tag{71} \]
Under complementary-phase tight-frame assumptions for the pair ${(}{\bf{K}}_{1},{\tilde{\bf{K}}}_{1}{),}$ (71) reduces to \[{\scr{P}}\left\{{{R}_{1}}\right\}{(}{\bf{Q}}{)} = {\bf{Q}} \tag{72} \] which shows that the encoder and decoder ${R}_{1}{(}\,{\cdot}\,{)}$ can approximately reconstruct any signal. Now, switching to ${R}_{0}$, it can be observed that the linear part is \[{\scr{P}}\left\{{{R}_{0}}\right\}{(}{\bf{y}}{)} = {\tilde{\bf{K}}}_{0}^{\top}{(}{\bf{K}}_{0}{\bf{y}}{)}_{+}{.} \tag{73} \]
Just as with ${R}_{1}{(}\,{\cdot}\,{)}$, it is assumed that the convolution kernels are tight framelets. Therefore, (73) becomes \[{\scr{P}}\left\{{{R}_{0}}\right\}{(}{\bf{y}}{)} = {\bf{y}}{.} \tag{74} \]
Consequently, ${R}_{0}{(}\,{\cdot}\,{)}$ and ${R}_{1}{(}\,{\cdot}\,{)}$ can reconstruct any arbitrary signal under complementary-phase tight-frame assumptions.
In this case, all the convolution layers operate at the original resolution of image x. Therefore, the number of operations ${O}{(}\,{\cdot}\,{)}$ for kernel ${\bf{K}}_{0}$ and ${\tilde{\bf{K}}}_{0}$ is ${O}{(}{\bf{K}}_{0}{)} = {O}{(}{\tilde{\bf{K}}}_{0}{)} = {C}_{0}\,{\cdot}\,{N}_{r}\,{\cdot}\,{N}_{c}\,{\cdot}\,{N}{}_{f}^{2}$ [FLOPs], while ${\bf{K}}_{1}$ and ${\tilde{\bf{K}}}_{1}$ require ${O}{(}{\bf{K}}_{1}{)} = {O}{(}{\tilde{\bf{K}}}_{1}{)} = $ ${C}_{0}\,{\cdot}\,{C}_{1}\,{\cdot}\,{N}_{r}\,{\cdot}\,{N}_{c}\,{\cdot}\,{N}{}_{f}^{2}$ [FLOPs]. By adding the contributions of both encoding-decoding pairs, the total operations for the residual encoder-decoder becomes \[{\mathcal{O}}{(}{R}{)} = {2}\,{\cdot}\,{(}{1} + {C}_{1}{)}\,{\cdot}\,{C}_{0}\,{\cdot}\,{N}_{r}\,{\cdot}\,{N}_{c}\,{\cdot}\,{N}{}_{f}^{2}\quad{[}{\text{FLOPS}}{]}{.} \tag{75} \]
The residual encoder-decoder network consists of a set of nested single-resolution residual encoding-decoding CNNs. The single-resolution design increases its computation cost with respect to multiresolution designs, such as the U-Net. In addition, it should be noted that the use of an ReLU as the output layer of the encoder-decoder residual network forces the signal estimates to be positive, but this is not always convenient. For example, in computerized tomography imaging, it is common that images contain positive and negative values.
The LWFSN network is a multiresolution architecture in which the DWT is used for down-/upsampling and also as part of the decomposition where shrinkage is applied. In this CNN, the noiseless estimates are produced by \[{\hat{\bf{x}}} = {L}{(}{\bf{y}}{)} \tag{76} \] where ${L}{(}\,{\cdot}\,{)}$ represents the encoding-decoding structure of the LWFSN, and the encoding-decoding network ${L}{(}\,{\cdot}\,{)}$ is \[{L}{(}{\bf{y}}{)} = {L}_{\text{L}}{(}{\bf{y}}{)} + {L}_{\text{H}}{(}{\bf{y}}{).} \tag{77} \]
Here, the high-frequency path is given by \[{L}_{\text{H}}{(}{\bf{y}}{)} = {\tilde{\bf{K}}}_{0}^{\top}{\tilde{\bf{W}}}_{\text{H}}^{\top}{f}_{{(}{2}\uparrow{)}}\left({\tau}_{{(}{\underline{t}}_{0}{)}}^{\text{LET}}\left({f}_{{(}{2}\downarrow{)}}\left({\bf{W}}_{\text{H}}{\bf{K}}_{0}{\bf{y}}\right)\right)\right){.} \tag{78} \]
Note that in this design, the encoder leverages the filter ${\bf{W}}_{H}$ to generate a sparse signal prior to the shrinkage stage, i.e., ${\tau}_{{(}{\underline{t}}_{0}{)}}^{\text{LET}}\left({f}_{{(}{2}\downarrow{)}}\left({\bf{W}}_{H}{\bf{K}}_{0}{\bf{y}}\right)\right)$. Meanwhile, the low-frequency path ${L}_{\text{L}}{(}\,{\cdot}\,{)}$ is \[{L}_{\text{L}}{(}{\bf{y}}{)} = {\tilde{\bf{K}}}_{0}^{\top}{\tilde{\bf{W}}}_{\text{L}}^{\top}{f}_{{(}{2}\uparrow{)}}\left({{f}_{{(}{2}\downarrow{)}}\left({{W}_{\text{L}}{\bf{K}}_{0}{\bf{y}}}\right)}\right){.} \tag{79} \]
When analyzing signal propagation of the LWFSN, we set the threshold level $\underline{{t}_{0}} = {0}$. This turns (77) into \[{\scr{P}}\left\{{L}\right\}{(}{\bf{y}}{)} = {\scr{P}}\left\{{{L}_{\text{L}}}\right\}{(}{\bf{y}}{)} + {\scr{P}}\left\{{{L}_{\text{H}}}\right\}{(}{\bf{y}}{).} \tag{80} \]
Here, ${\scr{P}}\left\{{{L}_{\text{H}}}\right\}{(}\,{\cdot}\,{)}$ is defined by \[{\scr{P}}\left\{{{L}_{\text{H}}}\right\}{(}{\bf{y}}{)} = {\tilde{\bf{K}}}_{0}^{\top}{\tilde{\bf{W}}}_{\text{H}}^{\top}{f}_{{(}{2}\uparrow{)}}\left({{f}_{{(}{2}\downarrow{)}}\left({{W}_{\text{H}}{\bf{K}}_{0}{\bf{y}}}\right)}\right) \tag{81} \] while the low-frequency path ${\scr{P}}\left\{{{L}_{\text{L}}}\right\}{(}\,{\cdot}\,{)}$ is mathematically described by \[{\scr{P}}\left\{{{L}_{\text{L}}}\right\}{(}{\bf{y}}{)} = {\tilde{\bf{K}}}_{0}^{\top}{\tilde{\bf{W}}}_{\text{L}}^{\top}{f}_{{(}{2}\uparrow{)}}\left({{f}_{{(}{2}\downarrow{)}}\left({{W}_{\text{L}}{\bf{K}}_{0}{\bf{y}}}\right)}\right){.} \tag{82} \]
Substituting (81) and (82) in (80) results in \[{\scr{P}}\left\{{L}\right\}{(}{\bf{y}}{)} = {\tilde{\bf{K}}}_{0}^{\top}{\tilde{\bf{W}}}^{\top}{f}_{{(}{2}\uparrow{)}}\left({{f}_{{(}{2}\downarrow{)}}\left({{W}{\bf{K}}_{0}{\bf{y}}}\right)}\right){.} \tag{83} \]
For the DWT, it holds that ${\bf{Q}} = {\tilde{\bf{W}}}^{\top}{f}_{{(}{2}\uparrow{)}}\left({{f}_{{(}{2}\downarrow{)}}\left({WQ}\right)}\right)$. Consequently, (91) is simplified to \[{\scr{P}}\left\{{L}\right\}{(}{\bf{y}}{)} = {\tilde{\bf{K}}}_{0}^{\top}{\bf{K}}_{0}{\bf{y}}{.} \tag{84} \]
Assuming that ${\bf{K}}_{0}$ is a tight framelet, i.e., ${\tilde{\bf{K}}}_{0}^{\top}{\bf{K}}_{0} = {I}\,{\cdot}\,{c}$, with ${c} = {1}$, then \[{\scr{P}}\left\{{L}\right\}{(}{\bf{y}}{)} = {\bf{y}}{.} \tag{85} \]
This proves that the encoding-decoding section of the LWFSN allows for perfect signal reconstruction.
The LWFSN contains a simpler convolution structure than the networks reviewed up to now. Therefore, for a single-level decomposition architecture, the total number of operations is \[{\mathcal{O}}{(}{L}{)} = {2}\,{\cdot}\,{C}_{0}\,{\cdot}\,{N}_{\text{r}}\,{\cdot}\,{N}_{\text{c}}\,{\cdot}\,{N}_{\text{f}}^{2}\quad{[}{\text{FLOPS}}{].} \tag{86} \]
To illustrate the use of clipping activations in residual noise reduction, the residual version of the LWFSN is included. Note that there are two main differences with the conventional LWFSN. First, the shrinkage functions are replaced by clipping activations. Second, the low-frequency signal is suppressed. This is performed because the original design of the LWFSN does not have any nonlinearities in that section. This is akin to the low-frequency nulling proposed by Kwon and Ye [33]. The modified LWFSN is shown in Figure 11. It can be observed that by setting to zero the low-frequency branch of the design, the model inherently assumes that the noise is high pass.
Figure 11. The residual version of the LWFSN. It can be noticed that the low-frequency branch of the network is nulled. In deeper networks, it would further decomposed and the nulling would be activated at the deepest level (lowest resolution).
The (residual) LWFSN (r)LWFSN is a design that explicitly mimics wavelet-shrinkage algorithms. It can be observed that the (r)LWFSN inherently assumes that noise is high frequency and explicitly avoids nonlinear processing in the low-frequency band. Follow-up experiments also included nonlinearities in the low-frequency band of the LWFSN [34] and obtained results similar to the original design.
The assumption that the convolution filters of a CNN behave as (complementary-phase) tight framelets is useful for analyzing the theoretical ability of a CNN to propagate signals. However, it is difficult to prove that trained models comply with this assumption because there are diverse elements affecting the optimization of the model, e.g., initialization of the network, the data presented to the model, and the optimization algorithm as well as its parameters. In addition, in real CNNs, there may be coadaptation between diverse CNN layers, which may prevent the individual filters of the CNN from behaving as tight framelets as the decomposition and filtering performed by one layer is not independent from the rest [35].
To test whether the behavior of the filters of trained CNNs can converge to complementary-phase tight framelets, at least on a simplified environment, we propose training a toy model, as displayed in Figure 12. If the trained filters of an encoder-decoder pair of the toy model $\left({{\bf{K}}_{l},{\tilde{\bf{K}}}_{l}}\right)$, (where l denotes one of the decomposition levels) behave as a complementary-phase tight framelet, then the pair $\left({{\bf{K}}_{l},{\tilde{\bf{K}}}_{l}}\right)$ approximately complies with the condition presented in (40), which, for identity input I, simplifies to \[{\tilde{\bf{K}}}_{n}^{\top}{(}{\bf{K}}_{n}{)}_{+} = {\bf{I}}\,{\cdot}\,{c}_{n} \tag{87} \]
Figure 12. The toy model used for the experiment on the properties of the filters of a trained CNN. The dimensions for tensors ${\bf{K}}_{0}$, ${\bf{K}}_{1}$, and ${\bf{K}}_{2}$ are ${(}{6}\,{\times}\,{1}\,{\times}\,{3}\,{\times}\,{3}{),}$ ${(}{12}\,{\times}\,{6}\,{\times}\,{3}\,{\times}\,{3}{),}$ and ${(}{24}\,{\times}\,{12}\,{\times}\,{3}\,{\times}\,{3}{),}$ respectively. The network is symmetric and the filter dimensions for the decoder convolution kernels ${\tilde{\bf{K}}}_{n}$ are the same as their corresponding encoding kernel ${\bf{K}}_{n}.$
in which ${c}_{n}$ is an arbitrary constant.
The toy model is trained on images that contain multiple randomly generated overlapping triangles. All the images were scaled to the range of [0,1]. For this experiment, the input to the images is the noise-contaminated image, and the objective/desired output is the noiseless image. For training the CNNs, normally distributed noise with a standard deviation of 0.1 was added to the ground truth. For every epoch, a batch of 192 training images was generated. For validation and test images, we used the “astronaut” and “cameraman” images included in the Scipy software package. The model was optimized with Adam for 25 epochs with a linearly decreasing learning rate. The initial learning rate for the optimizer was set to 10–3, and the batch size was set to one sample. The convolution kernels were initialized with Xavier initialization using a uniform distribution (see Glorot and Bengio [36]). The code is available at IEEE’s code ocean at https://codeocean.com/capsule/7845737/tree.
Using the described settings, we trained the toy model and tested whether the phase-complementary tight-framelet property holds for filters of the deepest level: ${l} = {2}$. The results for the operation ${\tilde{\bf{K}}}_{2}^{\top}{(}{\bf{K}}_{2}{)}_{+}$ are displayed in Figure 13(a), which shows that when the weights of the encoder and decoder have different initial values, the kernel pair $\left({{\bf{K}}_{2},{\tilde{\bf{K}}}_{2}}\right)$ is not a complementary-phase tight framelet. We have observed that the forward and inverse filters of wavelets/framelets are often the same or at least very similar. Based on this reasoning, we initialized the toy model with the same initial values of the kernel pair $\left({{\bf{K}}_{n},{\tilde{\bf{K}}}_{n}}\right)$. As shown in Figure 13(b), with the proposed initialization, the filters of the CNN converge to tensors with properties reminiscent of complementary-phase tight-framelets. This suggests that the initialization of the CNN has an important influence on the convergence of the model to a specific solution.
Figure 13. The phase-complementary tight-framelet test for the trained-toy network, initialized with random weights. (a) The product ${\tilde{\bf{K}}}_{2}^{\top}({\bf{K}}_{2}{)}_{+}$, where the initialization of ${\bf{K}}_{2}$ and ${\tilde{\bf{\bf{K}}}}_{2}$ is different. It can be seen that the pair ${(}{\bf{K}}_{2},{\tilde{\bf{K}}}_{2}{)}$ does not comply with the complementary-phase framelet criterion of (95). This contrasts with (b), which displays the result of the product ${\tilde{\bf{K}}}_{2}^{\top}({\bf{K}}_{2}{)}_{+}$, for the same CNN, but where the initial values of ${\tilde{\bf{K}}}_{2}$ and ${\bf{K}}_{2}$ are identical. For this initialization, the filters approximate the complementary-phase tight-framelet criterion.
Figure 14 displays test images processed with two toy models, one trained with different and one trained with the same initial values for the encoding-decoding pairs. It can be observed that there are no significant differences between the images produced by both models. In Figure 14(e) and (f), we set the bias of both networks to zero. In this case, it is expected that the networks will reconstruct the noisy input, as confirmed by the figure, where both CNNs partly reconstruct the original noisy signal. This result suggests that the ReLU plus bias pairs operate akin to the low-rank approximation mechanism proposed by the TDCFs.
Figure 14. The processed “cameraman” image for (in)dependently sampled initialization for the encoding and decoding filters. (a) The noise-contaminated input $\left({{\sigma}_{\eta} = {0}{.}{1}}\right)$ and (d) the noiseless reference. (b) and (e) The processed noisy image with the toy model trained with different initialization for its convolution filters, while (c) and (f) are images processed with the model where the same initial values are used for the encoding and decoding filters. (b) and (c) Nearly identical images in terms of quality and signal-to-noise ratio (SNR) so that initialization has no effect. (b) and (c) The same model presented that processed (b), (e), and (f) but where its bias is set to zero. As expected, the noise is partly reconstructed.
The following conclusions can be drawn from this experiment. First, the filters of the CNN may not necessarily converge to complementary-phase tight framelets. This is possibly caused by initialization of the network and/or the interaction/coadaptation between the multiple encoder/decoder layers. Second, we confirm that for our experimental setting, the low-rank approximation behavior in the CNN can be observed. For example, when setting the biases and thresholds to zero, part of the noise texture (high-rank information) is recovered. Third, it is possible that linear filtering happens in the network as well, which may explain why noise texture is not fully recovered when setting the biases to zero. Fourth and finally, we observed that the behavior of the trained models changes drastically depending on factors such as the learning rate and initialization values of the model. For this reason, we consider this experiment and its outcome more as a proof of concept, where further investigation is needed.
From the explanations in the “Nonlinear Signal Estimation in the Framelet Domain” section, it can be noted that the bias/threshold used in CNNs can modulate how much of the signal is suppressed by the nonlinearities. In addition, the “Additional Links Between Encoding-Decoding CNNs and Existing Signal Processing Techniques” section established that there are additional mechanisms for noise reduction within the CNN, such as the Wiener-like behavior observed by Mohan et al. [14]. This raises the question of how robust conventional CNNs are to noise-level changes, different from the level at which the model has been trained. To perform such an experiment, we trained two variants of the toy model. The first variant was inspired by the multiscale sparse coding network of Mentl et al. [17], where the biases of each of the nonlinearities (in this case, an ReLU) are multiplied by an estimate of the standard deviation of the noise. In the design of this example, the noise estimate ${\hat{\sigma}}_{\eta}$, which, in accordance to Chang et al., [1] is defined by \[{\hat{\sigma}}_{\eta} = {1.4826}\,{\cdot}\,{\text{median}}\left({{\vert}{\bf{f}}_{\text{HH}}\ast{\bf{x}}{\vert}}\right){.} \tag{88} \]
Here, variable ${f}_{\text{HH}}$ is the diagonal convolution filter of the DWT with Haar basis. For comparison purposes, we refer to this model as an adaptive toy model. The second variant of the toy model that was tested examines the case where the convolution layers of the model do not add bias to the signal. This model is based on the bias-free CNNs proposed by Mohan et al. [14], in which the bias of every convolution filter is set to zero during training. The purpose of this setting is to achieve better generalization on the model as it is claimed that this modification causes the model to behave independent of the noise level.
We trained the described variants of the toy models with the same settings as the experiment in the “Properties of Convolution Kernels and Low-Rank Approximation” section. The three models are evaluated on the test image with varying noise levels: ${\sigma}_{n}\in\left[{{0}{.}{1},{0}{.}{15},{0}{.}{175},{0}{.}{2},{0}{.}{225}}\right]$. The result of this evaluation is displayed in Figure 15. These results confirm that the performance of the original toy model degrades for higher noise levels. In contrast, the adaptive and bias-free toy models perform better than the original toy model for most of the noise levels.
Figure 15. A comparison of the baseline (original) toy model against its adaptive and bias-free variants. The models are evaluated in the cameraman picture with increasing noise levels. (a) The noisy input. (b) The images processed with the original toy model. (c) Results of the adaptive toy model. (d) Finally, results corresponding to the bias-free model. It can be observed that the performance original toy model degrades as the noise level increases, while the performance-adaptive and bias-free models degrade less with increased noise levels, resulting in pictures with lower noise levels.
The results of this experiment confirm the diverse noise-reduction mechanisms within a CNN as well as show that CNNs have certain modeling limitations. For example, noise invariance, which can be addressed by further incorporating prior knowledge into the model, such as with the case of the adaptive model or by forcing the model to have a more Wiener-like behavior, such as with the case of the bias-free model. In the case of the bias-free model, note that, theoretically, it should be possible to obtain exactly the same behavior with the original toy model if the biases of the model would have converged to zero. This reasoning suggests that the large number of free parameters and nonlinear behavior of the model can potentially prevent finding the optimal/robust solution, in which case, the incorporation of prior knowledge can help improve the model.
When choosing or designing a CNN for a specific noise-reduction application, multiple choices and design elements should be considered, for example, the required performance, memory required to train/deploy models, whether certain signal preservation characteristics are required, target execution time for the model, characteristics of the images being processed, and so on. Based on these requirements, diverse design elements of CNNs can be more or less desirable, for example, the activation functions, use of single/multiresolution models, need for skip connections, and so forth. The following sections briefly discuss such elements by focusing on the impact that such elements have in terms of performance and potential computational cost. A summary of the main conclusions of these elements is included in Table 2.
Table 2. Design elements and their impact on performance and computation cost.
In the literature, the most common activation function in CNNs is the ReLU. There are two main advantages of an ReLU with respect to other activations. First, ReLUs potentially enforce more sparsity in the feature maps than, for example, soft shrinkage, because ReLUs not only cancel the small values of feature maps like shrinkage functions do but also all the negative values. The second advantage of an ReLU is its capacity to approximate other functions (see the “Shrinkage and Clipping in ReLU Networks” section). Note that the high capacity of the ReLU to represent other functions [13], [29] (often referred to as expressivity) may also be one of the reasons why these models are prone to overfitting.
Better expressivity of ReLU CNNs may be the reason why, at the time of this writing, that ReLU-based CNNs perform marginally better than shrinkage-based models in terms of metrics such as signal-to-noise ratio or the structural similarity index metric [19], [37], [38]. Despite this small benefit, the visual characteristics of estimates produced by ReLUs and shrinkage-based networks are very similar. Furthermore, the computational cost of ReLU-based designs is potentially higher than those with shrinkage functions because ReLUs require more feature maps to preserve signal integrity. For example, the LWFSN shown in the “LWFSN” section achieves a performance very close to the FBPConvNet and the tight-frame U-Net for noise reduction in computerized tomography, but only with a small fraction of the total trainable parameters, which allows for a faster and less computation-expensive model [19].
As a concluding remark, it can be noted that regardless of the expressivity of the ReLU activation, it is not entirely clear whether this means that ReLU activations outperform other functions, such as the soft threshold in general. We were unable to find articles that specifically focus on comparing the performance of ReLU/shrinkage-based models. In spite of this, there are some works that compare shrinkage-based CNNs with other (architecturally different) models based on ReLUs that indicate that the compared ReLU-based designs slightly outperform shrinkage-based ones. For example, Herbreteau and Kervrann [38] proposed the DCT2-Net, a shrinkage-based CNN, which, despite of its good performance, is still outperformed by the ReLU-based denoising CNN (DnCNN) [7] CNN. Similar behavior was observed by Zavala-Mondragón et al. [19], where their shrinkage-based LWFSN could not outperform the ReLU-based FBPConvNet [21] nor the tight-frame U-Net [6]. Another similar case is the deep K-singular-value-decomposition network [37], which achieves a performance close to (but slightly less good) than the ReLU-based DnCNN. Among the few examples we found were that an ReLU CNN performed better than shrinkage-based models, i.e., in Fan et al. [18], where they compared variants of the soft autoencoder and found that the shrinkage-based model outperformed the ReLU variant.
The advantage of single-scale models is that they avoid aliasing because no down-/upsampling layers are used. Still, this comes at the expense of more computations and more memory. Furthermore, this approach may lead to models with larger filters and/or deeper networks that achieve the same receptive field as multiscale models, which may further increase the computation costs of single-scale models.
In the case of multiscale models, the main consideration should be that the down-/upsampling structure should allow perfect signal reconstruction to avoid introducing aliasing and/or distortion to the image estimates (e.g., the DWT in the tight-frame U-Net and in the LWFSN).
Residual noise-reduction CNNs often perform better than their nonresidual counterparts (e.g., the U-Net versus FBPConvNet and the LWFSN versus the rLWFSN). This may be because the trained models have more freedom to learn the filters because the design does not need to learn to reconstruct the noiseless signal, it need only estimate the noise [12]. Also, it can be observed that nonresidual models potentially need more parameters than residual networks because the propagation/reconstruction of the noiseless signal is also dependent on the number of channels of the network.
Defining the state of the art in image denoising with CNNs is challenging for diverse reasons. First, there is a wide variety of available CNNs, which are not often compared to each other. Second, the suitability of a CNN for a given task may depend on image and noise characteristics, such as noise distribution and (non)stationarity. Third, the large number of variables in terms of, e.g., optimization, data, and data augmentation, adds reproducibility issues, which further complicate making a fair comparison among all the available models [11]. In addition, it should be noted that for many of the existing models, the performance gap between state-of-the-art models and other CNNs is often small.
Despite the aforementioned challenges, we have found some models that could be regarded as the state of the art. The first of which is the denoising residual U-Net [39], which is a bias-free model [14] that incorporates a U-Net architecture with residual blocks. In addition, the DRU-Net uses an additional input to indicate to the network the noise intensity, which increases its generalization to different noise levels. An additional state-of-the-art model is DnCNN [7]. This network is residual and single-scale while also using ReLU activations. Another state-of-the-art model is the multilevel-wavelet CNN [40], which has a design very similar to that of the tight-frame U-Net [6]. Both of these models are based on the original U-Net design [32] but are deployed in a residual configuration, and the down-/upsampling structure is based on the DWT. Furthermore, in addition to using standalone encoding-decoding CNNs, CNNs have been used as proximal operators within model-based methods [39], [41], which further improves the denoising power of nonmodel-based encoding-decoding CNNs.
In this article, the widely used encoding-decoding CNN architecture was analyzed from several signal processing principles. This analysis revealed the following conclusions. 1) Multiple signal processing concepts converge in the mathematical formulation of encoding-decoding CNNs models. For example, the convolution and down-/upsampling structure of the encoder-decoder structure is akin to the framelet decomposition: the activation functions are rooted in classical signal estimators. In addition, linear filtering may also happen within the model. 2) The activations implicitly assume noise and signal characteristics of the feature maps. 3) There are still many signal processing developments that can be integrated with current CNNs, further improving their performance in terms of accuracy, efficiency, or robustness.
Despite the signal processing nature of encoding-decoding CNNs, at the time of this writing, the integration of CNNs and existing signal processing algorithms is at an early stage. A clear example of the signal modeling limitations of current CNN denoisers is the activation functions, where the estimators provided by current activation layers neglect spatial correlation of the feature maps. Possible alternatives to solving this limitation could be to perform an activation function inspired by denoisers working on principles such as Markov random fields [42], locally spatial indicators [43], and multiscale shrinkage [24]. Further ideas are provided by the extensive survey on denoising algorithms by Pižurika and Philips [44]. Additional approaches that can be further explored are nonlocal [45] and collaborative filtering [46]. Both techniques exploit the redundancy in natural images and only a few models are exploring these properties [47], [48].
Finally, we encourage the reader to actively consider the properties of the signals processed, design requirements, and existing signal processing algorithms when designing new CNNs. By doing so, we expect that next the generation of CNN denoisers will not only be better performing but also more interpretable and reliable.
We thank Dr. Ulugbek Kamilov and the anonymous reviewers for their valuable feedback and suggestions for this article.
Luis Albert Zavala-Mondragón (lzavala905@gmail.com) received his M.Sc. degree in electrical engineering from Eindhoven University of Technology, 5600 MB Eindhoven, The Netherlands, where he is currently a Ph.D. candidate in signal processing. He has experience in the field of hardware emulation (Intel, Mexico) and computer vision (Thirona, The Netherlands). His research interests include the development of efficient and explainable computer vision pipelines for health-care applications. He is a Student Member of IEEE.
Peter H.N. de With (p.h.n.de.with@tue.nl) received his Ph.D. degree in computer vision from Delft University of Technology, where he is a full professor and leads the Video Coding and Architectures research group at Eindhoven University of Technology, 5600 MB Eindhoven, The Netherlands. He was a researcher at Philips Research Labs, full professor at the University of Mannheim, and VP of video technology at CycloMedia. He is the coauthor of more than 70 refereed book chapters and journal articles, 500 conference publications, and 40 international patents. He was a Technical Committee member of the IEEE Consumer Electronics Society, ICIP, Society of Photo-Optical Instrumentation Engineers, and he is co-recipient of multiple paper awards. He is a Fellow of IEEE and a member of the Royal Holland Society of Sciences and Humanities.
Fons van der Sommen (fvdsommen@tue.nl) received his Ph.D. degree in computer vision. He is an associate professor at Eindhoven University of Technology, 5600 MB Eindhoven, The Netherlands. As head of the health-care cluster at the Video Coding and Architectures research group, he has worked on a variety of image processing and computer vision applications, mainly in the medical domain. His research interests include signal processing and information theory and strives to exploit methods from these fields to improve the robustness, efficiency, and interpretability of modern-day artificial intelligence architectures, such as convolutional neural networks. He is a Member of IEEE.
[1] S. G. Chang, B. Yu, and M. Vetterli, “Adaptive wavelet thresholding for image denoising and compression,” IEEE Trans. Image Process., vol. 9, no. 9, pp. 1532–1546, Sep. 2000, doi: 10.1109/83.862633.
[2] M. Elad and M. Aharon, “Image denoising via learned dictionaries and sparse representation,” in Proc. IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit. (CVPR), Piscataway, NJ, USA: IEEE Press, 2006, vol. 1, pp. 895–900, doi: 10.1109/CVPR.2006.142.
[3] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,” Phys. D, Nonlinear Phenomena, vol. 60, nos. 1–4, pp. 259–268, Nov. 1992, doi: 10.1016/0167-2789(92)90242-F.
[4] K. H. Jin and J. C. Ye, “Annihilating filter-based low-rank Hankel matrix approach for image inpainting,” IEEE Trans. Image Process., vol. 24, no. 11, pp. 3498–3511, Nov. 2015, doi: 10.1109/TIP.2015.2446943.
[5] H. Chen et al., “Low-dose CT with a residual encoder-decoder convolutional neural network,” IEEE Trans. Med. Imag., vol. 36, no. 12, pp. 2524–2535, Dec. 2017, doi: 10.1109/TMI.2017.2715284.
[6] Y. Han and J. C. Ye, “Framing u-net via deep convolutional framelets: Application to sparse-view CT,” IEEE Trans. Med. Imag., vol. 37, no. 6, pp. 1418–1429, Jun. 2018, doi: 10.1109/TMI.2018.2823768.
[7] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising,” IEEE Trans. Image Process., vol. 26, no. 7, pp. 3142–3155, Jul. 2017, doi: 10.1109/TIP.2017.2662206.
[8] T. Yokota, H. Hontani, Q. Zhao, and A. Cichocki, “Manifold modeling in embedded space: An interpretable alternative to deep image prior,” IEEE Trans. Neural Netw. Learn. Syst., vol. 33, no. 3, pp. 1022–1036, Mar. 2022, doi: 10.1109/TNNLS.2020.3037923.
[9] K. C. Kusters, L. A. Zavala-Mondragón, J. O. Bescós, P. Rongen, P. H. de With, and F. van der Sommen, “Conditional generative adversarial networks for low-dose CT image denoising aiming at preservation of critical image content,” in Proc. 43rd Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. (EMBC), Piscataway, NJ, USA: IEEE Press, 2021, pp. 2682–2687, doi: 10.1109/EMBC46164.2021.9629600.
[10] H. Gupta, K. H. Jin, H. Q. Nguyen, M. T. McCann, and M. Unser, “CNN-based projected gradient descent for consistent CT image reconstruction,” IEEE Trans. Med. Imag., vol. 37, no. 6, pp. 1440–1453, Jun. 2018, doi: 10.1109/TMI.2018.2832656.
[11] M. T. McCann, K. H. Jin, and M. Unser, “Convolutional neural networks for inverse problems in imaging: A review,” IEEE Signal Process. Mag., vol. 34, no. 6, pp. 85–95, Nov. 2017, doi: 10.1109/MSP.2017.2739299.
[12] J. C. Ye, Y. Han, and E. Cha, “Deep convolutional framelets: A general deep learning framework for inverse problems,” SIAM J. Imag. Sci., vol. 11, no. 2, pp. 991–1048, 2018, doi: 10.1137/17M1141771.
[13] J. C. Ye and W. K. Sung, “Understanding geometry of encoder-decoder CNNs,” in Proc. 36th Int. Conf. Mach. Learn., PMLR, Cambridge, MA, USA, 2019, pp. 7064–7073.
[14] S. Mohan, Z. Kadkhodaie, E. P. Simoncelli, and C. Fernandez-Granda, “Robust and interpretable blind image denoising via bias-free convolutional neural networks,” in Proc. Int. Conf. Learn. Representations, 2020. [Online] . Available: https://iclr.cc/virtual_2020/poster_HJlSmC4FPS.html
[15] M. Unser, “From kernel methods to neural networks: A unifying variational formulation,” 2022, arXiv:2206.14625.
[16] V. Papyan, Y. Romano, and M. Elad, “Convolutional neural networks analyzed via convolutional sparse coding,” J. Mach. Learn. Res., vol. 18, no. 1, pp. 2887–2938, 2017.
[17] K. Mentl et al., “Noise reduction in low-dose CT using a 3D multiscale sparse denoising autoencoder,” in Proc. IEEE 27th Int. Workshop Mach. Learn. Signal Process. (MLSP), Piscataway, NJ, USA: IEEE Press, 2017, pp. 1–6, doi: 10.1109/MLSP.2017.8168176.
[18] F. Fan, M. Li, Y. Teng, and G. Wang, “Soft autoencoder and its wavelet adaptation interpretation,” IEEE Trans. Comput. Imag., vol. 6, pp. 1245–1257, Aug. 2020, doi: 10.1109/TCI.2020.3013796.
[19] L. A. Zavala-Mondragón, P. Rongen, J. O. Bescos, P. H. De With, and F. Van der Sommen, “Noise reduction in CT using learned wavelet-frame shrinkage networks,” IEEE Trans. Med. Imag., vol. 41, no. 8, pp. 2048–2066, Aug. 2022, doi: 10.1109/TMI.2022.3154011.
[20] L. A. Zavala-Mondragón, P. H. de With, and F. van der Sommen, “Image noise reduction based on a fixed wavelet frame and CNNs applied to CT,” IEEE Trans. Image Process., vol. 30, pp. 9386–9401, 2021, doi: 10.1109/TIP.2021.3125489.
[21] K. H. Jin, M. T. McCann, E. Froustey, and M. Unser, “Deep convolutional neural network for inverse problems in imaging,” IEEE Trans. Image Process., vol. 26, no. 9, pp. 4509–4522, Sep. 2017, doi: 10.1109/TIP.2017.2713099.
[22] M. Ranzato, F. J. Huang, Y.-L. Boureau, and Y. LeCun, “Unsupervised learning of invariant feature hierarchies with applications to object recognition,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit., Piscataway, NJ, USA: IEEE Press, 2007, pp. 1–8, doi: 10.1109/CVPR.2007.383157.
[23] E. J. Candes and M. B. Wakin, “An introduction to compressive sampling,” IEEE Signal Process. Mag., vol. 25, no. 2, pp. 21–30, Mar. 2008, doi: 10.1109/MSP.2007.914731.
[24] L. Sendur and I. W. Selesnick, “Bivariate shrinkage functions for wavelet-based denoising exploiting interscale dependency,” IEEE Trans. Signal Process., vol. 50, no. 11, pp. 2744–2756, Nov. 2002, doi: 10.1109/TSP.2002.804091.
[25] G. Steidl and J. Weickert, “Relations between soft wavelet shrinkage and total variation denoising,” in Proc. Joint Pattern Recognit. Symp., Berlin, Germany: Springer-Verlag, 2002, pp. 198–205, doi: 10.1007/3-540-45783-6_25.
[26] T. Blu and F. Luisier, “The sure-let approach to image denoising,” IEEE Trans. Image Process., vol. 16, no. 11, pp. 2778–2786, Nov. 2007, doi: 10.1109/TIP.2007.906002.
[27] I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresholding algorithm for linear inverse problems with a sparsity constraint,” Commun. Pure Appl. Math., vol. 57, no. 11, pp. 1413–1457, Nov. 2004, doi: 10.1002/cpa.20042.
[28] K. Gregor and Y. LeCun, “Learning fast approximations of sparse coding,” in Proc. 27th Int. Conf. Mach. Learn., 2010, pp. 399–406.
[29] I. Daubechies, R. DeVore, S. Foucart, B. Hanin, and G. Petrova, “Nonlinear approximation and (deep) ReLU networks,” Constructive Approximation, vol. 55, no. 1, pp. 127–172, 2022, doi: 10.1007/s00365-021-09548-z.
[30] J. C. Ye, J. M. Kim, K. H. Jin, and K. Lee, “Compressive sampling using annihilating filter-based low-rank interpolation,” IEEE Trans. Inf. Theory, vol. 63, no. 2, pp. 777–801, Feb. 2017, doi: 10.1109/TIT.2016.2629078.
[31] A. Cichocki, R. Zdunek, and S.-i. Amari, “Nonnegative matrix and tensor factorization [Lecture Notes] ,” IEEE Signal Process. Mag., vol. 25, no. 1, pp. 142–145, 2008, doi: 10.1109/MSP.2008.4408452.
[32] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi, Eds. Cham, Switzerland: Springer International Publishing, 2015, pp. 234–241.
[33] T. Kwon and J. C. Ye, “Cycle-free cyclegan using invertible generator for unsupervised low-dose CT denoising,” IEEE Trans. Comput. Imag., vol. 7, pp. 1354–1368, 2021, doi: 10.1109/TCI.2021.3129369.
[34] L. A. Zavala-Mondragón et al., “On the performance of learned and fixed-framelet shrinkage networks for low-dose CT denoising,” Med. Imag. Deep Learn., 2022. [Online] . Available: https://openreview.net/pdf?id=WGLqD0zHXy9
[35] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” 2012, arXiv:1207.0580.
[36] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proc. 13th Int. Conf. Artif. Intell. Statist., JMLR Workshop Conf. Proc., 2010, pp. 249–256.
[37] M. Scetbon, M. Elad, and P. Milanfar, “Deep K-SVD denoising,” IEEE Trans. Image Process., vol. 30, pp. 5944–5955, Jun. 2021, doi: 10.1109/TIP.2021.3090531.
[38] S. Herbreteau and C. Kervrann, “DCT2net: An interpretable shallow CNN for image denoising,” IEEE Trans. Image Process., vol. 31, pp. 4292–4305, Jun. 2022, doi: 10.1109/TIP.2022.3181488.
[39] K. Zhang, Y. Li, W. Zuo, L. Zhang, L. Van Gool, and R. Timofte, “Plug-and-play image restoration with deep denoiser prior,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 10, pp. 6360–6376, Oct. 2022, doi: 10.1109/TPAMI.2021.3088914.
[40] P. Liu, H. Zhang, K. Zhang, L. Lin, and W. Zuo, “Multi-level wavelet-CNN for image restoration,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit. Workshops, 2018, pp. 773–782.
[41] V. Monga, Y. Li, and Y. C. Eldar, “Algorithm unrolling: Interpretable, efficient deep learning for signal and image processing,” IEEE Signal Process. Mag., vol. 38, no. 2, pp. 18–44, Mar. 2021, doi: 10.1109/MSP.2020.3016905.
[42] M. Malfait and D. Roose, “Wavelet-based image denoising using a Markov random field a priori model,” IEEE Trans. Image Process., vol. 6, no. 4, pp. 549–565, Apr. 1997, doi: 10.1109/83.563320.
[43] A. Pižurica and W. Philips, “Estimating the probability of the presence of a signal of interest in multiresolution single- and multiband image denoising,” IEEE Trans. Image Process., vol. 15, no. 3, pp. 654–665, Mar. 2006, doi: 10.1109/TIP.2005.863698.
[44] A. Pižurica, “Image denoising algorithms: From wavelet shrinkage to nonlocal collaborative filtering,” in Wiley Encyclopedia of Electrical and Electronics Engineering. Hoboken, NJ, USA: Wiley, 1999, pp. 1–17.
[45] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image restoration by sparse 3D transform-domain collaborative filtering,” in Proc. SPIE Image Process., Algorithms Syst. VI, International Society for Optics and Photonics, Bellingham, WA, USA, 2008, vol. 6812, pp. 62–73.
[46] A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for image denoising,” in Proc. IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit. (CVPR), Piscataway, NJ, USA: IEEE Press, 2005, vol. 2, pp. 60–65, doi: 10.1109/CVPR.2005.38.
[47] D. Yang and J. Sun, “BM3D-net: A convolutional neural network for transform-domain collaborative filtering,” IEEE Signal Process. Lett., vol. 25, no. 1, pp. 55–59, 2018, doi: 10.1109/LSP.2017.2768660.
[48] H. Lee, H. Choi, K. Sohn, and D. Min, “KNN local attention for image restoration,” in Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognit., 2022, pp. 2139–2149.
Digital Object Identifier 10.1109/MSP.2023.3300100