Lecture Notes

Discriminative and Generative Learning for the Linear Estimation of Random SignalsNir Shlezinger, Tirza RouttenbergInference tasks in signal processing are often characterized by the availability of reliable statistical modeling with some missing instance-specific parameters. One conventional approach uses data to estimate these missing parameters and then infers based on the estimated model. Alternatively, data can also be leveraged to directly learn the inference mapping end to end. These approaches for combining partially known statistical models and data in inference are related to the notions of generative and discriminative models used in the machine learning literature [1], [2], typically considered in the context of classifiers.The goal of this â€œLecture Notesâ€ column is to introduce the concepts of generative and discriminative learning for inference with a partially known statistical model. While machine learning systems often lack the interpretability of traditional signal processing methods, we focus on a simple setting where one can interpret and compare the approaches in a tractable manner that is accessible and relevant to signal processing readers. In particular, we exemplify the approaches for the task of Bayesian signal estimation in a jointly Gaussian setting with the mean-square error (MSE) objective, i.e., a linear estimation setting. Here, the discriminative end-to-end approach directly learns the linear minimum MSE (LMMSE) estimator, while the generative strategy yields a two-stage estimator, which first uses data to fit the linear model and then formulates the LMMSE estimator for the fitted model. The ability to derive these estimators in closed form facilitates their analytical comparison. It is rigorously shown that discriminative learning results in an estimate that is more robust to mismatches in the mathematical description of the setup. Generative learning, which utilizes prior knowledge on the distribution of the signals, can exploit this prior to achieve improved MSE in some settings. These analytical findings are numerically demonstrated in a numerical study, which is available online as a Python Notebook, such that it can be presented alongside the lecture detailed in this note.RelevanceSignal processing algorithms traditionally rely on mathematical models for describing the problem at hand. These models correspond to domain knowledge obtained from, e.g., established statistical models and understanding of the underlying physics. In practice, statistical models often include parameters that are unknown in advance, such as noise levels and channel coefficients, and are estimated from data.Recent years have witnessed the dramatic success of machine learning and, particularly, of deep learning in domains such as computer vision and natural language processing [3]. For inference tasks, these data-driven methods typically learn the inference rule directly from data rather than estimating missing parameters in the underlying model, and they can operate without any mathematical modeling. Nonetheless, when one has access to some level of domain knowledge, it can be harnessed to design inference rules that benefit over black-box approaches in terms of the performance, interpretability, robustness, complexity, and flexibility [4]. This is achieved by formulating the suitable inference rule given full domain knowledge and then using data to optimize the resulting solver directly with various methodologies, including learned optimization [5], deep unfolding [6], and the augmentation of classic algorithms with trainable modules [7].The fact that signal processing tasks are often carried out based on partial domain knowledge, i.e., statistical models with some missing parameters and data, motivates inspecting which design approach is preferable: is it the model-oriented approach of using the data to estimate the missing parameters or the task-oriented strategy, which leverages data to directly optimize a suitable solver in an end-to-end manner? These approaches can be related to the notions of generative learning and discriminative learning, typically considered in the machine learning literature in the context of classification tasks [8, Ch. 3]. In this â€œLecture Notesâ€ column, we address this fundamental question for an analytically tractable setting of linear Bayesian estimation, for which the approaches can be rigorously compared, connecting machine learning concepts with interpretable signal processing practices and techniques.PrerequisitesThis â€œLecture Notesâ€ column is intended to be as self-contained as possible and suitable for the undergraduate level without a deep background in estimation theory and machine learning. As such, it requires only basic knowledge of probability and calculus.Problem statementTo formulate the considered problem, we first review some basic concepts in statistical inference, following [9]. Then, we elaborate on model-based and data-driven approaches for inference. Finally, we present the running example considered in the remainder of this â€œLecture Notesâ€ column of linear estimation in partially known measurement models.Statistical inferenceThe term inference refers to the ability to conclude based on evidence and reasoning. While this generic definition can refer to a broad range of tasks, we focus, in our description, on systems that estimate or make predictions based on a set of observed measurements. In this wide family of problems, the system is required to map an input variable x, taking values in an input space

${\cal{X}}$

into a prediction of a target variable y, which takes value in the target space

${\cal{Y}}$

.The inputs are related to the targets via some statistical probability measure

${\cal{P}}$

, referred to as the data-generating distribution, which is defined over

${\cal{X}}\,{\times}\,{\cal{Y}}$

. Formally,

${\cal{P}}$

is a joint distribution over the domain of inputs and targets. One can view such a distribution as being composed of two parts: a distribution over the unlabeled input

${\cal{P}}_{\boldsymbol{x}}$

, which sometimes is called the marginal distribution, and the conditional distribution over the targets given the inputs

${\cal{P}}_{{\boldsymbol{y}}\,{\vert}\,{\boldsymbol{x}}}$

, also referred to as the discriminative or inverse distribution.Inference rules can, thus, be expressed as mappings of the form

\[{f}{:}{\cal{X}}\,{\mapsto}\,{\cal{Y}}{.} \tag{1} \]

We write the decision variable for a given input

${\boldsymbol{x}}\,{\in}\,{\cal{X}}$

as

${\hat{\boldsymbol{y}}} = {f}{(}{\boldsymbol{x}}{)}\,{\in}\,{\cal{Y}}$

. The space of all possible inference mappings of the form (1) is denoted by

${\scr{F}}$

. The fidelity of an inference mapping is measured using a loss function

\[{l}{:}{\scr{F}}\,{\times}\,{\cal{X}}\,{\times}\,{\cal{Y}}\,{\mapsto}\,{\Bbb{R}} \tag{2} \]

with

$\Bbb{R}$

being the set of real numbers. We are generally interested in carrying out inference that minimizes the risk function, also known as the generalization error, given by

\[{\cal{L}}_{\cal{P}}{(}{f}{)}\,{â‰œ}\,{\Bbb{E}}_{{(}{\boldsymbol{x}},{\boldsymbol{y}}{)}\,{\sim}\,{\cal{P}}}{\{}{l}{(}{f},{\boldsymbol{x}},{\boldsymbol{y}}{)\}} \tag{3} \]

where

${\Bbb{E}}{\{}{\cdot}{\}}$

is the stochastic expectation. Thus, the goal is to design the inference rule

${f}{(}{\cdot}{)}$

to minimize the generalization error

${\cal{L}}_{\cal{P}}{(}{f}{)}$

for a given problem.Model based versus data drivenThe risk function in (3) allows us to evaluate inference rules and to formulate the desired mapping as the one that minimizes

${\cal{L}}_{\cal{P}}{(}{f}{)}$

. The main question is how to find this mapping. Approaches to design

${f}{(}{\cdot}{)}$

can be divided into two main strategies: the statistical model-based strategy, referred to henceforth as model based, and the pure machine learning approach, which relies on data and is, thus, referred to as data driven. The main difference between these strategies is what information is utilized to tune

${f}{(}\cdot{)}$

.Model-based methodsModel-based methods, also referred to as hand-designed schemes, set their inference rule; i.e., tune

${f}{(}{\cdot}{)}$

in (1), to minimize the risk function

${\cal{L}}_{\cal{P}}{(}\cdot{)}$

based on full domain knowledge. The term domain knowledge typically refers to prior knowledge of the underlying statistics relating the input x and the target y, where the term full domain knowledge implies that the joint distribution

${\cal{P}}$

is known. For instance, under the squared-error loss,

${l}_{\text{MSE}}{(}{f},{\boldsymbol{x}},{\boldsymbol{y}}{)} = {\left\Vert{\boldsymbol{y}}{-}{f}{(}{\boldsymbol{x}}{)}\right\Vert}_{2}^{2}$

, the optimal inference rule is the minimum MSE (MMSE) estimator, given by the conditional expected value

${f}_{\text{MMSE}}{(}{\boldsymbol{x}}{)} = {\Bbb{E}}_{{\cal{P}}_{{\boldsymbol{y}}\,{\vert}\,{\boldsymbol{x}}}}{\{}{\boldsymbol{y}}\,{\vert}\,{\boldsymbol{x}}{\}}$

, whose computation requires knowledge of the discriminative distribution

${\cal{P}}_{{\boldsymbol{y}}\,{\vert}\,{\boldsymbol{x}}}$

.Model-based methods are the foundations of statistical signal processing. However, in practice, accurate knowledge of the distribution that relates the observations and the desired information is often unavailable. Thus, applying such techniques commonly requires imposing some assumptions on the underlying statistics, which, in some cases, reflects the actual behavior but may also constitute a crude approximation of the true dynamics. Moreover, in the presence of inaccurate model knowledge, either as a result of estimation errors or due to enforcing a model that does not fully capture the environment, the performance of model-based techniques tends to degrade. Finally, solving the coupled problem of model selection and estimation of the parameters of the selected model is a difficult task, with nontrivial inference rules [10], [11].Data-driven methodsData-driven methods learn their mapping from data rather than from statistical modeling. This approach builds upon the fact that, while in many applications coming up with accurate and tractable statistical modeling is difficult, we are often given access to data describing the setup. In the supervised setting considered henceforth, the data comprise a training set consisting of

${n}_{t}$

pairs of inputs and their corresponding target values, denoted by

${\cal{D}} = {\{}{\boldsymbol{x}}_{t},{\boldsymbol{y}}_{t}{\}}_{{t} = {1}}^{{n}_{t}}$

.Machine learning provides various data-driven methods that form an inference rule f from the data

${\cal{D}}$

. Broadly speaking and following the terminology of [1], these approaches can be divided into two main classes:

Generative models use

${\cal{D}}$

to estimate the data-generating distribution

${\cal{P}}$

. Given the estimated distribution, denoted by

${\hat{\cal{P}}}_{\cal{D}}$

, one then seeks the inference rule that minimizes the risk function with respect to

${\hat{\cal{P}}}_{\cal{D}}$

; i.e.,

\[{f}^{\ast} = \mathop{\arg\min}\limits_{{f}\,{\in}\,{\scr{F}}}{\cal{L}}_{{\hat{\cal{P}}}_{\cal{D}}}{(}{f}{)} \tag{4} \]

where

${\cal{L}}_{\cal{P}}{(}{f}{)}$

is defined in (3).

In discriminative models, data are used to directly form the inference rule as a form of end-to-end learning. Without access to the true distribution

${\cal{P}}$

, one cannot directly optimize the risk function (3), typically resorting to the empirical risk given by

\[{\cal{L}}_{\cal{D}}{(}{f}{)}\,{â‰œ}\,\frac{1}{{n}_{t}} \mathop{\sum}\limits_{{t} = {1}}\limits^{{n}_{t}}{l}{(}{f},{\boldsymbol{x}}_{t},{\boldsymbol{y}}_{t}{)}{.} \tag{5} \]

To avoid overfitting, i.e., coming up with an inference rule that minimizes (5) by memorizing the data, one has to constrain the set

${\cal{F}}$

. This requires imposing a structure on the mapping, which is often dictated by a set of parameters denoted by

${\theta}$

, taking values in some parameter set

${\Theta}$

, as considered henceforth. Thus, the system mapping is written as

${f}_{\theta}$

, which is tuned via

\begin{align*}{\theta}^{\ast} & = \mathop{\arg\min}\limits_{{\theta}\,{\in}\,{\Theta}}{\cal{L}}_{\cal{D}}{(}{f}_{\theta}{)} \\ & = \mathop{\arg\min}\limits_{{\theta}\,{\in}\,{\Theta}} \frac{1}{{n}_{t}} \mathop{\sum}\limits_{{t} = {1}}\limits^{{n}_{t}}{l}{(}{f}_{\theta},{\boldsymbol{x}}_{t},{\boldsymbol{y}}_{t}{)} \tag{6} \end{align*}

where the last equality is obtained by substituting (5). In practice, the empirical risk in (6) is often combined with regularizing terms to facilitate training and mitigate overfitting.One can further divide discriminative models into the following categories:

Model-based discriminative models: One has some domain knowledge that indicates what structure the inference rule should take. This allows using inference mappings with a relatively small number of parameters that are specific to the problem at hand. For instance, in the example presented in the sequel, we consider estimation in a jointly Gaussian setting for which a suitable inference rule is known to take a linear form.

Model-agnostic discriminative models: These use highly parameterized inference rules that can realize a broad set of abstract mappings. This is the common practice in deep learning for inference problems, which infer using deep neural networks (DNNs) trained from massive datasets.

The different design approaches are illustrated in Figure 1. Many tasks encountered in practical applications can, in fact, be tackled using any of these approaches. For instance, superresolution, which is a fundamental task in biomedical imaging and optics, can be tackled in a model-based manner by treating it as a fully known sparse recovery problem, while data can be leveraged to estimate the parameters of the sparse recovery setup via generative learning; alternatively, one can use data to directly learn the superresolution solver using deep unfolding [6] as a form of model-agnostic discriminative learning or by training a model-agnostic discriminative model, e.g., a DNN. A detailed account of these examples (and more) can be found in [4].lecturenotes01-3271431Figure 1. The different design approaches for inference rules based on domain knowledge and/or data.Inference with partially known generative distributionsThe unprecedented success of deep learning in areas such as computer vision and natural language processing [3] notably boosted the popularity of model-agnostic discriminative models that rely on abstract, purely data-driven pipelines, trained with massive datasets, for inference. Specifically, by letting

${f}_{\theta}$

be a DNN with parameters

${\theta}$

, one can train inference rules from data via (6) that operate in scenarios where analytical models are unknown or highly complex [12]. For instance, a statistical model

${\cal{P}}$

relating an image of a dog and the breed of the dog is likely to be intractable, and, thus, inference rules that rely on full domain knowledge or on estimating

${\cal{P}}$

are likely to be inaccurate. However, the abstractness and extreme parameterization of DNNs results in them often being treated as black boxes, while the training procedure of DNNs typically is lengthy, is computationally intensive, and requires massive volumes of data. Furthermore, understanding how their predictions are obtained and how reliable they are tends to be quite challenging, and, thus, deep learning lacks the interpretability, flexibility, versatility, and reliability of model-based techniques.Unlike conventional deep learning domains, such as computer vision and natural language processing, in signal processing, one often has access to some level of reliable domain knowledge. Many problems in signal processing applications are characterized by faithful modeling based on the understanding of the underlying physics, the operation of the system, and models backed by extensive measurements. Nonetheless, existing modeling often includes parameters, e.g., channel coefficients and noise energy, that are specific to a given scenario and are typically unknown in advance, though they can be estimated from data. The key question in scenarios involving such partial domain knowledge in addition to training data is which of the following design approaches is preferable:

the generative learning approach, which seeks the parameter vector that best matches the data and then sets the inference rule to minimize the risk with the estimated distribution as in (4)

the discriminative learning approach, which formulates the inference rule for a given parameter vector and then uses

${\cal{D}}$

to directly optimize the resulting mapping as in (6).

We tackle this fundamental question using a simple tractable scenario of linear estimation with partial domain knowledge, where it is known that the setting is linear, yet the exact linear mapping is unknown. In this setting, which is mathematically formulated in the following section, both approaches can be explicitly derived and analytically compared.Case study: Linear estimation with partially known measurement modelsTo provide an analytically tractable comparison between the aforementioned approaches to jointly leverage domain knowledge and data in inference, we consider a linear estimation scenario. Such scenarios are not only simple and enable rigorous analysis, but they also correspond to a broad range of statistical estimation problems that are commonly encountered in signal processing applications. Here, the input x and the target y are real-valued vectors taking values in

${\cal{X}} = {\Bbb{R}}^{{N}_{x}}$

and in

${\cal{Y}} = {\Bbb{R}}^{{N}_{y}}$

, respectively. The loss measure is the squared-error loss; i.e.,

${l}_{\text{MSE}}{(}{f},{\boldsymbol{x}},{\boldsymbol{y}}{)} = {\left\Vert{\boldsymbol{y}}{-}{f}{(}{\boldsymbol{x}}{)}\right\Vert}_{2}^{2}$

.In the considered setting, one has prior knowledge that the target y admits a Gaussian distribution with mean

${\mu}_{y}$

and covariance

${\boldsymbol{C}}_{\boldsymbol{yy}}$

, i.e.,

${\boldsymbol{y}}\,{\sim}\,{\cal{N}}{(}{\mu}_{y},{\boldsymbol{C}}_{\boldsymbol{yy}}{)}$

, and that the measurements follow a linear model:

\[{\boldsymbol{x}} = {\boldsymbol{Hy}} + {\boldsymbol{w}},{\boldsymbol{w}}\,{\sim}\,{\cal{N}}{(}{\mu},{\sigma}^{2}{\boldsymbol{I}}{)} \tag{7} \]

i.e., w is a Gaussian noise with mean

${\mu}$

and covariance matrix

${\sigma}^{2}{\boldsymbol{I}}$

and is assumed to be independent of y. However, the available domain knowledge is partial in the sense that the parameters

${\boldsymbol{H}}$

and

${\mu}$

are unknown. Consequently, while the generative distribution

${\cal{P}}$

is known to be jointly Gaussian, its parameters are unknown. Thus, conventional Bayesian estimators, such as the MMSE and LMMSE estimators, cannot be implemented for this case. Nonetheless, we are given access to a dataset

${\cal{D}} = {\{}{\boldsymbol{x}}_{t},{\boldsymbol{y}}_{t}{\}}_{{t} = {1}}^{{n}_{t}}$

comprised of independent identically distributed (i.i.d.) samples drawn from

${\cal{P}}$

. An illustration of the setting is depicted in Figure 2(a). The goal is to utilize the available domain knowledge and the data

${\cal{D}}$

to formulate an inference mapping that achieves the lowest generalization error, i.e., minimizes (3) with the squared-error loss.lecturenotes02-3271431Figure 2. The linear estimation setting: (a) the design basis, which shows the available data and domain knowledge, where the unknown model parameters are highlighted in red fonts; (b) the generative learning-based estimator; and (c) the discriminative learning-based estimator.Solution: Data-aided linear estimatorsIn the following, we exemplify a generative learning estimator (in the â€œGenerative Learning Estimatorâ€ section) and a discriminative learning estimator (in the â€œDiscriminative Learning Estimatorâ€ section), which both aim to recover the random signal y from the observed x for the partially known jointly Gaussian setting described in the â€œCase Study: Linear Estimation With Partially Known Measurement Modelsâ€ section. To this end, we use the following notations of sample means:

\[{\bar{\boldsymbol{x}}}\,{â‰œ}\,\frac{1}{{n}_{t}} \mathop{\sum}\limits_{{t} = {1}}\limits^{{n}_{t}}{\boldsymbol{x}}_{t},\,\,\,{\bar{\boldsymbol{y}}}\,{â‰œ}\,\frac{1}{{n}_{t}} \mathop{\sum}\limits_{{t} = {1}}\limits^{{n}_{t}}{\boldsymbol{y}}_{t} \tag{8} \]

and the sample covariance/cross-covariance matrices

\begin{align*}{\hat{\boldsymbol{C}}}_{\boldsymbol{yx}} & = \frac{1}{{n}_{t}} \mathop{\sum}\limits_{{t} = {1}}\limits^{{n}_{t}}{(}{\boldsymbol{y}}_{t}{-}{\bar{\boldsymbol{y}}}{)}{(}{\boldsymbol{x}}_{t}{-}{\bar{\boldsymbol{x}}}{)}^{T} \tag{9a} \\ {\hat{\boldsymbol{C}}}_{\boldsymbol{yy}} & = \frac{1}{{n}_{t}} \mathop{\sum}\limits_{{t} = {1}}\limits^{{n}_{t}}{(}{\boldsymbol{y}}_{t}{-}{\bar{\boldsymbol{y}}}{)}{(}{\boldsymbol{y}}_{t}{-}{\bar{\boldsymbol{y}}}{)}^{T} \tag{9b} \\ {\hat{\boldsymbol{C}}}_{\boldsymbol{xx}} & = \frac{1}{{n}_{t}} \mathop{\sum}\limits_{{t} = {1}}\limits^{{n}_{t}}{(}{\boldsymbol{x}}_{t}{-}{\bar{\boldsymbol{x}}}{)}{(}{\boldsymbol{x}}_{t}{-}{\bar{\boldsymbol{x}}}{)}^{T}{.} \tag{9c} \end{align*}

Generative learning estimatorThe generative approach uses data to estimate the missing domain knowledge parameters. To estimate the distribution based on the model in (7), we need to estimate the matrix H and the noise mean

${\mu}$

from the training data and then use the estimates denoted

${\hat{\boldsymbol{H}}}$

and

${\hat{\mu}}$

to form the linear estimator, as illustrated in Figure 2(b).The unknown H and

${\mu}$

are fitted to the data using the maximum likelihood rule. Letting

${\cal{P}}{(}{\boldsymbol{x}},{\boldsymbol{y}}{;}{\boldsymbol{H}},{\mu}{)}$

be the joint distribution of x and y for given H and

${\mu}$

, the log-likelihood can be written as

\begin{align*}&{\log}\,{\cal{P}}{(}{\boldsymbol{x}}_{t},{\boldsymbol{y}}_{t}{;}{\boldsymbol{H}},{\mu}{)} \\ & \quad = {\log}\,{\cal{P}}{(}{\boldsymbol{x}}_{t}\,{\vert}\,{\boldsymbol{y}}_{t}{;}{\boldsymbol{H}},{\mu}{)} + {\log}{\cal{P}}{(}{\boldsymbol{y}}_{t}{)} \\ & \quad = {\text{const}}{-}\frac{1}{{\sigma}^{2}}{\left\Vert{\boldsymbol{x}}_{t}{-}{\boldsymbol{Hy}}_{t}{-}{\mu}\right\Vert}_{2}^{2}{.} \tag{10} \end{align*}

In (10), â€œconstâ€ denotes a constant term that is not a function of the unknown parameters H and

${\mu}$

.Under the assumption that

${\cal{D}} = {\{}{\boldsymbol{x}}_{t},{\boldsymbol{y}}_{t}{\}}_{{t} = {1}}^{{n}_{t}}$

is composed of i.i.d. samples drawn from a Gaussian generative distribution

${\cal{P}}$

, the maximum likelihood estimates are given by

\[{\hat{\boldsymbol{H}}},{\hat{\mu}} = \mathop{\arg\max}\limits_{{\boldsymbol{H}}\,{\in}\,{\Bbb{R}}^{{N}_{x}\,{\times}\,{N}_{y}},{\mu}\,{\in}\,{\Bbb{R}}^{{N}_{x}}} \mathop{\sum}\limits_{{t} = {1}}\limits^{{n}_{t}}{\left\Vert{\boldsymbol{x}}_{t}{-}{\boldsymbol{Hy}}_{t}{-}{\mu}\right\Vert}_{2}^{2}{.} \tag{11} \]

The solutions to (11) are given by [8, Sec. 3.3]

\[{\hat{\mu}} = {\bar{\boldsymbol{x}}}{-}{\hat{\boldsymbol{H}}}{\bar{\boldsymbol{y}}},\,\,\,{\hat{\boldsymbol{H}}} = {\hat{\boldsymbol{C}}}_{\boldsymbol{xy}}{\hat{\boldsymbol{C}}}_{\boldsymbol{yy}}^{{-}{1}} \tag{12} \]

where

${\hat{\boldsymbol{C}}}_{\boldsymbol{xy}} = {\hat{\boldsymbol{C}}}_{\boldsymbol{yx}}^{T}$

, which is defined in (9a), and it is assumed that

${\hat{\boldsymbol{C}}}_{\boldsymbol{yy}}$

from (9b) is a nonsingular matrix. By substituting the estimators from (12) in (7), the estimated

${\hat{\cal{P}}}_{\cal{D}}$

is obtained from the linear model

\[{\boldsymbol{x}} = {\hat{\boldsymbol{H}}}{\boldsymbol{y}} + {\tilde{\boldsymbol{w}}},\,\,\,{\tilde{\boldsymbol{w}}}\,{\sim}\,{\cal{N}}{(}{\hat{\mu}},{\sigma}^{2}{\boldsymbol{I}}{)}{.} \tag{13} \]

Having estimated the generative model, we proceed to finding the inference rule that minimizes the risk function with respect to

${\hat{\cal{P}}}_{\cal{D}}$

; i.e.,

\[{f}^{\ast} = \mathop{\arg\min}\limits_{{f}\,{\in}\,{\scr{F}}}{\Bbb{E}}_{{(}{\boldsymbol{x}},{\boldsymbol{y}}{)}\,{\sim}\,{\hat{\cal{P}}}_{\cal{D}}}{\left\{{\left\Vert{\boldsymbol{y}}{-}{f}{(}{\boldsymbol{x}}{)}\right\Vert}_{2}^{2}\right\}} \tag{14} \]

where we substitute the squared-error loss and (3) into (4). Since the estimated distribution

${\hat{\cal{P}}}_{\cal{D}}$

is a jointly Gaussian distribution, the solution of (14) is the LMMSE estimator under the estimated distribution, which is given by

\begin{align*}{\hat{\boldsymbol{y}}}_{g} = & {f}^{*}{(}{\boldsymbol{x}}{)} = {\Bbb{E}}_{{(}{\boldsymbol{x}},{\boldsymbol{y}}{)}\,{\sim}\,{\hat{\cal{P}}}_{\cal{D}}}{\left\{{\boldsymbol{y}}\,{\vert}\,{\boldsymbol{x}}\right\}} \\ \mathop=\limits^{(a)} & {\mu}_{y} + {\bf{C}}_{\bf{yy}}{\hat{\boldsymbol{H}}}^{T}{(}{\hat{\boldsymbol{H}}}{\boldsymbol{C}}_{\boldsymbol{yy}}{\hat{\boldsymbol{H}}}^{T} + {\sigma}^{2}{I}{)}^{{-}{1}} \\ & {\times}\,{(}{\boldsymbol{x}}{-}{\bar{\boldsymbol{x}}}{-}{\hat{\boldsymbol{H}}}{(}{\mu}_{y} {-}{\bar{\boldsymbol{y}}}{)}{)} \\ \mathop=\limits^{(b)} & {\mu}_{y} + {(}{\hat{\boldsymbol{H}}}^{T}{\hat{\boldsymbol{H}}} + {\sigma}^{2}{\boldsymbol{C}}_{\boldsymbol{yy}}^{{-}{1}}{)}^{{-}{1}} \\ & {\times}\,{\hat{\boldsymbol{H}}}^{T}{(}{\boldsymbol{x}}{-}{\bar{\boldsymbol{x}}}{-}{\hat{\boldsymbol{H}}}{(}{\mu}_{y} {-} {\bar{\boldsymbol{y}}}{)}{)}. \tag{15} \end{align*}

Here, (a) follows from the estimated jointly Gaussian model (13), and (b) is obtained using the matrix inversion lemma, as long as

${\hat{\boldsymbol{H}}}^{T}{\hat{\boldsymbol{H}}} + {\sigma}^{2}{\boldsymbol{C}}_{\boldsymbol{yy}}^{{-}{1}}$

is invertible.Applying the matrix inversion lemma requires the computation of the inverse of an

${N}_{y}\,{\times}\,{N}_{y}$

matrix instead of an

${N}_{x}\,{\times}\,{N}_{x}$

matrix as in the direct expression for the LMMSE estimate. This contributes to the computational complexity when

${N}_{y}\,{<}\,{N}_{x}$

; i.e., the target signal is of a lower dimensionality compared with the input signal. By substituting (12) in (15), one obtains the generative learning estimator as in (16). It can be seen in (16) that, in the general case, the generative learning estimator is a function of both the empirical covariance of y,

${\hat{\boldsymbol{C}}}_{\boldsymbol{yy}}$

, and its true covariance,

${\boldsymbol{C}}_{\boldsymbol{yy}}$

:

\begin{align*}{\hat{\boldsymbol{y}}}_{g} = & {f}^{*}{(}{\boldsymbol{x}}{)} = {\mu}_{y} + {(}{\hat{\boldsymbol{C}}}_{\boldsymbol{yy}}^{{-}{1}}{\hat{\boldsymbol{C}}}_{\boldsymbol{yx}}{\hat{\boldsymbol{C}}}_{\boldsymbol{xy}}{\hat{\boldsymbol{C}}}_{\boldsymbol{yy}}^{{-}{1}} \\ & + {\sigma}^{2}{\boldsymbol{C}}_{\boldsymbol{yy}}^{{-}{1}}{)}^{{-}{1}}{\hat{\boldsymbol{C}}}_{\boldsymbol{yy}}^{{-}{1}}{\hat{\boldsymbol{C}}}_{\boldsymbol{yx}}{(}{\boldsymbol{x}}{-}{\bar{\boldsymbol{x}}} \\ & {-}{\hat{\boldsymbol{C}}}_{\boldsymbol{xy}}{\hat{\boldsymbol{C}}}_{\boldsymbol{yy}}^{{-}{1}}{(}{\mu}_{\boldsymbol{y}}{-}{\bar{\boldsymbol{y}}}{)}{)}. \tag{16} \end{align*}

Discriminative learning estimatorIn this section, we consider a discriminative learning approach for the considered estimation problem. Here, the partial domain knowledge regarding the underlying joint Gaussianity indicates that the estimator should take a linear form; i.e.,

\[{f}_{\theta}{(}{\boldsymbol{x}}{)} = {\bf{A}}{\boldsymbol{x}} + {\boldsymbol{b}} \tag{17} \]

where

${\theta} = {\{}{\bf{A}},{\boldsymbol{b}}{\}}$

. For the considered parametric model, the available data are used to directly identify the parameters that minimize the empirical risk, as illustrated in Figure 2(c); i.e.,

\begin{align*}{\theta}^{*} & = \mathop{\arg\min}\limits_{{\theta}\,{\in}\,{\Theta}} \frac{1}{{n}_{t}} \mathop{\sum}\limits_{{t} = {1}}\limits^{{n}_{t}}{\left\Vert{\boldsymbol{y}}_{t}{-}{f}_{\theta}{(}{\boldsymbol{x}}_{t}{)}\right\Vert}_{2}^{2} \\ & = \mathop{\arg\min}\limits_{{\bf{A}}\,{\in}\,{\Bbb{R}}^{{N}_{y}\,{\times}\,{N}_{x}},\,{\boldsymbol{b}}\,{\in}\,{\Bbb{R}}^{{N}_{y}}} \frac{1}{{n}_{t}} \mathop{\sum}\limits_{{t} = {1}}\limits^{{n}_{t}}{\left\Vert{\boldsymbol{y}}_{t}{-}{\bf{A}}{\boldsymbol{x}}_{t}{-}{\boldsymbol{b}}\right\Vert}_{2}^{2}{.} \tag{18} \end{align*}

Since (18) is a convex function of

${\bf{A}}$

and b, the optimal solution is obtained by equating the derivatives of (18) with respect to

${\bf{A}}$

and b to zero, which results in

\[{\boldsymbol{b}}^{*} = {\bar{\boldsymbol{y}}}{-}{\bf{A}}^{*}{\bar{\boldsymbol{x}}},\,\,\,{\bf{A}}^{*} = {\hat{\boldsymbol{C}}}_{\boldsymbol{yx}}{\hat{\boldsymbol{C}}}_{\boldsymbol{xx}}^{{-}{1}} \tag{19} \]

where the sample means and sample covariance matrices are defined in (8) and (9). It is noted that the solution in (19) is valid and unique if and only if

${\hat{\boldsymbol{C}}}_{\boldsymbol{xx}}$

is a nonsingular matrix (see Sec. 3.3 and 4.2 in [8]). This generally holds for

${\sigma}^{2}\,{>}\,{0}$

when

${n}_{t}\,{>}\,{N}_{x}$

. By substituting (19) into (17), we obtain the discriminative learned estimator, which is given by

\[{\hat{\boldsymbol{y}}}_{d} = {f}_{\theta}^{\ast}{(}{\boldsymbol{x}}{)} = {\hat{\boldsymbol{C}}}_{\boldsymbol{yx}}{\hat{\boldsymbol{C}}}_{\boldsymbol{xx}}^{{-}{1}}{(}{\boldsymbol{x}}{-}{\bar{\boldsymbol{x}}}{)} + {\bar{\boldsymbol{y}}}{.} \tag{20} \]

It can be verified that the learned estimator in (20) coincides with the sample LMMSE estimator that is obtained by plugging the sample mean and sample covariance matrices into the LMMSE estimator.Discussion and comparisonBy focusing on the simple yet common scenario of linear estimation with a partially known measurement model, we obtained closed-form expressions for the suitable estimators attained via generative learning and via discriminative learning. This allows us to compare the resulting estimators and, thus, draw insights into the general approaches of generative versus discriminative learning in the context of signal processing applications. In the following, we provide a theoretical analysis of the estimators in the â€œTheoretical Comparisonâ€ section, followed by a qualitative comparison discussed in the â€œQualitative Comparisonâ€ section. A stimulative study is presented in the â€œNumerical Comparisonâ€ section.Theoretical comparisonWe next provide an analysis of the estimators obtained via generative learning in (16) and via discriminative learning in (20) by studying their behavior in asymptotic regimes. To show this, we first study the asymptotic setting where the number of samples

${n}_{t}$

is arbitrarily large, after which we inspect the case of high signal-to-noise ratio (SNR); i.e.,

${\sigma}^{2}\,{\rightarrow}\,{0}$

.Asymptotic analysisLet us inspect the derived estimators in the asymptotic case when

${n}_{t}\,{\rightarrow}\,{\infty}$

, and the data

${\cal{D}}$

are comprised of i.i.d. samples (i.e., ergodic and stationarity scenario). In this case, the discriminative learning estimator

${\hat{\boldsymbol{y}}}_{d}$

in (20) converges to the LMMSE estimator; i.e.,

\[{\hat{\boldsymbol{y}}}_{d} = {\boldsymbol{C}}_{\boldsymbol{yx}}{\boldsymbol{C}}_{\boldsymbol{xx}}^{{-}{1}}{(}{\boldsymbol{x}}{-}{\mu}_{x}{)} + {\mu}_{y} \tag{21} \]

where

${\boldsymbol{C}}_{\boldsymbol{yx}},{\boldsymbol{C}}_{\boldsymbol{xx}}$

are the true covariance matrices, and

${\mu}_{x},{\mu}_{y}$

are the true expected values.Similarly, for

${n}_{t}\,{\rightarrow}\,{\infty}$

, the generative learning estimator

${\hat{\boldsymbol{y}}}_{g}$

stated in (16) converges to

\begin{align*}{\hat{\boldsymbol{y}}}_{g} = & {(}{\boldsymbol{C}}_{\boldsymbol{yy}}^{{-}{1}}{\boldsymbol{C}}_{\boldsymbol{yx}}{\boldsymbol{C}}_{\boldsymbol{xy}}{\boldsymbol{C}}_{\boldsymbol{yy}}^{{-}{1}} + {\sigma}^{2}{\boldsymbol{C}}_{\boldsymbol{yy}}^{{-}{1}}{)}^{{-}{1}} \\ & {\times}\,{\boldsymbol{C}}_{\boldsymbol{yy}}^{{-}{1}}{\boldsymbol{C}}_{\boldsymbol{yx}}\,{\times}\,{(}{\boldsymbol{x}}{-}{\mu}_{x}{)} + {\mu}_{y}{.} \tag{22} \end{align*}

When the linear model in (7) holds, it can be verified that

\[{\boldsymbol{C}}_{\boldsymbol{xy}} = {\boldsymbol{HC}}_{\boldsymbol{yy}},\,\,\,{\boldsymbol{C}}_{\boldsymbol{xx}} = {\boldsymbol{HC}}_{\boldsymbol{yy}}{\boldsymbol{H}}^{T} + {\sigma}^{2}{\boldsymbol{I}}{.} \tag{23} \]

By substituting (23) into (22) and using the matrix inversion lemma again, we obtain

\begin{align*}{\hat{\boldsymbol{y}}}_{g} = & {(}{\boldsymbol{H}}^{T}{\boldsymbol{H}} + {\sigma}^{2}{\boldsymbol{C}}_{\boldsymbol{yy}}^{{-}{1}}{)}^{{-}{1}} \\ & {\times}\,{\boldsymbol{H}}^{T}{(}{\boldsymbol{x}}{-}{\mu}_{x}{)} + {\mu}_{y} \\ = & {\boldsymbol{C}}_{\boldsymbol{yy}}{\boldsymbol{H}}^{T}{(}{\boldsymbol{HC}}_{\boldsymbol{yy}}{\boldsymbol{H}}^{T} + {\sigma}^{2}{\bf{I}}{)}^{{-}{1}} \\ & {\times}\,{(}{\boldsymbol{x}}{-}{\mu}_{x}{)} + {\mu}_{y} \\ = & {\boldsymbol{C}}_{\boldsymbol{yx}}{\boldsymbol{C}}_{\boldsymbol{xx}}^{{-}{1}}{(}{\boldsymbol{x}}{-}{\mu}_{x}{)} + {\mu}_{y}{.} \tag{24} \end{align*}

Thus, asymptotically, if the true generative model is linear and given by (7), then the two estimators coincide.Asymptotic analysis under the misspecified (nonlinear) modelOften, in practice, the generative model is nonlinear. For instance, consider the case where the true generative model is not the linear one in (7) but is given by

\[{\boldsymbol{x}} = {\boldsymbol{g}}{(}{\boldsymbol{H}},{\boldsymbol{y}}{)} + {\boldsymbol{w}} \tag{25} \]

where the measurement function g :

${\Bbb{R}}^{{N}_{y}}\,{\rightarrow}\,{\Bbb{R}}^{{N}_{x}}$

is a nonlinear function, and the statistical properties of y and w are the same as described in the â€œCase Study: Linear Estimation With Partially Known Measurement Modelsâ€ section. Although the model is nonlinear, the estimator may be designed based on the linear model in (7), either due to mismatches or due to intentional linearization carried out to simplify the problem.When the true generative model is nonlinear (i.e., under the misspecified model), then the two linear estimators

${\hat{\boldsymbol{y}}}_{g}$

and

${\hat{\boldsymbol{y}}}_{d}$

are different even asymptotically. The asymptotic estimators in this case are given by (21) and (22), but (23) does not hold, and, thus,

${\hat{\boldsymbol{y}}}_{g}$

here does not coincide with the LMMSE estimator as it does in (24) for the linear generative model. In this case, the discriminative model approach is asymptotically preferred, being the LMMSE estimator, while the generative learning approach, which is based on a mismatched generative model, yields a linear estimate that differs from the LMMSE estimator and is, thus, suboptimal.High-SNR regimeAnother setting in which one can rigorously compare the estimators is in the high-SNR regime, where

${\sigma}^{2}\,{\rightarrow}\,{0}$

. In the following analysis, we focus on settings where

${N}_{y}\,{â‰¥}\,{N}_{x}$

; i.e., the number of input variables is, at most, the number of target variables. Specifically, when

${N}_{y}\,{>}\,{N}_{x}$

, the input covariance becomes singular when

${\sigma}^{2}\,{\rightarrow}\,{0}$

, and the corresponding estimators as well as the MMSE estimate may diverge. In this case, the data satisfy the model from (7); i.e.,

${\boldsymbol{x}}_{t}{{\underset{{\sigma}^{2}\,{\rightarrow}\,{0}}\longrightarrow}}{\boldsymbol{Hy}}_{t} + {\mu}$

. Thus, the sample mean in (8) satisfies

\[{\bar{\boldsymbol{x}}}{{\underset{{\sigma}^{2}\,{\rightarrow}\,{0}}\longrightarrow}}{\boldsymbol{H}}{\bar{\boldsymbol{y}}} + {\mu}{.} \tag{26} \]

Similarly, the sample covariance matrices in (9) satisfy

\begin{align*}&{\hat{\boldsymbol{C}}}_{\boldsymbol{yx}}{\underset{{\sigma}^{2}\,{\rightarrow}\,{0}}\longrightarrow} \\ & \frac{1}{{n}_{t}} \mathop{\sum}\limits_{{t} = {1}}\limits^{{n}_{t}}{(}{\boldsymbol{y}}_{t}{-}{\bar{\boldsymbol{y}}}{)}{(}{\boldsymbol{Hy}}_{t} + {\mu}{-}{\bar{\boldsymbol{x}}}{)}^{T} = {\hat{\boldsymbol{C}}}_{\boldsymbol{yy}}{\boldsymbol{H}}^{T} \tag{27} \end{align*}

and

\begin{align*}&{\hat{\boldsymbol{C}}}_{\boldsymbol{xx}}{\underset{{\sigma}^{2}\,{\rightarrow}\,{0}}\longrightarrow} \\ & \quad \frac{1}{{n}_{t}} \mathop{\sum}\limits_{{t} = {1}}\limits^{{n}_{t}}{(}{\boldsymbol{Hy}}_{t} + {\mu}{-}{\bar{\boldsymbol{x}}}{)}{(}{\boldsymbol{Hy}}_{t} + {\mu}{-}{\bar{\boldsymbol{x}}}{)}^{T} \\ & = {\boldsymbol{H}}{\hat{\boldsymbol{C}}}_{\boldsymbol{yy}}{\boldsymbol{H}}^{T}{.} \tag{28} \end{align*}

Under this limit case, we next derive the considered estimators, starting with the generative one. First, we note that, according to (12) and by using (27), one obtains, under the assumption that

${\hat{\boldsymbol{C}}}_{\boldsymbol{yy}}$

is a nonsingular matrix,

\[{\hat{\boldsymbol{H}}}{\underset{{\sigma}^{2}\,{\rightarrow}\,{0}}\longrightarrow}{\hat{\boldsymbol{C}}}_{\boldsymbol{xy}}{\hat{\boldsymbol{C}}}_{\boldsymbol{yy}}^{{-}{1}} = {\boldsymbol{H}}{.} \tag{29} \]

Thus, taking the limit

${\sigma}^{2}\,{\rightarrow}\,{0}$

in the third equality in (15) and substituting (29), we obtain

\begin{align*}&{\hat{\boldsymbol{y}}}_{g}{\underset{{\sigma}^{2}\,{\rightarrow}\,{0}}\longrightarrow}{\mu}_{y} + {\boldsymbol{C}}_{\boldsymbol{yy}}{\boldsymbol{H}}^{T}{(}{\boldsymbol{HC}}_{\boldsymbol{yy}}{\boldsymbol{H}}^{T} {)}^{{-}{1}} \\ & \qquad {\times}{(}{\boldsymbol{x}}{-}{\bar{\boldsymbol{x}}}{-} {\boldsymbol{H}}{(}{\mu}_{y}{-}{\bar{\boldsymbol{y}}}{)}{)}. \tag{30} \end{align*}

It is assumed here that

${\boldsymbol{HC}}_{\boldsymbol{yy}}{\boldsymbol{H}}^{T}$

is nonsingular, for which

${N}_{y}\,{â‰¥}\,{N}_{x}$

is a necessary condition. Similarly, by substituting (27) and (28) in the discriminative learned estimator in (20), we obtain

\begin{align*} & {\hat{\boldsymbol{y}}}_{\boldsymbol{d}}{\underset{{\sigma}^{2}\,{\rightarrow}\,{0}}\longrightarrow} \\ & \quad {\bar{\boldsymbol{y}}} + {\hat{\boldsymbol{C}}}_{\boldsymbol{yy}}{\boldsymbol{H}}^{T}{(}{\boldsymbol{H}}{\hat{\boldsymbol{C}}}_{\boldsymbol{yy}}{\boldsymbol{H}}^{T}{)}^{{-}{1}}{(}{\boldsymbol{x}}{-}{\bar{\boldsymbol{x}}}{)}. \tag{31} \end{align*}

While both asymptotic estimators in (30) and (31) apply the same linear transformation to the input term

${(}{\boldsymbol{x}}{-}{\bar{\boldsymbol{x}}}{)}$

, they vary in how they process the term depending on the target mean and covariance. That is, comparing the generative and discriminative cases in (30) and (31), we can see that they yield identical expressions except for the use of the true values

${\mu}_{\boldsymbol{y}}$

and

${\boldsymbol{C}}_{\boldsymbol{yy}}$

in the generative case being replaced by the empirical values

${\bar{\boldsymbol{y}}}$

and

${\hat{\boldsymbol{C}}}_{\boldsymbol{yy}}$

in the discriminative case. This follows from the prior knowledge of high SNR available to the generative estimator. While these estimators differ, they can be shown to coincide when

${\bar{\boldsymbol{y}}} = {\mu}_{y}$

and

${\hat{\boldsymbol{C}}}_{\boldsymbol{yy}} = {\boldsymbol{C}}_{\boldsymbol{yy}}$

, which is approached when

${n}_{t}$

is sufficiently large, such that the empirical mean

${\bar{\boldsymbol{y}}}$

and empirical covariance

${\hat{\boldsymbol{C}}}_{\boldsymbol{yy}}$

approach the true mean

${\mu}_{y}$

and true covariance matrix

${\boldsymbol{C}}_{\boldsymbol{yy}}$

, respectively.Qualitative comparisonThe theoretical comparison in the â€œTheoretical Comparisonâ€ section allows us to rigorously identify scenarios in which one approach is preferable over the other, e.g., that discriminative learning is more suitable for handling modeling mismatches. Another aspect in which the estimators are comparable is in their sample complexity. The discriminative linear estimator in (20) requires the computation of the inverse sample covariance matrix of x,

${\hat{\boldsymbol{C}}}_{\boldsymbol{xx}}$

, from (9c). On the other hand, the generative linear estimator in (16) requires the computation of the inverse sample covariance matrix of y,

${\hat{\boldsymbol{C}}}_{\boldsymbol{yy}}$

, in (9b). The dimensions of the matrices

${\hat{\boldsymbol{C}}}_{\boldsymbol{xx}}$

and

${\hat{\boldsymbol{C}}}_{\boldsymbol{yy}}$

are different. Thus, for a limited dataset, it may be easier to implement one of these estimators and guarantee the stability of the inverse covariance matrix.In particular, in settings where the sample size

${(}{n}_{t}{)}$

is comparable to the observation dimension

${(}{N}_{x}{)}$

, the discriminative sample LMMSE estimator exhibits severe performance degradation. This is because the sample covariance matrix is not well conditioned in the small sample size regime, and inverting it amplifies the estimation error. Similarly, if

${n}_{t}$

is comparable to the estimated vector dimension

${(}{N}_{y}{)}$

, the performance of the generative learning estimator in (16) degrades. In such cases, the available information is not enough to reveal a sufficiently good model that fits the data, and it can be misleading due to the presence of noise and possible outliers.A possible additional operational gain of generative learning stems from the fact that it estimates the underlying distribution, which can be used for other tasks. The discriminative learning approach is highly task specific, learning only what must be learned to produce its estimate, and is, thus, not adaptive to a new task.Numerical comparisonIn this section, we evaluate the considered estimators in comparison with the oracle MMSE, which is the MMSE estimator for the model in (7) and known

${\boldsymbol{H}}^{1}$

. The numerical study is available online at https://gist.github.com/nirshlezinger1/3e92bc16d28c8f2f7feb5031e32b5618. The purpose of this study is to numerically assert the theoretical findings reported in the previous section and to empirically compare the considered approaches to combine data with partial domain knowledge in a nonasymptotic regime.We simulate jointly Gaussian signals via the signal model in (7) with

${N}_{x} = {28}$

observations and

${N}_{y} = {30}$

target entries. The target y has zero mean and a covariance matrix representing spatial exponential decay, where the (i, j)th entry of

${\boldsymbol{C}}_{\boldsymbol{yy}}$

is set to

${e}^{{-}\frac{\left\vert{i}{-}{j}\right\vert}{5}}$

. The measurement matrix

${\boldsymbol{H}}$

is generated randomly with i.i.d. zero-mean unit variance Gaussian entries. The results are averaged over 10⁴ Monte Carlo simulations.We numerically evaluate the performance of the discriminative learning estimator of (20) and the generative learning estimator of (16). We consider two scenarios: 1) a setting in which

${\boldsymbol{C}}_{\boldsymbol{yy}}$

and

${\mu}_{\boldsymbol{y}}$

are accurately known and 2) a mismatched case in which the estimator approximates

${\boldsymbol{C}}_{\boldsymbol{yy}}$

as the identity matrix. Since the discriminative estimator is invariant with respect to the prior distribution of y, the presence of mismatch only affects the generative approach.The resulting MSE values versus the SNR

${1} / {\sigma}^{2}$

for

${n}_{t} = {100}$

data samples are reported in Figure 3, while Figure 4 illustrates the MSE curves versus

${n}_{t}$

for

${\sigma} = {0.3}$

. Observing Figure 3, we note that, for the considered setting of small

${n}_{t}$

, the generative estimator, which fully knows

${\boldsymbol{C}}_{\boldsymbol{yy}}$

, outperforms the discriminative approach due to its ability to incorporate the prior knowledge of

${\sigma}^{2}$

and of the statistical moments of y. We also observe that, in high SNRs, generative learning allows approaching the MMSE, while discriminative learning yields a performance gap, which settles with our analysis in the previous section. We note, though, that, when repeating this study with

${N}_{x} = {N}_{y}$

, all estimators achieve performance within a minor gap of the MMSE in the high-SNR regime.lecturenotes03-3271431Figure 3. The MSE versus the SNR.lecturenotes04-3271431Figure 4. The MSE versus the number of samples.Nonetheless, in the presence of small mismatches in

${\boldsymbol{C}}_{\boldsymbol{yy}}$

, the discriminative approach yields improved MSE, indicating its ability to cope better with modeling mismatches compared with generative learning. In Figure 4, where the SNR is fixed and finite, we observe that the effect of a misspecified model does not vanish when the number of samples increases, and the mismatched generative model remains within a notable gap from the MMSE, while both the discriminative learning estimator and the nonmismatched generative one approach the MMSE as

${n}_{t}$

grows. These findings are all in line with the theoretical analysis presented in the â€œTheoretical Comparisonâ€ section.What we have learnedIn this â€œLecture Notesâ€ column, we reviewed two different approaches to combining partial domain knowledge with data for forming an inference rule. The first approach is related to the machine learning notion of generative learning, which operates in two stages: it first uses data to estimate the missing components in the statistical description of the problem at hand; then, the estimated statistical model is used to form the inference rule. The second approach is task oriented, leveraging the available domain knowledge to specify the structure of the inference rule while using data to optimize the resulting mapping in an end-to-end fashion.To compare the approaches in a manner that is relevant and insightful to signal processing students and researchers, we focused on a case study representing linear estimation. For such settings, we obtained a closed-form expression for both the generative learning estimator as well as the discriminative learning one. The resulting explicit derivations enabled us to rigorously compare the approaches and draw insights into their conceptual differences and individual pros and cons.In particular, we noted that discriminative learning, which uses the available domain knowledge only to determine the inference rule structure, is more robust to mismatches in the mathematical description of the setup. This property indicates the ability of end-to-end learning to better cope with mismatched and complex models. However, when the partial domain knowledge available is indeed accurate, it was shown that generative learning can leverage this prior to operate with few samples and to improve the performance in noisy settings. These findings were not only analytically proven but also backed by numerical evaluations, which are made publicly available as a highly accessible Python Notebook intended to be used for presenting this lecture in class. While our conclusions are rigorously proven for a simple setting of linear estimation in a Gaussian model, they can be empirically shown to generalize to more complex settings. For instance, the robustness of discriminative learning to model mismatches is one of the key motivations for converting model-based algorithms into trainable discriminative models via model-based deep learning [4], while the improved sample complexity of generative learning was also observed in [1].AcknowledgmentWe thank Arbel Yaniv and Lital Dabush, who helped during the development of the numerical example.AuthorsNir Shlezinger (nirshl@bgu.ac.il) received his B.Sc., M.Sc., and Ph.D. degrees in 2011, 2013, and 2017, respectively, from Ben-Gurion University, Israel, all in electrical and computer engineering. He is an assistant professor in the School of Electrical and Computer Engineering at Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel. From 2017 to 2019, he was a postdoctoral researcher at the Technion, and from 2019 to 2020, he was a postdoctoral researcher at the Weizmann Institute of Science, where he was awarded the Feinberg Graduate School prize for outstanding research achievements. His research interests include communications, information theory, signal processing, and machine learning.Tirza Routtenberg (tirzar@bgu.ac.il) received her B.Sc. degree in biomedical engineering from the Technion Israel Institute of Technology, Haifa, Israel, in 2005, and her M.Sc. and Ph.D. degrees in Electrical Engineering from the Ben-Gurion University of the Negev, Beer-Sheva, Israel, in 2007 and 2012, respectively. She is an associate professor in the School of Electrical and Computer Engineering at Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel. In addition, she has been appointed as the William R. Kenan, Jr., Visiting Professor for Distinguished Teaching in the Electrical and Computer Engineering Department at Princeton University. She was the recipient of four Best Student Paper Awards at international conferences. She is currently serving as an associate editor of IEEE Transactions on Signal and Information Processing Over Networks and IEEE Signal Processing Letters. She is a member of the IEEE Signal Processing Theory and Methods Technical Committee. Her research interests include statistical signal processing, estimation and detection theory, graph signal processing, and optimization and signal processing for smart grids. She is a Senior Member of IEEE.References[1] A. Ng and M. Jordan, â€œOn discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes,â€ in Proc. Adv. Neural Inf. Process. Syst., 2001, vol. 14, pp. 841â€“848.[2] T. Jebara, Machine Learning: Discriminative and Generative, vol. 755. New York, NY, USA: Springer Science & Business Media, 2012.[3] Y. LeCun, Y. Bengio, and G. Hinton, â€œDeep learning,â€ Nature, vol. 521, no. 7553, pp. 436â€“444, May 2015, doi: 10.1038/nature14539.[4] N. Shlezinger, Y. C. Eldar, and S. P. Boyd, â€œModel-based deep learning: On the intersection of deep learning and optimization,â€ IEEE Access, vol. 10, pp. 115,384â€“115,398, Nov. 2022, doi: 10.1109/ACCESS.2022.3218802.[5] A. Agrawal, S. Barratt, and S. Boyd, â€œLearning convex optimization models,â€ IEEE/CAA J. Autom. Sin., vol. 8, no. 8, pp. 1355â€“1364, Aug. 2021, doi: 10.1109/JAS.2021.1004075.[6] V. Monga, Y. Li, and Y. C. Eldar, â€œAlgorithm unrolling: Interpretable, efficient deep learning for signal and image processing,â€ IEEE Signal Process. Mag., vol. 38, no. 2, pp. 18â€“44, Mar. 2021, doi: 10.1109/MSP.2020.3016905.[7] N. Shlezinger, J. Whang, Y. C. Eldar, and A. G. Dimakis, â€œModel-based deep learning,â€ Proc. IEEE, early access, 2023, doi: 10.1109/JPROC.2023.3247480.[8] S. Theodoridis, Machine Learning: A Bayesian and Optimization Perspective, 2nd ed. New York, NY, USA: Elsevier, 2020.[9] S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learning: From Theory to Algorithms. New York, NY, USA: Cambridge Univ. Press, 2014.[10] N. Harel and T. Routtenberg, â€œBayesian post-model-selection estimation,â€ IEEE Signal Process. Lett., vol. 28, pp. 175â€“179, Jan. 2021, doi: 10.1109/LSP.2020.3048830.[11] E. Meir and T. Routtenberg, â€œCramÃ©r-Rao bound for estimation after model selection and its application to sparse vector estimation,â€ IEEE Trans. Signal Process., vol. 69, pp. 2284â€“2301, Mar. 2021, doi: 10.1109/TSP.2021.3068356.[12] Y. Bengio, â€œLearning deep architectures for AI,â€ Found. Trends Mach. Learn., vol. 2, no. 1, pp. 1â€“127, Jan. 2009, doi: 10.1561/2200000006.Digital Object Identifier 10.1109/MSP.2023.3271431CoverCelebrate IEEEMastheadGet PublishedFrom the EditorPresident's MessageSociety NewsElection of President-Elect, Regional Directors-at-Large, and Members-at-LargePerspectivesOn the Concept of Frequency in Signal Processing: A DiscussionQuaternions in Signal and Image ProcessingIntegrated Sensing and Communications With Reconfigurable Intelligent SurfacesDeep Learning Meets Sparse RegularizationLecture NotesPeriodograms and the Method of Averaged PeriodogramsSP CompetitionsHumorDates AheadSPS Resource CenterMathWorks