Jie Chen, Min Zhao, Xiuheng Wang, Cédric Richard, Susanto Rahardja
©SHUTTERSTOCK.COM/PAPAPIG
Spectral unmixing is central when analyzing hyperspectral data. To accomplish this task, physics-based methods have become popular because, with their explicit mixing models, they can provide a clear interpretation. Nevertheless, because of their limited modeling capabilities, especially when analyzing real scenes with unknown complex physical properties, these methods may not be accurate. On the other hand, data-driven methods using deep learning in particular have developed rapidly in recent years, thanks to their superior capability in modeling complex nonlinear systems. Simply transferring these methods as black boxes to perform unmixing may lead to low interpretability and poor generalization ability. To bring together the best of two worlds, recent research efforts have focused on combining the advantages of both physics-based models and data-driven methods. In this article, we present an overview of recent advances on this topic from various perspectives, including deep neural network (DNN) design, prior capturing, and loss selection. We summarize these methods within a common optimization framework and discuss ways of enhancing our understanding of these methods. The related source codes are made publicly available at http://github.com/xiuheng-wang/awesome-hyperspectral-image-unmixing.
Conventional machine vision applications widely use red-green-blue cameras. These cameras are suitable for identifying objects with shapes and colors consistent with the human visual system. As a significant advancement in imaging techniques, hyperspectral cameras measure objects or scenes by recording hundreds of narrow bands across a wide-range spectrum that can cover the nonvisible range of light. In that way, hyperspectral images enable the identification of materials based on their unique spectral signatures, beyond their visible characteristics, coupled with spatial information [1]. Initial hyperspectral imaging applications started with the remote sensing of Earth’s surface using satellite or airborne cameras, e.g., for mineral exploration, vegetation monitoring, and land analysis. Recently, the subsequent technical advancements and ongoing cost reduction of hyperspectral cameras have made their deployment practical and possible in other emerging applications, such as food safety inspection, medical diagnostics, industrial sorting, drug analysis, biometric forensics, archaeology, and so on.
The spectral content of individual pixels in hyperspectral images is typically a mixture of the spectra of multiple materials [2], [3]. This phenomenon, called spectral mixture, is attributed to multiple factors, including mainly the low spatial resolution of hyperspectral imaging devices, the diversity and intimate interactions of materials in the imaged scenes, and multiple photon reflections from layered objects. Separating the spectra of individual pixels into a set of spectral signatures (called endmembers), and determining the abundance fraction of each endmember, is an essential task in quantitative hyperspectral subpixel analysis. This process, denoted as spectral unmixing or mixed pixel decomposition, is currently used in many applications, such as conventional mineral distribution analysis in remote sensing, biodistribution analysis in fluorescence images, food quality and composition control, and pharmaceutical inspection.
Diverse solutions have been proposed for the spectral unmixing problem. They can be divided primarily into two categories: physics-based and data-driven methods. Physics-based methods exploit physical models to estimate light scattering and interaction mechanisms among multiple materials in the imaged scene. The observed spectra are explicitly related to the endmembers and abundances, albeit with strong assumptions regarding photon-interacting processes. In practice, it is difficult to accurately model real scenes, and sophisticated models with complex mathematics are usually difficult to estimate. By contrast, data-driven unmixing methods can overcome some limitations of the physics-based methods by directly deriving mixture models from the observed hyperspectral data. These techniques are effective in uncovering the underlying relationships in complicated scenes with abundant data. Nevertheless, explicit and useful knowledge of the physical mixing process is often overlooked in naive data-driven approaches. To reap the benefits of both physics-based and data-driven models, it is essential to properly integrate these techniques to achieve a superior unmixing performance with clear interpretation.
The main purpose of this article is to provide an overview of hyperspectral unmixing models and techniques with special attention focused on the integration of physics-based models and data-driven methods. These methods are analyzed from multiple perspectives, including model structure design, prior capturing, and loss function design. In addition to reviewing the materials dispersed in the literature, we establish a mathematical framework that can characterize and relate these methods. This article provides a set of methods that can serve as good examples for understanding the topics and a selection of references that is sufficient to cover the main contributions. The symbols used in this article are provided in Table 1.
Table 1. The main symbols used in this article.
This section establishes a general modeling and optimization framework for spectral unmixing. The classical and recent physics-based and data-driven unmixing methods described subsequently are compatible with this framework.
Generally, hyperspectral unmixing consists of three essential steps: the estimation of the number of endmembers present in the imaged scene, their extraction from available data, and the estimation of their respective abundances. Estimating the number of endmembers can be performed with signal dimension estimation techniques or circumvented with sparse regression methods. Estimating the endmembers and their abundances can be performed sequentially or simultaneously. During the unmixing process, endmembers and abundances are often constrained to preserve physical properties and interpretability. The endmembers represent the pure spectra existing in the imaged scene, implying that the so-called endmember nonnegativity constraint (ENC) is usually imposed. The abundances represent the contribution in percentage of each pure component, meaning that the abundance nonnegativity constraint (ANC) and the abundance sum-to-one constraint (ASC) are often considered. The feasible regions of endmembers and abundances are denoted by ${\Omega}_{M}$ and ${\Omega}_{a},$ with \[{\Omega}_{M} = {\{}{\boldsymbol{M}}\,{:}\,{\boldsymbol{M}}\geq{\boldsymbol{0}}{\},} \tag{1} \] \[{\Omega}_{a} = {\{}{\boldsymbol{a}}\,{:}\,{\boldsymbol{a}}\geq{\boldsymbol{0}}{;}\,{a}_{1} + {a}_{2} + \cdots + {a}_{R} = {1}{\},} \tag{2} \] where ${\cdot}\geq{\boldsymbol{0}}$ denotes elementwise nonnegativity of its matrix or vector argument. The feasible region ${\Omega}_{M}$ is for the ENC, and the feasible region ${\Omega}_{a}$ is for the ANC and ASC. The parameter R denotes the number of endmembers available in the imaged scene.
An observed pixel spectrum y can be described by the following general mixture mechanism: \[{\boldsymbol{y}} = {\cal{F}}^{\ast}{(}{\boldsymbol{M}},{\boldsymbol{a}}{)} + \boldsymbol{\epsilon} \tag{3} \] where ${\cal{F}}^{\ast}$ defines the inherent photon interaction mechanism parameterized by the endmembers and their abundances. In a nutshell, spectral unmixing aims at determining the endmembers and their respective abundances by considering the inverse process of ${(}{\cal{F}}^{\ast}{)}^{{-}{1}}{:}{\boldsymbol{y}}\rightarrow{\{}\hat{\boldsymbol{M}},\hat{\boldsymbol{a}}{\}}$ under the constraints defined by ${\Omega}_{M}$ and ${\Omega}_{a},$ provided that this inverse is known or can be estimated. From the perspective of mathematical optimization, the general form of the unmixing problem can be formulated in different terms depending on known information and priors.
Consider a scene with N pixels $\{{\boldsymbol{y}}_{i}{\}}_{{i} = {1}}^{N}.$ Under the assumption that the mixture mechanism is defined by an explicit transformation ${\cal{F}},$ the optimization formulation of the unmixing problem is given by \begin{align*}\{{\hat{\boldsymbol{M}}},\{{\hat{\boldsymbol{a}}}_{i}\}_{{i} = {1}}^{N}\} & = \mathop{\text{argmin}}\limits_{{\boldsymbol{M}},\{{\boldsymbol{a}}_{i}\}_{{i} = {1}}^{N}}\mathop{\sum}\limits_{{i} = {1}}^{N}{\cal{L}}{(}{\boldsymbol{y}}_{i},\,{\hat{\boldsymbol{y}}}_{i}{)} + {\cal{R}}{(}{\boldsymbol{M}},\{{\boldsymbol{a}}_{i}\}_{{i} = {1}}^{N}{)} \\ & \quad {\text{with}} \quad {\hat{\boldsymbol{y}}}_{i} = {\cal{F}}{(}{\boldsymbol{M}},{\boldsymbol{a}}_{i}{)} \\ & \quad {\text{s.t.}}\quad{\boldsymbol{M}}\in{\Omega}_{M}\,{\text{and}}\,{\boldsymbol{a}}_{i}\in{\Omega}_{a}\,{\text{for}}\,{\forall}{i}{,} \tag{4} \end{align*} where
Extensive research has been conducted to derive explicit mathematical formulations for describing mixture mechanisms. These physics-based models are constructed with different assumptions regarding how photons interact with scene materials. The works in [2] and [4] provide thorough reviews on physical mixture models and associated unmixing methods. Some typical models are presented here as examples.
In the linear mixture model (LMM), the macroscopically pure components are assumed to be homogeneously distributed in separate patches within the view field. Each photon interacts with only one pure material before reaching the observer, without being affected by any other material. The LMM is expressed as follows: \begin{align*}{\boldsymbol{y}} & = {a}_{1}{\boldsymbol{m}}_{1} + {a}_{2}{\boldsymbol{m}}_{2} + \cdots + {a}_{R}{\boldsymbol{m}}_{R} + \boldsymbol{\epsilon} \\ & = {\boldsymbol{Ma}} + {\boldsymbol{\epsilon}}{;} \tag{5} \end{align*} that is, the observed spectrum y is a linear combination of ${a}_{j}{\boldsymbol{m}}_{j},$ where ${\boldsymbol{m}}_{j}$ is the spectrum of the jth pure material weighted by its abundance ${a}_{j}{.}$ An additive noise is denoted by $\boldsymbol{\epsilon}{.}$ A general form of the LMM considers different endmembers at each pixel due to, e.g., spectral variability. It is formulated as \[{\boldsymbol{y}}_{i} = {\boldsymbol{M}}_{i}{\boldsymbol{a}}_{i} + {\boldsymbol{\epsilon}}_{i}{,} \tag{6} \] where ${\boldsymbol{M}}_{i}$, having the same structure as M defined in Table 1, is the endmember matrix at the ith pixel.
Bilinear models [5] generalize the LMM by introducing second-order reflection signatures to capture, e.g., photon scatterings caused by multiple vegetation layers in agricultural scenarios. The mixture model is described by \[{\boldsymbol{y}} = {\boldsymbol{Ma}} + \mathop{\sum}\limits_{{i} = {1}}^{R}\mathop{\sum}\limits_{{j} = {1}}^{R}{{\gamma}_{i,j}}{\boldsymbol{m}}_{i}\odot{\boldsymbol{m}}_{j} + \boldsymbol{\epsilon}{,} \tag{7} \] where $\odot$ denotes the Hadamard product, and ${\gamma}_{i,j}$ stands for the nonlinear contribution of the ith and jth endmembers. The expression of ${\gamma}_{i,j}$ allows us to define different bilinear models. For instance, ${\gamma}_{i,j} = {\beta}_{i,j}{a}_{i}{a}_{j}$ yields the generalized bilinear model in [5], where ${\beta}_{i,j}$ controls the interaction strength between the ith and jth endmembers.
The Hapke model has been developed to characterize intimate mixtures with complex photon interactions [6]. The full Hapke model is complex and requires a number of parameters that are generally unavailable in real-world scenes. Under certain reasonable assumptions and simplifications, such as spherical particles and isotropic scatters, the relationship between the bidirectional reflectance y and the single-scattering albedo (SSA) w becomes \[{\boldsymbol{y}} = {\cal{S}}{(}{\boldsymbol{w}}{)} = \frac{\boldsymbol{w}}{{(}{1} + {2}{\mu}\sqrt{{1}{-}{\boldsymbol{w}}}{)}\left({{1} + {2}{\mu}_{0}\sqrt{{1}{-}{\boldsymbol{w}}}}\right)}{,} \tag{8} \] where the division is the elementwise operator, and ${\mu}_{0}$ and ${\mu}$ are the cosines of the incoming and outgoing radiation angles, respectively. In spite of being nonlinear in terms of reflectance, the intimate mixture model is linear in the SSA domain and can be expressed as follows: \[{\boldsymbol{w}} = \mathop{\sum}\limits_{{i} = {1}}^{R}{{a}_{i}}{\boldsymbol{w}}_{i}{.} \tag{9} \] Unmixing problems with physics-based models frequently fall under Problem formulation 1 when ${\cal{F}}$ is explicitly defined. We now give two examples of linear unmixing problems with known and unknown endmembers, respectively.
Under the linear model assumption ${\cal{F}}{(}{\boldsymbol{M}},{\boldsymbol{a}}{)} = {\boldsymbol{Ma}}$ with known M , considering independent pixels and using the squared ${\ell}_{2} $-norm error ${L}{\boldsymbol{(}}{\boldsymbol{y}},\hat{\boldsymbol{y}}{)} = {\Vert}{\boldsymbol{y}}{-}\hat{\boldsymbol{y}}{\Vert}^{2},$ problem (4) reduces to the following fully constrained least-squares (FCLS) problem [7]: \begin{align*}{\hat{\boldsymbol{a}}} & = \mathop{\text{argmin}}\limits_{\boldsymbol{a}}{\Vert}{\boldsymbol{y}}{-}{\boldsymbol{Ma}}{\Vert}^{2} \\ & \quad\quad {\text{s.t.}}\quad{\boldsymbol{a}}\in{\Omega}_{a}{.} \tag{10} \end{align*}
Under the linear model assumption ${\cal{F}}{(}{\boldsymbol{M}},{\boldsymbol{a}}_{i}{)} = {\boldsymbol{Ma}}_{i}$ for ${i} = {1},\cdots,{N}$, and using ${L}{(}{\boldsymbol{y}}_{i},{\hat{\boldsymbol{y}}}_{i}{)} = {\Vert}{\boldsymbol{y}}_{i}{-}{\hat{\boldsymbol{y}}}_{i}{\Vert}^{2}$ with a regularization term, problem (4) becomes \begin{align*}\{{\hat{\boldsymbol{M}}},\{{\hat{\boldsymbol{a}}}_{i}\}_{{i} = {1}}^{N}\} & = \mathop{\text{argmin}}\limits_{{\boldsymbol{M}},\{{\boldsymbol{a}}_{i}\}_{{i} = {1}}^{N}}{\Vert}{\boldsymbol{Y}}{-}{\boldsymbol{MA}}{\Vert}_{\cal{F}}^{2} + {\cal{R}}\left({\boldsymbol{M}},\{{\boldsymbol{a}}_{i}\}_{{i} = {1}}^{N}\right) \\ & {\text{s.t.}}\quad{\boldsymbol{M}}\in{\Omega}_{M},\,{\text{and}}\,{\boldsymbol{a}}_{i}\in{\Omega}_{a}{,} \tag{11} \end{align*} where we notice that ${\Vert}{\boldsymbol{Y}}{-}{\boldsymbol{MA}}{\Vert}_{F}^{2} = {\Sigma}_{{i} = {1}}^{N}{\Vert}{\boldsymbol{y}}_{i}{-}{\boldsymbol{Ma}}_{i}{\Vert}^{2}{.}$ The design of regularizers ${\cal{R}}$ for enhancing the unmixing performance has been extensively studied in the literature. For example, a minimum volume-based term is usually introduced as a geometrical constraint on the endmember estimation [8]. Total variation, sparsity, and low rankness are usually used to regularize the abundance estimation [9].
Though the physics-based model and unmixing with Problem formulation 1 have been widely used, it is clear that the real mixture mechanism may not be explicitly known or accurately assumed, and the actual physics-based models have a limited nonlinearity modeling capacity. Simultaneously learning the function ${\cal{F}}$ from data while determining the endmembers and abundances appears to be a remedy to this problem.
Under the assumption that the mixture mechanism ${\cal{F}}$ is unknown or partially unknown, the optimization formulation that learns ${\cal{F}}$ from the data can be formulated as \begin{align*}\{{\hat{\boldsymbol{M}}},\{{\hat{\boldsymbol{a}}}_{i}\}_{{i} = {1}}^{N},{\hat{F}}\} & = \mathop{\text{argmin}}\limits_{{\boldsymbol{M}}{,\{}{\boldsymbol{a}}_{i}{\}}_{{i} = {1}}^{N},{\cal{F}}\in{H}}\mathop{\sum}\limits_{{i} = {1}}^{N}{L}{(}{\boldsymbol{y}}_{i},{\hat{\boldsymbol{y}}}_{i}{)} + {R}{(}{\boldsymbol{M}},\{{\boldsymbol{a}}_{i}\}_{{i} = {1}}^{N}{)} \\ & \quad\quad {\text{with}}\quad{\hat{\boldsymbol{y}}}_{i} = {\cal{F}}{(}{\boldsymbol{M}},{\boldsymbol{a}}_{i}{)} \\ & \quad\quad {\text{s.t}}{.}\quad{\boldsymbol{M}}\in{\Omega}_{M},\,{\text{and}}\,{\boldsymbol{a}}_{i}\in{\Omega}_{a}, \tag{12} \end{align*} where ${\cal{H}}$ denotes a functional space where ${\cal{F}}$ lies. Typical nonlinear unmixing problems with conventional machine learning techniques often fall into this category [10], [11].
Learning ${\cal{F}}$ from data opens the possibility of determining mixture mechanisms that are difficult to formulate using explicit descriptions and expressions. However, this strategy may fail if ${\cal{F}}$ is not sufficiently parameterized and jointly constrained with physics-based models. Some researchers have specified several forms for ${\cal{F}}$ to make it more interpretable. In addition to the trivial linear form, the additive nonlinear model and postnonlinear model are two major categories that have been extensively investigated.
Models in this category consider that the mixture consists of a linear mixture with an additive nonlinear component involving the endmembers and their abundances; that is, \[{\boldsymbol{y}} = {\boldsymbol{Ma}} + {\cal{F}}_{\text{add}}{(}{\boldsymbol{M}},{\boldsymbol{a}}{)} + \boldsymbol{\epsilon}{.} \tag{13} \]
Several subcategories, depending on specific forms for ${\cal{F}}_{\text{add}},$ have been investigated to make the problem tractable, including
Models in this category consider that a spectrum is produced by distorting the LMM with a nonlinear function, that is, \[{\boldsymbol{y}} = {\cal{F}}_{\text{post}}{(}{\boldsymbol{Ma}}{)} + \boldsymbol{\epsilon}{.} \tag{16} \]
A specific form can be constructed by setting ${\cal{F}}_{\text{post}} = {I} + {\cal{F}}_{\text{add}},$ leading to the following additive postnonlinear model: \[{\boldsymbol{y}} = {\boldsymbol{Ma}} + {\cal{F}}_{\text{add}}{(}{\boldsymbol{Ma}}{)} + \boldsymbol{\epsilon}, \tag{17} \] which can also be viewed as a special case of (15) by setting ${\cal{F}}_{\text{add}}{(}{a}_{1}{\boldsymbol{m}}_{1},\cdots,{a}_{R}{\boldsymbol{m}}_{R}{)} = {\cal{F}}_{\text{add}}{(}{a}_{1}{\boldsymbol{m}}_{1} + \cdots + {a}_{R}{\boldsymbol{m}}_{R}{).}$
Figure 1 shows the hierarchical structure of the preceding categories and how they cover typical physics-based mixture models. The following examples illustrate the relationship between Problem formulation 2 and two typical nonlinear unmixing techniques.
Figure 1. The hierarchical structure of general mixture models and physics-based models. The arrows go from general models to more specific ones. Physics-based models are marked with $\begin{gathered}{\dagger{.}}\end{gathered}$
The derivation of the K-Hype algorithm is based on a linear-mixture/nonlinear-fluctuation model of the form (14). The nonlinear fluctuation function ${\cal{F}}_{\text{add}}$ is constrained to be a member of a reproducing kernel Hilbert space (RKHS) ${H}_{RKHS}$ and expressed with kernels. Considering pixels independently and assuming that M is known, the optimization problem is given by \begin{align*}\{{\hat{\boldsymbol{a}}},{\hat{\cal{F}}}_{\text{add}}\} = & \mathop{\text{argmin}}\limits_{{\boldsymbol{a}},{\cal{F}}_{\text{add}}\in{H}_{RKHS}}\frac{1}{{2}{\mu}}\mathop{\sum}\limits_{{\ell} = {1}}^{L}{e}_{\ell}^{2} + \frac{1}{2}\left({\Vert}{\cal{F}}_{\text{add}}{\Vert}^{2} + {\Vert}{a}{\Vert}^{2}\right) \\ & \quad {\text{with}}\quad{e}_{\ell} = {y}_{\ell}{-}{\boldsymbol{m}}_{{\lambda}_{\ell}}^{\top}{\boldsymbol{a}}{-}{\cal{F}}_{\text{add}}{(}{\boldsymbol{m}}_{{\lambda}_{\ell}}{)} \\ & \quad {\text{s.t.}}\quad{\boldsymbol{a}}\in{\Omega}_{a}{,} \tag{18} \end{align*} where ${\boldsymbol{m}}_{{\lambda}_{\ell}}$ denotes the ${\ell}$th row of M , i.e., all reflectances at band ${\lambda}_{\ell}.$ This problem can be solved using the Lagrangian method, which yields an explicit expression relating $\hat{\boldsymbol{a}}$ and the dual variables and expands ${\hat{F}}_{add}$ as a sequence of kernels parameterized by $\{{\boldsymbol{m}}_{{\lambda}_{\ell}}{\}}_{{\ell} = {1}}^{L}.$ The K-Hype formulation, along with its sparse variant [10] and its spatially regularized variant in [11], clearly falls into Problem formulation 2.
The geodesic distance can be calculated to conduct data unmixing on a manifold rather than in a linear space. This process considers a nonlinear but continuous bijective mapping ${\cal{C}}$ between the linear space of abundance coefficients and a manifold in the spectral space, \[{\boldsymbol{y}} = {\cal{C}}{(}{\boldsymbol{Ma}}{)} + \boldsymbol{\epsilon}, \tag{19} \] which is of the form (16) as shown in [12] and [13]. More specifically, defining a K-nearest neighbor graph with the dataset, the geodesic distance can be approximated as the length of the shortest path between two points. The endmembers $\{{\boldsymbol{m}}_{i}{\}}_{{i} = {1}}^{R}$ span the simplex with the largest volume on the data manifold, and the abundance coefficients of each data to be unmixed can be expressed as volume ratios in this manifold. The manifold formulation also falls into Problem formulation 2 as $\text{C}$ is learned from the data.
Problem formulations 1 and 2 cover physics-based methods or conventional machine learning techniques with handcrafted regularizers. See “Implications of Problem Formulations 1 and 2”. This article focuses more on the integration of physics-based methods with recent machine learning techniques powered by DNNs and other data-driven techniques. Problem formulation 3 encompasses these new methods by introducing additional hierarchical structures on endmembers and abundances.
Implications of Problem Formulations 1 and 2
Problem formulations 1 and 2 imply that the following three steps are essential for achieving a good unmixing performance:
Under the assumption that the mixture mechanism ${\cal{F}}$ is unknown or partially unknown, the optimization formulation that 1) jointly learns ${\cal{F}}$ from data and 2) considers hierarchical reparameterization of the endmembers and abundances can be formulated as \begin{align*}&\{{\hat{\boldsymbol{M}}},{\boldsymbol{\Theta}}_{\boldsymbol{M}}^{\ast},\{{\hat{\boldsymbol{a}}}_{i}\}_{{i} = {1}}^{N},{\boldsymbol{\Theta}}_{\boldsymbol{a}}^{\ast},{\hat{\cal{F}}}\} = \\ & \quad \mathop{\text{argmin}}\limits_{{\boldsymbol{M}},{\Theta}_{\boldsymbol{M}},\{{\boldsymbol{a}}_{i}\}_{{i} = {1}}^{N},{\boldsymbol{\Theta}}_{\boldsymbol{a}},{\cal{F}}\in{H}}\mathop{\sum}\limits_{{i} = {1}}^{N}{\cal{L}}{(}{\boldsymbol{y}}_{i},{\hat{\boldsymbol{y}}}_{i}{)} + {R}{(}{\boldsymbol{M}},\{{\boldsymbol{a}}_{i}\}_{{i} = {1}}^{N}{)}, \\ & \qquad\qquad\qquad {\text{with}}\quad{\hat{\boldsymbol{y}}}_{i} = {\cal{F}}{(}{\boldsymbol{M}}{(}{\boldsymbol{\Theta}}_{\boldsymbol{M}}{),}{\boldsymbol{a}}_{i}{(}{\boldsymbol{\Theta}}_{\boldsymbol{a}}{)}{)} \\ & \qquad\qquad\qquad {\text{s.t.}}\quad{\boldsymbol{M}}\in{\Omega}_{M},\,{\text{and}}\,{\boldsymbol{a}}_{i}\in{\Omega}_{a} \tag{20} \end{align*} where ${\boldsymbol{\Theta}}_{\boldsymbol{M}}$ and ${\boldsymbol{\Theta}}_{\boldsymbol{a}}$ are parameters that allow us to determine the endmember matrix M and the abundances $\{{\boldsymbol{a}}_{i}{\}}_{{i} = {1}}^{N},$ respectively, and ${\cal{H}}$ denotes a given function space. For example, in recent DNN structures, ${\boldsymbol{\Theta}}_{\boldsymbol{a}}$ can represent the encoder in an autoencoder unmixing framework, and ${\boldsymbol{\Theta}}_{\boldsymbol{M}}$ can represent the decoder part that generates the endmembers. Recent physically inspired data-driven methods will be reviewed in the following three sections. Their connections to Problem formulation 3 will also be established. In contrast to conventional machine learning-based unmixing methods, which mostly focus on learning mixture models from data, recent data-driven methods have shown their superiority in improving a variety of aspects including model design, prior capturing, and loss function design.
In practice, it appears natural to model ${\cal{F}}^{\ast}$ in (3) using DNNs since they offer an enhanced modeling capacity compared to conventional nonlinear modeling techniques. Instead of using DNNs as black boxes, elegant designs of network structures for modeling (15)–(17) are indispensable for preserving physical interpretation and then facilitating the extraction of abundances and endmembers. Deep autoencoder networks are considered as the most appropriate ones and are intensively studied for this purpose. Essentially, an autoencoder consists of an encoder network and a decoder network. The encoder part seeks a low-dimensional representation (e.g., the abundances) to reconstruct the input image. Given an input y , \[{\boldsymbol{\varrho}} = {f}_{{\boldsymbol{\Theta}}_{\text{enc}}}{(}{\boldsymbol{y}}{),} \tag{21} \] where ${f}_{{\Theta}{\text{enc}}}$ is the function representing the encoder with ${\boldsymbol{\Theta}}_{\text{enc}}$ denoting all network parameters. The decoder part reverses the functionality of the encoder by decompressing the hidden representation vector to reconstruct the original input data, namely, \[\hat{\boldsymbol{y}} = {f}_{{\Theta}{\text{dec}}}{(}{\boldsymbol{\varrho}}{),} \tag{22} \] where ${f}_{{\Theta}_{\text{dec}}}$ is the function representing the decoder with ${\Theta}_{\text{dec}}$ denoting all network parameters. For instance, an autoencoder can be designed to provide an estimate ${\boldsymbol{\varrho}}$ of the abundances a and to estimate the endmembers and the mixture mechanism with the decoder network. Then, identifications ${\Theta}_{\boldsymbol{a}}\leftarrow{\Theta}_{\text{enc}}$ and ${\cal{F}}\leftarrow{f}_{{\Theta}_{\text{dec}}}$ can be considered to fit these structures into Problem formulation 3.
In the following, rather than simply presenting the original autoencoders introduced in existing works as a whole, we explore the designs of encoders and decoders separately to demonstrate their extra flexibility and the potential of combining different designs. The basic structure of an autoencoder and typical encoder and decoder designs for unmixing are summarized in Figure 2. We start with decoders that are, in fact, associated with mixture models. Then, we proceed with the encoders that implicitly benefit from data priors.
Figure 2. The structure of an autoencoder for hyperspectral unmixing and a variety of selections of encoder and decoder designs. FCN: fully connected network; CNN: convolutional neural network.
Starting with the decoder of a single linear layer, we progressively demonstrate that the mixture models learned by an autoencoder are actually dependent on the decoder.
The relationship between the input and output of a fully connected linear layer decoder is as follows: \[{\hat{\boldsymbol{y}}} = {f}_{{\Theta}_{\text{dec}}}{(}{\boldsymbol{\varrho}}{)} = {W}{\boldsymbol{\varrho}}{.} \tag{23} \]
As shown in D1 of Figure 2, this form is consistent with the LMM defined in (5). Consequently, the learning process of the autoencoder, which minimizes the gap between $\hat{\boldsymbol{y}}$ and ${\boldsymbol{y}}$, leads to ${\boldsymbol{\varrho}}$ as an estimate of a and W as an estimate of the endmember matrix M. Despite the fact that the linear unmixing problem has been extensively studied with various classes of methods [14], [15], [16], [17], [18], [19], this structure motivates the other sophisticated designs that learn structured models from the data.
With the linear decoder at hand, one can connect it to additional nonlinear layers to create the following postnonlinear structure: \[{\hat{\boldsymbol{y}}} = {f}_{{\boldsymbol{\Theta}}{\text{dec}}}{(}{\boldsymbol{\varrho}}{)} = \Psi{(}{\boldsymbol{W}}{\boldsymbol{\varrho}}{),} \tag{24} \] where $\Psi$ denotes an implicit mapping function represented by the subsequent network layers with nonlinear activation functions. Parameters ${\boldsymbol{\Theta}}_{\text{dec}}$ can be determined by training the autoencoder so that the nonlinear function is learned from the data. Compared with the postnonlinear mixture model defined in (16), $\Psi$ is learned to mimic the generative model ${\cal{F}}_{\text{post}},$ and the endmembers are obtained from the weights of the first layer of the decoder [20]. The structure that forms this process is shown by D2 in Figure 2.
Instead of adding subsequential nonlinear layers, a network branch can be added in parallel to the linear layer for constructing additive nonlinear structures. However, the parameters coupled in the linear and the nonlinear components should be carefully addressed.
Instead of being extracted from the weights of a decoder, the endmembers can also be estimated from a subnetwork; that is, \[\hat{\boldsymbol{y}} = {f}_{{\Theta}_{\text{end}}}{(}{\boldsymbol{z}}{)}{\boldsymbol{\varrho}} = \Psi{(}{\boldsymbol{z}}{)}{\boldsymbol{\varrho}}, \tag{30} \] where ${f}_{{\boldsymbol{\Theta}}_{\text{end}}}$ represents the endmember generative model, with ${\boldsymbol{\Theta}}_{\text{end}}$ denoting the network parameters. This subnetwork can be a regular or variational autoencoder (VAE) that maps pixels to endmembers [25], [26] or a pretrained decoder to generate the endmembers from a latent vector [27]. The variable z denotes the input of this subnetwork, which can be pixels, candidate endmembers, or latent representations. This decoder is also a powerful tool to cope with spectral variability [26], [27]. Compared with Problem formulation 3, we identify ${\boldsymbol{\Theta}}_{\boldsymbol{M}}\leftarrow{\boldsymbol{\Theta}}_{\text{end}},$ leading to a hierarchical structure $\boldsymbol{M}({\boldsymbol{\Theta}}_{\boldsymbol{M}})$.
The encoder part is designed to map the input spectra into a latent representation space, which is often considered as the abundance space. The encoder implicitly acts as a regularizer relating abundances and spectra and imposes a hierarchical structure on a with parameters ${\boldsymbol{\Theta}}_{a},$ i.e., of the form ${\boldsymbol{a}}{(}{\boldsymbol{\Theta}}_{\boldsymbol{a}}{)}{.}$ The encoder part has been designed in the following ways in the literature.
Early works [14], [15], [16], [17] employ fully connected layers to construct the encoder. The latent representation of the kth layer is given by \begin{align*}&{\boldsymbol{h}}_{k} = {\sigma}_{k}{(}{\boldsymbol{U}}_{k}{\boldsymbol{h}}_{{k}{-}{1}} + {\boldsymbol{b}}_{k}{)} \\ & {\text{with}}\quad{\boldsymbol{h}}_{0} = {\boldsymbol{y}}{,} \tag{31} \end{align*} where ${\sigma}_{k}$ represents the nonlinear activation function, and ${\boldsymbol{U}}_{k}$ and ${\boldsymbol{b}}_{k}$ stand for the weights and bias of the kth layer, respectively. The output of the final layer is denoted by ${\boldsymbol{\varrho}}$ and is connected to the decoder. Some variants have also been designed based on this structure. For example, in [14] and [15], a sparse encoder is used to enforce abundance sparsity. The work [15] applies a denoising encoder to address the problem of noisy data and outliers in hyperspectral data.
Convolutional neural networks (CNNs) have also been used for designing the encoder [18], [19], [22]. Encoder inputs can be the entire image or overlapping patches. A 2D or 3D convolution operator operates on 3D image cubes ${\mathbb{Y}}_{n}$ as follows: \begin{align*}&{\mathbb{H}}_{k} = {\sigma}_{k}{(}{\mathbb{H}}_{{k}{-}{1}}\,{\circledast}\,{\mathbb{U}}_{k} + {\mathbb{B}}_{k}{)} \\ & {\text{with}}\quad{\mathbb{H}}_{0} = {\mathbb{Y}}_{n}, \tag{32} \end{align*} where ${\mathbb{U}}_{k}$ denotes a convolution kernel, and ${\mathbb{B}}_{k}$ represents the bias parameters. As 2D convolutional kernels may cause local spectral distortion, 3D-CNN encoders are used in [19] and [22] to jointly learn the spatial structures and local spectral features of 3D input images.
Recurrent neural networks, such as long short-term memory (LSTM) networks, benefit from memory cells and are better suited for processing sequences. With their recurrent hidden states depending on previous steps, they can take full advantage of the sequential characteristics in hyperspectral data. The abundances are estimated with an LSTM to encode spectral correlation and band-to-band variability information [21].
Another class of approaches uses kernel layers to process hyperspectral data. Such layers can map the input pixels into a manifold and estimate the abundances according to a Riemannian metric. For example, in [24], the kernel layer is defined as \[{a}_{i} = \exp\left({-}{\beta}\,{\cdot}\,{\Vert}{\boldsymbol{y}}_{i}{-}{\boldsymbol{c}}_{i}{\Vert}^{2}\right), \tag{33} \] where ${\beta}$ is the kernel parameter and ${c}_{i}$ is the centroid related to the ith endmember. The smaller the distance between the mixed pixel and the endmember centroid, the larger the corresponding abundance.
A variational encoder encodes each input pixel with a distribution in the latent space as opposed to a single vector. In practice, it is common to consider Gaussian distributions parameterized by their means and covariance matrices. The work in [16] uses a variational encoder to infer abundances. In [26], a probabilistic generative model is used to address endmember variability and provide more accurate abundance and endmember estimates.
Some works use multibranch encoders to capture different features by sharing weights or concatenating features. In [28], an endmember-guided encoder is used to transfer the endmember information into a parallel abundance estimate encoder. The work [29] uses a dual-branch encoder, where one branch consists of fully connected layers to extract spectral information, and the other branch consists of 2D convolution layers to capture spatial information.
The ANC and ASC are usually addressed at the last layer of the encoder either by using specific activation functions [e.g., a rectified linear unit (ReLU), a sigmoid, or an absolute operator] to enforce the ASC followed by a normalization layer to enforce the ANC or by using SoftMax to enforce the ANC and ASC simultaneously. They can also be addressed approximatively via a penalty term in the loss function.
Considering the 3D-CNN-based encoder (E2 in Figure 2) and the generalized additive nonlinear decoder defined in (28) (D5 in Figure 2) leads to the algorithm proposed in [22].
Considering the denoising and sparse encoder and the linear decoder defined in (23) (E1 and D1 in Figure 2) leads to the method proposed in [15].
Considering the kernel-layer-based encoder (E4 in Figure 2), the bilinear nonlinear decoder defined in (29) (D4 in Figure 2), and the postnonlinear decoder defined in (25) (D3 in Figure 2) yield the method proposed in [24].
A two-stream autoencoder framework with a probabilistic generative model (E3 and D6 in Figure 2) that copes with spectral variability leads to the method proposed in [26].
The combination of two deep autoencoders with a multitask learning framework leads to the method in [23]. One of the autoencoders models the linear components and estimates the endmembers as well as their abundances, while the other one models the bilinear components defined in (29) and estimates interaction abundances.
In addition to the regular autoencoders that have been described so far, some variants of autoencoder frameworks also have been proposed for spectral unmixing. For instance, in [16], a group of autoencoders is first applied to address issues regarding outliers and generate a good initialization. Then, an autoencoder with a single-layer linear decoder is used to perform the process of unmixing. To improve the unmixing performance, two cascaded autoencoders are utilized in an end-to-end fashion to address cycle-consistency constraints in [30]. In [31], the endmembers are extracted using a geometric endmember extraction method, and the abundances are estimated using a deep image prior based on a CNN (depicted by E5 in Figure 2). In [32], a two-stream network, with one stream for spectral information and the other one for exploiting spatial information, is collaboratively learned to make full use of the spectral and spatial information. The work [33] constructs an attention network to fuse the information of lidar data into the unmixing process. See “Unmixing With Deep Autoencoder Networks” for summarized insights for the roles of the encoder and decoder in the unmixing.
Unmixing With Deep Autoencoder Networks
Remarks on unmixing with deep autoencoder networks are as follows:
In addition to modeling the spectral mixture mechanism, DNNs are superior to conventional machine learning techniques in extracting information from data priors. It is intriguing to use DNNs to explicitly learn prior knowledge from data and use it with physical model-based inversion algorithms. Instead of designing sophisticated regularizers, the plug-and-play paradigm uses a denoiser with deep architectures to learn an image prior and then incorporates the denoiser into an iterative optimization of the problem in (4). On the other hand, deep unrolling algorithms aim to unfold an iterative optimization of the problem in (4) with a specific regularizer into a trainable end-to-end deep network.
Benefiting from the variable splitting technique, the plug-and-play framework makes it possible to exploit the priors learned from training data to solve various image restoration problems. State-of-the-art denoising algorithms, especially fast-speed and powerful CNN-based denoisers, are usually plugged into this framework as the proximity operator that captures the intrinsic spatial and spectral structures of hyperspectral images. By leveraging the flexibility of model-based optimization and the powerful learning ability of CNN, this framework has shown great potential in performing unmixing tasks. Among the three problem formulations, the form based on Problem formulation 1 is considered here for illustration because of its simplicity. In the plug-and-play framework, an auxiliary variable Z is introduced in (4) to replace A in the regularization term, with an extra constraint ${\boldsymbol{Z}} = {\boldsymbol{A}}$ to ensure the equivalence of the optimization problems. Using the alternating direction method of multipliers (ADMM) results in the following iterative updates: \begin{align*}\{{\boldsymbol{M}}^{{(}{k} + {1}{)}},\,{\boldsymbol{A}}^{{(}{k} + {1}{)}}\} = & \mathop{\text{argmin}}\limits_{{M},\{{\boldsymbol{a}}_{i}\}_{{i} = {1}}^{N}}\mathop{\sum}\limits_{{i} = {1}}^{N}{\cal{L}}{(}{\boldsymbol{y}}_{i},{\hat{\boldsymbol{y}}}_{i}{)} + \frac{\rho}{2}{\Vert}{\boldsymbol{A}}{-}{\boldsymbol{Z}}^{(k)} + {\boldsymbol{V}}^{{(}{k}{)}}{\Vert}_{F}^{2} \\ {\text{with}}\quad{\hat{\boldsymbol{y}}}_{i} = & {\cal{F}}{(}{\boldsymbol{M}},{\boldsymbol{a}}_{i}{)} \\ {\text{s.t.}}\quad{\boldsymbol{M}}\in&{\Omega}_{M}\,{\text{and}}\,{\boldsymbol{a}}_{i}\in{\Omega}_{a}, \tag{34a} \\ {\boldsymbol{Z}}^{{(}{k} + {1}{)}} = & \mathop{\text{argmin}}\limits_{\boldsymbol{Z}}\frac{\rho}{2}{\Vert}{\boldsymbol{Z}}{-}{(}{\boldsymbol{A}}^{{(}{k} + {1}{)}} + {\boldsymbol{V}}^{(k)}{)}{\Vert}_{F}^{2} + {\cal{R}}{(}{\boldsymbol{Z}}{),} \tag{34b} \\ {\boldsymbol{V}}^{{(}{k} + {1}{)}} = & {\boldsymbol{V}}^{(k)} + {\boldsymbol{A}}^{{(}{k} + {1}{)}}{-}{\boldsymbol{Z}}^{{(}{k} + {1}{)}}, \tag{34c} \end{align*} where the superscript ${}^{{(}{k}{)}}$ refers to the iteration index, ${\mathit{\rho}}$ is the penalty parameter, and ${\boldsymbol{V}}^{(k)}$ is the dual variable to be updated at iteration k. In this formulation, the regularizer ${\cal{R}}{(}\boldsymbol{A}{)}$ can be implicitly defined by a plugged denoising operator Denoiser, and (34b) can be seen as the denoising procedure of ${(}{\boldsymbol{A}}^{{(}{k} + {1}{)}} + {\boldsymbol{V}}^{(k)}{)}$ and replaced by \[{\boldsymbol{Z}}^{{(}{k} + {1}{)}} = {\text{Denoiser}}\,{(}{\boldsymbol{A}}^{{(}{k} + {1}{)}} + {\boldsymbol{V}}^{(k)}{)}{.} \tag{35} \]
Generally, Denoiser can be any off-the-shelf denoising operator. This offers the opportunity of incorporating CNN-based denoisers with powerful prior learning ability into the physical model-based iterative optimization. In particular, the work [34] considers the LMM assumption in (5) with known M and incorporates conventional or deep learning-based denoisers with two strategies, namely, a direct regularizer on A and a regularizer on reconstructed spectra, i.e., MA . Considering the case of an unknown M , the work in [35] jointly estimates M and exploits a handcrafted sparsity prior with the regularizer ${\Vert}{\boldsymbol{A}}{\Vert}_{2,1}$ and a denoiser prior with implicit regularizer ${\cal{R}}(\boldsymbol{A})$ in the plug-and-play framework to produce more stable unmixing results. The work in [36] proposes a nonlinear plug-and-play unmixing method based on the general bilinear model in (7), and both traditional and CNN-based denoisers are plugged into the iterative optimization.
Unlike plug-and-play algorithms, the deep unrolling paradigm unfolds and truncates iterative optimization algorithms into trainable deep architectures. Considering the LMM in (5) with known M , deep unrolling methods solve the unmixing problem in (10) with the ${\Vert}{\boldsymbol{A}}{\Vert}_{1}$ sparsity regularizer using iterative algorithms, such as the iterative soft thresholding algorithm (ISTA) and the ADMM algorithm. In [37], by introducing an auxiliary variable z such that ${\boldsymbol{z}} = {\boldsymbol{a}}$ along with a dual variable v , the optimization problem is iteratively solved by the ADMM with the following steps: \begin{align*}{\boldsymbol{a}}^{{(}{k} + {1}{)}} = & {\boldsymbol{Wy}} + {\boldsymbol{B}}{(}{\boldsymbol{z}}^{(k)} + {\boldsymbol{v}}^{(k)}{),} \tag{36a} \\ {\boldsymbol{z}}^{{(}{k} + {1}{)}} = & \max{(}{\text{soft}}{(}{\boldsymbol{a}}^{{(}{k} + {1}{)}}{-}{\boldsymbol{v}}^{(k)},{\lambda}{/}{\rho}{),}{0}{),} \tag{36b} \\ {\boldsymbol{v}}^{{(}{k} + {1}{)}} = & {\boldsymbol{v}}^{(k)}{-}{\eta}{(}{\boldsymbol{a}}^{{(}{k} + {1}{)}}{-}{\boldsymbol{z}}^{{(}{k} + {1}{)}}{),} \tag{36c} \end{align*} where ${\lambda}$ is the regularization parameter, ${\rho}$ denotes the penalty parameter, ${\eta}$ is a parameter offering additional flexibility, and ${\boldsymbol{W}} = {(}{\boldsymbol{M}}^{\top}{\boldsymbol{M}} + {\mu}{\boldsymbol{I}}{)}^{{-}{1}}{\boldsymbol{M}}^{\top}$ and ${\boldsymbol{B}} = {(}{\boldsymbol{M}}^{\top}{\boldsymbol{M}} + {\mu}{\boldsymbol{I}}{)}^{{-}{1}}\,{\rho}{.}$ Operator ${(}{\cdot}{)}$ in (36b) is the soft-threshold operator given by ${\text{soft}}{(}{\boldsymbol{x}},{\lambda}{/}{\rho}{)} = {\text{sign}}{(}{\boldsymbol{x}}{)(}{\vert}{\boldsymbol{x}}{\vert}{-}{\lambda}{/}{\rho}{)}_{+}{.}$ The key feature of deep unrolling unmixing algorithms is to leverage a reparameterization of these iterative update steps through the consecutive DNN layers (denoted as ${f}_{\boldsymbol{a}},{f}_{\boldsymbol{z}},\,{\text{and}}\,{f}_{\boldsymbol{v}}{)}$ with learnable parameters ${\{}{\boldsymbol{W}}^{{(}{k} + {1}{)}},{\boldsymbol{B}}^{{(}{k} + {1}{)}},{\theta}^{{(}{k} + {1}{)}},{\eta}^{{(}{k} + {1}{)}}{\},}$ respectively: \begin{align*}{\boldsymbol{a}}^{{(}{k} + {1}{)}} = & {f}_{\boldsymbol{a}}{(}{\boldsymbol{z}}^{(k)},{\boldsymbol{v}}^{(k)},{\boldsymbol{y}}{;}{\boldsymbol{W}}^{{(}{k} + {1}{)}},{\boldsymbol{B}}^{{(}{k} + {1}{)}}{)} \\ = & {\boldsymbol{W}}^{{(}{k} + {1}{)}}{\boldsymbol{y}} + {\boldsymbol{B}}^{{(}{k} + {1}{)}}{(}{\boldsymbol{z}}^{(k)} + {\boldsymbol{v}}^{(k)}{),} \\ {\boldsymbol{z}}^{{(}{k} + {1}{)}} = & {f}_{\boldsymbol{z}}{(}{\boldsymbol{a}}^{{(}{k} + {1}{)}},{\boldsymbol{v}}^{(k)}{;}{\theta}^{{(}{k} + {1}{)}}{)} \\ = & {\text{ReLU}}{(}{\boldsymbol{a}}^{{(}{k} + {1}{)}}{-}{\boldsymbol{v}}^{(k)}{-}{\theta}^{{(}{k} + {1}{)}}{\boldsymbol{I}}{),} \\ {\boldsymbol{v}}^{{(}{k} + {1}{)}} = & {f}_{\boldsymbol{v}}{(}{\boldsymbol{a}}^{{(}{k} + {1}{)}},{\boldsymbol{z}}^{{(}{k} + {1}{)}},{\boldsymbol{v}}^{(k)}{;}{\eta}^{{(}{k} + {1}{)}}{)} \\ = & {\boldsymbol{v}}^{(k)}{-}{\eta}^{{(}{k} + {1}{)}}{(}{\boldsymbol{a}}^{{(}{k} + {1}{)}}{-}{\boldsymbol{z}}^{{(}{k} + {1}{)}}{),} \tag{37} \end{align*} where ${\theta}^{{(}{k} + {1}{)}}$ is a learnable parameter that plays the role of ${\lambda}{/}{\rho},$ and ${\text{ReLU}}{(}{\cdot}{)}$ is a componentwise ReLU operation. These specifically designed layers are then cascaded into an entire DNN architecture for unmixing, which can be trained with supervised loss functions in an end-to-end manner. For the blind unmixing problem in (11), the work in [38] builds a nonconvex sparse NMF model using ${\Vert}{\boldsymbol{A}}{{\Vert}}_{p}$ with ${0}\,{<}\,{p}\,{<}\,{1}$ and designs an interpretable sparsity constrained NMF network based on the ISTA to jointly estimate M and a .
The loss functions ${\cal{L}}$ in (4), (12), and (20) also play a crucial role in determining the unmixing performance. In addition to the data-driven model, data-learned loss functions have been proposed recently to improve the unmixing performance, particularly for Problem formulation 3. The subsequent review focuses on autoencoder-based works as they mostly refer to Problem formulation 3.
Geometric distances, which are usually used as the essential loss in autoencoder-based unmixing methods, enforce the network to reconstruct the input image. The works [14], [15], and [16] use the mean-squared error (MSE) to measure the similarity between the input and reconstructed pixels; that is, \[{\cal{L}}_{\text{MSE}} = \frac{1}{N}\mathop{\sum}\limits_{{i} = {1}}^{N}{\Vert}{\boldsymbol{y}}_{i}{-}{\hat{\boldsymbol{y}}}_{i}{\Vert}^{2}{.} \tag{38} \]
In contrast to the MSE, which is sensitive to the scale of spectra, the spectral angle distance (SAD) and spectral information divergence (SID) are scale invariant. The SAD considers the angle between two spectra as a spectral similarity measure: \[{\cal{L}}_{\text{SAD}} = \frac{1}{N}\mathop{\sum}\limits_{{i} = {1}}^{N}{\text{arccos}}\left(\frac{\left\langle{\boldsymbol{y}}_{i},{\hat{\boldsymbol{y}}}_{i}\right\rangle}{{\Vert}{\boldsymbol{y}}_{i}{\Vert}{\Vert}{\hat{\boldsymbol{y}}}_{i}{\Vert}}\right){.} \tag{39} \]
The SID models the spectra as a probability distribution and considers the band-to-band spectral variability as the uncertainty existing in random variables to lower the gap between input and reconstructed pixels, \[{\cal{L}}_{\text{SID}} = \frac{1}{N}\mathop{\sum}\limits_{{i} = {1}}^{N}{{\boldsymbol{p}}_{i}}{\text{log}}\left({\frac{{\boldsymbol{p}}_{i}}{{\hat{\boldsymbol{p}}}_{i}}}\right), \tag{40} \] where ${\boldsymbol{p}} = \left({\boldsymbol{y}/{\boldsymbol{1}}^{\top}\boldsymbol{y}}\right)$ and $\hat{\boldsymbol{p}} = \left({\hat{\boldsymbol{y}}/{\boldsymbol{1}}^{\top}\hat{\boldsymbol{y}}}\right)$ represent the probability distribution vector of the input and estimated pixels, respectively. Nevertheless, these conventional losses are defined at the pixel level based on basic statistical assumptions that disregard inherent image structures.
Generative adversarial networks (GANs) have recently been used to address the limitations of geometric distance loss functions. A GAN consists of two components, a generator and a discriminator, and it models the distribution of data using DNNs. The generator $\text{G}$ is trained to generate samples that are similar to real data to confuse the discriminator. The discriminator $\text{D}$ receives samples from the real data distribution (positive data) and samples from the generator (negative data) and attempts to distinguish between the real and generated samples. These two networks engage in a min–max adversarial game during the training process that can be expressed as follows: \[\mathop{\min}\limits_{\cal{G}}\mathop{\max}\limits_{\cal{D}}{\text{E}}_{{\boldsymbol{z}}{∼}{q}{(}{\boldsymbol{z}}\mid{\boldsymbol{y}}{)}}{[}\log{\cal{D}}{(}{\boldsymbol{z}}{)]} + {E}_{\boldsymbol{z}∼p(\boldsymbol{z})}{[}\log{(}{1}{-}{\cal{D}}{(}{\cal{G}}{(}{\boldsymbol{z}}{)))].} \tag{41} \]
The architecture of unmixing methods using adversarial loss usually contains two terms: an autoencoder and a discriminator. The former, considered to be the generator, aims to estimate the abundances and extract the endmembers. The latter is a deep binary classifier and is used to distinguish between generated and real samples. Thus, the discriminator may be able to transfer intrusive properties of real data that are useful for unmixing. The loss function utilized for generator training is always composed of two terms, \[{\cal{L}}_{\cal{G}} = {\cal{L}}_{\text{AE}} + {\cal{L}}_{\text{adv}}, \tag{42} \] where ${\cal{L}}_{\text{adv}} = {(}{1}{/}{N}{)}{\Sigma}_{{i} = {1}}^{N}\log{\cal{D}}\left({{\boldsymbol{z}}_{i}}\right) + {(}{1}{/}{N}{)}{\Sigma}_{{i} = {1}}^{N}\log\left({{1}{-}{\cal{D}}\left({{\hat{\boldsymbol{z}}}_{i}}\right)}\right)$ is the loss function that is used to generate samples $\hat{\boldsymbol{z}}$ that confuse the discriminator, and ${\cal{L}}_{\text{AE}}$ is the reconstruction loss function, e.g., the geometric cost function. The loss function for training the discriminator is defined as ${\cal{L}}_{\cal{D}} = {-}{\cal{L}}_{\text{adv}}{.}$ The work [39] introduces the Wasserstein distance calculated from the discriminator as a regularization term to characterize the distribution similarity between the input and reconstructed spectra. In [40], an adversarial network is utilized to model the abundance distribution.
Recent works have introduced the perception mechanism to discover the discrepancy between reconstructed pixels and their corresponding ground truth (input pixels). This strategy allows us to relax pixel-level reconstruction and enhance the unmixing performance using an end-to-end process. The perceptual similarity can be viewed as a feature-matching operator that compares the perceptual features of the reconstructed output pixels to those of the input pixels. The expression for the perceptual loss defined in a feature domain is \[{\cal{L}}_{\text{Perceptual}} = \frac{1}{N}\mathop{\sum}\limits_{{i} = {1}}^{N}{\Vert}{\cal{P}}{(}{\boldsymbol{y}}_{i}{)}{-}{\cal{P}}{(}{\hat{\boldsymbol{y}}}_{i}{)}{\Vert}^{2}, \tag{43} \] where ${\cal{P}}$ stands for a feature extractor. For example, in [30], a cycle-consistency strategy is used to further refine the detailed information in the unmixing process. In [39], the features extracted from the hidden layers of a discriminator are used to ensure the consistency of high-level representations.
This article reviewed hyperspectral unmixing methods with special attention focused on the integration of physics-based and data-driven methods. A mathematical framework was established to introduce a variety of existing works with a unified formalism. As summarized in this article, notable advances have been recently achieved in this topic. Table 2 lists some features of the surveyed methods and summarizes some of their pros and cons. However, there are still several important aspects that merit further discussion and investigation.
Table 2. Features of the surveyed methods.
Evaluating the performance of hyperspectral unmixing algorithms has always been nontrivial since few research works have been devoted to producing ground-truth data [41]. Performance evaluation becomes even more difficult when considering data-driven methods. Preliminary comparison results of several autoencoder-based unmixing methods with synthetic and real datasets have been provided recently in [42]. This study demonstrates the significance of using appropriate loss functions, nonlinear methods, spatial information, and proper initialization. However, thorough and objective evaluation studies are still missing.
Real data usually involve multiple mixing models in complex scenes. It is anticipated that data-driven models, such as the generalized linear-mixture/nonlinear-fluctuation model, will exhibit superior scene-specific adaptability. Nevertheless, it is challenging, if not impossible, to preselect an optimal model in advance. Using a mixing model with parallel multiple branches and an attention module to balance the contribution of each mixing model is a possible strategy to address this issue. Utilizing superpixel-based strategies and adopting distinct models within each superpixel and at its borders is an alternative solution.
This article reviews the use of autoencoders to learn mixture models from data, the plug-and-play method to learn priors from data, and perception dissimilarity characterized by features learned from data. Existing works include one or two of these techniques. Investigating an elegant strategy combining physics-based and data-driven methods that benefits from all of these factors is still missing. This requires a deeper understanding and original methods in both aspects of optimization and neural network design.
Compared to classical optimization methods, data-driven approaches using DNNs require significantly higher computational resources, especially when multiple data-driven modules collaborate to perform unmixing. Fast implementation must be considered for the practical use of unmixing algorithms. The formulation of the optimization problems, the design of the network structures, and the way of integrating them are crucial for enhancing the convergence speed of the learning process. A trained encoder can be used for supervised or semisupervised unmixing when the computation resource is limited. Further, a Compute Unified Device Architecture implementation can also be useful for executing parallel optimization-solving modules.
The work of J. Chen was supported in part by the Natural National Science Foundation of China under Grants 62192713 and 62171380, Guangdong International Cooperation Project 2022A0505050020, Shenzhen Research Grant JCYJ20220530161606014, Shaanxi Key Industrial Innovation Chain Project 2022ZDLGY01-02, and Xi’an Technology Industrialization Plan XA2020-RGZNTJ-0076. The work of C. Richard was supported by the French government through the 3IA Côte d’Azur Investments in the Future project with reference number ANR-19-P3IA-0002. Jie Chen is the corresponding author.
Jie Chen (dr.jie.chen@ieee.org) received his Ph.D. degree from Troyes University of Technology (UTT), France, in 2013. He received his B.S. degree in 2006 from Xi’an Jiaotong University (XJTU), China, and his M.S. and Dipl.-Ing. degrees in 2009 from XJTU and UTT, respectively, all in information and telecommunications engineering. From 2013 to 2014, he was with the University of Nice Sophia Antipolis, France, and from 2014 to 2015, he was with the University of Michigan, United States. He is currently a professor at the School of Marine Science and Technology at Northwestern Polytechnical University, Xi’an 710072 China. His current research interests include adaptive signal processing and distributed optimization with applications to hyperspectral image processing and acoustic signal processing. He is a Senior Member of IEEE.
Min Zhao (minzhao@mail.nwpu.edu.cn) received her B.S. degree in electronic and information engineering and her M.S. degree in signal and information processing from Northwestern Polytechnical University, Xi’an 710072, China, in 2017 and 2020, where she is currently pursuing her Ph.D. degree. Her research interests include hyperspectral image analysis and object detection. She received the People’s Choice Award for the Three Minute Thesis Competition at the IEEE International Geoscience and Remote Sensing Symposium 2020 and the (with her team) Champion of Grand Challenges on Near-Infrared Image Colorization at the IEEE International Conference on Visual Communications and Image Processing 2020. She is a Student Member of IEEE.
Xiuheng Wang (xiuheng.wang@oca.eu) received his B.S. degree in electronic and information engineering and his M.S. degree in signal and information processing from Northwestern Polytechnical University, Xi’an, China, in 2018 and 2021, respectively. He is currently pursuing his Ph.D. degree at the Université Côte d’Azur, Nice 06000 France. His research interests include inverse problems in image processing, machine learning, and change point detection. He was a corecipient of the Champion of Grand Challenges on Near-Infrared Image Colorization at the IEEE International Conference on Visual Communications and Image Processing 2020. He is a Student Member of IEEE.
Cédric Richard (cedric.richard@unice.fr) received his Dipl.-Ing. and M.S. degrees in 1994 and his Ph.D. degree in 1998, all in electrical and computer engineering from the Compiègne University of Technology, France. He is a full professor at Université Côte d’Azur, Nice 06000 France. He is a senior chair at the Interdisciplinary Institute for Artificial Intelligence Côte d’Azur, in the chapter Artificial Intelligence for Smart and Secure Territories. His current research interests include statistical signal processing and machine learning. He is the author of more than 300 journal and conference articles. He was distinguished as a junior member of the Institut Universitaire de France in 2010–2015. He is a Senior Member of IEEE.
Susanto Rahardja (susantorahardja@ieee.org) received his Ph.D. and M.Eng. degrees from Nanyang Technological University, Singapore, and his B.Eng. degree from the National University of Singapore; all in the field of electrical and electronic engineering. He is a Ph.D. advisor at Northwestern Polytechnical University Xi’an 710072 China and a professor at the Singapore Institute of Technology. His research interests are in multimedia, signal processing, wireless communications, discrete transforms, machine learning and signal processing algorithms, implementation, and optimization. He is a Fellow of IEEE and a fellow of the Academy of Engineering, Singapore.
[1] P. Ghamisi et al., “Advances in hyperspectral image and signal processing: A comprehensive overview of the state of the art,” IEEE Geosci. Remote Sens. Mag., vol. 5, no. 4, pp. 37–78, Dec. 2017, doi: 10.1109/MGRS.2017.2762087.
[2] N. Dobigeon, J.-Y. Tourneret, C. Richard, J. C. M. Bermudez, S. McLaughlin, and A. O. Hero, “Nonlinear unmixing of hyperspectral images: Models and algorithms,” IEEE Signal Process. Mag., vol. 31, no. 1, pp. 82–94, Jan. 2014, doi: 10.1109/MSP.2013.2279274.
[3] R. Borsoi et al., “Spectral variability in hyperspectral data unmixing: A comprehensive review,” IEEE Geosci. Remote Sens. Mag., vol. 9, no. 4, pp. 223–270, Dec. 2021, doi: 10.1109/MGRS.2021.3071158.
[4] R. Heylen, M. Parente, and P. Gader, “A review of nonlinear hyperspectral unmixing methods,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 7, no. 6, pp. 1844–1868, Jun. 2014, doi: 10.1109/JSTARS.2014.2320576.
[5] A. Halimi, Y. Altman, N. Dobigeon, and J.-Y. Tourneret, “Nonlinear unmixing of hyperspectral images using a generalized bilinear model,” IEEE Trans. Geosci. Remote Sens., vol. 49, no. 11, pp. 4153–4162, Nov. 2011, doi: 10.1109/TGRS.2010.2098414.
[6] B. Hapke, Theory of Reflectance and Emittance Spectroscopy. Cambridge, U.K.: Cambridge Univ. Press, 2012.
[7] D. Heinz and C. Chang, “Fully constrained least squares linear spectral mixture analysis method for material quantification in hyperspectral imagery,” IEEE Trans. Geosci. Remote Sens., vol. 39, no. 3, pp. 529–545, Mar. 2001, doi: 10.1109/36.911111.
[8] L. Miao and H. Qi, “Endmember extraction from highly mixed data using minimum volume constrained nonnegative matrix factorization,” IEEE Trans. Geosci. Remote Sens., vol. 45, no. 3, pp. 765–777, Mar. 2007, doi: 10.1109/TGRS.2006.888466.
[9] J. Peng et al., “Low-rank and sparse representation for hyperspectral image processing: A review,” IEEE Geosci. Remote Sens. Mag., vol. 10, no. 1, pp. 10–43, Mar. 2022, doi: 10.1109/MGRS.2021.3075491.
[10] Z. Li, J. Chen, and S. Rahardja, “Kernel-based nonlinear spectral unmixing with dictionary pruning,” Remote Sens., vol. 11, no. 5, 2019, Art. no. 529, doi: 10.3390/rs11050529.
[11] J. Chen, C. Richard, and P. Honeine, “Nonlinear estimation of material abundances in hyperspectral images with ,1-norm spatial regularization,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 5, pp. 2654–2665, May 2014, doi: 10.1109/TGRS.2013.2264392.
[12] R. Heylen, D. Burazerovic, and P. Scheunders, “Non-linear spectral unmixing by geodesic simplex volume maximization,” IEEE J. Sel. Topics Signal Process., vol. 5, no. 3, pp. 534–542, Jun. 2011, doi: 10.1109/JSTSP.2010.2088377.
[13] R. Heylen and P. Scheunders, “A distance geometric framework for nonlinear hyperspectral unmixing,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 7, no. 6, pp. 1879–1888, Jun. 2014, doi: 10.1109/JSTARS.2014.2319894.
[14] S. Ozkan, B. Kaya, and G. B. Akar, “EndNet: Sparse autoencoder network for endmember extraction and hyperspectral unmixing,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 1, pp. 482–496, Jan. 2019, doi: 10.1109/TGRS.2018.2856929.
[15] Y. Qu and H. Qi, “uDAS: An untied denoising autoencoder with sparsity for spectral unmixing,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 3, pp. 1698–1712, Mar. 2019, doi: 10.1109/TGRS.2018.2868690.
[16] Y. Su, J. Li, A. Plaza, A. Marinoni, P. Gamba, and S. Chakravortty, “DAEN: Deep autoencoder networks for hyperspectral unmixing,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 7, pp. 4309–4321, Jul. 2019, doi: 10.1109/TGRS.2018.2890633.
[17] B. Palsson, J. Sigurdsson, J. R. Sveinsson, and M. O. Ulfarsson, “Hyperspectral unmixing using a neural network autoencoder,” IEEE Access, vol. 6, pp. 25,646–25,656, Mar. 2018, doi: 10.1109/ACCESS.2018.2818280.
[18] B. Palsson, M. O. Ulfarsson, and J. R. Sveinsson, “Convolutional autoencoder for spectral–spatial hyperspectral unmixing,” IEEE Trans. Geosci. Remote Sens., vol. 59, no. 1, pp. 535–549, Jan. 2021, doi: 10.1109/TGRS.2020.2992743.
[19] F. Khajehrayeni and H. Ghassemian, “Hyperspectral unmixing using deep convolutional autoencoders in a supervised scenario,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 13, pp. 567–576, Feb. 2020, doi: 10.1109/JSTARS.2020.2966512.
[20] M. Wang, M. Zhao, J. Chen, and S. Rahardja, “Nonlinear unmixing of hyperspectral data via deep autoencoder networks,” IEEE Geosci. Remote Sens. Lett., vol. 16, no. 9, pp. 1467–1471, Sep. 2019, doi: 10.1109/LGRS.2019.2900733.
[21] M. Zhao, L. Yan, and J. Chen, “LSTM-DNN based autoencoder network for nonlinear hyperspectral image unmixing,” IEEE J. Sel. Topics Signal Process., vol. 15, no. 2, pp. 295–309, Feb. 2021, doi: 10.1109/JSTSP.2021.3052361.
[22] M. Zhao, M. Wang, J. Chen, and S. Rahardja, “Hyperspectral unmixing for additive nonlinear models with a 3-D-CNN autoencoder network,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Art no. 5509415, doi: 10.1109/TGRS.2021.3098745.
[23] Y. Su, X. Xu, J. Li, H. Qi, P. Gamba, and A. Plaza, “Deep autoencoders with multitask learning for bilinear hyperspectral unmixing,” IEEE Trans. Geosci. Remote Sens., vol. 59, no. 10, pp. 8615–8629, Oct. 2021, doi: 10.1109/TGRS.2020.3041157.
[24] K. T. Shahid and I. D. Schizas, “Unsupervised hyperspectral unmixing via nonlinear autoencoders,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Art no. 5506513, doi: 10.1109/TGRS.2021.3077833.
[25] Q. Jin, Y. Ma, X. Mei, and J. Ma, “TANet: An unsupervised two-stream autoencoder network for hyperspectral unmixing,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Art no. 5506215, doi: 10.1109/TGRS.2021.3094884.
[26] S. Shi, M. Zhao, L. Zhang, Y. Altmann, and J. Chen, “Probabilistic generative model for hyperspectral unmixing accounting for endmember variability,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Art no. 5516915, doi: 10.1109/TGRS.2021.3121799.
[27] R. Borsoi, T. Imbiriba, and J. Bermudez, “Deep generative endmember modeling: An application to unsupervised spectral unmixing,” IEEE Trans. Comput. Imag., vol. 6, pp. 374–384, 2020, doi: 10.1109/TCI.2019.2948726.
[28] Z. Han, D. Hong, L. Gao, B. Zhang, and J. Chanussot, “Deep half-siamese networks for hyperspectral unmixing,” IEEE Geosci. Remote Sens. Lett., vol. 18, no. 11, pp. 1996–2000, Nov. 2021, doi: 10.1109/LGRS.2020.3011941.
[29] Z. Hua, X. Li, Y. Feng, and L. Zhao, “Dual branch autoencoder network for spectral-spatial hyperspectral unmixing,” IEEE Geosci. Remote Sens. Lett., vol. 19, 2022, Art no. 5507305, doi: 10.1109/LGRS.2021.3091858.
[30] L. Gao, Z. Han, D. Hong, B. Zhang, and J. Chanussot, “CyCU-Net: Cycle-consistency unmixing network by learning cascaded autoencoders,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Art no. 5503914, doi: 10.1109/TGRS.2021.3064958.
[31] B. Rasti, B. Koirala, P. Scheunders, and P. Ghamisi, “UnDIP: Hyperspectral unmixing using deep image prior,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Art no. 5504615, doi: 10.1109/TGRS.2021.3067802.
[32] L. Qi, F. Gao, J. Dong, X. Gao, and Q. Du, “SSCU-Net: Spatial–spectral collaborative unmixing network for hyperspectral images,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Art no. 5407515, doi: 10.1109/TGRS.2022.3150970.
[33] Z. Han, D. Hong, L. Gao, J. Yao, B. Zhang, and J. Chanussot, “Multimodal hyperspectral unmixing: Insights from attention networks,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Art no. 5524913, doi: 10.1109/TGRS.2022.3155794.
[34] M. Zhao, X. Wang, J. Chen, and W. Chen, “A plug-and-play priors framework for hyperspectral unmixing,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Art no. 5501213, doi: 10.1109/TGRS.2020.3047479.
[35] M. Zhao, T. Gao, J. Chen, and W. Chen, “Hyperspectral unmixing via nonnegative matrix factorization with handcrafted and learned priors,” IEEE Geosci. Remote Sens. Lett., vol. 19, 2022, Art no. 5501905, doi: 10.1109/LGRS.2020.3047481.
[36] Z. Wang, L. Zhuang, L. Gao, A. Marinoni, B. Zhang, and M. K. Ng, “Hyperspectral nonlinear unmixing by using plug-and-play prior for abundance maps,” Remote Sens., vol. 12, no. 24, 2020, Art. no. 4117, doi: 10.3390/rs12244117.
[37] C. Zhou and M. R. Rodrigues, “ADMM-based hyperspectral unmixing networks for abundance and endmember estimation,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Art no. 5520018, doi: 10.1109/TGRS.2021.3136336.
[38] F. Xiong, J. Zhou, S. Tao, J. Lu, and Y. Qian, “SNMF-Net: Learning a deep alternating neural network for hyperspectral unmixing,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Art no. 5510816, doi: 10.1109/TGRS.2021.3081177.
[39] A. Min, Z. Guo, H. Li, and J. Peng, “JMnet: Joint metric neural network for hyperspectral unmixing,” IEEE Trans. Geosci. Remote Sens., vol. 60, 2022, Art no. 5505412, doi: 10.1109/TGRS.2021.3069476.
[40] Q. Jin, Y. Ma, F. Fan, J. Huang, X. Mei, and J. Ma, “Adversarial autoencoder network for hyperspectral unmixing,” IEEE Trans. Neural Netw. Learn. Syst., early access, 2022, doi: 10.1109/TNNLS.2021.3114203.
[41] M. Zhao, J. Chen, and Z. He, “A laboratory-created dataset with ground truth for hyperspectral unmixing evaluation,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 12, no. 7, pp. 2170–2183, Jul. 2019, doi: 10.1109/JSTARS.2019.2905099.
[42] B. Palsson, J. R. Sveinsson, and M. O. Ulfarsson, “Blind hyperspectral unmixing using autoencoders: A critical comparison,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 15, pp. 1340–1372, Jan. 2022, doi: 10.1109/JSTARS.2021.3140154.
Digital Object Identifier 10.1109/MSP.2022.3208987