Henry Arguello, Jorge Bacca, Hasindu Kariyawasam, Edwin Vargas, Miguel Marquez, Ramith Hettiarachchi, Hans Garcia, Kithmini Herath, Udith Haputhanthri, Balpreet Singh Ahluwalia, Peter So, Dushan N. Wadduwage, Chamira U.S. Edussooriya
©SHUTTERSTOCK.COM/PAPAPIG
Computational optical imaging (COI) systems leverage optical coding elements (CEs) in their setups to encode a high-dimensional scene in a single or in multiple snapshots and decode it by using computational algorithms. The performance of COI systems highly depends on the design of its main components: the CE pattern and the computational method used to perform a given task. Conventional approaches rely on random patterns or analytical designs to set distribution of the CE. However, the available data and algorithm capabilities of deep neural networks (DNNs) have opened a new horizon in CE data-driven designs that jointly consider the optical and computational decoders. Specifically, by modeling the COI measurements through a fully differentiable image-formation model that considers the physics-based propagation of light and its interaction with the CEs, the parameters that define the CE and the computational decoder can be optimized in an end-to-end (E2E) manner. Moreover, by optimizing just CEs in the same framework, inference tasks can be performed from pure optics. This work surveys the recent advances in CE data-driven design and provides guidelines on how to parameterize different optical elements to include them in the E2E framework. As the E2E framework can handle different inference applications by changing the loss function and the DNN, we present low-level tasks such as spectral imaging reconstruction or high-level tasks such as pose estimation with privacy preservation enhanced by using optimal task-based optical architectures. Finally, we illustrate classification and 3D object-recognition applications performed at the speed of the light using all-optics DNNs.
We are in a new era of COI systems that break traditional sensing limits leveraged by computational signal processing. COI systems have enabled the acquisition of high-dimensional information of the scene such as polarization [1], spectrum [2], depth [3], and time [4] with unprecedented performance in commercial applications such as medical imaging, precision agriculture, surveillance, and vision-guided vehicles [5]. With current measurement devices, it is only possible to acquire low-dimensional intensity values of high-dimensional scenes. Therefore, the main goal of computational imaging design is to encode the desired information in a low-dimensional sensor image by introducing optical CEs in the optical setups. The most popular CEs include coded apertures (CAs) [2], constructed by an arrangement of opaque or blocking elements that spatially modulate the wavefront; diffractive optical elements (DOEs) [3], which are phase-relief elements that use microstructures to alter the phase of the light propagated through them; and color filter arrays (CFAs) [6], which are mosaics of tiny bandpass filters placed over pixel sensors (or image plane) to capture wavelength information.
The decoding or recovery procedure of a high-dimensional image from the coded measurements is carried out in a postprocessing step leveraging computational algorithms. The reconstruction quality mainly depends on the CE pattern and the image processing algorithm employed to decode the desired information. Thus, several works have covered these two fundamental lines of research. Over the last few years, different CE designs have been proposed, adopting random patterns, handcrafted assumptions [7], or theoretical constraints such as the mutual coherence [2] or concentration of measure [2]. On the other hand, alternative studies propose computational processing methods to perform inference tasks directly from the coded measurements, avoiding the reconstruction step. Specifically, in [8], theoretical and simulation results showed that it is possible to learn features directly from the coded measurements. These features are used as input of classifiers, such as support vector machines [9], sparse subspace clustering [7], and recently, in deep models that are applied for different tasks [10]. The unprecedented gains in performance of DNNs have rapidly converted them into a standard algorithmic approach to process the coded measurements.
The continuous progress in deep learning and the growing amount of image data have enabled a new data-driven CE design where the optical encoder and computational decoder are jointly designed. More precisely, it consists of simulating the COI system as a fully differentiable image-formation model that considers the physics-based propagation of light and its interaction with the CEs. In this model, the CE can be represented by learnable parameters and interpreted as an optical layer. In the same way, the overall COI system can be interpreted as an optical encoder composed of different optical layers. The optical encoder can be coupled with a DNN or differentiable algorithm in a deep learning model where the ensemble parameters (CEs and DNN parameters) can be optimized in an E2E manner using a training dataset and a backpropagation algorithm. The optimized CE provides novel data-driven optical designs that outperform the conventional nondata-driven approaches (see Figure 1 for some visual examples, or refer to [4], [6], [10], [11], and [12] for more details). Interestingly, it has also been shown that a cascade of multiple optical layers can be trained to perform a specific task such as classification [13], image formation [13], 3D object recognition [14], or saliency segmentation [15]. This all-optical deep learning framework is referred to as diffractive DNNs (D2NNs). Each optical layer represents a CE, and the transmission coefficient at each spatial location is treated as a learnable parameter. Then the optical model can be trained in an E2E manner to perform a specific task. In this context, the optical system also works as an inference algorithm, and the inference can be achieved from pure optics at the speed of light.
Figure 1. A comparison of nondata-driven CEs’ to deep optical CEs’ design in terms of peak signal-to-noise ratio for applications of (a) sensor multiplexing, (b) compressive video, and (c) spectral imaging. The CEs correspond to a CFA [6], a binary-CA [4], and a DOE [11], respectively.
An important aspect of the E2E scheme is the optical system’s constraints that must be considered in the coupled deep training model; for example, the lens surface must be regular and smooth, or the attenuation values in the CA could be binary. Additionally, some critical assembling properties and modeling considerations, such as the amount of light and the number of projections, must be considered in the learning-based design of the optical system. These constraints have been tackled in the modeling by parameterizing of the CE or the optimization problem through regularization functions [10]. After finding the CE-optimal parameters of the COI system and the decoder, the optical elements are carefully manufactured or implemented in an optical setup that will acquire the optimized coded measurements. The optimal COI measurements are fed to the trained decoder to perform the inference task of interest. In some cases, the mismatch of the simulated forward model and the transfer function of the experimental setup can demand an additional optical setup calibration, and also a neural network retraining (or fine-tuning) [3].
This article surveys the recent advances and foundations on data-driven CE design. Our goal is to provide readers with step-by-step guidance on how to efficiently model optical systems and parameterize different CEs for being included in the E2E optimization framework. Furthermore, we describe how to address some physical constraints or assembling properties that appeal to a wide range of COI systems. After providing a tutorial on how to include the COI system in a deep model, we demonstrate the flexibility of the deep optical design in several real applications, including implementation of D2NNs. Finally, we discuss current challenges and possible future research directions.
One of the main challenges in computational imaging is acquiring high-dimensional scenes into a low-dimensional intensity detector. Researchers have coincided with modeling the high-dimensional world information with up to 8D function ${f}{(}{x},{y},{z},{\alpha},{\psi},{t},{\lambda},{p}{),}$ called the plenoptic function, where (x, y) stand for spatial, z for depth, ${(}{\alpha},{\psi}{)}$ for angular views, t for temporal, ${\lambda}$ for wavelength, and p for polarization dimension [5]. In this path, several authors have proposed the design of sophisticated imagers based on CEs to sense two or more dimensions of the plenoptic function (see [16] and [17] for more details). Table 1 summarizes some of the optical systems that employ CEs in their optical setups to encode, and then, recover some wavefront dimensions. In the following, we present two fundamental steps for obtaining a physics-based model of the COI system, which is integrated later into the deep optical design. The first step defines the more suitable light propagation model, which is selected based on the light source conditions, propagation distances, optical elements, and target physical constraints.
Table 1. The COI systems’ summary: the coding optical element, encoding dimension, and design parameters. (Only a few references are cited; for more details please see [5], [16], [17], and [18].)
The Rayleigh–Sommerfeld (RS) diffraction formulation is the scalar diffraction model that allows obtaining a suitable solution for the output field of a given input field under some physical conditions [13]. Specifically, considering an input field ${U}^{0}$ (usually given by the scene of interest), the resultant field of the diffracted optical wave at a given spatial point is denoted as \begin{align*}&{U}^{1}{(}{x}_{1},{y}_{1},{z}_{1}{;}{\lambda}{)}\\ & \quad = \mathop{\iint}\limits_{A}{{U}^{0}}{(}{x}_{0},{y}_{0},{z}_{0}{;}{\lambda}{)}\left({\frac{{z}_{1}{-}{z}_{0}}{r{}^{2}}}\right)\left({\frac{1}{{2}{\pi}{r}} + \frac{1}{{j}{\lambda}}}\right)\exp\left({\frac{{j}{2}{\pi}{r}}{\lambda}}\right){dA} \tag{1} \end{align*} where A is the area of the input field. Here, ${r} = \sqrt{{(}{x}_{1}{-}{x}_{0}{)}^{2} + {(}{y}_{1}{-}{y}_{0}{)}^{2} + {(}{z}_{1}{-}{z}_{0}{)}^{2}}$ is the distance between the spatial points, ${\lambda}$ is the wavelength of the light, and ${j} = \sqrt{{-}{1}}{.}$ Although the RS formulation offers high-precision propagation modeling, the majority of COI systems employ the angular spectrum (AS) [13], Fresnel [11], or Fraunhofer [2] diffraction approximation models for practicality as they are sufficiently accurate. For instance, the AS method is an efficient approach to approximate the RS propagation model. It uses the Fourier relationship between the electric/magnetic field of a light wave and its AS to perform the computations in the Fourier domain. Here, the AS of the field is propagated using the transfer function given by \[{P}{(}{z}{)} = \exp\left({{j}{2}{\pi}{z}\sqrt{\frac{1}{{\lambda}^{2}}{-}{f}_{x}^{2}{-}{f}_{y}^{2}}}\right) \tag{2} \] where ${f}_{x}$ and ${f}_{y}$ are the 2D spatial frequency components [13].
Once the light propagation model is selected, the following step consists of modeling the interactions with the CE. According to the CE’s physical composition and structure, the incoming wavefront is modulated in its phase [21] or intensity [4]. The current fabrication technologies limit CE structures to discrete distributions. For that reason, we restrict our attention to work with the propagation and modulation model in a discrete form. For instance, let the DOE be considered a CE. The propagation of a discrete input field ${\bf{U}}^{0}$ at plane ${z}_{0}$ and its interaction with the discretized DOE with ${N}{\times}{N}$ pixels located in plane ${z}_{1}$ is illustrated in Figure 2. Each DOE’s pixel has a particular light response, known as the transmission coefficient. Based on a Riemann summation of the integral (1), the propagation of ${\bf{U}}^{0}$ up to the DOE as well as the DOE’s modulation can be mathematically expressed as \[{\bf{U}}_{{i}_{1},{j}_{1}}^{1} = {\bf{\phi}}_{{i}_{1},{j}_{1}}^{1}\cdot\mathop{\sum}\limits_{{i}_{0} = {0}}\limits^{{N}{-}{1}}{\mathop{\sum}\limits_{{j}_{0} = {0}}\limits^{{N}{-}{1}}{U{}_{{i}_{0},{j}_{0}}^{0}}}{w}{\Delta}_{x}{\Delta}_{y}\colon = {\mathcal{P}}_{{\phi}^{1}}{(}{\bf{U}}^{0}{),} \tag{3} \]
Figure 2. A visual representation of the discretized propagation and the modulation of a CE in an optical layer. The incident field ${\bf{U}}_{0}$ is propagated and then modulated by the CE(${\bf{\phi}}$) where the field immediately after the ${i}_{1},\,{j}_{1}$ pixel (or neuron) is given by ${\bf{U}}_{{i}{1}}^{1}{}_{,{j}_{1}}{.}$
where \[{w} = \left({\frac{{z}_{1}{-}{z}_{0}}{{r}^{2}}}\right)\left({\frac{1}{{2}{\pi}{r}} + \frac{1}{{j}{\lambda}}}\right)\exp\left({\frac{{j}{2}{\pi}{r}}{\lambda}}\right) \tag{4} \] with ${r} = \sqrt{{(}{\Delta}_{x}{i}_{1}{-}{\Delta}_{x}{i}_{0}{)}^{2} + {(}{\Delta}_{y}{j}_{1}{-}{\Delta}_{y}{j}_{0}{)}^{2} + {(}{z}_{1}{-}{z}_{0}{)}^{2}},\,\left\{{{\Delta}_{x},{\Delta}_{y}}\right\}$ being the width and height pixel resolution, respectively; ${\bf{U}}_{{i}_{0},{j}_{0}}^{0}$ is the input-diffracted wavefront at pixel ${(}{i}_{0},{j}_{0}{);}$ ${\bf{U}}_{{j}_{1},{j}_{1}}^{1}$ is the modulated field at the spatial point ${(}{i}_{1},{j}_{1}{)}$ just after the DOE; and ${\bf{\phi}}_{{i}_{1},{j}_{1}}^{1} = {\bf{a}}_{{i}_{1},{j}_{1}}^{1}{e}^{j{\bf{\psi}}_{{i}_{1},{j}_{1}}^{1}}$ is the transmission coefficient at position ${(}{i}_{1},{j}_{1}{),}$ with ${\bf{a}}_{{i}_{1},{j}_{1}}^{1}$ and $\bf{\psi}_{{i}_{1},{j}_{1}}^{1}$ being the amplitude and phase terms, respectively. In this case, light modulation is performed by the complex transmission coefficient of the DOE, which is the primary customized variable. The operator ${\mathcal{P}}_{\bf{\phi}^{1}}{(}\cdot{)}$ represents the light propagation and modulation of the CE $({\bf{\phi}}^{1}),$ which is referred to as the optical layer in the E2E framework. Note that ${\mathcal{P}}_{\bf{\phi}^{1}}{(}\cdot{)}$ also depends on the destination coordinates ${i}_{1},\,{j}_{1},\,{z}_{1},$ which we omit in the notation for simplicity.
Usually, COI systems comprise more than one CE, as is the case of the D2NN, which is composed of n sequential concatenated CEs (optical layers). Consequently, the encoded wavefront in (3) continues propagating for a set of additional arbitrary CEs to finally converge to the sensor, which converts photons to electrons, i.e., it measures the photon flux per unit of surface area. A general sensing process can be expressed as \[{\bf{G}} = {\mathcal{S}}\left({{\mathcal{P}}_{{\bf{\phi}}^{n}}{(}{\mathcal{P}}_{{\bf{\phi}}^{{n}{-}{1}}}{(}\cdots{\mathcal{P}}_{{\bf{\phi}}^{1}}{(}{U}^{0}{)}\cdots{))}}\right) \tag{5} \] where ${\mathcal{S}}{(}\cdot{)}$ denotes the transfer function of the sensor, and ${\bf{G}}\in{\mathbb{R}}^{{N}\times{N}}$ is the detected intensity, known as coded measurements. For convenience of notation, we henceforth represent the COI system in vector form as \[{\bf{g}} = {\mathcal{M}}_{\bf{\phi}}{(}{\bf{f}}{)}{,} \tag{6} \] where ${g}\in{\mathbb{R}}^{m}$ denotes the coded measurement, which is the vector representation of G; ${\bf{f}}\in{\mathbb{C}}^{n}$ (or ${\mathbb{R}}^{n}{)}$ represents the underlying scene given by the input field; and ${\mathcal{M}}_{\bf{\phi}}$ denotes the mapping function that models the three basic COI system operators: propagation, interaction with CEs ${\bf{\phi}} = \left\{{\bf{\phi}}^{l}\right\}_{{l} = {1},{2},\ldots,{n}}$, and the sensing process. Although the modeling expressed in (5) is accurate, depending on the application, the operator ${\mathcal{M}}_{\bf{\phi}}$ can be further simplified. For imaging applications, where we consider that different points of a scene are added incoherently, the operator can be modeled by a shift-invariant convolution of the image (intensity) and a point-spread function (PSF) defined by $\bf{\phi}{.}$ Another simplification is employed in a CA-based system where the modulation is just in amplitude. In this case, the COI system can be modeled as a linear system, and thus, the operator ${\mathcal{M}}_{\bf{\phi}}$ can be expressed as a sensing matrix whose values depend on the CA. In the next section, we present the deep optical coding design framework that requires the appropriate COI sensing model.
Optical coding design based on deep learning has shown considerable improvement for multiple computational imaging applications compared with analytical or mathematical optical coding design. Specifically, the deep optical coding design jointly optimizes the CEs of a COI system and the computational algorithm or DNN for a specific task in an E2E manner. The key idea for an E2E design is to accurately simulate the COI measurements, as is explained in the “Computational Imaging System Based on Coding Elements” section, and employ them as an input of a differentiable computational algorithm that performs a given task. Therefore, the task error, defined by a loss function, can be propagated to the trainable parameters of the algorithm and further to the COI system parameters to update them toward an optimal point. In this sense, we can consider the model of the COI system and the DNN as a coupled DNN (see Figure 3), where the physical optical layers can be interpreted as an optical encoder whose learnable parameters describe the CEs.
Figure 3. An E2E scheme where the COI system is modeled as optical layers. In training, a set of images passes through the optical system, obtaining the projected measurements that enter the computational decoder, producing an output for an arbitrary task. The estimated task error is propagated from the output of the decoder to the optical layer, updating the weights of the decoder and the optical system.
To model the real physical system, the following are necessary: an accurate propagation of light, a correct parameterization of the CEs, and a proper sensing model. Leveraging auto differentiation tools to efficiently optimize neural networks, it is only necessary to employ a differentiable model of ${\mathcal{M}}_{\bf{\phi}}{(}\cdot{)}$ to efficiently optimize the CEs $\bf{\phi}$ using a backpropagation algorithm. Therefore, the key point is the CE parameterization that accounts for a simple and accurate representation of the optical element. For instance, the height map of the DOE needs to be regular and smooth for its easy fabrication. Thus, the work in [3] proposed using the Zernike polynomial to describe the height map of the DOE; consequently, the CE’s parameters $\bf{\phi}$ are coefficients of the Zernike polynomials, as expressed in Figure 4(a), i.e., instead of learning the height of each pixel, it is only necessary to learn the coefficient associated with each Zernike polynomial. Another example includes the multiplexing pattern design of a sensor array to reconstruct a red, green, blue image from a single 2D sensor [6]. In this case, the set of parameters $\bf{\phi}$ corresponds to the multiplexing pattern that selects which color must be assigned to each pixel on the sensor, as illustrated in Figure 4(b). To model $\bf{\phi}$ as a selector, it is parameterized using either the softmax function [6] or the sigmoid function [4] to guarantee binary values, as depicted in Figure 4(c). In some scenarios, the parameterization of the optical elements consists of using nondifferentiable functions, such as the step function [6] employed in discrete values or a thresholding operation [4] used in binary optimization. The gradient needs to be approximated or modified to achieve the desired direction in these scenarios. A clear example is when the sign function is used to binarize; in this case, the authors in [4] define the gradient of that function using the identity.
Figure 4. Parameterization of different optical elements. (a) A DOE, where the trainable parameters are the coefficients associated with Zernike polynomial ${Z}_{i};$ (b) color filters, where the parameters are a vector of weights that select one of the available filters; and (c) a CA, where the trainable parameter is a continuous variable.
Depending on the task to be performed, different decoders are employed in the E2E framework [10]. For instance, [3] develops a fully differentiable Wiener filtering operator to recover a deblurred and superresolution image. On the other hand, the remarkable advances of DNNs have excelled in various tasks. For example, auto encoders, residual networks, or the well-known U-net have been used for image recovery. Furthermore, unrolled-based approaches exploit iterative reconstruction algorithms as layers [26]. More recently, DNNs with self-attention mechanisms along spatial or spectral dimensions have also been proposed [27]. Furthermore, complex architectures that perform high-level tasks such as classification or human-pose estimation (HPE) [20] have been employed. In some cases, the coded measurements do not necessarily share the same spatial resolution of the target image. For example, in the CA snapshot spectral imagers [2] where a dispersive element is used, the coded optical measurements are spatially distorted. This spatial mismatch prevents some state-of-the-art deep models from being directly applied because they require spatial resolution of the input and output images to match. Consequently, some lifting strategies are included as an intermediate step between the optical layer and the neural network decoder to obtain the appropriate image size. Commonly, in linear systems of the form ${\bf{g}} = {\mathcal{M}}_{\bf{\phi}}{(}{\bf{f}}{)} = {\bf{Hf}}$, lifting of the measurements can be performed via the transpose operator ${\bf{H}}^{T}{\bf{g}}$ [12] or the pseudoinverse ${\bf{H}}^{\dagger}{\bf{g}}$ [10]. Another suitable alternative is to add some layers for learning the transpose operator [20].
Once the COI system $\left({{\mathcal{M}}_{\bf{\phi}}}\right)$ and the decoder $\left({{\mathcal{N}}_{\theta}}\right)$ have been modeled as layers, the E2E optimization consists of training the encoder-decoder parameters using the following optimization problem: \begin{align*}\left\{{\phi}^{\ast},{\bf{\theta}}^{\ast}\right\} & = \mathop{\text{argmin}}\limits_{\bf{\phi},\bf{\theta}}{\mathcal{L}}{(}{\bf{\phi}},{\bf{\theta}}{)} \\ & {:=} \mathop{\sum}\limits_{k}{{\mathcal{L}}_{\text{task}}}\left({{\mathcal{N}}_{\bf{\theta}}\left({{\mathcal{M}}_{\bf{\phi}}\left({{\bf{f}}_{k}}\right)}\right),{\bf{d}}_{k}}\right) + {\rho}{R}_{\rho}{(}{\phi}{)} + {\sigma}{R}_{\sigma}{(}{\bf{\theta}}{)} \tag{7} \end{align*} where $\left\{{{\bf{\phi}}^{\ast},{\bf{\theta}}^{\ast}}\right\}$ represent the set of optimal optical coding parameters and the optimal weights of the network, respectively, and ${\left\{{{\bf{f}}_{k},{\bf{d}}_{k}}\right\}}_{{k} = {1}}^{K},$ account for the training database with K elements, with ${\bf{f}}_{k}$ as the input image and ${\bf{d}}_{k}$ as the output of the neural decoder, which can be a target image, classification vector, and segmentation map, among others. The loss function ${\mathcal{L}}_{\text{task}}$ is linked to a specific inference task. For instance, the mean-square error (MSE) and cross-entropy metrics are conventionally used for reconstruction and classification tasks, respectively [10]. ${R}_{\rho}{(}\bf{\phi}{)}$ and ${R}_{\sigma}{(}\bf{\theta}{)}$ denote regularization functions that act in the optical parameters and the weights of the decoder, respectively, with ${\rho}$ and ${\sigma}$ as regularization parameters. Regularization functions have been widely used for training neural networks as ${R}_{\sigma}{(}\bf{\theta}{)}$ has been shown to reduce the overfitting problem, a common issue that appears when training DNNs. For instance, the ${\Vert}{\bf{\theta}}{\Vert}_{2}$ or ${\Vert}\bf{\theta}{\Vert}_{1}$ has been successfully applied [10].
Regularization over the optical parameters ${R}_{\rho}{(}\bf{\phi}{)}$ plays a different role than the network weights as they directly change the values of the physical optics. Therefore, it is useful to promote desired properties of the CE. The main idea of including regularization in the training is that the gradient of the composed loss function $\mathcal{L}$ concerning $\bf{\phi}$ is calculated using the chain rule as \[\frac{\partial{\mathcal{L}}}{\partial\bf{\phi}} = \frac{\partial{\mathcal{L}}_{\text{task}}}{\partial{\mathcal{N}}_{\theta}}\frac{\partial{\mathcal{N}}_{\theta}}{\partial{g}}\frac{\partial{g}}{\partial\bf{\phi}} + {\rho}\frac{\partial{R}_{\rho}}{\partial\bf{\phi}}{.} \tag{8} \]
Therefore, design of the optical elements is directly influenced by loss of the task and the regularization function. For example, physical restrictions of the CE’s implementation process that are not addressed in the parameterization of the optical elements impose constraints on the optimization, such as the CAs entries that must be binary. This constraint can be adequately addressed by including a regularization function. Some examples of regularizer used in the state-of-the-art are presented in Table 2. Additionally, the parameter ${\rho}$ plays an essential tradeoff role in the optimal performance and the desired properties imposed by the regularization. Therefore, the work in [10] uses an exponential increase strategy in which the derivative of the loss gives, in the first epochs, the direction to converge to the desired task values, and then ${\rho}$ is increased to guarantee the regularization performance.
Table 2. Examples of the regularization on the CAs to obtain specific characteristics.
Instead of having an encoder consisting of optical layers coupled with an electronic decoder, as discussed in the previous section, we have a D2NN entirely consisting of optical layers that can be trained to learn an arbitrary linear function, which can then be performed through a physical setup. Even though inference and prediction of the physical network is all optical, learning the parameters to design the network happens through software simulations. As shown in Figure 5, the network consists of several CEs, where each CE’s pixel is called a neuron that can be modeled to transmit or reflect an incoming wave [13]. Such neurons are connected to other neurons of the following layers through optical diffraction [13]. This enables the entire network to be modeled as a differentiable function. Analogous to a standard DNN, we can consider the complex-valued transmission or reflection coefficient of each neuron as a multiplicative bias term, where both the amplitude and phase can be treated as learnable parameters. They can be adjusted iteratively via an error-backpropagation method. As mentioned in [13], the input object information can be encoded in the amplitude and/or phase channel of the input field, and the desired result will be captured as the intensity of the output field by a detector. The modeled optical system $\left({{\mathcal{M}}_{\bf{\phi}}}\right)$ is trained to optimize the parameters, i.e., \[{\bf{\phi}}^{\ast} = \mathop{\text{argmin}}\limits_{\bf{\phi}}\mathop{\sum}\limits_{k}^{K}{{\mathcal{L}}_{\text{task}}}\left({{\mathcal{M}}_{\bf{\phi}}{(}{\bf{f}}_{k}{)}{,}{\bf{d}}_{k}}\right), \tag{9} \]
Figure 5. A D2NN that consists of ${L}\times{L}{-} $sized CEs spaced at ${d}_{\text{layer}}$ distance from each other. Each optical layer consists of neurons that are characterized by a transmission coefficient (neuron). The input plane is ${d}_{\text{in}}$ distance away from the first layer of the network, while the output intensity is captured by the detector, which is placed ${d}_{\text{out}}$ distance from the last layer of the network. The D2NN depicted here is trained for a classification task. Different applications of a D2NN are discussed in the “All-Optical Applications of D2NNs” section.
where ${\bf{\phi}}^{\ast},\,{\bf{f}}_{k},\,{\bf{d}}_{k}$ denote the optimal transmission coefficient parameters, input field, and target from the training database consisting of K instances, respectively. Once the network is trained, the design of the D2NN is fixed and the layers can be fabricated subject to the constraints, as discussed in the “Implementation and Fabrication of Optical Coding Elements” section. The network can then be used to perform the learned task at the speed of light with little or no power consumption.
During the simulation of the forward propagation of light through a D2NN, we may directly use the RS integral (1) and (3) to compute the field for each neuron in all the layers. This approach is denoted as the direct method and can be implemented to perform computations in parallel. However, the direct method requires ${N}^{2}$ number of ${N}\times{N}$ matrices for each layer, making the memory requirement for computations at each layer ${N}^{4}{.}$ This requirement grows extensively large as the number of neurons increases, resulting in large computational time requirements.
To eliminate this drawback, the AS method can be employed to efficiently compute the fields of neurons in a D2NN. The electromagnetic light-wave field and the AS can be expressed as a Fourier relationship using the discrete Fourier transform (DFT) model based on the fast Fourier transform algorithm. The DFT-based approximation is commonly used in D2NN approaches to overcome the intrinsic high-computational complexity of the direct model. For instance, consider two CEs located at the ${z}_{{l}{-}{1}}$ and ${z}_{l}$ planes in the D2NN-based COI system, as shown in Figure 5. In this case, the CE’s width and height with L size are sampled as ${N}\times{N}$ with ${N} = {L}{/}{dx}$ where ${\bf{\phi}}^{{l}{-}{1}}$ and ${\bf{\phi}}^{l}$ are the complex transmission coefficients of the optical layers, respectively.
Figure 6 shows the computation pipeline of the output field using the AS method. Initially, the input field is sampled with a sampling interval of dx in the spatial domain. According to the modeling explained in the “Computational Imaging System Based on Coding Elements” section, the propagation and modulation of the input field, i.e., the ${(}{l}{-}{1}{)}{\text{th}}$ optical layer, is given by ${\bf{U}}^{{l}{-}{1}}\circ{\bf{\phi}}^{{l}{-}{1}}$ where $\circ$ is the elementwise matrix multiplication. Then, the 2D DFT of a zero-padded version of the encoded field is applied to obtain the AS given by ${\bf{A}}^{{l}{-}{1}}$ with a computational window of the size ${wN}\times{wN}{.}$ The propagation transfer function matrix ${\bf{P}}{(}{\bf{z}}_{0}{)}$ at ${z}_{0}$ is also created in the computation window with a similar shape as ${\bf{A}}^{{l}{-}{1}}{.}$ This matrix is the discrete representation of the propagation transfer function given in (2). The region where ${f}_{x}^{2} + {f}_{y}^{2}{>}{1}{/}{\lambda}$ in ${\bf{P}}{(}{\bf{z}}_{0}{)}$ corresponds to evanescent waves, which do not propagate energy along the z-axis. Hence, they are filtered out using a binary mask D, as shown in Figure 6. The AS of the field at ${z}_{0}$ is obtained by the elementwise multiplication of ${\bf{A}}^{{l}{-}{1}}$ and the masked propagation transfer function. Then, a 2D inverse DFT operation is performed to obtain the padded resulting field. Finally, the padding is removed to retrieve the resulting field ${\bf{U}}^{l}{.}$ As all the computations happen in the computational window, the memory requirement for the computations at each layer of the AS method is ${(}{wN}{)}^{2}{.}$ Therefore, a significant reduction in the memory requirement can be achieved with the AS method compared to the direct method for typical D2NNs. As an example, the D2NN classifier used in [13] has ${200}\times{200}$ neurons per layer ${(}{N} = {200}{)}{.}$ If the direct method is used for the implementation, it requires matrices with a total number of 1.6 billion $\left({ = {200}^{4}}\right)$ elements. In contrast, using the AS method with ${w} = {4},$ which is sufficient to provide accurate results, each layer only requires matrices with a total number of ${640},{000} = {(}{4}\times{200}{)}^{2}$ elements. This is a 2,500-times reduction in the memory requirement per single layer.
Figure 6. The computational pipeline of the AS method. The 2D DFT operation is performed after a DFT shift operation to make sure that the origin of the AS aligns with that of the propagation transfer function. In the figure, “°” denotes the elementwise matrix multiplication. IDFT: inverse DFT.
When the E2E-based CE design ends, they are manufactured or implemented following the learned parameters. A digital micromirror device (DMD) [2] or liquid crystal on silicon (LCoS) [20] optical elements are commonly used to mimic and validate the trained CE’s pattern. DMD-based CE experiments exploit the spatial modulation capabilities of the ${M}\times{N}$ array of metalized polymer mirrors bonded to a silicon circuit. Each micromirror can have a pixel-size resolution of ${1}{-}{25}{\mu}{\text{m}}$ with two degrees of freedom (i.e., a rotation around two orthogonal axes) and refreshing rates of up to 25 kHz. The DMD adjustable angle property is used to let pass or block the light, which allows it to work as a binary-CA. In this path, DMD-based gray-scale modulation approaches have been proposed by synchronizing the DMD with the sensor’s integration time [10], i.e., several DMD patterns are used for generating the coded measurements. In contrast, LCoS devices comprise a layer of liquid crystal sandwiched between a top sheet of glass coated with a transparent electrode and a pixelated silicon substrate. Because the silicon is reflective, it serves as a mirror element for each pixel, with the strength of the reflection electronically controlled by the amount of light transmitted through the liquid crystal above it. The LCoS’s capability to regulate light transmission has constituted it as the most used optical device for modulating the wavefront’s phase.
The construction of a CE with a fixed pattern allows for considerably reducing its size as well as its cost, and improves its portability. For instance, an inexpensive way to perform a CA is through photography film. Accordingly, the work in [11] prints a color-CA using a Fujichrome Velvia 50 transparency film (35 mm, 36 exposures), with an International Organization for Standardization of 50, which provides high sharpness and daylight-balanced color. On the other hand, the CEs used in the D2NN [13] were fabricated to work in terahertz (THz) wavelengths. This work realized D2NNs, which consist of ${300}\,{\mu}{\text{m}}\times{300}\,{\mu}{\text{m}}{/}{400}\,{\mu}{\text{m}}\times{400}\,{\mu}{\text{m}}$-sized neurons with millimeter-/centimeter-level-layer distances. The operating wavelength was ${749.48}\,{\mu}{\text{m}}$ (THz range). But for imaging, visible light is the most prominent operating wavelength range. To scale down the D2NN in THz proposed by [13] to a visible wavelength (e.g., 632.8 nm), nanometer-scale optical neurons are required, in contrast to the millimeter scale used in THz. Fabricating such features having feature sizes smaller than the diffraction limit of visible light is challenging, and therefore requires special fabrication methods. In the following, we briefly present current potential fabrication technologies including electron beam lithography (EBL), two-photon polymerization, and implosion fabrication.
EBL is a technique with an extremely high diffraction-limited resolution (compared with conventional optical or ultraviolet photolithography) that transfers desired patterns having feature sizes of roughly 10 nm onto an EBL resist. Then, the carved resist is dissolved in a chemical bath and filled with the desired material. Finally, the resist is dissolved again to obtain the material with the desired patterns. Two-photon polymerization is a nonlinear laser-writing technique based on the simultaneous absorption of two photons in a photosensitive material, e.g., polymers. This technique allows for the creation of complex 3D structures with feature sizes on the order of 100 nm. Implosion fabrication is a method that creates large-scale objects embedded in expanded hydrogels and then shrinks them to nanoscale. Specifically, this technique uses an absorbent material made of polyacrylate as the scaffold for its nanofabrication process. The scaffold is bathed in a solution containing fluorescein molecules, which attach to the scaffold in defined 3D patterns when activated by laser light. The patterned and functionalized gel scaffolds are shrunken using acid/divalent cations. Then, the final structure is obtained after a dehydration process.
One of the essential steps in E2E-based CE design is DNN experimental validation, i.e., with measurements acquired by a testbed system. This validation is conducted by first assembling the COI system using the CE obtained from the E2E optimization strategy, followed by characterization of the optical system response, as illustrated in Figure 7. Then, a DNN retraining, termed as fine-tuning [11], is carried out by decoder weights, while optical layer weights are fixed with the characterized CE values by passing the training dataset through the real sensing model. For example, in [11], this process was carried out by obtaining the PSFs of the implemented system and a small set of 10 ground-truth spectral scenes and their corresponding measurements to retrain the decoder network along 100 epochs using a small learning rate of ${10}^{{-}{6}}{.}$ This process is indispensable to improve the decoder robustness and fidelity in real scenarios as the propagation analytical model does not consider nonideal CE characteristics, such as fabrication errors or COI system misalignments, as depicted in Figure 7. For instance, in [2], the designed pixelated optical filters are not perfectly square as expected, or in [20], the resulting experimental PSF has the slight distortions associated with objective and relay lens imperfections. After calibrating the camera and retraining the decoder, the task-oriented COI system is ready. A remarkable feature of the task-oriented CE design framework resides in optimally adapting the CE distribution according to the COI system’s configuration and task specifications. For instance, task-oriented DOE design has been studied in several COI applications such as privacy-preserving pose estimation [20], video recovery [28], superresolution [3], depth estimation [3], and spectral imaging [11], as displayed in Figure 7, where it is highlighted that each task generates a unique DOE structure.
Figure 7. An illustration of the real implementation using the deep optical coding design. With the real optical system, new measurements are acquired and the same trained network is used to perform the desired task (depth [3], privacy [20], superresolution [3], and spectral imaging [11]) directly to the measurements. CCA: color-CA; SCCD: shift-variant color-coded diffractive spectral imaging.
The E2E optimization framework has been applied in diverse applications, such as reconstruction, classification, or object detection. This survey presents some relevant applications that have leveraged E2E optimization and describes the specific encoder-decoder parameters as well as the cost functions employed to optimize them.
Spectral imaging aims to capture 2D images of the same scene at different wavelengths. Spectral images enable a diverse range of applications, including medical imaging, remote sensing, defense and surveillance, and food-quality assessment [2]. The amount of spatial information across a multitude of wavelengths represents one of the main challenges of traditional scanning-acquisition imaging systems as to obtain several high-definition images, these systems require a long exposure time, thereby limiting their use in real-time applications [2]. To overcome this limitation, several computational approaches have been proposed to acquire a single snapshot and recover the desired spectral cube in a computational stage [2]. More recent approaches have employed the E2E framework to optimize the CEs jointly, and the image processing algorithm, leading to state-of-the-art results [11], [28].
For instance, the authors in [11] propose shift-variant color-coded diffractive spectral imaging (SCCD) composed of a color-CA (CCA) and a DOE, as shown in Figure 8(a). The proposed optical system has a different PSF for each pixel, improving the coding of the optical system. In this case, the optical parameters $\bf{\phi}$ consist of a binary spatial mask that selects a given primary color for each spatial pixel location of the CCA and a set of coefficients that weight the Zernike polynomials to represent the DOE’s height map [3]. As an electronic decoder, a deep network based on the U-Net with a skip connection was employed [10]. The decoder parameters $\bf{\theta}$ are the weights of the U-Net network. To optimize $\bf{\phi}$ and $\bf{\theta},$ the MSE metric was employed, setting ${\mathcal{L}}_{\text{task}}$ in the (7) to ${\mathcal{L}}_{\text{task}}{(}{\bf{x}},{\hat{\bf{x}}}{)} = {\Vert}{\bf{x}}{-}{\hat{\bf{x}}}{\Vert}_{2}^{2}{.}$ Furthermore, the binary regularization introduced in Table 2 is employed to promote binary values on the spatial mask, selecting the optimal color.
Figure 8. Three representative optical encoding and computational decoding applications that employ E2E optimization. (a) Spectral imaging, (b) image classification, and (c) HPE.
Image classification is a complex task that has been addressed successfully using deep convolutional neural networks (CNNs). However, the deployment of CNNs in mobile or low-resource devices is prohibitive due to the high costs of memory, processing, and power. To increase computational efficiency, recent approaches design CEs that can perform computations via light propagation and that can work as a coprocessing unit. Thus, the image information can be processed at the speed of light with reduced power consumption. For instance, the authors in [29] proposed a hybrid optical-electronic CNN that implements the first layer of a CNN using an optimizable DOE and the remaining layers in an electronic decoder, as illustrated in Figure 8(b). The optical parameter in this system is the height profile of a DOE that produces the kernels of the first optical layer. The digital decoder is just a simple nonlinear activation function, followed by a fully connected layer. Similar to standard optimization methods for classification, the cross-entropy loss is employed to optimize the kernels generated by the DOE and the fully connected linear layer of the decoder. As the generated PSFs from the DOE cannot have negative values, an additional nonnegative regularization is employed.
Cameras in smartphones, cars, homes, and cities collect a huge amount of information in an always-connected digital world. However, this raises a big challenge in the preservation of privacy. Privacy-preserving approaches rely on a traditional digital imaging system. For example, the detection of privacy-sensitive everyday situations can be done by software. Then, an eye-tracker’s first-person camera can be enabled or disabled using a mechanical shutter [20]. However, such a method performs software-level processing on videos acquired by traditional cameras, which may already contain privacy-sensitive data that could be exposed in an attack. A recent approach based on the design of CEs proposes designing a freeform lens that protects sensitive data while maintaining useful features for perceiving humans in the scene, specifically, HPE using hardware-level protection [20], as shown in Figure 8(c). Similar to the applications presented previously, in this work, optimization of the optical encoder is carried out through the varying surface height of a lens parameterized using the Zernike polynomials. To obtain locations of body key points, the OpenPose network, composed of a VGG backbone and two CNN branches, is employed. The cost function $\mathcal{L}$ of the proposed privacy HPE framework consists of the OpenPose loss ${\mathcal{L}}_{p}$ plus a cost function that enforces image degradation. The latter term, in this case, consists of the negative of the MSE between the output private image of the designed camera and the high-resolution latent image.
In contrast to optoelectronic-based applications, D2NNs provide an opportunity to perform all-optical systems that can perform computations at the speed of light through optical diffraction. As the layers do not require power to operate, the inference of these networks becomes scalable with an increasing number of neurons. The first demonstration of D2NNs presented experiments in image classification and amplitude-to-amplitude imaging [13]. For experimental evaluation of image classification, five-layer D2NNs were trained with trainable, phase-only transmission coefficients ${(}\bf{\phi}{)}$ while using MSE as the loss function $\left({{\mathcal{L}}_{\text{task}}}\right)$ to quantify classification accuracy for the Modified National Institute for Standards and Technology (MNIST) and FashionMNIST datasets separately. Furthermore, the D2NN was physically implemented with 3D printed layers, each having ${200}\times{200}$ neurons in an 8-cm $\times$ 8-cm area, and the frequency of the illuminating wave was 0.4 THz. The all-optical classifiers reported 88 and 90% accuracy for classifying 3D printed handwritten digits and fashion products, respectively.
Several research works have extended D2NNs for tasks such as 3D object recognition [14] and saliency segmentation [15]. To address the task of 3D object recognition, multiple D2NNs have been employed to capture light fields from different views and aggregate them together to obtain better accuracy compared to a single view. Each of these D2NNs is composed of two layers and learns to produce an optical spot pattern from which a one-hot encoding can be obtained. These networks employ softmax cross entropy as the training loss function. In inference, a weighted summation of the sublight fields is obtained to perform the prediction. Although the previously discussed research works learn a transformation through a D2NN in the real space, the authors in [15] explore learning such a transformation on the Fourier space to perform saliency segmentation. They utilize a 4-f system [13], which is an optical system with two lenses placed two focal lengths (f) apart from each other that perform a cascade of Fourier transforms. They place a D2NN in the Fourier plane (i.e., one focal length after the first lens) of the 4-f system and perform simulations to show that the saliency map can be obtained all optically by training to minimize the MSE between the output intensity distribution and the ground-truth saliency map. We present a summary of these all-optical applications of D2NNs in Figure 9.
Figure 9. The three types of all-optical applications of D2NNs for (a) image classification, (b) 3D object detection using multiview D2NNs, (c) saliency segmentation by placing the D2NN at the Fourier plane of the 4-f system.
This article summarized the essential concepts for a data-driven CE design via the joint optimization of optics and an image processing algorithm. These concepts include CE parameterization, modeling physics-based propagation of light, and the underlying optimization framework and are illustrated via the design of two of the most popular customizable optical elements, CAs and DOEs. We showcased practical applications of this rapidly developing framework in various imaging problems such as spectral imaging, privacy-preserving for human-pose estimation, and image classification. Furthermore, we illustrated a data-driven CE design in an all-optics framework that performs DNN inference at the speed of light.
Although optical coding design has become the state-of-the-art way to design optical CEs in multiple COI applications, it is still evolving. In this direction, there are some challenges and open problems when applying the E2E framework. For instance, what is the proper training scheme? The current optimization framework mainly lies in the backpropagation algorithm. Therefore, the optical layer may not be properly optimized for naturally being the first layer of the coupled deep model and thus may be fundamentally limited by the gradient vanishing of the CE (i.e., gradients converge to zero). Another essential question is the influence of proper initialization of the optical parameters well studied for DNN layers but not for optical layers. As the optical design is obtained from a database, the results lack interpretability. Some works try to analyze the PSFs of the system, for instance, by measuring the coherence between bands [11], or by giving some intuition about the results [3] based on the concept of PSF engineering. However, designing metrics or providing some guidance to understand the results is still an open problem. In this reasoning, even though custom neural networks are suitable digital decoders, there are no clear insights related to the characteristics of deep architectures that obtain the best deep optical design. Therefore, several issues related to the interaction between the optical encoder and digital decoder are still to be understood.
From the perspective of constructing the optimized physical imaging system, there is also a fundamental issue regarding the mismatch between the mathematical and the real model. This problem is addressed by calibrating the assembled optical system and/or fine-tuning the digital decoder. Additionally, the D2NNs perform only linear operations in the all-optics framework. Thus, there still exist limitations in the optical implementation of inherent nonlinearities, such as activation functions and pooling operations. To overcome this limitation, a few recent works are based on mode coupling in a multimode optical fiber [30], or on nonlinear material light interactions such as electromagnetically induced transparency and saturable absorption [13]. However, making an all-optical model that considers the latest ideas in DNNs is still an open field.
Finally, the CE design framework has focused on a limited set of elements, relegating the other elements and parameters that compose COI systems to a predesigned structure. This fixed structure suggests a future direction that explores more flexible camera designs. For instance, physical considerations that have remained fixed until now, such as the distance between the optical elements, sensor size, and number of measurements, can be considered the deep optical designs with proper modeling. We anticipate that in the near future, we will use compact cameras in our smartphones or devices designed using this methodology for diverse applications, mainly in specific, task-driven software.
This work was partially supported by the Air Force Office of Scientific Research under award number FA9550-21-1-0326, the Research Council of Norway, the INTPART project (number 309802), National Institutes of Health grants R21-MH130067 and P41-EB015871, the Center for Advanced Imaging at Harvard University, the John Harvard Distinguished Science Fellowship Program, and Fujikura Inc. Henry Arguello and Jorge Bacca contributed equally to the article.
Henry Arguello (henarfu@uis.edu.co) received his Ph.D. degree from the Electrical and Computer Engineering Department at the University of Delaware in 2013. He is currently a titular professor in the Systems Engineering Department, Universidad Industrial de Santander, Bucaramanga 680002, Colombia. He is president of the IEEE Signal Processing Society Colombia Chapter. His current research interests include computational imaging techniques, high-dimensional signal coding and processing, and optical design. He is a Senior Member of IEEE.
Jorge Bacca (jorge.bacca1@correo.uis.edu.co) received his Ph.D. degree in computer science from Universidad Industrial de Santander (UIS) in 2021. He is currently a postdoctoral researcher at UIS, Bucaramanga 680002, Colombia. He is a consulting associate editor for IEEE Open Journal of Signal Processing. His current research interests include inverse problems, deep learning methods, optical imaging, and hyperspectral imaging.
Hasindu Kariyawasam (170287A@uom.lk) received his B.Sc. degree in electronic and telecommunication engineering from the University of Moratuwa, Sri Lanka. He is currently a postbaccalaureate research fellow in the Center for Advanced Imaging at Harvard University, Cambridge, Massachusetts 02138 USA. His research interests include physics-based machine learning, digital system designing, optical neural networks, human–computer interaction, signal processing, and natural language processing.
Edwin Vargas (edwin.vargas4@correo.uis.edu.co) received his master’s degree in electronic engineering in 2018 from Universidad Industrial de Santander, Bucaramanga 680002, Colombia, where he is currently working toward his Ph.D. degree in the same field of study. His research interests includes high-dimensional signal processing, compressive sensing, and the development of new computational cameras using deep learning.
Miguel Marquez (hds.marquez@gmail.com) received his M.Sc. degree in applied mathematics in 2018 from Universidad Industrial de Santander, Bucaramanga 680002, Colombia, where he is currently working toward his Ph.D. degree in physics. His main research interests include computational imaging, compressive sensing, optimization algorithmic, and optical system design.
Ramith Hettiarachchi (170221T@uom.lk) received his B.Sc. degree in electronic and telecommunication engineering from the University of Moratuwa, Sri Lanka. He is currently a postbaccalaureate research fellow at the Division of Science of the Faculty of Arts and Sciences at Harvard University, Cambridge, Massachusetts 02138 USA. He worked as a student researcher for the Robotics and Autonomous Systems Group at the Commonwealth Scientific and Industrial Research Organisation’s Data 61, Brisbane, Australia. His research interests include optical computing, computer vision, and machine learning.
Hans Garcia (hans.garcia@saber.uis.edu.co) received his master’s degree in electronics engineering in 2018 from Universidad Industrial de Santander, Bucaramanga 680002, Colombia, where he is currently working toward his Ph.D. degree in the same field of study. His research interests include multiresolution image recovery, compressive spectral imaging, computational imaging, remote sensing, and compressive sensing. Also, in the last several years, he has worked on optics implementation of designed coding imaging systems.
Kithmini Herath (170213V@uom.lk) received her bachelor of science degree in engineering (with honors) from the Department of Electronic and Telecommunication Engineering, University of Moratuwa, Sri Lanka. She is currently a postbaccalaureate research fellow at the Division of Science of the Faculty of Arts and Sciences at Harvard University, Cambridge, Massachusetts 02138 USA. Her research interests include signal processing, machine learning, computer vision, and human–computer interaction.
Udith Haputhanthri (170208K@uom.lk) received his B.Sc. degree in electronic and telecommunication engineering from the University of Moratuwa, Sri Lanka. He is currently a post-baccalaureate research fellow in the Center for Advanced Imaging at Harvard University, Cambridge, Massachusetts 02138 USA. Prior to his full-time position at Harvard, he worked as a part-time remote visiting undergraduate research fellow in the same lab at Harvard parallel to his university studies. His research interests include computer vision, computational imaging, and computational neuroscience/artificial intelligence intersection.
Balpreet Singh Ahluwalia (balpreet.singh.ahluwalia@uit.no) received his Ph.D. degree in photonics from Nanyang Technological University—Singapore, in 2007. He is a professor in the Department of Physics and Technology, UiT The Arctic University of Norway, Tromsø 9037, Norway. He is also affiliated with the Department of Clinical Sciences, Intervention and Technology, Karolinska Institute, Stockholm 171 77, Sweden. He also cofounded Chip NanoImaging AS. Ahluwalia was awarded the Tycho Jæger Prize in Electro-optics and the University’s Research & Development Award in 2018. His research interests include microscopy, integrated optics, and bioimaging.
Peter So (ptso@mit.edu) received his Ph.D. degree in physics from Princeton University in 1992. He is currently a professor in the Department of Mechanical and Biological Engineering, at the Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA. He is the director of the MIT Laser Biomedical Research Center, a National Institutes of Health National Institute of Biomedical Imaging and Bioengineering P41 research resource. His research focuses on developing high-resolution and high-information content microscopic imaging instruments. These instruments are applied in biomedical studies, such as for noninvasive optical biopsies of cancer.
Dushan N. Wadduwage (wadduwage@fas.harvard.edu) received his Ph.D. degree in biological science from the National University of Singapore under the Singapore-MIT Alliance Fellowship in 2016. He is a John Harvard Distinguished Science Fellow in Imaging, and the principle investigator of Wadduwage Lab at the Center for Advanced Imaging, Harvard University, Cambridge, MA 02138 USA. His research focuses on developing learning-based differentiable microscopy systems for high-throughput, high-content biomedical applications.
Chamira U.S. Edussooriya (chamira@uom.lk) received his Ph.D. degree in electrical engineering from the University of Victoria, Victoria, BC, Canada, in 2015. He has been a senior lecturer in the Department of Electronic and Telecommunication Engineering, University of Moratuwa, Moratuwa 10400, Sri Lanka, since 2016, and a courtesy postdoctoral associate in the Department of Electrical and Computer Engineering, Florida International University, Miami, Florida 33174 USA, since December 2019. His current research interests include analysis and design of low-complexity multidimensional digital filters, 4D and 5D light field video processing, and computational imaging.
[1] C. Fu, H. Arguello, B. M. Sadler, and G. R. Arce, “Compressive spectral polarization imaging by a pixelized polarizer and colored patterned detector,” J Opt. Soc. Amer. A, vol. 32, no. 11, pp. 2178–2188, 2015, doi: 10.1364/JOSAA.32.002178.
[2] G. R. Arce, D. J. Brady, L. Carin, H. Arguello, and D. S. Kittle, “Compressive coded aperture spectral imaging: An introduction,” IEEE Signal Process. Mag., vol. 31, no. 1, pp. 105–115, Jan. 2014, doi: 10.1109/MSP.2013.2278763.
[3] V. Sitzmann, S. Diamond, Y. Peng, X. Dun, S. Boyd, W. Heidrich, F. Heide, and G. Wetzstein, “End-to-end optimization of optics and image processing for achromatic extended depth of field and super-resolution imaging,” ACM Trans. Graph., vol. 37, no. 4, pp. 1–13, 2018, doi: 10.1145/3197517.3201333.
[4] M. Iliadis, L. Spinoulas, and A. K. Katsaggelos, “DeepBinaryMask: Learning a binary mask for video compressive sensing,” Digital Signal Process., vol. 96, p. 102,591, Jan. 2020, doi: 10.1016/j.dsp.2019.102591.
[5] J. N. Mait, G. W. Euliss, and R. A. Athale, “Computational imaging,” Adv. Opt. Photon., vol. 10, no. 2, pp. 409–483, 2018, doi: 10.1364/AOP.10.000409.
[6] A. Chakrabarti, “Learning sensor multiplexing design through back-propagation,” in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 3081–3089.
[7] C. Hinojosa, J. Bacca, and H. Arguello, “Coded aperture design for compressive spectral subspace clustering,” IEEE J. Sel. Topics Signal Process., vol. 12, no. 6, pp. 1589–1600, Dec. 2018, doi: 10.1109/JSTSP.2018.2878293.
[8] M. A. Davenport, P. T. Boufounos, M. B. Wakin, and R. G. Baraniuk, “Signal processing with compressive measurements,” IEEE J. Sel. Topics Signal Process., vol. 4, no. 2, pp. 445–460, Apr. 2010, doi: 10.1109/JSTSP.2009.2039178.
[9] R. Calderbank and S. Jafarpour, “Finding needles in compressed haystacks,” in Proc. IEEE Int. Conf. Acoustics, Speech Signal Process., 2012, pp. 3441–3444, doi: 10.1109/ICASSP.2012.6288656.
[10] J. Bacca, T. Gelvez-Barrera, and H. Arguello, “Deep coded aperture design: An end-to-end approach for computational imaging tasks,” IEEE Trans. Comput. Imag., vol. 7, pp. 1148–1160, Oct. 2021, doi: 10.1109/TCI.2021.3122285.
[11] H. Arguello, S. Pinilla, Y. Peng, H. Ikoma, J. Bacca, and G. Wetzstein, “Shift-variant color-coded diffractive spectral imaging system,” Optica, vol. 8, no. 11, pp. 1424–1434, 2021, doi: 10.1364/OPTICA.439142.
[12] J. Bacca, L. Galvis, and H. Arguello, “Coupled deep learning coded aperture design for compressive image classification,” Opt. Exp., vol. 28, no. 6, pp. 8528–8540, 2020, doi: 10.1364/OE.381479.
[13] X. Lin, Y. Rivenson, N. T. Yardimci, M. Veli, Y. Luo, M. Jarrahi, and A. Ozcan, “All-optical machine learning using diffractive deep neural networks,” Science, vol. 361, no. 6406, pp. 1004–1008, 2018, doi: 10.1126/science.aat8084.
[14] J. Shi et al., “Multiple-view integrated DNNs array (MIDA): Realizing robust all-optical 3D object recognition,” Opt. Lett., vol. 46, no. 14, pp. 3388–3391, 2021, doi: 10.1364/OL.432309.
[15] T. Yan et al., “Fourier-space diffractive deep neural network,” Phys. Rev. Lett., vol. 123, no. 2, p. 023901, 2019, doi: 10.1103/PhysRevLett.123.023901.
[16] X. Cao, T. Yue, X. Lin, S. Lin, X. Yuan, Q. Dai, L. Carin, and D. J. Brady, “Computational snapshot multispectral cameras: Toward dynamic capture of the spectral world,” IEEE Signal Process. Mag., vol. 33, no. 5, pp. 95–108, Sep. 2016, doi: 10.1109/MSP.2016.2582378.
[17] X. Yuan, D. J. Brady, and A. K. Katsaggelos, “Snapshot compressive imaging: Theory, algorithms, and applications,” IEEE Signal Process. Mag., vol. 38, no. 2, pp. 65–88, Feb. 2021, doi: 10.1109/MSP.2020.3023869.
[18] G. Wetzstein et al., “Inference in artificial intelligence with deep optics and photonics,” Nature, vol. 588, no. 7836, pp. 39–47, 2020, doi: 10.1038/s41586-020-2973-6.
[19] M. Marquez, Y. Lai, X. Liu, C. Jiang, S. Zhang, J. Liang, and H. Arguello, “Deep-learning supervised snapshot compressive imaging enabled by an end-to-end convolutional neural network,” IEEE J. Sel. Topics Signal Process., vol. 16, no. 4, pp. 688–699, Jun. 2022, doi: 10.1109/JSTSP.2022.3172592.
[20] C. Hinojosa, J. C. Niebles, and H. Arguello, “Learning privacy-preserving optics for human pose estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 2573–2582.
[21] J. Bacca, S. Pinilla, and H. Arguello, “Super-resolution phase retrieval from designed coded diffraction patterns,” IEEE Trans. Image Process., vol. 29, pp. 2598–2609, Oct. 2019, doi: 10.1109/TIP.2019.2949436.
[22] T. Bell, B. Li, and S. Zhang, “Structured light techniques and applications,” in Wiley Encyclopedia of Electrical and Electronics Engineering, J. G. Webster, Ed. New York, NY, USA: Wiley, 1999, pp. 1–24.
[23] M. R. Kellman, E. Bostan, N. A. Repina, and L. Waller, “Physics-based learned design: Optimized coded-illumination for quantitative phase imaging,” IEEE Trans. Comput. Imag., vol. 5, no. 3, pp. 344–353, Sep. 2019, doi: 10.1109/TCI.2019.2905434.
[24] M. Marquez, P. Meza, H. Arguello, and E. Vera, “Compressive spectral imaging via deformable mirror and colored-mosaic detector,” Opt. Exp., vol. 27, no. 13, pp. 17,795–17,808, 2019, doi: 10.1364/OE.27.017795.
[25] M. Marquez, P. Meza, F. Rojas, H. Arguello, and E. Vera, “Snapshot compressive spectral depth imaging from coded aberrations,” Opt. Exp., vol. 29, no. 6, pp. 8142–8159, 2021, doi: 10.1364/OE.415664.
[26] R. Jacome, J. Bacca, and H. Arguello, “D2uf: Deep coded aperture design and unrolling algorithm for compressive spectral image fusion,” 2022, arXiv:2205.12158.
[27] Y. Cai, J. Lin, X. Hu, H. Wang, X. Yuan, Y. Zhang, R. Timofte, and L. Van Gool, “Mask-guided spectral-wise transformer for efficient hyperspectral image reconstruction,” 2021, arXiv:2111.07910.
[28] E. Vargas, J. N. Martel, G. Wetzstein, and H. Arguello, “Time-multiplexed coded aperture imaging: Learned coded aperture and pixel exposures for compressive imaging systems,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 2672–2682, doi: 10.1109/ICCV48922.2021.00269.
[29] J. Chang, V. Sitzmann, X. Dun, W. Heidrich, and G. Wetzstein, “Hybrid optical-electronic convolutional neural networks with optimized diffractive optics for image classification,” Scientific Rep., vol. 8, no. 1, pp. 1–10, 2018, doi: 10.1038/s41598-018-30619-y.
[30] U. Teğin, M. Yldrm, l Oğuz, C. Moser, and D. Psaltis, “Scalable optical learning operator,” Nature Comput. Sci., vol. 1, no. 8, pp. 542–549, 2021, doi: 10.1038/s43588-021-00112-0.
Digital Object Identifier 10.1109/MSP.2022.3200173