Emerging Memory-Based Chip Development for Neuromorphic Computing

Emerging Memory-Based Chip Development for Neuromorphic Computing: Status, Challenges, and PerspectivesQiumeng Wei, Bin Gao, Jianshi Tang, He Qian, Huaqiang Wu01med02-gao-opener-3296084Â©SHUTTERSTOCK.COM/VLADDONIn this article, we review the development of emerging memory-based neuromorphic computing. First, we discuss the motivation and advantages of this approach. We then summarize the mechanisms and electrical behaviors of commonly used emerging memory devices as well as their application characteristics in various neuromorphic computing paradigms, including computing in memory (CIM) and other biologically plausible brain-like computing. Next, we introduce the principles of CIM and analyze its characteristics from the perspective of parallelism, precision, and signal domains. We also summarize some representative works on chip and system implementations. Additionally, we review the state of the art of other brain-like computing approaches, such as spiking neural networks (SNNs) and reservoir computing (RC). Finally, we provide insights and conclusions on the potential and obstacles of developing neuromorphic computing based on emerging memory, from both the device and system levels.IntroductionSince the advent and widespread adoption of backpropagation algorithms in the 1980s, neural networks have undergone exponential growth in both scale and learning capacity. However, the increasing number of parameters required for artificial intelligence (AI) places a heavy burden on computing hardware. For example, the latest GPT-4 model includes more than 100 trillion parameters, approximately 570 times more than that of GPT-3 [1]. Such a massive model, like GPT-3, reportedly requires more than 285,000 CPU cores and 10,000 GPUs for its training, resulting in a power consumption of more than 1,000 MWh. In addition to large models, AI has been extensively used for edge computing applications to reduce response latency and enhance privacy. However, due to the volatility of on-chip memory and integration difficulties with embedded flash in advanced technology nodes [2], achieving favorable energy efficiency, minimal area overhead, and large memory capacity simultaneously has been challenging. In contrast, the human brain can execute more than 1 million trillion operations per second (TOPS) using only 20 W of power and exhibits exceptional memory and learning abilities [3].In comparison to the binarized information representation, the human brain processes and stores information in analog methods, significantly boosting the storage density and information utilization. In addition, the human brain features an integrated memoryâ€“computing architecture, obviating the necessity for frequent data transfer between separate memory and computing units in conventional computing platforms [Figure 1(a)]. Moreover, memory in the brain is represented as analog values, such as continuous synaptic weights, with information delivery taking the form of spike firing, contributing to sparser network activity. Additionally, the brainâ€™s processing exhibits nonlinearity, dynamics, and stochasticity, augmenting its ability to express information [4]. Inspired by the workings of the human brain, neuromorphic computing aims to create systems that emulate the brain at multiple levels, from top-level algorithms to bottom-level units, as shown in Figure 1(b).gao01-3296084Figure 1. (a) Separated memory and computation units in conventional computing systems. (b) Algorithm, circuit, device, and architecture levels to implement neuromorphic computing.At the algorithm level, a typical approach is to use SNNs [5], which replace artificial neural networks (ANNs) with spike-based information representation and dynamic neurons, resulting in systems that have more temporal information processing ability. Synaptic plasticity-based learning rules are also utilized to train networks without supervision [6], [7]. At the architecture level, on the one hand, in-memory computing is employed to mitigate frequent data transfers. On the other hand, asynchronous event-driven information transferring, with flexible data routing, is realized to mimic the operation of biological neural networks (BioNNs), such as Loihi [8]. At the circuit level, some neuromorphic chips, like Tianji [9], use digital-circuit-based neuron cores, including axons, dendrites, somas, and neuron routers, to construct the system. Other works emulate the neural units in the mixed-signal domain, such as the study on the ROLLS processor [10], which uses exponential Iâ€“V characteristics to replicate the behavior of neurons.Several demonstrations have highlighted the advantages of neuromorphic computing systems in terms of their low-power and high-speed dynamic information processing capabilities [11]. Recently, some studies have implemented neuromorphic computing at the device level, utilizing the specific behaviors of the devices to perform neuromorphic computing functions. Emerging memory, with its high integration density and tunable multilevel states, can efficiently emulate synapses. Additionally, the nonvolatility of certain devices allows for the fusion of computing and memory, without the power consumption or latency associated with data transfer. Based on these features, CIM has been explored for use in emerging device arrays to accelerate large-scale matrixâ€“vector multiplication (MVM) operations. Furthermore, by utilizing emerging memory to emulate dynamic behaviors of synapse plasticity and neuron dynamics, sparse and event-driven network activities and complex neuronal behavior can be realized with lower power and minimal area overheads.Emerging Memory DevicesDevice Types and MechanismsThe behaviors of devices are determined by their underlying mechanisms, which in turn influence the suitability of these devices for use in neuromorphic computing systems. Specifically, certain devices are optimized for nonvolatile and analog resistive-switching behaviors, while others with dynamic properties are better suited for constructing artificial neural units. In the following paragraphs, we will summarize various device types along with their corresponding mechanisms and describe the requirements regarding device behavior for high-performance neuromorphic computing systems.MemristorThe memristor is an emerging memory device that can modulate its conductance state through programming pulses. It can be classified into nonvolatile and volatile types based on its data retention. Nonvolatile memristor-based arrays, also known as resistive random-access memories (RRAMs), are promising memories with nanosecond-level read latency, picojoule-to-femtojoule-level write power consumption, a footprint of tens of F ², and the potential for large-scale integration with CMOSs [12]. RRAMs can improve storage density and represent neural network weights by utilizing multilevel resistive states, while volatile memristors exhibiting dynamic behaviors have been utilized in constructing synaptic plasticity and neuronal nonlinear dynamics.Memristors can be categorized as filament type and interface type, shaped by their working mechanism [13]. Filament-type memristors have two subcategories: cation and anion filament types. The former comprise ions of active metals like Ag and Cu [14], [15], while in the latter, oxygen vacancy is involved [16]. The resistive switching for a filament-type memristor depends on the change in the filament morphology between electrode plates, which can have volatile or nonvolatile behavior depending on the time scale of the ion dynamics. Interfacial memristors depend on the distribution of cations at the interface to modify potential barrier and resistance state. Interfacial memristors offer better analog resistive-switching features than filament-type memristors, but they have a slower write speed and poorer retention [17], [18], [19].Phase-Change MemoryPhase-change memory (PCM) is another promising nonvolatile memory technology [20], [21] that depends on the conversion of chalcogenide-based material between crystalline and amorphous forms through joule heating. PCMs also have multilevel conductance states. The crystalline state transition requires a high temperature for a long enough duration, resulting in a large delay and power consumption for writing the PCM. In addition, PCMs suffer from slow resistance drift [22].Ferroelectric Field-Effect TransistorFerroelectric field-effect transistors (FeFETs) have a structure similar to that of conventional transistors but utilize a ferroelectric insulator layer as the dielectric layer. The application of a gate voltage induces polarization switching in the ferroelectric layer, modulating the conductance of the channel in FeFETs. By using HfO_x thin films as the dielectric layer instead of perovskites [23], the fabrication of FeFETs becomes compatible with CMOS technology, resulting in a moderate memory window and good data retention. Similar to memristors, the polarization-induced conductance modulation in FeFETs is ideal for constructing synaptic plasticity [24], [25]. In addition, polarization in FeFETs exhibits an accumulation effect where the device state abruptly switches after a certain number of pulses is applied. By harnessing this device dynamics, FeFETs can be used to implement capacitorless integration neurons [26], [27].Electrochemical Random-Access MemoryAn electrochemical random-access memory (ECRAM) device [28], [29] is a three-terminal memory device similar to an FeFET, but with an insulating electrolyte layer between the metal gate electrode and conductive channel. Under the electric field induced by the applied gate voltage, ionic transporting between the electrolyte and channel modulates the device conductance. The analog conductance tuning of ECRAMs is more stable than that of memristors due to deterministic ion motion. However, there are still some difficulties in the large-scale integration of ECRAMs and CMOSs at present, such as contamination caused by ions and the need to withstand high temperatures in the back end of the line.FlashA flash memory device is a conventional nonvolatile memory device that can store multibit weights by storing electrical charge in the floating gate. It is available in both NAND and NOR array structures and is an established technology. Compared to other emerging memories, flash has a larger memory window and has matured its 3D integration technology. Recently, vertical split-gate flash has been fabricated to enhance the performance of flash in advanced technology [30]. Combined with heterogeneous integration, flash-based CIM can provide a very high computing density [31].In addition, other emerging memory devices have been widely studied for neuromorphic computing, including ferroelectric tunnel junctions [32], [33] and magnetoresistive random-access memory (MRAM) [34]. The device structures of representative memory devices are shown in Figure 2. The different behavior characteristics of these devices are applicable for various types of neuromorphic applications, as discussed in the following section.gao02-3296084Figure 2. Illustration of emerging memory devices. (a) Filament-type memristor. (b) Interfacial memristor. (c) PCM. (d) FeFET. (e) ECRAM. (f) Split-gate flash.Emerging Memory for Neuromorphic ComputingCurrently, neuromorphic computing based on emerging devices can primarily be divided into two types. One type is CIM, which draws inspiration from the structure of BioNNs. The other type is brain-like computing, which simulates biological neural units and learning mechanisms. Emerging memory-based CIM can be broadly categorized into two types: inference-only CIM and CIM with in situ training. For CIM, memory devices function as tunable weights, requiring fine-tuning of analog conductance. The reliability requirements for these two types of applications differ [2], [35]. In inference-only CIM, the overhead of iterative readâ€“verify can be disregarded as mapping software-trained weights to the memory array is performed only a few times. Therefore, the critical metrics center on the number of distinguishable conductance levels, device retention, conductance range, and reading noise as these factors significantly affect the weight mapping and sensing precision. Rao et al. developed a denoising programming method that reduces random reading noise [36]. An array of

${256}\times{256}$

Ti/Ta/HfO₂/Al₂O₃/Pt memristors achieved up to 2,048 conductance levels with a

${<}{0}{.}{4}{-}$

ÂµA read current fluctuation under a 0.2-V read voltage. On the other hand, in situ training requires frequent weight updating, so endurance, write power consumption and latency, programming nonlinearity, and asymmetry significantly impact the in situ trainingâ€™s performance.For brain-like computing, to emulate behaviors of biological neural units, memory devices require dynamics and nonlinearity. To implement synaptic plasticity like spike-time-dependent plasticity (STDP) and paired-pulse facilitation, conductance relaxation and nonlinear Iâ€“V behavior can be utilized, in which the conductance changes nonlinearly with the applied pulse rate, intervals, and amplitudes [15], [37]. Several studies showed that device-based neurons can emulate multiple oscillating and firing modes [38], [39]. Moreover, temporal filtering of devices has been leveraged to implement artificial neural dendrites [40].CIMHardware Architecture and Working PrinciplesEmerging memory device-based CIM capable of parallelizing MVM operations can significantly accelerate the computing of ANNs. On the one hand, a considerable amount of computation is performed in parallel based on Kirchhoffâ€™s law. On the other hand, the weights are directly mapped to the computing array to avoid frequent weight transfer. For example, in the crossbar array structure (Figure 3), a synapse cell storing the weight is located at the intersection of the bitline (BL) and sourceline (SL). Vertical SLs are clamped to a fixed clamp voltage

${V}_{\text{slc}},$

while input voltages are applied to horizontal BLs through digital-to-analog converters (DACs). The current converged at each SL is

${I}_{{\text{out}}{<}{i}{>}} = {\Sigma}_{{j} = {1}}^{N}{G}_{j,i}\times{(}{V}_{{\text{in}}{<}{j}{>}}{-}{V}_{\text{slc}}{)},$

in which

${V}_{{\text{in}}{<}{j}{>}}{-}{V}_{\text{slc}}$

represents the input

${A}_{{<}{j}{>}},$

and the MVM results are obtained by quantizing the converged SL current

${I}_{\text{out}}$

through analog-to-digital converters (ADCs). In this scenario, a single readout operation can process

${M}\times{N}$

multiplication-and-accumulations (MACs), including

${2}\times{M}\times{N}$

operands, where N refers to the number of input parallelisms and M corresponds to the number of readout parallelisms. To represent multibit input values, pulse amplitude or pulsewidth encoding can be used, as depicted in Figure 3(c). Weighted coding is a well-accepted technique for balancing both precision and efficiency. For instance, an 8-bit input is divided into two 4-bit sections with a weight ratio of 16:1. The sensed outcomes of each section are weighted and summed to generate the final MVM results. Similarly, multibit weights can also be fragmented into multiple columns, called bit-slicing [Figure 3(d)] [41]. Regarding bipolar inputs and weights, two methods include mapping positive and negative weights to two separate arrays, followed by differentiating the results to obtain the final output, or using a differential 2T2R structure [42] to combine the positive and negative parts into a single array.gao03-3296084Figure 3. (a) MVM operations in ANNs. (b) MVM based on RRAM array. (c) Amplitude coding, pulsewidth coding, and weighted coding. (d) Bit-slicing for weight. DAC: digital-to-analog converter; ADC: analog-to-digital converter; BL: bitline; SL: sourceline.Emerging memory-based CIMs reported in the literature can be broadly categorized into two groups: full-precision CIM and limited-precision CIM, where â€œfull-precisionâ€ and â€œlimited-precisionâ€ refer to the accuracy of result quantization at the array level. In the following we will discuss features of these two CIM modes.Full PrecisionThe full-precision CIM produces lossless quantization for every input batch, meaning that the input of 1 multiplying weight of 1 acts as the least significant bit (LSB) in the binary outcomes. This implies that if the input parallelism is M, and the number of weight levels and input levels are W and N, then the minimum quantization precision required is

${(}{\text{log}}{2}{(}{M}{)} + {\text{log}}{2}{(}{W}{)} + {\text{log}}{2}{(}{N}{)} + {1}{)}$

bits. To achieve this, it is essential to ensure a nonoverlapping distribution of each signal level to accurately sense each level without producing any errors. Thus, a high ON/OFF conductance ratio and small conductance variation are desired to generate a more distinguishable distribution for each readout level [43]. Both the input and weight are often binarized in full-precision CIMs to achieve a larger signal margin, and the input parallelism is kept low to obtain reduced logic ambiguity, ignorable current leakage, faster read latency (<10 ns), and a lower parasitic. In this parallelism setting, readout ADCs or sense amplifiers (SAs) utilize 5-bit precision or less to achieve the full-precision quantization [44]. Furthermore, the lossless quantization in full-precision CIMs allows for flexible digital operations, such as shift-and-add. Direct subtraction of the corresponding quantized codes for positive and negative weights can also result in bipolar MVM outputs.Limited PrecisionOn the other hand, for limited-precision MVM, the number of bits in the output readout is not determined by the quantity of input and weight levels or the systemâ€™s parallelism, but rather by the required accuracy levels of the algorithm. Specifically, if an N-bit quantization precision is required by the algorithm, the LSB of the output readout should be equal to the maximum output range divided by 2 to the power of N [45]. Here, the â€œmaximum output rangeâ€ refers to the signal range covered by all outcomes of the network. This mode is suitable for high-parallelism CIM, where multibit inputs and weights can boost computing power. However, it does result in narrowed quantization resolution due to the read voltage range and device conductance window not being proportionally enlarged. A full-precision readout would require an excessively high demand for quantization precision that would cause significant latency and power overhead. For instance, with 512 inputs, eight distinguishable weight levels, and 4-bit input, a one-time quantization precision of 17 bits would be required for a full-precision readout. In contrast, the limited-precision mode offers a better balance between efficiency and performance. It is important to note that, due to the precision loss during limited-precision CIM quantization, subtraction operations must be performed in the analog domain to prevent information loss, as depicted in Figure 4(a).gao04-3296084Figure 4. (a) Information loss in the subtraction of limited-precision quantization. (b) Definition of the signal range of limited-precision MVM.It is important to bear in mind that the terms â€œfull precisionâ€ and â€œlimited precisionâ€ refer to the accuracy of readout quantization at the array level, not the precision of weight devices. While using analog devices as weights may appear to be more precise than binary cells, it is vital to consider that the overlapping conductance distribution and reduced signal margins in analog weight cells can make precise readout challenging.These two modes of CIM are suitable for different application scenarios, and the choice of array size and computing precision should be carefully considered based on several factors, such as the number of algorithm parameters, accuracy requirements, energy consumption, and area limitations. Table 1 summarizes some characteristics of these two types of CIM from the present works. Here we briefly analyze the reasons for the different characteristics of the two models in terms of energy efficiency and computing power density.Table 1. Characteristics of full-precision CIM and high-parallelism CIM.gao_t1-3296084From the perspective of input circuits, in full-precision but low-parallelism CIM, binary inputs can be applied to the wordlines (WLs) in the memory array, which connect to the gate of transistors. This structure lowers the requirement of driving capability and eliminates the IR drop. The application of inputs only necessitates switches and a global buffer. However, in limited-precision but full-parallelism CIM, multilevel inputs can only be utilized in crossbars, where the input voltage is applied to low-impedance nodes. To meet the demands of precise and swift voltage setup, power-hungry amplifiers with feedback become unavoidable.Full-precision CIMs are usually designed with high energy-efficiency applications but with medium computing power and computing density, while limited-precision CIMs are more suitable for high computing power applications.From the perspective of readout circuits, while a low-parallelism CIM with full precision requires multiple inputs for computing multibit inputs, it benefits from a smaller array size and parasitic effects, enabling a faster voltage settling. Additionally, the relaxed resolution requirements, such as the low-resistance state cell current typically ranging from 4 to 20

${\mu}$

A for CIM chips [43], [45], [46], make it possible to achieve multibit quantization at once, ultimately reducing readout latency. On the other hand, the effective differential signal distribution present in full-parallelism but limited-precision CIM is lesser than the single-end signal amplitude [42], as depicted in Figure 4(b). Moreover, employing multibit inputs and weights leads to a reduction in the signal margin, thereby augmenting the need for higher quantization resolution. For ADCs limited by thermal noise, achieving an additional bit of precision translates to a fourfold increase in power consumption [47]. Thus, to mitigate the noise, ADCs intended for readout quantization usually demand a significant amount of power, leading to a decline in overall energy efficiency.Regarding computing power density, full parallelism enables the completion of multiple operations simultaneously, while analog weights improve the weight density. Full-parallelism CIMs with large-size arrays result in lower overheads on peripheral circuits, thereby facilitating high computing density for the same number of parameters [48], [49]. Additionally, a larger array can accommodate a greater number of weights simultaneously, reducing the frequency of weight transfers.Therefore, full-precision CIMs are usually designed with high energy-efficiency applications but with medium computing power and computing density, while limited-precision CIMs are more suitable for high computing power applications. From the perspective of brain computing, it is notable that the computing in brains is not accurately and accompanied by noise [50], [51]. Hence, the limited-precision but high-parallelism CIM seems more akin to how the human brain computes. To better exploit the advantages of emerging memory-based highly parallel computing, the learning mechanisms of the human brain remain to be further revealed.MVM in Different Signal DomainsAs the basic operation in CIMs, MVMs can be implemented in multiple signal domains, including current, voltage, charge, and time, etc. In current-mode MVM, the cell current is controlled by the device conductance, shown in Figure 5(a). The current-mode MVM operation has the drawback of high quiescent currents, resulting in a large static power overhead [52]. Additionally, sensing currents induced by reading voltages implies that the precision of current-mode CIM is sensitive to the variation of clamp voltages caused by noise, offsets, and IR drop [46], [53], [54]. Nonetheless, current-mode CIM is expected to perform better with higher parallelism compared to other CIM modes due to the clamping circuit. For energy-efficient CIM applications, some studies have explored voltage-domain CIM, such as using the BL discharge rate [52] and the settled floating SL voltage [55] to represent the MVM calculation results. Figure 5(b) illustrates the proposed structure for such applications.gao05-3296084Figure 5. Illustrations for MVM in different signal domains. (a) Current domain. (b) Voltage domain. (c) Charge domain. (d) Time domain. TDC: time-to-digital convert.Moreover, charge and time-domain MVMs are also employed in CIM hardware designs. In charge-mode CIM, the products of input and weight bits are responsible for controlling the bottom plate voltages of capacitors, further establishing the voltage of the common top plate [56], [57]. Finally, the top plate voltage is sampled and converted to the MVM result. The array structure is depicted in Figure 5(c). However, it is worth noting that larger capacitor areas can lead to decreased computing density. Time-domain MVM [58] also presents an efficient solution for high energy-efficiency CIM. This is due to sparse pulse activities and the absence of static currents, which contribute to enhanced energy efficiency. In this method of MVM, the delay of each cell is determined by the corresponding inputs and weights, as seen in Figure 5(d). The MVM results are later read out by time-to-digital converts (TDCs) in accordance with the delayed pulses.Compared to voltage- and current-mode CIMs, charge and timing-mode CIMs are better suited for advanced technology due to the sensitivity of analog circuits to power and voltage scaling. The dynamic signal range in the voltage and charge domain is frequently restricted by the precharged voltage, thus limiting the signal margins. However, the voltage range in the voltage-mode domain is further limited by the read disturb. Table 2 summarizes the features of MVMs in different signal domains. Combining processing methods from different signal domains can leverage the strengths of each domain. For example, current-mode CIM-based SNNs transfer spikes in the time domain, enabling weight-and-sum processing with short latency and sparse information routing with low power consumption.Table 2. Summary of features of different signal domain-based CIMs.gao_t2-3296084Nonideal Properties and OptimizationsEmerging device-based CIM systemsâ€™ performance may suffer from nonideal properties of devices, circuits, and systems, making them less accurate than software-based simulations. Nonideal behaviors of devices primarily affect the precision of weight programming and readout. To address the impact of stochastic conductance fluctuations on weight mapping precision, readâ€“verify methods are widely employed to constrain conductance states in narrow distributions [60]. However, multibit programming can lead to significant power and time overhead due to the iterative readâ€“verify process. So, precision is often limited to four bits or less when programming single devices, resulting in a disparity between software-trained floating-point weights and finite device levels. Additionally, accuracy may suffer from conductance variation, drift, read noise, and other nonideal RRAM device characteristics. Some works propose incorporating RRAM behavior models and noise injection techniques into network training processes to enhance resilience to nonideal properties [61]. Hybrid training [45] is an effective strategy that enables the network to adapt to device behaviors by training part of the layers of the network in situ.At the circuit level, large-scale arrays with circuit noise, capacitive parasitics, IR drop, and current leakage can degrade read speed and precision, particularly in current-mode CIM crossbar architectures. Fluctuations in the voltage of SL clamping have the potential to affect both static errors, such as readout circuit offsets and array currentâ€“voltage IR drop, and dynamic errors, such as noise and incomplete voltage settling. Input-aware clamp circuitry or the use of differential 2T2R weight and 2T1R weight structures can address these issues [42], [46], [62].When it comes to processing large neural network computations, multicore systems are essential. However, their peak performance is inhibited by factors such as weight mapping, data input loading, and data routing among arrays, leading to latency. Furthermore, unbalanced task distribution can congest data routes among multiple cores. Fortunately, simulators can be utilized to scrutinize system performance and optimize weight mapping and array size under diverse structures [49], [63], [64].Review of Chips and Systems ImplementationsIn this section, we will review the CIM chips and systems based on emerging devices, ranging from array-level demonstrations to macrolevel chips and systems. Figure 6 is a road map of emerging device-based CIM hardware.gao06-3296084Figure 6. Road map of emerging device-based CIM hardware. MNIST: Modified National Institute of Standards and Technology; LSTM: long short-term memory. (Source: Adapted from [42], [45], [55], [65], [67], [69], [70], and [72].)One early demonstration was in situ training on

${3}\times{3}$

binary image classification based on a

${12}\times{12}$

passive memristor crossbar [65], showcasing the potential for neural network acceleration. Afterward, various algorithm demonstrations were implemented using small-scale memristor arrays [66]. To overcome cell leakage and improve programming precision in large-scale arrays, 1T1R memristor arrays were used in several studies to implement algorithms such as image compression, convolutional filtering, and the in situ training of two-layer perceptions and long short-term memory (LSTM) networks [60], [67], [68].Beyond array demonstrations, several prototype full-precision CIM chips based on emerging memory have been reported. In 2017, Su et al. [69] presented a nonvolatile RRAM-based CIM processor fabricated in 150-nm CMOS technology, consisting of four

${32}\times{32}$

RRAM MVM engines. In 2018, Chen et al. [70] implemented a 65-nm 1-Mb RRAM-based CIM macro for accelerating a binary neural network, achieving a 14.8-ns access time for a

${3}\times{3}$

convolutional kernel and adopting full-precision CIM mode. To mitigate the signal margin deterioration caused by increasing parallelism, Xue et al. [43] proposed a triple-margin small-offset current-mode SA that enlarges the sense margin by sampling and canceling the threshold mismatch of diode-connected load transistors, resulting in a threefold increase in the effective signal margin. The fabricated 1-Mb RRAM CIM core in the 55-nm process achieved an energy efficiency of 53.17 TOPS/W in 1 b-input, 3 b-weight, and 4 b-output mode. Hung et al. [52] proposed a dc-current-free CIM macro that achieved up to 1,286.4 TOPS/W in 1 b-input, 1 b-weight, and 3 b-output mode. In this scheme, the BL is first precharged to a fixed voltage and then discharges at different discharge rates depending on the input-activated WL and stored weights. Instead of using single-level cells (SLCs) as weight cells, Khwa et al. [71] used multilevel cells (MLCs) to improve the weight density and proposed a voltage-swing remapping voltage SA to enlarge the signal margin and rescale the sensed voltage range. In 2022, Huo et al. [72] reported the first 3D vertical RRAM based full-precision CIM macro, in which each memory cell supports the 2-bit weight. The 3D vertical RRAM density is 16.6 times higher than that of the previous 2D RRAM-based CIM.Regarding high-parallelism CIM, Liu et al. [42] reported the first analog RRAM-based CIM chip for high parallelism, fabricated in the 130-nm technology. The chip adopted a differential 2T2R weight structure to largely cancel SL currents and alleviate the IR drop. This chip achieved high parallelism with an array size of 156.8 Kb and a read latency of 77

${\mu}$

s/image as well as a peak energy efficiency of 78.4 TOPS/W. Correll et al. [73] fabricated an SoC prototype chip integrating four isolated RRAM CIM tiles, each supporting the parallelism of 256 inputs and 32 readouts, with a 16-level RRAM weight cell. They proposed using a binary-weighted multicycle sampling ADC to weight the partial bitwise output and quantize the overall result, so only one-time quantization is needed for each MAC operation. The performance of recently representative CIM macro chips is summarized in Table 3.Table 3. Summary of recently reported CIM macro chips. (Adapted from [44], [46], [52], [71], [73], and [74].)gao_t3-3296084To improve the performance of CIM systems, optimizations can be implemented at both the circuit and system levels. In 2022, a multiarray system integrating eight 2-Kb RRAM arrays was reported [45]. The system performed a five-layer convolutional neural network (CNN) and achieved an accuracy of more than 96%. It also demonstrated better energy efficiency and performance density than state-of-the-art GPUs. To address the precision loss caused by weight mapping, this work adopted a hybrid training method that performed in situ retraining of the last layer. Furthermore, convolutional kernels were replicated to RRAM arrays to parallelize the convolution process of serial sliding windows. Another voltage-mode CIM chip, named the NeuRRAM, was reported [55]. The chip included 48 RRAM-CIM bidirectional transposable neurosynaptic arrays and supported flexible calculation without duplicating ADCs and input buffers. Several hardware-algorithm co-optimization methods, such as model-driven chip calibration and noise-resilient neural-network training, were proposed to mitigate hardware nonidealities. The NeuRRAM demonstrated outstanding performance on various tasks, like achieving an error rate of 0.98% on the Modified National Institute of Standards and Technology (MNIST) dataset using a seven-layer CNN and 14.34% on the CIFAR-10 dataset using ResNet-20.Other Brain-Like Computing Based on Emerging Memory DevicesAs mentioned previously, the device conductance is dependent upon the motion of conductive ions, exhibiting a wide range of dynamic and stochastic characteristics. It is therefore expected that such features could be harnessed in the design of brain-like computing systems, with the potential to realize higher energy efficiency and integration density compared to conventional CMOS-based systems. In the following sections, several behavioral attributes of the emerging memory and related demonstrations of brain-like systems will be discussed.Dynamic and Stochastic CharacteristicsThe selector device employed in passive crossbars to suppress leakage can result in abrupt conductance switching under varying voltage, as depicted in Figure 7(a). This threshold-switching behavior is widely utilized to mimic the firing behaviors of biological neurons and consequently facilitate the construction of SNNs. Mott device-based neurons have been demonstrated to emulate multiple firing modes of biological neurons [38], [39]. Ag filament-based diffusive memristors and FeFETs are also adopted to implement the leakage integrate-and-fire (LIF) neurons, which are commonly used in SNNs [26], [75], [76]. Device dynamics are leveraged for implementing synaptic plasticity. Kim et al. [37] utilized the second-order behavior of memristors to implement STDP, whose conductance change depends on the interval time between applied pulses at both terminals. Li et al. [40] introduced a dendrite device exhibiting a nonlinear response in its conductance with respect to the amplitude and interval time of the continuously applied pulses. The properties of this device resemble the nonlinear filtering behavior of biological dendrites, and it could be employed in the creation of neural network systems that incorporate dendritic structures, unlike the point-neuron models used in conventional ANNs. Some emerging devices possess intrinsic stochasticity, which has been leveraged in physically unclonable functions [77], true random number generators [78], Bayesian neural networks [79] as well as simulated annealing algorithms [80] and others. A schematic of device behaviors and the related applications are shown in Figure 7.gao07-3296084Figure 7. (a) Device behavior schematic of threshold switching, nonlinear dynamics, and stochastic. (b) Synapse device behaves like STDP. (c) Diffusive memristor as an LIF neuron. (d) Dendrite device. LIF: leakage integrate and fire. (Source: Adapted from [37], [75], and [81].)Brain-Like System DemonstrationsNeuromorphic computing demonstrations have been achieved using emerging memory devices. For example, Wang et al. utilized

${8}\times{8}$

drift memristor-based synapses and eight diffusive memristor-based neurons for unsupervised learning. The synapses evolved to fixed patterns after training using simple STDP and lateral inhibition under the stimuli of input images [82]. Fu et al. [83] developed a forming-less V/VO_x/HfWO_x/Pt device with threshold-switching and nonvolatile resistive-switching behavior, enabling capacitor-free neurons and high-precision synapses in a single device, thus avoiding process incompatibilities. A fully memristive SNN based on the fabricated device achieved âˆ¼10 fJ/per operation for one synapse with an accuracy of 86% on the MNIST dataset using in situ learning based on the remote supervised method (ReSuMe). Furthermore, an SNN system was demonstrated in [84] integrating hybrid memristorâ€“CMOS stochastic LIF neurons and resistive-switching memristor synapses. The neuron emulated the â€œall-or-noneâ€ feature and supported a simple Hebbian learning rule along with lateral inhibition to achieve the winner-take-all feature. A two-layer SNN was also demonstrated, in which the first layer underwent unsupervised training in situ, and the second layer was supervised trained to output classified results.The previous study considered the neuron as a conventional point model and overlooked the interaction among the neuron components. In contrast, [81] constructed neuromorphic systems with a complete neuron structure, comprising somas, dendrites, and synapses using various memristors. The serial connection of dendrite and Mott soma devices in the neuron enables nonlinear filtering behavior, which enhances robustness to input noise and enriches temporal information processing capabilities. A full-memristor LSTM network was implemented, incorporating 75 memristor synapses, 18 dendrites, and three somas, achieving 96% accuracy on the Nanyang Technological University-Red Green Blue (NTU-RGB) dataset. By leveraging the filtering behavior of dendrite devices, the power consumption of the system was significantly reduced, being 2,000 times lower than that of a GPU. Besides device-level simulation, some systems adopt a brain-like architecture.RC is a novel computing paradigm that efficiently captures spatiotemporal signals. The training of RC is highly efficient, requiring only modulation of the linear readout layer. To create a reservoir layer without recurrent connections, dynamic memristors have been leveraged to exhibit nonlinear and temporal filtering features, providing rich temporal information to the reservoir. Zhong et al. [85] reported a fully integrated RC system using dynamic memristors that included a 24-dynamic-memristor-based reservoir layer and a readout layer composed of

${192}\times{8}$

drift memristors. The voltage buffer connects the node states of the reservoir directly to the readout layer, eliminating the need for ADCs. The system achieved 96.6% accuracy for detecting temporal arrhythmia and 97.9% accuracy for recognizing spatiotemporal dynamic gestures.Table 4 is a summary of representative demonstrations of emerging memory device-based brain-like computing.Table 4. Summary of emerging device-based brain-like demonstrations.gao_t4-3296084Summary and PerspectiveThe utilization of emerging memories that possess high integration densities, nonvolatility, and dynamic behaviors provides a promising approach to construct large-scale neuromorphic computing systems capable of achieving advanced performance with high energy efficiency. Here we put forward some perspectives about improving the neuromorphic computing systems from the device to the system level.At the device level, optimizing the device analog behavior is essential. Keeping the device conductance within an appropriate range is also crucial for balancing current consumption and readout accuracy. Emerging memory-based CIMs in the charge and time domains require further investigation at the circuit and algorithm level due to their suitability for advanced CMOS technology. To maximize full-precision CIMâ€™s parallelism, two key factors need addressing by optimized circuit designs: establishing a larger signal margin by widening the gap between 0 and 1 states of weight cells and developing compensations for the array IR drop.At the system level, it is necessary to combine CIM with a CPU to build a general-purpose processor to support implementing all operations on chip [87]. To achieve high efficiency, it is necessary to minimize the data transfer frequency and reduce the speed gap between CIM and CPU processing while maximizing CIM core utilization. This requires optimization of the system architecture, instructions, and data routing. Another crucial point to explore at the system level is process integration, which focuses on 3D integration and optimizing selector devices to increase the integration density and construct 3D monolithic integration systems. It also involves integrating multiple modules, such as sensing, storage, and computing, into a single chip, further reducing data transmission and enhancing the energy efficiency and data security of edge computing. Some monolithic 3D integration implementations have been reported. For example, [88] monolithically integrated silicon-based CMOS logic, RRAM-based CIM, and carbon nanotube FET-based ternary content-addressable memory layers. They demonstrated a one-shot learning task with 97.8% accuracy on the Omniglot dataset, with 162Ã— lower energy efficiency than the GPU. From a system perspective, the multimodal system is expected to integrate advantages of multiple types of neuromorphic computing. For example, ANNs and SNNs have their own advantages in task processing [5], with ANNs focusing on accuracy and SNNs having an advantage in processing binary bitstreams containing time information, such as Deep Visibility Series data. Some works have integrated CIM-based ANNs and SNNs [89] to combine high energy efficiency and accuracy. In addition, a multimodal system that integrates multiple sensor arrays, CIM-based preprocessing units, and nonvolatile memory is expected to facilitate high-performance edge computing.Moreover, a strict methodology for benchmarking CIMs is currently inadequate [90]. Presently, the accuracy of CIM hardware is reported based on the accuracy of specific network algorithms, but such an evaluation criterion is uncertain and can be easily manipulated. Furthermore, evaluating the signal-to-noise ratio is challenging due to its dependency on array loads and weight and input patterns. Additionally, published works can only report the performance under specific precisions, such as integer 4 and integer 8. However, different circuit structures and computing principles have unique calculation processes, making it difficult to extrapolate performances under different precision levels. For example, energy efficiency at 1-bit precision cannot be easily determined by multiplying the efficiency under 4-bit by 4. In conclusion, the high level of customization involved in analog CIM hinders the establishment of uniform benchmarks.In this review, we have provided a comprehensive summary of the application of RRAM-based CIM for neural network acceleration as well as demonstrations of emerging memory for other brain-like computing purposes. Different array structures and MVM modes are suitable for CIM chips with different performance requirements. In addition, emerging memories exhibiting diverse properties are also well suited for the realization of diverse types of neuromorphic computing. To achieve a reliable and high-performance neuromorphic computing system, collaborative optimization is required at every level from the bottom-up device manufacturing to high-level system algorithms.AcknowledgmentThis work was supported in part by the STI 2030-Major Projects (2021ZD0201200), the National Natural Science Foundation of China (92064001 and 62025111), the XPLORER Prize, the Shanghai Municipal Science and Technology Major Project, and the Beijing Advanced Innovation Center for Integrated Circuits.Biographiesgao_a1-3296084 Qiumeng Wei (wqm20@mails.tsinghua.edu.cn) is with The School of Integrated Circuits, Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China.gao_a2-3296084 Bin Gao (gaob1@tsinghua.edu.cn) is with The School of Integrated Circuits, Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China.gao_a3-3296084 Jianshi Tang (jtang@tsinghua.edu.cn) is with The School of Integrated Circuits, Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China.gao_a4-3296084 He Qian (qianh@tsinghua.edu.cn) is with The School of Integrated Circuits, Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China.gao_a5-3296084 Huaqiang Wu (wuhq@tsinghua.edu.cn) is with The School of Integrated Circuits, Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China.References[1] â€œGPT-4 will have 100 trillion parameters â€” 500x the size of GPT-3,â€ Towards Data Science, Sep. 2021. [Online] . Available: https://towardsdatascience.com/gpt-4-will-have-100-trillion-parameters-500x-the-size-of-gpt-3-582b98d82253[2] Y. Xi et al., â€œIn-memory learning with analog resistive switching memory: A review and perspective,â€ Proc. IEEE, vol. 109, no. 1, pp. 14â€“42, Jan. 2021, doi: 10.1109/jproc.2020.3004543.[3] R. J. Douglas and K. A. Martin, â€œRecurrent neuronal circuits in the neocortex,â€ Current Biol., vol. 17, no. 13, pp. R496â€“R500, Jul. 2007, doi: 10.1016/j.cub.2007.04.024.[4] P. Poirazi and A. Papoutsi, â€œIlluminating dendritic function with computational models,â€ Nature Rev. Neurosci., vol. 21, no. 6, pp. 303â€“321, Jun. 2020, doi: 10.1038/s41583-020-0301-7.[5] A. Tavanaei, M. Ghodrati, S. R. Kheradpisheh, T. Masquelier, and A. Maida, â€œDeep learning in spiking neural networks,â€ Neural Netw., vol. 111, pp. 47â€“63, Mar. 2019, doi: 10.1016/j.neunet.2018.12.002.[6] P. U. Diehl and M. Cook, â€œUnsupervised learning of digit recognition using spike-timing-dependent plasticity,â€ Frontiers Comput. Neurosci., vol. 9, Aug. 2015, Art. no. 99, doi: 10.3389/fncom.2015.00099.[7] T. Masquelier and S. J. Thorpe, â€œUnsupervised learning of visual features through spike timing dependent plasticity,â€ PLoS Comput. Biol., vol. 3, no. 2, Feb. 2007, Art. no. e31, doi: 10.1371/journal.pcbi.0030031.[8] M. Davies et al., â€œLoihi: A neuromorphic manycore processor with on-chip learning,â€ IEEE Micro, vol. 38, no. 1, pp. 82â€“99, Jan./Feb. 2018, doi: 10.1109/mm.2018.112130359.[9] J. Pei et al., â€œTowards artificial general intelligence with hybrid Tianjic chip architecture,â€ Nature, vol. 572, no. 7767, pp. 106â€“111, Aug. 2019, doi: 10.1038/s41586-019-1424-8.[10] N. Qiao et al., â€œA reconfigurable on-line learning spiking neuromorphic processor comprising 256 neurons and 128K synapses,â€ Frontiers Neurosci., vol. 9, Apr. 2015, Art. no. 141, doi: 10.3389/fnins.2015.00141.[11] R. Massa, A. Marchisio, M. Martina, and M. Shafique, â€œAn efficient spiking neural network for recognizing gestures with a DVS camera on the LOIHI neuromorphic processor,â€ in Proc. IEEE Int. Joint Conf. Neural Netw. (IJCNN), 2020, pp. 1â€“9, doi: 10.1109/IJCNN48605.2020.9207109.[12] C.-C. Chou et al., â€œAn N40 256KÃ—44 embedded RRAM macro with SL-precharge SA and low-voltage current limiter to improve read and write performance,â€ in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), 2018, pp. 478â€“480, doi: 10.1109/ISSCC.2018.8310392.[13] W. Sun et al., â€œUnderstanding memristive switching via in situ characterization and device modeling,â€ Nature Commun., vol. 10, no. 1, Aug. 2019, Art. no. 3453, doi: 10.1038/s41467-019-11411-6.[14] J. Zahurak et al., â€œProcess integration of a 27nm, 16Gb Cu ReRAM,â€ in Proc. IEEE Int. Electron. Devices Meeting, 2014, pp. 6.2.1â€“6.2.4, doi: 10.1109/IEDM.2014.7046994.[15] Z. Wang et al., â€œMemristors with diffusive dynamics as synaptic emulators for neuromorphic computing,â€ Nature Mater., vol. 16, no. 1, pp. 101â€“108, Jan. 2017, doi: 10.1038/nmat4756.[16] W. Wu et al., â€œA methodology to improve linearity of analog RRAM for neuromorphic computing,â€ in Proc. IEEE Symp. VLSI Technol., 2018, pp. 103â€“104, doi: 10.1109/VLSIT.2018.8510690.[17] S. Park et al., â€œNeuromorphic speech systems using advanced ReRAM-based synapse,â€ in Proc. IEEE Int. Electron Devices Meeting, 2013, pp. 25.6.1â€“25.6.4, doi: 10.1109/IEDM.2013.6724692.[18] W. Zhang et al., â€œAnalog-type resistive switching devices for neuromorphic computing,â€ Physica Status Solidi, Rapid Res. Lett., vol. 13, no. 10, Oct. 2019, Art. no. 1900204, doi: 10.1002/pssr.201900204.[19] X. Li et al., â€œElectrode-induced digital-to-analog resistive switching in TaO x -based RRAM devices,â€ Nanotechnology, vol. 27, no. 30, Jul. 2016, Art. no. 305201, doi: 10.1088/0957-4484/27/30/305201.[20] G. W. Burr et al., â€œPhase change memory technology,â€ J. Vac. Sci. Technol. B, vol. 28, no. 2, pp. 223â€“262, Mar. 2010, doi: 10.1116/1.3301579.[21] G. Servalli, â€œA 45nm generation phase change memory technology,â€ in Proc. IEEE Int. Electron Devices Meeting (IEDM), 2009, pp. 1â€“4, doi: 10.1109/IEDM.2009.5424409.[22] D. Ielmini, A. L. Lacaita, and D. Mantegazza, â€œRecovery and drift dynamics of resistance and threshold voltages in phase-change memories,â€ IEEE Trans. Electron Devices, vol. 54, no. 2, pp. 308â€“315, Feb. 2007, doi: 10.1109/ted.2006.888752.[23] T. BÃ¶scke, J. MÃ¼ller, D. BrÃ¤uhaus, U. SchrÃ¶der, and U. BÃ¶ttger, â€œFerroelectricity in hafnium oxide thin films,â€ Appl. Phys. Lett., vol. 99, no. 10, Sep. 2011, Art. no. 102903, doi: 10.1063/1.3634052.[24] H. Mulaosmanovic et al., â€œNovel ferroelectric FET based synapse for neuromorphic systems,â€ in Proc. IEEE Symp. VLSI Technol., 2017, pp. T176â€“T177, doi: 10.23919/VLSIT.2017.7998165.[25] M. Jerry et al., â€œFerroelectric FET analog synapse for acceleration of deep neural network training,â€ in Proc. IEEE Int. Electron Devices Meeting (IEDM), 2017, pp. 6.2.1â€“6.2.4, doi: 10.1109/IEDM.2017.8268338.[26] J. Luo et al., â€œA novel ferroelectric FET-based adaptively-stochastic neuron for stimulated-annealing based optimizer with ultra-low hardware cost,â€ IEEE Electron Device Lett., vol. 43, no. 2, pp. 308â€“311, Feb. 2022, doi: 10.1109/LED.2021.3138765.[27] H. Mulaosmanovic, E. Chicca, M. Bertele, T. Mikolajick, and S. Slesazeck, â€œMimicking biological neurons with a nanoscale ferroelectric transistor,â€ Nanoscale, vol. 10, no. 46, pp. 21,755â€“21,763, 2018, doi: 10.1039/C8NR07135G.[28] J. Tang et al., â€œECRAM as scalable synaptic cell for high-speed, low-power neuromorphic computing,â€ in Proc. IEEE Int. Electron Devices Meeting (IEDM), 2018, pp. 13.1.1â€“13.1.4, doi: 10.1109/IEDM.2018.8614551.[29] S. Kim et al., â€œMetal-oxide based, CMOS-compatible ECRAM for deep learning accelerator,â€ in Proc. IEEE Int. Electron Devices Meeting (IEDM), 2019, pp. 35.7.1â€“35.7.4, doi: 10.1109/IEDM19573.2019.8993463.[30] S. K. Saha, â€œDesign considerations for sub-90-nm split-gate flash-memory cells,â€ IEEE Trans. Electron Devices, vol. 54, no. 11, pp. 3049â€“3055, Dec. 2007, doi: 10.1109/TED.2007.907265.[31] T.-H. Hsu et al., â€œA vertical split-gate flash memory featuring high-speed source-side injection programming, read disturb free, and 100K endurance for embedded flash (eFlash) scaling and computing-in-memory (CIM),â€ in Proc. IEEE Int. Electron Devices Meeting (IEDM), 2020, pp. 6.3.1â€“6.3.4, doi: 10.1109/IEDM13553.2020.9372036.[32] S. Slesazeck, T. Ravsher, V. Havel, E. T. Breyer, H. Mulaosmanovic, and T. Mikolajick, â€œA 2TnC ferroelectric memory gain cell suitable for compute-in-memory and neuromorphic application,â€ in Proc. IEEE Int. Electron Devices Meeting (IEDM), 2019, pp. 38.6.1â€“38.6.4, doi: 10.1109/IEDM19573.2019.8993663.[33] J. Yang et al., â€œA 9Mb HZO-based embedded FeRAM with 10 12-cycle endurance and 5/7ns read/write using ECC-assisted data refresh and offset-canceled sense amplifier,â€ in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), 2023, pp. 1â€“3, doi: 10.1109/ISSCC42615.2023.10067752.[34] H. Cai et al., â€œ33.4 A 28nm 2Mb STT-MRAM computing-in-memory macro with a refined bit-cell and 22.4 - 41.5TOPS/W for AI inference,â€ in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), 2023, pp. 500â€“502, doi: 10.1109/ISSCC42615.2023.10067339.[35] M. Zhao, B. Gao, J. Tang, H. Qian, and H. Wu, â€œReliability of analog resistive switching memory for neuromorphic computing,â€ Appl. Phys. Rev., vol. 7, no. 1, Jan. 2020, Art. no. 011301, doi: 10.1063/1.5124915.[36] M. Rao et al., â€œThousands of conductance levels in memristors integrated on CMOS,â€ Nature, vol. 615, no. 7954, pp. 823â€“829, Mar. 2023, doi: 10.1038/s41586-023-05759-5.[37] S. Kim, C. Du, P. Sheridan, W. Ma, S. Choi, and W. D. Lu, â€œExperimental demonstration of a second-order memristor and its ability to biorealistically implement synaptic plasticity,â€ Nano Lett., vol. 15, no. 3, pp. 2203â€“2211, Mar. 2015, doi: 10.1021/acs.nanolett.5b00697.[38] M. D. Pickett, G. Medeiros-Ribeiro, and R. S. Williams, â€œA scalable neuristor built with Mott memristors,â€ Nature Mater., vol. 12, no. 2, pp. 114â€“117, Feb. 2013, doi: 10.1038/nmat3510.[39] W. Yi, K. K. Tsang, S. K. Lam, X. Bai, J. A. Crowell, and E. A. Flores, â€œBiological plausibility and stochasticity in scalable VO2 active memristor neurons,â€ Nature Commun., vol. 9, no. 1, Nov. 2018, Art. no. 4661, doi: 10.1038/s41467-018-07052-w.[40] X. Li et al., â€œPower-efficient neural network with artificial dendrites,â€ Nature Nanotechnol., vol. 15, no. 9, pp. 776â€“782, Sep. 2020, doi: 10.1038/s41565-020-0722-5.[41] S. Diware, A. Singh, A. Gebregiorgis, R. V. Joshi, S. Hamdioui, and R. Bishnoi, â€œAccurate and energy-efficient bit-slicing for RRAM-based neural networks,â€ IEEE Trans. Emerg. Topics Comput. Intell., vol. 7, no. 1, pp. 164â€“177, Feb. 2023, doi: 10.1109/TETCI.2022.3191397.[42] Q. Liu et al., â€œ33.2 A fully integrated analog ReRAM based 78.4TOPS/W compute-in-memory chip with fully parallel MAC computing,â€ in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), 2020, pp. 500â€“502, doi: 10.1109/ISSCC19947.2020.9062953.[43] C.-X. Xue et al., â€œ24.1 A 1Mb multibit ReRAM computing-in-memory macro with 14.6ns parallel MAC computing time for CNN based AI edge processors,â€ in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), 2019, pp. 388â€“390, doi: 10.1109/ISSCC.2019.8662395.[44] W.-H. Huang et al., â€œA nonvolatile Al-edge processor with 4MB SLC-MLC hybrid-mode ReRAM compute-in-memory macro and 51.4-251TOPS/W,â€ in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), 2023, pp. 15â€“17, doi: 10.1109/ISSCC42615.2023.10067610.[45] P. Yao et al., â€œFully hardware-implemented memristor convolutional neural network,â€ Nature, vol. 577, no. 7792, pp. 641â€“646, Jan. 2020, doi: 10.1038/s41586-020-1942-4.[46] W. Ye et al., â€œA 28nm hybrid 2T1R RRAM computing-in-memory macro for energy-efficient AI edge inference,â€ in Proc. IEEE Asian Solid-State Circuits Conf. (A-SSCC), 2022, pp. 2â€“4, doi: 10.1109/A-SSCC56115.2022.9980669.[47] R. H. Walden, â€œAnalog-to-digital converter survey and analysis,â€ IEEE J. Sel. Areas Commun., vol. 17, no. 4, pp. 539â€“550, Apr. 1999, doi: 10.1109/49.761034.[48] T. Gokmen and Y. Vlasov, â€œAcceleration of deep neural network training with resistive cross-point devices: Design considerations,â€ Frontiers Neurosci., vol. 10, Jul. 2016, Art. no. 333, doi: 10.3389/fnins.2016.00333.[49] W. Zhang et al., â€œDesign guidelines of RRAM based neural-processing-unit: A joint device-circuit-algorithm analysis,â€ in Proc. 56th Annu. Des. Autom. Conf., 2019, pp. 1â€“6.[50] T. J. Hamilton, S. Afshar, A. van Schaik, and J. Tapson, â€œStochastic electronics: A neuro-inspired design paradigm for integrated circuits,â€ Proc. IEEE, vol. 102, no. 5, pp. 843â€“859, May 2014, doi: 10.1109/JPROC.2014.2310713.[51] P. N. Steinmetz, A. Manwani, C. Koch, M. London, and I. Segev, â€œSubthreshold voltage noise due to channel fluctuations in active neuronal membranes,â€ J. Comput. Neurosci., vol. 9, no. 2, pp. 133â€“148, Sep. 2000, doi: 10.1023/A:1008967807741.[52] J.-M. Hung et al., â€œAn 8-Mb DC-current-free binary-to-8b precision ReRAM nonvolatile computing-in-memory macro using time-space-readout with 1286.4-21.6TOPS/W for edge-AI devices,â€ in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), 2022, vol. 65, pp. 1â€“3, doi: 10.1109/ISSCC42614.2022.9731715.[53] P.-Y. Chen et al., â€œTechnology-design co-optimization of resistive cross-point array for accelerating learning algorithms on chip,â€ in Proc. Des., Automat. Test Europe Conf. Exhib. (DATE), 2015, pp. 854â€“859.[54] C.-X. Xue et al., â€œ15.4 A 22nm 2Mb ReRAM compute-in-memory macro with 121-28TOPS/W for multibit MAC computing for tiny AI edge devices,â€ in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), 2020, pp. 244â€“246, doi: 10.1109/ISSCC19947.2020.9063078.[55] W. Wan et al., â€œA compute-in-memory chip based on resistive random-access memory,â€ Nature, vol. 608, no. 7923, pp. 504â€“512, Aug. 2022, doi: 10.1038/s41586-022-04992-8.[56] P. Chen et al., â€œ7.8 A 22nm delta-sigma computing-in-memory (Î”âˆ‘CIM) SRAM macro with near-zero-mean outputs and LSB-first ADCs achieving 21.38TOPS/W for 8b-MAC edge AI processing,â€ in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), 2023, pp. 140â€“142, doi: 10.1109/ISSCC42615.2023.10067289.[57] S.-E. Hsieh et al., â€œ7.6 A 70.85-86.27TOPS/W PVT-insensitive 8b word-wise ACIM with post-processing relaxation,â€ in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), 2023, pp. 136â€“138, doi: 10.1109/ISSCC42615.2023.10067335.[58] P.-C. Wu et al., â€œA 28nm 1Mb time-domain computing-in-memory 6T-SRAM macro with a 6.6ns latency, 1241GOPS and 37.01TOPS/W for 8b-MAC operations for edge-AI devices,â€ in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), 2022, vol. 65, pp. 1â€“3, doi: 10.1109/ISSCC42614.2022.9731681.[59] J.-H. Yoon, M. Chang, W.-S. Khwa, Y.-D. Chih, M.-F. Chang, and A. Raychowdhury, â€œA 40-nm 118.44-TOPS/W voltage-sensing compute-in-memory RRAM macro with write verification and multi-bit encoding,â€ IEEE J. Solid-State Circuits, vol. 57, no. 3, pp. 845â€“857, Mar. 2022, doi: 10.1109/JSSC.2022.3141370.[60] C. Li et al., â€œAnalogue signal and image processing with large memristor crossbars,â€ Nature Electron., vol. 1, no. 1, pp. 52â€“59, Dec. 2017, doi: 10.1038/s41928-017-0002-z.[61] W. Zhang, B. Gao, P. Yao, J. Tang, H. Wu, and H. Qian, â€œA circuit-algorithm codesign method to reduce the accuracy drop of RRAM based computing-in-memory chip,â€ in Proc. IEEE Int. Conf. Integr. Circuits, Technol. Appl. (ICTA), 2020, pp. 108â€“109, doi: 10.1109/ICTA50426.2020.9332118.[62] B. Yan, M. Liu, Y. Chen, K. Chakrabarty, and H. Li, â€œOn designing efficient and reliable nonvolatile memory-based computing-in-memory accelerators,â€ in Proc. IEEE Int. Electron Devices Meeting (IEDM), 2019, pp. 14.5.1â€“14.5.4, doi: 10.1109/IEDM19573.2019.8993562.[63] X. Peng, S. Huang, H. Jiang, A. Lu, and S. Yu, â€œDNN+NeuroSim V2.0: An end-to-end benchmarking framework for compute-in-memory accelerators for on-chip training,â€ IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 40, no. 11, pp. 2306â€“2319, Nov. 2021, doi: 10.1109/TCAD.2020.3043731.[64] Y. Jiang et al., â€œHARNS: High-level architectural model of RRAM based computing-in-memory NPU,â€ in Proc. IEEE Int. Conf. Integr. Circuits, Technol. Appl. (ICTA), 2021, pp. 35â€“36, doi: 10.1109/ICTA53157.2021.9661827.[65] M. Prezioso, F. Merrikh-Bayat, B. D. Hoskins, G. C. Adam, K. K. Likharev, and D. B. Strukov, â€œTraining and operation of an integrated neuromorphic network based on metal-oxide memristors,â€ Nature, vol. 521, no. 7550, pp. 61â€“64, May 2015, doi: 10.1038/nature14441.[66] P. M. Sheridan, F. Cai, C. Du, W. Ma, Z. Zhang, and W. D. Lu, â€œSparse coding with memristor networks,â€ Nature Nanotechnol., vol. 12, no. 8, pp. 784â€“789, Aug. 2017, doi: 10.1038/nnano.2017.83.[67] C. Li et al., â€œEfficient and self-adaptive in-situ learning in multilayer memristor neural networks,â€ Nature Commun., vol. 9, no. 1, Jun. 2018, Art. no. 2385, doi: 10.1038/s41467-018-04484-2.[68] C. Li et al., â€œLong short-term memory networks in memristor crossbar arrays,â€ Nature Mach. Intell., vol. 1, no. 1, pp. 49â€“57, Jan. 2019, doi: 10.1038/s42256-018-0001-4.[69] F. Su et al., â€œA 462GOPs/J RRAM-based nonvolatile intelligent processor for energy harvesting IoE system featuring nonvolatile logics and processing-in-memory,â€ in Proc. IEEE Symp. VLSI Technol., 2017, pp. T260â€“T261, doi: 10.23919/VLSIT.2017.7998149.[70] W.-H. Chen et al., â€œA 65nm 1Mb nonvolatile computing-in-memory ReRAM macro with sub-16ns multiply-and-accumulate for binary DNN AI edge processors,â€ in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), 2018, pp. 494â€“496, doi: 10.1109/ISSCC.2018.8310400.[71] W.-S. Khwa et al., â€œA 40-nm, 2M-cell, 8b-precision, hybrid SLC-MLC PCM computing-in-memory macro with 20.5 - 65.0TOPS/W for tiny-Al edge devices,â€ in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), 2022, vol. 65, pp. 1â€“3, doi: 10.1109/ISSCC42614.2022.9731670.[72] Q. Huo et al., â€œA computing-in-memory macro based on three-dimensional resistive random-access memory,â€ Nature Electron., vol. 5, no. 7, pp. 469â€“477, Jul. 2022, doi: 10.1038/s41928-022-00795-x.[73] J. M. Correll et al., â€œAn 8-bit 20.7 TOPS/W multi-level cell ReRAM-based compute engine,â€ in Proc. IEEE Symp. VLSI Technol. Circuits (VLSI Technol. Circuits), 2022, pp. 264â€“265, doi: 10.1109/VLSITechnologyandCir46769.2022.9830490.[74] P. Deaville, B. Zhang, and N. Verma, â€œA 22nm 128-KB MRAM row/column-parallel in-memory computing macro with memory-resistance boosting and multi-column ADC readout,â€ in Proc. IEEE Symp. VLSI Technol. Circuits (VLSI Technol. Circuits), 2022, pp. 268â€“269, doi: 10.1109/VLSITechnologyandCir46769.2022.9830153.[75] Y. Zhang et al., â€œHighly compact artificial memristive neuron with low energy consumption,â€ Small, vol. 14, no. 51, Dec. 2018, Art. no. 1802188, doi: 10.1002/smll.201802188.[76] C. Sun et al., â€œNovel a-IGZO anti-ferroelectric FET LIF neuron with co-integrated ferroelectric FET synapse for spiking neural networks,â€ in Proc. Int. Electron Devices Meeting (IEDM), 2022, pp. 2.1.1â€“2.1.4, doi: 10.1109/IEDM45625.2022.10019526.[77] Y. Pang, H. Wu, B. Gao, D. Wu, A. Chen, and H. Qian, â€œA novel PUF against machine learning attack: Implementation on a 16 Mb RRAM chip,â€ in Proc. IEEE Int. Electron Devices Meeting (IEDM), 2017, pp. 12.2.1â€“12.2.4, doi: 10.1109/IEDM.2017.8268376.[78] B. Gao, B. Lin, X. Li, J. Tang, H. Qian, and H. Wu, â€œA unified PUF and TRNG design based on 40-nm RRAM with high entropy and robustness for IoT security,â€ IEEE Trans. Electron Devices, vol. 69, no. 2, pp. 536â€“542, Feb. 2022, doi: 10.1109/TED.2021.3138365.[79] B. Lin et al., â€œA high-speed and high-reliability TRNG based on analog RRAM for IoT security application,â€ in Proc. IEEE Int. Electron Devices Meeting (IEDM), 2019, pp. 14.8.1â€“14.8.4, doi: 10.1109/IEDM19573.2019.8993486.[80] K. Yang, Q. Duan, Y. Wang, T. Zhang, Y. Yang, and R. Huang, â€œTransiently chaotic simulated annealing based on intrinsic nonlinearity of memristors for efficient solution of optimization problems,â€ Sci. Adv., vol. 6, no. 33, Aug. 2020, Art. no. eaba9901, doi: 10.1126/sciadv.aba9901.[81] X. Li et al., â€œA memristors-based dendritic neuron for high-efficiency spatial-temporal information processing,â€ Adv. Mater., early access, Jun. 2022, doi: 10.1002/adma.202203684.[82] Z. Wang et al., â€œFully memristive neural networks for pattern classification with unsupervised learning,â€ Nature Electron., vol. 1, no. 2, pp. 137â€“145, Feb. 2018, doi: 10.1038/s41928-018-0023-2.[83] Y. Fu et al., â€œForming-free and annealing-free V/VO x /HfWO x /Pt device exhibiting reconfigurable threshold and resistive switching with high speed (< 30ns) and high endurance (10 12 /> 10 10),â€ in Proc. IEEE Int. Electron Devices Meeting (IEDM), 2021, pp. 12.6.1â€“12.6.4, doi: 10.1109/IEDM19574.2021.9720551.[84] X. Zhang et al., â€œHybrid memristor-CMOS neurons for in-situ learning in fully hardware memristive spiking neural networks,â€ Sci. Bull., vol. 66, no. 16, pp. 1624â€“1633, Aug. 2021, doi: 10.1016/j.scib.2021.04.014.[85] Y. Zhong et al., â€œA memristor-based analogue reservoir computing system for real-time and power-efficient signal processing,â€ Nature Electron., vol. 5, no. 10, pp. 672â€“681, Oct. 2022, doi: 10.1038/s41928-022-00838-3.[86] X. Zhang et al., â€œFully memristive SNNs with temporal coding for fast and low-power edge computing,â€ in Proc. IEEE Int. Electron Devices Meeting (IEDM), 2020, pp. 29.6.1â€“29.6.4, doi: 10.1109/IEDM13553.2020.9371937.[87] M. Chang et al., â€œA 40nm 60.64TOPS/W ECC-capable compute-in-memory/digital 2.25MB/768KB RRAM/SRAM system with embedded cortex m3 microprocessor for edge recommendation systems,â€ in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), 2022, vol. 65, pp. 1â€“3, doi: 10.1109/ISSCC42614.2022.9731679.[88] Y. Li et al., â€œMonolithic 3D integration of logic, memory and computing-in-memory for one-shot learning,â€ in Proc. IEEE Int. Electron Devices Meeting (IEDM), 2021, pp. 21.5.1â€“21.5.4, doi: 10.1109/IEDM19574.2021.9720534.[89] M. Chang et al., â€œA 73.53TOPS/W 14.74TOPS heterogeneous RRAM in-memory and SRAM near-memory SoC for hybrid frame and event-based target tracking,â€ in Proc. IEEE Int. Solid-State Circuits Conf. (ISSCC), 2023, pp. 426â€“428, doi: 10.1109/ISSCC42615.2023.10067544.[90] N. R. Shanbhag and S. K. Roy, â€œComprehending in-memory computing trends via proper benchmarking,â€ in Proc. IEEE Custom Integr. Circuits Conf. (CICC), 2022, pp. 1â€“7, doi: 10.1109/CICC53496.2022.9772817.
Digital Object Identifier 10.1109/MED.2023.3296084Date of current version: 15 September 2023CoverIEEE Electron Devices MagazineMastheadFrom the Editor's DeskFrom the EditorLetters to the EditorErrataA Critique of Plan SHomemade US$10 Chua Corsage MemristorNonvolatile Capacitive SynapseEmerging Memory-Based Chip Development for Neuromorphic ComputingWill Stochastic Devices Play Nice With Others in Neuromorphic Hardware?Women in Electronic DevicesEducators' DeskTechRxivArchives