Ming Lei, Daning Yang, Xiaoming Weng
©SHUTTERSTOCK.COM/SENSVECTOR
This article presents an integrated sensor fusion (ISF) solution based on the multiple-input, multiple-output (MIMO) radar, camera, and on-device computing. The MIMO radar is capable of estimating an object’s attributes in four dimensions—range, velocity, azimuth angle, and elevation angle—which can be further used to estimate the length, width, and height of the object. The camera is responsible for object classification based on deep learning. The respective signal processing pipelines and the fusion of results are carried by the on-device computing platform. These two sensors complement each other very well in detecting and classifying traffic objects. Compared with existing sensor fusion solutions based on multiple distributed devices, ISF exhibits superior performance in terms of latency and the total cost of ownership (TCO). It also simplifies time synchronization among different sensors and facilitates the deeper fusion of the signal processing algorithms of different sensors. The comprehensive roadside sensing capabilities provided by the ISF solution can enhance the safety and efficiency of both the automated driving and human driving of connected vehicles.
Roadside traffic sensors can complement vehicle sensors in perceiving the traffic environment because they can provide a broader view and different angles of view. This is crucial for enhancing the safety and efficiency of both automated driving and human driving. As Figure 1 shows, roadside sensing analytic results (e.g., vehicle velocity and heading, traffic event category and impact scope, and so on) are generated by roadside edge computing equipment, which analyzes raw data collected by roadside traffic sensors, such as cameras, radars, and lidars. These analytic results (usually structured data) are packed into the various types of infrastructure-to-vehicle (I2V) messages and transmitted to connected vehicles by a roadside unit (RSU) via a cellular vehicle-to-everything (C-V2X) wireless link (i.e., the PC5 sidelink) [1]. These I2V messages can benefit not only connected vehicles but also other road traffic participants, such as vulnerable road users. C-V2X greatly facilitates the rapid development of roadside sensing.
Figure 1 The connected vehicle applications based on roadside sensing, roadside computing, and C-V2X technology. gNB: 5G base station; eNB: 4G base station; mm-Wave: millimeter-wave; RSU: roadside unit; MEC: mobile edge computing; I2V: infrastructure-to-vehicle; V2V: vehicle-to-everything.
Cameras, radars, and lidars are the most often used traffic sensors in intelligent transportation systems (ITS), but each has its own advantages and disadvantages. No one single type of traffic sensor can meet all the requirements, and that is where sensor fusion comes in [2], [3], [4]. Sensor fusion refers to the collaboration of two or more sensors to complete object detection and classification tasks. Due to the fact that the involved sensors can complement one another in functions, sensor fusion has more comprehensive sensing capabilities and better performance than a single sensor. According to the system architecture detailed in Figure 2, we classify roadside sensor fusion solutions into the following two types:
Figure 2 The roadside infrastructure supporting connected vehicles. PON: passive optical network; ITMS: intelligent traffic management system; OBU: onboard unit.
Due to its advantage of low cost (summarized in Table 1), the roadside sensor fusion of radar and camera is becoming a thriving research area in both academia and industry. Liu et al. [3] investigated an evidential architecture for radar and camera fusion to manage uncertain and conflicting data caused by challenging environments. Du et al. [4] investigated an optimization model to solve the spatiotemporal synchronization problem for fusion. Bai et al. [2] developed an experimental DSF system to demonstrate promising results for object detection and tracking by using fusion based on an improved Gaussian mixture probability hypothesis density filter. These recent works explore important topics for advancing this technology.
Table 1 The advantages and disadvantages of traffic sensors.
On the other hand, to support the rapid development of ITS, the market urgently needs solutions that can meet the very challenging practical requirements of low latency, easy time synchronization and spatial calibration, robustness against environmental influences, support for advanced fusion algorithms, and a low TCO. We are highly motivated in solving these challenges by leveraging mature results from the research community. To the best of our knowledge, our market-adopted solution is the first one in the industry to meet the preceding requirements simultaneously.
In this article, we present an ISF solution that integrates camera, radar, and computing platform in one device for roadside sensing. The different steps of the signal processing pipelines for these two sensors are processed by the on-device computing platform, which also fuses the results of these two pipelines at the object level. This solution and its future evolutions can not only enhance the automated driving and human driving of connected vehicles but also have huge potential for applications of intelligent traffic management systems, such as traffic signal control and roadway digital twins (Figure 2), which are the building blocks of a holographic traffic system.
The system architecture of the roadside infrastructure supporting connected vehicles is presented in Figure 2. It has three basic functions:
In practical implementations, the two functions of roadside sensing and roadside edge computing can be integrated in one single device (ISF) and distributed in multiple split devices connected by a network (DSF). ISF has many advantages over DSF:
In transportation areas, the most commonly used traffic sensors are cameras, radars, and lidars. Their respective advantages and disadvantages are summarized in Table 1. From the comparison of traffic sensors, we can see that radar and camera can complement each other very well in providing comprehensive sensing capabilities for connected vehicle applications. Their respective costs and signal processing complexities are lower than those of lidar. This makes it possible for cost-effective solutions. Therefore, we adopt radar and camera in our ISF solution.
The radar in our solution features a MIMO antenna array, millimeter-wave (mm-wave) band, and frequency-modulated continuous wave (FMCW) [5]. Since the MIMO radar [6], [7] can estimate an object’s attributes in 4D—range, velocity, azimuth angle, and elevation angle—it is also known as 4D radar. Some of these attribute parameters can be processed to generate the length, width, and height of the object. Therefore, in addition to the advantages summarized in Table 1, MIMO radar has the following capabilities to make up for the disadvantages of traditional radar:
Compared with low-frequency counterparts, radar operating in the mm-wave band has a large system bandwidth, which is key to improving the resolution of range estimation. The FMCW waveform can greatly simplify radar signal processing in estimating the range, velocity, and angles of arrival (AoAs) (azimuth and elevation). The camera in our solution is used for the classification of traffic objects, traffic events, and traffic signs and signals.
Roadside edge computing is responsible for the signal processing of the involved sensors and the fusion of the results. As demonstrated in Figure 2, it can be in either the device hosting the sensors (ISF) or a split device (e.g., roadside MEC) connected with the sensors via the network (DSF). Connected vehicle applications related to safety, e.g., automated driving, have strict requirements for end-to-end (E2E) latency. Roadside edge computing deployed near roadside sensing (traffic sensors) and roadside communication (RSUs) has a better guarantee for low E2E latency.
As shown in Figure 3, the sensor fusion flowchart includes three major parts: radar signal processing, image signal processing, and the fusion of results. This applies to both ISF and DSF solutions. The difference is that video decoding is not needed in the image signal processing of ISF; this is because the camera does not need video coding to compress the data, which are transmitted to the on-device computing platform in the same device via the internal high-speed interface [e.g., the Mobile Industry Processor Interface (MIPI), while in DSF, video has to be encoded (compressed) by the camera before it is transmitted to the computing device via a bandwidth-limited network (e.g., the Ethernet). Moreover, there are network switches and other equipment that cause further delay in DSF. Therefore, the latency of ISF can be significantly shorter than that of DSF.
Figure 3 The MIMO radar and camera sensor fusion. RF: radio frequency; ADC: analog-to-digital converter; FFT: fast Fourier transform; CFAR: constant false alarm rate.
MIMO radar is capable of estimating an object’s attributes in 4D: range, velocity, azimuth angle, and elevation angle. The signal processing steps given in Figure 3 are grouped into the following sections.
Moving target detection (MTD) includes the first- and second-dimension windowed fast Fourier transforms (FFTs), which are used to estimate an object’s range and velocity, respectively. FFTs are computation intensive, and therefore, dedicated digital signal processors (DSPs) are used. The complexity of computing the FFTs is O(NlogN), where N is the FFT length. We can shorten the FFT length by reducing the chirp duration of the FMCW waveform with a fixed system bandwidth. In addition to faster processing, this also brings the benefits of a larger maximum measurable velocity and reduced impacts from the 1/f noise [8].
The input data of constant false alarm rate (CFAR) detection consist of the incoherent accumulation of the range–Doppler heat maps generated by MTD across the virtual antenna array. We use the clutter map algorithm, which divides the clutter map into sectors according to the different ranges and Doppler frequencies. The clutter map is a type of time series data, and it is processed by exponential smoothing for predicting CFAR thresholds. CFAR detection based on the clutter map algorithm exhibits good environmental adaptability and low computation complexity.
AoA estimation is related to the antenna array design and estimation algorithm. We use a sparse 2D MIMO antenna array to estimate the azimuth angle and elevation angle of objects [9]. According to the requirements of the radio-frequency front end, we purposely set the position of the repeated virtual array so that the array has the minimum redundancy and a short feeder length for reducing the signal transmission loss. The improved fast back projection algorithm [10] is used for AoA estimation. It is not limited to the specific antenna array and has low computation complexity and satisfying accuracy.
MIMO radar is capable of generating an object’s point cloud, with each point having 4D attributes: range, azimuth angle, elevation angle, and velocity. We use the density-based spatial clustering of applications with noise (DBSCAN) clustering algorithm [11] for processing the point cloud in the 4D space formed by these attributes. Object tracking is based on the extended Kalman filter (EKF) [2]. With a high enough sampling rate at the radar receiver, we can assume that the object is in a state of linear motion, with uniform acceleration in a short time window. This simplified model can significantly reduce the computation complexity, with negligible performance loss in tracking.
Three attributes (range, azimuth angle, and elevation angle) are the spherical coordinates determining the position of every point in radar’s spherical CS, which can be converted to 3D coordinates (x, y, and z) in radar’s Cartesian CS. For every object (i.e., a group of clustered points) in the current frame, a bounding box can be generated, and its 3D metrics (length, width, and height) can be obtained. The 3D metrics estimated from the current frame need to be smoothed by using prior observations, i.e., the length, width, and height, estimated from previous frames, which can be obtained from the object-to-track association at the tracking step. Due to its low computation complexity and satisfying performance, we use exponential smoothing as the low-pass filter to reduce the high-frequency noise caused by estimation errors in the current frame. The smoothed 3D metrics are used to filter out the fake targets and therefore reduce the false alarm rate. With enough spatial resolution, MIMO radar has the potential for object classification.
For connected vehicle applications, the latency, throughput, and accuracy of the image signal processing need to satisfy certain requirements, while the system cost and power consumption need to be kept as low as possible. In our solution, the detection and classification of traffic objects are based on the deep learning inference of video images generated by the camera, which requires relatively high computing power. We also need to accelerate the computer vision tasks, such as resizing and color space conversion in the pipeline.
The You Only Look Once version 5 (YOLOv5) neural network model [12] is used for deep learning inference. YOLOv5 inherits the advantages of the one-stage algorithms of the YOLO family. Because there is no need for the region proposal required by two-stage algorithms, and since the dual tasks of object detection and object classification can be completed in one inference process, the YOLO family models can achieve higher inference throughputs. Compared to its previous generation, YOLOv4, YOLOv5 is lighter, faster, and more precise.
The ISF solution does not have the time synchronization problem because the time stamps of signals from different sensors are generated by the same on-device computing platform; i.e., the same clock source is used when signals from different sensors reach the same on-device computing platform for digital signal processing. In DSF, the time stamps of signals from different sensors are generated by different computing platforms in split devices; therefore, time synchronization is a problem that must be solved. The network time protocol (NTP) can achieve only millisecond-level synchronization accuracy, while the more advanced precision timing protocol (PTP) is more accurate, though the implementation cost is much higher. The NTP and PTP need specific hardware support, which means additional cost and complexity.
Each sensor in the sensor fusion solution has its own CS, and therefore, we need spatial calibration to determine their relation. Since MIMO radar is used, we extend the spatial calibration method in [13] to estimate the transformation matrix between the 3D radar coordinates and 2D image coordinates. This single transformation matrix represents the combined effects of the extrinsic parameters (translation and rotation between radar and camera) and the intrinsic parameters of the camera. In ISF, the radar and camera can share the same coordinate origin, and their spatial relation (translation and rotation) is relatively fixed, given that they are colocated in the same device. This makes spatial calibration much easier in ISF than in DSF. Moreover, ISF is more robust against environmental influences that cause sensor displacement.
In the data fusion shown in Figure 3, the radar’s Cartesian CS is selected as the common CS. Both the radar and the image signal processing pipelines produce their respective bounding boxes [14] of the objects and the coordinates of their centers. We project the image results into the common CS (i.e., the radar’s Cartesian CS) and calculate the weighted sum of the differences among the 3D metrics (length, width, and height) of the two bounding boxes and the distance between the two centers. If the outcome is smaller than a preset threshold, the two tracks generated by the radar and camera, respectively, are associated (i.e., track-to-track association), as they are determined to be from the same target. Then, the object-level results of the two sensors are fused. The fusion results include the object’s category, length, width, height, position, velocity, and heading.
Compared with traditional radar, MIMO radar can estimate one more dimension of information (i.e., the elevation angle). The estimate of the elevation angle, combined with the range estimate, can be used to measure the height of an object. And this height information can effectively filter out fake targets caused by clutters reflected by the road surface. Moreover, MIMO radar is capable of estimating the length, width, and height of the object. This feature can assist object classification in sensor fusion because common traffic objects (e.g., cars, buses, trucks, motorcycles, and pedestrians) have their own typical 3D metrics.
We evaluate two systems based on ISF and DSF, respectively. Both systems are installed on equipment poles at an urban intersection. Each system is composed of one MIMO radar, one camera, and one computing platform.
Detailed system parameters are listed in Table 2. Figure 4 shows the antenna array of the MIMO radar in both ISF and DSF. The unit of the horizontal axis is the minimum Rx antenna spacing, which is ${\lambda} / {2}$, and the unit of the vertical axis is ${\lambda}$, where ${\lambda}$ is the wavelength (≈ 4 mm) at 77 GHz. Based on the three transmit antennas (in the 2D array) and six receive antennas (in the 1D array), the total number of virtual antennas is 42. The horizontal aperture of the virtual array is 54 mm.
Figure 4 The 2D virtual antenna array of MIMO radar in ISF and DSF.
Table 2 The system parameters for ISF and DSF.
The on-device computing platform of the ISF solution is composed of the CPUs, DSPs, neural compute engine (NCE), input–output (I/O) interfaces, and other components. Their key features and main functions in ISF are summarized in Table 3.
Table 3 The core components of the on-device computing platform and their main functions in ISF.
Figure 5 displays the effects of the ISF solution. Object classification results are displayed in Figure 5(a). The fusion of the results with the radar detection results is displayed in Figure 5(b) (in the radar’s Cartesian CS). The velocity and heading of each object detected by the radar are visualized by a line segment associated with each object ID. The red line represents the normal direction of the radar. The ISF solution is adopted to optimize the signal phase and timing (SPaT) of the traffic lights. It significantly improves traffic efficiency. For example, the utilization rate of green lights (defined as the total vehicle passing time divided by the green light time) has increased from 45.3% (without ISF) to 78% (with ISF). The SPaT signals optimized by ISF can be sent to vehicles via C-V2X, which can further improve the traffic efficiency of the intersections.
Figure 5 The effects of deploying an ISF solution (one MIMO radar and one camera) at an urban intersection. (a) The object classification results based on video images. (b) The fusion of object classification results and radar detection results.
Table 4 lists the average E2E latency of both ISF and DSF. In DSF, the video coding and decoding time is included in the data transmission latency. We can observe that ISF has a significantly shorter E2E latency than DSF. This is due to the two prominent advantages of ISF:
Table 4 The average E2E latency of ISF and DSF.
ISF enables us to provide real-time roadside sensing analytic results to connected vehicles, which is crucial for traffic safety.
The YOLOv5 model is trained using a dataset with more than 60,000 images. There are five types of traffic objects (cars, buses, trucks, nonmotor vehicles, and pedestrians) in the annotation. The testing dataset consists of more than 6,000 images, with objects evenly distributed in the two range segments in Table 5. We compared the object classification results with the ground truth and then calculated the accuracy. The results are in Table 5.
Table 5 The accuracy of object classification.
The MIMO radar with a 2D antenna array can estimate one more dimension than the radar without a 2D array: the elevation angle. It can be used to estimate the height of an object. Given that each type of traffic object has its own typical shape characteristics, the height estimate can be used to assist with object classification. This feature is particularly useful for recognizing distant objects, where the performance of object classification based on visual analytics deteriorates significantly. From the results in Table 5, we observe that ISF has better performance than DSF in both range segments. This is because the video information in DSF will have a longer delay and higher frame loss when transmitted through the network, resulting in occasional fusion failures.
We proposed an ISF solution based on MIMO radar, camera, and on-device computing. MIMO radar is capable of estimating an object’s 4D attributes (range, velocity, azimuth angle, and elevation angle), which can be further used to estimate the object’s length, width, and height. This feature is very useful in both object detection and object classification. The signal processing workload for both sensors (MIMO radar and camera) and data fusion are carried out by the single on-device computing platform in the same device. These two sensors complement each other very well in functions. The proposed solution exhibits superior performance in terms of latency, object classification accuracy, and TCO. It also simplifies the time synchronization among different sensors and makes it possible for the deeper fusion of the signal processing algorithms of different sensors. We are also developing the next generation of ISF, featuring the versatile x86 processor, the FFT, and computer vision acceleration by using software library integrated performance primitives (IPP) [15].
5G mobile communications have been commercialized on a large scale worldwide, laying the foundation for the interconnection of terminals (including vehicles) with a high-speed, low-latency, and extensively connected network. The connected vehicle is one of the most prominent 5G vertical industry applications, and its technology development is being accelerated. The comprehensive roadside sensing capabilities provided by the proposed solution can significantly enhance the safety and efficiency of both the automated driving and human driving of connected vehicles.
Ming Lei (ming.lei@intel.com) is a senior platform architect at Intel, Beijing 100013, China. He received his Ph.D. degree from the Beijing University of Posts and Telecommunications in 2003. His research interests include deep learning, sensor fusion, radar signal processing, and artificial intelligence in wireless communications. He is a Senior Member of IEEE.
Daning Yang (daning.yang@raysunradar.com) is with Raysun Radar Electronic Technology, Suzhou 215299, China, which he founded in 2018 and where he designs several types of radars, including marine radar and active electronically scanned array radar for drone detection. He received his M.Sc. degree in computer science from Uppsala University in 2006. His research interests include traffic radar and its fusion with cameras.
Xiaoming Weng (xiaoming.weng@raysunradar.com) is a system engineer at Raysun Radar Electronic Technology, Suzhou 215299, China. He received his M.Sc. degree from the Nanjing University of Science and Technology in 2010. His research interests include frequency-modulated continuous wave; multiple-input, multiple output; and millimeter-wave radars used in intelligent transportation systems.
[1] R. Molina-Masegosa and J. Gozalvez, “LTE-V for Sidelink 5G V2X vehicular communications: A new 5G technology for short-range vehicle-to-everything communications,” IEEE Veh. Technol. Mag., vol. 12, no. 4, pp. 30–39, Dec. 2017, doi: 10.1109/MVT.2017.2752798.
[2] J. Bai, S. Li, H. Zhang, L.-B. Huang, and P. Wang, “Robust target detection and tracking algorithm based on roadside radar and camera,” Sensors, vol. 21, no. 4, p. 1116, Feb. 2021, doi: 10.3390/s21041116.
[3] P. Liu, G. Yu, Z. Wang, B. Zhou, and P. Chen, “Object classification based on enhanced evidence theory: Radar–vision fusion approach for roadside application,” IEEE Trans. Instrum. Meas., vol. 71, pp. 1–12, Feb. 2022, doi: 10.1109/TIM.2022.3154001.
[4] Y. Du, B. Qin, C. Zhao, Y. Zhu, J. Cao, and Y. Ji, “A novel spatio-temporal synchronization method of roadside asynchronous MMW radar-camera for sensor fusion,” IEEE Trans. Intell. Transp. Syst., early access, doi: 10.1109/TITS.2021.3119079.
[5] H. Rohling and M. Meinecke, “Waveform design principles for automotive radar systems,” in Proc. CIE Int. Conf. Radar, 2001, pp. 1–4, doi: 10.1109/ICR.2001.984612.
[6] J. Li and P. Stoica, “MIMO radar with Colocated antennas,” IEEE Signal Process. Mag., vol. 24, no. 5, pp. 106–114, Sep. 2007, doi: 10.1109/MSP.2007.904812.
[7] A. M. Haimovich, R. S. Blum, and L. J. Cimini, “MIMO radar with widely separated antennas,” IEEE Signal Process. Mag., vol. 25, no. 1, pp. 116–129, 2008, doi: 10.1109/MSP.2008.4408448.
[8] F. N. Hooge, “1/f noise sources,” IEEE Trans. Electron. Devices, vol. 41, no. 11, pp. 1926–1935, Nov. 1994, doi: 10.1109/16.333808.
[9] S. K. Goudos, K. Siakavara, T. Samaras, E. E. Vafiadis, and J. N. Sahalos, “Sparse linear array synthesis with multiple constraints using differential evolution with strategy adaptation,” IEEE Antennas Wireless Propag. Lett., vol. 10, pp. 670–673, Jul. 2011, doi: 10.1109/LAWP.2011.2161256.
[10] L. M. H. Ulander, H. Hellsten, and G. Stenstrom, “Synthetic-aperture radar processing using fast factorized back-projection,” IEEE Trans. Aerosp. Electron. Syst., vol. 39, no. 3, pp. 760–776, Jul. 2003, doi: 10.1109/TAES.2003.1238734.
[11] K. Khan, S. U. Rehman, K. Aziz, S. Fong, and S. Sarasvady, “DBSCAN: Past, present and future,” in Proc. Int. Conf. Appl. Digit. Inf. Web Technol., 2014, pp. 232–238, doi: 10.1109/ICADIWT.2014.6814687.
[12] “YOLOv5 in PyTorch.” GitHub. Accessed: Mar. 11, 2022. [Online] . Available: https://github.com/ultralytics/yolov5
[13] S. Sugimoto, H. Tateda, H. Takahashi, and M. Okutomi, “Obstacle detection using millimeter-wave radar and its visualization on image sequence,” in Proc. Int. Conf. Pattern Recognit., 2004, vol. 3, pp. 342–345, doi: 10.1109/ICPR.2004.1334537.
[14] B. Li, W. Ouyang, L. Sheng, X. Zeng, and X. Wang, “GS3D: An efficient 3D object detection framework for autonomous driving,” in Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognit., 2019, pp. 1019–1028, doi: 10.1109/CVPR.2019.00111.
[15] “Intel® integrated performance primitives.” Intel. Accessed: Mar. 11, 2022. [Online] . Available: https://www.intel.com/content/www/us/en/developer/tools/oneapi/ipp.html
Digital Object Identifier 10.1109/MVT.2022.3207453