Jongho Kim, Kyongsu Yi
©SHUTTERSTOCK.COM/ASLYSUN
This article describes the development and implementation of a 3D lidar perception framework to guarantee the precise cognition of the surrounding environment for urban autonomous driving. The proposed framework consists of two different detection modules operating in parallel: a deep learning-based and a geometric model-free cluster-based method. The first module utilizes the convolutional gated recurrent unit (ConvGRU)-based residual network (CGRN). The module aims to repredict 3D objects based on results from a continuous single-frame detection network. A vision-fusion methodology based on 2D projection is adopted for postprocessing in the first module. The second module utilizes geometric model-free area (GMFA) cluster detection and is designed to cope with false-negative cases of unclassified objects from the prior module. For the second module, a cluster variance-based ground removal is conducted to prevent false-positive cases. A kinematic model-based particle filter (PF) is then applied to estimate the dynamic states of detection. The suggested framework has been developed with real-time operation in mind, to be implemented in autonomous vehicles equipped with automotive lidars and low-cost cameras. The test results show that the framework with CGRN and GMFA successfully improved the surrounding object detection and state estimation accuracy in urban autonomous driving.
Advancements in the field of autonomous driving has led to a shift in driving scenarios, from highways to urban roads, to accommodate level 4 autonomous driving. Unlike highways, urban roads with substantial traffic require much more information regarding the surroundings to ensure that safe decisions are made in autonomous driving systems. Consequently, the lidar sensor, which utilizes reflections to obtain point clouds with high positional accuracy, is utilized by most autonomous vehicles for recognizing short- to medium-ranged objects in urban scenarios.
Lidar object perception for modern autonomous driving systems generally consists of detection and tracking. The detection module distinguishes and categorizes the 3D box from the point clouds of surrounding objects. Furthermore, according to the tracking-by-detection paradigm, the tracking module tracks IDs and estimates the dynamic states of the moving objects from the continuous detection sequence. Although many researchers have studied perception modules individually for several years, the entire perception framework as a whole has not yet been fully discussed and still requires the evaluation and validation of real time and robustness for autonomous vehicle implementation. Therefore, this article proposes a lidar object perception framework, including detection and tracking and several postprocessing methods. Focusing on the application to urban autonomous driving, the proposed framework is designed to confirm the real-time capability and robustness against perception failure.
Regarding 3D object detection, numerous studies have been carried out on point cloud object detection through methodologies, ranging from conventional geometric shapes to the latest deep neural networks. Most modern deep-learning approaches attempt to quantize the point cloud and represent it as a pseudoimage. Typically, the point cloud is reorganized into 2D pillar or 3D voxel units to detect objects through the single-shot detector [1], [2], [3]. From the perspective of an autonomous vehicle, previous studies attempted to expand the input from a single frame to temporal frames for the 3D object-detection task. Luo et al. proposed a network that concatenates the point cloud sequence for capability of simultaneously learning multiple tasks [4]. Huang et al. proposed a spatiotemporal fusion method based on U-net and long short-term memory (LSTM), which is the origin of approaches by recurrent neural networks [5]. However, jointly training the spatial backbone and LSTM for temporal fusion has problems in real-time applications due to the heavy network. Also, to compensate for ego-motion, the transformation of 3D quantized features is computationally too heavy. CGRN in the proposed detection module improves detection performance by taking a prior detection sequence from a single-frame detector with a frozen parameter [6]. Taking the previous detections as points can compensate for ego-motion with relatively little computation, so the proposed detection module shows real-time capability and performance enhancement with a slight increase in training parameters.
For multiobject tracking, several other articles and CenterPoint have proposed results and methods for 3D object tracking based on deep learning [7]. However, it is essential to note that the 3D object-tracking tasks done with open data sets do not include an evaluation regarding the dynamic state estimation of tracked objects. Instead, they only deal with the tracking task of assigning an ID to the tracked object. Many decision-making studies, such as cut-in trajectory prediction and intersection motion planning, are based on the dynamic state (velocity, yaw rate) of an object [8], [9]. Geometric model-free tracking (GMFT) has been developed to estimate the dynamic state of point clusters based on the extended Kalman filter (EKF) [10]. However, the GMFT methodology shows delays in yaw rate estimation from the absence of yaw detection. In this study, the proposed state tracking module adopting a PF ensures satisfactory estimates of the velocity and yaw rate uncertainty from the nonlinear motion of a vehicle [11].
Open data sets (such as KITTI and Waymo Open Dataset) provide vast amounts of point-cloud and ground truth data that are difficult to obtain directly [12], [13]. Average precision (AP), a representative evaluation index for the 3D object-detection task, is determined regardless of the location and direction of the ground truth object. However, false-detection errors for location in autonomous driving systems can be significant, depending on vehicle behavior. For example, unrecognition or misrecognition around future driving paths hold comparatively more risk in terms of safety and ride comfort than other areas. Therefore, the proposed framework is designed to detect point clusters in parallel in the GMFA to prevent the unrecognition (false negatives) of unclassified objects. In addition, each deep-learning and clustering-based detection module includes postprocessing to reduce misrecognition (false positives).
The contributions of this article are summarized below:
As shown in Figure 1, the proposed perception framework receives multiple lidar point clouds and front camera images. Surrounding vehicle states (including velocity and angular velocity) are then given as an output. Before being sent to the main perception module, the inputs are processed by a preprocessing module containing the following two functions: 1) sensor calibration and merging of point clouds from multiple lidars; and 2) 2D object detection from the front camera image. As a 2D object detector for the vision-fusion method, the Yolov4 detector has been adopted in the actual implementation algorithm [14]. The three main modules are deep learning-based detection, GMFA cluster detection, and PF-based object state estimation. The two detection modules include vision fusion and ground removal as postprocessing processes. Details of each module are described below.
Figure 1 The overall architecture of the proposed perception framework.
The point pillars network described previously has low complexity and strong intuition among single-frame 3D object detection networks [1]. While inference speed is significantly faster, detection performance (such as AP) is low compared to that of state-of-the-art networks. Our approach improves detection performance at the current frame by using an auxiliary ConvGRU-based network [6].
The deep learning-based detection module consists of a baseline network (point pillars) and CGRN. Typical single-frame detection networks predict the difference (location, scale, direction) between a predefined anchor and ground truth through the output of the last region proposal network (RPN) stage. A temporal fusion approach was adopted to refine errors in the RPN output by letting the CGRN only predict the error from the single-frame RPN output of the current step. A recurrent neural network-based ConvGRU has been trained to predict the residual of the RPN by feeding the detection sequence of pretrained single-frame detectors as an input.
As depicted in Figure 2, a baseline network with frozen parameters only provides previous single-frame detections and RPN output at the current frame. The prior detection sequences, considered as a point with attributes (scale, direction, score), are transformed to the current frame for ego-motion compensation. Furthermore, transformed detection is expressed as an 8D point with 3D position, 3D size, direction, and score, and taken as an input of CGRN. In a similar way to detectors that take a point cloud as input, CGRN first extracts each pillar’s features by a point net-based pillar feature network (PFN) [15]. Spatial features composed of pillar features for each time step are input to the ConvGRU to simultaneously learn temporal and spatial features. And the fused feature is upsampled to be equal in size to that of the baseline. Finally, to refine the error of single-frame RPN results, residual RPN processes spatiotemporal features using a shared multilayer perceptron in the same dimension as the baseline anchor to obtain residuals. Therefore, the final detections are computed through an elementwise summation of the RPN residual (predicted by CGRN) and the RPN output (predicted by the single-frame detector from the current point cloud).
Figure 2 An overview of the deep learning-based detection network. Each frame of temporal point clouds is processed by a pretrained point pillars network (gray) to predict anchor features. The proposed CGRN network (blue) predicts the residuals of the anchor features as auxiliary.
Since the detection network is trained by the location and the class of objects from the point cloud data set, it does not guarantee the detection of unclassified objects in the autonomous driving area. Therefore, clustering-based detection is processed in parallel to utilize the advantages of both methods.
Since the computing cost of point clustering increases exponentially with the number of points, clustering in entire perception areas is inefficient in state estimation. Therefore, according to the vehicle behavior plan, the clustering region is reduced to detect unclassified objects with a potential collision risk. In this study, two safety indices are defined: lane-width margin (LWM) and time to collision (TTC). The proposed GMFA is shown in Figure 3. GMFA is defined as the area of the lane width weighted by LWM and the distance multiplied by the vehicle velocity and the TTC. GMFA is also altered depending on the lateral position and planned behavior of the autonomous vehicle. The clustering on point clouds in the region was performed through a method known as density-based spatial clustering with noise. After the clustering-based detection, duplication is eliminated by measuring the rotating intersection over union (RIOU) with deep learning-based results. RIOU is defined as the ratio of the overlapping area of each rotating 3D box.
Figure 3 The GMFA defined by vehicle behavior (ego vehicle is white and the clustered vehicle is red).
In this approach, the vehicle model is approximated as a rigid body, in which yaw rate is generated by a steering angle and velocity is controlled by throttle and brake pedals. Assuming that this vehicle motion model has a constant velocity and yaw rate, the vehicle travels along a circular trajectory.
A PF is used to estimate the velocity and yaw rate states of the tracked vehicle (called track) with a position and yaw obtained through the detection modules. To cope with the nonlinearity of vehicle motion, the PF directly expresses the probability distribution of the state via several particles. The tracked vehicle can change speed and yaw rate, but since acceleration and yaw acceleration cannot be measured directly, the PF adds noise representing the uncertainty of speed and yaw rate [11].
Since the vehicle’s motion model is assumed to have constant velocity and yaw rate, the circular motion equation can be utilized to predict particle states. At each period of the detection process (dt), the state of the particle is transformed to the current time step by ego-motion ${(}{T}_{\text{ego}}{)}$. Each particle states (x) is predicted as follows: \[{x}_{{k} + {1}} = {T}_{\text{ego}}{.} {\left[\begin{array}{c}{p}_{{x},{k}} - \frac{{v}_{k}}{{\gamma}_{k}}\,{\sin}\,{(}{\theta}_{k}{)} + \frac{{v}_{k}}{{\gamma}_{k}}\,{\sin}\,{(}{\theta}_{k} + {\gamma}_{k}{dt}{)} \\ {p}_{y,k} + \frac{{v}_{k}}{{\gamma}_{k}}\,{\cos}\,{(}{\theta}_{k}{)}{-} \frac{{v}_{k}}{{\gamma}_{k}}\,{\cos}\,{(}{\theta}_{k} + {\gamma}_{k}{dt}{)} \\ {\theta}_{k} + {\gamma}_{k}{dt} \\ {v}_{k} + {w}_{v} \\ {\gamma}_{k} + {w}_{\gamma} \end{array}\right]} \tag{1} \]
Each particle has a position ${(}{p}_{x},{p}_{y}{)}$ and yaw ${(}{\theta}{)}$ state in the ego vehicle coordinate system, and a velocity (v) and a yaw rate ${(}{\gamma}{)}$ state in the global coordinate system. The process noise of the prediction step is applied only to the velocity and yaw rate to efficiently utilize the limited number of particles. In addition, the noise (w) is assumed to have a uniform distribution to cope with random variation in steering angle and velocity of the vehicle.
For multiple object tracking based on the PF, a track management process is implemented that includes track initialization, track removal, and measurement assignment of a detected vehicle. Upon completion of the detection process, the displacement between the tracks updated by the prediction and newly detected vehicle are calculated. The detected vehicle is assigned to the nearest track as a measurement, and an unassigned detected vehicle is initialized as a new track with several particles. The track is removed when no measurements are assigned over a certain period of time.
The measurement update of the PF refers to the update of a particle’s weight by relative likelihood through a comparison between the predicted state (x) and the measurement (z). The relative likelihood is defined as follows: \begin{align*}{p}{(}{z}\,{\vert}\,{x}{)} & = {p}{(}{z}_{x},{z}_{y} \,{\vert}\,{p}_{x},{p}_{y}{)}\,{\cdot}\,{p}{(}{z}_{\theta}\,{\vert}\,{\theta}{)} \\ & = {N}{(}{L}{2}{(}{z}_{{x},{y}},{p}_{{x},{y}}{)}\,{\vert}\,{R}_{D}^{2}{)}\,{\cdot}\,{N}{(}{diff}{(}{z}_{\theta},{\theta}{)}\,{\vert}\,{R}_{\theta}^{2} \tag{2} \end{align*} where ${z}_{x}$, ${z}_{y}$, and ${z}_{\theta}$ are the position and yaw states of the measurement. The relative likelihood is defined as the conditional probability of the measurement according to the track state. It is assumed that the measurement assigned to the track is normally distributed in position and yaw for each track particle. Therefore, as the displacement and the yaw difference between the particle and the measurement increase, the particle’s weight is updated in a decremental way. To prevent the particle weight from being biased, the track is resampled when the effective number is lower than half of the number of particles, which has the effect of splitting large weighted particles into smaller ones.
Autonomous driving systems may have fatal consequences when perception algorithms fail to detect objects. Therefore, to increase the recall rate, the score threshold of the detection network is set to a low value. However, the tradeoff between recall and precision needs to be addressed. To tackle this problem, this article proposes a method of postprocessing false-positive cases through vision fusion.
Numerous studies involving multidomain networks based on lidar and vision fusion have been conducted currently. However, instead of fully training a multidomain network, our framework filters false-positive cases by processing simple projections into image planes for scalability. According to the front camera’s internal/external transformation matrix, the center of the 3D bounding box detected based on deep learning is projected as one pixel on the visual image plane. On the image plane, the false-positive detection has been removed using the distance between the projected location and the 2D object box from the image preprocessing.
The height of the GMFA is determined in a predefined range to avoid ground clustering. However, in an environment where the ground is not level, such as on a ramp, false-positive detection occurs due to the clustering the ground. However, with respect to the ground, the vertical dispersion of the cluster corresponding to each channel of the 3D lidar is relatively small regardless of the height of the cluster. Therefore, point clusters with vertical variance smaller than the threshold have been excluded as ground misrecognition.
The proposed framework has been developed and designed to cover the entire lidar perception system for autonomous vehicles. Thus, the proposed framework contributes to both object detection and tracking. The deep learning-based detection module containing the CGRN has been evaluated on Waymo Open Dataset [13]. PF-based object-state tracking has been evaluated using data obtained from an overtake scenario. The overall framework has been quantitatively and qualitatively investigated via urban autonomous driving.
The autonomous vehicle and device configuration are shown in Figure 4. Two lidars (velodyne VLP-32C) are located at the front and rear end of the top, and a camera is installed under the front window. In addition to the perception sensors, a real-time kinematic (RTK) GPS, four around view monitoring (AVM) cameras, two industrial PCs with RTX 3070 GPU for the proposed perception framework and vehicle motion planning, and a microautobox for actuator control have been embedded in the autonomous vehicle.
Figure 4 Autonomous driving vehicle and device configuration.
The detection range of the CGRN is within [−53.76, 53.76] m for the x axis, [−53.76, 53.76] m for the y axis, and [−2, 4] m for the z axis, and the pillar size is set to [2.56, 2.56, 6] m. ConvGRU in the CGRN received sequential pseudoimages of width, height, and channels (42, 42, 32), and its hidden dimension, network layers, and kernel size are defined as (16, 2, [3, 3]). For the performance analysis, the CGRN has been implemented with two different input frames of three and five with a sampling time of 150 ms. The same loss functions proposed in the baseline are used. For the Waymo Open Dataset, the CGRN was trained with a batch size of four and learning rate of 0.02 for 30 epochs on GTX 3070 GPUs for approximately 11 h. The detection range of all networks for comparative analysis, including the baseline, is same as that of the CGRN. Moreover, the pillar size of the baseline is set to [0.32, 0.32, 6] m. The baseline was reimplemented and trained separately before training the CGRN, and other hyperparameters are identical to those of the reference article [1].
For urban autonomous driving, among the parameters of GMFA, the LWM is set to 2.5, which can include both side lanes, and the TTC is set to 4 s, considering the maximum test speed.
The Waymo Open Dataset has a larger scale compared to previous autonomous vehicle data sets, such as the KITTI data set [12], [13], and consists of 1,950 segments of 20 s, collected at 10 Hz (390,000 frames) in diverse geographies and conditions. Table 1 shows the AP and the AP weighted by heading (APH) on 3D object detection of the proposed framework and the preliminary literature. In terms of the number of input frames, since the input with five frames can utilize relatively longer historical characteristics of the object for training the model, the AP improvement is remarkable for the model with five frames compared to that with three frames. Consequently, the proposed CGRN with five frames achieved 6.3% AP improvement for the vehicle category, with only 1.01% additional parameters compared to the baseline. It is noteworthy that the proposed algorithm with 37% less parameters shows the comparable performance to the AP of heavier networks [3]. Also, in the results for the pedestrian category, our approach predicted the movement direction of the pedestrian based on the prior detection results, and the APH increased by 19% compared to the baseline.
Table 1 The AP and APH comparison of 3D object detection at 0.7 IoU.
The proposed framework has been compared with the GMFT algorithm for state tracking [10]. GMFT detects objects only through point clustering from the point cloud projected into the bird’s eye view and estimates states based on the EKF. The inefficiency of tracking nonvehicle objects is excluded, and only the tracking performance for dynamic states is evaluated in this section.
An evaluation experiment for vehicle state tracking has been conducted at the Seoul National University Future Mobility Technology Center, Siheung, South Korea. A scenario in which a target vehicle traveling at 35 km/h overtakes a stationary ego vehicle has been tested. The ego vehicle refers to the autonomous vehicle shown in Figure 4, and the target vehicle used was a Hyundai County electric vehicle, a midsize bus. In the experiment, the target vehicle’s RTK GPS and chassis sensor measurements were referenced as ground truth data. Estimation of the target vehicle, a bus measuring 6.3 m in length, 2.4 m in width, and 2.6 m in height, proved to be difficult due to the significant changes in geometric shape of the point cluster during the overtake. All in all, the test scenario was considered suitable for evaluating the detection and tracking of the proposed framework and the conventional algorithm.
Figure 5 shows the trajectory and yaw of the tracked target in the ego vehicle’s coordinate system. The initial straight movement of the target vehicle behind the ego vehicle shows a linear movement of the point cluster in a constant shape; therefore, it can be concluded that estimations of both algorithms are near exact to the ground truth. However, as the target vehicle changes lanes to the right to overtake, the change in shape of the point cluster is maximized. GMFT, which estimates state through cluster movement, shows a delay in yaw estimation, but the proposed framework, trained by the network, is able to detect the actual position and yaw of the bus and successfully estimates circular motion based on PFs, enabling direct yaw state estimation. When the target changes back to the left lane, the proposed framework is able to correctly estimate the yaw state by detecting the actual location. However, due to the rapid change in shape and yaw of the point cluster, GMFT fails to converge and initiates a new track. In the overtaking scenario, the root-mean–squared error (RMSE) for the yaw angle is 0.12 rad for the proposed framework and 0.28 rad for the GMFT.
Figure 5 The state tracking result and proposed detection snapshots of the bus overtaking the ego vehicle (the arrow indicates the yaw).
Figure 6 shows the yaw rate and absolute error value profile for the travel trajectory depicted in Figure 5. The state tracking results of the proposed framework show a significant decreased delay in yaw rate estimation compared to those of the GMFT. The RMSE for the yaw rate of the proposed framework and GMFT are 0.05 rad/s and 0.12 rad/s, respectively. In addition, the maximum delay of the yaw rate during the cut-out of the target vehicle is 0.54 s and 1.01 s in the proposed estimation and the GMFT, respectively. Through the experimental results shown in Figures 5 and 6, it can be inferred that the PF better represents the nonlinear vehicle motion of the tracked vehicle than the EKF. Furthermore, in overtaking scenarios with substantial shape changes of the cluster, the deep learning-based detection is able to predict the vehicle’s actual position and yaw more accurately than point clustering.
Figure 6 Yaw rate and absolute error profile of the proposed framework and GMFT at a scenario where the bus overtakes the ego vehicle.
Quantitative and qualitative evaluation of the lidar perception framework has been conducted via autonomous driving in the Sangam autonomous driving testbed, Seoul, South Korea. The test vehicle, a Kia Carnival, and device configuration are depicted in Figure 4. The test area consists of two sections: a parkway and an urban road. The parkway refers to a narrow two-lane road with long median guardrail and numerous parked vehicles. The urban road refers to six lanes with congested traffic and various types of vehicles parked on the shoulder lane. The maximum velocity of the autonomous vehicle was set to 20 km/h and 40 km/h on the parkway and the urban road, respectively. Parkway and urban road driving data consist of sensor data for perception frameworks covering 250 s and 240 s, respectively.
Table 2 shows the quantitative evaluation results of the proposed framework, the ablation studies, and the GMFT. The results in the table are aggregated according to some rules. The ground truth track is the vehicle and cyclist within the deep learning-based detection area and unclassified objects within the future driving path. Also, in each algorithm, tracks that survive more than 5 s are counted as positive, and split and merged tracks are counted as false positives and false negatives, respectively. First of all, ablation studies show that the GMFA and the postprocessing module effectively improved recall and precision, respectively. For the entire proposed framework, the F1 score for both parkway and the urban road driving data are analyzed. The F1 score on the parkway driving data shows an increase of 0.4 due to the enhanced precision. Figure 7(a) shows the snapshots of the results about the parkway driving data of each method. The proposed framework is able to eliminate the false-positive detection of the pedestrian safety guardrail through the proposed vision-based fusion method. In addition, the clustering-based approach incorrectly estimates the long-shaped objects traveling in the same heading direction of the ego vehicle. The F1 score on the urban road driving data describes an increase of 0.23 due to both improved precision and recall. As shown in Figure 7(b), detection via deep learning reduces the misrecognition of commercial vehicles split into two point clusters. In addition, an increase in recall is observed due to the phenomenon of surrounding objects merging and clusters disappearing through the proposed ground removal method, and the proposed framework operating at 33-ms and 36-ms latency for each data set satisfies the real-time capability for all data acquired with 20-Hz lidar.
Figure 7 Qualitative detection and tracking results: (a) Sangam parkway; (b) Sangam urban roads.
Table 2 Quantitative detection and tracking results of proposed framework and GMFT on the urban autonomous driving data.
This article proposes a lidar-based perception framework for an autonomous driving system. The proposed framework encompasses the entire process of estimating the state of surrounding objects using lidar, camera, and vehicle sensor signals. The proposed framework has also been implemented and investigated in actual autonomous vehicles. In experiments on the Waymo Open Dataset, the proposed CGRN achieves a 6.3% (AP) performance improvement for the vehicle category, with only a 1.01% increase in memory over the baseline networks. The test on the bus-overtaking scenario demonstrates that the PF-based state estimation shows 46% reduction in yaw rate delay compared to the EKF-based method. Overall, the autonomous vehicle test in an urban environment verified and confirmed that the proposed framework is able to successfully detect and estimate the state of the object with managing cases of false detection and real-time capability.
In future studies, our research team plans to develop auxiliary networks to infer the driver intentions, such as lane changes and yields from 3D object sequence data, by following the concept of CGRN. Furthermore, we plan to design a network that can integrate the results of multiple domains, such as radar, V2X, and AVM cameras, apart from lidars and cameras.
This work was supported in part by the Korea Institute of Marine Science & Technology Promotion funded by the Ministry of Oceans and Fisheries, Korea (20220583); the Institute of Engineering Research, Seoul National University, South Korea; the Brain Korea 21 FOUR Project in 2022; the Korea Science and Engineering Foundation/Seoul National University-Institute of Advanced Machines and Designs; and partly supported by the Future Mobility Technical Center of Seoul National University, Seoul, South Korea.
Jongho Kim (kimjhmj@snu.ac.kr) is a Ph.D. candidate in mechanical engineering at Seoul National University, Seoul, 08826, South Korea. He received his B.S. degree in mechanical system design engineering from Hongik University, Seoul, South Korea, in 2019. His research interests are lidar-based object perception and state estimation for urban autonomous driving.
Kyongsu Yi (kyi@snu.ac.kr) is a professor at the School of Mechanical Engineering at Seoul National University, Seoul 08826, South Korea. He received his B.S. and M.S. degrees in mechanical engineering from Seoul National University, South Korea, in 1985 and 1987, respectively, and his Ph.D. degree in mechanical engineering from the University of California, Berkeley, CA, USA in 1992. His research interests are control systems, driver assistant systems, active safety systems, and automated driving of ground vehicles.
[1] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “Pointpillars: Fast encoders for object detection from point clouds,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit. (CVPR), 2019, pp. 12,697–12,705.
[2] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud based 3D object detection,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit. (CVPR), 2018, pp. 4490–4499.
[3] S. Shi et al., “PV-RCNN: Point-voxel feature set abstraction for 3D object detection,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit. (CVPR), 2020, pp. 10,529–10,538.
[4] W. Luo, B. Yang, and R. Urtasun, “Fast and furious: Real time end-to-end 3D detection, tracking and motion forecasting with a single convolutional net,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit. (CVPR), 2018, pp. 3569–3577.
[5] R. Huang et al., “An LSTM approach to temporal 3D object detection in LiDAR point clouds,” in Proc. 16th Eur. Conf., Comput. Vision (ECCV), Glasgow, U.K., Aug. 2020, pp. 266–282.
[6] N. Ballas, L. Yao, C. Pal, and A. Courville, “Delving deeper into convolutional networks for learning video representations,” 2015, arXiv:1511.06432.
[7] T. Yin, X. Zhou, and P. Krähenbühl, “Center-based 3D object detection and tracking,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit. (CVPR), 2021, pp. 11,784–11,793.
[8] Y. Yoon, C. Kim, J. Lee, and K. Yi, “Interaction-aware probabilistic trajectory prediction of cut-in vehicles using Gaussian process for proactive control of autonomous vehicles,” IEEE Access, vol. 9, pp. 63,440–63,455, 2021, doi: 10.1109/ACCESS.2021.3075677.
[9] J. M. Yoo, Y. Jeong, and K. Yi, “Virtual target-based longitudinal motion planning of autonomous vehicles at urban intersections: Determining control inputs of acceleration with human driving characteristic-based constraints,” IEEE Veh. Technol. Mag., vol. 16, no. 3, pp. 38–46, Sep. 2021, doi: 10.1109/MVT.2021.3086432.
[10] H. Lee, J. Yoon, Y. Jeong, and K. Yi, “Moving object detection and tracking based on interaction of static obstacle map and geometric model-free approach for urban autonomous driving,” IEEE Trans. Intell. Transp. Syst., vol. 22, no. 6, pp. 3275–3284, Jun. 2020, doi: 10.1109/TITS.2020.2981938.
[11] A. Flåten and E. Brekke, “Rao-blackwellized particle filter for turn rate estimation,” in Proc. IEEE Conf. Aerosp. Conf., 2017, pp. 1–7, doi: 10.1109/AERO.2017.7943746.
[12] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,” Int. J. Robot. Res., vol. 32, no. 11, pp. 1231–1237, Sep. 2013, doi: 10.1177/0278364913491297.
[13] P. Sun et al., “Scalability in perception for autonomous driving: Waymo open dataset,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit. (CVPR), 2020, pp. 2446–2454.
[14] A. Bochkovskiy, C. Wang, and H. M. Liao, “Yolov4: Optimal speed and accuracy of object detection,” 2020, arXiv:2004.10934.
[15] C. Qi et al., “Pointnet: Deep learning on point sets for 3D classification and segmentation,” in Proc. IEEE Conf. Comput. Vision Pattern Recognit. (CVPR), 2017, pp. 652–660.
Digital Object Identifier 10.1109/MVT.2023.3236480