Julian Stähler, Carsten Markgraf, Mathias Pechinger, David W. Gao
©SHUTTERSTOCK.COM/SCHARFSINN
Mobility is fundamental for the wealth and health of the world’s population, and it has a significant influence on our daily life. However, with the increasing complexity of traffic, the need to transport goods, and growing urbanization, improving the quality of mobility in terms of time, space, and air becomes more challenging. An autonomous electric vehicle offers technology and potential for new mobility concepts in smart cities. Today, many vehicles have been developed with automated driving capabilities. A safety driver is still required to intervene in most cases if the autonomous electric vehicle is not able to handle a situation in a safe and, at the same time, reliable way. One important aspect to achieve safety and reliability goals is a robust and efficient perception of the vehicle’s environment. The tragic accident involving an Uber self-driving car killing a pedestrian in 2018 highlighted the importance of perception in autonomous driving. In its investigation, the U.S. National Transportation Safety Board found that the Uber self-driving car and its safety driver involved in the accident failed to detect the pedestrian at the same time. Since this accident, the autonomous vehicle industry has been working to improve perception systems through the use of advanced sensors, machine learning algorithms, and other technologies. A variety of sensor technologies are used in vehicles to detect objects and perceive the vehicles’ surroundings. Cameras and radars are among the widely used technologies for sensing systems, as they are cheap and reliable, and as these sensors are operating at different wavelengths, they are not susceptible to common errors. While some companies solely use cameras and radars, others also use lidars in their sensing systems. These sensors are widely used in the industry, and the technology itself is continuously enhanced, leading to new developments, such as 4D lidar, which are capable of measuring not only the distance of an object but also its velocity by evaluating the phase shift of the returned light. Various methods for the perception of the environment exist and rely on different kinds of sensors. Machine learning-based methods have evolved rapidly in recent years and are currently leading the field in perception, particularly in the tasks of object detection and classification.
This article shows the integration results of an extended You Only Look Once for Panoptic Driving Perception (YOLOP) convolutional neural network (NN). A camera-based approach for object detection and classification as well as free space and lane detection in a single deep NN (DNN) is developed. It is used in a hybrid electric research vehicle for environmental perception in automated driving. The layer structure of this NN was slightly changed and retrained for the detection and classification of up to 13 different kinds of objects. The quality of the outputs, from lane recognition to free space segmentation, is kept at a high level. Safe and reliable environmental perception is crucial for the homologation of automated vehicles for public roads at levels where no safety driver is required. This approach utilizes a shared backbone that keeps the sampling periods for calculating the NN sufficiently low. Extension to multiobject detection and classification enables individual tracking according to object-specific mobility characteristics. Furthermore, sign and signal recognition are included as an important input for the behavior planning of automated vehicles.
In autonomous driving applications, safe and reliable perception of the environment is a major challenge. Recent developments show reliable results even in complex surroundings, keeping the number of false positive and false negative object detections at a low level. However, to increase multiobject tracking accuracy, ever larger DNNs are used with higher numbers of input and output parameters, processing data from diverse sensors in the vehicle. Therefore, computational performance constraints must be considered for in-vehicle applications with limited resources in real time. A very promising approach to achieve this goal can be found in a recent work called YOLOP. Here, the authors propose an architecture capable of doing several perception tasks in a single network. Three different kinds of detection (object detection, free space segmentation, and lane detection) are implemented in one NN with the help of a shared backbone. Very high computational performance is achieved, allowing the NN to be operated on low-cost embedded devices, such as Jetson TX2. However, so far, only one object class (vehicle) is identified at the output. For object motion prediction purposes, it is crucial to distinguish different object types. Furthermore, static objects that determine traffic rules (e.g., signals and signs) are an important input for the behavior planner inside the vehicle’s internal navigation systems.
In this work, the YOLOP network is extended and trained for further object classes that are already labeled in the BDD100K dataset (see Figure 1). Our approach is integrated into a research vehicle to validate results in a real-world application in Augsburg, Germany, to determine the optimization potential of the sensor fusion algorithm. For this purpose, the network is integrated into Robot Operating System (ROS) to establish a connection to vehicle sensors and actuators. ROS is middleware for Unix-like systems, such as Ubuntu Linux, and is used heavily in robotics and autonomous driving research. This approach facilitates the ease of debugging. Open-loop test drives were performed to collect data for a first subjective judgment of the detection quality. Subsequently, the system was connected to the vehicle’s planning system, low-level controller, and actuators to validate the results in closed-loop evaluations.
Figure 1. Perception results in a crowded city environment based on data from the BDD100K dataset.
YOLOP is extended to 13 object classes, and training is executed on two additional datasets. The updated network with a new parameter set is then integrated into our research vehicle and used for open-loop test drives on public roads in Augsburg. An image of this vehicle is provided in Figure 2. The research car is a BMW ActiveHybrid 3 that is equipped with additional sensing systems, computing units for autonomous driving algorithms, and actuators. The system covers the full data processing pipeline and provides closed-loop test capabilities. On the roof of our research vehicle, given in Figure 2, a 360° lidar sensor is mounted for object detection and localization purposes. At the rear of the roof, a differential global navigation satellite system is installed. This system enables the vehicle to position itself with centimeter-level accuracy. Other sensing systems, such as cameras, are mounted behind the windshield of the vehicle and provide perception redundancy and diversity.
Figure 2. The research vehicle used in this work.
The architecture of the test vehicle is divided into three main areas. Implementing these three levels is a common design for an autonomous electric vehicle. Figure 3(a)–(c) describes the sensing systems, processing systems, and actuation systems.
Figure 3. The system architecture of the autonomous vehicle. The (a) sensing systems, (b) processing systems, and (c) actuation systems. ITS-G5: Intelligent Transportation System, 5.9 GHz; IMU: inertial measurement unit; GNSS: global navigation satellite system.
While control algorithms for actuation require hard real-time operation and a comparably high control frequency of more than 250 Hz, detection algorithms must achieve a lower frequency of 10 to 20 Hz. In between, specific tracking algorithms handle interfacing to make predictions about the development of the driving situation, and planning algorithms use their output at each time step:
To integrate YOLOP into the vehicle’s architecture, we extended the existing data flow (Figure 4). Due to our design of choice, all data are available in every compute unit’s processing system at any time. This enables running parts of the algorithm on different hardware platforms available in the vehicle. Camera and lidar data are converted into the common data representation point cloud using two nodes. These converted data are the source for all further processing and visualization.
Figure 4. The data flow of our detection system.
The training environment is based on a containerized setup for the actual training execution and an object storage for artifacts and datasets. In our particular use case, we utilize Podman running on a Red Hat Enterprise Linux 8.5 host. This machine is equipped with 128 GB of memory and a Nvidia A6000 GPU with 48 GB of memory. The container image is based on Nvidia NGX Pytorch. The original YOLOP implementation was trained on the BDD100k dataset but already provided a modular approach that could be easily extended to additional datasets. While also training on the BDD100k dataset, we extended the training to more than one object class and utilized all object classes provided by the dataset (bus, 4× traffic light, sign, person, bike, truck, motorcycle, car, train, and rider) As a reference, we ran the training on the original training configuration and verified the results against the benchmark values. In a further step, we modified the training configuration to suit our hardware setup. Then, we integrated the necessary changes into the model structure to incorporate the new object classes and trained the whole model again. Thereby, we started with the detection head only, verified the results, and continued with end-to-end training. This procedure turned out to be the most efficient.
In this section, we explain the necessary modifications required to detect more than one object class. In a first step, we identify the object classes, which we include in the network. This decision highly depends on the classes that are available in the used dataset. Our primary dataset is BDD100k, containing the following classes: bus, 4× traffic light, sign, person, bike, truck, motorcycle, car, train, and rider. The state of the traffic light is available, as well, so we decompose the traffic light to the four object classes of green, red, yellow, and unknown. Table 1 lists the number of instances for each class in the BDD100k dataset.
Table 1. The object classes and corresponding IDs, with the number of instances in the BDD100k dataset.
Additional data are required, as we encountered too many false detections when the training set did not include situations and environments that were specific to German roads. There are several datasets available that include German cities, but to our knowledge, there is no set available that also contains bounding box labels, drivable areas, and lane annotations at the same time. While it is possible to derive bounding box annotations from semantic datasets, such as Cityscapes, we chose to use the Kitti dataset and used only the object labels. Thus, we trained only the encoder and detection branch while freezing the rest of the network. For further comparison between our implementation and the original implementation, we consider only the training on the BDD100k dataset. In a second step, we update the detection head’s output layer to include all 13 object classes by integrating the relevant part of the YOLO v5 network structure. The authors of YOLOP propose different training strategies. They compare multitask learning to single-task learning and find that the performance is close, but there is a huge saving in training time. However, we experienced slightly better results when starting with the object detection task training and then switching to the multitask training.
We evaluate the results after the final training (Figure 5) with all object classes. The detection still performs very well on the BDD100k dataset. The addition of the supplementary object classes has no significant negative impact on the lane detection and segmentation outputs (Tables 2 and 3).
Figure 5. Two different training results based on data from the BDD100K dataset. (a) Perception results in a suburban environment. (b) Perception results during the night.
Table 2. The drivable area: a comparison of the YOLOP multiclass with the results from different networks.
Table 3. The lane detection: a comparison of the YOLOP multiclass with the results from different networks.
For the integration in our research vehicle, the ability to run the network in real time is mandatory (in this case, it is required that the algorithm be running faster than the output of our camera, at 20 frames per second). To accelerate the performance of the network execution, we utilize the Nvidia TensorRT framework. The TensorRT software development kit is tailored for the high-throughput low-latency use case. It requires generating an optimized version of the DNN. This task takes about 2 min on our hardware but needs to be executed only when new weights are deployed to the vehicle. The authors of YOLOP provide a C++ implementation of their network, which we use as a starting point. We integrate GPU-accelerated dynamic input resizing, which allows the use of different cameras. In addition, the whole execution is accelerated with TensorRT and then fully integrated in ROS. This allows a more flexible integration with the rest of the vehicle and adds the benefit of the visualization tools coming with the ROS stack. The algorithms are packaged into ROS nodes, and the execution is triggered by new data coming from the sensors. This ensures that new data are processed only once from the perception hardware and allows us to incorporate different frequencies of sensors into our system. The synchronization of these data is handled by a fusion node, which runs with a constant cycle time of 100 ms and collects the data of all sources. Fusion execution can be based on a reference frequency derived from an input source or on a timer. Depending on the type of data, different measures for the synchronization of the data can be employed to increase the quality of the fused result. We decided to use the timer.
Before we ran the camera stream through the NN, we did a rectification of the image. We compared the results of the detection output of the NN between the distorted and undistorted image as an input and achieved similar detection results. In multistep postprocessing, we derived the drivable area and objects in the vehicle coordinate system:
Figure 6. The top-down transformation: the (a) original image with detections, (b) drivable area only, and (c) top-down view of drivable area.
To validate the overall result of our implementation, we executed several open-loop test drives through the city of Augsburg and recorded the output of the processing (Figure 7). The qualitative evaluation shows that the network performs exceptionally well, even in challenging environments. We can verify the detection of 12 out of 13 object classes. We did not record any data with trains in them, and a streetcar was not properly detected. In some situations, traffic lights were not properly detected (Figure 8). The qualitative evaluation shows that this problem is particularly frequent when these lights are mounted lower. This could be because the main dataset is recorded in the United States, where traffic lights have different body colors and mounting heights. During test drives, the execution time was always below 50 ms for the complete processing chain.
Figure 7. The validation results: the (a) panoptic output of the network and (b) different object classes.
Figure 8. Traffic light misdetection: (a) wrong classification and false negative detection as well as (b) false negative detection.
We optimized an ROS-based inference stack using TensorRT and CUDA to improve real-time performance. Then, we integrated the system into a research vehicle and added necessary transformations to make the output usable for our open planner-based driving stack. The addition of the object classes did not affect the lane and drivable area segmentation negatively. During open-loop test drives, we saw good detection performance while maintaining low cycle times that were sufficient for a low-power platform, such as Jetson TX2. Currently, the network struggles with traffic signal detection. In future work, we could attenuate these problems by increasing the size of the training set and by introducing a task-specific network running in tandem with YOLOP by postprocessing the traffic light candidate efficiently as a reduced region of interest with a lower number of pixels for confirmation. To do this, the detection threshold must be shifted to a value that leads to false positives that will be reanalyzed, and therefore safety critical false negatives will be avoided.
S. Kohnert, J. Stahler, R. Stolle, and F. Geissler, “Cooperative radar sensors for the digital test field A9 (KoRA9) – Algorithmic recap and lessons learned,” in Proc. Kleinheubach Conf., 2021, pp. 1–4, doi: 10.23919/IEEECONF54431.2021.9598409.
M. Pechinger, G. Schroer, K. Bogenberger, and C. Markgraf, “Cyclist safety in urban automated driving – Sub-microscopic HIL simulation,” in Proc. IEEE Int. Intell. Transp. Syst. Conf. (ITSC), 2021, pp. 615–620, doi: 10.1109/ITSC48978.2021.9565108.
M. Pechinger, G. Schroer, K. Bogenberger, and C. Markgraf, “Benefit of smart infrastructure on urban automated driving – Using an AV testing framework,” in Proc. IEEE Intell. Veh. Symp. (IV), 2021, pp. 1174–1179, doi: 10.1109/IV48863.2021.9575651.
D. Wu et al., “YOLOP: You only look once for panoptic driving perception,” Mach. Intell. Res., vol. 19, pp. 550–562, Dec. 2022, doi: 10.1007/s11633-022-1339-y.
F. Yu et al., “Bdd100k: A diverse driving dataset for heterogeneous multitask learning,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), Seattle, WA, USA, 2020, pp. 2633–2642, doi: 10.1109/CVPR42600.2020.00271.
Julian Stähler (julian.staehler@hs-augsburg.de) is with the Driverless Mobility Research Group, University of Applied Sciences, 86161 Augsburg, Germany.
Carsten Markgraf (carsten.markgraf@hs-augsburg.de) is with the Driverless Mobility Research Group, University of Applied Sciences, 86161 Augsburg, Germany.
Mathias Pechinger (mathias.pechinger@hs-augsburg.de) is with the Driverless Mobility Research Group, University of Applied Sciences, 86161 Augsburg, Germany.
David W. Gao (david.gao@du.edu) is with the School of Engineering and Computer Science, University of Denver, Denver, CO 80208 USA.
Digital Object Identifier 10.1109/MELE.2023.3264920
2325-5897/23©2023IEEE