High-Performance Perception

High-Performance Perception: A camera-based approach for smart autonomous electric vehicles in smart citiesJulian StÃ¤hler, Carsten Markgraf, Mathias Pechinger, David W. Gao11mele02-sthler-opener-3264920Â©SHUTTERSTOCK.COM/SCHARFSINNMobility is fundamental for the wealth and health of the worldâ€™s population, and it has a significant influence on our daily life. However, with the increasing complexity of traffic, the need to transport goods, and growing urbanization, improving the quality of mobility in terms of time, space, and air becomes more challenging. An autonomous electric vehicle offers technology and potential for new mobility concepts in smart cities. Today, many vehicles have been developed with automated driving capabilities. A safety driver is still required to intervene in most cases if the autonomous electric vehicle is not able to handle a situation in a safe and, at the same time, reliable way. One important aspect to achieve safety and reliability goals is a robust and efficient perception of the vehicleâ€™s environment. The tragic accident involving an Uber self-driving car killing a pedestrian in 2018 highlighted the importance of perception in autonomous driving. In its investigation, the U.S. National Transportation Safety Board found that the Uber self-driving car and its safety driver involved in the accident failed to detect the pedestrian at the same time. Since this accident, the autonomous vehicle industry has been working to improve perception systems through the use of advanced sensors, machine learning algorithms, and other technologies. A variety of sensor technologies are used in vehicles to detect objects and perceive the vehiclesâ€™ surroundings. Cameras and radars are among the widely used technologies for sensing systems, as they are cheap and reliable, and as these sensors are operating at different wavelengths, they are not susceptible to common errors. While some companies solely use cameras and radars, others also use lidars in their sensing systems. These sensors are widely used in the industry, and the technology itself is continuously enhanced, leading to new developments, such as 4D lidar, which are capable of measuring not only the distance of an object but also its velocity by evaluating the phase shift of the returned light. Various methods for the perception of the environment exist and rely on different kinds of sensors. Machine learning-based methods have evolved rapidly in recent years and are currently leading the field in perception, particularly in the tasks of object detection and classification.This article shows the integration results of an extended You Only Look Once for Panoptic Driving Perception (YOLOP) convolutional neural network (NN). A camera-based approach for object detection and classification as well as free space and lane detection in a single deep NN (DNN) is developed. It is used in a hybrid electric research vehicle for environmental perception in automated driving. The layer structure of this NN was slightly changed and retrained for the detection and classification of up to 13 different kinds of objects. The quality of the outputs, from lane recognition to free space segmentation, is kept at a high level. Safe and reliable environmental perception is crucial for the homologation of automated vehicles for public roads at levels where no safety driver is required. This approach utilizes a shared backbone that keeps the sampling periods for calculating the NN sufficiently low. Extension to multiobject detection and classification enables individual tracking according to object-specific mobility characteristics. Furthermore, sign and signal recognition are included as an important input for the behavior planning of automated vehicles.BackgroundIn autonomous driving applications, safe and reliable perception of the environment is a major challenge. Recent developments show reliable results even in complex surroundings, keeping the number of false positive and false negative object detections at a low level. However, to increase multiobject tracking accuracy, ever larger DNNs are used with higher numbers of input and output parameters, processing data from diverse sensors in the vehicle. Therefore, computational performance constraints must be considered for in-vehicle applications with limited resources in real time. A very promising approach to achieve this goal can be found in a recent work called YOLOP. Here, the authors propose an architecture capable of doing several perception tasks in a single network. Three different kinds of detection (object detection, free space segmentation, and lane detection) are implemented in one NN with the help of a shared backbone. Very high computational performance is achieved, allowing the NN to be operated on low-cost embedded devices, such as Jetson TX2. However, so far, only one object class (vehicle) is identified at the output. For object motion prediction purposes, it is crucial to distinguish different object types. Furthermore, static objects that determine traffic rules (e.g., signals and signs) are an important input for the behavior planner inside the vehicleâ€™s internal navigation systems.In this work, the YOLOP network is extended and trained for further object classes that are already labeled in the BDD100K dataset (see Figure 1). Our approach is integrated into a research vehicle to validate results in a real-world application in Augsburg, Germany, to determine the optimization potential of the sensor fusion algorithm. For this purpose, the network is integrated into Robot Operating System (ROS) to establish a connection to vehicle sensors and actuators. ROS is middleware for Unix-like systems, such as Ubuntu Linux, and is used heavily in robotics and autonomous driving research. This approach facilitates the ease of debugging. Open-loop test drives were performed to collect data for a first subjective judgment of the detection quality. Subsequently, the system was connected to the vehicleâ€™s planning system, low-level controller, and actuators to validate the results in closed-loop evaluations.sthler01-3264920Figure 1. Perception results in a crowded city environment based on data from the BDD100K dataset.System SetupYOLOP is extended to 13 object classes, and training is executed on two additional datasets. The updated network with a new parameter set is then integrated into our research vehicle and used for open-loop test drives on public roads in Augsburg. An image of this vehicle is provided in Figure 2. The research car is a BMW ActiveHybrid 3 that is equipped with additional sensing systems, computing units for autonomous driving algorithms, and actuators. The system covers the full data processing pipeline and provides closed-loop test capabilities. On the roof of our research vehicle, given in Figure 2, a 360Â° lidar sensor is mounted for object detection and localization purposes. At the rear of the roof, a differential global navigation satellite system is installed. This system enables the vehicle to position itself with centimeter-level accuracy. Other sensing systems, such as cameras, are mounted behind the windshield of the vehicle and provide perception redundancy and diversity.sthler02-3264920Figure 2. The research vehicle used in this work.Test Vehicle EnvironmentThe architecture of the test vehicle is divided into three main areas. Implementing these three levels is a common design for an autonomous electric vehicle. Figure 3(a)â€“(c) describes the sensing systems, processing systems, and actuation systems.sthler03-3264920Figure 3. The system architecture of the autonomous vehicle. The (a) sensing systems, (b) processing systems, and (c) actuation systems. ITS-G5: Intelligent Transportation System, 5.9 GHz; IMU: inertial measurement unit; GNSS: global navigation satellite system.While control algorithms for actuation require hard real-time operation and a comparably high control frequency of more than 250 Hz, detection algorithms must achieve a lower frequency of 10 to 20 Hz. In between, specific tracking algorithms handle interfacing to make predictions about the development of the driving situation, and planning algorithms use their output at each time step:

Sensing systems: Various active and passive sensors are built into vehicles to create a comprehensive picture of the environment. Active sensors, such as lidar and radar, emit signals and evaluate respective reflections. For this work, the lidar sensor RoboSense RS-Ruby is used. It is a state-of-the-art rotating lidar sensor utilizing 128 layers. Passive sensors, such as cameras, inertial measurement units, and differential global navigation satellite systems, measure properties without simulating the environment. Looking at passive sensors, the NN given in this work uses a Basler GigE camera with a 1,280 Ã— 1,024-pixel resolution. The field of view can be modified using different optics, while a horizontal angle of 102Â° is used in this work. In addition, we have included external sensors from roadside infrastructure in our architecture. Measurements from roadside sensors are transmitted to the vehicle via wireless local area network standard 802.11p in real time. Roadside infrastructure can perceive information from occluded areas that the ego vehicles themselves would not be able to observe otherwise.

Processing systems: For processing, we use different industrial computers based on the x86 architecture, and they are connected via 10-GB Ethernet. Data are exchanged arbitrarily in the system by using ROS messages. Default ROS and Autoware message definitions for communication among our software components are used whenever possible. Autoware is an open source framework for automated driving platforms that are based on ROS. Full-stack automated driving algorithm solutions are provided by the Autoware framework. Computers in our system are equipped with Nvidia graphics cards (GTX 1080 and GTX 1060 Ti) for GPU acceleration. As a result, we are able to achieve a target detection frequency of more than 10 Hz. All computers utilize Ubuntu together with ROS Melodic and Noetic. Having such a distributed architecture, exact time synchronization between devices is paramount. A GPS-based Network Time Protocol server is integrated to achieve a synchronization accuracy of below 1 ms. For planning tasks of the ego vehicle, the open source planning algorithm Open Planner is used, which is provided within Autoware. This planning method has already proved its capabilities in several simulation studies as well as actual test drives carried out by our research group.

Actuation system: To actuate the vehicle, we use the manufacturerâ€™s hardware that is already installed in the car and replace firmware on related engine control units (ECUs). The ego vehicle uses electric power steering to utilize steering actuation for closed-loop testing. The electric power steering ECU is connected via FlexRay to a MicroAutoBox II. A cascaded control loop with the desired steering angle as input and the motor torque as output is implemented on the MicroAutoBox II. This ensures sufficient system dynamics for the motion control of the front wheels and maneuvering the test vehicle. The drive and brake systems are accessed using custom access to the electronic stability control. This allows specifying wheel-individual brake pressures and demanding rear wheel driving torque from the engine. The electronic stability control is connected to MicroAutoBox II by utilizing a controller area network bus. The turn indicators are connected to the MicroAutoBox II by employing relays.

Data FlowTo integrate YOLOP into the vehicleâ€™s architecture, we extended the existing data flow (Figure 4). Due to our design of choice, all data are available in every compute unitâ€™s processing system at any time. This enables running parts of the algorithm on different hardware platforms available in the vehicle. Camera and lidar data are converted into the common data representation point cloud using two nodes. These converted data are the source for all further processing and visualization.sthler04-3264920Figure 4. The data flow of our detection system.Training EnvironmentThe training environment is based on a containerized setup for the actual training execution and an object storage for artifacts and datasets. In our particular use case, we utilize Podman running on a Red Hat Enterprise Linux 8.5 host. This machine is equipped with 128 GB of memory and a Nvidia A6000 GPU with 48 GB of memory. The container image is based on Nvidia NGX Pytorch. The original YOLOP implementation was trained on the BDD100k dataset but already provided a modular approach that could be easily extended to additional datasets. While also training on the BDD100k dataset, we extended the training to more than one object class and utilized all object classes provided by the dataset (bus, 4Ã— traffic light, sign, person, bike, truck, motorcycle, car, train, and rider) As a reference, we ran the training on the original training configuration and verified the results against the benchmark values. In a further step, we modified the training configuration to suit our hardware setup. Then, we integrated the necessary changes into the model structure to incorporate the new object classes and trained the whole model again. Thereby, we started with the detection head only, verified the results, and continued with end-to-end training. This procedure turned out to be the most efficient.YOLOP DNN AdaptionIn this section, we explain the necessary modifications required to detect more than one object class. In a first step, we identify the object classes, which we include in the network. This decision highly depends on the classes that are available in the used dataset. Our primary dataset is BDD100k, containing the following classes: bus, 4Ã— traffic light, sign, person, bike, truck, motorcycle, car, train, and rider. The state of the traffic light is available, as well, so we decompose the traffic light to the four object classes of green, red, yellow, and unknown. Table 1 lists the number of instances for each class in the BDD100k dataset.Table 1. The object classes and corresponding IDs, with the number of instances in the BDD100k dataset.sthler_t1-3264920Additional data are required, as we encountered too many false detections when the training set did not include situations and environments that were specific to German roads. There are several datasets available that include German cities, but to our knowledge, there is no set available that also contains bounding box labels, drivable areas, and lane annotations at the same time. While it is possible to derive bounding box annotations from semantic datasets, such as Cityscapes, we chose to use the Kitti dataset and used only the object labels. Thus, we trained only the encoder and detection branch while freezing the rest of the network. For further comparison between our implementation and the original implementation, we consider only the training on the BDD100k dataset. In a second step, we update the detection headâ€™s output layer to include all 13 object classes by integrating the relevant part of the YOLO v5 network structure. The authors of YOLOP propose different training strategies. They compare multitask learning to single-task learning and find that the performance is close, but there is a huge saving in training time. However, we experienced slightly better results when starting with the object detection task training and then switching to the multitask training.System EvaluationTraining ResultsWe evaluate the results after the final training (Figure 5) with all object classes. The detection still performs very well on the BDD100k dataset. The addition of the supplementary object classes has no significant negative impact on the lane detection and segmentation outputs (Tables 2 and 3).sthler05-3264920Figure 5. Two different training results based on data from the BDD100K dataset. (a) Perception results in a suburban environment. (b) Perception results during the night.Table 2. The drivable area: a comparison of the YOLOP multiclass with the results from different networks.sthler_t2-3264920Table 3. The lane detection: a comparison of the YOLOP multiclass with the results from different networks.sthler_t3-3264920Integration in Research VehicleFor the integration in our research vehicle, the ability to run the network in real time is mandatory (in this case, it is required that the algorithm be running faster than the output of our camera, at 20 frames per second). To accelerate the performance of the network execution, we utilize the Nvidia TensorRT framework. The TensorRT software development kit is tailored for the high-throughput low-latency use case. It requires generating an optimized version of the DNN. This task takes about 2 min on our hardware but needs to be executed only when new weights are deployed to the vehicle. The authors of YOLOP provide a C++ implementation of their network, which we use as a starting point. We integrate GPU-accelerated dynamic input resizing, which allows the use of different cameras. In addition, the whole execution is accelerated with TensorRT and then fully integrated in ROS. This allows a more flexible integration with the rest of the vehicle and adds the benefit of the visualization tools coming with the ROS stack. The algorithms are packaged into ROS nodes, and the execution is triggered by new data coming from the sensors. This ensures that new data are processed only once from the perception hardware and allows us to incorporate different frequencies of sensors into our system. The synchronization of these data is handled by a fusion node, which runs with a constant cycle time of 100 ms and collects the data of all sources. Fusion execution can be based on a reference frequency derived from an input source or on a timer. Depending on the type of data, different measures for the synchronization of the data can be employed to increase the quality of the fused result. We decided to use the timer.Before we ran the camera stream through the NN, we did a rectification of the image. We compared the results of the detection output of the NN between the distorted and undistorted image as an input and achieved similar detection results. In multistep postprocessing, we derived the drivable area and objects in the vehicle coordinate system:

Transformation in road plane: We use a four-point transformation, where we calibrate the four pixel coordinates to the corresponding coordinates of the vehicle frame. We measure the reference points with the lidar of the vehicle to automatically achieve a calibration between the two sensor frames, assuming the ground is flat. Figure 6 presents the transformation from image coordinates to vehicle coordinates. The output from the drivable area segmentation is converted to a binary mask, and then every pixel from this map is transformed to recover the top-down view of the road. When applying the transformation to the vehicle coordinates, we define a resolution of 10 mm in the lateral direction and 100 mm in the longitudinal direction between neighboring points in the top-down view of the resulting mask. This has a direct impact on the execution time of the algorithm. Moving the calculation from the CPU to the GPU allows us to further improve the cycle time from 189 to 10 ms and thus provides sufficiently low sample times of the algorithm.

Contour detection: We apply contour detection to derive the drivable area as a polygon. In some cases, it may happen that the mask of the NN contains more than one interconnected area and therefore also generates more than one polygon in the contour detector. In this case, we select the shape closest to the front of our vehicle and discard all other contours. This simplification is made to reduce the complexity in the free space planner. After applying all transformations, we derived the drivable area in vehicle coordinates.

Object detection: Object detection requires additional steps to transform the objects from the image coordinates to the vehicle coordinate system. Here, we implement different strategies depending on the object type we detect. We used the same transformation matrix as in the contour detector, but instead of transforming the whole image, the transformation is applied only on some key points:

Pedestrians and bicycles: We transform the center of the two bottom corners of the bounding box to the vehicle coordinate system and assume a rectangular bounding box with 0.5 m but with at least the distance between the two corners around it.
Vehicles, trucks, and motorcycles: For vehicles, trucks, and motorcycles, positions and orientations must be derived. We prefer to incorporate the lidar sensor to further increase the reliability of the output. To simplify the mapping, we run a ground removal and then use a Euclidian clustering algorithm, which allows us to identify the corresponding points between the objects returned by the NN and those of the point cloud. After a top-down projection, we run an L-shape fitting, which gives us the orientation of the vehicles.
Signs and traffic lights: Currently, we do not incorporate the position and significance of signs and traffic lights into our planner. In the future, we intend to use this information for a fusion with the already implemented vehicle-to-everything communication. Here, we receive the signal phase and time message of the traffic light, which contains state, absolute position, and timing information.

sthler06-3264920

Figure 6. The top-down transformation: the (a) original image with detections, (b) drivable area only, and (c) top-down view of drivable area.

Validation Results

To validate the overall result of our implementation, we executed several open-loop test drives through the city of Augsburg and recorded the output of the processing (Figure 7). The qualitative evaluation shows that the network performs exceptionally well, even in challenging environments. We can verify the detection of 12 out of 13 object classes. We did not record any data with trains in them, and a streetcar was not properly detected. In some situations, traffic lights were not properly detected (Figure 8). The qualitative evaluation shows that this problem is particularly frequent when these lights are mounted lower. This could be because the main dataset is recorded in the United States, where traffic lights have different body colors and mounting heights. During test drives, the execution time was always below 50 ms for the complete processing chain.

sthler07-3264920

Figure 7. The validation results: the (a) panoptic output of the network and (b) different object classes.

sthler08-3264920

Figure 8. Traffic light misdetection: (a) wrong classification and false negative detection as well as (b) false negative detection.

Results and Conclusion

We optimized an ROS-based inference stack using TensorRT and CUDA to improve real-time performance. Then, we integrated the system into a research vehicle and added necessary transformations to make the output usable for our open planner-based driving stack. The addition of the object classes did not affect the lane and drivable area segmentation negatively. During open-loop test drives, we saw good detection performance while maintaining low cycle times that were sufficient for a low-power platform, such as Jetson TX2. Currently, the network struggles with traffic signal detection. In future work, we could attenuate these problems by increasing the size of the training set and by introducing a task-specific network running in tandem with YOLOP by postprocessing the traffic light candidate efficiently as a reduced region of interest with a lower number of pixels for confirmation. To do this, the detection threshold must be shifted to a value that leads to false positives that will be reanalyzed, and therefore safety critical false negatives will be avoided.

For Further Reading

S. Kohnert, J. Stahler, R. Stolle, and F. Geissler, â€œCooperative radar sensors for the digital test field A9 (KoRA9) â€“ Algorithmic recap and lessons learned,â€ in Proc. Kleinheubach Conf., 2021, pp. 1â€“4, doi: 10.23919/IEEECONF54431.2021.9598409.

M. Pechinger, G. Schroer, K. Bogenberger, and C. Markgraf, â€œCyclist safety in urban automated driving â€“ Sub-microscopic HIL simulation,â€ in Proc. IEEE Int. Intell. Transp. Syst. Conf. (ITSC), 2021, pp. 615â€“620, doi: 10.1109/ITSC48978.2021.9565108.

M. Pechinger, G. Schroer, K. Bogenberger, and C. Markgraf, â€œBenefit of smart infrastructure on urban automated driving â€“ Using an AV testing framework,â€ in Proc. IEEE Intell. Veh. Symp. (IV), 2021, pp. 1174â€“1179, doi: 10.1109/IV48863.2021.9575651.

D. Wu et al., â€œYOLOP: You only look once for panoptic driving perception,â€ Mach. Intell. Res., vol. 19, pp. 550â€“562, Dec. 2022, doi: 10.1007/s11633-022-1339-y.

F. Yu et al., â€œBdd100k: A diverse driving dataset for heterogeneous multitask learning,â€ in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), Seattle, WA, USA, 2020, pp. 2633â€“2642, doi: 10.1109/CVPR42600.2020.00271.

Biographies

Julian StÃ¤hler (julian.staehler@hs-augsburg.de) is with the Driverless Mobility Research Group, University of Applied Sciences, 86161 Augsburg, Germany.

Carsten Markgraf (carsten.markgraf@hs-augsburg.de) is with the Driverless Mobility Research Group, University of Applied Sciences, 86161 Augsburg, Germany.

Mathias Pechinger (mathias.pechinger@hs-augsburg.de) is with the Driverless Mobility Research Group, University of Applied Sciences, 86161 Augsburg, Germany.

David W. Gao (david.gao@du.edu) is with the School of Engineering and Computer Science, University of Denver, Denver, CO 80208 USA.

Digital Object Identifier 10.1109/MELE.2023.3264920

CoverIEEE Power & Energy SocietyMastheadFrom the EditorHistoryIncentivizing Electric Vehicle Adoption Through State and Federal PoliciesElectrification of ExcavatorsLithium-Ion Battery Technologies for Electric VehiclesHigh-Performance PerceptionAutonomous Electric Race CarÂ Inverter DevelopmentA New Electrification Model to End Energy PovertyPowering MaritimeSupraharmonic Measurements in Distributed Energy ResourcesAd Index & Sales OfficesDates AheadNewsfeed2023 IEEE PES General MeetingIEEE PES Resource CenterArchives