Xiangkun He, Chen Lv
©SHUTTERSTOCK.COM/ZAPP2PHOTO
In recent years, electrified mobility (e-mobility), especially connected and autonomous electric vehicles (CAEVs), has been gaining momentum along with the rapid development of emerging technologies such as artificial intelligence (AI) and Internet of Things. The social benefits of CAEVs are manifested in the form of safer transportation, lower energy consumption, and reduced congestion and emissions. Nevertheless, it is highly difficult to design driving policies that ensure road safety, travel efficiency, and energy conservation for all CAEVs in traffic flows, particularly in a mixed-autonomy scenario where both CAEVs and human-driven vehicles (HDVs) are on the road and interact with each other. Here we present a novel deep multiagent reinforcement learning (DMARL)-enabled energy-aware cooperative driving solution, facilitating CAEVs to learn vehicular platoon management policies for guaranteeing overall traffic flow performance. Specifically, with the aid of information communication technology (ICT), CAEVs can share their vehicle state and learned knowledge, such as their state of charge (SoC), speed, and driving policies. Additionally, a cooperative multiagent actor–critic (CMAAC) technique is developed to optimize vehicular platoon management policies that map perceptual information directly to the group decision-making behaviors for the CAEV platoon. The proposed approach is evaluated in highway on-ramp merging scenarios with two different mixed-autonomy traffic flows. The results demonstrate the benefits of our scheme. Finally, we discuss the challenges and potential research directions for the proposed energy-aware cooperative driving solution.
E-mobility is recognized as an efficient way to achieve sustainable urban transport via reducing air pollution and oil dependence, which can bring potential environmental and health benefits [1]. Consequently, many countries have formulated goals and policies to deploy electric vehicles (EVs), and it is projected that EVs will account for around 30% of vehicle sales by 2030 [2]. Meanwhile, with the advent of the intelligence and connectivity age, e-mobility is undergoing a transformation.
Autonomous vehicles (AVs) have captivated considerable attention of people across the globe in recent years because of their potential to revolutionize human mobility and transportation systems. A growing number of governments, companies, and institutes have poured large amounts of effort and money into promoting the development of AVs, leading to the prediction that AVs may account for 65% of vehicle travel and 90% of vehicle sales in 2050 [3]. In addition, ICT, such as fifth-generation and vehicle-to-everything technology, has become an indispensable component in facilitating the advancement of AVs and EVs by enabling vehicles to communicate with their surrounding vehicles or devices [4].
Together, the above technologies are anticipated to bring substantial benefits to revolutionize the modern transportation systems. E-mobility can be further upgraded by combining AVs and ICT. Hence, CAEVs have been recognized as a vital component of future intelligent transportation systems, which have the potential to reduce noise, air pollution, and energy consumption while improving traffic safety and efficiency [5]. CAEVs are products of multidisciplinary knowledge and theories, which integrate e-mobility, autonomous driving, and Internet of Vehicles (IoV) technologies. As the brains of CAEVs, their decision-making systems are the direct embodiment of the intelligence level. In this case, AI will play a crucial role in enabling CAEVs.
AI technology has been widely leveraged in different domains like finance, healthcare, education, transportation, and social networking, and it has had a significant societal impact and benefit. In recent years, as an indispensable component of modern AI, with deep neural networks (DNNs), reinforcement learning (RL) has achieved impressive successes in a series of challenging decision-making tasks [6]. By virtue of the learning paradigm through interacting with the environment, RL has shown the potential to liberate researchers from sophisticated and burdensome rule-based or optimization-based vehicle system design. As a result, many researchers have explored various schemes to apply RL algorithms in AVs or EVs. For instance, in [6], an autonomous racing agent is trained via a deep RL method called quantile regression soft actor–critic, which can compete with the best esports drivers in the world. In [8], a hybrid RL method that integrates the deep deterministic policy gradient method and deep Q-learning is developed to learn eco-driving policies for connected and autonomous vehicles (CAVs).
The existing RL-based schemes have achieved many compelling results; however, these solutions are mostly developed by single-agent RL methods for a single vehicle and treat all other traffic participants as part of the driving environment. From the perspective of the transportation system, they ignore the fact that multivehicle cooperation is vital to improve the performance of the overall traffic flow, which is a typical multiagent problem. Consequently, some studies have tried to multiagent RL (MARL) to optimize the driving policies of AVs or EVs. In [8], a MARL-based multivehicle control method is developed to manage CAVs at intersections. In [9], a MARL-based eco-driving control technique for hybrid EVs is advanced to keep a safe headway while minimizing energy consumption. In [10], the coordinated charging task of multiple EVs is modeled as a multiagent decision-making problem, which is solved via a MARL algorithm. In [11], a MARL framework is leveraged to control vehicular platoons for reducing fuel consumption in traffic oscillations.
Although the above studies demonstrate promising potential, there remains space for further improvement and refinement. First, to the best of our knowledge, the MARL-enabled cooperative driving solutions for CAEVs have not been comprehensively explored, specifically regarding the development of the CAEV platoon management techniques considering the battery SoC. Second, credit assignment is a fundamental challenge in CAEV platoon management, especially when vehicles are operating in a cooperative setting from which may arise the “lazy agent” problem, in which some agents may become “lazy” and not contribute as much to the team’s success, while other agents have to work harder to compensate for their lack of effort. Third, the nonstationary environment problem poses another challenge for CAEV platoon management since the actions taken by one vehicle may affect the environment and, in turn, affect the learning of the other vehicles, particularly in the mixed-autonomy traffic scenario that includes both CAEVs (with complete observations) and HDVs (with incomplete observations).
Taking into account the above discussions and analyses, this article presents a novel DMARL-enabled energy-aware cooperative driving solution for CAEVs. The following are the three primary contributions of this work:
Finally, the proposed solution is benchmarked in highway on-ramp merging scenarios with the two different mixed-autonomy traffic flows. In conducting a contrastive analysis, we implement three competitive baselines, including classical and state-of-the-art approaches. The assessment results demonstrate that our DMARL-enabled energy-aware cooperative driving technique for CAEVs surpasses the baselines and can dramatically improve the overall traffic flow performance, in terms of road safety, travel efficiency, and energy conservation.
In this section, we outline the overall framework of the proposed DMARL-enabled energy-aware cooperative driving solution for CAEVs. As shown in Figure 1, CAEVs (green) and HDVs (blue) coexist in the traffic environment that is composed of a four-lanes-based highway and a ramp. The vehicles on the ramp have to interact with their surrounding vehicles to merge into the highway. Meanwhile, to ensure the travel efficiency, driving safety, and energy conservation of the overall traffic flow, the vehicles on the highway are required to cooperate with their surrounding CAEVs, HDVs, and the vehicles on the ramp. Furthermore, since the states of HDVs are partially observable, the coordination of CAEVs in the presence of HDVs is a more challenging task compared with the case with only CAEVs.
Figure 1 The framework of the proposed DMARL-enabled energy-aware cooperative driving technique for CAEVs. Agent-n: the nth agent; N: the total number of agents; ${s}_{t}^{n}{:}$ the local observational state of the nth agent at the time step t; ${c}_{t}{:}$ the shared communicating information of the entire CAEV platoon at the time step t; ${a}_{t}^{n}{:}$ the action taken by the nth agent at the time step t; ${r}_{t}^{n}{:}$ the reward function of the nth agent at the time step t; ${s}_{{t} + {1}}^{n}{:}$ the local observational state of the nth agent at the time step t + 1; ${d}_{t}^{n}{:}$ the done signal that represents if ${s}_{{t} + {1}}^{n}$ is terminal.
To deal with these challenges, this work develops the DMARL-enabled multiagent decision-making solution for CAEVs. We formulate the CAEV platoon management task as the multiagent cooperative problem where each agent has a common goal to jointly maximize the global expected return with other agents in the environment. With the aid of the IoV technology, in addition to observing the state of the surrounding traffic environment, each CAEV can also receive the state information (e.g., SoC and speed) and learned driving policy of other CAEVs. A long-standing challenge in MARL is the large-scale state and action spaces, as the number of joint states and actions increases exponentially with the number of agents in the environment. To handle this issue, we consider a decentralized interacting and learning setting. That is, at every step of interacting with the environment, each agent performs an individual action based on both the local observation and the key information of the entire CAEV platoon. We design an individual reward function to guide each CAEV to improve travel efficiency and driving safety while keeping the battery SoC within the ideal working bounds. The data on how each agent interacts with the environment, i.e., DCMA-MDP transitions, are stored in the experience memory.
Actor–critic (AC) is a powerful architecture that combines the benefits of value-based and policy-based methods, making it well suited for a wide range of RL problems. Hence, the CMAAC scheme is developed to optimize CAEV platoon management policies via the AC architecture, the centralized action-value function network, and the shared policy and value networks. To cope with the credit assignment and nonstationary environment problems, the CMAAC algorithm employs the decentralized training framework with the centralized action-value function. Consequently, the energy-aware cooperative driving policies for CAEVs can be learned by combining the CMAAC technique with the DCMA-MDP transitions in the experience memory.
In this section, we extend the existing MDP formulation for explicitly modeling the multiagent cooperative behaviors under the shared communicating information and the decentralized interaction setting. Inspired by [12], here DCMA-MDP is presented, which is defined as follows.
A DCMA-MDP can be represented by a 6-tuple (S, C, A, p, r, ${\gamma}$). S is the state space, C denotes the shared communication space, A is the action space, p represents the transition probability function, r denotes the individual reward function, and ${\gamma}$ is the discount factor ${\gamma}\in{(}{0},{1}{)}{.}$
We follow the decentralized interacting and learning architecture with sharing policy model parameters, where each agent employs a shared policy ${\pi}{(}{a}{|}{s},{c}{;}{\theta}{)}$parameterized by ${\theta}$ to execute its action a based on its local observational state s and the shared communicating information c.
To solve the CAEV platoon management task, the local observational state, shared communicating information, action, and individual reward function of each agent are described below.
We define s as a vector of dimension V × F + 3. V represents the number of observed surrounding vehicles, and F denotes the number of features. We select the nearest V vehicles that are within a 200-m distance from the ego vehicle as observable vehicles. In this work, we empirically show that V = 5 can achieve the best performance. Each feature has five dimensions, including a binary variable indicating if a vehicle is observable to the ego vehicle, the longitudinal distance and speed of the observed vehicle relative to the ego vehicle, and the lateral distance and speed of the observed vehicle relative to the ego vehicle. Moreover, the local observational state also contains the speed, acceleration, and battery SoC of the ego vehicle. Hence, the local observational state of each agent has 28 dimensions. In practical applications, we can acquire information about surrounding vehicles through sensors such as cameras or lidar.
To avoid the dimension explosion problem (i.e., the dimension of the state or action space increases exponentially with the number of agents), we select two key states of the entire vehicular platoon as the shared communicating information, including the average speed and the battery SoC of all the CAEVs in the traffic environment.
Since we focus on the group decision-making problem, here the action of each agent is high-level driving behavior, including left lane changing, right lane changing, cruising, speeding up, and slowing down.
To make it easier for each agent to learn the driving policy, we design the individual reward function to guide CAEVs to enhance travel efficiency and driving safety while keeping the battery SoC within the ideal working bounds. Algorithm 1 overviews the design of the individual reward function, where vn and SoC n denote the speed and battery SoC of the nth agent, respectively. According to the research results in [13], the ideal working range of the battery SoC is 20%–90%. Hence, when a CAEV keeps its battery SoC within the ideal working range or drives at high speed, it will receive a reward signal to encourage desirable driving behaviors, and vice versa. In addition, if the vehicle collides, it will receive a penalty signal to penalize unsafe driving behaviors. We carry out a certain amount of experimentation and testing to identify the best combination of the numbers (e.g., 30, 0.2) in the reward function.
Input: The sn and c of the nth agent.
2: rn = vn /30. *Encourage the agent to be more efficient
3: if Collision occurs then
4: rn = rn – 2. *Penalize collision
5: end if
6: if 0.2 ≤ Battery SoC n ≤ 0.9 then
7: rn = rn + 0.2·SoC n . *Encourage the agent to maintain SoC
8: end if
9: if Battery SoC n < 0.2 then
10: rn = rn - (0.2 - SoC n ). *Penalize low battery condition
11: end if
Output: rn .
We introduce the CMAAC approach in this section. To improve the learning efficiency and effectiveness of multiagents, we leverage the paradigm of sharing the learned driving policies, that is, each agent based on the AC architecture employs the same policy network parameters and the same value network parameters. Furthermore, to handle the credit assignment issue, each agent is required to maximize its own action-value function. Meanwhile, to cope with the nonstationary environment problem and facilitate multiagent cooperation, each agent also needs to optimize the centralized action-value function. As a consequence, the shared policy network parameter ${\theta}$ of the nth agent can be learned via maximizing the following actor objective function: \[{J}_{\theta}\left({\theta}\right) = \mathop{\mathbb{E}}\limits_{{T}_{m}∼\text{M}}\left\{{{\left({{\pi}_{t}^{n}}\right)}^{T}\left[{{\alpha}{Q}^{n}\left({{s}_{t}^{n},{c}_{t},{\pi}_{t}^{n}}\right) + \left({{1}{-}{\alpha}}\right)\bar{Q}\left({{\bar{s}}_{t},{\bar{\pi}}_{t}}\right)}\right]}\right\}{.} \tag{1} \]
Here $\begin{gathered}{\mathbb{E}}\end{gathered}$ represents mathematic expectation, ${T}_{m}$ denotes the DCMA-MDP transitions sampled from experience memory M, ${\pi}_{t}^{n}$ refers to ${\pi}\left({\left.{{a}^{n}}\right|{s}_{t}^{n},{c}_{t}{;}{\theta}}\right),\,{\alpha}$ is the weighting coefficient, ${Q}^{n}\left({\cdot}\right)$ represents the nth agent’s individual action-value function parameterized by $\varphi,\,\bar{Q}\left({\cdot}\right)$ denotes the centralized action-value function parameterized by $\bar{\varphi},$ and ${\bar{s}}_{t}$ is the state of $\bar{Q}\left({\cdot}\right)$. To avoid the dimension explosion issue, ${\bar{s}}_{t}$ only contains the average speed and SoC of the entire vehicular platoon, and the collision status of each vehicle. ${\bar{\pi}}_{t} = \left\{{{\pi}_{t}^{1},…{\pi}_{t}^{N}}\right\}$ represents the set of all agent policies.
Additionally, the shared network parameter $\varphi$ of the nth agent’s individual action-value function can be optimized by minimizing the following critic loss function: \begin{align*}{J}_{\varphi}\left({\varphi}\right) = & \mathop{\mathbb{E}}\limits_{{T}_{s}{\sim}{\text{M}}}{[}{r}_{t}^{n} + {\gamma}\left({\pi}_{{t} + {1}}^{n}\right)^{T}{Q}^{n}\left({s}_{{t} + {1}}^{n},{c}_{{t} + {1}},{\pi}_{{t} + {1}}^{n}{;}\varphi\right) \\ & {-}{Q}^{n}\left({s}_{t}^{n},{c}_{t},{\pi}_{t}^{n}{;}\varphi\right){]}^{2}{.} \tag{2} \end{align*}
In this work, the centralized action-value function for the entire vehicular platoon is able to be additively decomposed into individual action-value functions across vehicles. Hence, the network parameter $\bar{\varphi}$ of the centralized action-value function can be optimized via minimizing the following loss function: \[{J}_{\bar{\varphi}}\left({\bar{\varphi}}\right) = \mathop{\mathbb{E}}\limits_{{T}_{s}∼\text{M}}{\left[{\bar{Q}\left({{\bar{s}}_{t},{\bar{\pi}}_{t}{;}\bar{\varphi}}\right){-}\mathop{\sum}\limits_{{n} = {1}}\limits^{N}{{Q}^{n}\left({{s}_{t}^{n},{c}_{t},{\pi}_{t}^{n}{;}\varphi}\right)}}\right]}^{2}{.} \tag{3} \]
Algorithm 2 outlines the proposed approach in detail, where e denotes the episode step, and E is the maximum episode step. Since DNNs are critical to the performance of the agent, they are required to be properly set up. The shared policy network, the shared individual action-value function network, and the centralized action-value function network are designed using two fully connected hidden layers. The size of both hidden layers is 128. We adopt rectified linear unit as all activation functions in hidden layers. The dimensions of the input and output of the shared policy network are 30 and 5, respectively. Furthermore, the dimensions of the input (i.e., state and action) and output of the shared individual action-value function network are 31 and 1, respectively. The dimensions of the input (i.e., global state and joint action) and output of the centralized action-value function network are 2 + 2N (i.e., the average speed and SoC of the entire vehicular platoon, the collision status, and the actions of N agents) and 1, respectively.
1: Initialize the shared policy network parameter ${\theta},$ the shared individual action-value function parameter $\varphi,$ the centralized action-value function parameter $\bar{\varphi},$ and an empty memory M.
2: for episode step e = 1, 2, … E do
3: Sample initial local states and shared information.
4: for time step t = 1, 2, …T do
5: for agent n = 1, 2, … N do
6: Take action based on ${\pi}\left({{a}_{t}^{n}\left|{{s}_{t}^{n},{c}_{t}{;}{\theta}}\right.}\right)$.
7: Execute ${a}_{t}^{n}$ in the environment and receive a transition:
8: ${s}_{{t} + {1}}^{n},{c}_{{t} + {1}},{r}_{t}^{n},{d}_{t}^{n}\sim{p}\left({{s}_{{t} + {1}}^{n}\left|{{s}_{t}^{n},{c}_{t},{a}_{t}^{n}}\right.}\right){.}$
9: end for
10: Save the DCMA-MDP transitions of all the agents in M.
11: end for
12: Sample a batch of the DCMA-MDP transitions from M.
13: for agent n = 1, 2, … N do
14: Update the parameter ${\theta}$ of ${\pi}^{n}$ via (1).
15: Update the parameter $\varphi$ of Qn via (2).
16: end for
17: Update the parameter $\bar{\varphi}$ of $\bar{Q}$ via (3).
18: end for
Generally, the choice of hyperparameters requires a trial-and-error process, where different values are tested and compared to find the optimal set of hyperparameters for a given task. Hyperparameters can affect the performance and stability of the learning algorithm. For instance, a larger batch size can lead to more stable updates, but it may also slow down the learning process. Moreover, the weight coefficient ${\alpha}$ affects the group cooperation ability and the individual learning effect. The main hyperparameters of our technique are provided in Table 1.
Table 1 The main hyperparameters of the proposed approach.
In this section, the numerical experiment is implemented to benchmark the proposed DMARL-enabled energy-aware cooperative driving technique for CAEVs. We leverage the autonomous driving simulator provided in [14] to evaluate the proposed method.
As shown in Figure 1, CAEVs and HDVs coexist in the traffic environment that is composed of a four-lanes-based highway and a ramp. To conduct a comprehensive evaluation of the proposed approach, we set the two mixed-autonomy traffic flows. The first traffic flow (traffic flow 1) randomly emits two to three CAEVs and one to two HDVs at each episode. Moreover, the second traffic flow (traffic flow 2) randomly emits 10–15 CAEVs and two to four HDVs at each episode. The maximum traffic speed of the entire traffic flow is set to 30 m/s. The range of acceleration and battery SoC for each vehicle is −10 m/s2 to 10 m/s2 and 0–1, respectively. The battery SoC is calculated via the ratio between the current energy and the total energy [13]. The battery SoC decreases as the number of time steps increases, but it will be reset to 1 before each new episode starts.
Additionally, since we focus on the high-level multiagent decision-making problem, we do not consider the impact of low-level controls such as regenerative braking. In this work, the low-level controllers and the HDV model are provided by the autonomous driving simulator we adopt.
The dynamic programming (DP) [15] and proximal policy optimization (PPO) schemes for CAEVs are implemented as the two classical single-agent baselines for contrastive analysis. In other words, the single-agent baseline has no mechanism to communicate information and share policies among agents, and each agent approximates driving policies through its own feedback information. In addition, multiagent PPO (MAPPO) [12] is a state-of-the-art MARL method. Consequently, the MAPPO-based cooperative driving scheme for CAEVs is implemented as a state-of-the-art baseline.
To reduce the influence of accidental factors in the process of model training, we perform five different runs of each approach with different random seeds. This means that the neural networks of each method adopt different initialization parameters when performing each training. We train the agents for 5,000 episodes, where one episode contains 100 time steps.
We assess the final policy models trained by each scheme with 1,000 testing episodes. During the model testing, the model only performs inference without sampling. Figures 2 and 3 show the evaluating curves of the entire CAEV platoon’s average battery SoC and collision rate for the baselines and our approach in the first mixed-autonomy traffic flow. The average is shown by the solid line, while the shaded area represents the standard deviation. Overall, the results indicate that the proposed scheme exceeds all the baselines in terms of energy conservation and driving safety. In addition, from Figure 3, we can find that only DP-based vehicles have collision accidents among the four solutions.
Figure 2 The evaluating curves of the entire CAEV fleet’s average battery SoC for the four approaches in the first mixed-autonomy traffic flow.
Figure 3 The evaluating curves of the entire CAEV fleet’s collision rate for our approach and the three baselines in the first mixed-autonomy traffic flow.
Figures 4 and 5 show the testing curves of the entire CAEV platoon’s average battery SoC and collision rate for the baselines and our solution in the second mixed-autonomy traffic flow. The plain fact is that the proposed solution outperforms all the baselines with a large margin, in the matters of both energy conservation and driving safety.
Figure 4 The evaluating curves of the entire CAEV fleet’s average battery SoC for the four approaches in the second mixed-autonomy traffic flow.
Figure 5 The evaluating curves of the entire CAEV fleet’s collision rate for our approach and the three baselines in the second mixed-autonomy traffic flow.
Qualitatively, in Table 2, we report the average metrics for each solution during the policy model testing. For instance, in the case of the first traffic flow, compared with the DP, PPO, and MAPPO methods, our solution gains approximately 2.65%, 3.01%, and 0.81% improvements regarding the average speed of the entire CAEV platoon, respectively. In contrast to the DP, PPO, and MAPPO schemes, the average battery SoC of the CAEV platoon based on CMAAC is improved by about 3.13%, 3.13%, and 4.76%, respectively.
Table 2 The performance of the four schemes in the different traffic flows.
On the second traffic flow task, the proposed scheme achieves about 16.13% improvement in the average SoC for the entire CAEV platoon, compared with the DP, PPO, and MAPPO baselines. Additionally, the proposed method gains approximately 81.48%, 75.00%, and 61.54% improvements regarding the average collision rate of the entire CAEV platoon in comparison with the DP, PPO, and MAPPO solutions, respectively. With the average return of the entire CAEV platoon, we can analyze the comprehensive performance of each scheme. Compared with the DP, PPO, and MAPPO baselines, the average return of the CAEV platoon based on CMAAC is improved by about 15.57%, 10.58%, and 7.67%, respectively.
Figure 6 visually illustrates the performance of the baselines and our solution in terms of the average return of the entire CAEV platoon, in the two different mixed-autonomy traffic flows. It can be found that our approach shows more significant advantages over the baselines in relatively complex situations (i.e., traffic flow 2).
Figure 6 The average return regarding DP-, PPO-, MAPPO-, and CMAAC-based solutions in the different mixed-autonomy traffic flows.
This article introduces a novel DMARL-enabled energy-aware cooperative driving solution for CAEVs, which attempts to learn vehicular platoon management policies to guarantee overall traffic flow performance. Specifically, we advance DCMA-MDP to model the CAEV platoon management problem. Additionally, a CMAAC technique is developed to optimize vehicular platoon management policies that map perceptual information directly to the group decision-making behaviors for the CAEV platoon. Finally, the proposed solution is benchmarked in highway on-ramp merging scenarios with the two different mixed-autonomy traffic flows. The assessment results demonstrate that the proposed solution can dramatically improve the entire CAEV platoon performance, including road safety, travel efficiency, and energy conservation. In comparison with the three baselines, the CAEVs based on our technique show a superior awareness of energy and cooperation.
While we have demonstrated the potential of the proposed technique to upgrade the transportation system, two challenges remain. First, in the real world, the sensing and communication systems of CAEVs involve issues such as noise and delay, and these uncertainties may deteriorate the performance of the entire CAEV platoon and even cause serious accidents. Second, in real-world traffic scenarios, the interaction between CAEVs and HDVs is complex, potentially encompassing a combination of cooperation, competition, and imperfect information games.
Here we outline possible extensions and improvements to the proposed technique: 1) adopting diverse state representation methods, such as grid representation or images; 2) designing new reward functions, such as a weighted summation of the rewards; 3) utilizing real human data to model HDVs; 4) considering the impact of low-level controllers on the energy efficiency, such as incorporating a regenerative braking control system; 5) conducting more complex simulations that include diverse traffic elements, such as traffic signal timing or stop signs; and 6) exploring different neural network architectures, such as the transformer. Additionally, we believe that robust MARL techniques and game-theory-based MARL approaches have significant potential as competitive solutions for driving further advancements in CAEVs.
This work was supported in part by the Start-Up Grant (Nanyang Assistant Professorship), Nanyang Technological University, and in part by the Agency for Science, Technology and Research (A*STAR) Singapore under Advanced Manufacturing and Engineering (AME) Young Individual Research Grant (A2084c0156), the ANR-NRF joint Grant (NRF2021-NRF-ANR003 HM Science), and the MTC Individual Research Grants (M22K2c0079). Chen Lv is the corresponding author of this article.
Xiangkun He (xiangkun.he@ntu.edu.sg) is a research fellow at Nanyang Technological University, Singapore 639798. He received his Ph.D. degree in 2019 from the School of Vehicle and Mobility, Tsinghua University, Beijing, China. During 2019–2021, he was a senior researcher at Noah’s Ark Lab, Huawei Technologies, China. His research interests include AVs, RL, decision, and control. He has contributed more than 40 peer-reviewed publications.
Chen Lv (lyuchen@ntu.edu.sg) is an assistant professor at Nanyang Technological University, Singapore 639798. He received his Ph.D. degree at the Department of Automotive Engineering, Tsinghua University, China, in 2016. His research focuses on advanced vehicles and human–machine systems, where he has contributed more than 100 papers and obtained 12 granted patents.
[1] M. Kandidayeni, M. Soleymani, A. Macias, J. P. Trovão, and L. Boulon, “Online power and efficiency estimation of a fuel cell system for adaptive energy management designs,” Energy Convers. Manage., vol. 255, pp. 1–11, Mar. 2022, doi: 10.1016/j.enconman.2022.115324.
[2] E. M. Bibra et al., “Global EV outlook 2021: Accelerating ambitions despite the pandemic,” 2021. [Online] . Available: https://iea.blob.core.windows.net/assets/ed5f4484-f556-4110-8c5c-4ede8bcba637/GlobalEVOutlook2021.pdf
[3] Y. Zhang, C. Li, T. H. Luan, C. Yuen, and Y. Fu, “Collaborative driving: Learning-aided joint topology formulation and beamforming,” IEEE Veh. Technol. Mag., vol. 17, no. 2, pp. 103–111, Jun. 2022, doi: 10.1109/MVT.2022.3156743.
[4] R. Molina-Masegosa and J. Gozalvez, “LTE-V for sidelink 5G V2X vehicular communications: A new 5G technology for short-range vehicle-to-everything communications,” IEEE Veh. Technol. Mag., vol. 12, no. 4, pp. 30–39, Dec. 2017, doi: 10.1109/MVT.2017.2752798.
[5] A. Coppola, D. G. Lui, A. Petrillo, and S. Santini, “Eco-driving control architecture for platoons of uncertain heterogeneous nonlinear connected autonomous electric vehicles,” IEEE Trans. Intell. Transp. Syst., vol. 23, no. 12, pp. 24,220–24,234, Dec. 2022, doi: 10.1109/TITS.2022.3200284.
[6] P. R. Wurman et al., “Outracing champion Gran Turismo drivers with deep reinforcement learning,” Nature, vol. 602, no. 7896, pp. 223–228, Feb. 2022, doi: 10.1038/s41586-021-04357-7.
[7] Q. Guo, O. Angah, Z. Liu, and X. J. Ban, “Hybrid deep reinforcement learning based eco-driving for low-level connected and automated vehicles along signalized corridors,” Transp. Res. C, Emerg. Technol., vol. 124, pp. 1–20, Mar. 2021, doi: 10.1016/j.trc.2021.102980.
[8] G. P. Antonio and C. Maria-Dolores, “Multi-agent deep reinforcement learning to manage connected autonomous vehicles at tomorrow’s intersections,” IEEE Trans. Veh. Technol., vol. 71, no. 7, pp. 7033–7043, Jul. 2022, doi: 10.1109/TVT.2022.3169907.
[9] Y. Wang, Y. Wu, Y. Tang, Q. Li, and H. He, “Cooperative energy management and eco-driving of plug-in hybrid electric vehicle via multi-agent reinforcement learning,” Appl. Energy, vol. 332, pp. 1–12, Feb. 2023, doi: 10.1016/j.apenergy.2022.120563.
[10] S. Li et al., “A multiagent deep reinforcement learning based approach for the optimization of transformer life using coordinated electric vehicles,” IEEE Trans. Ind. Informat., vol. 18, no. 11, pp. 7639–7652, Nov. 2022, doi: 10.1109/TII.2021.3139650.
[11] M. Li, Z. Cao, and Z. Li, “A reinforcement learning-based vehicle platoon control strategy for reducing energy consumption in traffic oscillations,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 12, pp. 5309–5322, Dec. 2021, doi: 10.1109/TNNLS.2021.3071959.
[12] C. Yu et al., “The surprising effectiveness of PPO in cooperative multi-agent games,” in Proc. 36th Conf. Neural Inf. Process. Syst. Datasets Benchmarks Track, vol. 35, Jul. 2021, pp. 24,611–24,624.
[13] M. A. Hannan, M. M. Hoque, A. Hussain, Y. Yusof, and P. J. Ker, “State-of-the-art and energy management system of lithium-ion batteries in electric vehicle applications: Issues and recommendations,” IEEE Access, vol. 6, pp. 19,362–19,378, 2018, doi: 10.1109/ACCESS.2018.2817655.
[14] E. Leurent. “An environment for autonomous driving decision-making.” GitHub. Accessed: 2018. [Online] . Available: https://github.com/eleurent/highway-env
[15] X. Tang, J. Chen, T. Liu, Y. Qin, and D. Cao, “Distributed deep reinforcement learning-based energy and emission management strategy for hybrid electric vehicles,” IEEE Trans. Veh. Technol., vol. 70, no. 10, pp. 9922–9934, Oct. 2021, doi: 10.1109/TVT.2021.3107734.
Digital Object Identifier 10.1109/MVT.2023.3291171