Multi-Armed Bandit for Link Configuration in Millimeter-Wave Networks

Multi-Armed Bandit for Link Configuration in Millimeter-Wave Networks: An Approach for Solving Sequential Decision-Making ProblemsYi Zhang, Robert W. Heath18mvt02-heath-opener-3237940Â©SHUTTERSTOCK.COM/SUWINEstablishing and maintaining millimeter-wave (mm-wave) links are challenging due to the changing environment and the high sensibility of mm-wave signals to user mobility and channel conditions. mm-Wave link configuration problems often involve a search for optimal system parameter(s) under environmental uncertainties from a finite set of alternatives that are supported by the system hardware and protocol. For example, beam sweeping aims at identifying the optimal beam(s) for data transmission from a discrete codebook. Selecting parameters such as the beam sweeping period and the beamwidth is crucial to achieving high overall system throughput. In this article, we motivate the use of the multi-armed bandit (MAB) framework to intelligently search out the optimal configuration when establishing the mm-wave links. MAB is a reinforcement learning framework that guides a decision maker to sequentially select one action from a set of actions. As an example, we show that within the MAB framework, the optimal beam sweeping period, beamwidth, and beam directions could be dynamically learned with sample-computational-efficient bandit algorithms. We conclude by highlighting some future research directions on enhancing mm-wave link configuration design with MAB.IntroductionAchieving the highest performance in a wireless communication network requires constant readjustment to changing channel and network conditions. Such optimizations happen across all layers over multiple timescales to achieve end-to-end performance objectives. Many of these algorithms involve successive operations of adaptation and reconfiguration. For example, beamforming in an mm-wave network might involve frequent beam training (finding the best beam in a beam-training codebook [1]) and a less frequent reconfiguration (finding the best beam-training codebook size among several possible codebooks [2]). Solving the reconfiguration problem is challenging as the optimal solution changes over time due to the dynamics in the environment and mobility in the network, and it is unrealistic to find the optimal solution at each configuration point due to the large training overhead. Sequential decision making, a type of reinforcement learning, is one approach for formulating and solving such problems.In this article, we motivate the use of the MAB framework [3] to solve sequential decision-making problems in mm-wave and sub-THz wireless communication networks. Such problems occur frequently when operating an mm-wave link in a higher mobility setting as found in mobile ad hoc networks, vehicular networks, and cellular networks. A typical MAB framework consists of a decision maker (aka agent) that sequentially selects one action (aka arm) from a set of actions. A scalar reward would be revealed to the agent given the selected action. The typical challenge in a MAB problem is to design an appropriate arm selection algorithm with optimality and convergence. MAB has been successful in many applications of reinforcement learning such as assessing treatment effectiveness in clinical trials and predicting user preference in recommendation systems [4].MAB has been applied to solve mm-wave link configuration problems recently. Most existing applications focused on finding the optimal beams for beam training/tracking in mm-wave networks. Please see [5], [6], [7] and the references therein. More recent examples further include selecting the optimal beamwidth for beam training in [8] and our prior work [2]. Prior work has shown how MAB can improve cumulative system throughput compared to reasonable nonsequential solutions. MAB stands out for its small sample and computational complexity. Since it does not rely on neural network (NN) training as deep reinforcement learning (DRL) does, it could be a purely online learning methodology. Our recent work [9] has shown that MAB could achieve a similar throughput performance to DRL-based optimization with fewer samples.In this article, we show the power and efficiency of MAB in solving link configuration problems in mm-wave communication networks related to configuring the beamforming weights of the transmit and receive antenna arrays. We explain how selecting the beam sweeping period; the beamwidth of the beams in the codebook; and even the candidate beams in the codebook can be formulated as sequential decision-making problems. We introduce key backgrounds on MAB to explain how they can be used to solve these problems. We provide some examples of performance in terms of system throughput in a mobile mm-wave network. Finally, we highlight some potential future research directions for exploiting MAB in mm-wave system designs. While we focus the description and simulation study on mm-wave communication at frequencies below 100 GHz, the ideas can be applied to sub-THz by adjusting the system parameters, like antenna array size, and updating the channel models.Sequential Decision Making in mm-Wave Link ConfigurationBeamforming using antenna arrays is critically important for mm-wave communication [10]. The array gain obtained with well-configured antennas overcomes losses due to shrinking antenna size and wider communication bandwidths. In IEEE 802.11ad/ay and 5G, the de facto approach for configuring those antenna arrays is beam sweeping (aka beam training). Though there are many variations, the key idea is for the transmitter and receiver to estimate the link performance with different combinations of transmitting and receiving beam pairs and then to exchange information and agree on the best pair for communication.Generally speaking, there is tension among the codebook size, training frequency, and net system throughput. In the absence of side information, achieving a high array gain typically requires time to search over candidate beams from a larger codebook, which takes away from time used for communication with the optimum pair. If the beam coherence time is long, as in a fixed wireless scenario, that cost may be sufficiently amortized. But if the beam coherence time is short, as in a vehicular ad hoc network, it may take too long to find the optimum beam configuration, leaving little or no time for communication. The problems are exacerbated by uncertainty in the system due to network dynamics, user mobility, channel conditions, and blockages. For example, beam training may have to be performed more frequently for a codebook of narrow beams due to misalignment errors, potentially canceling out the benefits compared to using wider beams with less array gain.The codebook size, training frequency, and throughput dilemma can be solved in part by dynamically adjusting the system parameters to local conditions as they change. This leads to what is known as a sequential decision-making problem where the system (acting as the agent) chooses the system parameters from a set of possible parameters (the action) to achieve good average throughput (the reward). As explained in this article, by formulating link configuration problems using the sequential decision-making framework, they can be solved efficiently using MABs. Essentially, the choice of the system configurations is adjusted over time based on estimates of performance and feedback to the agent. A good algorithm balances exploration (trying different configurations) and exploitation (using a configuration that works well). MAB is the perfect tool to solve those configuration selection problems in mm-wave systems due to its sampling and computational efficiency.In the rest of this section, we describe three important mm-wave link configuration problems, including what parameters need to be optimized. Note that these parameters could be optimized independently or jointly, depending on the real system protocol design and hardware capability. Later in the article, we provide numerical examples that give insights into the gains that can be achieved when using MABs to solve these problems.Beam Sweeping Period OptimizationThe beam sweeping period is one system parameter that may be adjusted in a typical mm-wave communication system. In 5G New Radio release 15, there are five possible beam sweeping periods: 10, 20, 40, 80, and 160 ms. The beam sweeping period is the time between attempts at beam training, and the optimum value depends not just on the geometry of the environment but also on the mobility of devices and objects acting as potential blockers in that environment. For example, consider a vehicle-to-infrastructure communication link on a multilane highway, as shown in Figure 1. Traveling faster on the road generally requires a shorter beam training period than going slowly down the road. That period may be further reduced if there are many trucks on the highway blocking the signal, while it could be larger if the road is empty. Using the shortest period of 10 ms all the time would lead to the most resilient connection but also to high overhead during the occasions when the beams do not need to be swept so frequently. In this case, the sequential decision-making problem is for the system to adapt to the optimum sweeping period over time.heath01-3237940Figure 1 Beam sweeping period optimization. BA: beam alignment; DT: data transmission; EXPRS: express.Beamwidth OptimizationThe beamwidth is a parameter that may be optimized as a means to realize beam-training codebooks with different sizes. In line-of-sight channels, the highest array gain is achieved using beams that are narrow and matched to the array response for the direction of the angle of arrival or departure. It can be advantageous, however, to build codebooks that have wider beams. Such codebooks can be smaller since fewer beams are required to cover a given angular area. As such, they can be searched more quickly. The selected beam is also optimum over a larger angular range, perhaps increasing the beam sweeping period. The main drawback of wider beams is that they sacrifice array gain, reducing the rate achieved during communication and possibly requiring a longer dwell time per beam. The choice of the optimum beamwidth is both scenario and device dependent. For example, as shown in Figure 2, wider beams benefit devices that are moving quickly (they are more resilient to beam pointing errors) and those that are closer to the base station (BS) (they do not need the array gain derived from searching a larger codebook). Any inappropriate choice could potentially degrade overall system throughput [2]. In this situation, the sequential decision-making problem is for the system to adapt to the beamwidth, and therefore the codebook size, over time.heath02-3237940Figure 2 Beamwidth optimization.Codebook DownsamplingThe propagation channel at mm-wave bands has fewer strong paths compared to lower frequencies since there are more losses due to scattering. Moreover, the devices are normally nonuniformly distributed on a given site. Taken together, these imply that among the available candidate beams, it is plausible that only a few of them are potentially feasible, while others may point to the spatial directions where mm-wave propagation channels rarely exist, as shown in Figure 3. Identifying a small set of useful beams and using it for link configuration could greatly reduce reconfiguration overhead without significantly degrading the link quality. This allows obtaining the benefits of smaller codebook sizes along with those of higher gain narrow beams. In this case, the sequential decision-making problem is to find a downsampled codebook that consists of the potentially feasible beams selected from a larger codebook.heath03-3237940Figure 3 Codebook downsampling. For later numerical evaluation, the user randomly moves and self-rotates within a predefined area, which is a circle in which the center coordinate is (21.21,21.21) m and the radius r is 10 m. The initial location of the user is randomly generated within the circle.Background on MABMAB Framework and ObjectiveA classical MAB framework describes an agent that performs sequential decision making in a discrete-time setting. At each time step, the agent selects one action from a set of actions, and then a scalar reward would be revealed to the agent given this selected arm. The provider of this reward is called the environment, which captures the overall system uncertainty. Mathematically, supposing there are K arms, the reward of the kth arm is modeled as an unknown distribution Ï€k with mean Î¼k. We focus on stationary Ï€k in this article for simplicity, and the other case will be discussed accordingly. The agent may, however, make a nonoptimal decision in an arbitrary time step due to the lack of information on those reward distributions, i.e., Ï€k. The MAB framework accordingly defines a term called regret to quantify the loss of reward when selecting a nonoptimal arm. To be specific, denoting I[t] as the index of the arm pulled;

${\text{reward}}{[}{t}{]}$

as the scalar reward whose expectation is

${\mu}_{\text{I}[\text{t}]}$

; and

${\text{regret}}{[}{t}{]}$

as the associated regret generated in the tth time step, then the regret

${\text{regret}}{[}{t}{]}$

is defined as the difference between the highest expected achievable reward, denoted by

${\mu}_{{k}^{*}}$

, and the expected reward of the selected arm

${I}{[}{t}{]}$

. The objective of the MAB framework is to minimize the accumulated regret over a finite-time horizon.From mm-Wave to MABAn mm-wave system could be viewed as an agent. Then, for example, different beam sweeping periods or beamwidth could be regarded as different actions in the MAB framework. Or a set of beams or a pair of beam sweeping periods and beamwidth could be treated as an action. The reward would be the effective data rate when using the selected configuration setup. The advantages of applying MAB for mm-wave link configuration can be further summarized next.

Most tunable mm-wave link configuration parameters, such as the beamwidth and beam sweeping period, have only a finite number of alternatives to ease standardization, commercialization, and implementation. This perfectly matches the framework of MAB. Moreover, a smaller action set yields a faster convergence rate of the MAB algorithms.

MAB has a small sample and computational complexity and does not rely on offline NN training as DRL does. This makes MAB a more favorable tool for those mobile wireless systems where the samples (i.e., feedback) are not available offline due to the unknown user mobility, position, access, and wireless traffic patterns. Moreover, the reward update process of MAB algorithms usually requires only very small computational resources.

The reward of MAB is sent back per data transmission, and it is only a scalar. Thus, the feedback overhead is relatively small and can be easily incorporated into current standards (e.g., acknowledgment frame) without increasing the exchange overhead, while the other more sophisticated algorithms, e.g., DRL, might require more feedback information, like position information, speed, etc., which could result in larger communication overhead.

Operating at high-frequency bands also raises many new hardware challenges, like beam squint, phase noise, etc. The MAB framework can incorporate all these uncertainties into the unknown distribution of the reward. Hence, it is more generalizable and deployable for different scenarios.

When the correlation between decision-making time slots is irrelevant or small, MAB is preferable since a single policy could be learned for all the users who access this network. Hence, better scalability of the algorithm could be achieved.

Therefore, MAB is more suitable for use cases where fast deployment and low computational complexity are the crucial features. Accordingly, MAB does have several limitations in comparison to DRL.

MAB does not exploit the correlation between time slots, while DRL could leverage the correlation to approximate the optimal solution.

DRL could incorporate more context/side information in a more straightforward manner thanks to the application of NN; hence, better performance could be obtained.

When offline training is permitted and encouraged, DRL is more favorable as we can pretrain a DRL-based decision maker well with massive amounts of offline wireless data and then refine the associated NN with online data.

Overall, compared with DRL, MAB does have limitations when the system becomes more complicated as MAB is stateless, while DRL saves more historical information for future prediction and can also exploit NN for better approximation.MAB AlgorithmsA MAB algorithm refers to a policy that is used by the agent to decide which arm to play at the current moment given all historical actions and corresponding feedback

${\left\{{(}{\text{reward}}{[}{t}{]},{I}{[}{t}{]}{)}\right\}}$

. Most MAB algorithms for selecting arms are based on the upper confidence bound (UCB) [3] or Thompson sampling (TS) technique [11]. Their key ideas are briefly reviewed as follows.UCBThe UCB-type algorithm directly focuses on estimating the mean reward of each arm. By gradually gathering more knowledge of the arms, it adaptively moves from exploring different arms to concentrating on exploiting a certain arm. The UCB-type algorithm measures each arm such that the true mean reward is below the UCB with a high probability. At each time step, the arm with the highest UCB would be chosen. Typically, the UCB is designed as the sum of the empirical reward estimate and a term that is inversely proportional to the number of times the arm has been exploited. Accordingly, as time evolves, a better reward estimate is achieved, and the decision-making process would primarily rely on the empirical estimate. There are many variants of UCB-type algorithms that revise the UCB for different purposes.TSTS is a Bayesian-type approach that estimates the distribution of the reward with respect to a given prior distribution. Unlike the UCB-type algorithm, TS selects the arm that has the largest probability of being optimal with the available historical observations. The philosophy of TS is briefly explained as follows. TS first models the reward by a likelihood function parameterized by a certain unknown parameter i and then assumes a prior distribution of it. At each round of decision making, TS samples a

${\tilde{\theta}}$

from the prior distribution and then selects the arm that has the largest expected reward. Finally, the real reward provided by the selected arm is further used to update the prior distribution of i with the Bayesian rule. As time evolves, the accumulated observations form an accurate approximation of the prior distribution of i; hence, a better decision would be made. TS achieves comparative or even better performance compared to UCB-type policies [11]. Moreover, TS does not need parameter tuning. It is favorable to use TS if there is prior distribution of the armsâ€™ rewards and their posterior update is computationally efficient. For example, a Dirichlet distribution could be used as a prior to model the arm that follows a categorical distribution.Applications of MAB in mm-Wave Link ConfigurationIn this section, we will showcase how to use the MAB framework to make decisions among the available link configuration settings in mm-wave systems. In particular, we provide link-level simulations to demonstrate the performance improvement, in terms of data rate, achieved by using the MAB framework.Our basic setup is an mm-wave network where a BS [or an access point (AP)] keeps communication with a mobile user through the beam sweeping-based mechanism (2D beam sweeping is considered here). The user is moving (with random speed and direction) and self-rotating within a circular area as shown in Figure 3. To mimic the dynamic propagation environment, we model that the user experiences random blockage with a certain probability, meaning that its direct link to the BS randomly suffers from a signal attenuation of

${PL}^{\text{block}}$

dB.The studied mm-wave system is modeled in a discrete-time setting where each communication time slot refers to a complete round. To be specific, at the beginning of each time slot, the BS selects one of the available configuration setups to first perform beam sweeping and then uses the beam with the largest signal-to-noise ratio (SNR) for data transmission to the user. At the end of a time slot, the effective data rate would be calculated as the total data transmitted divided by the time slot duration. We will use MAB algorithms to select the configuration setup at each time slot.The key simulation parameters are summarized in Table 1; some unlisted parameters will be given in the specific scenarios that are to be discussed. For more details, e.g., user mobility simulation, the source code is accessible on GitHub (https://github.com/yzhang417/MAB-MmWave-Link-Config). All the evaluation results presented later are averaged over 500 realizations. In the following, three link configuration scenarios with progressive complexity are investigated.Table 1 Link-level simulation parameters.heath_t1-3237940Scenario I: Beam Sweeping Period SelectionFor this first scenario, we consider optimizing the beam sweep period for the given basic setup. We suppose that five alternatives of beam sweeping periods (i.e., time slot duration) are supported by the system protocol and that they are 10, 20, 40, 80, and 160 ms. In the beam sweeping process, 256 sectors (around 1.4Â° in terms of beamwidth) are used by the BS, and 16 sectors (around 22.5Â° in terms of beamwidth) are used by the user. To apply the MAB framework, we view each beam sweeping period as an arm, and the reward is the effective data rate achieved at the end of the time slot, which is shown in Figure 4(a).heath04-3237940Figure 4 (a) Each beam sweeping period is an arm. (b) Evolution of the effective data rate versus time slot index. Each time slot corresponds to one beam sweeping period. Within 500 time slots, the bandit policies converge to the genius policy and improve the data rate by 22% with respect to a random selection policy and by 60% with respect to the potentially worst policy. (a) MAB adaption in scenario I (beam sweeping period selection). (b) The evaluation result in scenario I. KL-UCB: Kullback-Leibler UCB.The evaluation results of the bandit-based policies along with some benchmarks are given in Figure 4(b). Instead of showing cumulative regret, we demonstrate the evolution of the effective data rate during the learning process since the performance improvement in the data rate is the ultimate goal from the perspective of the wireless context. By evaluating all the static policies that used a fixed arm throughout, the genius policy is the one that provides the highest average data rate, and the worst policy is the one that results in the lowest data rate. Both TS-type and UCB-type bandit policies are evaluated. The optimal data rate is obtained by exhaustively testing all arms per slot and recording the largest data rate. Some interesting observations can be drawn from Figure 4(b).

The two bandit policies start by behaving like a random selection policy and converge quickly into the optimal solution. The convergence takes around 500 time slots, which corresponds to 5â€“80 s (as the slot duration could vary from 10 to 160 ms).

The bandit policies boost the system data rate by 22% in comparison to the random selection policy. Moreover, choosing a suboptimal beam sweeping period could potentially lead to 36% performance degradation, from around 2.5 to 1.6 Gb/s (the worst policy). This well justifies the necessity of using a learning-based framework to configure the mm-wave link in dynamic environments.

We can see that the genius policy uses a period of 40 ms, which is primarily driven by the mobility setting given in Table 1 where the user moves with an average speed of 7.5 m/s. It is safe to infer that if the user is static, the genius policy would switch to using the longest period to exploit the beam sweeping result as long as possible.

The proposed MAB reaches 78% of the optimal data rate. It is, however, worth noting that the optimal data rate is intractable as it is obtained by assuming knowledge of the future channels and noise power.

Scenario II: Joint Beam Sweeping Period and Beamwidth SelectionWe now further consider a scenario where the used beamwidth is not necessarily fixed for the BS like that in scenario I. Instead, we assume that six beam sets of different beamwidth are available at the BS and that they are sorted from wide to narrow levels, which corresponds to 16, 32, 64, 128, 256, and 512 sectors, respectively. Each beam set consists of beams sharing the same width and together covering cover the entire spatial area. It is worth pointing out that the generation of beams of different resolutions could be achieved by beamforming optimization or antenna on-off techniques [2]. The optimal beamwidth is determined by both user channel conditions and the beam sweeping period. This motivates us to jointly configure the beam sweeping period and beamwidth.For this joint optimization, a straightforward application of MAB is to consider that there are 5 Ã— 6 arms (five beam sweeping periods and six beamwidth levels) and then apply the classical MAB algorithms. This approach, however, does not exploit the fact that the action space (30 arms in this scenario) can be factored into several subsets, namely that each beam sweep period could be combined with one alternative beamwidth level to become an arm. To exploit this special structure, another type of bandit approach decomposes the action space into multiple sequential subactions and sequentially decides on subactions, which is referred to as Monte Carlo tree search (MCTS) [12]. We further explain MCTS as follows. As shown in Figure 5(a), each nonleaf node represents a set of actions (arms), and it stores the empirical reward of this set of actions, while each leaf node represents a single action.heath05-3237940Figure 5 (a) The mechanism of bandit-based MCTS. Each nonleaf node represents a set of actions, and it stores the empirical reward of this set of actions. All five nonleaf nodes have their own six child nodes, though only the child nodes of the first node (10 ms) are shown as an example. Each leaf node represents a single action. One node is selected at each depth by a MAB algorithm until a leaf node is reached. The interaction with the environment occurs only at the leaf node, and the reward would be backpropagated to its ancestors to update the reward estimation. (b) The evolution of the effective data rate versus the time slot index. Each time slot corresponds to one beam sweeping period. For the bandit-based MCTS policy, the KL-UCB algorithm [13] is used to select the nodes. Within 500 time slots, the MCTS policy converges to the genius policy and improves the data rate by 30% with respect to a classical bandit policy, by 70% with respect to a random selection policy, and by more than 850% with respect to the potentially worst policy. (a) MAB adaption in scenario II (joint beam sweeping period and beamwidth selection). (b) The evaluation result in scenario II.At the beginning of each round of decision making, the MCTS algorithm starts from the root node and then traverses down until a leaf node is selected. Each node on the path is selected by a MAB algorithm. The action that is associated with the selected leaf node would be executed in the environment to get a reward. Finally, the reward would be backpropagated to its ancestors to update the estimates of their arms. The intuition behind this is that MCTS could keep tracking the estimated reward of the nonleaf nodes that are encountered in the earlier subaction selection. Therefore, if certain nonleaf nodes are reencountered, their estimates could be used to bias the decision on the next subaction, which may speed up the convergence.The evaluation results of the MAB-based joint beam sweeping period and beamwidth optimization are shown in Figure 5(b). We first observe that the bandit-based MCTS policy is superior to the random selection policy by providing more than 68% data rate improvement within 500 time slots (5â€“80 s). Besides, the potential worst policy could result in a low data rate of 0.3 Gb/s, which is fewer than 12% of the data rate achieved by the bandit-based MCTS policy. This big performance gap is due to the fact that the more system configuration parameters are wrongly chosen, the worse performance could be. The bandit-based MCTS policy converges much faster than the classical MAB algorithm, which empirically justifies that the structure of the action space should be exploited to speed up the learning process. The proposed MAB reaches 70% of the optimal data rate. Again, the optimal data rate is intractable, as discussed before. Also, by comparing the performance loss to that in Figure 4(b), we can see that the more actions that exist, i.e., the more complicated the system is, the larger the loss of performance is obtained. This is because MAB always simplifies the randomness and internal correlation of the system; hence, the performance loss becomes larger if the original system complexity is larger.It is worth noting that in the previous scenario, the beamwidth optimization is performed given that the exhausted beam sweeping mechanism is adopted. If a hierarchical beam-searching mechanism was used, MAB is still applicable, and the corresponding beamwidth optimization would be to optimize the highest beam resolution that the hierarchical search should narrow down to.Scenario III: Joint Beam Sweeping Period, Beamwidth, and Beam Direction SelectionIn the previous two scenarios, the beam sweeping process has probed the whole 2D angular space to find the best link. In the real system, it might not be necessary to scan the whole space when the user(s) move around only a certain location in the coverage area. For example, scanning only the first quadrant in Figure 3 would be sufficient to find a line-of-sight path, which could reduce the overhead by 75%. In this last showcased scenario, which is built upon scenario II, we would reduce the overhead of the beam sweep process by testing only partial directional beams (aka sectors) in each time slot, which is referred to as beam direction optimization. This then leads us to jointly perform the beam sweeping period, beamwidth, and beam direction selection.To perform this threefold joint optimization, we can also exploit the bandit-based MCTS approach as follows. We first decide the beam sweep period. Then, we decide the beamwidth level, providing us a set of beams that together cover the whole space, whose size is denoted as

${N}_{s}$

. Finally, we select only

${\tilde{N}}_{s}$

beams out of the

${N}_{s}$

ones to participate in beam sweeping. Denoting

${R} = {(}\tilde{N} / {N}_{s}{)}$

, the beam sweeping overhead is reduced by

${(}{1}{-}{R}{)}\,{\times}\,{100}{\%}$

. For example, if the beamwidth associated with

${N}_{s} = {512}$

sectors is selected and we set R = 1/4, then only

${\tilde{N}}_{s} = {128}$

out of the 512 beams would be used at each time slot for beam sweeping. The decision-making processes on the first two subactions are the same as presented in Figure 5(a). The key difference now is the extra step that decides which

${\tilde{N}}_{s}$

beams are to be used. Unlike the traditional MAB framework where a single arm is picked up each time, our scenario requires choosing

${\tilde{N}}_{s}$

arms for each round, which is known as a best-K identification problem in the bandit literature.The selection of

${\tilde{N}}_{s}$

beams and the reward backpropagation in MCTS is shown in Figure 6(a) and further explained next. Each beam has its own UCB index (or posterior distribution if TS is used). Among all

${N}_{s}$

beams, those

${\tilde{N}}_{s}$

beams associated with the

${\tilde{N}}_{s}$

largest UCB index are picked up for beam sweeping. After the beam sweeping process, the link quality (e.g., SNR) of each of the tested

${\tilde{N}}_{s}$

beams is sent back to the transmitter. This feedback would be used to update their UCB index (or posterior distribution if TS is used). The beam that provides the highest SNR is then used for data transmission, and the average effective data rate throughout the time slot is backpropagated to their ancestor nodes as a reward. The evaluation results are given in Figure 6(b).heath06-3237940Figure 6 (a) The adoption of best-K identification into MCTS. The nonleaf node selection remains the same as that in Figure 5(a), while multiple leaf nodes are selected to interact with the environment. Here, each gray nonleaf node has its own child nodes, while the child nodes of only one node are shown as an example. Each leaf node updates its estimate with its own feedback. A reward, which is a function of the feedback of the selected leaf nodes, is backpropagated to the ancestor nodes. In our scenario, the feedback of a beam is whether it connects to the user, and the reward is the effective data rate after the beam sweeping process. (b) The evolution of the effective data rate versus the time slot index. Each time slot corresponds to one beam sweeping period. For the MCTS-based policy, the beam sweeping period and beamwidth are selected using the KL-UCB index, while the beam directions are selected using the UCB1 index. Within 300 time slots, the MCTS policy that tests half of the beams (R = 1/2) improves the data rate by more than 20% with respect to the MCTS policy that enumerates all candidate beams and by more than 150% with respect to the random selection policy. (a) MAB adaption in scenario III (joint beam sweeping period, beamwidth, and beam direction selection). (b) The evaluation result in scenario III.As we can see, system throughput can be significantly improved in our studied case. For example, by selecting one-half of the candidate beams (R = 1/2), the bandit-based MCTS policy can improve the system data rate by more than 20% in comparison with the case that R = 1. The improvement over a random selection policy that enumerates all candidate beams is even more than 120%. Although the performance improvement is promising, parameter tuning over R, namely deciding how much overhead to be reduced, is crucial and tricky. One potential future direction would be to further expand this threefold joint optimization to a fourfold one. Besides, it might be interesting to learn a set of beams that are not necessarily of the same width. This is challenging as the action space would surge up. It could be helpful to explore the correlation among the performance of beams that have different beamwidth but point in the same direction.In this section, we have showcased three concatenated scenarios of applying MAB in solving the mm-wave link configuration problems, revealing that

MAB is feasible for its fast convergence and computational efficiency. The performance improvement in terms of system data rate is significant.

Bandit-based MCTS can factor a large action space into several sequential ones such that the optimal configuration could be learned in a fast way.

Future Research DirectionsThere are many promising research directions for applying MAB in mm-wave systems. We highlight some of them next.Time-Varying MAB for Nonstationary SystemIn this article, we have focused on applying the stationary MAB framework to mm-wave systems, where we consider that the reward distributions are stationary for simplicity. The underlying randomness in wireless systems could be time varying. Accordingly, it is important to further design bandit algorithms that can specifically handle the nonstationary reward. For example, a parallel subroutine algorithm could be designed to detect whether the underlying system has a significant change, and then the classical stationary MAB algorithms would be reset when such a scenario is detected. Besides, if the reward distributions evolve smoothly over time, a sliding-window approach that uses only the most recent observations for decision making might be helpful. It is imperative to design a unified bandit algorithm that automatically handles both abruptly and smoothly changing environments.Double-Side and Multiple Link ConfigurationThe MAB framework and its applications in mm-wave systems presented in this article consist only of a single agent (aka decision maker). In wireless systems, lots of link configuration parameters exist in both transceivers. For example, both the BS and users could configure their beamwidth for beam sweeping. One naive solution to this problem is to extend the number of arms to

${K}_{T}\,{\times}\,{K}_{R}$

, where

${K}_{T}$

(

${K}_{R}$

) is the number of arms of the transmitter (receiver). This approach, however, increases the algorithm complexity quadratically and also involves coordination overhead between transceivers. A very recent work [14] has proposed a concept of the â€œpartner-aware banditâ€ in which an agent anticipates the partnerâ€™s action and coordinates with the partner. This concept opens up a potential framework for the double-side mm-wave link configuration, in which the transceivers (agents) are partners with each other and cooperate to perform codecision making. Besides, in one of our recent works [9], we initiated a MAB-based controller to jointly consider mm-wave link configuration and user scheduling.Federated MAB for High Communication Efficiency and Data PrivacyThe examples given in this article showcased only the application of MAB in downlink transmission scenarios. Therefore, it is interesting to extend them for multiuser uplink transmission, where each user has their own local MAB model running for the link configuration. Meanwhile, they can share with each other their learned model through the BS or device-to-device connection. This idea leads to the design of the federated MAB algorithm [15], which is inspired by the principle of federated learning: training a model across multiple decentralized agents and merging the locally trained models in a communication-efficient and data-private way. Another related challenge to a federated MAB is the adversarial users who could attack the systems by sharing malicious local models, which calls for designing defense mechanisms to protect the global model.ConclusionsThe efficient operation of mm-wave wireless networks requires the careful selection of system parameters from among many possibilities. Reinforcement learning is one approach that can be used to adaptively select those parameters in a way that is data driven. In this article, we explained the key idea of MABs and how they can be used to solve certain mm-wave link configuration problems related to beam training. We found that such algorithms were able to dynamically learn the optimal mm-wave link configuration strategies in a site-specific or user-specific manner. The MAB approach has advantages over other strategies like DRL in that it has fast convergence and low computational complexity, making it more suitable for the small timescales of mm-wave communication. We outlined several directions for future research that will push MABs to be better customized for the specific problems faced by next-generation wireless problems.AcknowledgmentThis work was partially supported by the U.S. Army Research Office under Grant W911NF1910221 and NSF Grants NSF-CCF-2225555 and NSF-CNS-2147955.Author Informationheath_a1-3237940 Yi Zhang (yi.zhang.cn@utexas.edu) is currently a wireless system engineer at Apple Inc., Cupertino, CA 95014 USA. His research interests include wireless, networking, signal processing, reinforcement learning, deep learning-based wireless communication, and wireless system prototyping. He received his Ph.D. degree in electrical and computer engineering from the University of Texas at Austin in 2021. He is a Member of IEEE.
heath_a2-3237940 Robert W. Heath Jr. (rwheathjr@ncsu.edu) is the Lampe Distinguished Professor at North Carolina State University, Raleigh, NC 27606 USA. He is also the president and CEO of MIMO Wireless Inc. He is a Fellow of IEEE.

References[1] S.-E. Chiu, N. Ronquillo, and T. Javidi, â€œActive learning and CSI acquisition for mmWave initial alignment,â€ IEEE J. Sel. Areas Commun., vol. 37, no. 11, pp. 2474â€“2489, Nov. 2019, doi: 10.1109/JSAC.2019.2933967.[2] Y. Zhang, S. Basu, S. Shakkottai, and R. W. Heath Jr., â€œMmWave codebook selection in rapidly-varying channels via multinomial Thompson sampling,â€ in Proc. 22nd ACM Int. Symp. Mobile Ad Hoc Netw. Comput., Shanghai, China, 2021, pp. 151â€“160, doi: 10.1145/3466772.3467044.[3] P. Auer, N. Cesa-Bianchi, and P. Fischer, â€œFinite-time analysis of the multiarmed bandit problem,â€ Mach. Learn., vol. 47, nos. 2â€“3, pp. 235â€“256, May 2002, doi: 10.1023/A:1013689704352.[4] D. Bouneffouf, I. Rish, and C. Aggarwal, â€œSurvey on applications of multi-armed and contextual bandits,â€ in Proc. IEEE Congr. Evol. Comput. (CEC), 2020, pp. 1â€“8, doi: 10.1109/CEC48606.2020.9185782.[5] G. H. Sim, S. Klos, A. Asadi, A. Klein, and M. Hollick, â€œAn online context-aware machine learning algorithm for 5G mmWave vehicular communications,â€ IEEE/ACM Trans. Netw., vol. 26, no. 6, pp. 2487â€“2500, Dec. 2018, doi: 10.1109/TNET.2018.2869244.[6] M. Hashemi, A. Sabharwal, C. E. Koksal, and N. B. Shroff, â€œEfficient beam alignment in millimeter wave systems using contextual bandits,â€ in Proc. IEEE Int. Conf. Comput. Commun. (INFOCOM), Honolulu, HI, USA, Apr. 2018, pp. 2393â€“2401, doi: 10.1109/INFOCOM.2018.8486279.[7] J. Zhang, Y. Huang, Y. Zhou, and X. You, â€œBeam alignment and tracking for millimeter wave communications via bandit learning,â€ IEEE Trans. Commun., vol. 68, no. 9, pp. 5519â€“5533, Sep. 2020, doi: 10.1109/TCOMM.2020.2988256.[8] M. Feng, B. Akgun, I. Aykin, and M. Krunz, â€œBeamwidth optimization for 5G NR millimeter wave cellular networks: A multi-armed bandit approach,â€ in Proc. IEEE Int. Conf. Commun. (ICC), 2021, pp. 1â€“6, doi: 10.1109/ICC42927.2021.9500381.[9] Y. Zhang and R. W. Heath, â€œReinforcement learning-based joint user scheduling and link configuration in millimeter-wave networks,â€ IEEE Trans. Wireless Commun., early access, 2022, doi: 10.1109/TWC.2022.3215922.[10] R. W. Heath, N. GonzÃ¡lez-Prelcic, S. Rangan, W. Roh, and A. M. Sayeed, â€œAn overview of signal processing techniques for millimeter wave MIMO systems,â€ IEEE J. Sel. Topics Signal Process., vol. 10, no. 3, pp. 436â€“453, Apr. 2016, doi: 10.1109/JSTSP.2016.2523924.[11] D. J. Russo, B. V. Roy, A. Kazerouni, I. Osband, and Z. Wen, â€œA tutorial on Thompson sampling,â€ Found. Trends Mach. Learn., vol. 11, no. 1, pp. 1â€“96, 2018, doi: 10.1561/2200000070.[12] P.-A. Coquelin and R. Munos, â€œBandit algorithms for tree search,â€ in Proc. 23rd Conf. Uncertainty Artif. Intell., Arlington, VA, USA: AUAI Press, 2007, pp. 67â€“74.[13] A. Garivier and O. CappÃ©, â€œThe KL-UCB algorithm for bounded stochastic bandits and beyond,â€ in Proc. 24th Annu. Conf. Learn. Theory, Budapest, Hungary, Jun. 2011, pp. 359â€“376.[14] E. Biyik, A. Lalitha, R. Saha, A. Goldsmith, and D. Sadigh, â€œPartner-aware algorithms in decentralized cooperative bandit teams,â€ in Proc. AAAI Conf. Artif. Intell., Jun. 2022, pp. 9296â€“9303, doi: 10.1609/aaai.v36i9.21158.[15] C. Shi and C. Shen, â€œFederated multi-armed bandits,â€ in Proc. AAAI Conf. Artif. Intell., May 2021, vol. 35, no. 11, pp. 9603â€“9611, doi: 10.1609/aaai.v35i11.17156.Digital Object Identifier 10.1109/MVT.2023.3237940CoverVTSMastheadFrom the EditorVTS SocietyMobile RadioTransportation SystemsLeveraging RIS-Enabled Smart Signal Propagation for Solving Infeasible Localization ProblemsSatellite Network Slice PlanningMulti-Armed Bandit for Link Configuration in Millimeter-Wave NetworksToward 6G Vehicle-To-Everything SidelinkLidar Object Perception Framework for Urban Autonomous DrivingReimagining Automotive Service-Oriented CommunicationAn Extraction and Validity Evaluation Method Proposal for Monitoring Points for In-Vehicle SystemsDynamic Heterogeneous Redundancy-Based Joint Safety and Security for Connected Automated VehiclesToward Trusted Unmanned Aerial Vehicle Swarm NetworksConnected and Automated VehiclesAutomotive ElectronicsStandardsSociety NewsIEEE AccessTechRxivArchives