Vision-Assisted Millimeter-Wave Beam Management for Next-Generation Wireless Systems

Vision-Assisted Millimeter-Wave Beam Management for Next-Generation Wireless Systems: Concepts, Solutions, and Open ChallengesKan Zheng, Haojun Yang, Ziqiang Ying, Pengshuo Wang, Lajos Hanzo18mvt03-hanzo-opener-3262907Â©SHUTTERSTOCK/NAMOMOOYIMBeamforming techniques have been widely used in the millimeter-wave (mm-wave) bands to mitigate the path loss of mm-wave radio links as narrow straight beams by directionally concentrating the signal energy. However, traditional mm-wave beam management algorithms usually require excessive channel state information (CSI) overhead, leading to extremely high computational and communication costs. This hinders the widespread deployment of mm-wave communications. By contrast, the revolutionary vision-assisted beam management system concept employed at base stations (BSs) can select the optimal beam for the target user equipment (UE), based on location information determined by machine learning (ML) algorithms applied to visual data, without requiring channel information. In this article, we present a comprehensive framework for a vision-assisted mm-wave beam management system, its typical deployment scenarios, and the specifics of the framework. Then, some of the challenges faced by this system and their efficient solutions are discussed from the perspective of ML. Next, a new simulation platform is conceived to provide both visual and wireless data for model validation and performance evaluation. Our simulation results indicate that vision-assisted beam management is indeed attractive for next-generation wireless systems.IntroductionBeamforming-aided directional transmission plays a critical role in improving spatial spectrum efficiency. Due to the expected wide deployment of mm-wave communications, beamforming techniques are receiving much attention in the context of multiple-input, multiple-output (MIMO) systems designed for the mm-wave frequency bands [1]. However, given a large number of antennas, tracking the movement of multiple concurrent pieces of UE dramatically increases the complexity, overhead, and latency of signal processing at the BSs using mm-wave massive MIMO schemes [2]. These problems may be exacerbated for a high number of antennas, even in line-of-sight (LOS) channels. To overcome these challenges, computer vision (CV)-aided ML algorithms may be harnessed as promising solutions for beamforming. Motivated by the spatial sparsity of mm-wave wireless channels exhibiting predominant LOS characteristics, mm-wave beams pointing to target UE can be efficiently selected and adapted according to location information of the UE derived by ML algorithms [3].To implement a vision-assisted beam management system, several technical challenges have to be overcome. The traditional vision-based object tracking algorithms, such as sparse representation and correlation filtering, have difficulty in accurately locating high-mobility UE in real time [4]. Furthermore, in complex environments, tracking multiple pieces of UE in the face of blockage and uneven light, the localization accuracy of UE tends to degrade significantly. As a result, the traditional CV-related ML algorithms cannot satisfy the high location accuracy required by mm-wave communications. Finally, gathering abundant labeled data from real-world environments, including both visual data and wireless signals, to train ML models is still challenging.Nevertheless, vision-assisted beam management methods that predict the optimal mm-wave beams have been investigated in the past few years. A framework used for dataset generation was also proposed for cooperatively exploiting both visual and wireless data [5]. However, these methods may still be plagued by a number of issues:

The robustness of the existing ML models for vision-assisted beam management has to be improved [6]. When the classical image classification models designed for prediction are used for mm-wave link blockage prediction and beam prediction, the target accuracy cannot be always satisfied, for example, due to the overfitting issues. In this context, a preliminary study was conducted in [7] by relying on a simple dataset for investigating vision-assisted beam management in multiuser scenarios.

The scalability of the methods is not guaranteed in practical scenarios. For example, the existing methods do not obey the so-called modular design principles, making it difficult to upgrade them flexibly and to modify them simply when, for example, a crucial parameter, such as the size of the beamforming codebook, changes.

The implementation issues pertaining to the complexity of computation and overhead costs have to be addressed before the wide deployment of vision-assisted beam management becomes a reality. For example, an exploratory strategy was proposed for reducing the overhead associated with beam selection, where information from localization and vision sensors is integrated [8].

Indeed, there is a paucity of literature addressing the associated challenges of vision-assisted mm-wave beam management techniques. The scope of this article is, thus, to study the interplay of ML-based CV and beam management in mm-wave systems. Specifically, the main contributions of this article can be summarized as follows.

We first present a comprehensive framework for a vision-assisted mm-wave beam management system, including the typical deployment scenarios as well as a pair of major concerns, namely, UE detection and beam selection.

Then, three main technical challenges and their efficient solutions are discussed from the perspective of ML. In particular, we study the salient issues of lightweight compression, the deleterious effects of inadequately labeled data, and associated robustness aspects.

Next, we describe the development of our own simulation platform to provide both visual and wireless data for model validation and performance evaluation. This unified platform is universally applicable in terms of producing data for those scenarios where the wireless characteristics vary tremendously. Our simulation results show that vision-assisted beam management is indeed attractive for next-generation wireless systems.

Finally, related open topics are discussed from a practical perspective to guide future research.

The article is organized as follows. A detailed description of vision-assisted mm-wave beam management systems is provided in the â€œHolistic Framework of Vision-Assisted Beam Management Systemâ€ section. Next, we discuss some ML-related challenges and solutions conceived for mm-wave beam management in the â€œChallenges for Vision-Assisted Beam Management Systemâ€ section. We then present our performance results and discuss some potential open topics. Finally, our conclusions are given in the â€œConclusionsâ€ section.Holistic Framework of Vision-Assisted Beam Management SystemTypical Deployment Scenarios of Vision-Assisted Beam Management SystemsThe next-generation wireless systems are expected to operate in multiple bands, including the sub-6-GHz and mm-wave bands. In general, the signal propagation of the sub-6-GHz bands is more resilient to blockages; thus, the sub-6-GHz bands are used for services that require low and medium data rates. By contrast, as a benefit of their abundance of spectral resources, the mm-wave bands are expected to support multigigabit services. To take full advantage of their benefits, a dual-band system in which the BS and UE use both sub-6-GHz and mm-wave transceivers is considered in this article. Vision-assisted beam management may be enabled only when the LOS condition is met, which may also be combined with sub-6-GHz systems [3]. The rich bandwidth potential of mm-wave communications can be used both for backhauls and user access links under a variety of potential deployment scenarios. Thus, we mainly focus attention on those scenarios, where the vision-assisted beam management can be harmoniously integrated. Some of them are illustrated in Figure 1 and discussed in more detail in the following.hanzo01-3262907Figure 1 The vision-assisted beam management system. UAV: unmanned aerial vehicle.Scenario 1: Outdoor Deployment (Cellular UE)When the wireless channels in outdoor environments are spatially sparse, i.e., dominated by LOS propagation, vision-assisted beam management can indeed be conveniently adopted at the BSs. Then, all the cellular UE can be served by BSs on the mm-wave band.Scenario 2: Outdoor Deployment (Unmanned Aerial Vehicles)At the time of writing, mm-wave communications are widely used for unmanned aerial vehicle (UAV) communications. Cameras deployed at BSs can also readily capture, without obstruction, the video of a UAV flying by, which is eminently suitable for vision-assisted beam management in this scenario.Scenario 3: Indoor DeploymentTo significantly increase the system capacity in high-density indoor environments, cameras installed at the BSs are capable of capturing images of nearby UE. However, there may be lots of objects, which increases the recognition complexity of the CV algorithms.Description of Vision-Assisted Beam Management SystemAs shown in Figure 1, a BS equipped with a high-definition camera first captures the video scenes. Then, the ML-based vision model embedded in the BS is activated for localizing and tracking the target UE. By collaboratively utilizing the image/video information, the beam management module finally selects the optimal beams for the target UE among a predefined beam pattern codebook.Vision-assisted beam management is generally divided into two steps, namely, the UE detection and the ensuing beam selection for the target UE. More explicitly, the former determines whether any target UE exists in the view of the camera, while the latter is responsible for providing both the location information and the optimal beams for the target UE.UE DetectionIn the traditional ML-based object detection models, each frame of the video stream is processed to generate object locations as the output, which is usually time-consuming. However, the target UE is not captured by the camera all the time, and it is not always in its active communication status. To this end, it is necessary to detect the existence of active UE before beam selection. In the proposed framework, an ML-based binary classification model can be used for determining whether active UE exists at the current moment. For the sake of illustration, the active state of the kth UE is defined as â€œone,â€ while the inactive state is defined as â€œzero.â€ Then, based on the captured image, the active/passive state of the kth UE can be predicted by

\[{S}_{k} = {F}_{{\cal{P}}_{1}}\left({\text{Image}},{k}\right) \tag{1} \]

where

${F}_{{\cal{P}}_{1}}{(}{\cdot}{)}$

is the ML-based binary classification model that has to be investigated and

${\cal{P}}_{1}$

is the parameter set of the model. Additionally, to strike an attractive tradeoff between complexity and accuracy, the UE-related information, including the sub-6-GHz CSI and network signaling, might be taken into account.Beam Selection for Target UEThe goal of beam selection is to find the optimal beam from the codebook for maximizing the signal-to-noise ratio (SNR). The traditional beam management schemes generally require the CSI to be obtained by channel estimation, which requires substantial overhead. Instead, a vision-assisted method requiring no CSI knowledge is conceived for solving the beam selection problem. First, the position of target UE can be determined by an ML-based object detection model from the images captured by the camera. Then, the angles of the target UE are estimated by exploiting the location information. Finally, the optimal beam index is selected by maximizing the SNR, although other metrics may also be used. The complete procedure is as follows.Object DetectionGiven the presence of some target UE, each frame of the video stream can be processed to locate the target UE by using ML-based object detectors. In general, the family of ML-based object detectors may be divided into two types, namely, single-stage detectors, such as the so-called You Only Look Once (YOLO)-type models [9], and two-stage detectors, such as region-based convolutional neural network-related models. The two-stage detector first adopts a â€œregion proposal networkâ€ for generating the region of interest (ROI) and then utilizes classification models for determining the category of the region. In contrast to the two-stage detector, the single-stage one directly predicts the category of each feature map without first generating the ROI. Hence, the two-stage detector typically attains higher detection accuracy, while the single-stage detector has higher detection speed. In our proposed framework, either of them may be chosen flexibly according to the specific requirements of different application scenarios.Angle PredictionFor the beam management, the angle information of the UEâ€™s physical location within the geographical coverage of the BS is required for selecting the optimal beam in terms of the real physical world coordinate. However, the outputs of ML-based object detection models are the UEâ€™s location in the image captured by the camera, i.e., the location in the pixel coordinate. Thus, it is paramount to establish the mapping relationship between these two locations in cases of vision-assisted beam management applications.Beam SelectionGiven the predicted angle, the beam selection can provide the index of the optimal beam. Let

${\bf{w}}_{k}$

denote the beamforming vector of the kth UE. Then, the optimal beam can be predicted as follows:

\[{\bf{w}}_{k} = {G}_{{\cal{P}}_{2}}\left({\text{Angle, Codebook}},\,{k}\right) \tag{2} \]

where

${G}_{{\cal{P}}_{2}}\left({\cdot}\right)$

and

${\cal{P}}_{2}$

are the prediction model and parameter set, respectively. For instance, upon considering the simple case of a uniform linear array in the 2D space, the codebook is composed of Q beams having an identical angular separation of

${\pi}{/}{Q}{.}$

Therefore, the task of beam selection is simplified to estimating the range that the predicted angle falls into.Challenges for Vision-Assisted Beam Management SystemLightweight Compression for Prediction ModelThe limited computing and storage capabilities of embedded systems make the real-time implementation of the ML-based models in mm-wave communication systems challenging. Again, the YOLO object detector is applicable to localize the target UE. Even though YOLO is faster than other detectors, it still contains too many convolutional layers. For example, the backbone network in YOLO v3 [9] consists of 53 convolutional layers, and the channels in each of these convolutional layers are typically quite large, namely, up to 1,024 channels. Hence, the model size and computation complexity of YOLO become the barriers to its time-critical application, such as mm-wave beam management. Therefore, it is essential to compress the model volume for increasing the prediction speed while guaranteeing its accuracy.One of the most common methods of model compression is network pruning [11]. In this method, sophisticated rules can be applied to neural networks so that the relatively insignificant weights and branches are removed, thereby reducing the number of model parameters and increasing the inference speed. According to the granularity of pruning objects, the typical network pruning schemes can be divided into weight pruning and structured pruning techniques. The former compresses those relatively insignificant weights in the networks. This technique has a high degree of flexibility but modest inference speed acceleration. For the latter one, the coarser-grained convolution kernels, channels, and layers might be removed, resulting in both a higher compression ratio and faster inference speed.Since YOLO contains a large number of convolutional layers and hundreds or thousands of channels, a structured pruning method is preferred for obtaining a satisfactory compression effect. In particular, YOLO can be pruned at both the channel and layer levels. Channel pruning provides the compression of the model width, while layer pruning reduces the depth of the models. With the help of network pruning, the volume of the YOLO model can be substantially reduced; hence, its prediction speed is significantly improved.Efficiency Improvement of Prediction Model Having Inadequately Labeled DataDue to a large number of parameters in the ML-based models, a dataset having a huge amount of labeled data is required for training the models. However, there might be insufficient labeled data to fully fit the ML models in practical vision-assisted mm-wave communication systems. First of all, gathering visual data (such as redâ€“greenâ€“blue images) and wireless data (such as channel responses) requires completely different equipment and devices. Furthermore, realistic physical test scenarios have to be constructed, relying on practical equipment placement and data synchronization. Additionally, a long test period is needed to collect enough data. As a result, the data collection process itself is complex and time-consuming. Finally, the visual datasets collected have to label the bounding boxes for all UE in the images. Thus, for practical applications, another challenge is how to achieve excellent prediction accuracy in the beam management module when the dataset is small or moderate.The output layer, which is used to map the feature vector to the required classification space, is typically a fully connected layer or a 1 Ã— 1 convolutional layer in both the single-stage and two-stage object detectors. In general, the output layer parameters of object detectors are randomly initialized and iteratively updated thereafter when a new dataset is adopted in the training. However, there is also a number of other parameters in the models that have to be fine-tuned. As a result, having inadequate data may cause overfitting during the learning process, gravely affecting the localization of objects. Localization performance degradation may lead to spurious angle prediction for beam selection algorithms.Therefore, to improve the modeling process in the face of inadequately labeled data, we propose to use an ML scheme relying on the â€œmetric learningâ€ technique of [12] for accurately localizing UE. In general, many approaches in ML require a measure of distance among data points. Typically, with the aid of priori domain knowledge, some standard distance metrics are adopted, such as Euclidean, cosine, and so on. Nevertheless, it is difficult to design metrics that are well suited to the particular data and task of interest. Therefore, the â€œmetric learningâ€ technique is investigated to automatically construct task-specific distance metrics from weakly supervised data, which is more beneficial for the case of inadequately labeled data. Figure 2 presents an N-way K-shot ML scheme conceived for object detection based on metric learning. In this scheme, an ROI set is generated for the querying images, based on region proposal networks. For the supporting images, the features of all categories are generated according to the labeled frames. Our proposed scheme calculates the similarity between each predicted ROI feature and the corresponding feature template. Theoretically, the higher the similarity score, the higher the prediction accuracy becomes for the bounding box associated with the ROI.hanzo02-3262907Figure 2 The N-way K-shot ML models based on metric learning for object detection. ARPN: attention-based region proposal network.Robustness and Applicability for Prediction ModelAdopting only low-complexity single-frame image-based methods cannot cope well with multiuser scenarios, especially in cluttered environments. In the case of completely invisible UE, the BSs cannot identify the equipment because it may be totally obscured when simply analyzing a single image frame at a time. Explicitly, a single image contains only the location information and environmental information about the UE seen at the time and cannot provide extra information concerning the movement of the target UE and the changing camera view of the surrounding environment. Hence, single-frame processing loses sight of the spatial and temporal correlation of moving objects. Therefore, how to exploit the image sequences in the video data to improve the performance of the object detectors and beam management remains a challenge.To enhance robustness and applicability, vision-assisted beam management may process the image sequences for a total of N consecutive video frames, i.e., not only the image from the current frame but also those from

${N}{-}{1}$

previous frames. Compared to the schemes based on the current individual video frame, the improved schemes using a sequence of video frames can capture both the spatial coherence of each video frame and its temporal interframe correlation.Figure 3 describes the overall framework of a vision-assisted beam management system based on image sequences [13]. The framework primarily consists of three main steps. In the first step, we extract specific features of the image sequences. In the lower half of Figure 3(b), 3D convolution is applied for extracting the features containing both spatial and temporal contents. In the upper half of Figure 3(b), 2D convolution is used for processing each image separately. Then, the interactions among the image-beam sequence features take place. The transformer scheme of Figure 3(c), having several encoder layers, is used for interactive sequence modeling of the features in the beam index sequence and the image sequence features obtained by 2D convolution. The final step is to design a suitable output layer according to the specific beam management tasks so as to select the optimal beam.hanzo03-3262907Figure 3 A vision-assisted beam management system based on image sequences, where the structures of the YOLO v3 and transformer are the classical ones and the 3D ResNeXt-101 is shown in [10]. The (a) image sequences, (b) feature extraction, (c) feature interaction between image and beam sequences, and (d) feature fusion and output.Simulation Methodology and EvaluationAt the current state of the art, it is quite difficult to collect and label both visual and wireless data in real time. Hence, we resort to simulations for generating labeled data for training and testing. Figure 4 depicts our simulation platform conceived for vision-assisted mm-wave beam management. As this stage, only an outdoor scenario is used for validating the models in the platform. An open source framework is proposed to speed up the implementation of other scenarios. As a result, it is advantageous to revise the details of the scenario when generating wireless data, such as the number and orientation of rays, channels, user positions, and so on. Furthermore, diverse UE is involved as well as other entities, such as trees, bushes, sidewalks, benches, and buildings. Specifically, our self-developed and open source platform is based on MATLAB software and requires only a text file for defining a scenario. During the phase of initialization, a series of visual and wireless sequences are created for modeling real-world physical environments. To create visual and wireless datasets, all sequences are respectively processed by the animation modeling software and the ray tracing software in the second step. Normally, animation modeling software is designed for creating complex 3D objects, rendering the objects into images, and making animations from frames. Finally, the datasets can be used for evaluating and validating the performance of ML-based models for beam management.hanzo04-3262907Figure 4 The simulation platform for the vision-assisted beam management system. The (a) visual data generation, (b) wireless data generation, and (c) evaluation.InitializationIn the initialization phase, the types and attributes of entities are described in intricate detail. Using unified definitions is an efficient and compatible way of ensuring the appropriate relationship between the visual and wireless data generation. In particular, the scenario definition includes the system parameters, antenna arrays, BSs, reflectors, and mobile users. Each of them contains the following information:

System parameters: This includes parameters used to describe how the platform works, for example, the total number of frames for simulations, the average number of video frames calculated per second in simulations, the maximum number of mm-wave reflections calculated in the ray tracing process, and the size of beam codebook.

Antenna arrays: The key parameters of the antennas, such as the size of the antenna arrays and the antenna spacing, are defined.

BSs: The location of BSs, the configuration of cameras deployed at BSs, the antenna arrays used by BSs, and diverse other parameters are described in detail.

Reflectors: The position, shape, and material of reflectors are given.

UE: Similar to the reflectors, we have to define the parameters of UE, such as location and appearance. Additionally, the UE antenna arrays, the UE motion trajectories, and other parameters are specified.

It is noted that both the visual and wireless simulations require the aforementioned information. Based on the information defined in a given scenario, the visual and wireless simulation processes have to be synchronized so that the data generated tally correctly.Data GenerationThe data generation includes both visual and wireless data. Although different simulation environments and processes are used for generating these two types of data, there still exists a corresponding relationship between them for ensuring that the data produced conform to the scenario definition. Here, we first introduce the process of generating both types of data, and then we describe how to synchronize and merge these data.Visual Data GenerationThe visual data are generated by some special animation modeling software, such as Blender [14], which facilitates the construction of 3D object models. Hence, the first step in generating the visual data is to create a 3D model of the reflectors and users via the animation modeling software. Then, we have to assign textures and materials to the objects in the scenario to make them more realistic. Note that the material mentioned here differs from that in the scenario definition. The former determines only the visual effect of the generated image, while the latter determines the propagation of electromagnetic waves. Next, the cameras have to be deployed correctly at the BSs. The second step is to define the movement animation of users. In general, the animation consists of a sequence of images. There are some frames referred to as the key frames, and the position and shape of the 3D model in the other frames can be determined by interpolation between a pair of consecutive key frames. In the scenario defined, the objects are regarded as rigid bodies, and each frame contains only the position change users. Finally, the animation is generated in the final step and exported from the animation modeling software.Wireless Data GenerationThe wireless data are generated by software that supports ray tracing technology [15], e.g., MATLAB. Due to the challenge of generating complex 3D objects in MATLAB, the 3D models of the reflectors and users must be obtained by loading external data. Then, the transmitter and receiver are correctly positioned, i.e., colocated with either the BSs or the other UE that might move at a given speed and in a certain direction. Next, we calculate the propagation-related information, such as the signal power, delay, angle of departure, and angle of arrival, using ray tracing technology. Likewise, according to the geometric channel model, the wireless channels in the current scenario are constructed using the preceding propagation information. Furthermore, the codebook indices corresponding to the optimal beam are calculated, which is crucial for the wireless data.Data SynchronizationAs seen from the preceding data generation phase, both the object model and UE motion should be consistent across the pair of generation processes. Specifically, the following methods are used in our platform:

Consistency of object model: With the aid of the animation modeling software, we create 3D models of the objects and export them in the STL format. (The STL format is a universal format for displaying 3D models and widely supported by related animation modeling software.) Then, we import the required 3D model into MATLAB software with ray tracing to make sure that the object models are consistent.

Consistency of UE movement: The concept of frames is introduced into the ray tracing process. Hence, the UE remains in the same position for these two generation processes, which ensures that the UE movements are consistent.

Evaluation and ValidationFigure 5(a) gives an example of the mm-wave communication scenario in an urban environment, where a BS communicates with three vehicular UE objects in cars. Moreover, two types of buildings built from different materials are involved in this scenario. In general, the reflection properties of objects are strongly influenced by their materials. To accurately model real-world environments, we set the materials of the dark-colored and light-colored buildings to concrete and brick, while the materials of all vehicles are assumed to be metal. We then characterize the performance of the platform, based on this predefined scenario.hanzo05-3262907Figure 5 The performance evaluation results. (a) An example of the mm-wave communication scenario. (b) An example single-frame image with a bounding box labeled by object detection. (c) An example of ray tracing at one frame. (d) The beam prediction performance.Dataset Validation From Visual and Wireless AspectsIn the proposed framework of vision-assisted beam management, object detection plays a crucial role in both the existence detection and beam selection tasks. Thus, Figure 5(b) presents the validation results for datasets from the visual perspective. Explicitly, it shows a single image frame labeled by the bounding boxes of three moving UE objects in cars. The accuracy of the labeling results demonstrates that the visual datasets generated accurately characterize the movement of objects in each frame and that the proposed vision-assisted beam management framework can also effectively track objects in real time.On the other hand, Figure 5(c) illustrates the signal power of the randomly generated rays between the BS and vehicular UE, which comes from the wireless datasets. The larger the distance between rays, the lower the received power, which confirms the trends of the generated wireless datasets. Additionally, due to the mobility of various objects, wireless datasets can occasionally contain zero data. For example, whenever the car farthest from the BS runs into the shadow of a bus, no rays are detected for this frame, resulting in a beam tracking outage.Results for Inadequately Labeled DataFigure 5(d) displays the beam prediction accuracy of the improved models, using metric learning in the case of inadequately labeled data. For each category, the metric learning normally requires only one feature template. Nevertheless, there may be K samples in each category, thereby having K feature vectors. Therefore, the K feature vectors should be combined to produce a representative category template. Three schemes are considered for comparison, i.e., YOLO v3, average metric learning, and k-means-weighted average metric learning. Specifically, average metric learning combines all feature vectors by using the arithmetic mean method over K samples. To overcome the homogenization of the arithmetic mean, k-means-weighted average metric learning is also studied, in which K samples are first classified by the k-means method and the cluster features obtained are then combined by the weighted averaging method.It is clearly shown that both metric learning schemes perform better than YOLO v3, achieving an accuracy of about 84.12% with only 10-shot learning. Furthermore, k-means-weighted average metric learning slightly outperforms average metric learning. In conclusion, the metric learning schemes are more efficient than YOLO v3 when data are inadequately labeled.Results for Lightweight CompressionThe performance of the improved ML-based models having different lightweight compression are also evaluated. Prior to discussing the simulation results, we briefly highlight the performance metric, i.e., the mean average precision (MAP) score. As a derivative of the AP, MAP is the average of the AP scores, while the AP score generally is obtained by calculating the area under the precision recall curve. To summarize, the AP score is calculated for each category, then averaged to determine the final MAP score.Table 1 presents the simulation results for the cases of no pruning, channel pruning only, and channel pruning and layer pruning. As recorded in Table 1, the classification performance of both pruning models degrades compared to the no-pruning model. However, the accuracy erosion is modest for the two pruning models. For instance, with regard to the channel-and-layer pruning model, the MAP performance and beam prediction accuracy deteriorates by only about 0.8% and 0.3%, respectively. By contrast, the pruning operation results in a significant model size reduction and an acceleration of the inference speed. The number of parameters and the model size are reduced by about 82%, which is more beneficial for the practical deployment of latency-sensitive applications.Table 1 Simulation results for ML models with lightweight compression.hanzo_t1-3262907Open DiscussionCombination With Hierarchical Beam SearchUsually, the explicit training required for finding the best beam directions in the angular domain is indispensable. In contrast to the classical exhaustive search-based training, hierarchical training has been proposed as a promising technique of reducing both the complexity and the overhead. However, a tradeoff must be struck between the phase shift resolution of training and the complexity imposed. For example, when a low phase shift is chosen for the first stage of training, the beam direction can be selected more accurately. However, this imposes higher feedback delay and higher overhead and vice versa. To strike a compelling tradeoff, a vision-assisted beam management scheme can be used as the first stage of training because it does not rely on UE feedback for beam selection. Subsequently, the accuracy of beam search can be further improved through a fine-tuning of CSI-based beam management along with a lower phase shift in the following training stage.Uplink and Downlink Beam MatchingAs a result of the propagation differences between the uplink and downlink, especially for frequency-division duplex systems, the downlink beam selection based on the uplink channel estimation operation usually requires calibration to improve accuracy. On the other hand, the location of the user can be accurately determined by vision-assisted beam management regardless of the frequency band. Therefore, how to use this information to support beam matching on both the uplink and downlink becomes a very interesting topic.Dual-Band Communications With Sub-6 GHzRecently, the dual-band communication mode that includes mm-wave and sub-6-GHz communications has become increasingly popular. Therefore, another open challenge is how to exploit the extra information at sub-6 GHz so as to enhance the mm-wave beamforming performance. Intuitively, the proposed vision-assisted mm-wave communication depends on having LOS propagation for its accurate operation, and it is vulnerable to blockage. For instance, when multiple pieces of UE are captured by the camera without any additional details, vision-assisted beamforming may falter. On the other hand, sub-6-GHz communication generally works well for both non-LOS and LOS channels, and it is capable of providing the related control information, including CSI and other user-specific information. This information can assist in the detection of active UE and multiuser discrimination when using vision-assisted mm-wave communications. Additionally, for further reducing the complexity of exploiting sub-6-GHz communication, the previously mentioned hierarchical beam search technique can be used for sub-6 GHz to provide prompt user-specific information.Multicell Beam ManagementThe coverage distance of mm-wave communications is typically small, and UE often appears at the cell edge. The beam selection of cell edge UE can be handled more accurately by adopting vision-assisted beam management. Specifically, the videos obtained by the cameras of multiple adjacent BSs can be processed jointly. Due to the fact that the same UE is captured in multiple images at the same time, its position can be more accurately determined using ML algorithms and the beam direction. By aligning the beams of two adjacent BSs for the target UE, a more reliable communication connection can be achieved. The issues associated with channel feedback overhead can be avoided by such a vision-aided multicell beam management.ConclusionsVision-assisted beam management is paving the way for improved mm-wave communications by relying on ML models of analyzing visual data. This enables us to tackle several important challenges of ML-based model implementation for mm-wave beamforming. In particular, sophisticated network pruning has been used to compress the models for reducing the complexity. Additionally, a model based on metric learning has been shown to be an effective option for dealing with the problem of inadequately labeled data in practical applications. An ML model based on image sequences has also been conceived for multiuser scenarios and to mitigate blockage problems. Then, animation modeling software and ray tracing software were used for successfully building a new simulation platform to generate various labeled visual and wireless data for performance evaluation and model validation. Our simulation results show that ML-based models work well with vision-assisted mm-wave beam management schemes. Furthermore, some open challenges were presented to guide future research work. Additionally, dual-function radar communication systems may be capable of simultaneously performing wireless communications and remote sensing when they become available, but they will have huge complexity. The alluring topic of combining these technologies is also interesting for future research. Suffice it to say that the vision-based system investigated in this treatise requires only a low-cost camera and object recognition software.AcknowledgmentThis work was supported, in part, by the National Natural Science Foundation of China, under grant 62201301. Lajos Hanzo would like to acknowledge the financial support of Engineering and Physical Sciences Research Council projects EP/W016605/1 and EP/X01228X/1 as well as European Research Council Advanced Fellow Grant QuantCom (grant 789028).Author Informationhanzo_a1-3262907 Kan Zheng (zhengkan@nbu.edu.cn) is currently a full professor with Ningbo University, Ningbo 315211, China. He received the Ph.D. degree from Beijing University of Posts and Telecommunications, China, in 2005. He has authored over 200 journal articles and conference papers in the field of wireless communications, vehicular networks, security, and so on, and he holds editorial board positions with several journals. He is a Senior Member of IEEE.hanzo_a2-3262907 Haojun Yang (yanghaojun.yhj@gmail.com) is currently a postdoctoral fellow with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada. He received the Ph.D. degree in information and communication engineering from Beijing University of Posts and Telecommunications, China, in 2020. His research interests include ultrareliable and low-latency communications, radio resource management, and vehicular networks. He is a Member of IEEE.hanzo_a3-3262907 Ziqiang Ying (yingzq0116@163.com) is with Beijing University of Posts and Telecommunications, Beijing 100876, China, where he received the M.S. degree in information and communication engineering in 2021. His research interests include deep learning, millimeter-wave communication systems, vehicular networks, and machine vision.hanzo_a4-3262907 Pengshuo Wang (wshuo@bupt.edu.cn) is working toward the M.E. degree in information and communication engineering at Beijing University of Posts and Telecommunications, Beijing 100876, China, where he received the B.E. degree in information engineering in 2020. His research interests include machine learning for wireless communication.hanzo_a5-3262907 Lajos Hanzo (lh@ecs.soton.ac.uk) is with the School of Electronics and Computer Science, University of Southampton, SO17 1BJ Southampton, U.K. He received the Ph.D. degree in 1983 from the Technical University of Budapest. He is a Life Fellow of IEEE and a fellow of the Royal Academy of Engineering, the Institute of Engineering and Technology, and the European Association for Signal Processing.References[1] E. Bjornson, L. Van der Perre, S. Buzzi, and E. G. Larsson, â€œMassive MIMO in sub-6 GHz and mmWave: Physical, practical, and use-case differences,â€ IEEE Wireless Commun., vol. 26, no. 2, pp. 100â€“108, Apr. 2019, doi: 10.1109/MWC.2018.1800140.[2] X. Liu et al., â€œLearning to predict the mobility of users in mobile mmWave networks,â€ IEEE Wireless Commun., vol. 27, no. 1, pp. 124â€“131, Feb. 2020, doi: 10.1109/MWC.001.1900241.[3] G. Charan, M. Alrabeiah, and A. Alkhateeb, â€œVision-aided 6G wireless communications: Blockage prediction and proactive handoff,â€ IEEE Trans. Veh. Technol., vol. 70, no. 10, pp. 10,193â€“10,208, Oct. 2021, doi: 10.1109/TVT.2021.3104219.[4] P. Druzhkov and V. Kustikova, â€œA survey of deep learning methods and software tools for image classification and object detection,â€ Pattern Recognit. Image Anal., vol. 26, no. 1, pp. 9â€“15, Jan. 2016, doi: 10.1134/S1054661816010065.[5] M. Alrabeiah, J. Booth, A. Hredzak, and A. Alkhateeb, â€œViWi vision-aided mmWave beam tracking: Dataset, task, and baseline solutions,â€ Feb. 2020, arXiv:2002.02445v3.[6] Y. Tian and C. Wang, â€œVision-aided beam tracking: Explore the proper use of camera images with deep learning,â€ in Proc. IEEE 94th Veh. Technol. Conf. (VTC2021-Fall), Norman, OK, USA, Sep. 2021, pp. 1â€“5, doi: 10.1109/VTC2021-Fall52928.2021.9625195.[7] H. Ahn et al., â€œMachine learning-based vision-aided beam selection for mmWave multiuser MISO system,â€ IEEE Wireless Commun. Lett. vol. 11, no. 6, pp. 1263â€“1267, Jun. 2022, doi: 10.1109/LWC.2022.3163780.[8] G. Reus-Muns et al., â€œDeep learning on visual and location data for V2I mmWave beamforming,â€ in Proc. Int. Conf. Mobility, Sens. Netw. (MSN), Exeter, U.K., Dec. 2021, pp. 559â€“566, doi: 10.1109/MSN53354.2021.00087.[9] J. Redmon and A. Farhadi, â€œYOLOv3: An incremental improvement,â€ Apr. 2018, arXiv:1804.02767v1.[10] K. Hara, H. Kataoka, and Y. Satoh, â€œCan spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?â€ in Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognit. (CVPR), Salt Lake City, UT, USA, Jun. 2018, pp. 6546â€“6555.[11] S. Ioffe and C. Szegedy, â€œBatch normalization: Accelerating deep network training by reducing internal covariate shift,â€ in Proc. Int. Conf. Mach. Learn. (ICML), Lille, France, Jul. 2015, pp. 448â€“456.[12] Q. Fan, W. Zhuo, C.-K. Tang, and Y.-W. Tai, â€œFew-shot object detection with attention-RPN and multi-relation detector,â€ in Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognit. (CVPR), Seattle, WA, USA, Jun. 2020, pp. 4012â€“4021.[13] R. Yao, G. Lin, S. Xia, J. Zhao, and Y. Zhou, â€œVideo object segmentation and tracking: A survey,â€ ACM Trans. Intell. Syst. Technol., vol. 11, no. 4, pp. 1â€“47, May 2020, doi: 10.1145/3391743.[14] â€œBlender,â€ Blender Foundation. [Online] . Available: https://www.blender.org[15] Q. Li et al., â€œValidation of a geometry-based statistical mmWave channel model using ray-tracing simulation,â€ in Proc. IEEE 81st Veh. Technol. Conf. (VTC Spring), Glasgow, U.K., May 2015, pp. 1â€“5, doi: 10.1109/VTCSpring.2015.7146155.Digital Object Identifier 10.1109/MVT.2023.3262907CoverVTSMastheadFrom the EditorPresident's MessageMobile RadioTransportation SystemsMultimode High-Altitude Platform Stations for Next-Generation Wireless NetworksSatellite Clustering for Non-Terrestrial NetworksTaming Aerial Communication With Flight-Assisted Smart Surfaces in the 6G EraVisual Camouflage and Online Trajectory PlanningVision-Assisted Millimeter-Wave Beam Management for Next-Generation Wireless SystemsMetamobilityFree Space Optical Communications for Intelligent Transportation SystemsV2X Communications for Maneuver Coordination in Connected Automated DrivingToward Intelligent Connected E-MobilityMobile Charging Services for the Internet of Electric VehiclesConnected and Automated VehiclesProceedingsVT SocietyIEEE AccessTechRxivArchives