Wenjun Xu, Yimeng Zhang, Fengyu Wang, Zhijin Qin, Chenyao Liu, Ping Zhang
©SHUTTERSTOCK.COM/ART STOCK CREATIVE
The Internet of Vehicles (IoV) is expected to become the central infrastructure to provide advanced services to connected vehicles and users for higher transportation efficiency and security. A variety of emerging applications/services bring explosively growing demands for mobile data traffic between connected vehicles and roadside units (RSUs), imposing the significant challenge of spectrum scarcity in the IoV. In this article, we propose a cooperative semantic-aware architecture to convey essential semantics from collaborated users to servers for lessening data traffic. In contrast to current solutions that are mainly based on piling up highly complex signal processing techniques and multiple access capabilities in terms of syntactic communications, this article puts forth the idea of semantic-aware content delivery in the IoV. Specifically, the successful transmission of essential semantics of the source data is pursued rather than the accurate reception of symbols regardless of their meaning, as in conventional syntactic communications. To assess the benefits of the proposed architecture, we provide a case study of an image retrieval task for vehicles in intelligent transportation systems (ITSs). Simulation results demonstrate that the proposed architecture outperforms existing solutions, with fewer radio resources, especially in a low-signal-to-noise-ratio (SNR) regime, which can shed light on the potential of the proposed architecture in extending applications in extreme environments.
Automobiles have become a daily necessity in modern society to provide a fast and convenient way to deliver goods and passengers. The rapid growth in the number of vehicles results in a dramatic increase in the time of traffic congestion, causing a waste of more than 56 h and around 18 gal of additional fuel for each commuter per year. Moreover, according to the World Health Organization, approximately 1.35 million deaths and more than 20 million injuries are caused by road traffic crashes every year around the world [1].
Aiming at improving the safety and efficiency of transportation system, the IoV [2] has been proposed, which enables information exchanges among vehicles, users, and external infrastructures. By integrating the Internet of Things and ITSs, the IoV provides diversified services, such as traffic management, car navigation, and intelligent vehicle control, to avoid traffic accidents and ease traffic congestion.
To provide various real-time services to vehicular users, massive data should be transmitted to servers within corresponding delay restrictions while keeping data integrity, leading to a huge demand for spectrum resources [2]. As a result, the IoV enables multiple access capabilities, including cellular, Wi-Fi, satellite, and so on, for larger bandwidth. Moreover, various advanced technologies, such as nonorthogonal multiple access and multiple-input, multiple-output, have been used to improve spectrum efficiency. However, conventional communication systems have nearly approached the Shannon capacity limit. With the advent of more intelligent applications (e.g., autonomous driving) as well as the increasing number of vehicular users in the IoV, the spectrum allocated for the IoV hardly supports big data transmission [2].
On the other hand, many ITS applications, such as traffic congestion detection, autonomous driving, and so on, require massive data from nearby vehicles and RSUs [3], as shown in Figure 1. Although the collaboration of different data sources makes the performance better than stand-alone systems, the redundancy among the transmitted data [4] dramatically deteriorates the spectrum efficiency. Moreover, the raw data transmitted to servers may contain irrelevant information for specific tasks, leading to severe network congestion in the IoV.
Figure 1 The typical IoV scenario: correlated data of nearby users are transmitted to servers. BS: base station.
This article aims to break the preceding limits by proposing a novel cooperative semantic communication (Co-SC) architecture for the IoV. Correlated semantic information from multiple users is extracted and transmitted via a shared channel and jointly recovered and exploited by cooperative modules at the receiver for further processing. In particular, Co-SC extracts the intended “meanings” and “features” of source data that are relevant to the transmission intention and filters out irrelevant and unessential information to lessen data traffic. As a result, SC [5], [6] can transmit fewer data while preserving the effectiveness of communication, alleviating the transmission load significantly. As one of the potential technologies for 6G and beyond, SC has drawn extensive attention in both academia and industry. Preliminary works have shown the potential of SC in improving both transmission efficiency and reliability for supporting end-to-end (E2E) transmission [7], [8], [9]. By jointly optimizing the semantic and channel coding, point-to-point semantic transmission for text, images, and speech is achieved, outperforming the conventional syntactic-based system, especially in the low-SNR regime. However, these works cannot be directly applied in multiuser scenarios in the IoV.
To deal with multiuser scenarios, our initial work [10] designs a multiuser SC system for visual question answering (VQA), named MU-DeepSC. Correlated semantic information of users is incorporated at the receiver to get more accurate answers. However, the correlation among different users is explored only in specific VQA tasks and not fully explored during transmission. To collaboratively utilize correlated information of different users for more efficient transmission and intelligent tasks in the IoV, in this work, we propose a general intelligent architecture, Co-SC, for multiuser applications in the IoV. The proposed Co-SC jointly designs the semantic encoder/decoder (Sem-Codec), where the redundant semantics of different users will be eliminated. Meanwhile, the distinctive semantics of users that are relevant to the transmission goal will be extracted to improve system performance. Moreover, the correlation among the semantics of different users is further exploited in the cooperative joint source and channel (JSC) coding scheme. As a result, the decoder can better reconstruct the semantic features of each user, without extra transmission overhead, while coping with wireless channel noise and impairment.
The remainder of this article is organized as follows. We first present the framework and functionality of each component of Co-SC. Then, we implement a case study of an image retrieval application in the IoV via the proposed Co-SC architecture, where extensive simulations have been conducted to investigate the approach’s effectiveness compared to the state-of-art baselines. At the end of this article, future directions and concluding remarks are discussed.
In this section, we provide an overview of the proposed Co-SC architecture, which consists of the Sem-Codec, JSC encoder/decoder (JSC-Codec), and task-related modules, as detailed in Figure 2. Specifically, to complete a required task, such as traffic analysis, pedestrian detection, vehicle tracking, and so on, users/transmitters need to transmit correlated data to the server/receiver. The correlation among users is prelearned and embedded in the whole structure of Co-SC, including encoders at the transmitters and cooperative modules at the receiver. At the transmitters, essential semantic information is extracted by semantic encoders, and JSC encoders further encode the extracted semantic information to resist noise and interference during transmission. At the server/receiver side, semantic features are recovered by the cooperative JSC decoder and will be further processed by the cooperative semantic decoder and semantic-driven task performer on demand to fulfill the task at the receiver. The detailed functionality of each component is discussed in the following sections.
Figure 2 The proposed architecture for general Co-SC.
Generally, the semantic encoder is designed to extract the semantic information from the source data, which is a high-dimensional interpretation of the original data, emphasizing the meaning and goal-relevant part. Correspondingly, the semantic decoder recovers the source data or expresses it in other modalities from the high-dimensional semantic information according to the specific goals. For example, source images are recovered for the data reconstruction-oriented system [9], and speech signals are reconstructed as text transcriptions for speech recognition tasks [6].
Inspired by the semantic-level correlation among users, Co-SC incorporates a cooperative semantic decoder to leverage the semantic-level correlation among users, as in Figure 2. Compared to the E2E design for the single transceiver [7], [8], [9], the advantages of cooperative design are twofold. First, by jointly optimizing the semantic encoders and cooperative semantic decoder, the correlation among users can be learned by the Sem-Codec. As a result, distinctive semantic information of each user can be obtained by semantic encoders, while redundancy can be compressed to improve the compressing efficiency. Second, the inherent correlation among users can be implicitly used for error correction at the semantic level, further improving the accuracy of the Sem-Codec.
Note that for some intelligent tasks, such as machine-to-machine applications, semantic features can be directly used by the semantic-driven task performer. In such cases, the semantic decoder can be omitted. However, the correlation among users can still be learned by semantic encoders by jointly optimizing the cooperative task performer and semantic encoders with a back propagation algorithm.
The functionality of the JSC-Codec is to resist channel distortions. Specifically, the JSC encoder is applied to encode the extracted semantic information as channel input symbols, while the JSC decoder recovers the semantic information with the received noisy symbols. Unlike the JSC coding scheme in conventional communication systems, where channel symbols are obtained regardless of the transmission meaning, in Co-SC, the JSC-Codec operates at the semantic level, where the channel symbols are obtained with the awareness of semantic information.
Specifically, semantics with different importance levels are protected with a different number of symbols to enhance the robustness of the semantic information to channel distortions implicitly. Moreover, to further leverage the semantic-level correlation among users, in Co-SC, a cooperative JSC decoder is designed to recover the transmitted semantics of multiple users jointly, as demonstrated in Figure 2. Note that the semantic-level correlation narrows the scope of potential symbols, and the received symbols of users can provide a reference for one another. As a result, higher JSC-Codec accuracy can be achieved with the cooperative JSC decoder.
The semantic-driven task performer is used to achieve specific tasks with the recovered semantic information from multiple users. The structure of the semantic-driven task performer adapts to specific tasks. For example, convolutional neural networks (CNNs) are generally used for image-based tasks, and long-short-term memory is widely used for speech recognition. Note that for tasks oriented toward information recovery, the task performer can be omitted where the output of the semantic decoder may directly achieve the intelligent goals.
In Co-SC, the semantic-level correlation and distinctions among users are leveraged by the semantic-driven cooperative task performer, and a task is cooperatively performed by combining the information provided by distinct users. The combination method can be adaptively designed according to the type of task and correlation. For example, for tasks with partially correlated semantic information, the semantic-driven cooperative task performer can be designed as two cascade networks, where the correlation with the recovered semantic information of each user is captured in the first network, and the second network combines the correlated information into a global feature and concatenates it with the distinctive semantic information of different users as an enhanced semantic feature for the task. The enhanced semantic information can facilitate better task performance in such a scenario.
The knowledge base is the basis of SC and one of the sources of the subjectivity of semantic information. The way a person recognizes and depicts the world is determined by the knowledge he or she has learned and accumulated through life, which differs from person to person. In SC, given a goal, users first analyze and understand the goal on the basis of their background knowledge and then perform coding or decoding. Semantic-level coding can be interpreted as the process of extracting and encoding goal-related semantic information from source data, while semantic-level decoding can be considered interpreting semantic information in the mode required by the transmission goal. Hence, differences in background knowledge will deteriorate the performance of SC severely. In general, before data transmission, transceivers will share their knowledge by acquiring and exchanging knowledge with a shared knowledge base at a central server through a specific link.
In Co-SC, we assume that the background knowledge is already shared among users and the server. This can be achieved by jointly training the whole NN offline with the common dataset, equipping the encoders and decoder with the same knowledge of the given transmission goal. Note that research about how to achieve efficient global knowledge sharing for multiuser SC and semantic networks is a particular research topic, which is out of the scope of this article.
In this section, the proposed Co-SC is implemented to support the IoV, in which image retrieval tasks are essential [11]. To provide data support for intelligent tasks at the central server, such as suspicious vehicle positioning, cameras at the RSUs need to transmit captured images to the server, as illustrated in Figure 3. The server retrieves the identifications of the received images (as well as query images) by calculating the distance between their semantic features and that of gallery images, which are accessible only at the server. As shown in Figure 3, cameras close to one another tend to capture images of the same vehicle from different angles, which results in semantic-level correlation and thus enables Co-SC and identification. Utilizing the Co-SC architecture for this task, only the semantic features of images are cooperatively transmitted to the server, instead of entire images, for tasks without image reconstruction. The entire Co-SC-based multiuser system is achieved by deep leaning (DL), where DL modules are trained offline at the server, and trained models of transmitters will be implemented for cameras before transmission, as in Figure 3. Note that the model training and distributing need to be performed only once unless the distribution of the collected data changes dramatically.
Figure 3 The framework for the cooperative vehicular ID retrieval task.
We consider an uplink scenario, where N single-antenna cameras simultaneously transmit data via a shared channel to the central server equipped with M antennas. A detailed implementation of the NN structure of Co-SC for this ID retrieval task is provided in Figure 4.
Figure 4 The NN structure of the proposed Co-SC for vehicular image retrieval tasks. Here, T is the transfer function of a module, and the subscript denotes the trainable parameter set.
At the transmitter, given images ${\boldsymbol{s}}_{i}\,{\in}\,{\Bbb{R}}^{{C}\,{\times}\,{H}_{i}\,{\times}\,{W}_{i}},{i} = {1},{\ldots},{N}$, cameras first extract the semantic features of images ${\boldsymbol{g}}_{i}\,{\in}\,{\Bbb{R}}^{F},{i} = {1},{\ldots},{N}$ with the semantic encoder, where Wi , Hi , and C are the width, height, and number of channels of images and F is the dimension of the semantic features. Then, the JSC encoder maps the semantic features to the channel input symbols ${\boldsymbol{x}}_{i}\,{\in}\,{\Bbb{R}}^{2{B}},{i} = {1},{\ldots},{N}$, where B is the number of transmitted complex symbols and 2B is the result of the transformation to real-valued symbols when processing. Note that the complex channel input symbols can be considered the counterpart of the symbols of conventional modulation. An average power constraint is used to normalize the channel input symbols before they are transmitted, and all cameras are constrained with the same average power P. The wireless channel can also be modeled as a layer with nontrainable parameters as long as the channel transfer function is differentiable. For nondifferentiable cases, the channel can be approximated with a generative adversarial network [12].
At the receiver, perfect channel state information is assumed for signal detection. The detected symbols ${\hat{\boldsymbol{X}}}\,{\in}\,{\Bbb{R}}^{{N}\,{\times}\,{2}{B}}$ are fed into the cooperative JSC decoder, which outputs the concatenated semantic features of multiple cameras, represented as ${\hat{\boldsymbol{G}}}\,{\in}\,{\Bbb{R}}^{NF}$. For the ID retrieval task, the recovered semantic features can be directly used for identification, and hence, the cooperative semantic decoder is not involved in this case.
The cooperative task performer is specifically designed for the vehicular ID retrieval task. To incorporate the semantic features of multiple cameras, a fusion module is applied to fuse the recovered individual semantic features as a global semantic feature ${\hat{\boldsymbol{g}}}_{f}\,{\in}\,{\Bbb{R}}^{F}$. A dynamically tailored weight allocation strategy is learned for fusion by training with a large set of images, where higher weights are assigned to the semantic features of higher effectiveness to contribute more to the identification task. Finally, the identifier retrieves the identification with the global semantic feature. The identification corresponding to the source image is indicated by the maxima of probability vector ${\hat{\boldsymbol{c}}}_{f}\,{\in}\,{\Bbb{R}}^{S}$, where S is the total number of identifications of training data. Note that the identifier is used only during training to learn efficient feature representations, while in the testing stage, the task is performed by calculating distances between semantic features.
To make the system more robust in practical applications, a gating module is implemented to verify whether the camera images describe the same vehicle, based on which different identification strategies are used. The input of the gating module is the subtraction of the recovered semantic features between two individual cameras, while the output is a binary indicator $\varphi$, where zero value means that two semantic features are not of the same identification and vice versa. The semantic features of the same identification will be fed into the fusion module to get the global semantic features for vehicular ID retrieval. For the remaining semantic features describing distinct vehicles, the task will be performed separately without fusion. The detailed implementation of each module is listed in Table 1.
Table 1 The parameter settings of Co-SC and DL-based baselines.
The proposed system is trained with a four-stage strategy at the server. We first train the whole network without the gating module, referred to as the backbone network, and then the gating module is trained with other modules frozen.
First, the semantic encoder and identifier are trained to learn the feature extraction strategy without being attached to other modules. This stage needs to be performed only once, and the trained semantic encoder will be loaded for individual cameras as the pretrained model. The trained identifier is loaded at the receiver and shared by all cooperative cameras. In the second stage, the JSC encoders of multiple users and the cooperative JSC decoder are jointly trained to minimize the distance between the recovered semantic features and the transmitted ones, which are extracted by the trained semantic encoders. The distance is measured by the mean square error (MSE) in this case study. The parameters of other modules will not be updated in this stage. In the third stage, the whole backbone network is jointly trained. The gating module is trained in the final stage to evaluate the correlation based on the individual semantic features recovered by the trained backbone network obtained through the previous three stages.
This section presents the evaluation results of the proposed Co-SC on the VeRi-776 dataset [13]. Euclidean distance is applied to measure the distance between semantic features, which is one of the most widely used methods in image retrieval tasks. The calculated distances are ranked as a list in ascending order. Based on the list, two of the most popular performance metrics, rank-n accuracy and mean average precision (mAP), are evaluated. Specifically, rank-n accuracy indicates the proportion of query images that are correctly retrieved by the first n results in the list. In the following, rank-1 accuracy is used to assess the performance. The mAP measures the mean of the AP, indicating the proportion of the correctly retrieved gallery images in the gallery image set.
The proposed Co-SC scheme is compared with traditional transmission methods and DL-based semantic transmission methods, with different levels of cooperation as baselines. The evaluation is performed with both the correlated and uncorrelated cases (i.e., cameras capture different vehicles) in the test data. The implementation details of the baselines are provided in the following:
Note that for both DL-based baselines, the task is performed individually with the recovered semantic features ${\hat{\boldsymbol{g}}}_{i}$ of each camera. All the cameras share the parameters of the identifier during training. The training strategy for Co-SC is adopted. The parameters of Co-SC and DL-based baselines are given in Table 1. Commonly used training configurations can be referred to [11].
For simplicity, a two-user case is considered. Single-tap Rayleigh fading channels are adopted, with the channel coefficients following ${\Bbb{CN}}{(0,1)}$. The power budget P is set as one for each user. The number of receiver antennas M is four. The number of transmitted complex symbols by DL-based methods B is set as 16 in all simulations.
Figure 5 provides visualization results of the identification task with the proposed Co-SC architecture and DL-S baseline. Figure 5(a) contains the original query image provided by two cameras and the rank-5 list of the corresponding retrieved gallery images, where incorrect results are marked with red boxes. Figure 5(b) is the visualization of the contributing feature maps obtained by gradient-weighted class activation mapping [15], where warmer colors indicate features with more significance. Leveraging the correlation among users, Co-SC can retrieve more correct results by combining the informative semantic features even when the query image is quite different from the gallery images. As demonstrated in Figure 5(a), in the third line, the query image from camera 2 is the back of a vehicle, and Co-SC helps the server retrieve more gallery images corresponding to the correct vehicle, including images captured from the front view, by incorporating the semantic feature of a front-view image from camera 1. In comparison, the non-cooperative method DL-S fails to distinguish vehicles with similar looks, where all the retrieved gallery images for camera 2 are incorrect, as evident in the fourth line Figure 5(a). Further analysis in Figure 5(b) indicates that DL-S is prone to be distracted by irrelevant features, such as backgrounds, due to the lack of cooperation between users. Specifically, as indicated in Figure 5(b), DL-S puts more attention on the backgrounds in the query image and some gallery images, which are highlighted with white circles. This misleads the server to classify these images as the same, due to the great similarity among the semantic features of the backgrounds, instead of cars.
Figure 5 (a) The retrieval results of Co-SC and DL-S. Red boxes indicate incorrect results, which are retrieved gallery images with different identifications than the query image. (b) The gradient-weighted class activation mapping visualization of feature maps. Warmer colors indicate contributing semantic features, while white circles indicate distracting backgrounds.
Figure 6 presents the MSE of the cooperative JSC decoder in Co-SC and separate JSC decoder in DL-S. It can be observed that the MSE of the two methods decreases with the increase of the SNR. The cooperative JSC decoder improves the recovery performance significantly in the low-SNR regime and achieves similar performance with the separate decoder when the SNR is high. In other words, the semantic-level correlation facilitates a more robust transmission by the proposed cooperative JSC coding scheme.
Figure 6 The MSE of the cooperative JSC decoder and separate JSC decoder.
The identification performance is presented in Figure 7, where DL-based semantic transmission methods with limited symbols all outperform the traditional JPEG + LDPC + BPSK and SoftCast methods. Moreover, the average number of symbols used in the two traditional methods is about ${3.1}\,{\times}\,{10}^{5}$ and ${2.6}\,{\times}\,{10}^{5}$, respectively, much more than the proposed semantic-based method, with 16 symbols, verifying the superiority of SC in reducing data traffic. Co-SC achieves the best performance among three semantic transmission methods, followed by Co-SC without fusion and DL-S. At –3 dB, the rank-1 accuracy gap between Co-SC and Co-SC without fusion and DL-S is 2.4% and 28.9%, respectively, while in terms of the mAP, which evaluates the identification performance in a global view, the gap is 8.3% and 26.7%, respectively.
Figure 7 The rank-1 accuracy and mAP comparison among Co-SC, Co-SC without fusion, DL-S, JPEG + LDPC + BPSK, and SoftCast under Rayleigh channels. (a) The rank-1 accuracy of the different methods. (b) The mAP of the different methods.
In this article, we proposed a cooperative semantic-aware architecture for multiuser communications in the IoV to reduce data traffic significantly. Such a revolution has achieved a transformation from traditional syntactic communications among users to SC for IoV applications. We presented the main guidelines and principles of the architecture designed for Co-SC. The proposed architecture is flexible enough to be adapted to different applications. We highlighted the advantages of the proposed architecture by implementing a case study. Experimental results show that 1) conveying semantics in the source data requires fewer spectrum resources compared to conventional syntactic symbol transmission, and 2) the correlation of semantics among different users leveraged by the cooperative JSC coding scheme achieves better semantic reconstruction performance without extra communication overhead. Consequently, the proposed architecture requires fewer radio resources to achieve better performance, easing the spectrum scarcity challenge in the IoV.
This article is an initial work to present a view of conveying semantics in the IoV for multiuser communications. With dedicated transformations, the proposed architecture is promising for use in more cooperative communication scenarios, such as communication from servers to vehicular users. Substantial further research is required in the following areas:
This work was supported, in part, by the National Natural Science Foundation of China, under grant 62293485; Fundamental Research Funds for the Central Universities, under grant 2022RC18; and China Scholarship Council. Fengyu Wang is the corresponding author of this article.
Wenjun Xu (wjxu@bupt.edu.cn) is a professor with the School of Artificial Intelligence, Key Laboratory of Universal Wireless Communications, Ministry of Education, Beijing University of Posts and Telecommunications, 100876 Beijing, China, and with Peng Cheng Laboratory, 518066 Shenzhen, China. His research interests include artificial intelligence-driven networks, semantic communications, unmanned aerial vehicle communications and networks, green communications and networking, and cognitive radio networks. He is an editor of China Communications and a Senior Member of IEEE.
Yimeng Zhang (yimengzhang@bupt.edu.cn) is currently pursuing her Ph.D. degree at the School of Artificial Intelligence, Key Laboratory of Universal Wireless Communications, Ministry of Education, Beijing University of Posts and Telecommunications, 100876 Beijing, China. Her research interests include semantic communications and intelligent resource allocation in emerging wireless applications. She is a Graduate Student Member of IEEE.
Fengyu Wang (fengyu.wang@bupt.edu.cn) is currently a lecturer with the School of Artificial Intelligence, Beijing University of Posts and Telecommunications, 100876 Beijing, China. Her research interests include integrated sensing and communications, semantic communications, wireless sensing, and statistical signal processing. She is a Member of IEEE.
Zhijin Qin (qinzhijin@tsinghua.edu.cn) is currently an associate professor with the Department of Electronic Engineering, Tsinghua University, 100084 Beijing, China. Her research interests include semantic communications. She is an associate editor of IEEE Transactions on Communications, IEEE Transactions on Cognitive Networking, and IEEE Communications Letters. She has received several awards from the IEEE Communications Society and IEEE Signal Processing Society. She is a Senior Member of IEEE.
Chenyao Liu (liuchenyao@bupt.edu.cn) is currently pursuing her Ph.D. degree at the School of Artificial Intelligence, Key Laboratory of Universal Wireless Communications, Ministry of Education, Beijing University of Posts and Telecommunications, 100876 Beijing, China. Her research interests include semantic communications, video coding, and machine learning.
Ping Zhang (pzhang@bupt.edu.cn) is currently a professor with the School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, 100876 Beijing, China, where he is the director of the State Key Laboratory of Networking and Switching Technology. He is also with the Department of Broadband Communication, Peng Cheng Laboratory, Shenzhen, China. His research interests include wireless communications. He is an academician of the Chinese Academy of Engineering and a Fellow of IEEE.
[1] F. Wang, X. Zeng, C. Wu, B. Wang, and K. J. R. Liu, “Driver vital signs monitoring using millimeter wave radio,” IEEE Internet Things J., vol. 9, no. 13, pp. 11,283–11,298, Jul. 2022, doi: 10.1109/JIOT.2021.3128548.
[2] H. Zhou, W. Xu, J. Chen, and W. Wang, “Evolutionary V2X technologies toward the internet of vehicles: Challenges and opportunities,” Proc. IEEE, vol. 108, no. 2, pp. 308–323, Feb. 2020, doi: 10.1109/JPROC.2019.2961937.
[3] A. Arooj, M. S. Farooq, A. Akram, R. Iqbal, A. Sharma, and G. Dhiman, “Big data processing and analysis in internet of vehicles: Architecture, taxonomy, and open research challenges,” Arch. Comput. Methods Eng., vol. 29, no. 2, pp. 793–829, pp. 1–37, May 2021, doi: 10.1007/s11831-021-09590-x.
[4] Z. Hu, D. Wang, Z. Li, M. Sun, and W. Wang, “Differential compression for mobile edge computing in internet of vehicles,” in Proc. Int. Conf. Wireless Mobile Comput., Netw. Commun. (WiMob), Barcelona, Spain, 2019, pp. 336–341, doi: 10.1109/WiMOB.2019.8923430.
[5] P. Zhang et al., “Toward wisdom-evolutionary and primitive-concise 6G: A new paradigm of semantic communication networks,” Engineering, vol. 8, pp. 60–73, Jan. 2022, doi: 10.1016/j.eng.2021.11.003.
[6] Z. Qin, X. Tao, J. Lu, W. Tong, and G. Y. Li, “Semantic communications: Principles and challenges,” 2021, arXiv:2201.01389.
[7] H. Xie et al., “Deep learning enabled semantic communication systems,” IEEE Trans. Signal Process., vol. 69, pp. 2663–2675, 2021, doi: 10.1109/TSP.2021.3071210.
[8] Z. Weng and Z. Qin, “Semantic communication systems for speech transmission,” IEEE J. Sel. Areas Commun., vol. 39, no. 8, pp. 2434–2444, Sep. 2021, doi: 10.1109/JSAC.2021.3087240.
[9] E. Bourtsoulatze, D. B. Kurka, and D. Gündüz, “Deep joint source-channel coding for wireless image transmission,” IEEE Trans. Cogn. Commun. Netw., vol. 5, no. 3, pp. 567–579, May 2019, doi: 10.1109/TCCN.2019.2919300.
[10] H. Xie, Z. Qin, and G. Y. Li, “Task-oriented multi-user semantic communications for VQA,” IEEE Wireless Commun. Lett., vol. 11, no. 3, pp. 553–557, Mar. 2022, doi: 10.1109/LWC.2021.3136045.
[11] Y. Zhang, W. Xu, H. Gao, and F. Wang, “Multi-user semantic communications for cooperative object identification,” in Proc. IEEE Int. Conf. Commun. (ICC) Workshops, Seoul, South Korea, May 2022, pp. 157–162, doi: 10.1109/ICCWorkshops53468.2022.9814491.
[12] H. Ye, L. Liang, G. Y. Li, and B.-H. Juang, “Deep learning-based end-to-end wireless communication systems with conditional GANs as unknown channels,” IEEE Trans. Wireless Commun., vol. 19, no. 5, pp. 3133–3143, May 2020, doi: 10.1109/TWC.2020.2970707.
[13] X. Liu et al., “Large-scale vehicle re-identification in urban surveillance videos,” in Proc. IEEE Int. Conf. Multimedia Expo (ICME), Seattle, WA, USA, Jul. 2016, pp. 1–6, doi: 10.1109/ICME.2016.7553002.
[14] S. Jakubczak and D. Katabi, “A cross-layer design for scalable mobile video,” in Proc. Int. Conf. Mobile Comput. Netw., New York, NY, USA, Sep. 2011, pp. 289–300, doi: 10.1145/2030613.2030646.
[15] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-CAM: Visual explanations from deep networks via gradient-based localization,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2017, pp. 618–626, doi: 10.1109/ICCV.2017.74.
Digital Object Identifier 10.1109/MVT.2022.3227723