Zhongxu Hu, Yiran Zhang, Yang Xing, Yifan Zhao, Dongpu Cao, Chen Lv
©SHUTTERSTOCK.COM/METAMORWORKS
Accurate dynamic driver head pose tracking is of great importance for driver–automotive collaboration, intelligent copilot, head-up display (HUD), and other human-centered automated driving applications. To further advance this technology, this article proposes a low-cost and markerless head-tracking system using a deep learning-based dynamic head pose estimation model. The proposed system requires only a red, green, blue (RGB) camera without other hardware or markers. To enhance the accuracy of the driver’s head pose estimation, a spatiotemporal vision transformer (ST-ViT) model, which takes an image pair as the input instead of a single frame, is proposed. Compared to a standard transformer, the ST-ViT contains a spatial–convolutional vision transformer and a temporal transformer, which can improve the model performance. To handle the error fluctuation of the head pose estimation model, this article proposes an adaptive Kalman filter (AKF). By analyzing the error distribution of the estimation model and the user experience of the head tracker, the proposed AKF includes an adaptive observation noise coefficient; this can adaptively moderate the smoothness of the curve. Comprehensive experiments show that the proposed system is feasible and effective, and it achieves a state-of-the-art performance.
Intelligent driving is currently a hot research topic that requires a combination of multiple disciplines and algorithms. Developing and testing algorithms for real intelligent vehicles is an expensive and time-consuming process. The development of simulation technology provides an alternative approach as it can offer physically and visually realistic simulations for several research goals and can also collect a large number of annotated samples to leverage deep learning and machine learning [1]. The driving simulator cockpit is a widely used experimental platform, and immersion is one of its key characteristics. One way to improve visual realism is to use virtual reality (VR) devices, which can result in two problems: 1) the dizziness caused by the serious mismatch between the fixed seat and the dynamic virtual graphics and 2) the fact that the VR glasses will cover the driver’s face, making it impossible to conduct research on the driver’s state [2]. Therefore, this study has proposed a vision-based driver head-tracking system to improve immersion and interaction, as shown in Figure 1. This technique can also be used to improve the user experience with HUD for intelligent vehicles and other driver-in-the-loop applications.
Figure 1 The framework of the proposed dynamic driver head pose tracking system. (a) The driving cockpit, which includes the input devices, computing server, and RGB camera. (b) The proposed ST-ViT model is adopted as a measurer to estimate the pose and its result as the observation. The proposed AKF is used to optimize the estimation. Finally, the virtual camera of the simulator is aligned with the output of the framework. CNN: convolutional neural network; Q: Query; K: Key; V: Value; BN: batch normalization; ConvTE: Convolutional Transformer Encoder; PE: positional embedding; MLP: multilayer perceptron; PYR: pitch, yaw, roll.
The head pose is an important clue that has been used in several human–machine interaction fields. Zhao et al. [3] proposed an orientation sensor-based head-tracking system to monitor the behavior of drivers engaging in various nondriving activities. Kang et al. [4] presented a sensor fusion method that integrates the inertial measurement unit, infrared (IR) LED, charge-coupled device camera, and other sensors. Ng et al. [5] developed a low-cost head-tracking device based on the SteamVR tracking technology for a VR system. These methods typically adopt different types of sensors to build the system. There are several similar products in flight simulators. They typically require special devices or optical markers, such as an IR camera. Although some devices require only an RGB camera, they all require the user to manually adjust the relative parameters, and they typically use certain traditional head estimation methods. Therefore, this study proposed a low-cost and markerless solution that is only dependent on the RGB sensor as the input device, and a dynamic head pose estimation model based on deep learning was developed to improve the accuracy of the system.
Currently, several types of head pose estimation models combined with multimodality inputs have been proposed to achieve a state-of-the-art performance. These methods can be divided into model-based and model-free methods [6]. Model-based methods typically use a deformable head model to fit the input image. They also locate facial landmarks to align with the predefined model. Generally, these methods are time consuming. The model-free approaches are more popular; they train a regression model to map the head image to the pose manifold, and deep learning-based models are basically adopted. To improve the model performance, facial landmarks are also leveraged in certain model-free methods that can be used with vision geometry algorithms or multitask learning to estimate the head pose [7]. To eliminate the influence of the illumination intensity, the depth image is explored to obtain more robust head poses under poor illumination or large illumination variations. The depth image can also provide additional depth information to improve the model accuracy [8]. These methods estimate the head pose independently for each frame. As a dynamic head tracker, this study focused on leveraging the prior frame to improve the performance of the model. A recurrent neural network (RNN) is a widely used model to handle sequential data, and it can be combined with a convolutional neural network (CNN) to handle video-based tasks. Recently, self-attention-based models, particularly vision transformers, have shown great potential in multiple tasks [9]. They outperformed inductive bias methods, including CNN and RNN models, based on a large data set. However, these transformers typically focus only on either the spatial information of the image or the temporal features of the sequential data. Therefore, this study proposed a novel ST-ViT structure that can achieve better performance. It was compared and analyzed using a CNN–RNN-based model.
The estimated curve of consecutive frames fluctuated owing to the error variance of the model. A Kalman filter (KF) was used for postprocessing to address this problem. By analyzing the error distribution of the estimation model, the AKF improved the filtering performance, which includes an adaptive observation noise coefficient; it adaptively moderated the smoothness and maintained the curve stable near the initial position.
The main contributions of this study are as follows:
The purpose of this study is to develop a vision-based dynamic head tracker and implement it on a driving simulator whose view can be automatically aligned with the driver’s head pose using a frontal RGB camera. The benefits are as follows: 1) It can improve the immersion and interaction of the simulator. The driver’s view will be unconstrained and nonfixed, and the virtual camera will be synchronized with the driver’s head pose. 2) The extracted head pose can also be used to monitor the state of the driver and improve the user experience of other human-centric applications. 3) This is a low-cost solution that uses a noninvasive camera sensor.
The development of deep learning and computer vision technology provides the basis for the proposed method. Current state-of-the-art head pose estimation methods typically use a single frame as the input. In this study, the prior frame is leveraged and combined with the current frame as the input to improve the performance of the model. A novel ST-ViT is proposed to achieve this task. To smooth the inconsistency and volatility of the estimation, this study also proposes an AKF. The overall proposed architecture is illustrated in Figure 1.
Estimating the head pose, a crucial problem that has several applications, is a task that must infer the 3D pose (pitch, yaw, and roll) of the head from the input image. There are several different methods that use multimodality input data, including depth images, RGB images, and video clips. Considering the cost of hardware and computing resources, this study investigates the dynamic head pose estimation approach based on the RGB image.
With the development of deep learning, research on head pose estimation has also achieved good results, but the works typically use a single frame as the input. In this study, the prior frame is leveraged and then combined with the current frame as an input pair. Generally, an RNN is a widely used model to handle this type of sequential data, and it can use a CNN as the feature extractor to handle video-based tasks. We adopt this type of structure model in this study. Notably, the transformer model has shown significant potential, particularly in natural language processing. Some researchers have also begun to apply it to computer vision tasks and have proposed some vision transformer models [9]. Compared to inductive bias models, such as CNN and RNN, the transformer is better at handling a large amount of data and achieves a superior performance on large data sets.
To handle the dynamic driver head pose, this study proposes an ST-ViT architecture, as illustrated in Figure 1. The input pair includes loosely cropped images from a face detector. It allows the model to focus on the head area and is easy to train. The ST-ViT still adopts a pretrained feature extractor as the CNN backbone rather than the standard vision transformer method, which requires a large data set for training. The feature extractor shares the weight between the input pair. The extracted feature maps are respectively input into the S-ViT module.
In the S-ViT module, the positional embedding (PE) is learnable, and the transformer encoder is convolutional as shown in the Convolutional Transformer Encoder (ConvTE) module of Figure 1, which computes the Query, Key, and Value (QKV) through the convolutional layer rather than the linear connection layer of the standard transformer as follows: \begin{align*}{QKV}_{\left({u,v}\right)} & = {BN}{(}{Conv}{\left({x},{W}_{QKV}\right)}{)} \\ & = {BN}{\left(\mathop{\sum}\limits_{i}\mathop{\sum}\limits_{j}{w}_{{qkv}_{u - i, v - j}}\,{\cdot}\,{x}_{i,j}\right)}, \tag{1} \end{align*} where BN denotes the batch normalization layer, Conv denotes the convolutional layer without bias, and W denotes the corresponding weight kernel. Then, the QKV is used to extract the spatial attention information using the multihead attention mechanism as follows: \begin{align*}{x}_{\text{out}} & = {Conv}_{\text{out}}{\left({Attention}{\left({Q},{K},{V}\right)}\right)} \\ & = {Conv}_{\text{out}}{\left({softmax}{\left(\frac{{QK}^{T} + {Pos}}{\sqrt{{d}_{k}}}\right)}{V}\right)}, \tag{2} \end{align*} where Pos denotes the position bias that is learnable and ${d}_{k}$ denotes the dimension of the Key. The attention mechanism leverages the Query and Key to obtain the similarity or correlation of the feature maps or vectors and then performs a weighted sum with the Value. The convolution layer, rather than the linear layer, can determine the formal consistency of the feature maps, allowing the residue connection to be used to avoid network degradation. Using the convolutional multihead attention module and the two convolutional feedforward layers, the spatial dependency and relationship of the feature maps can be theoretically obtained.
The temporal vision transformer (T-ViT) receives the feature vectors of the image pair through an average pooling layer, and a linear projection layer is used to embed the feature vectors. In this module, the PE uses a sine–cosine function, as shown in (3), to calculate the position encoding, and an extra token is used to concatenate with the embedding vector that is designed to obtain the prediction of the last frame through the final multilayer perceptron (MLP) head: \[\begin{cases} \begin{array}{c}{PE}_{\left({k,2i}\right)} = {\sin}{\left({k} / {10,000}^{2i/{d}_{\text{vector}}}\right)} \\ {PE}_{\left({k},{2}{i} + {1}\right)} = {\cos}{\left({k} / {10,000}^{2i/{d}_{\text{vector}}}\right)}\end{array}, \end{cases} \tag{3} \] where ${PE}_{\left({k,2i}\right)}$ represents the position encoding of the 2ith position of the feature vector of the kth frame, while ${d}_{\text{vector}}$ means the dimension of the feature vector. The temporal transformer module employs the standard transformer encoder, with the linear layers calculating the QKV.
A classic transformer is used to handle the sequential task to obtain the temporal dependency. A typical ViT is usually used to learn spatial information by splitting an image into several patches. The ST-ViT ingeniously proposes a spatiotemporal architecture to tackle the image sequences and also leverages a pretrained CNN backbone to alleviate the dependence on a large data set. In the S-ViT, the convolutional layer, rather than the linear layer, is utilized, which allows the residue connection to be used to avoid network degradation. Overall, the proposed ST-ViT can obtain a spatiotemporal attention rather than the single-dimension attention of the standard transformer.
Although the current head pose estimation method exhibits good performance, there is still a certain error. When it is applied to the simulator, its flaws of fluctuation and discontinuity are highlighted. From a practical perspective, the smoothness and continuity of view changes are more important than accuracy. To address this problem, a KF is adopted. Kalman filtering is an algorithm that provides the estimates of certain unknown variables that tend to be more accurate, given the measurements observed over time, and contain statistical noise and other inaccuracies. KFs have been demonstrated to be useful in various applications, such as the guidance, navigation, and control of vehicles. The KF has a relatively simple form and requires a small amount of computational power.
Because the head pose motion is typically of low speed and approximately uniform, it can be assumed to be a linear coordination transformation model, which only involves the pose and velocity and is also beneficial to reduce the computing, as shown in the postprocessing module of Figure 1. The output of the head pose estimation model is used as the observation of the AKF model. ${R}_{k}$ is the observation noise covariance, which is related to the estimation model and affects the performance of the filter. The results of the head pose estimation model are studied to determine ${R}_{k}$. Statistics revealed that the model usually has different performances at different intervals of the head pose. The accuracy is higher when the pose angle is small; otherwise, the error is higher, particularly on the pitch and roll axes.
For example, the BIWI data set [10] was used to evaluate the proposed head pose estimation model, and the results are shown in Figure 2. The pitch and roll are taken as the x- and y-axes, and the error is taken as the z-axis. The blue 3D points represent different samples. A 2D Gaussian function is used to fit the points, as shown on the curved surface. Therefore, an adaptive ${R}_{k}$, which can be adaptively adjusted in the iterative process, is proposed in this study. It can make the filtered value close to the observed value when the rotation angle is small, whereas the filtered value becomes smoother when the rotation angle is large.
Figure 2 The error of the proposed head pose estimation model on the pitch and roll axes tested on the BIWI data set. The 3D blue points represent each sample, and the curved surface is the result of 2D Gaussian fitting.
The BIWI data set, the only data set suitable for our task, was chosen to evaluate the proposed method. The other data sets did not contain sequential images. The BIWI data set contained 24 videos, which had more than 15,000 images of 20 subjects (14 males and six females). For each sample, an RGB image and the corresponding annotation were provided. The head pose range covered approximately ±75° yaw, ±60° pitch, and ±60° roll [10]. The ground truth was provided in the form of the 3D location of the head and its rotation, which can be converted to pitch, yaw, and roll.
In this study, EfficientNet-B0 [11], a popular backbone, was used as the CNN backbone to extract the feature maps. It developed a new baseline network by performing a neural architecture search and optimized both accuracy and efficiency.
To verify the proposed method, four paradigms were designed as follows:
We followed the common threefold cross-evaluation experimental protocol proposed earlier in [12] that splits the data set into portions of 70% (16 videos) for training and 30% (8 videos) for testing. In the training process, the batch size was 16, Adam was used as the optimizer, and the learning rate was ${1}\,{×}\,{e}^{{-}{4}}$. The mean absolute error (MAE) was used as the metric, which is the same metric used in other studies. The average results are presented in Figure 3.
Figure 3 A comparison of the different head estimation models. (a) The MAE (°) among different models. (b) The absolute error (°) distribution of different models.
The results indicated that the image pair improved the performance compared to the single image, especially on the yaw axis, where the value range is large. The feature changes caused by this axis are also more obvious, so the overall error is smaller. Compared to the LSTM model, the T-ViT model was not always competitive; it had a smaller error only on the pitch axis. Compared with others, the ST-ViT can reduce the error variance. This causes it to have a significant improvement on the roll axis, which easily has a larger error variance. To further evaluate the performance of the models, the error distribution is also displayed, which can reflect the error percentage and distribution under different thresholds. Notably, the overall performance of the LSTM was better than that of the T-ViT because the used data set did not have sufficient samples, and it could not completely reflect the learning ability of the transformer, especially because the used T-ViT is not deep. However, the proposed ST-ViT can still surpass others and achieve the best performance, and it has the potential to handle a greater number of samples.
The previous results demonstrated that the image pair improved the performance of the models owing to the extra sequential information. To further analyze the effect of sequence, we used three consecutive frames as the input to train the ST-ViT and LSTM models, and the results are shown in Figure 4. The comparison indicated that a longer sequence degenerated the model that had a higher MAE on all three axes. This was due to the fact that longer sequential information could not solve the problem of cross-subject evaluation; also, the frame rate of the BIWI videos was not high, and the deviation between consecutive frames was large. If the frame rate can be increased, the longer sequences might be better. Furthermore, the number of sequences can be easily changed in the proposed model depending on the situation. Compared to ST-ViT, the degradation of the LSTM module was more serious. For the same length of input, ST-ViT outperformed LSTM. This demonstrated that ST-ViT was more robust than LSTM in handling the sequence data. ST-ViT learned the relationship between consecutive frames and achieved a better performance.
Figure 4 A comparison of the different lengths of the sequence input. Sn indicates the length of sequence n.
To comprehensively evaluate the proposed method, it was compared to other sequential-based methods, as shown in Table 1. These models also adopted the same training protocol. Because the BIWI data set contained depth images, certain methods leveraged depth information to improve the performance. Table 1 also demonstrates that the benefit of the RGB image is combined with the depth information. To improve the performance of the head pose estimation model, these methods designed different types of models and loss functions from different perspectives. Compared to RGB-based methods, our proposed method achieved state-of-the-art performance. Even on the yaw axis, our proposed method was superior to the depth-based methods. The performance of the baseline and LSTM methods also demonstrated the importance of the backbone, which provides guidelines for future research. A comparison with other methods demonstrated the effectiveness of the proposed method.
Table 1 A comparison of state-of-the-art sequential-based models on the BIWI data set.
Figure 5 illustrates the visualization of the attention map learned by the ST-ViT model; the attention map for the input image was visualized through the attention score of self-attention. It was observed that the method could pay attention to the representative regions of the face; they are coincidentally similar to the facial landmarks, which can represent the facial expression and orientation. It demonstrated the learning ability of the proposed model. This helps us understand the mechanism of the S-ViT.
Figure 5 A visualization of the attention map learned by the proposed ST-ViT model.
To evaluate the proposed pipeline, ST-ViT was used to estimate the head pose on the BIWI data set, and the results are shown in Figures 2 and 3. The head pose estimation model inevitably had an error and a variance; hence, it could not be directly used for dynamic head tracking. A reasonable method was to leverage a filter to smooth the curve. Considering that the head pose model had different performances under varying angle ranges, this study proposed an AKF. We chose a sequence under a large angle range of the pitch axis to illustrate, and the results are shown in Figure 6. The use of the KF smoothened the curve and reduced the volatility. Notably, the ground truth also has measurement errors and deviations. This shows that it is necessary and reasonable to use a filter. For the standard KF, ${R}_{k}$ is a constant value, which is the mean error of the head pose estimation model. To further improve the performance, a constant ${R}_{k}$ is replaced with an adaptive one, as mentioned previously, and the related parameters are the results of the Gaussian fitting on the data set. A comparison is presented in Table 2. The standard KF and the filter with an adaptive ${R}_{k}$ almost coincided at a low angle range. However, the filter with the adaptive ${R}_{k}$ has a better performance in the high-angle range, and the curve is smoother. The filtering algorithm is a compromise between accuracy and smoothness. The increase in smoothness inevitably loses a certain degree of accuracy. The proposed AKF can maintain accuracy while reducing variance. This is an advantage of an adaptive ${R}_{k}$.
Figure 6 A comparison of the standard KF and AKF for different angle ranges.
Table 2 A comparison of the different filtering methods.
To improve the immersion and interaction of the driving simulator and related applications, this study proposed a dynamic head pose tracking system. The proposed system used only an RGB camera without other hardware or markers. To enhance the accuracy of the dynamic head pose estimation, this study proposed a ST-ViT model that used an image pair as the input instead of a single frame. Compared to a standard transformer, this model contained a spatial–convolutional vision transformer and a T-ViT, which improved the effectiveness of the model. Comprehensive experimental comparisons demonstrated that the proposed method outperformed the state-of-the-art methods.
Another challenge in deploying the head-tracking system was that the head pose estimation models still had certain errors and, hence, could not be directly adopted. To address this problem, this study proposed postprocessing of the raw estimation. By analyzing the error distribution of the estimation model and user experience, an AKF was proposed that included an adaptive observation noise coefficient that makes the curve smoother in the area where the estimation model has a large error. The experiments showed that the proposed method was feasible and could be deployed in the driving simulator.
This article proposed a reasonable low-cost vision-based solution for head tracking, which can be further optimized as the algorithm of head pose estimation improves. It can also be used in other driver-in-the-loop applications.
This work was supported in part by in part by the A*STAR National Robotics Program under grant W1925d0046, the Start-Up Grant, Nanyang Assistant Professorship under grant M4082268.050, Nanyang Technological University, Singapore, and the State Key Laboratory of Automotive Safety and Energy under project KF2021.
Zhongxu Hu (zhongxu.hu@ntu.edu.sg) is a research fellow at Nanyang Technological University in Singapore, 637460, Singapore. He received his mechatronic Ph.D. degree from the Huazhong University of Science and Technology of China in 2018. His current research interests include human–machine interaction, computer vision, and deep learning applied to driver behavior analysis and autonomous vehicles in multiple scenarios.
Yiran Zhang (yiran.zhang@ntu.edu.sg) is a research associate in the Department of Mechanical and Aerospace Engineering of Nanyang Technological University in Singapore, 637460, Singapore. He received his master’s degree in mechatronics from Hefei University of Technology of China in 2019. His current research interests include human–machine systems, automated driving, and robotics.
Yang Xing (yang.x@cranfield.ac.uk) is a lecturer with the Centre for Autonomous and Cyber-Physical Systems, Department of Aerospace, Cranfield University, Cranfield, MK43 OAL, U.K. He received his Ph. D. degree from Cranfield University in 2018. His research interests include machine learning, human behavior modeling, intelligent multiagent collaboration, and intelligent/autonomous vehicles.
Yifan Zhao (yifan.zhao@cranfield.ac.uk) is currently a lecturer in image and signal processing and degradation assessment at Cranfield University, Cranfield, MK43 OAL, U.K. He received his Ph.D. degree in automatic control and system engineering from the University of Sheffield, Sheffield, U.K., in 2007. His research interests include computer-vision-based process monitoring, superresolution, active thermography, and nonlinear system identification.
Dongpu Cao (dongpu.cao@uwaterloo.ca) is an associate professor and the director of the Driver Cognition and Automated Driving Laboratory at the University of Waterloo, Waterloo, Ontario, N2L 3G1, Canada. He received his Ph.D. degree from Concordia University, Montreal, Quebec, Canada, in 2008. His research interests include vehicle dynamics and control, driver cognition, automated driving, and parallel driving, where he has contributed more than 150 publications and one U.S. patent.
Chen Lv (lyuchen@ntu.edu.sg) is an assistant professor at Nanyang Technology University in Singapore, 639798, Singapore. He received his Ph.D. degree at the Department of Automotive Engineering, Tsinghua University, China, in 2016. His research focuses on advanced vehicles and human–machine systems, where he has contributed more than 100 papers and obtained 12 granted patents.
[1] S. Shah, D. Dey, C. Lovett, and A. Kapoor, “AirSim: High-fidelity visual and physical simulation for autonomous vehicles,” in Advanced Robotics. Berlin, Germany: Springer-Verlag, 2018, pp. 621–635.
[2] Z. Hu, C. Lv, P. Hang, C. Huang, and Y. Xing, “Data-driven estimation of driver attention using calibration-free eye gaze and scene features,” IEEE Trans. Ind. Electron., vol. 69, no. 2, pp. 1800–1808, 2022, doi: 10.1109/TIE.2021.3057033.
[3] Y. Zhao et al., “An orientation sensor-based head tracking system for driver behaviour monitoring,” Sensors, vol. 17, no. 11, p. 2692, 2017. doi: 10.3390/s17112692.
[4] C. H. Kang, C. G. Park, and J. W. Song, “An adaptive complementary Kalman filter using fuzzy logic for a hybrid head tracker system,” IEEE Trans. Instrum. Meas., vol. 65, no. 9, pp. 2163–2173, Sep. 2016, doi: 10.1109/TIM.2016.2575178.
[5] A. K. T. Ng, L. K. Y. Chan and H. Y. K. Lau, “A low-cost lighthouse-based virtual reality head tracking system,” in Proc. IC3D, Brussels, Belgium, 2017, pp. 1–5.
[6] Z. Hu, Y. Xing, C. Lv, P. Hang, and J. Liu, “Deep convolutional neural network-based Bernoulli heatmap for head pose estimation,” Neurocomputing, vol. 436, pp. 198–209, Jan. 2021, doi: 10.1016/j.neucom.2021.01.048.
[7] R. Valle, J. M. Buenaposada, and L. Baumela, “Multi-task head pose estimation in-the-wild,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 43, no. 8, pp. 2874–2881, Aug. 2021, doi: 10.1109/TPAMI.2020.3046323.
[8] G. Borghi, M. Fabbri, R. Vezzani, S. Calderara, and R. Cucchiara, “Face-from-depth for head pose estimation on depth images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 3, pp. 596–609, Mar. 2020, doi: 10.1109/TPAMI.2018.2885472.
[9] B. Graham et al., “LeViT: A vision transformer in ConvNet’s clothing for faster inference,” 2021, arXiv:2104.01136.
[10] G. Fanelli, J. Gall, and L. Van Gool, “Real time head pose estimation with random regression forests,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Providence, RI, USA, 2011, pp. 617–624, doi: 10.1109/CVPR.2011.5995458.
[11] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 6105–6114.
[12] N. Ruiz, E. Chong, and J. M. Rehg, “Fine-grained head pose estimation without keypoints,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, 2018, pp. 2074–2083.
[13] S. S. Mukherjee and N. M. Robertson, “Deep head pose: Gaze-direction estimation in multimodal video,” IEEE Trans. Multimedia, vol. 17, no. 11, pp. 2094–2107, Nov. 2015, doi: 10.1109/TMM.2015.2482819.
[14] T. Yang, Y. Chen, Y. Lin, and Y. Chuang, “FSA-Net: Learning finegrained structure aggregation for head pose estimation from a single image,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 1087–1096, doi: 10.1109/CVPR.2019.00118.
[15] H. Zhang, M. Wang, Y. Liu, and Y. Yuan, “FDN: Feature decoupling network for head pose estimation,” in Proc. AAAI Conf. Artif. Intell., 2020, vol. 34, no. 7, pp. 12,789–12,796, doi: 10.1609/aaai.v34i07.6974.
Digital Object Identifier 10.1109/MVT.2021.3140047