Transformer-based Long-Term Viewport Prediction in 360° Video: Scanpath is All You Need

18th August 2021

Fig. 1: Illustration of viewport prediction in 360° video. Our model aims to predict the viewport scanpath in the forthcoming F seconds given the past H-second viewport scanpath.

Abstract

Virtual Reality (VR) multimedia technology has dramatically advanced in recent years. Its immersive and interactive natures enable users to view any direction in 360? content freely. Users do not see the entire 360? content at a glance, but only a portion in the viewport. Viewport-based adaptive streaming, which streams only the user’s viewport of interest with high quality, has emerged as the primary technique to save bandwidth over the best-effort Internet. Thus, users’ viewport prediction in the forthcoming seconds becomes an essential task for informing the streaming decisions in the VR system. Various viewport prediction methods based on deep neural networks have been proposed. However, typically they are composed of complex Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) that require heavy computation. To achieve high prediction accuracy in limited computation time in a streaming system, we propose a new transformer-based architecture, named 360? Viewport Prediction Transformer (VPT360), that only leverages the past viewport scanpath to predict a user’s future viewport scanpath. We evaluate VPT360 over three widely-used datasets and compare the computation complexity with the state-of-the-art methods. The experiments show that our VPT360 provides the highest accuracy for short-term and long-term prediction and achieves the lowest computation complexity. The code is publicly available at https://github.com/FannyChao/VPT360 to further contribute to the community.

Architecture

Fig.2: (a) Architecture of our transformer-based VPT360 model, (b) multi-head attention module (c), scaled dot-product attention.

We introduce a new viewport prediction model, called VPT360, which adopts the self-attention layer from transformer to predict the user’s long-term viewport positions in the following F seconds. Fig. 2 shows the overall architecture
of our transformer-based model. As illustrated in Fig. 2a, the transformer layer comprises two modules, a multi-head self-attention module, and a position-wise feed-forward network. We can repeat it N times to extract complex features in all elements in the sequence.

Experiment Results

Fig. 3: Comparison results on (a) David MMSys18 and (b) Wu MMSys17 dataset, respectively.

Fig. 3a, Fig. 3b, and Table I present the performance of our VPT360 compared with state-of-the-art methods and
no-prediction baseline on three datasets, respectively. All the training and test sets follow the same settings in all compared methods for a fair comparison. Fig. 3a compares our method with Romero_PAMI21 on the David_MMSys18 dataset. It shows that our method achieves the best result in the entire 5-second scanpath prediction. Fig. 3b compares our method with a cluster-based method Taghavi_NOSSDAV20, two deep-learning-based methods Nguyen_MM18, and Romero_PAMI21 on Wu_MMSys17 dataset. The results show the prediction accuracy in terms of the average ratio of overlapping tiles in various prediction window lengths. Our method obtains a close result as Taghavi_NOSSDAV20 in a 0.5-second prediction window while outperforms Taghavi_NOSSDAV20 and the other methods in the prediction window longer than 0.5 seconds. It is noted that the result of Taghavi_NOSSDAV20 is in the prediction window from 0.5 to 2 seconds as reported in their paper.

TABLE I: Comparison with Xu PAMI18: Mean Overlap scores of FoV prediction, prediction window length F ? 30ms (1 frame). The best score is shown in bold and the second-best score is shown in underline.

In Table I, to compare with Xu_PAMI18, which predicts head movement in the next frame, we set the prediction window into one frame (approximately 30ms) and use mean overlap as an accurate measurement. We can see that our method significantly outperforms Xu_PAMI18 in all 15 test videos. Moreover, the scores of Xu_PAMI18 are lower than that of the no-prediction baseline in all 15 test videos, which implies that its complex architecture does not contribute to remarkable prediction ability.

Fig. 4: Four examples of viewport scanpath predicted by our VPT360 and Romero_PAMI21 on the David_MMSys18 dataset.

We can conclude that our VPT360 achieves superior prediction accuracy to state-of-the-art methods in both short-term and long-term prediction from the comparison results. Fig. 4 visualizes four examples of viewport scanpath predicted by our VPT360 and Romero_PAMI21 on the David_MMSys18 dataset.

Publication

Transformer-based Long-Term Viewport Prediction in 360° Video: Scanpath is All You Need
Fang-Yi Chao, Cagri Ozcinar, Aljosa Smolic, IEEE 23nd International Workshop on Multimedia Signal Processing (MMSP), 2021.