Audio-visual Perception of Omnidirectional Video for Virtual Reality Applications

25th April 2020

Fig. 1: Schematic diagram of the designed testbed.


Ambisonics, which constructs a sound distribution over the full viewing sphere, improves the immersive experience in the omnidirectional video (ODV) by enabling observers to perceive the sound directions. Thus, human attention could be guided by audio and visual stimuli simultaneously. Numerous datasets have been proposed to investigate human visual attention by collecting eye fixations of observers navigating ODV with head-mounted displays (HMD). However, there is no such dataset analyzing the impact of audio information. In this paper, we establish a new audio-visual attention dataset for ODV with mute, mono, and ambisonics. The user behavior including visual attention corresponding to sound source locations, viewing navigation congruence between observers, and fixations distributions in these three audio modalities is studied based on video and audio content. From our statistical analysis, we preliminarily found that, compared to only perceiving visual cues, perceiving visual cues with salient object sound (i.e., human voice, the siren of ambulance) could draw more visual attention to the objects making sound and guide viewing behavior when such objects are not in the current field of view. The more in-depth interactive effects between audio and visual cues in mute, mono, and ambisonics still require further comprehensive study. The dataset and developed testbed in this initial work is publicly available on Github with the paper to foster future research on audio-visual attention for ODV.


Creating immersive VR experiences requires a full spherical audio-visual representation of ODV. In particular, the spatial aspect of audio might also play an important role in informing the viewers about the location of objects in the 360-degree environment, guiding visual attention in ODV films, and achieving presence with head-mounted displays (HMDs). To this end, in spite of the existing evidence on the correlation between audio and visual cues and their joint contribution to our perception, to date, most user behavior studies and algorithms for the prediction of visual attention neglect audio cues, and consider visual cues as the only source of attention. The lack of understanding of the audio-visual perception of ODV raises interesting research questions to the multimedia community, such as How does ODV with and without audio affect users’ attention?
To understand the auditory and visual perception of ODV, in this work, we investigated users’ audio-visual attention using ODV with three different audio modalities, namely, mute, mono, and ambisonics. We first designed a testbed for gathering users’ viewport center trajectories (VCTs), created a dataset with a diverse set of audio-visual ODVs, and conducted subjective experiments for each ODV with mute, mono, and ambisonics modalities. We analyzed visual attention in ODV with mute modality and audio-visual attention in ODV with mono and ambisonics modalities by investigating the correlation of visual attention and sound source locations, the consistency of viewing paths between observers, and distribution of visual attention in the three audio modalities. An ODV with ambisonics provides not only auditory cues but also the direction of sound sources, while mono only provides the magnitude of auditory cues. Users only perceive the loudness of the audio without audio direction in mono modality. Our new dataset includes VCTs and visual attention maps from 45 participants (15 for each audio modality), and our developed testbed will be available with this paper. To the best of our knowledge, this dataset with such audio-visual analysis is the first to address the problem of audio-visual perception of ODV. We expect that this initial study will be beneficial for future research on understanding and anticipating human behavior in VR.


Design of testbed
We developed a JavaScript-based testbed that allows us to play ODVs with three different modalities (i.e., mute, mono, and ambisonics) while recording the VCTs of participants for the whole duration of the experiment. The testbed was implemented using three JavaScript libraries, namely three.js, WebXR, and JSAmbisonics. The libraries of three.js and WebXR enable the creation of fully immersive ODV experiences in the browser, allowing us to use an HMD with a web browser. The JSAmbisonics facilitated spatial audio experiences for ODVs with its real-time spatial audio processing functions (i.e., nonindividual head-related transfer functions based on the spatially oriented format for acoustics). The developed testbed can record VCTs without the need for eye-tracking devices, which is an adequate use case for many VR applications. As shown in Fig. 1, the developed testbed records participants’ VCTs with the current time-stamp, name of ODV, and audio modality. At the front-end of the testbed, a .json file of a given set of ODVs is first loaded as the playlist file, and a given video is played while the recorded data is stored at the back-end of the testbed with the refresh rate of the device’s graphics card. The HTTP server was implemented at the back-end using an Apache webserver with the MySQL database, where the audio-related (e.g., mute, mono, and ambisonics), sensor-related (e.g., viewing direction), and user-related (e.g., user ID, age, and gender) data are stored in the database.

To equalize the number of VCTs per audio modality for each ODV, and to ensure that each participant watches each ODV content only once, three playlists were prepared. Each playlist included a training and four test ODVs per audio modality, so there were three training and twelve ODVs for testing. The ODVs with three different audio modalities, namely, mute, mono, and ambisonics, and three content categories were allocated to three playlists, respectively, and equal numbers of participants were distributed to the three playlists. The playing order of the test ODVs for each playlist was randomized before starting each subjective test. Task-free viewing sessions were performed in our subjective experiments. All the participants were wearing an HMD, sitting in a swivel chair, and asked to explore the ODVs without any specific intention. In the experiments, we used an Oculus Rift consumer version as HMD, Bose QuietComfort noise-canceling headphones, and Firefox Nightly as a web browser. During the test, VCTs were recorded as coordinates of longitude (0~360 deg. ) and latitude (0~180 deg. ) in a viewing sphere. We fixed the starting position of each viewing as the center point at the beginning of every ODV display. A 5-second rest period showing a gray screen was included between two successive ODVs to avoid eye fatigue and motion sickness. The total duration of the experiments was about 10 minutes. During experiments, participants were alone in the environment to avoid any influence by the presence of an instructor.

Our dataset contains 15 monoscopic ODVs (three training and 12 testing) with first-order ambisonics in 4-channel B-format (W, X, Y, and Z) collected from YouTube. In our experiment, ODVs in mute modality were produced by removing all audio channels, and ODVs in mono modality were produced by mixing four audio channels into one channel which can be distributed equally in left and right headphones. They are all 4K resolution (3840 × 1920) in ERP format, and 25 sec. segment length each. We divided ODVs into three categories, namely, Conversation, Music, and Environment, depending on their audio-visual cues in a pilot test with two experts. The category of Conversation presents a person or several people talking, the category Music features people singing or playing instruments, while the category Environment includes background sound such as noise of crowds, vehicle engines, and horns on the streets. Table 1 summarizes the main characteristics of ODVs used in our dataset, where Train denotes the training set in each category, and Fig. 2 presents examples of each ODV. Also, Fig. 3 illustrates the visual diversity of each ODV in terms of spatial and temporal information measures, SI and TI, respectively. Each ODV is re-projected to cubic faces for computation of SI and TI to prevent effects from serious geometric distortion along latitude in ERP, as suggested by De Simone et al..


Do audio source locations attract the attention of users?
To analyze the effect of audio information on visual attention when audio and visual stimuli are presented simultaneously, we measure how far visual attention corresponds to areas with audio sources under three audio modalities. We generate an audio energy map (AEM), representing the audio energy distribution with a frame-by-frame heat map. In AEM, the energy distribution is calculated with the help of given audio directions in four channels (W, X, Y, Z) in ambisonics. We then estimate normalized scanpath saliency (NSS) to quantify the number of fixations that overlap with the distribution of audio energy via AEM. NSS is a widely-used saliency evaluation metric. It is sensitive to false positives and relative differences in saliency across the image. Fig. 4 illustrates the mean and 95% confidence intervals computed by bootstrapping of NSS per user for each modality of ODVs. A higher NSS score indicates more fixations are attracted to areas of audio source locations with AEM, and negative scores indicate most fixations are not corresponding to areas of audio source locations. Numerical results show that either ambisonics or mono case has a greater NSS score than mute case. From Fig. 4, we observe that users may tend to follow audio stimuli (especially human voice) in categories conversation and music while they tend to look around in general regardless of the background sound in category environment. Notably, the two ODVs (ODV 06, 07) in the category music feature singing humans, while the others (ODV 05, 08) contain humans playing instruments. However, in category conversation, ODV 02 obtains almost equal NSS scores in three audio modalities, which shows that visual attention could also be affected by the interaction of visual stimuli and audio stimuli depending on contents. In the category environment, ODV 10 and ODV 12 have similar NSS scores, while ODV 09 and ODV 11 have some difference. It appears that only ODV 11, which has an ambulance driving through with siren, obtains much higher NSS in ambisonics and mono than mute. It shows that hearing the siren and the sound direction of siren catches more attention than only seeing the ambulance. To understand the significance of the NSS results, we performed a statistical analysis with a Kruskal-Wallis H Test following a Shapiro-Wilks normality test which rejects the hypothesis of normality of variables. Statistically significant difference (SSD) between two modalities is detected by the Dunn-Bonferroni non-parametric post hoc method. The pairs with a SSD are marked with ** in Fig. 4, which shows that there are three ODVs in category conversation, two ODVs in category music, and one ODV in category environment obtain SSD between mute and mono, and mute and ambisonics. The statistical significance analysis results are in line with our observations above. Furthermore, only one ODV has SSD between mono and ambisonics, which demonstrates that perceiving the direction of sound (i.e., ambisonics) might not catch more attention than only perceiving the loudness of sound without directions (i.e., mono) in most of ODVs. For a visual comparison, Fig. 6 presents AEMs and fixations of two ODVs for each category. In this example, we show an ODV for each category (ODV 04, 06, 11) that receives statistically significantly higher NSS in ambisonics, and the other (ODV 02, 05, 10) receives almost equal NSS or negative NSS under three modalities. Looking at the figures, we can see that fixations are widely distributed along the horizon under mute modality and are more concentrated in AEMs under ambisonics modality. We can see in ODV 04, 06, 11, which obtain higher NSS in mono and ambisonics, feature talking or singing people or ambulance with siren outside the center field of view that can attract visual attention by object audio cues. However, in ODV 02, 05, 10, we observe that visual cues (e.g., human faces, moving objects and, fast-moving camera) have more effect than audio cues on the distribution of fixations. For example, as seen in ODV 02, three human faces are very close to one another in the center of the ODV, and the users focused on the area of faces in all three modalities. In ODV 05, a moving object, which is a conductor in the center of an orchestra, has a more substantial contribution to visual attention than audio cues. Furthermore, in ODV 10, we see that the participants paid attention to the direction of camera motion regardless of the sound source location.

Do observers have similar viewing behavior in mute, mono, and ambisonics?
Observers’ viewing behavior could exhibit considerable variance when consuming ODVs. Viewing trajectories might be more consistent to one another when observers perceive audio (i.e., mono) or audio direction (i.e., ambisonics). To investigate this, we estimate inter-observer congruence (IOC), which is a characterization of fixation dispersion between observers viewing the same content. A higher IOC score represents lower dispersion implying higher viewing concurrency. NSS is used here to compute IOC to compare the fixations of each individual with the rest of the observers. Statistical analysis was also conducted with the same methods as mentioned in Section 4.1. Fig. 5 illustrates mean and 95% confidence intervals of IOC scores of each ODV in three modalities and SSD is marked as **. From the figure, it is shown that there is a significant difference between without sound (i.e., mute) and with sound (i.e., mono or ambisonics) cases. In particular, only in 4 out of 12 cases did we observe a statistically significant difference between the two different cases, without sound and with sound. Moreover, we observe that visual attention is guided by an object’s sound to look for that object when observers do not see it in the current field of view. For example, in category conversation, ODV 03, and 04 featuring talking people in the back of the viewing center receive significantly higher IOC between mute and mono, or mute and ambisonics, while the other two ODVs (ODV 01, 02) featuring talking people in the front that can be seen at the beginning of ODV display have no significant differences between three audio modalities. Similarly, in category music, ODV 06 featuring people taking turns singing around the viewing center receives significantly higher IOC in ambisonics as it informs the direction of singing person unseen in the current field of view to observers. However, ODV 07 which has singing people in the front, and ODV 05 and 08 featuring the playing of instruments receive no significant IOC in three audio modalities. In category environment, ODV 11 featuring an ambulance with siren driving from right to left obtains significantly higher IOC between mute and mono, and mute and ambisonics, while in other ODVs ODV 09, 10, 12 having background sound from vehicle engines or crowds on the street obtain no significant differences between three audio modalities. This demonstrates that perceiving object audio cues and the corresponding direction guides visual attention and increases consistency of viewing patterns between observers when that object is not in the current field of view. Comparing the IOC scores between mono and ambisonics, we can see that the latter does not always receive higher scores in our subjective experiments. It shows that hearing the direction of sound (i.e., ambisonics) does not certainly increase the consistency of viewing patterns between observers, compared to only hearing the loudness of sound (i.e., mono).

Does sound affect observers’ navigation?
To study the impact of perception of audio (i.e., mono) and audio direction (i.e., ambisonics) to visual attention, we estimated the overall fixation distributions and overall AEM of all the frames. In most of the cases as shown in Fig. 6, the distribution of fixations for the ODVs with ambisonics modality is more concentrated. Fig. 7 shows the distribution of fixations and AEM in longitude of ODV 04, 10 with three modalities. This figure shows that, in the ODV 04, the participants follow the direction of object audio with ambisonics case in a crowded scenario, where the main actors talking in the back side of observers are attracting visual direction in the crowded scene. In contrast, in ODV 10, the fixation distributions of three modalities are similar to each other and unrelated to the audio information. This is due to visual saliency of the fast moving camera, where most of visual attention corresponds to the direction of camera motion. From our analyses in Sections 4.1, 4.2, and 4.3, we can generally conclude that when salient audio (i.e., human voice and siren) is presented, it catches visual attention more than if only visual cues are presented. On the other hand, in some cases having salient visual cues (i.e., human faces, moving objects, and moving camera), audio and visual information interactively affect visual attention. In addition, perceiving sounds and sound directions of salient objects can guide visual attention and achieve higher IOC, if these objects are not in the current field of view. Although this study reveal several initial findings, more studies are required to support the open research questions raised with this work. In particular, “does the directions of sounds lead higher viewing congruence than mono sound?” and “does the directions of sounds guide visual attention more than mono sound?” are still not confirmed due to limited number of participants. For this purpose, we plan to conduct more comprehensive subjective experiments (in terms of number of participants and diverse ODVs), and we plan to further investigate these questions with statistical tests.


This paper studied audio-visual perception of ODVs in mute, mono, and ambisonics modalities. First, we developed a testbed that can play ODVs with multiple audio modalities while recording users’ VCTs at the same time, and created a new audio-visual dataset containing 12 ODVs with different audio-visual complexity. Next, we collected users’ VCTs in subjective experiments, where each ODV had three different audio modalities. Finally, we statistically analyzed the viewing behavior of participants while consuming ODVs. This is, to the best of our knowledge, the first user behavior analysis for ODV viewing with mute, mono, ambisonics. Our results show that in most of cases visual attention disperses widely when viewing ODVs without sound (i.e., mute), and concentrates on salient regions when viewing ODVs with sound (i.e., mono and ambisonics). In particular, salient audio cues, such as human voices and sirens, and salient visual cues, such as human faces, moving objects, and fast-moving cameras, have more impact on visual attention of participants. Regarding audio cues, the nature of the sound (i.e., informative content, frequency changing, performance timing, audio ensemble) may also play a role in how it gets noticed. We will leave the aforementioned as future work to further foster the study of audio-visual attention in ODV. We expect this initial work which provides a testbed, a dataset from subjective experiments, and an analysis of user behavior could contribute the community and arouse more in-depth research in the future.




Audio-Visual Perception of Omnidirectional Video for Virtual Reality Applications, 2020 IEEE ICME workshop, London, UK,
Fang-Yi Chao; Cagri Ozcinar; Chen Wang; Emin Zerman; Lu Zhang; Wassim Hamidouche; Olivier Deforges; Aljosa Smolic

IEEE Xplore