Editor's note: This article includes mathematical formulae and equations. The reader can click on these to view them more clearly if needed.

1. Background: Immersion in Virtual Reality Environments (VRE)

Ever since the rise of technology, from early print to computer-assisted platforms, mediated experiences have played an increasingly pivotal role in shaping our models of the world and how we perceive and react to it. Moreover, with the emergence of new media, 1 it has become clear that these experiences are not only being passively consumed, but also actively co-created. A distinctive feature of new media is the fundamental role of the user, where media and individuals are not separate but define and create each other, together forming a complex ecosystem. Nowhere is this seamless melding of the individual and media more apparent than in the fields of Virtual Reality Environments (VRE) and Brain-Computer Interfaces (BCI). The environments of such media generate almost real time effects on users (often labeled as prosumers), thus opening new possibilities for quantitative investigations of the digital era of participatory culture. With massively large numbers of prosumers engaged in new media, it inevitably leads to the study of cognitive states they experience, particularly attention and immersion. One can even argue that attention (sometimes referred to as ‘eyeball hang time’, ‘spend-time’ or broadly ‘media consumption’) and immersion are limited cognitive resources that media are competing for. From that perspective, attention and immersion represent a new type of media currency. The value of this currency lies in its ability to effect changes in cognitive and emotional states of users and we argue that the effects can be observed on individual and collective levels.

To objectively measure immersion, it first needs to be acknowledged that the associated sensations are difficult to describe formally and precisely. However, they are undeniably present and several attempts have been made to capture their essence. For example, Jennett et. al (2008) describe it as follows: “[i]mmersion clearly has links to the notion of flow and CA [Cognitive Absorption] and all three use things like temporal dissociation and awareness of surroundings as indicators of high engagement. However, immersion is concerned with the specific, psychological experience of engaging with a computer game.” 2 In addition, they define it “[a]s an experience in one moment in time and graded (i.e. engagement, engrossment, total immersion). Immersion involves a lack of awareness of time, a loss of awareness of the real world, involvement and a sense of being in the task environment.” 3 Immersion is oftentimes used interchangeably with the term presence, which is defined as “[t]he subjective feeling of being in a virtual environment while being transiently unaware of one’s real location and surroundings and of the technology that delivers the stream of virtual input to the senses.” 4 Slater and Sanchez (2005) even argue that immersion and presence can be studied in the context of consciousness. 5

For the purposes of this paper, we define immersion as a melding of body responses with mental states induced by virtual stimuli (see Section 2 for further discussion). Thus, the qualitative understanding of immersion, carried out through the central philosophical concept of subjectivity (particularly media subjectivity),6 can be complemented with the notion of the psycho-physical entity, in which the body is considered a primary site of experience. In that way, immersion can be understood as a quantifiable phenomenon, allowing both qualitative and quantitative methods to be brought together in an integrative way.

Over the past decade, VR has gone beyond the boundaries of the gaming industry and has already had significant cultural, economic, and societal impact in a number of fields, including medical and psychological treatment, and education. The success of VR in all these fields draws from the guiding principle of immersion, a process whereby a subject shifts focus to simulations, allowing him to be completely immersed into the “flow”, 7 and reshaping the psycho-physical states of the user. For example, there have been numerous studies on VRE and its effects on the treatment of chronic pain, 8  9  10 pain management 11 and various treatments in clinical psychology: treatments of behavioral disorders, anxiety disorders 12 and mental disorders like PTSD. 13Another promising application lies in the field of experiential learning in VRE, and various studies suggest that immersion might be the principal factor in enhanced learning through situated experiences. 14 Students, for example, can learn about great historical events 15 with the help of immersive virtual environments, where they get to experience the simulated event in a 3-dimensional, multisensory and embodied way. A study by J. Fox et al. (2009) suggests that VREs could be viewed as a method to study social and psychological phenomena. 16 With the rise of online cultural repositories like YouTube, we strongly believe that studying such collective patterns from online behavior is now possible.

1.1 Embodied Experience and Immersion

We now elaborate more on the main thesis of the paper, that immersion is a psycho-physical phenomenon which can be quantified. VREs not only require our mental presence; we actively navigate and interact with the principal entities 17in such environments. In other words, we are not only mentally, but also viscerally active. Virtual worlds might not have a physical substrate, but we navigate through them as if they are real. The embodied principle seems to play a fundamental role in users’ engagements: we perceive and react to the virtual stimuli with our bodies and we navigate these environments, just as we navigate our own trajectory in the physical world. 18 In fact, a growing body of research indicates that a human subject is capable of distributing her immersion across the Virtual Reality Environment (VRE), and that the experience of ‘being present’ or ‘immersed’ can be similar to, or more intense than, the experience in the ‘real’, i.e. physical world. Moreover, studies by Slater et al. 19 showed that body representation and body ownership in virtual reality is similar to the ‘rubber glove experiment illusion’ and is a result of multisensory correlations: “[t]he ownership of virtual limbs and bodies may engage the same perceptual, emotional and motor processes that make us feel that we own our biological bodies,” 20 and that realistic response (behavior) to virtual environments is correlated to ‘place illusion’ and ‘plausibility illusion’. 21 Slater concludes that “Virtual reality can transform not only your sense of place, and of reality, but also the apparent properties of your own body.” 22 This amplified experience (oftentimes described as hyperexperience) of being present is inherent to VRE, which gives it a potential to reshape an individual more than any other type of media, as already outlined above.

It might be worth pointing out that immersion is clearly not a new phenomenon. A reader can be immersed in a book as much as a spectator can be immersed in a movie or an artwork. The key function of any media has always been to attract and sustain human attention. Advanced, multisensory and interactive new media, however, has led to the amplification of human senses.  Mediated environments have not only become visually more saturated (when compared to other visual forms, such as movies or comic books), but also increasingly more tactile, aural and spatial; in other words, multimodal. And apart from being cogent simulations of real (physical) environments, VREs also allow an extended and amplified sensorial experience of digitally created imaginary spaces.

Thus, the significant conceptual framework for this study lies in the phenomenology, more specifically in the idea of embodied cognition. This perspective builds on the premise of bodily perception, which might shed some light on the causal link between emotional states and behavioral modalities of a body during the state of immersion in VRE. There are, however, currently no standardized or commonly-practiced methods for measuring emotional states and their intensity from the body-based behavior. In this paper (see the Section Immersion: Quantification of behavioral modalities) we develop a prototypical method for quantification of immersion, which is based on verbal and non-verbal (bodily) cues of engaged users. This convergence enables one to borrow and adapt established conceptual frameworks, such as: (i) embodied cognition in the work by Merleau-Ponty (Phenomenology of Perception); 23and (ii) the technology as an extension of the mind and body, by McLuhan (Understanding Media: The Extensions of Man) 24 to develop qualitative models of individuals as prosumers of such interactive media. For example, through a contrast with pathological cases such as phantom limbs, Merleau-Ponty describes the body’s typical mode of existence as “being-toward-the-world.” In another particular example of a blind man and his cane, he illustrates that “[t]he stick is no longer an object perceived by the blind man, but an instrument with which he perceives. It is a bodily auxiliary, an extension of the bodily synthesis.” 25 This illustrative description can serve as an example of how a subject can incorporate tools into his or her sensory realm. The most recent disruptive technologies show how these "techno-tools" are getting even closer to the human sensory system, as if they have a tendency to meld with the senses, thus forming a new and amplified sensory field.

The same principle of embodied cognition is applied when an artist is using a virtual palette and brush to create a virtual painting or when a gamer is using virtual tools such as lamps, guns, etc. to interact with other characters in VR.

Of particular relevance to our research paper is the concept of “media extensions of men”, 26 developed by McLuhan in the early 1960s. McLuhan eloquently describes how the transfiguration of a human body by technology forms a new instance, in which we witness the technological extensions of human minds via ‘electronic media.’

“In this electronic age we see ourselves being translated more and more into the form of information, moving toward the technological extension of consciousness […] By putting our physical bodies inside our extended nervous systems, by means of electric media, we set up a dynamic by which all previous technologies that are mere extensions of hands and feet and teeth and bodily heat-controls – all such extensions of our bodies … will be translated into information systems.” 27

McLuhan’s “technological extension of consciousness” is analogous to sensorial and cognitive extension of human mind, which has been encapsulated in the technology of BCI. It is precisely the BCI that functions as a mediator between the conscious self and the extended body parts. For example, recent work that allows users to accurately control and direct a prosthetic robotic arm by just thinking about the target tasks is a prime example of such two-way integrations of technology and psycho-physical states of the user. The BCI interprets the spatial thoughts of the subject to move the new-found “arm” to a desired location, and in turn the thought process itself reacts and guides the BCI (based on visual feedback via eyes that tracks the movement of the “arm”), forming an integrated feedback loop.  McLuhan’s insight into the future of information systems particularly holds true for VREs. These feedback loops amplify our nervous system in a very visceral sense.

2. Immersion: Quantification of Behavioral Modalities

As mentioned earlier, there is currently no standard or commonly agreed measurement of immersion. The fundamental thesis of this paper is that the effect of VRE and related immersive technologies can be successfully studied only via a trans-disciplinary approach that combines qualitative theoretical models — widely discussed in media studies, phenomenology and psychology, and referred to in our introductory discussions — with quantitative data-driven empirical models based on Big Data and modern advances in AI (Artificial Intelligence), Machine Learning (ML), and computational models. Such automated analyses, rather than statistical summarization based on manual observations, are necessary for the following reasons (that are shared with the advantages offered by other Big Data/AI applications): (i) Automated methods allow one to analyze large number of VRE recordings consistently and, once calibrated, accurately, without suffering from usual problems with manual analysis such as human bias and the burden of sheer time commitment; (ii) Automated methods can give us population level analytics, such as typical bodily signals present in different VRE contexts, as well as variations from such normative models for individual users; (iii) Finally, such analytics will allow us to do reverse engineering of various kinds on VRE data, such as predicting the type of content a user is immersed in from the observed data, and recommendation engines for VR content based on their immersion profiles.

We attempted to develop such a prototypical method for quantification of immersion based on verbal and non-verbal (i.e. bodily) cues of engaged users. In particular, we assumed we have access to data from instruments (whether physical devices such as electrodes, or algorithmic tools based on software, as described in this paper) that measure behavioral responses of users in VREs. We then built models of correlating emotional states to particular patterns in these measured signals. For example, when one has video recordings of users participating in VR gaming, one can create software measurement tools that compute an array of verbal and nonverbal signals or cues (SeeFigure 3for more detailed descriptions) such as (i) kinematics (position, speed, acceleration, fluidity etc.) of different body parts and joints, (ii) verbal expressions such as bad language, and (iii) vocalizations.

We determine correlation of emotional states from coordinated patterns in such measurements (See Figure 1 and Figure 4) such as: (i) Contraction, expressed via sudden inward arm movement towards torso and head, which is known to correlate with negative emotions like fear and threat; (ii) Expansion, expressed via smooth and stretching movements of arms away from the torso, which is the binary opposite of contraction, and is known to correlate with positive emotions such as inspiration and relaxation; (iii) Different postures such as kneeling down, covering face, losing balance, or foot stomping, which are known to express different emotional states depending on overall content being experienced; and (iv) extreme vocalizations, such as screaming, exclamations and general interjections, which are known to express emotional states such as fear, surprise, and excitement.

We want to stress that this pilot study (involving only two types of VREs) serves only as a test case for the general methodology described in this paper, which can be applied to analyze users’ psycho-physical states when engaged in a variety of VREs. In a more general situation, one will also have access to not just video recordings but signals from a host of other sensors that directly measure physiological signals which correlate to cognitive and emotional states (particularly their intensity). 28Such biosensors could measure signals such as stress hormones, micro-contractions using Motor Evoked Potential (MEP), skin conductance GSR (galvanic skin response), heart rate ECK/EKG (electrocardiography), and Eye Tracking (Cornea Reflection and Pupil Dilation), and would eventually contribute to more conclusive results.

The same approach, as described above, can be applied to such data: finding collective patterns in these signals that are correlated to specific emotional and cognitive states. For now, however, we focus on video recordings of users (which are more readily available than sensor-based data) in at least two contrasting VREs, namely survival-based gaming and a much more contemplative activity such as painting using the Google Tilt brush.

We explore which emotional and cognitive states related to immersion can be inferred reliably from software measurement tools applied to such video data. More specifically we test the following hypotheses:

  1. Bodily reactions induced by stimuli in VR are very similar to widely known bodily reactions induced by stimuli in physical environments – That is, we react to simulated stimuli in a similar manner as we would react to real stimuli.
  2. Immersion can be defined by body cues (signals) – That is, the presence and intensity of the measured signals could indicate the level of immersion.
  3. Bodily cues in VRE are content dependent – That is, immersion is experienced via different emotional and cognitive states, and hence different bodily signals, in, for example, contemplative or artistic VREs versus in aggressive or survival-dependent VREs.

Figure 1: Example bodily signals of Contraction (left) and Expansion (right): These two responses are widely recognized to correspond to binary opposite emotional states of inspiration (seen as a body expansion) and threat (seen as body contraction)

2.1 Data Description and Bodily Signal Correlates of Emotional and Cognitive States

We used two datasets which consisted of YouTube videos depicting two VREs: (i) survival-based strategy VR games; and (ii) 3D painting VR application. For the purposes of our experiment, we downloaded 22 videos in total, all in MP4 format. This data consisted of 17 videos depicting users in Google Tilt Brush VR and 5 videos of users playing survival-based VR games (see Web Links for selected videos). Since many of the videos were compilations, we were able to capture video clips of 36 painting sessions and 31 gaming sessions. The 36 painting sessions involved 17 unique paint brush users, while each player in the 31 gaming sessions was unique, and thus we studied 31 gamers. This selected set of 67 video clips ranged in duration from 5 to 25 seconds and the resolution of clips was 1280  720 pixels. 29

Before employing automatic tools, we manually viewed approximately 150 videos, which led to important qualitative observations: users’ bodies in the 3D painting application Google Tilt Brush seemed to intensely occupy the space around them, whereas users in survival-based games were mainly escaping from the space. This body dynamic can be roughly described as a contraction/expansion duality, where two very distinct simulations of situations are at play (SeeFigure 1). Another bodily feature was the speed of movement: users in survival-based games seemed to move faster and with less fluidity, whereas users in painting app were moving at slower pace (See Figure 10, which visualizes the intensity of arm movements). Another distinct feature was aural: users in survival-based games were vocally expressive, while users in painting applications did not vocally express any kind of emotion.

Such manual observations of well-known correlates of emotional and cognitive states prompted us to define a set of bodily cues and software measurement tools to measure them. These are tabulated in Table 1.

Table 1: An overview of bodily cues and associated emotional and cognitive states

Types of Body Signals Qualitative Definition Emotional / Cognitive States
 
AURAL: VOCAL CUES
Bad language Using inappropriate words as an expression of fear, disgust, or threat. Indicates emotional state of surprise, fear, disgust, threat or terror.
Screaming, general interjections Using vocal register above the baseline to express intense emotional state. Indicates sudden fear, threat or terror.
 
VISUAL: POSTURAL RESPONSE
Body contraction An activity where limbs are extended in the opposite ways from the torso. Flight response caused by threat or fear.
Body expansion An activity where limbs are extended in the opposite ways from the torso. Indicates emotional openness, ability to create and the cognitive states of interestedness.
Covering face An activity where hands cover the face entirely or partially. Indicates emotional state of fear and protection as a result of threat.
Kneeling down An activity where the body in directed towards the ground. Indicates protection from the threat.
Foot-stomping An activity where intense movement of feet is expressed. To bring down the foot on an object or a surface forcibly, to tread heavily or violently upon. Indicates emotional state of restlessness.
Loss of balance (fall down) A state where the body loses its postural stability, usually associated with falling. Indicates response to severe threat.
 
MOVEMENT CUES
Fluidity A visibly coordinated movement of the body. Indicates stable or non-interrupted emotional state and cognitive state of interestedness.
Irregularity of arm movement A visibly uncoordinated movement of the body with sudden reflex motor acts. Indicates disturbed emotional state, such as fear, terror, apprehension, and cognitive state of confusion.
Intensity of arm movement Rapidity of movement of particular body parts, such as wrists. Indicates the perception of threat.

 

2.2 A Multi-Modal AI Platform for Bodily Signal Quantification

For the purposes of this pilot study, a multi-modal AI platform that uses both Computer Vision (CV) and Speech Analysis (SA) tools has been designed. The platform can automatically detect, quantify, and analyze bodily expressions associated with immersion. A schematic of the platform’s pipeline is given below, while the following subsections expand on each block of the pipeline.

2.2.1 Body Signals and Methods of Measurement: Video Processing of Data

Our approach to detecting bodily signals is primarily based on (i) detecting key joints and body parts of the users and (ii) determining how they move simultaneously to define a particular signal. Towards this, we used an open-source Deep Neural Network (DNN) based package, OpenPose. 30 In each image frame of each clip (recall that a standard video has 24/30 image frames per second of recording), OpenPose first looks for a human, and if it detects one, it returns 2-D pixel locations of 18 keypoints (such as left and right shoulders) of the skeleton as well as the confidence score for each keypoint. If the model fails to locate any keypoint, it will return (0,0) as pixel location and 0 confidence score for that keypoint. Thus, we considered any keypoint with a confidence score of zero as unreliable and, hence, these were not used in ourbodily signal detection algorithms.

Figure 2: System overview of the multi-modal AI system

Figure 3 illustrates various postures corresponding to different bodily signals, including (from left to right): (i) Normal pose (with joint index); (ii) Definitions of different body parts and regions, based on key joints and links; (iii) Skeletal posture for the measured signal, labeled as Covering Face (CF); (iv) Skeletal posture for the signal labeled as Loss of Balance (LB); and (v) Skeletal posture for the signal Kneeling Down (KD).

Figure 3: From left to right, each figure represents skeletal schematics

2.2.2 Skeleton Normalization and Classifier Definitions for Different Bodily Signals

A set of definitions and notations (see Appendix), along with formulae for computing different measures, are provided in the Endnote section. For example, we first group the 18 keypoints (numbered 0-17) of the human skeleton (see Figure 3) into meaningful body parts: We defined a Hand Set as the keypoints 4, 7, a Lower Body Set as the keypoints 8, 9, 10, 11, 12, 13, Leg Set as the keypoints 9, 12 and Face Set as the keypoints 0, 14, 15, 16, 17. For the sake of conserving space, we directly use these and related definitions in the following descriptions, and the reader is requested to consult the details in the Endnote for more information. Also, we note that we use various thresholds to detect the presence or absence of different bodily signals, and these thresholds were determined after performing experiments on our data sets, and then selecting values that yield the right balance between recall and precision.

Skeleton Normalization

Since we were dealing with videos captured under various settings, the player’s distance from the camera might vary. That is, the skeletons of the players were either scaled down (if the players were farther from the camera) or up (if they were closer to the camera). Thus, any movement, measured in terms of raw pixel values, needed to be scaled so that computed distances were comparable across players. We did this by computing the scaling factor of any given skeleton to a fixed size reference skeleton. Since almost all players were adults in our videos, a reference skeleton with average adult proportions served our purpose well. Given a detected skeleton with keypoint location matrix and its visibility mask , we computed its scaling factor s by taking the ratio of the sum of all the bones’ (for which both end keypoints are detected) lengths to that of the corresponding ones in a reference skeleton (See Appendix for full set of definitions and notations).

Upper body related: covering face (CF)

Since this action is related to the joints in the hand set and the face set, we defined a simple distance measure that computes the sum of the distance between the visible keypoints in and : . Then, it was compared to a threshold to detect whether a Covering Face (CF) action has occurred. An optimal threshold was picked to balance recall vs. precision. 

Standing vs. Sitting detector

Bodily signals defined in Table 1 have different skeletal orientations and positions depending on whether a user is sitting or standing; see for example the Losing Balance (LB) classifier description. We defined the function , which computes whether the scaled visible length of the femur bones is shorter by more than a threshold value (here set to 100 normalized pixels) when compared to those of a standing reference skeleton (). If both ends of the femur bones are visible and the person is sitting down, then .

Lower body related: kneeling down (KD)

KD was defined as the motion where the knee joints move down from a standing position, so we first used our standing vs. sitting detector to find players with standing postures. Then, for the standing cases we monitored the movement of the Leg set (see Notations and Definitions for definition of the set) keypoints between frame and frame using function ,where is a threshold for minimum movement in pixels. The final detection was done by comparing the sum of these detections over 5 consecutive frames to a threshold.

Whole body related: Losing balance (LB)

Compared to the kneeling down (KD) signal, loss of balance may persist for a longer period and is related to the collapse of the whole body. We first determined whether a player is sitting or standing, and then we designed a conditional classifier, tailored to each situation, to make a final decision. For a player already in a sitting position, we calculated the relative change of vertical distances between the leg part and the neck using . For a player already in a standing position, we quantified the differences between the current posture and the reference standing skeleton using .

Time smoothing: Due to noisy detection of skeletons or incomplete information contained in the frame, some actions may be misclassified. For example, loss of balance (falling down) may be easily misclassified as kneeling down even with good keypoint detections, since some postures inside these two actions are quite similar. To avoid these cases, we computed the classification functions over five consecutive frames. We then picked a threshold for each of the bodily signal detectors, such that if the sum of the respective functions over these five frames exceeded the threshold, then we detected it as an event. This choice of window size was picked after performing experiments on our current dataset. A longer window length fails to accurately locate the onset of each event, smoothing out the differences of durations among various players, but too short a window makes classification results noisy and too sensitive to noisy detections of keypoints.

Arm related movement

Since hands are the most flexible and most informative parts in our experiment (note that a user’s face is always covered with VR headset), we designed three feature extraction models to track hand related movement patterns:

Classifier for different speeds of arm movement (Levels 0—5):  We computed the sum of the distances moved (normalized by the scaling factor) over five consecutive frames by the left and the right wrists (keypoints 4 and 7) and then took the maximum. Based on the distribution of the observed data, we divided the distance measure into six levels: (0-10), (10-20), (20-50), (50-100), (100-200), (.png">200), with the smallest level assigned 0 and highest level assigned 5. The label assignment is represented by the function . If, for example, , then it shows vigorous instantaneous arm movement and a value of 0 would correspond to minimal movement.

Classifier for irregular movement (IM): Sudden change in direction or speed of arm movement can be an important signal to indicate emotional changes (see Table 1 for detailed descriptions of the correlation). To capture this signal, we observed arm movement patterns over three consecutive windows (recall that each window is five frames in length): we consider it to be an IM if . For example: 1 (previous window) -.png"> 3 (current window) -.png"> 3 (next window) will be considered as an instance of fast arm movement, but not an IM. However, 1-.png">5-.png">1 would be classified as an IM.

Classifier for Contraction (C) using space occupancy around the torso: We detected contraction and expansion based on the wrist’s relative location to the torso. We first estimated what we refer to as the body center using , where the set (shoulders) and the set (Hips) which define the torso. From this center point, we defined a vertically shifted torso region (See Left figure in Figure 4) and another semicircular region (See Right picture on Figure 4).

Figure 4: Regions in space around torso for detecting contraction (left) and expansion (right) of the body

A contraction is said to occur when both wrist(s)/hands are brought protectively and swiftly near the upper torso region and stay there for a long-enough duration. Algorithmically, a contraction is said to occur if both wrists are detected inside the green rectangular box (Left in Figure 4) for more than five frames. In order to remove false positives (especially when relaxed movements are made to bring both wrists inside the rectangular box), we also put a threshold on the speed with which the wrists are drawn into the torso/head region. Hence, for an occurrence of contraction, it has to be a reflex action that brings the hands close to the upper body. In contrast, stretching of arm(s) and wrist(s) to the yellow region (see Right in Figure 4) and staying in this arc for more than 30 frames will be considered as an expansion. The yellow region is comprised of a half-circle covering the shoulders (areas labeled as 1 and 2) and the area stretching 20 degrees below the shoulder (the areas labeled 3 and 4 area).

2.3 Audio Processing of Data

2.3.1 Feature Description

One of the features that indicates emotional arousal can be found in vocal cues (interjections), such as screaming, curse words, or shouting. Such vocal cues prompted us to design automated tools for capturing and analyzing emotionally charged audio signals.

Once we got the audio track from each VR video clip, we set a time window of duration 50 ms (also referred to as frame duration) along the audio track as the smallest unit. Then we set a window step of 25 ms (or a stride duration) to create a sliding time window along whole audio track.

Energy feature:

For each time window, we calculated its log-energy. In our experiment, we fixed sample rate , and time window , giving a total of samples. Each sample was represented by .

Pitch feature:

We used a standard power-spectrum method to get the frequency distribution in any audio clip (Huang, et. al., 2010).

2.3.2 Defining Screaming Sub-Classes

Another strategy for analyzing emotional agitation of VR players is to classify screaming into sub-classes. There are many features used in the audio analysis, like energy, pitch, and MFCC. We found that if we combined loudness with pitch features, we can separate shouting and screaming; screaming has a much higher pitch than shouting does. In particular, we first calculated log-energy of each time window. Then we segmented the audio track into clips, each clip comprised of a sequence of consecutive time windows with log-energy greater than a threshold (we set it to 5 decibels). Next, we selected all such clips of duration greater than another threshold (selected to be 300 ms), the idea being that any potential scream has to have a minimum loudness and last for a minimum duration.

2.4 Evaluation of Classifiers

We manually checked each video to calculate precision and recall. Since the classifiers yield window-based results (i.e. one decision for five consecutive frames in the video), we first printed the results predicted by each classifier to the original video, and then watched the annotated video using quarter-speed playback. For each window (5 frames) in the video, if an action occurred (as inferred by the human subject) but the classifier failed to detect it, it was marked as a false negative. If the classifier detected an action but it was not detected manually in the video, it was marked as a false positive. Windows in each video where the classifier and manual inspections matched were counted as true positives. Based on these numbers, we computed precision and recall values.

Table 2: Evaluation of classifiers

 

  CF KD FD IM Contraction (C)
Precision 0.721 0.724 0.733 0.660 0.707
Recall 0.805 0.768 0.766 0.623 0.575

 

 

3. Results

3.1 General Interjections

Usually, the pitch of most interjections falls in the range of 600 Hz to 2200 Hz, considering that the voice frequency of a female is higher than that of a male. We compared the screaming frequency distribution for both male and female, as shown in Figure 5.

Figure 5:  Gender comparison for screaming frequency distribution

Based on this experiment, we divided screaming pitch into four almost equally populated bins, where each corresponding interval roughly showed a similar percentage, as shown in Figure 5.

  • Level 4 (1375 Hz-2200 Hz): Highest level of screaming;
  • Level 3 (916 Hz-1375 Hz): Third level of screaming;
  • Level 2 (733 Hz-916 Hz): Second level of screaming;
  • Level1 (611Hz-733 Hz): Lowest level of screaming.

3.1 Immersion Profiling of Gaming Players

One of the indicators of immersion might be the number of various bodily cues that are present at the same time during a VR session. In the example below, the simultaneous presence of multiple intense and persistent body signals indicates that a person is being deeply involved in the VR game. Manual observation of the video indeed verified users’ visceral response to the VR content. A clear demonstration of automated observation of immersion is also shown in that video.

Figure 6: Multi-signal profile of a game player showing the simultaneous presence of multiple bodily cues

Our algorithm built a detector for screaming and assigned a level to it. Level 4 represents the most extreme expression observed in our dataset, whereas Level 1 represents the lowest threshold for an interjection to qualify as a scream. As Figure 7 shows, 6 players did not scream at all, while the last player screamed a total of 30 times. Thus, we see significant variations in the verbal cues, similar to those we observed in non-verbal cues.

Figure 7: Variations in screaming levels in game players

3.2 Comparison between Gaming and Painting Videos

Although immersion during both gaming and painting VREs can be observed, they seem to have different bodily signal patterns. For example, many bodily cues observed in game players, such as covering face, loss of balance, and screaming were rarely observed in users in painting VREs. However, our initial algorithm detected many instances of bodily cues such as contractions and kneeling down in painting videos. On further investigation we found an important distinction: while these bodily cues in game players represented emotional distress, in painting they represented functional necessity. For example, when one person is kneeling down during painting, it is not because he is afraid but because he is reaching down to paint near the floor. Similarly, contractions were observed in painters as they reached inward towards the torso area to access their virtual palette. This prompted us to refine our detection tools and add other criteria such as speed of movement leading into these cues. In Figure 8 (see below), ten game players showed multiple contractions; the number of contractions is shown along the Y-axis. In contrast, contraction was detected in only one painting user, which on manual verification turned out to be functional.

Figure 8: Counts of contraction for gaming and painting videos

Another way to compare the intensity of upper-body movements in both VREs is to observe arm movement, or, more specifically, movements of the wrists.

For each video session, we calculated the log of the sum of the scaled movement distances over every 5 frames, for both the right wrist and the left wrist. This 5-frame log movement measurement is color coded, as indicated by the colorbar, yellow representing the largest distance moved over a 5-frame interval. For each such 5-frame window, we created a block of height of ten rows: five rows with the color corresponding to the movement of the right wrist, and five for the left wrist. Then, a video is coded by a row of height 10, with the frames along the X-axis. In Figure 9, the left block shows the heat map for all the 31 gaming videos (stacked vertically) and the right block shows the heat map for a selection of 31 of the 36 paint brush videos. As the heat map clearly shows, intensity of both arm movements is much more dominant in gaming videos, which is normally associated with disturbed emotional states, such as fear, terror or apprehension, according to our observations, while in painting videos, movements are smoother. It also provides evidence that, although there may be variations among players, significant differences in patterns of body signal for immersion can been observed in various settings. Such conclusive results suggest that bodily signals are content dependent, and that we can infer the context of the VR just by observing and analyzing the bodily cues.

Figure 9: A heat map visualization of the intensity of arm movements for 31 gaming and 31 painting videos

In a similar manner, one can follow the trajectory of wrists and their speed of motion through a simple visualization that captures the fluidity or irregularity of movements.

Figure 10: Tracking wrist movements in two paintbrush user videos

In Figure 10 and Figure 11, the left column shows the right and left wrists’ location heat maps across frames, where a brighter pixel indicates higher frequency of visitation by the wrists of the users in the video. The right column shows zoomed in views of the trajectory of the right wrist (represented by green dots and yellow arrows) and the left wrist (represented by blue dots and magenta arrows). The relative speed of motion can be observed from these trajectories: longer arrows connecting two dots mean larger movements in two consecutive frames. In addition, movement pattern along time can be summarized or recovered following the arrow’s direction. For example, we can infer that the painter in the second figure in Figure 10 is drawing circles with the right hand and holding a palette in the left. Both the painting video trajectories show a clear fluidity of movement. In contrast, the gaming videos show bursts of quick, erratic movements. This provides another visualization of our captured signals, and sets up a bridge between detailed analysis of window-sized data and heat map overview of all videos.

Figure 11: Tracking wrist movements in two game player videos (watch videos at this link)

The following example of variations in arm movement intensity in survival-based games shows that the arm movement signal is also closely related to emotional states of surprise or fear. Figure 12 shows arm movement levels (0-5) which are plotted as a function of frame number for three selected players; higher level number means higher instantaneous speed of movement. The players exhibited various movement patterns. The first one had stable arm movement levels, which might infer that his or her emotional state was relatively stable. The second one had a large arm movement in the middle part, from 60-140 frames (2-4.7s), which might infer that the player was surprised. This inference was validated by manual inspection of the video. The third player showed almost no arm movement.

Figure 12: Variations in arm movement intensity over time

The levels of bodily signals show considerable variation across selected players. The last row in Figure 13 shows the fraction of time for each of the three cues for every player. Each of the top three rows shows the duration of one of the cues (measured as a fraction of the total gaming time) for each player. To better observe inter-player variations, each cue’s duration is scaled differently. For example, IM is not that frequent, and hence its duration is scaled by a factor of 3. Comparing among activities, we can observe that covering face and kneeling down are more frequent than irregular movement. Across players, we again see considerable variations in cues. For example, player 12 spent much time covering face, while player 9 mainly kneeled down during playing.

Figure 13: Duration distribution of key bodily signals, including Kneeling Down (KD), Covering Face (CF), and Irregular Movement (IM)

4. Discussion and Future Work

Immersion as a complex and, above all, a subjective phenomenon represents a challenge for quantification. In our paper we proposed a method based on the theoretical frameworks of embodied cognition and technology as an extension of the body (for theoretical aspects see Section 1.1). This means we posited the body as a primary site of experience, where emotional and cognitive states are being manifested through various body signals. Then, we proposed a method for quantifying immersion by observing, capturing and analyzing verbal and non-verbal (body) signals of the user. We analyzed video clips of multiple users in two different settings: a survival-based gaming setting and a non-aggressive, creative virtual application for painting called Tilt Brush. In particular, we started out with three hypotheses (see Section 2) about bodily effects of immersion, and we summarize our results which support our claims.

We want to emphasize that, while our preliminary models are consistent with all our hypotheses, validations using larger data sets augmented with additional sensors are required. In particular, we built software measurement tools to automatically distinguish between different types of VR experiences that are well-known markers of immersion. However, conclusive statements about quantification of immersion should be additionally supported in two ways: first, by using biosensor-based sensors on new subjects, which could provide additional physiological markers of immersion, and would further validate that the bodily signals (as captured via our video analysis) do indeed provide reliable markers of immersion; and second, by substantially enlarging the size of the corpus. Our future work will also include the development of algorithms for detecting body expansion (i.e. indicating the cognitive states of interestedness and creativity – video link here) and foot stomping (indicating the emotional states of restlessness), as listed in Table 1.

Our first hypothesis posited that there will be strong correlations between virtual stimuli and the bodily cues that are normally experienced under real life situations. We found that many of the threat-induced signals tabulated in Table 1 such as body contraction, face covering or irregular arm movement were indeed present in survival-based game videos (see Figure 6 and Figure 8). Those bodily cues were triggered by virtual threatening stimuli such as chasing, attacking, or falling, and they showed remarkable consistencies with known bodily reactions from real stimuli.

Our second hypothesis assumed that the presence and intensity of signals indicates the level of immersion, meaning that immersion is higher when all signals are present simultaneously and with high intensity. Based on our preliminary quantitative measurements, we found multiple evidence for this hypothesis. For example, Figure 6 showed simultaneous presence of multiple bodily cues that are intense and persistent for a player. The immersion of this player can be manually verified by viewing the related video. For other players, however, we found a subset of these cues, and they varied in intensity of signals. A second evidence for variation in immersion can be found in Figure 13, where we showed that the levels of bodily signals such as KD, CF and IM have considerable variations across 13 players. These variations are most likely dependent on users’ previous experience with VREs.

The third hypothesis posited that bodily signals depend on the content being simulated in VRE. Our findings (See Figure 8) suggest that contraction in survival-based gaming videos is significantly more present than in the artistic environment of Google Tilt Brush. Moreover, the results in Figure 9 showed that the intensity of both arm movements is largely present in gaming videos (indicated in bright green and yellow), which is normally associated with disturbed emotional states, such as fear, terror or apprehension. Such results suggest that bodily signals are content dependent, and that we can infer the context of the VR by observing and analyzing bodily cues, without having to read the scenarios of particular VR or using biometric equipment. Moreover, once such models are refined and applied to a larger dataset, we will be able to infer the content of VR solely on the basis of automated detection. Such analysis would give us a better understanding of macro-level behavior in VREs. As we have shown in our example (Figure 10), it is already possible to infer the basic composition of the painting being created in the VR. It is also worth stating that frame by frame analysis is much more sensitive to movement, as it captures every 33 milliseconds of the movement, which makes the analysis much more fine-grained when compared to manual (human) detection. Such granularity of frame-based movement is analogous to McLuhan’s notion of media as an extension of the senses (see Section 1.1), where modern detection algorithms can be interpreted as an instrument for detailed examination of body cues. Such instruments – if we refer again to Merleau-Ponty’s example of a blind man and his cane – represent the new, technologically advanced and amplified sensory field.

5. References

Baumgartner, T., Valko, L., Esslen, M., and Jäncke, L. (2006). Neural correlate of spatial presence in an arousing and noninteractive virtual reality: an eeg and psychophysiology study. CyberPsychology & Behavior, 9 (1), pp. 30—45.

Cao, Z., Simon, T., Wei, S.-E., and Sheikh, Y. (2016). Realtime multi-person 2D pose estimation using part affinity fields. arXiv preprint arXiv:1611.08050.

Csikszentmihalyi, M. (1990). Flow: The Psychology of Optimal Experience. New York: Harper Perennial.

Dede, C. (2009). Immersive interfaces for engagement and learning. Science, 323(5910), pp. 66—69.

Diemer, J., Alpers, G. W., Peperkorn, H. M., Shiban, Y., and Mühlberger, A. (2015). The impact of perception and presence on emotional reactions: a review of research in virtual reality. Frontiers in psychology, 6, pp. 26.

Ekman, P. (1992). An argument for basic emotions. Cognition & Emotion, 6(3-4), pp. 169—200.

Fox, J., Arena, D., and Bailenson, J. N. (2009). Virtual reality: A survival guide for the social scientist. Journal of Media Psychology, 21(3), pp. 95—113.

Hoffman, H. G. (2004). Virtual-reality therapy. Scientific American, 291(2), pp. 58—65.

Hoffman, H. G., Meyer III, W. J., Ramirez, M., Roberts, L., Seibel, E. J., Atzori, B., Sharar, S. R., and Patterson, D. R. (2014). Feasibility of articulated arm mounted oculus rift virtual reality goggles for adjunctive pain control during occupational therapy in pediatric burn patients. Cyberpsychology, Behavior, and Social Networking, 17(6), pp. 397—

Hoffman, H. G., Richards, T. L., Coda, B., Bills, A. R., Blough, D., Richards, A. L., and Sharar, S. R. (2004). Modulation of thermal pain-related brain activity with virtual reality: evidence from fmri. Neuroreport, 15(8), pp. 1245—1248.

Huang, W., Chiew, T. K., Li, H., Kok, T. S., and Biswas, J. (2010). Scream detection for home applications. In: 2010 the 5th IEEE Conference on Industrial Electronics and Applications (ICIEA), pp. 2115-2120. Available at: https://ieeexplore.ieee.org/document/5515397/ [Accessed 10 Dec. 2018].

Jäncke, L., Cheetham, M., and Baumgartner, T. (2009). Virtual reality and the role of the prefrontal cortex in adults and children. Frontiers in neuroscience, 3, pp. 6.

Jennett, C., Cox, A. L., Cairns, P., Dhoparee, S., Epps, A., Tijs, T., and Walton, A. (2008). Measuring and defining the experience of immersion in games. International journal of human-computer studies, 66(9), pp. 641—661.

Malloy, K. M. and Milling, L. S. (2010). The effectiveness of virtual reality distraction for pain reduction: a systematic review. Clinical psychology review, 30(8), pp. 1011—1018.

Manovich, L., Malina, R. F., and Cubitt, S. (2001). The language of new media. Cambridge: MIT press.

McLuhan, M. and Lapham, L. H. (1994). Understanding media: The extensions of man. Cambridge: MIT press.

Mehraby, N. et al. (2005). Body language in different cultures. Psychotherapy in Australia, 11(4), pp. 27.

Murray, C. D. and Sixsmith, J. (1999). The corporeal body in virtual reality. Ethos, 27(3), pp. 315—343.

Patterson, D. R., Jensen, M. P., Wiechman, S. A., and Sharar, S. R. (2010). Virtual reality hypnosis for pain associated with recovery from physical trauma. International Journal of Clinical and Experimental Hypnosis, 58(3), pp. 288—300.

Sanchez-Vives, M. V. and Slater, M. (2005). From presence to consciousness through virtual reality. Nature Reviews Neuroscience, 6(4), pp. 332.

Scapin, S., Echevarría-Guanilo, M. E., Junior, P. R. B. F., Gonçalves, N., Rocha, P. K., and Coimbra, R. (2018). Virtual reality in the treatment of burn patients: A systematic review. Burns, 44(6), pp. 1403—1416. https://doi.org/10.1016/j.burns.2017.11.002.

Slater, M., Pérez Marcos, D., Ehrsson, H., and Sanchez-Vives, M. V. (2009). Inducing illusory ownership of a virtual body. Frontiers in neuroscience, 3, pp. 29.

Strehovec, J. (2007). Besedilo in novi mediji: Od tiskanih besedil k digitalni besedilnosti in digitalnim literaturam. Ljubljana: Literarno-umetniško društvo Literatura.

Villani, D., Repetto, C., Cipresso, P., and Riva, G. (2012). May I experience more presence in doing the same thing in virtual reality than in reality? An answer from a simulated job interview. Interacting with Computers, 24 (4), pp. 265—272.

6. Web links

Adam Savage’s Tested 2016, Professional Sculpting in Virtual Reality with Oculus Medium, YouTube, viewed 8 August 2018, <https://www.youtube.com/watch?v=Pf8sKAuzR0k>.

Ahmed Aldoori 2016, TiltBrush VR Painting - Tracer!! Overwatch!!, YouTube, viewed 8 August 2018, <https://www.youtube.com/watch?v=zHVD7WYd2RU>.

ALexBY11 2016, Fusionandome con un Youtuber (Tilt Brush HTC VIVE), YouTube, viewed 8 August 2018, <https://www.youtube.com/watch?v=YuS5QK8GQN8>.

Amixem 2016, The best experience of virtual reality! Tilt Brush # 1, YouTube, viewed 8 August 2018, <https://www.youtube.com/watch?v=SFNPiaCH3-0>.

BaZe Entertained 2016, Tilt Brush - VR - Baze Entertained - Oculus Rift, YouTube, viewed 8 August 2018, <https://www.youtube.com/watch?v=SepSvJH-rvY>. The video is no longer available on You Tube. It can accessed here.

Chop Labalagun 2017, Oculus/Tilt Brush - Mixed Reality No GreenScreen Test 2, YouTube, viewed 8 August 2018, <https://www.youtube.com/watch?v=MCnK5UIcqFo>.

Debitor 2017, Edgar Malen in 3D vs Germanletsplay | Tilt Brush: Duell, YouTube, viewed 8 August 2018, <https://www.youtube.com/watch?v=UfSn5Sig25U>.

ErmaBell 2017, Funniest VR Freakouts, Fails and Moments of 2016, YouTube, viewed 10 August 2018,                         <https://www.youtube.com/watch?v=056bFCh8OpY>.

Funny Vines 2017, Funniest VR Reactions & VR Fails Compilation 2017 | Funny Vines Videos, YouTube, viewed 10 August 2018, <https://www.youtube.com/watch?v=i9BdwxMoOdA>.

Gines, H, et al. 2017, Openpose, GitHub, viewed 10 October 2017, <https://github.com/CMU-Perceptual-Computing-Lab/openpose>.

HandyGames 2016, Google Tilt Brush professional cartoon game artwork, YouTube, viewed 8 August 2018, <https://www.youtube.com/watch?v=arDmR_quuPE>.

HoloArt VR 2016, Artist Paints Slimer (Ghostbusters) in VR in 20 Mins TILT BRUSH, YouTube, viewed 8 August 2018, <https://www.youtube.com/watch?v=w20A6Caz8Ng>.

Immersive VR Education Ltd 2016, Apollo 11 VR, Immersive VR Education Ltd,viewed 9 June 2018, https://www.oculus.com/experiences/rift/937027946381272/

IGN 2016, How Scary is the Paranormal Activity VR Game, 2016, YouTube, viewed 10 August 2018, <https://www.youtube.com/watch?v=Qsna1ChGt0E>.

Jacksepticeye 2016, Virtual Reality Edition | Drawing Your Tweets #9, YouTube, viewed 8 August 2018, <https://www.youtube.com/watch?v=3JMBUEsf43A>.

Kriksix 2016, painting / sculpting in VR with the HTC Vive and Google Tilt Brush, YouTube, viewed 8 August 2018, <https://www.youtube.com/watch?v=HPTTWsGAZd4>.

Lachlan 2016, Virtual Reality Draw my Thing? (Google Tilt Brush), YouTube, viewed 8 August 2018, <https://www.youtube.com/watch?v=b5B11hVt7gE&t=13s>.

Opposable VR 2015, Artist Alix Briskham talks Tilt Brush on the HTC Vive, YouTube, viewed 8 August 2018, <https://www.youtube.com/watch?v=EYY-DZ14i9E>.

Random Stuff 2017, Funny, Scary VR Reactions Compilation, YouTube, viewed 10 August 2018, <https://www.youtube.com/watch?v=fIUojVmi_Xk>.

Relsson VR 2017, Tilt Brush - Paint in 3D Space with VR (HTC ViVe), YouTube, viewed 8 August 2018, <https://www.youtube.com/watch?v=C4SB-ZXM568>.

Relsson VR 2017, The Magic of Tilt Brush: Painting in 3D Space -Oculus Rift Touch, YouTube, viewed 8 August 2018, <https://www.youtube.com/watch?v=ez8xHqnuVmY>.

SPANKED SILLY 2017, Funniest VR Fails and Scare Compilation’s of 2017, YouTube, viewed 10 August 2018, <https://www.youtube.com/watch?v=ES_07yKIIsQ>.

The Artist – Olga Pankova 2017, Tilt Brush - Virtual Reality - Augmented Reality - HTC VIVE, YouTube, viewed 8 August 2018, <https://www.youtube.com/watch?v=lYBOUQj11JA>.

Thomas Keith 2017, 10 Funniest Virtual Reality Fails And Reactions! (VR Fails & Reactions), YouTube, viewed 10 August 2018, <https://www.youtube.com/watch?v=XA-vJg1A1WY>.

Шед 2017, Макаронный монстр - Нарисуй мне дичь (HTC Vive VR) - Tilt Brush, YouTube, viewed 8 August 2018, <https://www.youtube.com/watch?v=iBFKgdPchh4>

7. Appendix

It is useful to group the keypoints of the human skeleton into meaningful body parts: We defined a Hand Set  as the keypoints 4, 7, a Lower Body Set  as the keypoionts 8, 9, 10, 11, 12, 13Leg Set  as the keypoints 9, 12 and Face Set  as the keypoints 0, 14, 15, 16, 17.

(i) Labeling key bones: In addition, we connected keypoints into links/edges to indicate bones in the body: A full body skeleton is defined by the following set of links: {(4,3), (3,2), (2,1), (1,5), (5,6),  (6,7), (1,0), (0,14), (14,16), (0,15), (15,17), (1,8), (8,9), (9,10), (1,11), (11,12), (12,13)}, left leg as {(1,8), (8,9), (9,10)}, and right leg as  {(1,11), (11,12), (12,13)}.

(ii) Keypoint location coordinates and visibility: We represented the joints' locations in each frame as a keypoint location matrix , where is the pixel coordinate, is the pixel coordinate, and are the pixel coordinates of the joint.

Each joint's visibility was captured by a binary visibility vector  where if then the joint's confidence score is and hence, is visible, and if then  joint's confidence score is .

(iii) We also defined indicator function , where is a Boolean function and if is True then , else . For example, given two sets and , counts the number of elements of that are in .

(iv) These definitions were useful for succinctly describing how (i) distances are computed between different body parts, and (ii) how a body part might deform over time (for example, going from standing to a kneeling down position).  For example, since keypoints 0 and 4 represent the neck and right hand/wrist joints respectively,  measures their separation distance in pixel units. Similarly, the sum of pairwise distances among all keypoints in the Face () and the Hand () sets (and hence, how far the hands are from the face regions) can be written as .  Some keypoints, however, may be invisible, and we should only consider distances between pairs of visible keypoints. So, to compute a reliable distance between hands and the head/face region, we used the expression , where is 0 if the keypoint is invisible.

The indicator function was used to check whether the extracted skeleton satisfies certain conditions, and count events (i.e., occurrences satisfying these conditions). For example, we observed that if the location change of the knee joints along axis is larger than pixels between framesand , then the player is kneeling down. We first measured the change as , where could be 9 (right knee) or 12 (left knee). Then if then according to our definition a kneeling event has occurred. Furthermore, computes the number of such events over five frames, starting at frame . These notations were used repeatedly in the definitions of our classifiers and detectors (See section 2.2.2).

Skeleton normalization:    

Distance measurement which computes the sum of pairs of distance between the visible keypoints in and :

Standing vs. Sitting detector:

Lower body related: kneeling down (KD):

Whole body related: Losing B=balance (LB):

Classifier for different speeds of arm movement (Levels 0-5):

Classifier for irregular movement (IM):