Evaluating Deep Networks for Detecting User Familiarity with VR from Hand Interactions

Mingjun Li, Numan Zafar, Natasha Kholgade Banerjee, Sean Banerjee Clarkson University, Potsdam, NY, USA
{mingli,zafarn,nbanerje,sbanerje}@clarkson.edu

Abstract

As VR devices become more prevalent in the consumer space, VR applications are likely to be increasingly used by users unfamiliar with VR. Detecting the familiarity level of a user with VR as an interaction medium provides the potential of providing on-demand training for acclimatization and prevents the user from being burdened by the VR environment in accomplishing their tasks. In this work, we present preliminary results of using deep classifiers to conduct automatic detection of familiarity with VR by using hand tracking of the user as they interact with a numeric passcode entry panel to unlock a VR door. We use a VR door as we envision it to the first point of entry to collaborative virtual spaces, such as meeting rooms, offices, or clinics. Users who are unfamiliar with VR will have used their hands to open doors with passcode entry panels in the real world. Thus, while the user may not be familiar with VR, they would be familiar with the task of opening the door. Using a pilot dataset consisting of 7 users familiar with VR, and 7 not familiar with VR, we acquire highest accuracy of 88.03% when 6 test users, 3 familiar and 3 not familiar, are evaluated with classifiers trained using data from the remaining 8 users. Our results indicate potential for using user movement data to detect familiarity for the simple yet important task of secure passcode-based access.

Index Terms:

Virtual reality, VR, familiarity, access, security, deep learning, experience

I Introduction

Virtual reality (VR) is increasingly being looked at as a mechanism for delivering experiences for tasks that users may typically perform in real-world, desktop, and mobile environments. VR environments are being evaluated across a diverse spectrum of users for education [1], therapy [2, 3], physical fitness [4], and even applications such as security [5, 6] and personal banking [7]. Research in behavior-based VR security shows that user actions in VR change at varying timescales [8, 9, 10]. As VR applications become more prevalent in consumer spaces, the level of first-time or early-stage VR users is expected to rise. For instance, an older adult may be recommended an exercise routine they can perform in a VR space. If the user is unfamiliar with the VR application, they may need timely intervention to provide training as otherwise they may become disincentivized and stop using the application. Though such training can be given by the service provider, automatic training delivery by detecting the familiarity of the user provides the benefit of providing on-demand training in the user’s personal environment, and alleviates the burden on strained service staff.

In this work, we provide a first attempt at conducting automated familiarity detection using the movements of a user as they engage in a VR task. We use a door opening VR task, where a user enters a 4-digit passcode combination to open a VR door, as users, regardless of prior VR experience, will have performed similar tasks in the real world. We envision VR doors to be a point of entry to collaborative virtual spaces. For example, a user in a virtual office setting may ‘walk’ to a common virtual conference room and enter a combination before entering the room. Early detection of prior VR experience, in this case through the interaction with the VR door, could enable real-time modifications to the interaction elements before the user enters the conference room to perform more complex interactions. In our approach, we track the finger movements of a user entering a passcode combination to unlock a VR door, and train deep neural networks to detect familiarity with VR. We use user-reported binary experience level with VR as a representation of familiarity. The task considered, i.e., unlocking a VR door, also has special significance in the area of VR security. Recognizing the potential for VR applications to store sensitive user data, a number of VR applications investigate leveraging the VR environment to enable secure access, through passcodes or biometric signatures [5, 6]. With the emergence of hand tracking for VR, recent work has explored user identification based on tracking hand data [11] In trying to gain access to a secured space in VR, a novice user may perform actions that cause them to be locked out by VR authentication mechanisms, e.g., pressing the wrong keypad button due to lack of knowledge on interaction process, or performing a deviating movement.

To the best of our knowledge, no work exists for learning-based detection of user familiarity with VR. Some work exists in an allied area of using machine learning for familiarity assessment for a particular task simulated in a VR environment, e.g., evaluating surgical skill. For instance, a considerable body of work exists to detect surgical skill, by tracking the eye or hand movements of the practitioner [12]. Work similarly exists in using eye movements to detect soccer player expertise in VR [13, 14]. Work has evaluated assessing spatial familiarity during wayfinding based on the eye movements performed by a user when turning at junctions [15]. However, the focus is on familiarity with the task, rather than with VR as an interaction medium itself. Work on understanding how different gaming controllers influence perceptions of usability have been explored [16] and provide insights on how changes in ergonomics can impact usage.

II VR Door Unlock Application

We create a VR environment where users interact with a numeric panel to unlock a VR door via hand tracking using a controller-free head-mounted device (HMD). We use the Meta Quest Pro in this study. As VR environments become pervasive, an access controlled door may be used to allow users with varying familiarity in VR to enter a virtual meeting room or their office. Users who are unfamiliar with VR would still be familiar with the task from the real world as most people have opened a door using a numeric panel. To unlock the door, the user uses their (controller-free) hands to enter a numeric passcode combination by interacting with VR buttons simulating those on a traditional access panel on a real-world door. Upon completing the entry, they press a key labeled ‘E’ to enter the combination and unlock the door. If the user enters the correct combination, the door opens immediately, and automatically closes and locks after 3 seconds. We design our panel’s buttons to consist of block game objects for digits 1 through 0, and the letters ‘E’ represent enter and ‘C’ representing clear. We use native hand tracking enabling the participant to use hand motions to interact with the game objects. We detect if a user has interacted with a button by detecting collisions using the collider on the tip of the index finger on both hands and the colliders on the buttons. Figure 1 shows a view of the door unlock application.

Refer to caption — Figure 1: Left: Close-up view of numeric panel in VR Door Unlock application. The panel is placed on a VR door. Right: Participant interacts with the panel by entering a numeric passcode to unlock the door.

III Dataset

We recruited 14 participants from the student, faculty, and staff at Clarkson University. Prior to using the VR environment, we asked participants to indicate if they had prior experience in VR. Of the 14 participants, 7 reported no prior experience in VR, i.e., no familiarity, and the remaining 7 indicated being very familiar with VR. All participants in our study are right-handed with 10 participants self reporting as male and the remaining 4 as female. We collected data using a Meta Quest Pro and recorded position and orientation data at 60 frames per second. During data collection, we asked participants to enter $2648$ , $2468$ , $1379$ , and $3179$ on the virtual keypad. Each combination was entered 10 times before moving to the next combination. Once the participant had entered a combination, they were asked to press ‘E’ to open the door. If the participant made a mistake, we asked them to press ‘C’ to clear the code and try again. Incorrect entries were stored, but not used during the training and testing phase. In Figure 2 we show all 10 trials for each door lock combination for one VR-familiar and one non-VR-familiar participant. We observe that the user who self reported as unfamiliar with VR shows a more variable pattern of movement.

IV Experiments

We evaluate detection of VR familiarity by training classifiers on sliding windows extracted from the trajectory of the dominant hand of the participants in the dataset. We evaluate sliding windows of sizes 50, 60, 70, 80, 90, 100, 110, and 120, with a step size of 1. For each sliding window choice, we evaluate three types of classifiers—multi-layer perceptrons (MLPs), fully convolutional networks (FCNs) [17], and Point Cloud Transformer (PCT) [18]. We train one classifier per sliding window and per key combination.

Multi-Layer Perceptron

We design our MLP to consist of $2$ position-wise dense feed-forward layers where the dimension of the output for each layer is set at half of the input dimension. We retain this pattern across the hidden layers. Following each hidden layer, we apply a ReLU [19] activation layer. The output layer followed by a softmax layer to obtain the predicted class probabilities.

Fully Convolutional Network

We employ the FCN architecture as described by Wang et al. [17]. The architecture consists of $3$ convolutional blocks, each featuring a convolutional layer with a filter size of {128, 256, 128} and a 1D kernel with the size of {8, 5, 3}. We apply batch normalization layers [20] after each convolutional layer. We use ReLU activation layer at the end of each block. After the $3$ blocks, we use a global average pooling layer [21] and a softmax operation for the final class probabilities.

Point Cloud Transformer

The Point Cloud Transformer (PCT) [18], created to address point cloud inputs via an attention architecture, consists of 4 stacked attention blocks. The outputs of each attention block are concatenated, additional max-pooling and average-pooling operations are applied, and the output is passed through several linear layers to produce the final classification probabilities. Given that our dataset contains trajectories that occupy a constrained space, we reduce the complexity of the PCT model by removing the concatenation of attention module outputs and eliminating the subsequent max-pooling and average-pooling layers.

Training and Test Split

To ensure independence of users between the training and test sets, we randomly select a subset of 4 VR-familiar users and 4 non-VR-familiar users for training, and use the remaining 3 VR-familiar users and 3 non-VR-familiar users for testing. For each user, we use all 10 interaction sessions.

Loss Function

We optimize the parameters of each model by minimizing the Binary Cross-Entropy (BCE) loss,

\mathtt{L}=(1/|W|)\Sigma_{W}BCE(pred,gt),

(1)

where $|W|$ denotes the number of windows, $pred$ is the predicted label, and $gt$ represents the ground truth label. We use Adam [22] to optimize the parameters of all the models.

V Results

Tables I and II summarize the results acquired by running the three classifiers examined in this work with various sliding window sizes and numeric passcode combinations. We show peak test accuracy in Table I and area under the ROC curve in Table II. Columns represent sliding windows, while rows represent classifiers and passcodes. We show the ROC curves for each of the models in Figure 3, with columns representing passcodes and rows representing classifiers. Each curve in a plot corresponds to a sliding window size.

As demonstrated by Table I, we obtain highest accuracy of 88.03% for the combination 2648 using the PCT classifier with sliding window size of 120. We see similar high scores for other classifiers with the 2648 combination. The performance drops for smaller window sizes, with the lowest accuracy of 69.79% obtained using an FCN. A reduction in the amount of information to learn contextual information may contribute to the drop in performance with reducing window sizes. The combination 3197 shows the next best set of accuracies, with the highest for the combination being 80.11% using the FCN. Similar to the 2648 combination, we see a drop in accuracy with reduced sliding windows. As seen in the plots in Figure 3, the FCN classifier shows higher overall area under the curve (AUC) for the 3197 combination compared to the MLP and PCT classifiers. PCT generally shows higher AUC for 2648. For lower window sizes, the FCN shows higher AUCs for both codes, and the MLP shows higher AUC for higher window sizes, potentially due to its simpler architecture.

We observe that 2468 and 1379 show results close to chance. A possible reason is that given the order of entry, i.e., 2648, 2468, and 1379, users may be getting acclimatized to the environment. With the 3197 combination, unlike the other 3, the user moves from right to left. During this motion, users not familiar with VR may need re-acclimatization time, though further data collection and analysis is necessary.

TABLE I: Classification Accuracies

WS	50	60	70	80	90	100	110	120
MLP 1379	0.5234	0.5396	0.5472	0.5436	0.5427	0.5680	0.5444	0.5647
FCN 1379	0.5166	0.5907	0.5349	0.5275	0.5679	0.6320	0.6363	0.5453
PCT 1379	0.5473	0.5360	0.5291	0.5319	0.5640	0.5918	0.5925	0.5490
MLP 3197	0.6953	0.7266	0.7216	0.7144	0.6802	0.6199	0.5972	0.6054
FCN 3197	0.6269	0.7217	0.6766	0.6671	0.6834	0.6325	0.8011	0.7780
PCT 3197	0.6785	0.7053	0.7530	0.7061	0.7029	0.7438	0.7082	0.6466
MLP 2468	0.5170	0.5327	0.5062	0.5091	0.4816	0.4684	0.4586	0.4232
FCN 2468	0.6185	0.6497	0.5683	0.6872	0.6753	0.6586	0.6166	0.5768
PCT 2468	0.5623	0.5306	0.5190	0.5569	0.5269	0.5269	0.5102	0.5152
MLP 2648	0.7387	0.7149	0.7869	0.7841	0.8137	0.8130	0.8162	0.8001
FCN 2648	0.6979	0.7348	0.7250	0.7408	0.7126	0.7998	0.8103	0.8362
PCT 2648	0.7011	0.7522	0.7340	0.7762	0.8001	0.8397	0.8603	0.8803

TABLE II: Area under the ROC curve

WS	50	60	70	80	90	100	110	120
MLP 1379	0.5243	0.5397	0.5556	0.5487	0.5653	0.5896	0.5721	0.6034
FCN 1379	0.5114	0.5514	0.5015	0.5279	0.5133	0.6091	0.5907	0.5289
PCT 1379	0.5262	0.5164	0.5333	0.4924	0.5481	0.5729	0.5473	0.5364
MLP 3197	0.6946	0.7190	0.7118	0.7058	0.6738	0.6352	0.5972	0.6054
FCN 3197	0.5943	0.6916	0.6514	0.6225	0.6707	0.5899	0.7538	0.7028
PCT 3197	0.6678	0.6817	0.7329	0.6950	0.6914	0.7311	0.7097	0.6408
MLP 2468	0.4916	0.5276	0.5030	0.5199	0.5030	0.5013	0.5077	0.4693
FCN 2468	0.5560	0.5961	0.5403	0.6072	0.5741	0.5045	0.5747	0.4614
PCT 2468	0.5277	0.5196	0.5167	0.5488	0.4982	0.5339	0.5380	0.4033
MLP 2648	0.7064	0.6669	0.7442	0.7268	0.7608	0.7520	0.7405	0.7062
FCN 2648	0.6385	0.6687	0.6448	0.6642	0.6410	0.7345	0.7198	0.7508
PCT 2648	0.6534	0.6963	0.6768	0.7256	0.7338	0.8123	0.8080	0.8470

VI Conclusion

We demonstrate results of using deep networks to classify familiarity of a user in a VR environment using their movement patterns as they interact with a virtual keypad on a door. The task of opening a door after entering a combination is familiar to all users from having opened similar doors in the real world. We obtain highest accuracies upwards of 80% for combinations 2648 and next highest for 3197. Results indicate potential for the use of VR movements to detect familiarity, though the effect of time on acclimatization may need to be considered, i.e., conducting familiarity detection early is critical. As part of ongoing work, we are working on expanding our data collection to include a diverse array of users and VR tasks of varying complexity. With a broader dataset, we intend to expand prior familiarity from a binary label to a 5-point Likert scale. We are also interested in using eye and head movements to gauge whether distractions from the novelty of the environment play a role in familiarity detection, e.g., if a novice user spends time examining a new environment or is taken aback. As part of future work, we are interested in investigating whether on-demand training to familiarize novice users of VR is more effective than adapting the VR environment or controller to have a reduced interaction scope. The reduced interaction scope may be actualized by reducing the amount of movement in the VR environment when the person moves the analog stick, or adding pop-up hints when performing an interaction in the environment such as opening a door. Our current familiarity detection task in VR uses a door unlock panel which requires interactions with the dominant hand. Future studies on more complex tasks, or even unfamiliar tasks, that require both hands, or controllers, is of interest as most activities in VR involve a full body experience.

References

[1] M. A. Rojas-Sánchez, P. R. Palos-Sánchez, and J. A. Folgado-Fernández, “Systematic literature review and bibliometric analysis on virtual reality and education,” Education and Information Technologies, vol. 28, no. 1, pp. 155–192, 2023.
[2] I. Chard and N. van Zalk, “Virtual reality exposure therapy for treating social anxiety: a scoping review of treatment designs and adaptation to stuttering,” Frontiers in digital health, vol. 4, p. 842460, 2022.
[3] Z. Liu, L. Ren, C. Xiao, K. Zhang, and P. Demian, “Virtual reality aided therapy towards health 4.0: A two-decade bibliometric analysis,” International journal of environmental research and public health, vol. 19, no. 3, p. 1525, 2022.
[4] Y.-L. Ng, F. Ma, F. K. Ho, P. Ip, and K.-w. Fu, “Effectiveness of virtual and augmented reality-enhanced exercise on physical activity, psychological outcomes, and physical performance: A systematic review and meta-analysis of randomized controlled trials,” Computers in Human Behavior, vol. 99, pp. 278–291, 2019.
[5] J. M. Jones, R. Duezguen, P. Mayer, M. Volkamer, and S. Das, “A literature review on virtual reality authentication,” in Human Aspects of Information Security and Assurance: 15th IFIP WG 11.12 International Symposium, HAISA 2021, Virtual Event, July 7–9, 2021, Proceedings 15. Springer, 2021, pp. 189–198.
[6] S. Stephenson, B. Pal, S. Fan, E. Fernandes, Y. Zhao, and R. Chatterjee, “Sok: Authentication in augmented and virtual reality,” in 2022 IEEE Symposium on Security and Privacy (SP). IEEE, 2022, pp. 267–284.
[7] F. Mathis, K. Vaniea, and M. Khamis, “Can i borrow your atm? using virtual reality for (simulated) in situ authentication research,” in 2022 IEEE Conference on Virtual Reality and 3D User Interfaces (VR). IEEE, 2022, pp. 301–310.
[8] J. Liebers, C. Burschik, U. Gruenefeld, and S. Schneegass, “Exploring the stability of behavioral biometrics in virtual reality in a remote field study: Towards implicit and continuous user identification through body movements,” in Proceedings of the 29th ACM Symposium on Virtual Reality Software and Technology, 2023, pp. 1–12.
[9] R. Miller, N. K. Banerjee, and S. Banerjee, “Temporal effects in motion behavior for virtual reality (vr) biometrics,” in 2022 IEEE Conference on Virtual Reality and 3D User Interfaces (VR). IEEE, 2022, pp. 563–572.
[10] M. R. Miller, E. Han, C. DeVeaux, E. Jones, R. Chen, and J. N. Bailenson, “A large-scale study of personal identifiability of virtual reality motion over time,” arXiv preprint arXiv:2303.01430, 2023.
[11] J. Liebers, S. Brockel, U. Gruenefeld, and S. Schneegass, “Identifying users by their hand tracking data in augmented and virtual reality,” International Journal of Human–Computer Interaction, pp. 1–16, 2022.
[12] J. Chan, D. J. Pangal, T. Cardinal, G. Kugener, Y. Zhu, A. Roshannai, N. Markarian, A. Sinha, A. Anandkumar, A. Hung et al., “A systematic review of virtual reality for the assessment of technical skills in neurosurgery,” Neurosurgical Focus, vol. 51, no. 2, p. E15, 2021.
[13] B. Hosp, F. Schultz, E. Kasneci, and O. Höner, “Eye movement feature classification for soccer expertise identification in virtual reality,” arXiv preprint arXiv:2009.11676, 2020.
[14] B. Hosp, F. Schultz, O. Höner, and E. Kasneci, “Eye movement feature classification for soccer goalkeeper expertise identification in virtual reality,” arXiv preprint arXiv:2009.11676, 2020.
[15] N. Alinaghi, M. Kattenbeck, and I. Giannopoulos, “I can tell by your eyes! continuous gaze-based turn-activity prediction reveals spatial familiarity,” in 15th International Conference on Spatial Information Theory (COSIT 2022). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2022.
[16] K. M. Gerling, M. Klauser, and J. Niesenhaus, “Measuring the impact of game controllers on player experience in fps games,” in Proceedings of the 15th International Academic MindTrek Conference: Envisioning Future Media Environments, 2011, pp. 83–86.
[17] Z. Wang, W. Yan, and T. Oates, “Time series classification from scratch with deep neural networks: A strong baseline,” in 2017 International joint conference on neural networks (IJCNN). IEEE, 2017, pp. 1578–1585.
[18] M.-H. Guo, J.-X. Cai, Z.-N. Liu, T.-J. Mu, R. R. Martin, and S.-M. Hu, “Pct: Point cloud transformer,” Computational Visual Media, vol. 7, pp. 187–199, 2021.
[19] A. F. Agarap, “Deep learning using rectified linear units (relu),” arXiv preprint arXiv:1803.08375, 2018.
[20] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning. pmlr, 2015, pp. 448–456.
[21] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.
[22] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.