Multi-person eye tracking for real-world scene perception in social settings

Shreshth Saxena [email protected] 0000-0002-9237-5461 Dept. Of Psychology, Neuroscience & Behaviour, McMaster UniversityHamiltonCanada Areez Visram Dept. of Computing and Software, McMaster UniversityHamiltonCanada Neil Lobo Dept. of Computing and Software, McMaster UniversityHamiltonCanada Zahid Mirza Dept. of Computing and Software, McMaster UniversityHamiltonCanada Mehak Rafi Khan Dept. of Computing and Software, McMaster UniversityHamiltonCanada Biranugan Pirabaharan Dept. of Computing and Software, McMaster UniversityHamiltonCanada Alexander Nguyen Dept. of Psychology, Neuroscience & Behaviour, McMaster UniversityHamiltonCanada Institute for Music Informatics and Musicology, Karlsruhe University of MusicKarlsruheGermany  and  Lauren K. Fink 0000-0001-6699-750X Dept. Of Psychology, Neuroscience & Behaviour, McMaster UniversityHamiltonCanada School of Computational Science & Engineering, McMaster UniversityHamiltonCanada [email protected]
Abstract.

Eye movements provide a window into human behaviour, attention, and interaction dynamics. Previous research suggests that eye movements are highly influenced by task, setting, and social others; however, most eye tracking research is conducted in single-person, in-lab settings and is yet to be validated in multi-person, naturalistic contexts. One such prevalent real-world context is the collective viewing of a shared scene in social settings, for example, viewing a concert, film, lecture, sporting event, etc. Here, we apply mobile eye tracking in a real-world multi-person setup and develop a system to stream, record, and analyse synchronised data. We tested our proposed, open-source system while participants (N=60) watched a live concert and a documentary film screening during a public event. We tackled challenges related to networking bandwidth requirements, real-time monitoring, and gaze projection from individual egocentric perspectives to a common coordinate space for shared gaze analysis. Our system achieves precise time synchronisation and accurate gaze projection in challenging dynamic scenes. Further, to illustrate the potential of collective eye-tracking data, we introduce and evaluate novel analysis metrics and visualisations. Overall, our approach contributes to the development and application of versatile multi-person eye tracking systems in real-world social settings. This advancement enables insight into collaborative behaviour, group dynamics, and social interaction, with high ecological validity. Moreover, it paves the path for innovative, interactive tools that promote collaboration and coordination in social contexts.

multi-person, eye tracking, social settings, naturalistic, unrestricted, analysis, visualisation, gaze, collective viewing, music, concert, film, homography, modular, system design
ccs: Human-centered computing Empirical studies in collaborative and social computingccs: Human-centered computing Computer supported cooperative workccs: Human-centered computing Collaborative and social computing design and evaluation methodsccs: Human-centered computing Visualization toolkits

1. Introduction

The eyes can reveal where, when, and to what we are paying attention (Klein and Ettinger, 2019). Eye contact and shared gaze are critical in social interactions and in guiding collective/joint attention (Valtakari et al., 2021; Clark, 1985). Though ocular activity has always been considered relevant in visual situations, the eyes are equally relevant in indexing auditory attention (Reisberg et al., 1981; Van der Stigchel and Hollingworth, 2018; Pomper and Chait, 2017; Fink et al., 2018, 2019). Interactions between auditory and visual stimuli can capture attention (Van der Burg et al., 2008), alter (Sekuler et al., 1997; Dunifon et al., 2016) or enhance (Bolger et al., 2014; Miller et al., 2013) perceptual sensitivity, and affect social judgments and feelings (Gerdes et al., 2021; Stupacher et al., 2020). Eye tracking can, therefore, provide a multimodal index to perception, attention, and subjective states.

Applications of eye tracking have been mostly limited to single-person settings in controlled environments due to the immobility of traditional eye tracking hardware. With the recent development of affordable, mobile eye tracking technology, it is increasingly possible to conduct eye tracking studies “in the wild” (see e.g. (Fasold et al., 2021; Kulke and Hinrichs, 2021; Reitstätter et al., 2020; Saxena et al., 2023; Vandemoortele et al., 2018; Fridman et al., 2018)). The ability to study unrestricted eye movement behaviour in naturalistic contexts is critical for understanding how eye movements are integrated with head and body movements (Land, 2009) and how previous laboratory results may differ in natural environments (Kulke and Hinrichs, 2021; Laidlaw et al., 2016; Foulsham and Kingstone, 2017). In single-person use cases, mobile eye tracking glasses are commonly applied; however, since these glasses record individual gaze from the egocentric perspective of the wearer, they introduce a new challenge of aligning data from differing perspectives when conducting multi-person eye tracking studies. To effectively study collective gaze dynamics in social settings, it is crucial to develop automated ways to synchronise and analyse the eye tracking data of multiple individuals in unrestricted real-world contexts.

One particularly interesting and ubiquitous context is that of collective viewing of a shared scene in social settings where multiple people share a common gaze goal, e.g., sporting events (Kredel et al., 2017), surgery operations (Tolvanen et al., 2022), school classrooms (Jarodzka et al., 2021), etc. Eye tracking in such naturalistic contexts is crucial for studying social experiences that unfold over longer timescales. For instance, in the case of music, while we have learned much via laboratory studies with single participants and short sound clips presented in the lab, there is a push in the field for studying musical engagement in naturalistic social contexts (Tervaniemi, 2023), as the types of experiences people report as deeply meaningful often involve social settings with music that evolves over long periods of time. However, to date, very few solutions exist for multi-person eye tracking in such dynamic real-world settings.

A schematic of our general approach to solving the problem of multi-person eye-tracking in social settings, and automating data analysis from multiple participant perspectives, is detailed in Figure 1. We imagine a scenario where multiple people observe a shared scene from different viewing angles. The eye movements of each individual viewer can be recorded independently using mobile eye tracking glasses. However, to study collective gaze patterns, the recorded wearer-specific data needs to be synchronised and mapped onto a shared coordinate space (semantic gaze mapping). We record a central view of the shared gaze area from a camera placed behind all observers. We employ Network Time Protocol (NTP) to temporally synchronise data from all devices. The synchronised eye-tracking data is then mapped to the unified central coordinate space via feature mapping between the recorded images captured from each wearer’s head-mounted camera and the images from the central camera. Applying a pin-hole camera model, the scenes captured from each wearer’s perspective can be assumed to lie on a planar surface at an infinite distance. The projective transformation between any two views of the same scene is then mathematically related by a planar homography. Because dynamic real-world social settings likely involve challenging artefacts caused by illumination changes, motion distortion, and low-textured regions, and occlusions, we apply a deep learning based method to robustly detect and match feature points. Paired feature points are used to compute a homography matrix that relates each wearer’s view to a central view of the shared scene. Every individuals’ gaze can then be projected onto the shared centralview for subsequent analyses.

To realise this approach, we first design a scalable system architecture for multi-person mobile eye tracking in social settings, agnostic to hardware requirements. We then test our proposed system during two public events in which we record eye data from 30 participants simultaneously (N=60). In validating our approach, we provide visualisation and analysis metrics for multi-person eye tracking data. Overall, we make the following contributions in this paper:

  1. (1)

    Develop and validate a scalable multi-person eye tracking system, with flexible modes of operation. Our system provides users the means to monitor incoming data in real-time and to remotely control all associated mobile eye tracking devices in parallel. We tested our system with 30 mobile eye tracking devices in parallel, limited only by the number of devices we own. Our system should easily handle more devices when operating in recording mode. This tool opens unprecedented avenues for batch data collection and real-time interactivity in social settings.

  2. (2)

    Solve the problem of multi-perspective gaze data, in the context of shared social scenes. Using our proposed homography approach, it is possible to project the gaze of all eye tracking glasses wearers from egocentric coordinates to shared central coordinates. This technique revolutionises possibilities for automating analyses of social gaze dynamics in real-world settings.

  3. (3)

    Introduce and implement standard and new visualisation and analysis techniques for multi-person eye tracking data. We provide solutions for visualising both egocentric and gaze-transformed data and heatmaps, with methods to visualise all individually or in parallel. The gaze-transformed heatmaps offer particular promise in analysing group attention and perception. Similarly, analysis metrics like convex hull area may offer novel insights into collective gaze behaviour.

Refer to caption
Figure 1. LEFT: Schematic of a shared scene depicting multiple people wearing mobile eye-tracking glasses gazing at a stage. A central camera at the back of the audience records the shared scene (centralview). MIDDLE: Egocentric worldviews of each glasses wearer need to be projected onto the centralview. Point correspondences between the egocentric and centralviews are used to calculate homography matrices that relate the two views. RIGHT: Using the computed homography matrices to remap the gaze coordinates, all individuals’ gazes can be projected onto the shared scene (centralview).
\Description

Problem setting

2. Related Work

Previous work on multi-person eye tracking has relied on smaller groups (predominantly dyads) and elaborate experimental setups that hinder natural behaviour (Bise et al., 2024; Fasold et al., 2021; Macdonald and Tatler, 2018; Kera et al., 2016; Jermann et al., 2012; Kawase, 2009; Yamada et al., 2014). Most studies present static scenes on a screen as a proxy to predict real-world eye tracking behaviour which was identified as an inefficient approach by Foulsham and Kingstone (2017). Some studies have applied head pose as a proxy for gaze in multi-person setups (Stiefelhagen, 2002; Beyan et al., 2018); however, existing research also shows that head orientation is not always correlated with gaze in social-group interactions (Vrzakova et al., 2016).

In addition to the experimental setup, analysis of multi-person eye tracking data remains a significant hurdle in conducting large scale studies. Previous studies have heavily relied on manual techniques like incorporating gaze reports in surveys (Pennill and Timmers, 2017), mapping gaze to visual stimuli using proprietary closed-source manufacturer software, or hand-coding annotation of region-of-interest (Benjamins et al., 2018; Geeves et al., 2014). These techniques are incredibly time-consuming, with poor scalability to large datasets. Further, they are not entirely reproducible, and are prone to human errors. A common strategy applied to simplify the problem is to calculate coarse gaze measures, such as discrete binary classification of gaze on regions of interest (Müller et al., 2018; Fasold et al., 2021). While more scalable, such methods provide limited-resolution data, thereby reducing the cost-effectiveness of experiments and limiting the insights that can be gained.

More scalable and automated attempts at unrestricted eye tracking are actively being made under the umbrella field of ‘pervasive eye tracking’ (Kasneci, 2017) and motivated by advances in portable head-mounted eye trackers that have recently become more affordable and accurate. Further, the increased availability of open-sourced and automated analysis tools (Panetta et al., 2019; Li et al., 2006) has aided the adoption and potential applications of these mobile eye trackers, which are increasingly being employed in unrestricted natural environments with static scenes, such as art galleries (Reitstätter et al., 2020). However, to date, a synchronised multi-person eye tracking study in dynamic real world social settings has not been done and consequently the analysis methods to evaluate such large-scale data are not readily available. To solve these challenges we conceptualise and implement a system framework for multi-person eye tracking, described in the next section.

3. System Design

At the highest abstraction level, reliable multi-person eye-tracking relies on four critical components: a) the hardware required to collect eye tracking measures and transmit them to a central server over the network with low latency b) the software that efficiently manages multiple devices and synchronises received data from the devices in a fault-tolerant manner, c) storage to efficiently store raw and/or processed data, and d) analysis and visualisation methods. Proper choice of hardware and storage solutions is largely dependent on case-specific infrastructure limitations but it is critical to avoid performance bottlenecks. Here, we conceptualise the software framework, agnostic to specific hardware and storage solutions. We then adapt the proposed framework in a real-world utility study, discuss our hardware and storage choices, and demonstrate analysis and visualisations for multi-person eye-tracking data in collective, social settings.

In our proposed software framework, presented in Figure 2, incoming data consists of three parallel streams: i) gaze coordinates from each eye tracking device ii) worldview video recorded from a head-mounted camera on each device, and iii) a centralview video of the shared scene, captured and streamed by a stationary camera mounted to the centre of the scene. Incoming data streams are handled by separate modules that engage based on the mode of operation. Modules in our framework abstract information and allow isolated development and updates of separate components.

Refer to caption
Figure 2. Software framework demonstrating the flow (arrow lines) of data streams (ellipses) in the system. Solid and dashed lines represent the recording and streaming mode of operation respectively. Each mode of operation engages separate modules (dog-eared rectangles) that process the incoming data.
\Description

System Design

3.1. Modes of operation

3.1.1. Recording

The recording mode is meant for offline operation where data are collected and stored for post-hoc analysis in a separate stage. Such operation is desirable when real-time processing of incoming data is not required, e.g., in research and user studies. Splitting the two stages reduces network bandwidth and computation requirements, thereby allowing complex analysis and long-duration recordings with limited resources.

3.1.2. Streaming

The streaming operation is required for real-time or online applications where data are collected and processed concurrently. This mode of operation demands sophisticated hardware and software optimisations to achieve good performance. Processing data in real-time is desirable for innovative Human-Computer Interaction applications such as gaze-contingent interfaces.

Apache Kafka serves as the backbone of the online mode of operation. It is a high-throughput, low-latency event streaming platform that handles the incoming data, streams gaze and worldview from multiple eye trackers, and streams the centralview from a network camera. Each data stream is sent to separate Kafka topics with multiple partitions. In Kafka, topics are analogous to sub-directories that store incoming streams to background Java programs called brokers, ensuring load balancing and scalability. Finally, separate consumer modules read these data streams for a) homography calculation, b) writing to local storage, or c) publishing the data to a dedicated Flask server for real-time monitoring.

3.1.3. Recording and streaming

The recording and streaming mode of operation records raw data concurrently with real-time processing of incoming data streams. This mode provides the flexibility to perform both real-time and post-hoc analysis, at the expense of added computational requirements. Such operation is desirable when applications have gaze-contingent aspects that should be processed in real-time but more complex analyses need to be performed offline.

3.2. Modules

3.2.1. GlassesRecord

The GlassesRecord module is engaged in the recording mode to send a recording start and stop signal to each pair of eye tracking glasses, controlled manually by an operator. Additionally, the module provides functionality to annotate recordings with custom hand-marked events. A critical challenge in multi-person eye tracking is the management and monitoring of many devices. An intuitive and reliable interface to this module is, therefore, a central requirement to efficiently trigger and monitor actions on multiple devices. We demonstrate an example implementation of this module in our Utility Test (see 4) using the open source Textual framework (https://textual.textualize.io/) to build a convenient Terminal User Interface (TUI). Further, the module ensures fault tolerance with the ability to trigger these actions selectively on a subset of devices, for example, the ability to restart recording on one pair of glasses if it fails, rather than having to restart on all devices.

Time-synchronisation. A critical aspect of working with multiple devices on a network is time synchronisation, since each device has a separate internal clock used to record timestamps on that device. Independent internal clocks must be synchronised to a master clock to ensure accurate timestamp comparisons. We synchronise all eye tracking devices, recording cameras, and internal servers to a central NTP server hosted locally on the network. These internal digital clocks, however, can also accumulate small drifts over time leading to errors in offset calculation. In the streaming mode of operation, this drift can be ignored for most applications to favour low-latency operation. In the recording mode of operation, the GlassesRecord module records periodic clock offset measurements–every ten seconds by default–between the NTP server clock and each eye tracking device clock. Offsets between the two clocks are measured by exchanging a brief sequence of UDP (User Datagram Protocol) packets between the two involved devices that yields a) an estimate of the round-trip-time (RTT) between the two devices, and b) an estimate of the clock offset (OFS) at the time of the measurement, with the round-trip-time factored out. The final offset estimate is the one offset value for which the associated RTT value was minimal. During offline analysis, the offset information for each device is aggregated over the entire span of the recording. An outlier-robust linear fit is calculated over the clock offset measurements using random sample consensus (RANSAC) and used to remap the timestamps for precise time synchronisation.

3.2.2. CentralCam

This module is responsible for handling a centrally mounted camera that records a stationary perspective of the shared scene viewed by all the eye tracking device wearers. The module supports off-the-shelf webcams connected via USB interface as well as network RTSP streams from cameras accessible wirelessly or via ethernet. The module grabs and timestamps independent frames from the stream. The module also computes dynamic frame-rate in frames per second (FPS), temporal jitter calculated as the mean and standard deviation of inter-frame time difference, and dropped frame counts. The final outputs from this module include the recorded video in an mp4 container, along with timestamps for each recorded frame, and the calculated metrics as comma-separated values (csv). Additionally, the module offers the option to save the RAW video as output. The RAW video format occupies much larger disk space, however, it requires much less computation (no decoding) and stores lossless data for post-hoc processing or correction. In the streaming mode, the module initialises a Kafka producer and pushes each incoming frame to the producer. Before sending it to the producer, the module resizes the frame, applies JPEG compression, and bundles it into a JSON object along with the recorded timestamp.

3.2.3. GlassesStream

This module implements the streaming of gaze and worldview data streams in the streaming mode of operation. The module initialises a Kafka producer, parses incoming gaze and video data streams from selected eye tracking devices on the network and publishes them to the Kafka cluster. The module also starts a metrics server, incorporating Grafana and Prometheus services, to monitor the Kafka cluster.

3.2.4. Flask Monitor

The Flask Monitor module provides real-time monitoring of the incoming data streams to verify device connectivity and data quality. The module initialises a Kafka consumer that reads each topic and updates corresponding graphs for that data stream. The module hosts a flask server that provides a graphical web-browser interface. Worldview streams from all devices are displayed on this interface with a configurable refresh rate (default 1Hz) and incoming frame rate (in FPS) which is used to colour code the display window of each device. For example, if the current sample rate of an incoming worldview stream drops to zero, its display window boundary turns red. Incoming gaze is displayed as a dynamically updating time-series of x and y coordinates, with a configurable refresh rate. The sample rates for gaze and worldview streams are set to a low value by default to reduce stress on the server and keep the monitor interface responsive. The incoming centralview is also consumed in a separate process and displayed in a separate window, similar to the worldview frames.

3.2.5. Homography

Since the gaze and worldviews from each pair of eye tracking glasses are egocentric to the wearer’s perspective, a direct comparison of spatial gaze measures from different wearers is not possible. The homography module maps gaze from egocentric worldview coordinate space to centralview coordinate space simultaneously for multiple devices. Incoming gaze samples (x, y coordinates) and worldview frames from each device are paired together with the nearest centralview frame using the marked timestamps from the respective input modules of each data stream (see 3.2). The module accommodates varying data sampling rates of the three data streams, since each stream’s sampling resolution depends upon factors like hardware configuration, recording parameters, and network bandwidth. The final temporal resolution of the homography module output is the same as the sampling rate of the provided centralview (i.e., for each centralview frame, the homography module attempts to map all gaze points from the provided worldviews+gaze pairs to that centralview frame). This mapping is accomplished by finding robust feature point correspondences between each worldview-centralview pair. Matched key points are then used to calculate a homography matrix using a least squares method and the computed matrix is finally used to transform gaze from worldview to centralview coordinate space. We apply a state-of-the-art deep learning model–SuperGlue (Sarlin et al., 2020)–to find robust feature point matches in the recording mode of operation in our Utility Test (see 4); however, the real-time operation of this module is dependent on the applied use-case, available computational resources, and future usability testing.

3.2.6. Visualisation

The visualisation module synchronises timestamped data from recorded worldview frames, gaze samples, and centralview frames with an additional input of the transformed gaze in the centralview space–computed by the homography module. This synchronised data can be used for visual analytics by presenting it through various visualisation techniques. We propose and implement the following visualisation examples in our Utility Study:

  • Single-person gaze and worldview: Overlays gaze synchronised to the worldview stream for each selected viewer as a separate video file with an optional overlay of associated timestamps for each data sample.

  • Multi-person gaze and worldview: Renders a single video with all selected viewers’ worldviews with the corresponding gaze points overlay (as in the previous option) in a 2D grid of customisable size.

  • Single-person egocentric and transformed gaze view: Stacks the time synchronised worldview and centralview frames for a single viewer along with the corresponding gaze and transformed gaze overlay on each view respectively.

  • Multi-person transformed gaze on centralview: Presents the centralview with all time synchronised gaze coordinates from the selected viewers overlaid on top. The transformed gaze for each viewer can be plotted either as customisable circles containing a unique ID and outline colour, or as a heatmap overlay calculated from the 2D distribution of all transformed gaze points.

  • Multi-person egocentric views with transformed gaze on centralview: Combines the worldviews from all selected viewers and plots it as a grid with the time synchronised centralview in the centre cell. Gaze for each viewer is overlaid in their respective worldview and the centre cell displays all the transformed gaze points similar to the previous option.

3.2.7. Analysis

The Analysis module provides methods to preprocess and evaluate the recorded egocentric gaze data as well as the homography-transformed gaze data for collective gaze analysis metrics. We propose and implement the following metrics that include traditional eye-tracking measures, such as heatmap similarity and correlation, scaled up to multi-person use cases, and novel collective measures, such as convex hull area, to quantify gaze dynamics in social group settings:

  • Average Gaze Velocity: Gaze velocity is calculated as a time series of inter-sample Euclidean distances of gaze (x, y coordinates) samples. The average gaze velocity over a given timeframe quantifies mean precision of the gaze predictions which reflects overall reliability of the gaze predictions.

  • Heatmap: Gaze heatmaps are calculated from a series of x and y gaze coordinates by convolving a 2D Gaussian filter with sigma (standard deviation of the gaussian kernel) equal to the approximate size of the fovea (i.e., one degree of visual angle).

  • Heatmap Similarity (SIM) and Correlation (CC): Similarity is calculated as the histogram intersection between the distributions of two heatmaps (ranges between 0 and 1 where 1 represents identical distributions) and correlation is represented by the Pearson correlation coefficient between them to quantify the degree of linear relationship between their pixel intensities.

  • Stationary Gaze Entropy: Stationary gaze entropy is calculated by organising gaze coordinates in a given time frame into spatial bins of roughly one degree of visual angle (after conversion to the same coordinate space). Shannon’s entropy equation is then applied to this discrete probability distribution of gaze to calculate the average level of uncertainty or the overall predictability of the gaze coordinates. The obtained entropy quantifies gaze dispersion in the given time frame with a higher entropy or uncertainty indicating a wider distribution of gaze across the visual field (Holmqvist et al., 2011).

  • Convex Hull Area: Convex Hull Area provides a dynamic measure of gaze dispersion over time. It is calculated as the area of the smallest convex polygon that encloses all provided gaze samples at a given time point. The convex hull area was initially adapted to eye tracking analysis by Goldberg& Kotval (1999).

4. Utility Test

To test the applicability and scalability of our proposed framework with a large audience in a real-world context, we implemented the framework during a public event featuring a musical concert and a film screening. The event was held at the Large Interactive Virtual Environment (LIVELab) at McMaster University, Canada, which is a custom-built research concert hall with fully customisable spatial audio and room acoustics. The LIVELab holds 106 audience members; we had mobile eye-tracking glasses for 30 audience members. Since multi-person eye-tracking at such a scale, with a participant sample size of 60 (across two event days), has never been attempted before, we detail our choices for the system design components and evaluate our proposed analysis metrics to derive insights from multi-person eye movement data.

4.1. Methods

4.1.1. Apparatus

Eye tracking glasses: We employed Pupil Labs Neon eye tracking devices (Baumann and Dierkes, 2023) that apply a deep learning pipeline to provide reliable gaze predictions in unrestricted settings, requiring minimal calibration and setup. At the core of these eye trackers is a hardware module equipped with binocular infrared cameras that capture eye images with a resolution of 192 x 192 pixels at 200 Hz and a front-facing RGB camera that records the worldview with a resolution of 1600 x 1200 pixels at 30 Hz. The module is mounted to a 3D-printed, wearable frame that connects to an accompanying Android smartphone (Motorola Edge 40 Pro). The minimal form factor of the frames is designed to be unobtrusive and effortless for participants, making them a good fit for longer-duration studies. The accompanying Android phone runs a proprietary vendor application–“Neon Companion”–that hosts a suite of HTTP REST Application Programming Interfaces (APIs) enabling remote streaming of data and controlling operations such as start/stop recording. During the data collection event, each pair of eye tracking glasses was connected to the companion smartphone device running Pupil’s Labs NEON Companion android application in the background. The phones were locked and connected to an Anker 6-in-1 USB-C hub that extended the ports to connect them to ethernet, power, and the NEON glasses concurrently. Each smartphone was secured in a bag attached to the respective participant’s chair, ensuring free movement of the seated participant (see Figure 3B). The ethernet cable connected the device to a network switch placed under the false floor, in the centre of the audience rows. The global NTP server for all smartphones, the central camera, and the server computers were all set to the University’s NTP server available on the local network.

Refer to caption
Figure 3. Study setup. A represents the stage setup during the concert part of the event. The film was presented on the video wall visible behind the performers. During the film, the stage was cleared and/or covered. B shows audience chairs with eye tracking glasses attached to each seat. The eye tracking glasses were connected to companion smartphones that were secured in pockets attached on either side of each pair of seats.
\Description

Study setup

Central view camera: A PTZOptics pt30x-sdi-g2 camera was mounted central to the stage, positioned behind the audience seating. The camera captured an unobstructed view of the stage with a resolution of 720x480 px at 60 Hz. The camera feed was streamed on the network via Real-Time Streaming Protocol with H264 encoding. Figure 3A demonstrates an example frame from the central camera.

Presentation screen: The film was presented on a 14.5 ft (384.5 cm x 216 cm) Samsung LH015IER LED cabinet with a resolution of 1920x1080 pixels at a 60 Hz refresh rate and a pixel pitch of 1.5 mm.

Computing servers: The GlassesRecord and CentralCam modules were executed on separate desktop servers running Ubuntu 18.20 LTS Linux distribution. Note that both modules have low CPU requirements and could be easily executed on a single general-purpose server, as long as sufficient storage is available for long-duration recordings. We leveraged the modular structure of our framework to run the modules on separate servers to allow separate operators and monitors for the two modules.

Network Infrastructure: Each pair of Neon glasses requires a bandwidth of 6-8 Megabytes per second (MBps) to stream gaze and worldview data. For real-time streaming operations, it is critical to consider the available bandwidth on a network to identify how many concurrent eye tracking glasses’ data can be streamed. In this test, we used the recording mode of operation which has lower bandwidth requirements, since real-time data streams do not need to be accessed on the server. Nonetheless, we leveraged the stationary positioning of participants in our setup to provide precise timing information and optimal network reliability by connecting each eye tracking device to a network switch (TP Link TL-SG1048) via ethernet. The network did not have internet access, to disable any background network activity on the connected Android smartphones.

Storage: Worldview videos and gaze data were recorded and stored locally on each smartphone through the Neon companion app. The recorded data was later uploaded to Pupil Cloud for post-processing of the recorded raw data and then downloaded for further analysis. Time offsets and logs from the GlassesRecord module were stored locally on the server running the module. Centralview video and metrics recorded by the CentralCam module were stored locally on the server running the CentralCam module.

4.1.2. Implementation of proposed system design in Recording mode

GlassesRecord: We implemented the GlassesRecord module to be compatible with the Neon glasses using the pupil-labs-real-time-api python library (https://pupil-labs-realtime-api.readthedocs.io) to send start, stop, discard, and save triggers for a recording to each device. Additionally, since the accompanying smartphones use Android as operating system, we used Android Debug Bridge (ADB) to retrieve status metrics including battery level, storage level, network ping, ADB server status, and connected USB devices (see Figure 4A). The status metrics facilitated continuous monitoring of devices and recording status. ADB was also used to remotely control the Android smartphones for remote troubleshooting and verifying indicators of failed or corrupted recordings that were not reported by the vendor application. These operations were important to prevent data loss and ensure that we would never need to interrupt the live social event to access a participants’ phone in the event of an error. The module operations were interfaced via an intuitive Terminal User Interface (TUI) built using the python Textual library (https://textual.textualize.io/), providing low runtime requirements and cross platform operation. The TUI (see Figure 4) displays a list of all eye tracking devices found on the local network, along with accompanying colour-coded status fields for each device. An action panel at the bottom displays a list of operations (actions) that can be performed on selected devices by pressing the corresponding action key. Multiple devices can be selected at a time on the TUI, including hotkey actions of Spacebar to select a single device; Shift + Arrow Up/Down: to expand selection up/down; Ctrl+A to select all devices; and Exclamation mark (!) to reverse selection.

Refer to caption
Figure 4. Data recording and processing interfaces A. Terminal User Interface (TUI) for the GlassesRecord module used to monitor and operate all 30 devices concurrently. B Command Line Interface for offline analysis and visualisation of the recorded data.
\Description

Interfaces

CentralCam: The CentralCam module was implemented with OpenCV (Bradski, 2008) and FFMPEG (https://ffmpeg.org) libraries to retrieve the RTSP stream from the central camera. The incoming stream was compressed to a MPEG4 container using an H264 codec. A tolerance of 600 dropped frames was added to ignore intermittent packet drops on the network before the module terminates recording. FPS was calculated over a window of 180 frames and stored with other metrics to a CSV file format.

Homography, Analysis, and Visualisation: We applied a state-of-the-art deep learning model– SuperGlue (Sarlin et al., 2020)–to find robust feature point correspondences in our Homography module. The analysis, and visualisation modules’ methods were implemented in Python 3.12. Blinks were identified using the Neon blink-detection algorithm and filtered out of the raw gaze data before analysis and visualisation. As with the GlassesRecord module, we developed an interactive command line utility using the Python questionary (questionary.readthedocs.io) library (see Figure 4B) to interact with the three modules. The offline utility offers formatting of the directory structure for compatibility with the modules, selecting a subset of recordings to perform operations, such as homography, visualisation, gaze metrics calculation, etc., and storing output to the local filesystem.

Visual angle calculation: Visual angle conversion was done by estimating the distances of each of the three participant rows from the videowall. The distances were 758 cm for the first, 885 cm for the second, and 1012 cm for the third participant row. Since the physical dimensions of the video wall were 384.5 cm for width and 216 cm for height, the diagonal (441.02 cm) projected a visual angle of 32.4° for the first, 28° for the second, and 25.6° for the third row. From the worldview recordings of participants in each row, the diagonal of the videowall was estimated to be 515 pixels for the first, 441 pixels for the second, and 396 pixels for the third row. The visual angle to pixel ratio was therefore calculated to be approximately 16 (i.e. one visual degree was represented by roughly 16 pixels in the 1600x1200 resolution worldview recordings).

4.1.3. Stimuli

Music Performance: American percussionist-composers, Allen Otte and John Lane, performed the Canadian premiere of The Innocents – an hour-long performance art piece centred on the issue of wrongful incarceration. The work might best be described as social justice advocacy through performance art. The hour-long composition consists of home-made instruments, spoken texts, and electronic soundscapes. Across 17 tableaux–many of which elide into one another–the audience is presented with dramatic soundscapes that explore issues related to the American criminal justice system, such as mistaken identity, injustice, truth, politics, psychology, and resilience. Some tableaux are too long, uncomfortable, confusing; others are direct, melodic, and in familiar styles.

Film: Otte and Lane’s The Innocents is now the subject of an award-winning feature-length (1hr 20min) documentary film, directed by Wojciech Lorenc. The film showcases Otte and Lane’s compositional process and friendship, following them on the road as they tour the United States performing. On tour, we witness conversations with relevant stakeholders, such as lawyers, non-profit organisations, students, and exonerees. Woven throughout the film is the story of exoneree Anna Vasquez who served 13 years in prison for a crime that never occurred. We witness Anna tell parts of her story and ultimately her reaction to watching Otte and Lane’s live performance.

4.1.4. Study Design

The current utility test takes place within the context of a larger project examining the role of context information in altering audience physiological responses and social attitudes and behaviours. Specifically, counterbalancing across two different event days, we manipulated the order of the film vs. the performance [film-performance or performance-film]. Audience members could buy tickets for either event day and were aware that both the film and performance would occur; however, we never advertised the order, or the fact that we were manipulating order.

During the event, in addition to wearing eye-tracking glasses, participants also wore a smart watch to monitor cardiac activity and completed a series of surveys. Parallel to this in-person event, we also presented a livestream online. We recorded all of the same measures (eye, heart, survey), via web browser and webcam ((Saxena et al., 2022, 2023). We were interested in the effect social co-presence might have on our measures of interest. Detailed analyses related to our cognitive neuroscientific and social psychological questions will be reported elsewhere. In the current study, we focus on validating the utility of our proposed system design.

4.1.5. Procedure

The study took place over 2 days (April 2nd and 4th, 2024) where the same event was repeated in the opposite order: film before the performance on day one and film after the performance on day two. In the following text, we refer to these as four recording sessions, with the corresponding set of participants in each session as Group1 Film (G1F), Group1 Performance (G1P) and Group2 Performance (G2P), Group2 Film (G2F). The event was open to the general public. Participants were recruited either after purchasing advanced tickets before the event, or upon arrival on-site at each event. The study was approved by the McMaster Research Ethics Board (MREB #1975) and all participants provided consent before participating. The same procedure was repeated on both days.

Before the event, ADB connection was established between the server and each smartphone device. A force synchronisation to the assigned NTP server on the smartphones was also done. Participants were seated evenly across rows two to four in the audience, hereby referred to as participant-rows one to three in this paper. After taking their seat, participants were instructed on how to put the glasses on. Trained research assistants completed a quick calibration procedure to correct any static gaze offsets on the Neon Companion application. Participants retained their seating positions over the entire study duration.

To ensure proper connection to all eye tracking devices, test recordings for a short length were made before the actual recording session was started. We started the recording session before the event host came on stage. After the start of the event, no interruptions were allowed and all interactions with the recording devices were done remotely using the GlassesRecord module TUI to trigger recordings and monitor the eye tracking devices (see Figure 4A). The CentralCam module was used to trigger recording on/off on the central camera.

In between film and performance (or vice versa), there was an intermission of 30 minutes, during which we stopped recording and participants took off their glasses, completed a short survey, went to the lobby, used the restroom, etc. The central camera recording was not stopped during event intermission. After intermission, the same calibration procedure was repeated before the second half of the event started. An additional calibration procedure where participants followed a moving calibration target on the videowall with their eyes was also conducted before the second half. Recordings were stopped when the host came onstage at the end of the second half to close the event. The event started at 7 pm and was, on average, 3 hours and 15 minutes long for each of the two days (including intermission).

4.1.6. Participants

A total of 60 people participated in the study (30 on each day, no repeated participants). Data from one participant was only recorded for half of the study due to dropout at intermission. Of the remaining 59 participants, 38 self-identified as women and the reported mean age was 34 years, ranging from 16 to 82 years. All participants had normal or corrected-to-normal vision (via contact lenses; participation while wearing another pair of glasses was not possible).

4.2. Results

4.2.1. Time Synchronisation

Offsets: Time offsets and mean roundtrip duration between each device and the server was recorded at regular time intervals of 10 seconds during each of the four sessions. The mean offset over all devices was 19.58 ms (SD = 8.50 ms) for G1F, 45.57 ms (SD = 16.45 ms) for G1P, 18.52 ms (SD = 8.46 ms) for G2P, and 37.65 ms (SD = 16.14 ms) for G2F; see Figure 5A. Mean offsets were computed after outlier rejection of 3 data points that had high offsets (-277.4 ms, -261.4 ms, and 277.3 ms) due to a failure in the initial time synchronisation procedure for the respective devices.The higher offset of the outliers, however, does not hinder their time synchronisation with other devices since the relative offset as well as the over-time drift of the offset were recorded. The mean roundtrip duration over all 30 devices, was 0.51 ms (SD = 0.04 ms) for G1F, 0.52 ms G1P (SD = 0.03 ms), 0.49 ms (SD = 0.03 ms) for G2P, and 0.53 ms (SD = 0.02 ms) for G2F; see Figure 5B.

Refer to caption
Figure 5. A. Mean time offset (left) and mean roundtrip duration (right), in ms, recorded for the two sessions on each day. The coloured dots represent offsets at measured timepoints, (approx. 10 sec intervals); black dots represent mean values with the error bars representing standard error. B. Offset drift calculated for each device in each of the four sessions. The vertical bars represent standard deviation, horizontal bars represent the mean and individual dots represent single data points at measured timepoints.
\Description

Offsets results

Drifts: Accumulated drifts in time offsets were calculated as the difference between the offset at a given time point from the initial offset. The mean drift over all 30 devices for G1F was 8.58 ms (SD = 6.48 ms); G1P: 8.48 ms (SD = 5.54 ms); G2P: 11.52 ms (SD = 6.53 ms); G2F: 9.92 ms (SD = 6.88 ms).

RANSAC Fit: The mean accuracy score of the linear fits of the 30 devices for each session was 0.95 (SD = 0.03) for G1F, 0.95 (SD = 0.04) for G1P, 0.95 (SD = 0.06) for G2P, and 0.95 (SD = 0.04) for G2F.

Overall, these data show relatively low offsets and drifts across all 30 devices and recording sessions. As is expected, mean offsets increase from the 1st to 2nd recording session for both groups (Fig. 5A; left panel), while drifts seem more so to be device-specific than recording session-specific. Regardless of the actual calculated offsets and drifts, our RANSAC approach allows us to accurately align the timestamps between each eye tracking mobile device and our central camera.

4.2.2. Visualisation

Refer to caption
Figure 6. A. An example of feature matching from one frame of one participants’ worldview (left) onto the centralview (right). The recorded gaze corresponding to this frame is represented by the maroon and white circle. B. Multi-person views generated with the visualisation module. Top: All participants’ egocentric worldviews and gazes (small white and black circles) are displayed around the perimeter. Using the computed homography matrices for each participant, we project each participant’s gaze onto the centralview (centre). In the centre panel, participants’ gazes are represented with uniquely coloured circles with white in the middle to aid visibility. Bottom: Same as top, with the centre grid cell displaying a heatmap of the 2D gaze density of all participants looking at the scene. Higher intensities in the heatmap (intensity increases from blue to red with red being the highest) represent a higher proportion of the participants looking at that location. C. Same as centre panels in B, but resized to aid visibility.
\Description

Visualisations

Raw egocentric gaze coordinates and homography-transformed gaze coordinates were projected onto each viewer’s worldview and the shared centralview, respectively, after removal of blinks. The visualisations (presented in Figure 6) provide a quick data check and demonstrate accurate projective transform from the SuperGlue features identified in the Homography module (see 3.2.5). The transforms handle video artefacts, such as motion blur (from sudden head movements), poor lighting/texture (frames with large dark regions, shadows, lens flare, glare etc.), obstructions (audience members in previous rows), and scale variations (different resolution and zoom in centralview and worldviews) reliably. However, further investigations and optimisations for edge cases in the scene where the projective transformation might fail could still be conducted and are out of the scope for the current paper. Figure 6 displays still, single frame examples of our transformed data; an example video can be viewed here: https://tinyurl.com/multipersonET. In the first part of this video example (tableau: “canjo”), the two performers are seated very near to each other. Even at this close distance, we can clearly distinguish gaze to one or the other performer, vs. their instruments, and follow shifts in audience attention between these elements. When this “canjo” tableau ends, the two performers spread out and we see the gaze follow them around the stage. Once one of the performers starts speaking (tableau: “pod rattle incident”); the audience mostly maintain their focus on him, rarely glancing toward the other performer who is gently striking metallic objects. This video highlights the observation that the transformed gaze reliably follows the two musicians around the stage and provides proof-of-concept that our homography approach successfully projects gaze from a large audience in challenging scenes.

Heatmap analysis: Heatmaps help in analysing the 2D distribution of gaze over a time period. We present the individual gaze heatmaps of each participant for each of the four recording sessions in Figures 7 and 8, generated using the analysis module. Figure 7 shows the raw egocentric gaze data recorded from the eye tracking glasses in the worldview frame of reference. This egocentric gaze data is relative to the wearer’s head movements and demonstrates low variance in both horizontal and vertical directions with a strong central bias. This effect could be explained by the high coupling between eye and head movements commonly observed in real world exploration (Einhäuser et al., 2007; Tatler, 2007).

To study the gaze patterns over an evolving scene and compare gaze data between different viewers it is, therefore, required to map gaze from the viewer’s frame of reference to the scene’s frame of reference. We achieve this with our Homography module (see 3.2.5) that maps each participant’s gaze to the shared centralview space. Figure 8 presents individual gaze heatmaps from all participants and the aggregate gaze activity from all participants for each session overlaid on an example frame from that session. Projecting the gaze to a central perspective gets rid of the independent positions and head movements of each participant, thereby getting rid of the central bias. Heatmaps of transformed gaze coordinates (Figure 8) demonstrate higher gaze dispersion and multiple discrete clusters for each session duration. These clusters coincide with the presentation screen area for the film viewing sessions and the performers’ movement activity for the concert performance sessions (see averaged heatmaps for each session in Figure 8). Higher gaze dispersion can be observed during performance viewing than during film viewing, on both days, which could be partially explained by the larger physical space covered on the stage by the performers during the concert, in comparison to the presentation size for the film (videowall screen).

Investigating the gaze heatmaps for different participant seating clusters, we observe that participants in the right half of the audience spent more time exploring the right side of the stage, reflected by larger (higher gaze dispersion) and more intense (higher gaze duration) clusters in the right half and vice versa (see Figure 9A; heatmaps are arranged by seating order). This effect is, however, only observed in the performance sessions and not when the participants viewed the pre-recorded documentary film. The heatmaps could further be aggregated to observe similar effects with different participant clusters, such as clustering by participant columns (Figure 9A) or rows (Figure 9B). Figure 9 highlights different trends in the position of the heatmap centroid (marked by the intersection of white dashed lines; i.e., the peak heatmap intensity location), based on seating position.

Refer to caption
Figure 7. Egocentric gaze heatmaps. A. Participant-wise gaze heatmaps for each of the four event sessions. The heatmaps are arranged according to the seating order on both days where the top row (Row 1) is closest to the stage and the first column (from the left) is the leftmost participant. The sessions are arranged chronologically from top to bottom panel. B. Averaged gaze heatmaps calculated as the mean over all participants for each session.
\Description

Egocentric heatmaps

Refer to caption
Figure 8. Transformed gaze heatmaps. A. Participant-wise gaze heatmaps for each of the event sessions. The heatmaps are arranged according to the seating order on both days where the top row (Row 1) is closest to the stage and the first column (from the left) is the leftmost participant. The sessions are arranged chronologically from top to bottom. B. Averaged gaze heatmaps calculated as the mean over all participants for each session.
\Description

Transformed heatmaps

Refer to caption
Figure 9. Grouped average of homography-transformed gaze heatmaps. Centres for both x and y coordinate axes are represented by the black ticks on the bottom and left axes respectively. The axes are normalized with the centers marked as 0.5. The white dashed lines represent the location of peak heatmap intensity in the respective coordinate axis. A shows column-wise averages of all rows over the two event days. B shows row-wise averages of all columns of the two days. C shows heatmap averages for the left (left column) and right (right column) halves of the audience (averaging all rows for columns 1-5 and 6-10 respectively).
\Description

Averaged heatmaps

4.2.3. Analysis

Spatial metrics: Further comparisons between individual heatmaps can be made with the help of metrics from the Analysis module (see 3.2.7). For example, pairwise SIM and CC scores were calculated for each participant pair for each of the four event sessions. For raw egocentric gaze, the mean SIM score during film viewing was 0.35 (SD = 0.21) for G1F and 0.38 (SD = 0.19) for G2F, while the mean SIM during the performance was 0.40 (SD = 0.18) for G1P and 0.43 (SD = 0.17) for G2P. The mean CC was 0.37 (SD = 0.30) for G1F, 0.42 (SD = 0.28) for G2P, 0.43 (SD = 0.26) for G1P and 0.46 (SD = 0.25) for G2P. When using homography-transformed gaze, the scores in all sessions, for both of these metrics, increase drastically (see Figure 10; left column), nearly doubling in most cases.

A 2D chart of seat positions (row, column) for each participant (see Figure 7 and 8 for audience seating positions on each day) was used to calculate euclidean distances (DI) between each participant pair. Calculated SIM scores on the homography-transformed gaze were found to be negatively correlated with the euclidean distances during the performance session on both days (i.e., higher SIM values for smaller seat distances between participants in a pair and vice versa). The Pearson correlation coefficient between SIM and DI was -0.10 (p = 0.039) for G1P and -0.17 (p ¡ 0.001) for G2P. These comparisons highlight the critical role of the homography transformation proposed in our framework for a rich analysis of multi-person eye tracking datasets. The trends and observations revealed by the transformed gaze data are not available from the raw egocentric gaze; gaze dispersion similarity between participants is much lower in the independent egocentric gaze coordinates and does not highlight clear patterns between participant pairs or groups.

Entropy and avg. velocity: The mean gaze entropy over all participants during film viewing was 0.76 (SD = 0.03) for G1F and 0.77 (SD = 0.03) for G2F. The mean entropy was higher when participants watched the performance, with a value of 0.82 (SD = 0.03) for G1P and 0.83 (SD = 0.03) for G2P. The higher values represent more spatial gaze dispersion when participants watch the performance as compared to the documentary film. The average gaze velocity (calculated in pixels/sample) during film viewing was found to be 22.99 (SD = 7.25) for G1P and 36.00 (SD = 5.90) for G2P. During the performance sessions, the average gaze velocity was 16.76 (SD = 6.79) for G1P and 22.11 (SD = 8.03) for G2P. Interestingly, gaze velocity was higher during film viewing on both days which signifies longer saccadic movements when viewing pre-recorded media on a screen, in comparison to a live concert performance

Refer to caption
Figure 10. Gaze entropy (top left), average gaze velocity (bottom left), pairwise heatmap similarity (top right) and pairwise heatmap correlation (bottom right) measures computed for each of the two sessions on both days. Coloured dots in the left plots represent individual participant-wise entropy and velocity measures. Coloured dots in the right plots represent SIM and CC for all unique participant-pairs. Black dots in all plots represent means and error bars represent confidence intervals for the respective mean values. Gaze entropy and velocity give a measure of individual gaze variability for each participant and heatmap similarity (SIM) and correlation (CC) provide an estimate of similarities within participants’ gaze exploration for each session.
\Description

Spatial gaze measures

Temporal metrics: A major contribution of our framework is the ability to study ocular activity over extended durations, as opposed to controlled short-duration trials in lab experimentation. It allows investigating long-term effects in complex tasks and capturing group dynamics in naturalistic social interactions. Leveraging the precise time synchronisation achieved with our framework, instantaneous ocular activity from multiple people can be investigated. For example, though we can report that the average blink duration for film is 302.93 (SD = 68.55) ms and performance is 306.49 (SD = 69.55) ms, it is much more interesting to be able to analyse blinks in a time resolved manner (Lange and Fink, 2023). Figure 11 demonstrates blinks from all 30 participants in each of the four recording sessions at discrete time points, over an example 1 minute duration. It highlights the variability of individual blink frequency and duration between participants in each session as well as the aggregate blink activity changes with time as each session progresses. Analysing such activity over the entire film or performance (each an hour+ in duration), will offer rich insights into collective information chunking, event segmentation, fatigue, immersion, etc.

Refer to caption
Figure 11. Blinks recorded from individual participants during a one-minute segment of each of the four sessions. Each horizontal bar represents the start and end of a blink with the length representing the blink duration. The bottom row in each of the four plots represents a monochrome heatmap of the overall blink activity over time with darker regions representing more people blinking and lighter zone representing less people blinking at that instant.
\Description

Blinks

The synchronised gaze data from multiple devices mapped to a common coordinate space (centralview) further facilitates novel spatiotemporal investigations of collective gaze measures over time. In Figure 13, we visualise the measures of standard deviation, minimum convex area, and points in frame, over the entire duration of a film-viewing session (1 hr 20 min film, plus some minutes before/after) from 30 people (G1). These measures–computed by the Analysis module (see 3.2.7)–reveal collective gaze dynamics over time. Standard deviation of the gaze distribution of all participants at any given time is represented in the horizontal and vertical directions by the ”SD x” and ”SD y” measures (top and second panels, respectively). The ”Normalized convex area” (third panel) represents overall gaze dispersion as the area of a minimum convex hull enclosing gaze from all participants. The number of gaze points falling within the centralview scene at a given timepoint is represented by the ”Points in Frame” measure (bottom panel). The time series of all four measures highlight clear event boundaries such as the start and stop of the film presentation (vertical dotted lines). Future analyses will explore which features of the film drive changes in these collective gaze dynamics.

Refer to caption
Figure 12. Collective gaze measures over the entire duration of one of the sessions (film viewing on day one). The orange vertical lines represent the start of the film; brown represents the end. The measures are computed by aggregating gaze points from all 30 participants for each frame in the centralview. From top to bottom, the measures are the standard deviation of points in the horizontal direction, standard deviation in the vertical direction, area of a minimum convex hull enclosing all points, and number of points within the centralview frame boundaries. All time series are smoothed with a rolling mean window size of 150 samples to filter high frequency noise.
\Description

Temporal metrics

In another example, visualising the Normalised Convex Area measure for the entire concert performance (Figure 13) shows similar collective gaze behaviour from audiences during each performance tableau (represented by vertical colored lines) for each participant group. These collective gaze measures and corresponding time series could be further analysed to investigate temporal interactions with the audiovisual stimulus features (outside the scope of the current analysis). We provide these examples here to highlight the rich and vast amount of possibilities from shared gaze metrics in dynamic social settings.

Refer to caption
Figure 13. A. Normalised contour area of the minimum convex hull enclosing all gaze points at each time point during the concert performance on both days (G1P: top; G2P: bottom). The performance consisted of 17 different tableaux; the dashed lines represent manually-annotated start times of each tableau in the performance sequentially with additional first and last lines representing start (performers entering stage) and end (end of last tableau) of the performance respectively. The colours correspond to the tableau number. All time series are smoothed with a rolling mean window size of 150 samples to filter high frequency noise for visualisation. B. Cross-Correlation matrix between each pair of tableaux. Green represents higher correlation between the normalised contour area metric during the two tableaux and red represents low correlation. C. Mean contour area of the time series segment for each tableau. The changes in the mean values are similar across the two days (G1P: Blue; G2P: Orange).
\Description

Contour area analysis

5. General Discussion

Results from our utility study demonstrate the end-to-end application of our framework in real-world social settings. The framework tackles 1) critical challenges of time-synchronisation of multi-sensor data to scale data collection, 2) monitoring and troubleshooting to ensure reliable data collection over long durations, 3) semantic gaze mapping of the synchronised gaze data to analyse collective gaze patterns, and finally, 4) shared gaze visualisation and analysis to achieve novel insights from the collected data. The proposed framework aims to reduce technological and logistical barriers in conducting naturalistic eye-tracking experiments with large social groups, allowing replication/refutation and extension of previous lab studies in social, naturalistic contexts.

The proposed method conceptualised and tackled key challenges related to multi-person eye-tracking, yet a major external dependency of our framework is the choice of eye-tracking hardware. We acknowledge that the availability/development of required functionality in existing eye trackers and affordability are still substantial barriers. We picked Pupil Labs Neon devices in our utility study over market competitors because of their remote API control, minimal calibration requirements, and the availability of open-source gaze-processing components. However, in our initial tests we encountered an array of operational issues on the proprietary vendor application that severely hindered reliable recording from the devices. Solving these issues required hours of testing, troubleshooting or downscaling. For example, the Neon devices are also capable of recording device orientation and movement at 110 Hz using the onboard 9-DoF Inertial Measurement Unit (IMU), as well as stereo audio from on-board microphones; we had to omit these data streams to allow reliable recording of the gaze and video data streams, which were our priority. IMU failures would crash the mobile application, even though the eye data were intact, and audio recording could not be reliably remotely started through API calls, as it requires the android device to be unlocked. These issues point out the very incipient stage of multi-person eye-tracking hardware/software since these systems are mostly tested for single person use cases. Moving forward, the technology would benefit from increased heavy testing, at the scale we propose in this paper. Further, the availability of open-source implementations would allow higher customizability for application specific requirements and thereby contribute to its adoption in novel settings.

Future objectives for our framework would be to further investigate and optimise gaze mapping accuracy in different settings. For example, identifying/handling edge failure cases in challenging environments and adding temporal information from sequential frames would help reduce prediction errors and noise in the projected gaze. The framework can also be extended to support additional social environments and hardware vendors. The current paper demonstrates gaze mapping in scenes where multiple people share a scene view; however, the application could be scaled up to social settings where individuals share different views of the same scene using Multi-view Multi-object detection and tracking (Taj and Cavallaro, 2010). Moreover, the performance bottleneck of homography estimation could be improved by using techniques like feature tracking with periodic detection updates to optimise computation which would pave the way for real-time use cases, such as gaze-contingent interaction and sonification (Hornof, 2014). Future development and applications can also incorporate multiple sensor types, such as physiological or IOT sensors, to record and analyse synchronous multimodal measures. We look forward to applying this framework to study attention and subjective experiences in immersive, social contexts, and plan to open-source the code to motivate collaboration and application of our framework in novel contexts.

6. Conclusion

This paper presents an accurate and scalable framework to conduct multi-person eye tracking in everyday social situations where people have a shared gaze goal. In the current study, we used a live music concert and a film screening, but the described approach is generalizable. Building on previous studies that have identified eye measures to reflect visual and auditory attention, our proposed framework allows studying such multimodal correspondences outside of a controlled lab environment, and with multiple participants simultaneously. Such multi-person eye-tracking in dynamic social settings promises to provide novel insights into group behaviour, new approaches to cooperative/collaborative work in large-group settings, and new means to create gaze-contingent immersive environments and social interactions.

7. Code Availability

All code will be made publicly available on GitHub upon acceptance/publication of the submitted manuscript.

8. Funding

This research was supported by a Natural Sciences and Engineering Research Council of Canada Discovery Grant [RGPIN-2023-05050], the Canadian Foundation for Innovation John R. Evans Leaders Fund grant [Project #43884], and the Ontario Research Fund for Small Infrastructure Fund [#43884] held by LF. We would also like to thank the German Academic Exchange Service (DAAD) for providing a scholarship to AN, which supported his participation in this research at McMaster.

References

  • (1)
  • Baumann and Dierkes (2023) Chris Baumann and Kai Dierkes. 2023. Neon Accuracy Test Report. Pupil Labs (2023).
  • Benjamins et al. (2018) Jeroen S Benjamins, Roy S Hessels, and Ignace TC Hooge. 2018. Gazecode: Open-source software for manual mapping of mobile eye-tracking data. In Proceedings of the 2018 ACM symposium on eye tracking research & applications. 1–4.
  • Beyan et al. (2018) Cigdem Beyan, Francesca Capozzi, Cristina Becchio, and Vittorio Murino. 2018. Prediction of the Leadership Style of an Emergent Leader Using Audio and Visual Nonverbal Features. IEEE Transactions on Multimedia 20, 2 (2018), 441–456. https://doi.org/10.1109/TMM.2017.2740062
  • Bise et al. (2024) Kazuya Bise, Takeshi Saitoh, Keiko Tsuchiya, Hitoshi Sato, Kyota Nakamura, Takeru Abe, and Frank Coffey. 2024. Joint Attention Detection Using First-Person Points-of-View Video. In 2024 2nd International Conference on Computer Graphics and Image Processing (CGIP). IEEE, 1–6.
  • Bolger et al. (2014) Deirdre Bolger, Jennifer T Coull, and Daniele Schön. 2014. Metrical rhythm implicitly orients attention in time as indexed by improved target detection and left inferior parietal activation. Journal of cognitive neuroscience 26, 3 (2014), 593–605.
  • Clark (1985) Herbert H Clark. 1985. Language use and language users. Handbook of social psychology (1985).
  • Dunifon et al. (2016) Carolyn M Dunifon, Samuel Rivera, and Christopher W Robinson. 2016. Auditory stimuli automatically grab attention: Evidence from eye tracking and attentional manipulations. Journal of Experimental Psychology: Human Perception and Performance 42, 12 (2016), 1947.
  • Einhäuser et al. (2007) Wolfgang Einhäuser, Frank Schumann, Stanislavs Bardins, Klaus Bartl, Guido Böning, Erich Schneider, and Peter König. 2007. Human eye-head co-ordination in natural exploration. Network: Computation in Neural Systems 18, 3 (2007), 267–297.
  • Fasold et al. (2021) Frowin Fasold, André Nicklas, Florian Seifriz, Karsten Schul, Benjamin Noël, Paula Aschendorf, and Stefanie Klatt. 2021. Gaze coordination of groups in dynamic events–a tool to facilitate analyses of simultaneous gazes within a team. Frontiers in psychology 12 (2021), 656388.
  • Fink et al. (2018) Lauren K Fink, Brian K Hurley, Joy J Geng, and Petr Janata. 2018. A linear oscillator model predicts dynamic temporal attention and pupillary entrainment to rhythmic patterns. Journal of Eye Movement Research 11, 2 (2018), 12.
  • Fink et al. (2019) Lauren K Fink, Elke B Lange, and Rudolf Groner. 2019. The application of eye-tracking in music research. Journal of Eye Movement Research 11, 2 (2019), 1.
  • Foulsham and Kingstone (2017) Tom Foulsham and Alan Kingstone. 2017. Are fixations in static natural scenes a useful predictor of attention in the real world? Canadian Journal of Experimental Psychology/Revue canadienne de psychologie expérimentale 71, 2 (2017), 172.
  • Fridman et al. (2018) Lex Fridman, Bryan Reimer, Bruce Mehler, and William T Freeman. 2018. Cognitive load estimation in the wild. In Proceedings of the 2018 chi conference on human factors in computing systems. 1–9.
  • Geeves et al. (2014) Andrew Geeves, Doris J McIlwain, and John Sutton. 2014. The performative pleasure of imprecision: a diachronic study of entrainment in music performance. Frontiers in Human Neuroscience 8 (2014), 863.
  • Gerdes et al. (2021) Antje Gerdes, Georg W Alpers, Hanna Braun, Sabrina Köhler, Ulrike Nowak, and Lisa Treiber. 2021. Emotional sounds guide visual attention to emotional pictures: An eye-tracking study with audio-visual stimuli. Emotion 21, 4 (2021), 679.
  • Hornof (2014) Anthony J Hornof. 2014. The Prospects For Eye-Controlled Musical Performance.. In NIME. 461–466.
  • Jarodzka et al. (2021) Halszka Jarodzka, Irene Skuballa, and Hans Gruber. 2021. Eye-tracking in educational practice: Investigating visual perception underlying teaching and learning in the classroom. Educational psychology review 33, 1 (2021), 1–10.
  • Jermann et al. (2012) Patrick Jermann, Darren Gergle, Roman Bednarik, and Susan Brennan. 2012. Duet 2012. In Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work Companion.
  • Kasneci (2017) Enkelejda Kasneci. 2017. Towards pervasive eye tracking. IT-Information Technology 59, 5 (2017), 253–257.
  • Kawase (2009) Satoshi Kawase. 2009. An exploratory study of gazing behavior during live performance. In ESCOM 2009: 7th Triennial Conference of European Society for the Cognitive Sciences of Music.
  • Kera et al. (2016) Hiroshi Kera, Ryo Yonetani, Keita Higuchi, and Yoichi Sato. 2016. Discovering objects of joint attention via first-person sensing. In 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). 361–369. https://doi.org/10.1109/CVPRW.2016.52
  • Klein and Ettinger (2019) Christoph Klein and Ulrich Ettinger. 2019. Eye movement research: An introduction to its scientific foundations and applications. Springer Nature.
  • Kredel et al. (2017) Ralf Kredel, Christian Vater, André Klostermann, and Ernst-Joachim Hossner. 2017. Eye-tracking technology and the dynamics of natural gaze behavior in sports: A systematic review of 40 years of research. Frontiers in psychology 8 (2017), 287392.
  • Kulke and Hinrichs (2021) Louisa Kulke and Max Andreas Bosse Hinrichs. 2021. Implicit theory of mind under realistic social circumstances measured with mobile eye-tracking. Scientific reports 11, 1 (2021), 1215.
  • Laidlaw et al. (2016) Kaitlin EW Laidlaw, Austin Rothwell, and Alan Kingstone. 2016. Camouflaged attention: Covert attention is critical to social communication in natural settings. Evolution and Human Behavior 37, 6 (2016), 449–455.
  • Land (2009) Michael F Land. 2009. Vision, eye movements, and natural behavior. Visual neuroscience 26, 1 (2009), 51–62.
  • Lange and Fink (2023) Elke Lange and Lauren Fink. 2023. Eye-blinking, musical processing, and subjective states – A methods account. Psychophysiology 60, e14350 (2023). https://doi.org/10.1111/psyp.14350
  • Li et al. (2006) Dongheng Li, Jason Babcock, and Derrick J Parkhurst. 2006. openEyes: a low-cost head-mounted eye-tracking solution. In Proceedings of the 2006 symposium on Eye tracking research & applications. 95–100.
  • Macdonald and Tatler (2018) Ross G Macdonald and Benjamin W Tatler. 2018. Gaze in a real-world social interaction: A dual eye-tracking study. Quarterly Journal of Experimental Psychology 71, 10 (2018), 2162–2173.
  • Miller et al. (2013) Jared E Miller, Laura A Carlson, and J Devin McAuley. 2013. When what you hear influences when you see: listening to an auditory rhythm influences the temporal allocation of visual attention. Psychological science 24, 1 (2013), 11–18.
  • Müller et al. (2018) Philipp Müller, Michael Xuelin Huang, Xucong Zhang, and Andreas Bulling. 2018. Robust eye contact detection in natural multi-person interactions using gaze and speaking behaviour. In Proceedings of the 2018 ACM Symposium on Eye Tracking Research & Applications. 1–10.
  • Panetta et al. (2019) Karen Panetta, Qianwen Wan, Aleksandra Kaszowska, Holly A Taylor, and Sos Agaian. 2019. Software architecture for automating cognitive science eye-tracking data analysis and object annotation. IEEE Transactions on Human-Machine Systems 49, 3 (2019), 268–277.
  • Pennill and Timmers (2017) Nicola Pennill and Renee Timmers. 2017. Rehearsal processes and stage of performance preparation in chamber ensembles. 25th Anniversary Edition of the European Society for the Cognitive Sciences of Music (ESCOM) (2017).
  • Pomper and Chait (2017) Ulrich Pomper and Maria Chait. 2017. The impact of visual gaze direction on auditory object tracking. Scientific reports 7, 1 (2017), 4640.
  • Reisberg et al. (1981) Daniel Reisberg, Roslyn Scheiber, and Linda Potemken. 1981. Eye position and the control of auditory attention. Journal of Experimental Psychology: Human Perception and Performance 7, 2 (1981), 318.
  • Reitstätter et al. (2020) Luise Reitstätter, Hanna Brinkmann, Thiago Santini, Eva Specker, Zoya Dare, Flora Bakondi, Anna Miscená, Enkelejda Kasneci, Helmut Leder, and Raphael Rosenberg. 2020. The display makes a difference: A mobile eye tracking study on the perception of art before and after a museum’s rearrangement. Journal of Eye Movement Research 13, 2 (2020).
  • Sarlin et al. (2020) Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. 2020. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4938–4947.
  • Saxena et al. (2023) Shreshth Saxena, Lauren K Fink, and Elke B Lange. 2023. Deep learning models for webcam eye-tracking in online experiments. Behavior Research Methods (2023).
  • Saxena et al. (2022) Shreshth Saxena, Elke B Lange, and Lauren K Fink. 2022. Towards efficient calibration for webcam eye-tracking in online experiments. In 2022 Symposium on Eye Tracking Research and Applications. 1–7.
  • Sekuler et al. (1997) R Sekuler, AB Sekuler, and R Lau. 1997. Sound alters visual motion perception. Nature 385, 6614 (1997), 308.
  • Stiefelhagen (2002) Rainer Stiefelhagen. 2002. Tracking focus of attention in meetings. In Proceedings. Fourth IEEE International Conference on Multimodal Interfaces. IEEE, 273–280.
  • Stupacher et al. (2020) Jan Stupacher, Maria AG Witek, Jonna K Vuoskoski, and Peter Vuust. 2020. Cultural familiarity and individual musical taste differently affect social bonding when moving to music. Scientific Reports 10, 1 (2020), 10015.
  • Taj and Cavallaro (2010) Murtaza Taj and Andrea Cavallaro. 2010. Multi-view multi-object detection and tracking. In Computer vision: Detection, recognition and reconstruction. Springer, 263–280.
  • Tatler (2007) Benjamin W Tatler. 2007. The central fixation bias in scene viewing: Selecting an optimal viewing position independently of motor biases and image feature distributions. Journal of vision 7, 14 (2007), 4–4.
  • Tervaniemi (2023) Mari Tervaniemi. 2023. The neuroscience of music–towards ecological validity. Trends in Neurosciences 46, 5 (2023), 355–364.
  • Tolvanen et al. (2022) Otto Tolvanen, Antti-Pekka Elomaa, Matti Itkonen, Hana Vrzakova, Roman Bednarik, and Antti Huotarinen. 2022. Eye-tracking indicators of workload in surgery: A systematic review. Journal of InvestIgatIve surgery 35, 6 (2022), 1340–1349.
  • Valtakari et al. (2021) Niilo V Valtakari, Ignace TC Hooge, Charlotte Viktorsson, Pär Nyström, Terje Falck-Ytter, and Roy S Hessels. 2021. Eye tracking in human interaction: Possibilities and limitations. Behavior Research Methods (2021), 1–17.
  • Van der Burg et al. (2008) Erik Van der Burg, Christian NL Olivers, Adelbert W Bronkhorst, and Jan Theeuwes. 2008. Pip and pop: nonspatial auditory signals improve spatial visual search. Journal of Experimental Psychology: Human Perception and Performance 34, 5 (2008), 1053.
  • Van der Stigchel and Hollingworth (2018) Stefan Van der Stigchel and Andrew Hollingworth. 2018. Visuospatial working memory as a fundamental component of the eye movement system. Current Directions in Psychological Science 27, 2 (2018), 136–143.
  • Vandemoortele et al. (2018) Sarah Vandemoortele, Kurt Feyaerts, Mark Reybrouck, Geert De Bièvre, Geert Brône, and Thomas De Baets. 2018. Gazing at the partner in musical trios: a mobile eye-tracking study. Journal of Eye Movement Research 11, 2 (2018).
  • Vrzakova et al. (2016) Hana Vrzakova, Roman Bednarik, Yukiko I Nakano, and Fumio Nihei. 2016. Speakers’ head and gaze dynamics weakly correlate in group conversation. In Proceedings of the Ninth Biennial ACM Symposium on Eye Tracking Research & Applications. 77–84.
  • Yamada et al. (2014) Katsuma Yamada, Masaru Ohgiri, Takashi Furukawa, Hisanori Yuminaga, Akihiko Goto, Noriyuki Kida, and Hiroyuki Hamada. 2014. Visual behavior in a Japanese drum performance of Gion festival music. In Digital Human Modeling. Applications in Health, Safety, Ergonomics and Risk Management: 5th International Conference, DHM 2014, Held as Part of HCI International 2014, Heraklion, Crete, Greece, June 22-27, 2014. Proceedings 5. Springer, 301–310.