[1]\fnmXavier \surAlameda-Pineda \equalcontThese authors contributed equally to this paper.
These authors contributed equally to this paper.
These authors contributed equally to this paper.
These authors contributed equally to this paper.
1]\orgdivRobotLearn Team, \orgnameInria at Univ. Grenoble Alpes, CNRS, LJK, \orgaddress655, \streetAvenue de l’Europe, \postcode38334, \cityMontbonnot, \countryFrance
2]\orgdivCzech Institute of Informatics, Robotics and Cybernetics, \orgnameCzech Technical University in Prague, \orgaddress\streetJugoslávských partyzánů 1580/3, \postcode160 00 \cityDejvice, \countryCzechia
3]\orgdivAcoustic Signal Processing Laboratory, \orgnameBar-Ilan University, \orgaddress \cityRamat-Gan, \postcode5290002, \countryIsrael
4]\orgdivDepartment of Information and Computer Science, \orgnameUniversity of Trento, \orgaddress\streetVia Sommarive 9, \postcode38123, \stateTrento, \countryItaly
5]\orgdivInteraction Lab, Mathematical and Computer Sciences, \orgnameHeriot-Watt University, \orgaddress\cityEdinburgh, \postcodeEH14 4AS, \countryUnited Kingdom
6]\orgnameERM Automatismes, \orgaddress\street561 allée Bellecour, \postcode84200, \cityCarpentras, \countryFrance
7]\orgnamePAL Robotics, \orgaddress\streetC/ Pujades 77-79, \postcode08005, \cityBarcelona, \countrySpain
8]\orgdivLusage Living Lab, \orgnameAssistance Publique - Hopitaux de Paris, \orgaddress\street54-56 Rue Pascal, \postcode75013, \cityParis, \countryFrance
,
Socially Pertinent Robots in Gerontological Healthcare
Abstract
Despite the many recent achievements in developing and deploying social robotics, there are still many underexplored environments and applications for which systematic evaluation of such systems by end-users is necessary. While several robotic platforms have been used in gerontological healthcare, the question of whether or not a social interactive robot with multi-modal conversational capabilities will be useful and accepted in real-life facilities is yet to be answered. This paper is an attempt to partially answer this question, via two waves of experiments with patients and companions in a day-care gerontological facility in Paris with a full-sized humanoid robot endowed with social and conversational interaction capabilities. The software architecture, developed during the H2020 SPRING project, together with the experimental protocol, allowed us to evaluate the acceptability (AES) and usability (SUS) with more than 60 end-users. Overall, the users are receptive to this technology, especially when the robot perception and action skills are robust to environmental clutter and flexible to handle a plethora of different interactions.
keywords:
Multi-party Robot Interaction, Gerontology Healthcare, Acceptability, Usability1 Introduction
Social robots are not yet commonly found in our public spaces, despite this vision being an imminent reality over 25 years ago [1]. In addition to classic robotic skills, such as object avoidance during navigation, social robots must be able to seamlessly communicate with multiple people through natural verbal and non-verbal interaction. Over the past decade, social robots have been tested in museums, airports, libraries, shopping malls, bars, and hospitals [2, 3, 4, 5, 6, 7], reporting many positive findings. They have been used to successfully make sports and rehabilitation exercises more entertaining [8], assist older adults in care facilities [9, 10] and hospitals [11, 12], navigate and engage with people in public spaces (like concert halls, hotel lobbies, and shopping malls) [9, 13, 5], and engage in multi-party interaction [3, 14]. However, today’s social robots are far from perfect as they are often run in a wizard-of-oz setup (with the researchers controlling the robot’s navigation and dialogue manually) [9, 10, 14]. Those that do function independently are stationary [13, 8], have limited dialogue capacity (rule-based or closed-domain) [3, 5], or are designed for single user interactions that cannot be guaranteed in public spaces.
Such spaces require more complex robotic skills and introduce new underexplored challenges [15]. The robot must be able to fuse multi-sensory input to sense people and objects [16], tracking their positions as the robot navigates through the space [17], all while avoiding collisions. The robot must be able to hear its interlocutors [18] despite background noise, the robot’s ego-noise, the acoustics of the room, and multiple people speaking at the same time. It must understand where people are looking [19], and determine if they are getting frustrated with the robot to aid communication. The robot must move its head and eyes to look at its addressee or nod, move its arms appropriately when pointing, and move in the correct direction when guiding people. Typical speech systems are created to converse with only one individual at a time (e.g. Amazon Alexa, or Siri [20]), whereas pairs and groups of individuals may approach the robot together. People talk to each other as well as to the robot in a multi-party conversation, and the robot’s spoken dialogue system must be able to handle this [21]. Similarly, the navigation skills of the robotic platforms must be able to adapt to new environments without requiring extensive retraining or adaptation.
To tackle the above challenges, the EU’s H2020 SPRING project111https://spring-h2020.eu/ aims to develop social robots that can communicate in complex and unstructured public spaces. The SPRING (Socially Pertinent Robots IN Gerontological healthcare) project is a consortium of five international research labs (computer science & engineering), two industry partners, and importantly, a hospital with research facilities. Our experimental setting is the Broca gerontology day-care hospital, where patients visit when they are suspected to have dementia. Patients typically spend full days at the hospital with a friend or family member for support. The hours are filled with multiple appointments, but a large portion of the day is also spent waiting for test results or the next appointment. Our goal is to provide a system that is both practically useful and also entertaining, to provide participants with information and some light distraction from their otherwise stressful day. The research staff at the hospital run the experiments with volunteer patients, their companions, and the ARI robot (Figure 1).
One of the objectives of the SPRING project, and the main research question of the study presented in this paper, is whether or not patients of a gerontology day-care hospital and their companions will find a social robot useful and acceptable in this environment. Our study is unique in the sense that (i) the evaluation is conducted with real patients and companions of and in a day-care gerontological hospital (ii) with a full-sized humanoid robot (ARI is m tall) (iii) that is capable of advanced sensing and acting skills allowing it to converse with patients and companions. Our findings suggest that both patients and companions find such technology useful and acceptable, especially if it is robust to noise and clutter, and flexible to the plethora of different situations it can encounter.
To be able to conduct this study, we have developed a complex and feature-rich software architecture that is described in Section 2, where we detail the overall system architecture and provide technical details of each robotic ‘skill’, and the improvement beyond the state-of-the-art. Some of the modules in the architecture were not evaluated in the hospital, but only in our laboratories due to Ethical constraints. Where appropriate, we refer to technical reports or publications to condense this paper without it becoming over-simplistic. In Section 3 we describe the experimental setup in the hospital. We discuss the ethical considerations, the protocol, the performance metrics, the main results, and the failure cases. Finally, we conclude and list open topics in Section 4.
2 Architecture
2.1 Overview & Robot Platform
To accomplish its objective of developing a socially pertinent robot, the SPRING consortium developed a novel software architecture for humanoid robots. The architecture is composed of eight modules (Figure 2) responsible for perception (self-localisation, human localisation, speech processing, human behaviour analysis, person manager) and action (experimenter interface, multi-party conversation, non-verbal behaviour generation) processes. They are developed on top of the ROS 1 Noetic222http://wiki.ros.org/noetic middleware. In total, all modules consist of 52 ROS nodes communicating through more than 170 ROS topics and actions. Where relevant, the architecture uses standard ROS messages. In particular, the SPRING architecture is one of the first to fully adopt the REP-155 ‘ROS4HRI’333https://www.ros.org/reps/rep-0155.html [22] standard for human-robot interaction to combine the different perception modalities (voices, faces, bodies) into a consistent representation of persons that can be used by downstream action modules.
The developed architecture is deployed on the PAL ARI humanoid robot, designed for use as a socially assistive companion [23]. ARI is 1.65m tall, with a differential-drive mobile base, a touch-screen on the torso, movable arms to gesture, and a head with LCD eyes that enable expressive gazing behaviours. It is equipped with a four-microphone array (front torso), an RGB camera (head), and two 180 fish-eye cameras (chest and back) allowing us to capture and record the audio and video of the whole interaction from the robot’s perspective. An additional RGB-D camera is located in the front for self-localisation. The robot verbalises given responses using Acapela Text-To-Speech.444https://www.acapela-group.com/ The following sections provide details into the key features of each module.
2.2 Self-Localisation
The self-localisation module takes the RGB-D and Fisheye images as input and provides a 2D occupancy map indicating obstacles and the current position of ARI in the map. This information is crucial for obstacle avoidance during the navigation of ARI (Section 2.8). The developed module integrates RTABMap [24], a library for mapping and real-time tracking that incorporates ORB-SLAM [25], with an augmented version of the Hierarchical-Localisation (HLOC) [26] algorithm. RTABMap and HLOC provide two distinct, pre-aligned maps that share the same global coordinate system. The pre-alignment is done by aligning the camera centers registered in these maps. The module fuses both information to provide a single consistent map for other modules. This multi-layered approach is crucial for robust and precise localisation in complex and dynamic environments.
RTABMap uses the RGB-D camera that is slightly down-faced at the front of ARI to provide in many situations efficient mapping and localisation. However, certain environments might have reflective floors and RTABMap has trouble tracking key points, losing its localisation. For this purpose, an extended version of the HLOC algorithm was introduced to allow localisation based on the more robust and feature-rich images from the front and backwards-facing fisheye cameras. A calibration step, that allows to associated camera poses (position and orientation) and robot poses is done once-and-for-all at the beginning. Given a Fisheye image, the camera pose is initialised utilising the SENet [27] approach for advanced image retrieval. Image retrieval estimates the camera’s pose with an accuracy of a few meters, by identifying the most similar images from a collection of images of the environment with their poses in a global coordinate system. The next step is to establish a set of correspondences between the input image and the closest images. The culmination of this process is the pose estimation using the generalised absolute pose solver [28] from PoseLib555https://github.com/PoseLib/PoseLib, assuming known relative poses between the ARI cameras (i.e., camera rig) from external calibration.
Combining state-of-the-art image retrieval and pose estimation techniques, this complex methodology results in state-of-the-art localisation accuracy. Using front and backwards-facing images captured from two consecutive time points (i.e., four fisheye images), the system achieves localisation within the maps of the Broca gerontology day-care hospital with less than 1cm positional error and 0.2 degrees rotational error in 90% of queries. Comparatively, with a single image, the success rate drops to 15% for positional accuracy within 1cm and 30% for rotational accuracy within 0.2 degrees. These findings highlight the enhanced precision of the multiview approach over single-image HLOC localisation used to initialise the real-time camera pose tracking performed by ORB-SLAM.
2.3 Human Localisation
The main goal of this module is to localise the humans in the space through time. To that aim, we use audio-visual data – more precisely the front fisheye images and the microphone array signals – to detect people, track them over time, and re-identify them (meaning that there is a one-to-one correspondence between persons and tracks). This information is exploited by the Conversation Manager (Section 2.7) to have a time-consistent identifier for each conversation partner and the human-aware navigation to move naturally around humans (Section 2.8). On the audio side, we employed a time-difference of arrival (TDOA) estimation for the audio modality, utilising an instantaneous version of the generalized cross-correlation phase-transform (GCC-PHAT) [29]. The GCC-PHAT algorithm reliably provides TDOA readings in frames with a single speaker. This implementation utilises two horizontal microphones from the Respeaker microphone array embedded in ARI. The TDOA readings are then translated into direction of arrival (DOA) estimates using geometric considerations On the visual side, we have implemented a state-of-the-art multi-person visual tracker known as fair multi-object tracking (FairMOT) [30]. This method combines the detection and the re-identification abilities and is based on the well-known residual neural network (ResNet34) [31]. However, the standard architectures are trained for regular cameras, while ARI’s camera has a fisheye lens. This required an annotation and retraining procedure described in [32]. Another important property of FairMOT is that the tracking is based on the Kalman filter, plus an additional matching step that associates detections (from the current time step) and Kalman predictions (from the previous time step), by means of a detection-to-prediction distance matrix. By computing the distance between DOA and the predictions of the Kalman filter, we can seamlessly incorporate audio detections to the tracking pipeline, thus achieving multi-modal speaker detection and tracking.
Based on the position of humans in the image, we can extract their pose using OpenPose [33]. This provides the orientation of each person and their feet position in the image. By triangulation and the assumption that persons stand on an even floor, we can recover their depth. This allows to infer the distance and orientation to each other to finally detect conversational groups (group centre and its members) using the Graph-Cuts for F-formation (GCFF) algorithm [34].
2.4 Speech Processing
In real-life scenarios, a robot may engage with a group of people amid noisy and reverberant surroundings. The speech processing module’s objective is to generate multiple streams of transcribed speech for all speakers from the microphone array signals, which the conversational system (Section 2.7) will utilise. The transcribed text streams should maintain consistency over time. This may require various methods of attributing identity to the speakers, including DOA estimation and speaker identification.
We limit the scenarios to a maximum of two concurrent speakers. We also assume that the robot interacts with individuals in a half-duplex manner; namely, it does not listen while talking. The audio pipeline uses three steps to process the raw audio signal captured with the ReSpeaker microphone array:
-
1.
Enhance the speech quality;
-
2.
Extract key attributes (voice activity, DOA, speaker identity);
-
3.
Transcribe the enhanced audio data.
This process was designed to comprehensively handle the intricacies associated with single- and multi-speaker scenarios, using a systematic approach for audio processing and analysis.
Speech enhancement and denoising
The recorded speech signal is contaminated with various artefacts, such as noise and reverberation. To produce a clean speech signal, we apply the three following alternative speech enhancement algorithms.
If speech is only contaminated by noise, a noise reduction module based on a Mixture of Deep Experts [35] will be activated. Each expert is implemented via a deep neural network (DNN) attuned to a distinct speech spectral pattern, such as a phoneme. Each expert generates a speech presence probability (SPP) map, determining whether a time-frequency bin is predominantly speech or noise based on its expertise. The final time-frequency mask is derived by weighting the SPP estimates from various experts and then applied to enhance the speech signal.
If two speakers are active in the scene, a recent single-microphone speaker separation algorithm [36] will be activated, aiming at noisy and reverberant signals typical of real-world environments. Two variants of the algorithm are available, Separation TF Attention Network (Sep-TFAnet) and Sep-TFAnetVAD. Audio samples of the separation results (in English) are publicly available.666https://Sep-TFAnet.github.io
As an alternative, a speaker extraction module will be activated. We have implemented a two-stage method that extracts the speech corresponding to a reference signal and subsequently applies a dereverberation and residual interference suppression stage [37]. A noteworthy feature of the speaker extraction algorithm, particularly pertinent to the SPRING project, is its capability to infer speaker embeddings that can be leveraged for speaker identification tasks.
An arbitrator will be implemented to select the most appropriate algorithm based on the sound scene. Currently, the audio pipeline is tested and verified by manually switching between the alternative modules.
Attribute extraction
We extract three main speaker attributes: the activity pattern (if there is speech or not), the DOA, and the identity (understood as something that characterises the voice rather than the name). While the first two attributes are short-term, identity is considered long-term.
Regarding the short-term attributes, the activity can be directly obtained from the Sep-TFAnetVAD network, which incorporates an voice activity detector (VAD). The activity patterns of the separated speakers can serve for diarization in the downstream dialogue manager. Short-term identification relies partly on spatial information derived from the current scene. This involves employing a late fusion mechanism that combines visual-based and audio-based DOA estimation, as explained in Section 2.3.
Regarding the identity, we utilise Nvidia’s ECAPA-TDNN model [38] to extract voice-specific (speaker) embeddings, producing a 192-dimensional voice signature vector. The speaker identification module then stores these embeddings in an internal database. The identification process occurs by comparing the cosine similarity between an active speaker and an entry in the database, triggering a match when the similarity exceeds a specified threshold. Finally, the speaker embedding obtained from the speaker extraction algorithm [37] may also serve as a speaker ID.
Audio transcription
Transcribing audio is crucial in social robotics, as this component will generate the words comprehended by the robot. We carried out an extended experimental evaluation of several automatic speech recognition (ASR) systems [39] and concluded that the best option was to use Nvidia’s on-prem solution called RIVA (Version 2.7). Apart from demonstrating performance comparable to existing cloud services, RIVA offers the advantage of adaptability to our data distribution. Moreover, it operates on-premises, mitigating potential concerns associated with data and privacy issues.
2.5 Human Behaviour Analysis
This section describes the modules of ARI regarding the perception of human behaviour, namely: gaze target detection, detection of social acceptance of ARI, and multimodal emotion recognition. These modules input the position and groups of people from Section 2.3, the speech as processed in Section 2.4 as well as the face camera image, and could have direct applications in understanding the people’s intentions, emotional state, and engagement with the robot. While we did not have the opportunity yet to exploit the outcome of these modules in hospital settings, they are part of the software architecture and publicly available as the other ones.
Gaze Target Detection
This task, also referred to as gaze-following [40, 41], is to infer where each person in the scene (2D or 3D) is looking at [40, 41, 42]. We aim to predict a person’s gaze in an RGB scene image captured by the head camera of ARI. To do so, we apply our method [43], whose inputs are: (a) an RGB scene image, which contains the field of the view of the head camera of ARI; (b) an RGB face image, which is cropped from the RGB scene image, corresponding to the person whose gaze will be estimated and is extracted using a Multi-task Cascade CNN [44], and (c) a scene depth image obtained from monocular depth estimation network of [45]. A fusion and prediction module concatenates scene, depth, and head features to obtain the two outputs of the proposed method, namely i) the gaze heatmap that is a 1-channel 2D matrix whose peak value represents the gaze coordinates and ii) the probability of the gaze target being inside or outside the scene.
Automatic Social Acceptance
We address the task of automatically detecting the social acceptance of ARI as an engagement detection problem. Human-robot engagement detection refers to the process of identifying and understanding the level of interaction, involvement, or connection between humans and robots in a given context [46]. This involves analysing various cues, such as verbal and non-verbal communication, gestures, facial expressions, and other social signals, to determine the extent to which a person is actively engaged with or responsive to a robot [47]. To address this task, our proposed method concentrates on analysing the gaze behaviour of human agents. We leverage the former discussed gaze target detection module to extract handcrafted features, drawing inspiration from [48], which has demonstrated promising results in analysing multi-party conversations. Given a video clip lasting 10 seconds and the information of the gaze location in the scene, we extract the ratio of frames when the person is looking at ARI, gazing out of the field, or in regions visible by ARI’s head camera. These features are used to train a Deep Multilayer Perception along with the engagement annotation for the corresponding video clip.
Multimodal Emotion Recognition
Facial expressions, in addition to a person’s speech (e.g. prosody, pitch), are an essential part of non-verbal communication and major indicators of human emotions. Effective emotion recognition systems can facilitate comprehension of an individual’s intention, and prospective behaviours in Human-Robot Interaction. Our approach incorporates both a facial emotion recognition (FER) system [49], designed to differentiate between positive and negative emotions, and a single-microphone speech emotion recognition (SER) system [50] that can identify discrete emotions (e.g. happy, angry, sad).
The FER system is composed of two neural networks: 1) a convolutional autoencoder-based architecture used as a feature extractor, and 2) the classification head responsible for classifying whether the emotion is positive or negative. The employed convolutional autoencoder is trained to reconstruct the input image (unsupervised pre-training). We freeze the autoencoder and use it only to extract features that are passed to a multilayer perceptron used as a linear classifier. This classifier is trained with focal loss to better handle class imbalance problems, if any, and classifies whether a given face is displaying a positive or negative emotion [49].
Our SER algorithm [50] is a variant of a previous single-microphone system [51] to best fit the robot’s hardware in our setting. The acoustic features are extracted from the audio utterances and fed to a neural network that consists of CNN layers, a BiLSTM combined with an attention mechanism layer [52], and a fully connected layer.
Both the FER and SER systems have been evaluated individually on relevant corpora (see [49] and [50] respectively) achieving state-of-the-art performance, but there still exists a challenge in applying these models together in the real hospital. Indeed, the domain shift between the corpora and hospital data distributions is substantial and collecting annotated data in hospital settings is challenging and resource-consuming.
Example | User | Utterance | Note of Interest |
(A) | U1 | I think it is London | If turn 2 was U2, it would be agreement, so speaker recognition changes meaning. |
U1 | Yeah… London | ||
(B) | U1 | My husband needs the bathroom | Providing other user’s goal. |
(C) | U1 | What time is my appointment? | U2 answers U1’s question, but addressee was ambiguous without gaze info. |
U2 | It’s at 10am | ||
(D) | U1 | We are hungry | Shared goal indicated by ‘we’, and robot can point to the ‘left’. Fasting is in red as it is a world-knowledge hallucination. |
ARI |
|
||
(E) | U1 | Name a song by… | This is an OOD question that could not be answered without the LLM-based SDS. The partial utterance is handled naturally which improves accessibility. |
ARI | By who? | ||
U1 | Queen | ||
ARI | Bohemian Rhapsody |
2.6 Person Manager
The perception modules presented in Sections 2.3 (human localisation), 2.4 (speech processing) and 2.5 (human behaviour analysis) all extract what we call social features. To be used for downstream tasks (like the multi-party conversation manager, as described in the following section), these features have to be combined and associated with each other to build complete persons. The association might be direct (e.g., the facial recognition software module directly associates a face to a specific person), or indirect (we associate a body to a face based on the overlapping regions of interest in the source image, and transitively associate the body to a person). To broadcast possible associations between features and/or persons, with their corresponding likelihoods, the SPRING architecture uses the ROS4HRI [22] standard.
ROS4HRI defines five types of entities to model the humans interacting in the vicinity of the robot. The first three are feature entities for the face, body and voice. The last two are for persons and groups. Each entity has a unique identifier and properties, e.g. the bounding box of a face or the position of the group centre. The identifiers of feature entities are transients and might be created or changed at any time, based on the result of the face, body and voice detection algorithms. The person class, has instead a persistent identifier: a given person should always get assigned the same person identifier. Recognition modules are in charge of associating feature identifiers to the corresponding person identifier. For instance, the facial recognition node might broadcast a message like [{john, face_432, }, {jane, face_432, }] to indicate that a detected face has 80% chance of being John, and 20% chance of being Jane. This design allows the effective separation of concerns, where the question of ‘feature’ (face, body, voice) detection can be cleanly isolated from the question of feature matching. For SPRING, three feature matching processes are used: body to face, body to voice, and body to group. Associations depend in general on the closeness of two features, for instance, how far a detected face is from a body, and the Hungarian algorithm [55] is used to solve the assignment problem if several features have to be matched. Please note, due to ethical considerations during the experiments in the hospital, a direct association of a person entity to an actual person is not done (e.g. through a facial mapping with a database of photos). All person entities were regarded as anonymous persons.
This association mechanism is exploited to span, over time, a probabilistic graph of relationships between social features, persons and groups (Figure 3). The challenge is then to compute the most likely person–features associations. We have developed a novel algorithm, playfully named Mr. Potato algorithm [56], to compute the most probable associations between the different features and persons. Our algorithm searches all possible partitions of the graph, and selects the one that minimises the number of associations, while maximising affinity, i.e. the sum of likelihoods of each association. Our implementation represents efficiently the persons-features graph by the boost graph library [57]. Connected components are computed using a Depth-First-Search approach; likewise, minimum spanning trees are calculated using the Kruskal’s algorithm [58]; and shortest paths between nodes are computed using the Dijkstra algorithm [59]. The result of the algorithm is published as a new set of ROS4HRI-compatible topics, listing the list of tracked and known groups and persons, and their corresponding face, body and voice.
2.7 Multi-party Conversation Manager
Commercial and research spoken dialogue systems (SDSs), conversational agents, and social robots have been designed with a focus on dyadic interactions. That is a two-party conversation between one individual user and a single system/robot. Such interactions can only be guaranteed in specific settings, like people interacting with Siri on their phone, or with Amazon Alexa in single-occupant homes. When Alexa is in a family home, its lack of multi-party capabilities is apparent [60], but this becomes a critical limitation when deploying social robots in public spaces [2, 3, 4, 5, 6, 7], where multi-party conversations (MPCs), involving people talking to both the robot and each other, do commonly occur.
Tasks that are typically trivial in the dyadic setting become considerably more complex when conversing with multiple users [61, 62]: (1) The speaker is no longer simply the other person, so the meaning of the dialogue depends on recognising who said each utterance (see (A) in Figure 4); (2) addressee recognition is similarly more complicated as people can address each other, the robot, and groups of individuals; and (3) response generation depends on who said what to whom, relying on the semantic content and surrounding multi-party context. To make things even more difficult, MPCs provide additional unique challenges that are underexplored. Dyadic SDSs must identify and answer the user’s goals to be practically useful. In MPCs, users can provide another person’s goal (see (B) in Figure 4), answer each other’s goals (see (C) in Figure 4), and even share goals (see (D) in Figure 4, [63]). In SPRING, we have established the task of multi-party goal-tracking [53].
The conversational system [64, 21] has been iteratively improved through regular user tests and interviews with patients visiting the Broca gerontology day-care hospital. The initial system [7] was developed before recent LLM advances (such as ChatGPT), relying on a ‘traditional’ modular architecture based upon Alana V2 [65, 66]. As patients were usually accompanied by a companion, the lack of multi-party capabilities proved problematic. The system interrupted users since it responded to every turn, not allowing them to talk to each other at any point. We therefore designed and ran a multi-party data collection in a wizard-of-oz setup [67, 53] (see Section 3) and have used this data to motivate and evaluate our current SDS. Not only is this new system multi-party and multimodal, it improves QA accuracy, improves accessibility for people with dementia [21], and enables added functionality. For example, where previously we had to specifically design the system to tell jokes and run entertaining quizzes [68, 54], LLMs can now handle this inherently due to their general knowledge. Most importantly, both users and the hospital staff have reported that the user experience has improved dramatically (see Section 3).
2.8 Non-verbal Behaviour Generation
Gaze, gestures, and navigation of ARI are controlled by two modules: The behaviour manager interfaces a high-level planner with the conversational system to choose appropriate actions during an interaction. The behaviour generation module provides and executes the actions.
Behaviour Manager
The behaviour manager handles the interface between the conversation manager and the behaviour generator, and is responsible for deciding appropriate high-level social actions and managing the interactions.
To enable situated interactions with multiple users at the same time, it is required that the non-verbal behaviour system components interface the social perception signals (presented in Sections 2.3, 2.4, 2.5 and 2.6) with the multi-party conversational manager (presented in Section 2.7). Among the set of social decisions that are required to be made by the non-verbal behaviour system we have: detect people’s arrival and departure, determines a person in the scene that wants or requires the robot’s attention, decide when to go, start an approach or guidance action, decide who to look at, switch the focus of attention during multi-party interactions, etc.
The behaviour manager is implemented as an abstract controller for the ROS Petri-Net Planner (PNP) [69]. A Petri-Net is a mathematical model for state machines. The behaviour manager can start and stop tasks in Petri-Net plans, and keep track of them. The implementation of the PNP supports automatic generation of Petri-Net machines, handles concurrent execution of multiple Petri-Net machines, and natively exploits ROS infrastructure. The PNP is fully implemented in ROS. It consists of the Petri-Net plan server, the knowledge-base (KB), a set of ROS action servers from the Petri-Net plans (recipes). The PNP starts tasks and keeps track of them. It is able to manage the currently available and running Petri-Nets and to provide functionality to send and receive information from/to a specific net. The other major functionality of the controller interface is to exchange data between the different plans and the social state representation.
The main functionality is to manage the currently available and running plans and to provide the functionality to send and receive information between different plans and the social state representations provided by the social scene understanding components. Through the interface with the social perception components (defined in the previous sections) the behaviour manager populates and maintains the planner’s knowledge base with information about the interaction, the social state, the people engaged in the interaction/conversation with the robot, etc. The behaviour manager module interfaces with the non-verbal behaviour generation through the ROS actions servers that control low-level action execution (gestures, navigation) for a pertinent social interaction.
Behaviour Generator
The behaviour generator provides mainly actions and behaviours for two aspects: First, it controls ARI’s arms, head, and eyes to generate gestures such as waving or pointing, and its gaze. The gesture and gaze controllers are hard-coded behaviours that can be called during dialogues.
Second, it provides a human-aware navigation controller to move among humans and join them for interactions. The controller faces two crucial challenges. Foremost, it needs to move safely and reliably due to its application around vulnerable persons. Secondly, it has to adhere to social norms regarding groups, for instance, respecting conversational groups and not moving through them, for example. In difference to existing methods [70, 71, 72], we opted to combine a Model Predictive Controller [73] with an explicit social space model of humans and groups [74], therefore called Social Navigation Controller (SNC) [75]. The SNC allows planning ARI’s trajectory precisely over a future time horizon while adhering to constraints. This provides a higher level of safety compared to other planning methods such as Dynamic Window Approaches [76] or learned end-to-end controllers such as by Reinforcement Learning [77]. Additionally, having an explicit social space model allows us to understand and tune it reliably compared to learned black-box models.
The SNC utilises a dynamic forward model of ARI to predict its trajectory over a 2 sec time horizon, considering its current state (2D position in the map) and the anticipated motor outputs (linear and angular velocity). By optimising a loss function that captures the desired performance and constraints, the SNC computes an optimal control trajectory for the motor outputs. The first control action of this trajectory is applied and the optimisation process is repeated in the next time step, incorporating updated measurements and adjusting the control actions. The loss function incorporates constraints about maximum velocities and a cost function. The costs are based on the occupancy map for obstacles (Section 2.2) and the position of humans and groups (Section 2.3). Costs are high for areas close to objects and for interfering with the social space of humans and groups. Social spaces are modelled around the position of humans and group centres by 2D Gaussian-like functions that are conditioned on their orientation, movement direction, and status (e.g. seated vs standing) following [74] (Figure 5). The SNC avoids navigating through areas that incur a high cost, effectively avoiding obstacles, moving through groups, or being too close to humans. Social spaces also define the position to join a person or group to start a dialogue following the approach in [76] by identifying the closest point to a group centre or a single person who is least interfering with their social space.
2.9 Experimenter Interface
During the experimentation, the robot will not be accompanied by computer science researchers or engineers, but by medical researchers and personnel. In order to conduct the experimental session, an appropriate experimenter-interface was developed. Through a conception-trial-update process, we converged on an interface design that is a trade-off between usability and controllability. The interface (Figure 6) is implemented for use on a dedicated tablet.
From a technical point of view, the interface connects to its server that runs on the external computer. This server communicates with the modules described above via ROS, interacts with the tablet via web technologies, and works as a gateway between these two parts. Additionally, it allows the experimenter to control the data collection. Via a web browser and the tactile screen, the experimenter can easily check the status of the robot and external computer, conduct the experiment, control the interaction as well as generate quick annotations of the interaction that are synchronised with the rest of the ROS messages. These annotations enable the project engineers to measure error rates, and diagnose bugs more easily. More details about the experimenter interface and other user interfaces (e.g. for data collection) can be found in [78].
3 Experiments
This section describes the experiments to validate the acceptability and usability of the introduced multi-modal conversational system on the ARI robot. The validation was conducted with real patients and companions at the Broca gerontology day-care hospital. The experimental protocol, measures, procedure, obtained results, as well as associated discussion and failure cases are described.
3.1 Experimental Protocol
Participants
The study was carried out between May 2023 and January 2024. Two groups of participants were recruited for this study: elderly outpatients from a geriatric hospital and their accompanying persons. These participants were recruited from the Broca gerontology day-care hospital in Paris. Inclusion criteria for this study were: For patients (1) to be aged 60 and over, (2) have a good understanding of the French language, (3) not to have severe cognitive impairment (MMSE 10, see [79]) or neuropsychiatric symptoms (delirium, hallucination), and for accompanying persons: (1) be of legal age and (2) have a good understanding of the French language. All the participants were required to express their consent to participate in the study.
Ethical Approval and Data Availability Statement
This research was fully supported by the H2020 SPRING Project funded by the European Comission. As such, there are no conflicts of interest to be disclosed. As per the research involving humans and the informed consent, we provide details about the sought Ethical Committees and obtained Ethical Approvals in the following.
The study was approved by the French National Ethical Review Board Comité National de Protection des Personnes, CPP Ouest II, Maison de la Recherche Clinique-CHU Angers (approval number: 2021/20), the local ethics committee of the University of Paris (Comité d’Ethique de la Recherche CER-N IRB: 00012020-108), and was compliant with the General Data Protection Regulation (DPO: 20210114153645 register AP-HP). Additional information can be found in [80]. In order to guarantee universal healthcare access, the Ethical Approval restricted the experiments in the hospital to an auxiliary room, instead of the main waiting room, in a way that patients unwilling to interact with the robot would not feel aside or unwelcome. The auxiliary room is small ( m), making the validation of the robot self-localisation and navigation skills meaningless. Both skills have been validated in our respective laboratories and when possible in the Broca hospital without patients.
The Ethical Approval also restricted sharing the data collected in the hospital strictly to the partners of the SPRING project, and therefore this data can be neither shared publicly nor shared individually with anyone outside the project. The pipeline used to transfer data within the partners of the project was validated by Inria’s DPO and Chief Security Officer. However, all the reports of the project describing the main findings are publicly available,777https://spring-h2020.eu/results/ as well as the code of the software modules.888https://gitlab.inria.fr/spring
Material
The experiments were conducted using the ARI robotic platform and the software architecture described in Section 2, as well as a dedicated external computing server. The external computing server has one NVIDIA RTX A6000 GPU with 48 GB of VRAM. Additionally, a dedicated secured 4G mobile connection is used for internet connection with a remote server running the LLM of the conversational system (Section 2.7). This remote server has 4x NVIDIA GeForce RTX 2080 Ti GPU. The collected experimental data is securely stored on a NAS server from Synology. An experimenter tablet was available to monitor the status of the robot, using the described experimental interface (Section 2.9), as well as for the experimenter to stop the robot and interaction if needed. An external camera was used to record the overall scene for posterior annotation and understanding.
3.2 Performance Measures
In this study, the user experience was assessed by two standardised scales. The first quantitative measure is the acceptability of the robot assessed with the Acceptability E-scale (AES) [81]. It is designed to measure the subjective acceptability of a system using 6 items, resulting in a global score that ranges from 6 to 30. For consumer ready-to-use products, the acceptability cut-off score is 25/30. The second quantitative measure is the usability of the robot assessed with the System Usability Scale (SUS) [82]. It is a 10-items scale designed to assess the overall user-friendliness of a system, and generating an overall score out of 100, where a higher score indicates better user-friendliness. For consumer ready-to-use products, the usability cut-off score is 72/100. Statistical differences were assessed using independent Student t-tests
3.3 Procedure
The participant recruitment process began two weeks before the day of the experiment. A researcher used the management software of the day-care hospital [83] to identify patients who met the eligibility criteria. Once this information was validated, the researcher contacted the participant by telephone to present the objectives of the study and to arrange a time slot on the day of their medical visit to the hospital if the participant was willing to accept to take part in the tests. The researcher sent afterwards the information letter detailing the study by postal mail, together with a reminder letter of the appointment.
On the day of the test, the researcher checked that the tools (robot, tablet, camera) were working properly. Once the participants arrived, the researcher reminded them of the objectives of the study and the evaluation procedures. Participants could be escorted by one accompanying person. In case they confirmed their agreement to take part in the experiment, they were given a consent form to sign. The participant(s) were then invited to stand in front of the robot to begin the interaction with it. Two researchers led the activity, a facilitator, and a person who controlled the tablet and completed the observation grid. The facilitator’s role was to direct the test session and provide assistance or additional information to participants when needed.
3.4 Experimental Waves
Two separate waves of experimentation were conducted during this study, depending on the status of the architecture. The First wave (May 2023 – July 2023) used the initial version of the robot dialogue module, with an off-the-shelf automatic speech recognition model, and with almost no multi-party capabilities. In the Second wave (Sept. 2023 – Jan. 2024), some improvements to the system were made: The dialogue management included large-language models, the ASR was fine-tuned, the experimenter interface was updated, all-in-all to improve the user experience. The improvements were motivated by the participants’ feedback from the First wave.
3.5 Results
Wave | User group | Count (F/M) | Age |
---|---|---|---|
First | Patients | 15 (10/5) | 79.2 (±6.62) |
Companions | 5 (4/1) | 69.2 (±15.82) | |
Overall | 20 (14/6) | 76.7 (±10.23) | |
Second | Patients | 33 (14/9) | 78.6 (±8.08) |
Companions | 10 (6/4) | 56.7 (±19.43) | |
Overall | 43 (20/13) | 73.5 (±14.39) |
The recruitment efforts previous to the two waves lead to two sets of participants with slightly different age profiles for the patients, and wider age differences for the companions (see Table 1). The study counts a total of 63 participants with an overall average age of 74.5 (±12.31) years.
We also report the average and standard deviation of the AES and SUS scores (Table 2), split by wave and user group, as well as the overall score per wave.
We also asked them what “use-case” they think would be more useful (Figure 7), among the following choices: (i) reception and welcoming, (ii) promoting social interaction without health risks, (iii) help in preparing for consultations, (iv) orientation and guidance, and (v) entertainment.
3.6 Discussion
By looking at the main results, when we report the average AES and SUS results (Table 2), we observe a clear improvement between the First and Second wave, independently of the metric and for both the patient and companion group. A significant difference was found for the AES ( , ) as the scores increased between the first ( , SD ) and the second wave ( , SD ). This significant difference was only found in patients ( , ) and not in companions. However, no significant differences were observed for the SUS between the first and the second wave of experiments, both in patients and in companions. While there is an overall clear trend of improvement, the trend for the companion group is less pronounced than for the patient group.
Metric | User Group | First wave | Second wave |
---|---|---|---|
AES | Patients | 14.7 (±5.73) | 20.7 (±6.25) |
Companions | 18.0 (±4.64) | 21.0 (±3.30) | |
Overall | 15.5 (±5.88) | 20.8 (±5.20) | |
SUS | Patients | 45.5 (±20.21) | 56.8 (±12.63) |
Companions | 55.0 (±28.28) | 57.5 (±10.40) | |
Overall | 47.9 (±24.18) | 57.0 (±22.88) |
We explain this improvement in acceptability and usability by the technical improvements we made to the software architecture (since the robot’s appearance did not change between waves): namely, the fine-tuning of the ASR module and the inclusion of LLMs in the conversation manager. These two major modifications, together with bug fixes, allowed a more natural interaction with the users since (i) ARI understood better what the participants were saying and (ii) ARI was able to answer to a wider range of questions with reasonable (although now always exact) answers. In particular, the participant’s feedback reported: enjoyment in using the robot, usefulness, acceptability of the robot’s reaction time (talking) and overall satisfaction with the robot.
It is also interesting to observe, when considering the two groups separately, that the companions tend to provide more positive AES and SUS feedback than the patients. For the time being, we have not identified a reason that justifies this difference.
3.7 Challenges and Failure Cases
One of the major challenges we encountered during our project concerned the recruitment of patients for the tests. These challenges can be attributed to a number of factors, including organisational aspects, last-minute cancellations and refusal to interact with the robot.
Some factors had a significant impact on the organisation of the experiments, for example, patients that refused to take part in the experiments, and coordination difficulties among the various hospital stakeholders (patients arriving late, consultations’ delay, or some experiments lasting longer than planned resulting in a delay in the following appointments). Finally, some participants refused to interact with the ARI robot during the first physical encounter, which directly compromised the interaction and made the experiment impossible. The reasons for this refusal were diverse, ranging from technological apprehension linked to the size of the robot, and ethical concerns.
During the First wave, some participants felt frustration specifically due to the limitations in terms of conversational capabilities, whether these were related to the performance of the ASR or the topic restrictions of the dialogue manager. Since these were identified as the two main blocking points, we paid attention to enhancing these specific skills, leading to a clear improvement in ARI’s acceptability and usability in the Second wave.
4 Conclusion and Open Questions
In this paper, we have investigated the acceptability and usability of a social robot in gerontology healthcare. Compared to previous research [84, 85], our study is a step forward mainly because of three reasons. First, the acceptability and usability is evaluated by patients and companions within their regular visits to a day-care hospital, which is different to the more common scenarios of nursing facilities and private homes. Second, the platform used is a full-sized humanoid robot, very different in size and appearance from pet-like and small-sized humanoid robots. Finally, the platform enabled multi-modal conversational interaction, which is again uncommon in many previous studies. The combination of these three different characteristics makes this study the first of its kind, and we hope it opens the door to multiple follow-ups and a wider evaluation.
The paper describes the overall robotic and software architecture and provides details of the various modules and methods used for the experiments. We also discussed the materials, methods, and recruitment process, and provided technical and experimental details. After two experimental waves, we can provide an assessment in terms of the acceptability and usability of the developed technology. The most important result, of such human-robot interaction experiments, is that the improvements (ASR robustness, dialogue flexibility) had a positive effect on how the system is perceived by the patients and companions in the Broca gerontology day-care hospital.
The study and associated technology present several limitations. First, given that all participants were requested to sign a consent form, the robot was never facing people unwilling to interact, and therefore we were not able to test its ability to properly understand the lack of interaction interest and execute consequent actions (e.g. leaving the person alone). Second, from a technical perspective, the experiments require dedicated computational power, which might limit the deployment of such technology. The question of how to provide state-of-the-art perception and action skills for a social robot with limited on-board computational resources is widely open, and not easy to address. Third, other social skills such as the ability to hold conversations within groups (multi-party dialogue) or navigating while accounting for the presence of humans (social navigation) were not evaluated in this study. Fourth, it would be interesting to run the same evaluation with medical personnel and understand if there are important perception differences in terms of acceptability and usability with respect to patients and companions. Finally, beyond social skills, it would be interesting to evaluate the capacity of the robot to be connected to the information system of the hospital for logistic purposes (e.g. reminding appointments, rescheduling them, providing information about the doctor’s office or name), but this poses important ethical and security issues that have to be very carefully addressed. The proper evaluation of how these capabilities are seen and welcomed by the patients, companions and medical personnel is crucial to understand its impact on the everyday life of the hospital.
Acknowledgments
This research was funded by the EU H2020 program under grant agreement no. 871245 (https://spring-h2020.eu/). We would also like to thank our anonymous reviewers for their time and valuable feedback.
Author contributions
The complete list of author contributions is detailed in Table 3 at the end of the document, in which we report the participation of each author according to the items in the CRediT taxonomy.999https://credit.niso.org/ The first four authors contributed equally to the paper, being the first one the Coordinator of the SPRING project and corresponding author, the other three are ordered alphabetically. The rest of the authors are also ordered alphabetically.
References
Author | Affiliation | Conceptualization | Data curation | Formal Analysis | Funding acquisition | Investigation | Methodology | Project administration | Ressourcen | Software | Supervision | Validation | Visualization | Writing – original draft | Writing – review & editing |
Xavier Alameda-Pineda | INRIA | x | x | - | x | - | x | x | - | - | x | - | - | x | x |
Angus Addlesee | HWU | x | x | x | - | x | x | - | - | x | - | - | x | x | x |
Daniel Hernández García | HWU | x | x | - | - | x | x | - | - | x | - | x | x | x | x |
Chris Reinke | INRIA | x | - | x | - | - | x | - | - | x | x | - | - | x | x |
Soraya Arias | INRIA | - | - | - | - | - | - | - | x | - | - | - | - | - | - |
Federica Arrigoni | UNITN | x | - | - | - | - | x | - | - | - | - | - | - | - | - |
Alex Auternaud | INRIA | - | - | - | - | - | - | - | x | x | - | x | - | - | - |
Lauriane Blavette | AP-HP | - | - | x | - | x | - | - | - | - | - | - | - | x | - |
Cigdem Beyan | UNITN | x | - | - | - | - | x | - | - | x | - | x | - | x | x |
Luis Gomez Camara | INRIA | - | - | - | - | - | x | - | x | x | - | x | - | - | - |
Ohad Cohen | BIU | - | - | - | - | x | - | - | - | x | - | - | - | x | - |
Alessandro Conti | UNITN | - | - | - | - | - | x | - | - | x | - | x | - | - | - |
Sébastien Dacunha | AP-HP | - | - | - | - | - | - | - | - | - | - | x | - | - | x |
Christian Dondrup | HWU | x | - | - | x | - | x | x | x | x | x | - | - | - | - |
Yoav Ellinson | BIU | - | - | - | - | x | x | - | - | x | - | - | - | x | - |
Francesco Ferro | PAL | - | - | - | x | - | - | x | x | - | - | - | - | - | - |
Sharon Gannot | BIU | x | - | - | x | x | x | x | - | - | x | x | - | - | x |
Florian Gras | ERM | x | x | - | - | - | - | - | - | x | - | x | - | - | - |
Nancie Gunson | HWU | x | x | x | - | x | - | - | - | x | - | x | - | - | - |
Radu Horaud | INRIA | x | - | - | x | - | - | - | - | - | - | - | - | - | - |
Moreno D’Incà | UNITN | - | - | - | - | - | - | - | - | x | - | x | - | - | - |
Imad Kimouche | ERM | - | x | - | - | - | - | - | - | x | - | x | - | - | - |
Séverin Lemaignan | PAL | x | - | - | - | - | x | - | - | x | x | - | - | x | - |
Oliver Lemon | HWU | x | - | - | x | x | x | x | x | - | x | x | - | - | x |
Cyril Liotard | ERM | x | - | - | x | - | - | - | - | - | x | - | - | - | - |
Luca Marchionni | PAL | x | - | - | x | - | - | - | x | - | - | - | - | - | - |
Mordehay Moradi | BIU | - | - | - | - | x | - | - | - | x | - | - | - | x | - |
Tomas Pajdla | CVUT | x | - | x | x | - | x | x | x | - | x | - | - | - | x |
Maribel Pino | AP-HP | - | - | - | - | - | x | - | - | - | x | x | - | - | x |
Michal Polic | CVUT | x | - | x | - | x | x | x | - | x | x | x | x | x | - |
Matthieu Py | INRIA | - | - | - | - | - | - | x | - | - | - | - | - | - | x |
Ariel Rado | BIU | - | - | - | - | x | - | - | - | x | - | - | - | x | - |
Bin Ren | UNITN | - | - | - | - | - | - | - | - | x | - | x | - | - | - |
Elisa Ricci | UNITN | x | - | - | x | - | - | x | - | - | x | - | - | - | - |
Anne-Sophie Rigaud | AP-HP | - | - | - | - | - | x | - | - | - | x | x | - | - | x |
Paolo Rota | UNITN | x | - | - | - | - | x | - | - | - | - | - | - | - | - |
Marta Romeo | HWU | x | - | - | - | - | x | - | - | x | - | x | - | - | - |
Nicu Sebe | UNITN | - | - | - | x | - | - | x | - | - | x | - | - | - | - |
Weronika Sieińska | HWU | - | x | x | - | - | - | - | - | x | - | - | - | - | - |
Pinchas Tandeitnik | BIU | - | - | - | - | x | x | x | - | - | x | x | - | - | x |
Francesco Tonini | UNITN | - | - | - | - | - | x | - | - | x | - | x | - | x | x |
Nicolas Turro | INRIA | - | - | - | - | - | - | - | x | x | - | x | - | - | - |
Timothée Wintz | INRIA | x | - | - | - | x | x | - | - | x | x | - | - | - | - |
Yanchao Yu | HWU | x | - | - | - | - | - | - | - | x | - | x | - | - | - |