A Multi-Modal Explainability Approach for Human-Aware Robots in Multi-Party Conversation

Iveta Bečková
Faculty of Mathematics, Physics and Informatics
Comenius University Bratislava
Bratislava, 842 48, Slovak Republic
[email protected]
&

Štefan Pócoš
Faculty of Mathematics, Physics and Informatics
Comenius University Bratislava
Bratislava, 842 48, Slovak Republic
[email protected]
&

Giulia Belgiovine
CONTACT Unit
Italian Institute of Technology
Genova, 16152, Italy
[email protected]
&

Marco Matarese
CONTACT Unit
Italian Institute of Technology
Genova, 16152, Italy
[email protected]
&

Alessandra Sciutti
CONTACT Unit
Italian Institute of Technology
Genova, 16152, Italy
[email protected]
&

Carlo Mazzola
CONTACT Unit
Italian Institute of Technology
Genova, 16152, Italy
[email protected]

Abstract

The addressee estimation (understanding to whom somebody is talking) is a fundamental task for human activity recognition in multi-party conversation scenarios. Specifically, in the field of human-robot interaction, it becomes even more crucial to enable social robots to participate in such interactive contexts. However, it is usually implemented as a binary classification task, restricting the robot’s capability to estimate whether it was addressed and limiting its interactive skills. For a social robot to gain the trust of humans, it is also important to manifest a certain level of transparency and explainability. Explainable artificial intelligence thus plays a significant role in the current machine learning applications and models, to provide explanations for their decisions besides excellent performance. In our work, we a) present an addressee estimation model with improved performance in comparison with the previous SOTA; b) further modify this model to include inherently explainable attention-based segments; c) implement the explainable addressee estimation as part of a modular cognitive architecture for multi-party conversation in an iCub robot; d) propose several ways to incorporate explainability and transparency in the aforementioned architecture; and e) perform a pilot user study to analyze the effect of various explanations on how human participants perceive the robot.

Keywords Human Activity Recognition $\cdot$ Explainable AI $\cdot$ Transparency $\cdot$ Attention $\cdot$ Human-Robot Interaction $\cdot$ Addressee Estimation

1 Introduction

The endeavor to decode human intentions and behavior with advanced computer vision techniques is central to the challenge of developing human-aware technologies that can truly understand and support humans in various everyday scenarios. This is particularly evident in robotics, where the development of embodied agents capable of autonomous and meaningful interaction with humans is facilitated by human activity recognition algorithms, granting them a level of human awareness. In humans, the recognition of intentions related to social interaction is not limited to high-level reasoning abilities but anchors its roots in the visual system (McMahon and Isik,, 2023). It follows that visual information is crucial to processing and properly understanding social dynamics, and computer vision models represent an integral component of robots’ socio-cognitive abilities.

As the hallmark of human-centered technology, the development of interactive robots is increasingly focused on fostering human trust in artificial systems. Therefore, the accuracy and robustness of the performances are essential but not the only indicators of the system’s reliability. Explainability and transparency are two other critical aspects of designing and evaluating reliable interactive robots (Wortham et al.,, 2016). Both allude to the understandability of the system: transparency refers more to the visibility of underlying processes leading to a reduction of ambiguity regarding a behavior (Selkowitz et al.,, 2017), whereas explainability is related to the capability of a system to exhibit the reasons behind its outputs, decisions or behaviors (Ciatto et al.,, 2020; Miller,, 2019). These two qualities are desirable from a dual perspective. In the eyes of developers, who need deep comprehension of the robot to design and assess its functioning, and from the point of view of users, who should be able to intuitively interact with the artificial system (Sciutti et al.,, 2018).

In the context of Human-Robot Interaction (HRI), communication is a specific type of interaction. Following Jakobson, (1981), communication involves the exchange of messages between an addresser (who sends the message) and an addressee (who is entailed to receive it). In the case of spoken messages, the communication is verbal but usually comprises non-verbal elements such as gaze, gestures, poses, etc., which are grasped via vision and are often necessary to contextualize the message properly and to address it to the correct agent (Skantze,, 2021). Even though HRI studies rarely go beyond dyadic interactions, the final goal is often bringing robots to social environments, where they are often required to deal with more than one person and, to achieve this aim, be aware of basic social cues ruling multi-party conversations.

Addressee Estimation (AE), i.e., the ability to understand to whom a speaker is directing their utterance (Skantze,, 2021), is a specific case of human activity and intention recognition. The speaker identification and correct conversion of speech into text are necessary but not sufficient elements to engage in multi-party conversations. Thus, AE has become a key factor in HRI. As humans, we deeply exploit non-verbal behavior to indicate whom we are addressing (Auer,, 2018; Ishii et al.,, 2016): an ability that robots could greatly take advantage of to engage in conversations more smoothly. Without it, the understanding of the addressee would exclusively depend on the context of the dialogue or specific keywords (such as the name of the addressee), leading to a loss of fluency in the conversation and an increased likelihood of errors.

Taking into account explainability in the context of human-activity-aware robots requires considering the concept from different perspectives: not only the generation of explanations within the architecture and models controlling the robot’s behavior but also their communication to and reception by users. Hence, this work seeks to bind together such diverse points of view that cannot be examined separately. Starting from this broader approach and with the final aim to endow a social robot with explainable addressee estimation skills for multi-party conversation, the contribution of this work is divided into two intertwined steps, whose methodology is described in Section 3. Specifically, in Subsection 3.1, we design and train an attention-based neural network to optimize a former AE model (Mazzola et al.,, 2023) while extracting explanations at different stages of the inference. In Subsection 3.2, we deploy the newly developed explainable model in a modular robotic architecture to enable the iCub robot to engage in multi-party conversation, implement a multi-modal system to provide real-time explanations of its behavior, and explore the users’ reception of different modalities of explanation (verbal, embodied, visual) in a user study.

2 Related work

2.1 Attention-based and explainable neural networks

In the context of machine learning, the development of explainable models is steadily rising (Barredo Arrieta et al.,, 2020). The earliest approaches primarily focused on the inherent form of explainability, i.e., employing a model that is straightforward enough for people to understand its behavior and decisions. This slowly changed when the deep learning models got increasingly stronger, eventually substantially surpassing the accuracy of simple models (Krizhevsky et al.,, 2012). Deep learning models are powerful enough to achieve superhuman performance in certain tasks (He et al.,, 2015), yet they often do not provide any reasoning behind their decisions. Thus, the explainability of deep models started to play a crucial role, mainly in their deployment in critical applications.

Two vastly researched and popular methods of interpretability of deep learning are SHAP (Lundberg and Lee,, 2017) and LIME (Ribeiro et al.,, 2016). These and many similar methods work by training a surrogate, often much simpler model, approximating the black-box model to produce similar output while being inherently explainable. On the other hand, there is often a need to generate precise explanations for the image classification domain. Therefore, saliency maps, i.e., the importance of image regions for predictions, are often investigated (Simonyan et al.,, 2014; Zhou et al.,, 2016; Selvaraju et al.,, 2017). Using these methods, it is possible to peek inside the black-box model processing an image. A downside of similar approaches is their dependence on a specific type of architecture (smooth gradients, convolutions, etc.); otherwise, they are often unable to produce satisfying results.

A branch of research, currently setting the SOTA in the majority of tasks, was initiated by designing the transformer architecture (Vaswani et al.,, 2017) followed by its adaptation for the image domain (Dosovitskiy et al.,, 2021). Thanks to their attention mechanism, transformers are also often considered more interpretable (Kashefi et al.,, 2023), as extracting the attention weights during the forward pass is possible. Thus, in this work, we leverage the idea of attention in general (Shen et al.,, 2017; Niu et al.,, 2021) and the many ways of implementing it in the specific downstream task of AE where, to our knowledge, explainability was never taken into account before.

2.2 Multi-party conversation in HRI

The management of multi-party conversations requires robots to be endowed with human activity recognition capabilities; this is even more challenging when they have to deal with multiple humans at the same time. Several tasks need to be solved to this aim, not only sound detection and natural language understanding, which are essential to receiving the message. Speaker recognition and diarization, turn-taking, and addressee estimation are crucial problems that need to be tackled to assess, beyond the “what" of the message, the “who" and the “to whom" of each utterance (Gu et al.,, 2022). Endowed with multiple sensors, robots can solve problems related to multi-party conversations with a multi-modal approach to recognize the scene and human intentions via audio, vision, and, additionally, interpreting information coming from the conversation context (Bilac et al.,, 2017; Dhaussy et al.,, 2023; Bae and Bennett,, 2023; Addlesee et al.,, 2024).

Differently from dyadic scenarios, multi-party conversations present an additional problem once the speaker yields the turn: who is entitled to take it? If the conversation must proceed, a correct estimation of the utterance’s addressee(s) usually implies who should take the turn. In the majority of cases, works tackling AE during the interaction with artificial agents designed rule-based algorithms (Richter et al.,, 2016), machine learning models (van Turnhout et al.,, 2005; Huang et al.,, 2011; Sheikhi et al.,, 2013), deep neural networks (Mazzola et al.,, 2023; Tesema et al.,, 2023) or large language model (LLM) based techniques (Addlesee et al.,, 2024) grounded on multiple modalities and features in order to cope with the ambiguity and unpredictability of human behaviors in real-time interactions. Keywords uttered by speakers, their gaze, pose, para-verbal cues and contextual information have all been demonstrated useful to the purpose of AE (for a review, see Skantze, (2021)). AE may improve robots’ conversational abilities by providing information not only about when intervening but how to do so. However, most current models predict only whether the robot was addressed in a binary way (van Turnhout et al.,, 2005; Huang et al.,, 2011; Sheikhi et al.,, 2013; Tesema et al.,, 2023; Addlesee et al.,, 2024). But binary estimation does not identify the addressee (if it is other than the robot) and, therefore, is insufficient for the robot to effectively engage in conversations with more than two humans or without pre-determined knowledge about other users’ presence.

To resolve this limitation, Mazzola et al., (2023) adopted a deep-learning approach to estimate the direction of the addressee from the robot’s perspective, training the model on HRI data collected with a Nao robot (Jayagopi et al.,, 2012). The same model was then ported and tested with a pilot experiment on the iCub platform (Mazzola et al.,, 2024), but not yet implemented in an architecture for multi-party conversation. Such implementation is one of the goals of the present study: after optimizing the approach of Mazzola et al., (2023), we incorporate the new explainable AE model into a modular architecture with additional components (e.g., spatial memory and action manager) to make the robot find and identify the addressee over its limited field of view.

2.3 Explainability in HRI

Explainable artificial intelligence (XAI) has predominantly been explored in the human-computer interaction field (Lai et al.,, 2021; Gambino and Liu,, 2022) and even though the number of works concerning robots’ explainability is increasing, there are still a few studies about XAI within the HRI context (De Graaf and Malle,, 2017). Robots introduce an additional layer of complexity to the explainability problem compared to virtual agents due to their embodiment and the broader spectrum of interaction modalities they offer (Setchi et al.,, 2020).

Automatic explanation generation with robots has been investigated in several interaction contexts, such as planning (Chakraborti et al.,, 2017) and human-robot collaboration (Matarese et al., 2023a, ). For instance, Chakraborti et al., (2017) generated explanations while trying to resolve the discrepancies between the robot and human’s internal models. Diversely, Tabrez et al., (2019) tackled the problem by focusing on users’ task understanding to detect incomplete or incorrect beliefs about the robot’s functioning. Matarese et al., 2023a focused on explainable robots’ influence when providing explanations that consider the human-robot common ground, also regarding people’s personality traits (Matarese et al., 2023b, ).

Visual explanations with robots have been proposed for several purposes, such as navigation (Maruyama et al.,, 2022; Halilovic and Lindner,, 2023). However, more importantly for the scope of this work, Zhu et al., (2022) proposed a multi-modal explanation framework that coupled visual-based with verbal-based explanations to explain facial emotion recognition. Moreover, Sobrín-Hidalgo et al., (2024) presented a preliminary study proposing a vision-language model that allows the robot to generate explanations combining data from its logs and the images it captures.

In recent years, the HRI community has shown a growing interest in verbal explanations that is destined to grow further, given the spreading of LLMs. Stange et al., (2022) designed and developed a dialogical model for explanations in HRI. With their model, the robot can reply to human users’ requests with explanations referring to its internal state. The authors stressed the iterative nature of their model in managing the explanatory processes as dialogues. Task understanding has also been investigated from a dialogical perspective with a focus on the role of negation in human-robot explanatory exchanges (Groß et al.,, 2023). Moreover, to allow artificial agents to adapt their explanations to their partners’ understanding, Robrecht and Kopp, (2023) implemented a linguistic explainer model that constructs and employs a partner model.

In the context of the multi-party conversation, explainability has been investigated for sentiment analysis of social media dialogues (Sinha et al.,, 2021) and emotion recognition with multi-modal attentive learning (Arumugam et al.,, 2022). However, to the best of our knowledge, there is no approach to provide real-time explainable and transparent solutions for robot’s behavior in multi-party interaction.

3 Methods

3.1 Design of the attention-based explainable AE model

The development of our explainable AE model consists of two steps: in the first step, we focus on enhancing the classification accuracy with respect to the previous SOTA in the same task (Mazzola et al.,, 2023), which represents our baseline (see Paragraph 3.1.1). This way, we obtain a first model, which we refer to as Improved Addressee Estimation (IAE) model. In the second step, we modify this model with inherently explainable modules based on attention (see Paragraph 3.1.2) to extract additional information during addressee estimation. We refer to the second model as Explainable Addressee Estimation (XAE) model.

3.1.1 Improved Addressee Estimation model

Following Mazzola et al., (2023), we use the Vernissage dataset (Jayagopi et al.,, 2012, 2013) to train an Improved model for the Addressee Estimation task (IAE model). To the best of our knowledge, the Vernissage dataset is one-of-a-kind and designed specifically for solving the task of addressee classification in human-robot interaction. The dataset contains recordings of multi-party conversations from the robot’s point of view. In each conversation, one robot and two human participants are engaging in a conversation about paintings on the wall. The conversations are manually labeled with the relative position of the addressee from the robot’s point of view. The possible labels are ROBOT, RIGHT, LEFT, GROUP, and NO-LABEL.

To consistently compare with the baseline (Mazzola et al.,, 2023), we only use the conversation parts in which ROBOT, RIGHT, or LEFT is the target. We also follow the same data pre-processing. The view from the robot’s camera is split into two parallel data streams (face images and body-pose vectors), which serve as an input to the network and later are merged to form a combined representation.

We perform a hyper-parameter search (see details in the Appendix) to explore the achievable prediction accuracy on the given task. We always keep one of the 10 interactions included in the dataset for testing and perform 9-fold cross-validation on the remaining 9 interactions. This cross-validation performance (weighted according to the number of sequences in individual interactions) is being optimized.

After choosing all the hyper-parameters, we train a network on nine conversations and test with the remaining one. This is repeated ten times (for all possibilities of the test set). To calculate the final F1 score, we average the results across classes (weighted by the number of samples in the given class) and then across the 10 trials according to the number of sequences in the test sets.

Our chosen IAE model improves the current SOTA while significantly reducing the number of trainable parameters $\approx$ 135 folds (from 91,706,749 to 677,623). This is achieved mainly by reducing the number of output neurons from the convolutional neural network (CNN) processing the facial information. Furthermore, we replace the convolution on body-pose vectors with fully connected layers. The output dimensionality is chosen to make the length of the outputs from face and pose models similar, allowing for more sophisticated data-fusion methods.

3.1.2 XAE model: incorporating attention

In this section, we describe our neural network architecture (XAE model) that, in addition to yielding accuracy comparable with the IAE model, combines multiple attention-based components, allowing us to extract human-readable explanations.

The information flow in our “explainable” architecture is distinct from our IAE model. First, we utilize a vision transformer instead of a convolutional network to obtain the face representation (Dosovitskiy et al.,, 2021)¹¹1For the implementation of the vision transformer, we use (Wightman,, 2019).. Second, we insert an additional shallow model to fuse the face and pose information. Third, we alter the penultimate processing step using a tailored attention mechanism in the recurrent neural network. This way, we achieve the embedding calculation containing means to provide us with importance scores for each frame. The overall scheme of the addressee estimation is shown in Figure 1.

Refer to caption — Figure 1: Illustration of the addressee classification workflow. The sequence of faces is embedded using a vision transformer (M1), whereas poses are processed via an MLP. These embeddings are then fused using an intermediate network (M2), and their representation for each time frame is processed by a recurrent network enhanced with attention, forming a unified embedding of the whole utterance. The final step is the mapping to three output options via a fully-connected (FC) layer.

Merging modalities

After forming the face embedding ( $\boldsymbol{f}_{t}\in\mathbb{R}^{d_{face}}$ ) using a vision transformer and a pose embedding ( $\boldsymbol{p}_{t}\in\mathbb{R}^{d_{pose}}$ ) using an MLP, we devise a way to combine these representations. The simplest way to achieve this goal is concatenating the two vectors, but it does not admit extracting their relative importance.

Coherently with our goal, i.e., designing an architecture consisting of multiple components with inherent explainability, we seek to know which modality is more important in each time frame. Inspired by Brauwers and Frasincar, (2023), we opt for the following variant of a scoring function:

\mathrm{score}(\boldsymbol{v})=\boldsymbol{w}^{T}_{D}\ \mathrm{ReLU}(% \boldsymbol{W}\boldsymbol{v}+\boldsymbol{b}),

(1)

where $\boldsymbol{w}_{D}\in\mathbb{R}^{d_{inner}}$ , $\boldsymbol{W}\in\mathbb{R}^{d_{inner}\times d_{v}}$ , $\boldsymbol{b}\in\mathbb{R}^{d_{inner}}$ are trainable parameters, $d_{inner}$ is a hyper-parameter to be optimized, and $d_{v}$ is the length of the input vector $\boldsymbol{v}$ . Therefore, in this case, $d_{face}=d_{pose}=d_{v}$ .

To compare the influence of each of the two vectors (face and pose) at every frame, we calculate their relative contributions $s_{\boldsymbol{f}_{t}}$ $s_{\boldsymbol{p}_{t}}$ , as:

s_{\boldsymbol{f}_{t}},s_{\boldsymbol{p}_{t}}=\mathrm{softmax}(\mathrm{score}(% \boldsymbol{f}_{t}),\mathrm{score}(\boldsymbol{p}_{t})).

(2)

Finally, the single-vector representation of both modalities is formed by an element-wise addition of $\boldsymbol{f}_{t}$ and $\boldsymbol{p}_{t}$ , using their corresponding weights:

\boldsymbol{r_{t}}=s_{\boldsymbol{f_{t}}}\boldsymbol{f_{t}}+s_{\boldsymbol{p_{% t}}}\boldsymbol{p_{t}}.

(3)

Recurrent attention

To form a single vector representation of the whole utterance ( $\boldsymbol{r}_{1},...,\boldsymbol{r}_{n}$ ), in the baseline architecture, we employ a recurrent network. However, to create the “explainable” alternative, we go beyond the ordinary recurrent network and add a form of attention mechanism (as suggested in Brauwers and Frasincar, (2023)) that allows us to measure the time frame importance scores while producing the output. Our computation is as follows.

Let us denote a stacked representation through all the time frames of the embeddings created in the previous step as $\boldsymbol{R}=[\boldsymbol{r}_{1},\boldsymbol{r}_{2},...,\boldsymbol{r}_{n}$ ]. We linearly project them to produce keys, queries, and values:

\boldsymbol{Q}_{r}=\mathbf{W}^{Q}\boldsymbol{R},\quad\boldsymbol{K}_{r}=% \mathbf{W}^{K}\boldsymbol{R},\quad\boldsymbol{v}_{r}=\mathbf{W}^{V}\boldsymbol% {R}.

(4)

The queries are then fed one by one into the gated recurrent unit (GRU) network (Cho et al.,, 2014), to eventually form the embedding $\boldsymbol{q}$ integrating the information about all the time frames. Using the query embedding $\boldsymbol{q}$ , we proceed to the computation of the similarities with keys encoded in the matrix $\boldsymbol{K}$ , providing the contribution scores ( $\boldsymbol{c}=\boldsymbol{K}\boldsymbol{q}$ ) of each time frame. To produce the final utterance representation $\boldsymbol{u}$ , we use element-wise addition on elements of $\boldsymbol{V}$ , with the weight provided in $\boldsymbol{c}$ :

\boldsymbol{u}=\sum_{i=1}^{n}c_{i}\boldsymbol{v}_{i}.

(5)

A fully-connected layer taking $\boldsymbol{u}$ as input produces the final addressee estimation. A graphical illustration of our recurrent block is provided in Figure 2.

3.1.3 Generating explanations

The architectural design of the model is proposed with the intention for the explanations to be inherent. That way, one does not need an extra post-processing step to extract explanations, rendering our model computationally efficient and suitable for use in systems where real-time feedback is necessary. Three types of explanations are included: 1) image saliency, 2) face vs. pose importance, and 3) time frame importance.

Image saliency

To process an image via the vision transformer, we first need to split the image into small patches — non-overlapping squares forming the entire image. The patches are further embedded and subjected to the attention blocks. Those consist of multiple self-attention layers (multi-head self-attention), residual connections, batch normalization, and fully connected layers, all repeated several times (Vaswani et al.,, 2017). For visualization purposes, we are primarily interested in the self-attention computation, defined by the formula:

\textrm{Attention}(\boldsymbol{Q},\boldsymbol{K},\boldsymbol{V})=\textrm{% softmax}\left(\frac{\boldsymbol{Q}\boldsymbol{K}^{T}}{\sqrt{d_{k}}}\right)% \boldsymbol{V}.

(6)

The matrices $\boldsymbol{Q},\boldsymbol{K}$ , and $\boldsymbol{V}$ carry the information about each image patch. Thus, visualizing the raw attention scores provided after the softmax computation up-sampled back to the original image size produces maps, highlighting the areas the network uses the most for further processing. A sample depiction of the attention map is provided in Figure 3. Since employing multiple parallel heads in the vision transformer is a common practice, we have more visualization options. Because the attention heads extract different features, combining them for visualization can produce more consistent maps.

Face vs. pose

To see the relative importance of face vs. pose information, we can extract the weights used to combine these modalities. Knowing which modality was used more/less can provide the speaker with clues about what was unclear when misclassifications occurred.

Even though the weight extraction is straightforward, some interesting intrinsic properties exist. The general hyper-parameter setup, as well as the whole architecture, have a huge impact on the expressiveness of the model. For a concise analysis of the weights distribution, see Subsection 4.2.

Time frame score

The attention weights provided in the recurrent network offer an ideal way for us to retrieve information about those time frames, which have the highest impact on the classification.

Using attention activations, we design a method to automatically generate a verbal cue about the most important part of the prediction. To capture this, we use the average of the attention weights in a sliding window. The average is compared to a threshold $\theta$ , and if it crosses the $\frac{1}{k}+\theta$ , where $k$ is the length of the sequence, an output sentence is generated based on the sliding window location. For simplicity, in this experiment, we distinguish between three possibilities of the important region: the beginning of the interaction, the middle, and the end.

Two samples of a response, given sequences of length 10, are shown in Figure 4. We can see the weights of individual time frames distinguished by the color and size of the red dots. Even though these explanations are particularly easy to interpret, the generation process includes the pose information as well, which is for clarity not included in the image.

3.2 Implementation and user evaluation of the explainable modular architecture

After the design of the XAE model, the second contribution of this paper is its deployment in a modular architecture for multi-party conversation for the robot iCub, which goes in parallel with the user evaluation of the explainability and transparency techniques implemented in the architecture.

The term artificial cognitive architecture generally refers to computational frameworks that aim to explain and reproduce the fundamental mechanisms of human cognition, such as perception, attention, action selection, memory, learning, reasoning, meta-cognition, or prospective (Vernon,, 2014). The development of artificial cognitive architectures able to solve all these tasks with similar performance to humans is a longstanding and unsolved problem (Kotseruba and Tsotsos,, 2020). The architecture we propose in this paper is not meant to represent a comprehensive solution for a multi-task robotic architecture. Rather, it is a cognitive-inspired framework supporting the robot’s autonomous reasoning, decision-making, and interaction in the context of a multi-party conversation for deploying and evaluating the proposed explainable model in HRI.

3.2.1 Modular components of the architecture

Our architecture (Figure 5) is based on the architectural implementation described in Belgiovine et al., (2022). It is based on a modular approach where each component (i.e., a module) implements a specific robot’s ability (e.g., detecting faces, processing speech, reasoning about the next best action). Such modules exchange information and communicate with each other through the YARP middleware (Metta et al.,, 2006). To enhance computational resource allocation and optimize time response efficiency, we rely on a distributed approach. This modular setting allows us to incrementally add or update the robot’s skills by integrating additional modules into the framework at later stages.

Audio Perception and Processing

Raw audio is given as input to the Sound Detection module (Eldardeer et al.,, 2021). Based on a minimum threshold of the audio amplitude, this module triggers the activation of the Speech-To-Text and sends information to the self-monitoring module.

For the Speech-To-Text module, we use Whisper ²²2https://github.com/openai/whisper(Radford et al.,, 2023) by OpenAI, a SOTA model designed for accurate and robust speech-to-text transcription, ensuring optimal performance even with noisy environments and speakers of diverse nationalities.

Vision perception and processing

Visual input from the robot’s cameras (RGB images of 480x640 resolutions, recorded at 30 fps) is given as input to the Face Detection and the Addressee Estimation modules to extract higher-level visual features. The Face Detection module extracts the bounding boxes of faces using the Ultralytics YOLOv8 model³³3https://github.com/ultralytics/ultralytics (Jocher et al.,, 2023), which has been adapted for working on the iCub YARP-based framework.

The Addressee Estimation module deploys our XAE model in the YARP middleware. It pre-processes visual information at 12.5 Hz as in Mazzola et al., (2024), computing the speaker’s body pose using a lightweight version of OpenPose (Cao et al.,, 2019; Osokin,, 2018) and cropping an image of the speaker’s face from the body joints of the head. At each timeframe, the two inputs are given to the XAE model, as described in Section 3.1.

Moreover, while we utilize the Vernissage dataset for exploring the model’s accuracy and the possibilities for generating explanations, to ensure the final model is capable of making correct estimates in an online setting, we also retrain our model on the dataset collected using the iCub robot (Saade,, 2023). The dataset contains five recorded interactions of three people and the iCub, and is labeled into three groups (LEFT, RIGHT, ROBOT), such that it can be used to extend the Vernissage dataset.

A Multiple Objects Tracker (MOT) is used to generate tracker instances with a unique ID for each bounding box received by the Face Detector. Employing a fusion of the Kalman Filter and Hungarian algorithm, this module ensures consistent identity between faces detected in consecutive frames, allowing real-time performances (Bewley et al.,, 2016). If necessary, it activates a tracking mode for the robot, enabling it to follow individuals as they move within the environment and update their position when they stop moving.

Short-Term Spatial Memory

A memory system is a core component of a cognitive architecture to facilitate the accomplishment of high and low-level cognitive abilities such as attention, reasoning, and context understanding. Such abilities become particularly crucial in dynamic interactions involving multiple individuals, enabling the robot to maintain awareness of surrounding events despite its attention being focused on limited evidence. When developing artificial cognitive architectures, researchers usually adhere to the traditional classification of memory types established in cognitive psychology (Atkinson and Shiffrin,, 1968), attempting to reproduce their basic functionalities and features.

For the purpose of the current work, our Spatial Memory module aims to implement the ability of the robot to remember and update the position of elements of interest (in this case, people) in its surrounding environment with respect to an egocentric frame. Hence, the module can be viewed as akin to a short-term spatial memory system, as the information is solely retained to fulfill the current task, namely attending and intervening appropriately in conversations. No information is stored for subsequent learning or retrieval in future interactions.

This module primarily serves to centralize spatial and contextual knowledge. It aggregates information coming from the MOT and the proprioceptive data derived from the robot’s neck and head encoders (Roncone et al.,, 2016), linking each tracker instance with the robot’s head orientation and, after a 3-bin discretization, categorizing it as being to the right, left, or in front of the robot. This process creates a dictionary associating each individual with their respective robot-centric spatial position. Each element within this dictionary is dynamically updated with properties pertinent to the multi-party conversation task: people’s role in the interaction (e.g., speaker, listener, addressee). This dictionary structure works as a knowledge repository, formatted in a readable and standardized manner, which remains readily accessible for online queries by any other module that can efficiently update or retrieve information by using any object or property as a key. For example, the module Self Monitoring can ask the Spatial Memory for all the items present to the left of the robot or ask for the location of a specific individual.

Speech Generation

In order to make the robot able to understand verbal language and give contingent answers, we use an open-weight Large Language Model, specifically the Mistral-7B-Instruct model from MistralAI (Jiang et al.,, 2023), and deploy it in the architecture as a YARP-based module. Our approach involves simply prompt-engineering specific contextual information to enable the robot to engage meaningfully in multi-party conversations. Details of the prompts used can be found in the supplementary material.

Robot Actions and Behaviors

For tracking faces and directing the robot’s head towards the identified speaker, we employ the iKinGazeCtrl module - a controller designed for iCub’s gaze, capable of independently steering the neck and eyes (Roncone et al.,, 2016). Additionally, our action module provides functionalities for controlling facial expressions (by activating LEDs) and synthesizing speech (by using Acapela Text-To-Speech⁴⁴4https://www.acapela-group.com/ with a child-like English voice)

Self Monitoring/States Controller

We developed a module for interaction- and self-monitoring to manage the multiple parallel threads and processes needed for the interaction. This module is responsible for receiving and supervising the entire auditory and visual information, triggering the robot’s gaze and facial expressions accordingly, and controlling the “speak” and “listen” states of the conversation. In relation to perceptual inputs, this module also manages the quests for spatial memory information, handled with RPC-ports, and the trigger of LLM-generated responses in case the robot was addressed.

Visualization System

An additional module of the architecture is the real-time visualization of the main processes involved in the proposed framework. This includes details about the robot’s current state, such as the activation of the modular architecture, which illustrates the ongoing processes supporting the robot’s cognitive abilities. Additionally, it showcases the explainable and transparent solutions integrated into the XAE model. A 50-inch TV screen positioned behind the robot serves as a presentation board for various visualization windows (Figure 6).

3.2.2 Real-time, multi-modal explainability and transparency system

To implement a transparent and explainable modular architecture, we developed a framework providing several real-time clarifications about the processes and functionalities of the robot. To this aim, we use diverse techniques (see Figure 7).

Specifically, our solutions are the result of two possible approaches to extract explanations (Kerzel et al.,, 2022): neural or symbolic. The former approach leverages neural network outputs to extract information, whereas the latter represents the information about the system’s behavior with a symbolic code specified by the developer.

Another important difference is related to the type of clarification provided, which is rooted in how some literature differentiates between the concepts of explainability and transparency (Ciatto et al.,, 2020; Miller,, 2019). Several accounts we implemented provide the reason behind the decision of the robot (explainability), whereas others describes the current functioning of the robot (transparency). The formers answer the question “why is it happening?”, the latters reply to the question “what is happening?”. Following this classification, we implemented five explainability and six transparency solutions for the human-robot interaction phase.

Eventually, several modalities of communication are exploited: verbal, embodied, and visual (which we divide into visual/attentional and visual/functional).

Figure 6 shows a screenshot of the five techniques implemented via the visual modality and displayed in real-time on the screen behind the robot. Two out of these five represent functional aspects of the architecture. The Spatial Memory provides a 3-bin scheme with instances of people perceived and remembered by the robot, their (robot-centric) position, and conversational role. An additional visual-functional solution shows the scheme of the robot’s modular architecture onscreen, with real-time information about the current activations for each robot’s ability.

The other three techniques related to screen-visualization are implemented as streams of features the robot is paying attention to. The visual streams show, therefore:

•

the processed output of the iCub’s right camera with bounding box information of detected faces;
•

the attention map of the speaker’s face providing information about the salient part of the input image processed by the XAE model;
•

the two inputs of the XAE model (the speaker’s body pose and face image) displayed with size changing in real-time proportionally to the relative importance of their contribution to the final output.

Other explainability and transparency techniques designed to clarify the addressee estimation process were implemented via the verbal modality. For instance, at the end of other speakers’ utterances, the robot provides an explanation for its final addressee estimation, saying that its estimation relied on the speaker’s non-verbal behavior and, more specifically, which part of the speaker’s utterance (beginning, end, or throughout) impacted most on the final estimation. This enhances the transparency of its subsequent action (turning toward the addressee vs replying to the speaker). Moreover, the robot can correct any evaluation errors made by the XAE model after exploring the environment (e.g., if there is nobody in the estimated direction), and verbally clarifies the correction.

Finally, we designed specific actions for the iCub’s facial expression to increase the understandability of the robot’s behavior (embodied modality). iCub is endowed with face LEDs (eyebrows and mouth) that can change color and position and are synchronized with the speech synthesizer. We specified the iCub mouth (happy or neutral) to be coherent with the confidence of the estimation (happy for confidence over 80%, neutral otherwise). The color of the LEDs indicates the conversational role of the robot, coherently with the colors displayed in the spatial memory (green for “speaker”, red for “addressee”, white otherwise). The movement of the eyebrows is meant to suggest the direction "left" or "right" of the estimated addressee (see video for clarification).

3.2.3 Exploratory User Study

After developing the XAE model and integrating it into the robotic architecture, we ran a user study to assess external users’ perception of the explainability and transparency system described in the last Section.

Video Recordings

We recorded a video of multi-party conversations with the robot⁵⁵5The entire video can be found at the following link https://www.youtube.com/watch?v=zZF-L0gtRu4 and uploaded it on the soSci Survey⁶⁶6https://www.soscisurvey.de/ platform for questionnaires delivering. In the video, the experimenters stage 3 different conversational scenarios, namely interacting with a social robot assistant in a shopping mall, a restaurant, and a domestic setting. Upon receiving the speaker’s position and gazing towards them, the robot utilizes the XAE model to estimate the intended addressee. In the case the estimated addressee is the robot, it replies to the speaker using the Speech Generation module. To ensure meaningful dialogues, we provides the LLM with context prompts tailored to each scene.

Participants were also given an extra introductory video clip to familiarize them with the visual solutions of the explainability and transparency system. The video shows the conversational interactions from an external perspective, with the robot in the middle (Figure 8). To increase the visibility of important features, we incorporates a zoomed-in view of the robot’s face in the bottom left corner, along with the robot’s visual explanation system in the bottom right corner of the screen (Figure 6). Participants were instructed to focus on both the overall scene and the robot’s explanations.

Demographics

We performed an online user study with 21 participants (10 male, 11 female) with 10 HRI researchers (5 with a technical background, while the other 5 with a humanistic one) and 11 naive users to collect their perception of the robot’s multi-modal XAI system.

Questionnaires and Analysis

After viewing the videos, participants were given 7-point Likert scale questionnaires about their perceptions of the robot, its explanations, and the transparency techniques employed. For the robot’s perception, we used items regarding its warmth and competence (Fiske et al.,, 2007), likeability (adapted from (Spaccatini et al.,, 2019)), experience and agency (Gray et al.,, 2007), and cognitive and affective trust (Bernotat et al.,, 2021). Moreover, we asked them to rate their satisfaction with the explanations (Hoffman et al.,, 2018), and their perceived usefulness and intrusiveness (Conati et al.,, 2021). We asked the participants to individually evaluate the various explainability mechanisms (verbal, embodied, visual/attentional, and visual/functional) to gain a complete understanding of their perception of each.

Results from questionnaires have been analyzed with Jamovi Software v. 2.4.11, using Linear Mixed Model package from (Gallucci,, 2019) and Spearman’s Correlation Matrices. Three different linear mixed models compute statistical differences in the three scales (dependent variables) in relation to the explainability/transparency modalities and the participants’ role (factors). Participants’ ID is applied as a random effect to adjust for each participant’s baseline and model the intra-subject correlation of repeated measurements. Moreover, we conduct Spearman rank correlation tests to investigate the relationship between the three parameters (from the user perspective) of explainability/transparency, considering the four modalities independently. To investigate more deeply the impact of each explainable/transparent modality on participants’ perception of the robot, we analyze the correlation of each modality with other scales of the survey, i.e, how much the robot seemed to have a) experience, b) agency, c) competence, d) was likeable, e) how much it induced cognitive trust, and f) affective trust.

4 Results

4.1 AE models performance

To provide a statistically robust evaluation of our proposed models (IAE model as well as the XAE model), the training is repeated five times in total for each test set, using different random seeds. The two models achieve comparable accuracy. The IAE model reaches 79.51% average F1 score with a standard deviation of 0.56%, whereas the F1 score for the XAE model is 79.40% with a standard deviation of 1.06%. Thus, both models surpass the previous SOTA F1 score of 75.01%, described in Mazzola et al., (2023) by $\approx 4.45$ %. When looking at their confusion matrices (Figure 9), we see that the models have roughly equal distribution of each type of misclassification. We also see that the models are slightly weaker on recognizing when the addressee is the robot.

To test our XAE model’s capabilities on recordings captured using the iCub, we first analyze its accuracy on the data provided in Saade, (2023) (further referred to as the iCub data). We trained the XAE model on the Vernissage corpus and tested it on the iCub data, with resulting testing accuracy of 66.31%. This informs us that even though the data distribution is quite different from the Vernissage dataset, the model can successfully estimate the correct addressee most of the time. However, the performance is still notably worse than on the Vernissage data. Thus, to ensure the best possible accuracy of the model when deployed in the iCub, the final model used in the user study was trained on both of the datasets.

4.2 Explainability analysis

When analyzing the distribution of the face vs. pose attention weights, we found that the face and pose information is equally important, i.e., the average score is 0.5. On the other hand, we notice a negative correlation of -0.87 (p = 0.001) between the dimensionality we use for the face and pose embeddings and the logarithm of the importance deviations. The higher the dimensionality, the lower the deviation of attention scores (the lower the distinction rate between the two modalities). Thus, during the optimization process, it is not enough to optimize the model only for the performance, but also for the expressiveness of the explanations, like the deviation of importance scores in this case.

Our experiments showed that in the case of employing the fusion of dimensionalities without the vision transformer and recurrent network with attention, the explanation capability seemed to increase, and the pose information was slightly prevalent, having a score of $\approx 0.62$ with a deviation of $0.15$ . The full, combined version has an average score of $0.5$ and deviation of $0.04$ with a comparable embedding dimensionality.

To further explore the properties of our XAE model in greater detail, we analyze time frame scores with 10-frame-long sequences of the Vernissage dataset. A way to look at the activations is to compute their distribution. When considering only the stack of values independently, they precisely follow a Gaussian distribution with a mean of $\approx 0.1$ and standard deviation $\approx 0.0055$ . Our explanation of the low deviation is that since there are only 10 frames in each sequence, they do not capture a long period of time; thus, their embeddings are usually quite similar. We also observe that the weights change more in the case of frames with greater variability.

Next, we analyzed the threshold value for triggering an explanation at the end of a sequence. The threshold clearly influences the rate at which the verbal cue is generated. In Figure 10 we can see the probabilities of triggering a verbal response for differing threshold values. We empirically verified that threshold values higher than 0.02 yield verbal cues aligning with human expectations. In contrast, by using lower values, the noise patterns sometimes overrule the useful information, yielding responses that are difficult to verify.

4.3 User Study

Table 1: Statistics referring to participants’ perception of the robot’s explainability and transparency solutions for different modalities of communication and different roles of participants (technical HRI background (Tech.), humanities HRI background (Hum.), or no HRI background (User).

Method	Role	Satisf.		Usefuln.		Intrusiv.
Method	Role	Mean	SD	Mean	SD	Mean	SD
embodied	Avg.	3.12	1.42	3.9	1.79	1.95	1.46
	Tech.	2.85	0.704	3.87	1.15	1.33	0.471
	Hum.	3.23	1.37	3.13	1.68	2.53	1.99
	User	3.19	1.75	4.27	2.08	1.97	1.49
verbal	Avg.	4.54	1.1	4.6	1.54	1.94	1.06
	Tech.	4.47	0.409	4.93	1.16	1.8	0.96
	Hum.	3.98	1.28	3.13	1.82	2.07	1.36
	User	4.82	1.21	5.12	1.2	1.94	1.06
attentional	Avg.	4.14	1.24	4.71	1.84	1.86	0.981
	Tech.	4.83	1.09	5.53	1.35	1.67	0.527
	Hum.	3.55	0.851	4.27	1.32	1.93	1.01
	User	4.09	1.37	4.55	2.22	1.91	1.17
functional	Avg.	4.32	0.943	4.49	1.86	1.78	1
	Tech.	4.67	0.991	4.67	1.9	1.73	1.16
	Hum.	4.42	1.03	4.13	2.1	1.8	0.869
	User	4.1	0.915	4.58	1.9	1.79	1.08

Table 1 reports the means and standard deviations of three questionnaire scales (Satisfaction, Usefulness, and Intrusiveness) grouped by the four explainability and transparency modalities (verbal, embodied, visual/attentional, and functional) and the three groups of participants (Technical HRI Researchers, Humanities HRI researchers, Users in HRI). The Linear Mixed models revealed a significant effect of the modality on Satisfaction. Specifically, participants were found, on average, more satisfied with Verbal (M = 4.54), Attentional (M = 4.14), and Functional (M = 4.32) modalities than Embodied (M = 3.12) (Verb-Emb: B = 1.333, t = 3.991, p = 0.001; Func-Emb: B = 1.311, t = 3.925, p = 0.001, Att-Emb: B = 1.066, t = 3.19, p = 0.014; all tests computed with Bonferroni correction), as shown in Figure 11. No statistically significant differences were found in the group nor in the other scales.

In supplementary materials, Tables E.1 , LABEL:tab:correlation_verbal , E.3 and E.4 present all the results from Spearman rank correlations tests computed on the three parameters (Satisfaction, Usefulness and Intrusiveness) considering the four modalities independently. Concerning the verbal modality, the only significant relationship (a positive linear correlation) was found between Satisfaction and Usefulness ( $r(19)=0.479$ , $p=0.028$ ). The same relationship was found in all the other modalities: (embodied: $r(19)=0.611$ , $p=0.003$ ; attentional: $r(19)=0.728$ , $p<.001$ ; functional: $r(19)=0.666$ , $p<.001$ ). Moreover, a significant negative linear correlation was found between the Usefulness and the Intrusiveness in the embodied ( $r(19)=-.644$ , $p=0.002$ ), in the attentional ( $r(19)=-.650$ , $p=0.001$ ), and in the functional modalities ( $r(19)=-.554$ , $p=0.009$ ). The latter also reported a significant negative linear correlation between Satisfaction and Intrusiveness ( $r(19)=-.624$ , $p=0.003$ ).

With respect to the relationship between user’s judgments of explainability/transparency solutions and their perception of the robot, a Spearman rank correlation test revealed a significant positive relationship between the Satisfaction for embodied modality and the robot’s perceived experience ( $r(19)=0.449$ , $p=0.041$ ) as well as with the likeability ( $r(19)=0.588$ , $p=0.005$ ) (see Figure 12). No significant correlations were found between any of the above mentioned scales (a to f) and the Satisfaction for the other modalities. No other correlations were found on the other scales, with the exception of the affective trust, which was found to have a significant positive relationship with the Usefulness of attentional ( $r(19)=0.456$ , $p=0.038$ ) and with the Intrusiveness of the functional modality ( $r(19)=0.634$ , $p=0.02$ ).

5 Discussion

This study is guided by the effort to put into action transparency and explainability techniques in a social robot with multi-party conversation abilities. To this aim, we designed and applied XAI solutions to a real-time human activity recognition model for the estimation of the addressee and implemented it in a modular robotic architecture for multi-party conversation together with other transparency solutions to show the underlying processing of the robot’s behavior.

5.1 The real-time implementation in the architecture for multi-party conversation

Built with a modular approach, our architecture was designed as the mean to connect our XAE model with the other modules necessary for the task of multi-party conversation: from Sound Detection to Spatial Memory and Speech Generation. At this stage, our architecture only missed a sound localization module, which the experimenter handled with the Wizard-of-Oz technique. Beyond that, the interaction flow was smooth and autonomous, as presented in the video. While iCub verbal explanations after each participant’s utterance may disrupt the fluidity of the interaction, it’s important to note that this is a deliberate design choice made specifically for the purposes of this study.

The modular design was preferred to an end-to-end approach for two reasons. First, the multi-party conversation is a complex scenario involving various activities and multiple individuals interacting simultaneously. A modular approach offers greater controllability of the robot processes and behaviors in such unpredictable contexts. Second, real-time multi-modal processing allows to exploit the synergy of different modules, enabling them to self-supervise each other and correct any erroneous estimations.

In the recorded multi-party interactions, the XAE model failed only twice, but thanks to the modular approach, these errors could be amended, and the conversation resumed correctly. For instance, thanks to Speech Understanding and Generation modules, when iCub is told the utterance was addressed to someone else, it apologizes for interrupting.

The connection with spatial memory is another pivotal point. Thanks to this module, the final estimation of the addressee is not only based on multiple features coming from the robot’s vision but also on continuously updated spatial-contextual information of the environment, following a more cognitive-inspired approach.

Contextual information about people in the environment was used in previous works, but it was given as input to the neural network before the AE estimation, without any possibility of updating it after the inference. Hence, to our knowledge, our architecture is the first that can 1) make corrections on the addressee identification based on additional exploration and spatial memory information and 2) discover new people and update the spatial memory based on the addressee estimation. For instance, in the first case, if the addressee is estimated to be on the speaker’s left, but the robot doesn’t detect anyone in that direction, it infers that the utterance was directed towards itself. On the other hand, in the second case, if the addressee is estimated at the speaker’s left and the robot does not remember anybody in that direction, it turns its gaze to check over there and may detect new people.

The implementation of the XAI model in the robot’s architecture for multi-party conversation aimed to enhance the comprehensibility of the robot’s underlying processes. The system’s opacity can be an obstacle to the perceived reliability of the robot (or any other artificial system) both for the users and its developers. Moreover, this issue becomes even more problematic when it comes to multi-faceted modular architectures, where several processes concur in executing a complex behavior (Wortham et al.,, 2017). To reach an efficient and smooth interaction, developers and HRI designers need certainly to take care of the robot’s performance, but also of their user-friendly intelligibility (Sciutti et al.,, 2018). It is to ensure this understandability, and hence reliability, that we designed our framework and system to provide real-time clarifications of the robot’s behavior and the processes underlying its functioning.

5.2 The users’ evaluation

The act of explaining is a social mechanism: someone (the explainer) explains to someone else (the explainee) (Hilton,, 1990). Recently, also the XAI community recognized and exploited this social component of the explainability problem, highlighting the explainees’ needs within the explanation exchanges (Miller,, 2019). The active role of the explainee has also been stressed by Rohlfing et al., (2021) in their co-constructive approach.

In our online user study, we presented multi-modal explanations to explainees belonging to three different groups (naive users or HRI researchers, either with a technical or humanities background) to collect their impressions and preferences about the robot’s explanations of its behavior in multi-party conversation scenarios. Results outlined no preferences for any of the explanation types between the groups.

However, we found all participants were more satisfied with the verbal, visual/attentional, and visual/functional explanations than with the embodied ones (i.e., expressing model’s estimations and confidence via robot’s facial expressions). Participants did not appreciate iCub’s embodied behavior compared to the other more explicit measures of transparency, even though they found it as useful as the others. This result may be due to the embodied behavior design choices that some users could have found less intuitive and expressive than expected. Interestingly, correlations between satisfaction with the embodied explanations and the perceived robot’s experience and likeability showed that the more people were satisfied with the explanations communicated through facial expressions, the more they liked the robot (and the more they attributed to the robot the capability of having experiences as well). None of the other modalities correlated with the perceived robot’s experience and likeability, reinforcing the importance of embodied behavior design for robots also in the field of XAI. These results highlight indeed the importance of improving the transparency and reliability of embodied robotic behavior, such as gaze and facial expressions (as also observed in (Matarese et al.,, 2021)), and more deeply investigating those implicit communication mechanisms to reach smoother HRI and human-like transparency.

Participants’ satisfaction with the explanations and their perceived usefulness correlated for all the modalities meaning that participants’ appreciation came in general from a utilitarian viewpoint: the more useful, the more satisfying. This result is coherent with the declared objective of the user study since participants had to exploit the robot’s explanations to make sense of its behavior. The same modalities, but verbal explanations, showed strong negative correlations between perceived usefulness and intrusiveness, which evinces the reliability of the scales used.

5.3 Limitation of the Architecture and Future Improvements

The explainable addressee estimation model we designed in this work offers verbal and embodied explanations alongside the classification of the addressee. However, it is important to address its limitations. Given the surprisingly high sensitivity of the frequency and quality of the explanation on the hyper-parameters, a simple error optimization may not guarantee the model to produce meaningful explanations. Therefore, it would be useful to devise a training method that aligns with achieving good accuracy and generates reasonable and discriminative explanations.

Evaluating the level of explainability (besides user studies) presents the greatest challenge because the standard evaluation metrics cannot be used when data with labeled explanations are lacking.

In the case of addressee estimation, the model performance heavily relies on the quality and representativeness of the dataset, which may not always align with real-world scenarios. Thus, for continuing in this line of research, a data-centric approach might yield further substantial improvements both in the quality of the explanations and in the accuracy of the model.

For what concerns the multi-party modular architecture, a current limitation was the input provided by the experimenter instead of autonomous Sound Localization. For a final version of the architecture, we foresee the use of the Sound Localisation module to classify the direction of the upcoming sound source with respect to a robot-centric reference frame (e.g., from the right, front, or left of the robot). This will enable the activation of specific attention mechanisms, such as redirecting the robot’s gaze toward the direction of the speakers’ voice when they are outside its field of view. Moreover, as one can see from the video, the visual/attentional explanations sometimes flickered. This problem was due to network issues, which sometimes saturated during the working day. The modular approach we used needed extensive use of the local network to let the modules exchange information with each other, but we are aware that it may bring such inconvenience for users. At this stage, we do not have a real-time evaluation of the architecture, but we can refer to the accuracy of the XAE model (see Section 4.1). Anyway, we leave such real-time evaluation with the robot for future work.

6 Conclusion

The development of autonomous robots often aims at their deployment in social environments. Might they be hospitals, schools, restaurants, offices, or homes, robots are required to work safely and efficiently, two things that in human-populated contexts require social-like abilities to perceive the (social) world and act accordingly. To prove their reliability, autonomous systems must be transparent and explainable, two qualities that can increase the users’ trust in robots if the former are put in a position to interpret the behavior of the latter.

This work first proposes an explainable machine learning model to solve the problem of addressee estimation, that is, figuring out to whom an interlocutor is speaking within a multi-party conversation. Next, we embedded such a model in a modular architecture to enable the iCub robot to actively participate in such complex interactions. Finally, we proposed a setting in which we assessed the feasibility of the overall architecture while collecting the impressions of different types of users on the robot’s explainability and transparency mechanisms.

Thanks to our attention-based approach, our model does not require any additional processing to provide explanations fast, while also increasing the level of transparency. The saliency map of the speaker’s face, the relative importance of each input feature, and insights about which parts of the interaction affected the robot’s estimation the most are acquired from our model and integrated with other techniques to clarify the opacity of the iCub robot’s decision in a challenging scenario such as multi-party conversation.

When it comes to deploying deep neural networks in robots to predict human behavior and engage in social interaction, it is fundamental to adopt a unified approach. From neural network design to the implementation of computer vision algorithms into a modular architecture, ending with an exploratory user study, our work sought to do this by considering and integrating different perspectives to unveil the opacity of artificial systems designed to interact with us.

Acknowledgments

The research leading to these results has received funding from the project titled TERAIS in the framework of the program Horizon-Widera-2021 of the European Union under the Grant agreement number 101079338.

I. Bečková and Š. Pócoš were also supported in part by The Slovak Research and Development Agency, project no. APVV-21-0105.

This Preprint has been prepared based on the following template: https://github.com/kourgeorge/arxiv-style.git

References

Addlesee et al., (2024) Addlesee, A., Cherakara, N., Nelson, N., Hernández García, D., Gunson, N., Sieińska, W., Romeo, M., Dondrup, C., and Lemon, O. (2024). A multi-party conversational social robot using llms. In Companion of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, HRI ’24, page 1273–1275, New York, NY, USA. Association for Computing Machinery.
Arumugam et al., (2022) Arumugam, B., Bhattacharjee, S. D., and Yuan, J. (2022). Multimodal attentive learning for real-time explainable emotion recognition in conversations. In 2022 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1210–1214.
Atkinson and Shiffrin, (1968) Atkinson, R. and Shiffrin, R. (1968). Human memory: A proposed system and its control processes. volume 2 of Psychology of Learning and Motivation, pages 89–195. Academic Press.
Auer, (2018) Auer, P. (2018). Gaze, addressee selection and turn-taking in three-party interaction, pages 197–231. John Benjamins Amsterdam.
Bae and Bennett, (2023) Bae, Y.-H. and Bennett, C. C. (2023). Real-time multimodal turn-taking prediction to enhance cooperative dialogue during human-agent interaction. In 2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), pages 2037–2044.
Barredo Arrieta et al., (2020) Barredo Arrieta, A., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A., Garcia, S., Gil-Lopez, S., Molina, D., Benjamins, R., Chatila, R., and Herrera, F. (2020). Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai. Information Fusion, 58:82–115.
Belgiovine et al., (2022) Belgiovine, G., Gonzlez-Billandon, J., Sciutti, A., Sandini, G., and Rea, F. (2022). Hri framework for continual learning in face recognition. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8226–8233. IEEE.
Bernotat et al., (2021) Bernotat, J., Eyssel, F., and Sachse, J. (2021). The (fe) male robot: how robot body shape impacts first impressions and trust towards robots. International Journal of Social Robotics, 13(3):477–489.
Bewley et al., (2016) Bewley, A., Ge, Z., Ott, L., Ramos, F., and Upcroft, B. (2016). Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP), pages 3464–3468. IEEE.
Biewald, (2020) Biewald, L. (2020). Experiment tracking with weights and biases. Software available from wandb.com.
Bilac et al., (2017) Bilac, M., Chamoux, M., and Lim, A. (2017). Gaze and filled pause detection for smooth human-robot conversations. In 2017 IEEE-RAS 17th International Conference on Humanoid Robotics (Humanoids), pages 297–304.
Brauwers and Frasincar, (2023) Brauwers, G. and Frasincar, F. (2023). A general survey on attention mechanisms in deep learning. IEEE Transactions on Knowledge and Data Engineering, 35(4):3279–3298.
Cao et al., (2019) Cao, Z., Hidalgo Martinez, G., Simon, T., Wei, S., and Sheikh, Y. A. (2019). Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Chakraborti et al., (2017) Chakraborti, T., Sreedharan, S., Zhang, Y., and Kambhampati, S. (2017). Plan explanations as model reconciliation: Moving beyond explanation as soliloquy. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI’17, page 156–163. AAAI Press.
Cho et al., (2014) Cho, K., van Merriënboer, B., Bahdanau, D., and Bengio, Y. (2014). On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111. Association for Computational Linguistics.
Ciatto et al., (2020) Ciatto, G., Schumacher, M. I., Omicini, A., and Calvaresi, D. (2020). Agent-based explanations in ai: Towards an abstract framework. In Explainable, Transparent Autonomous Agents and Multi-Agent Systems, pages 3–20. Springer International Publishing.
Conati et al., (2021) Conati, C., Barral, O., Putnam, V., and Rieger, L. (2021). Toward personalized xai: A case study in intelligent tutoring systems. Artificial Intelligence, 298:103503.
De Graaf and Malle, (2017) De Graaf, M. M. and Malle, B. F. (2017). How people explain action (and autonomous intelligent systems should too). In 2017 AAAI Fall Symposium Series.
Dhaussy et al., (2023) Dhaussy, T., Jabaian, B., Lefèvre, F., and Horaud, R. (2023). Audio-visual speaker diarization in the framework of multi-user human-robot interaction. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5.
Dosovitskiy et al., (2021) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
Eldardeer et al., (2021) Eldardeer, O., Gonzalez-Billandon, J., Grasse, L., Tata, M., and Rea, F. (2021). A biological inspired cognitive framework for memory-based multi-sensory joint attention in human-robot interactive tasks. Frontiers in Neurorobotics, 15.
Fiske et al., (2007) Fiske, S. T., Cuddy, A. J., and Glick, P. (2007). Universal dimensions of social cognition: Warmth and competence. Trends in cognitive sciences, 11(2):77–83.
Gallucci, (2019) Gallucci, M. (2019). Gamlj: General analyses for linear models. Jamovi Module).
Gambino and Liu, (2022) Gambino, A. and Liu, B. (2022). Considering the context to build theory in hci, hri, and hmc: Explicating differences in processes of communication and socialization with social technologies. Human-Machine Communication, 4:111–130.
Gray et al., (2007) Gray, H. M., Gray, K., and Wegner, D. M. (2007). Dimensions of mind perception. Science, 315(5812):619–619.
Groß et al., (2023) Groß, A., Singh, A., Banh, N. C., Richter, B., Scharlau, I., Rohlfing, K. J., and Wrede, B. (2023). Scaffolding the human partner by contrastive guidance in an explanatory human-robot dialogue. Frontiers in Robotics and AI, 10.
Gu et al., (2022) Gu, J.-C., Tao, C., and Ling, Z.-H. (2022). Who says what to whom: A survey of multi-party conversations. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22).
Halilovic and Lindner, (2023) Halilovic, A. and Lindner, F. (2023). Visuo-textual explanations of a robot’s navigational choices. In Companion of the 2023 ACM/IEEE International Conference on Human-Robot Interaction, HRI ’23, page 531–535, New York, NY, USA. Association for Computing Machinery.
He et al., (2015) He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 1026–1034.
Hilton, (1990) Hilton, D. J. (1990). Conversational processes and causal explanation. Psychological Bulletin, 107(1):65.
Hoffman et al., (2018) Hoffman, R. R., Mueller, S. T., Klein, G., and Litman, J. (2018). Metrics for explainable ai: Challenges and prospects. arXiv preprint arXiv:1812.04608.
Huang et al., (2011) Huang, H.-H., Baba, N., and Nakano, Y. (2011). Making virtual conversational agent aware of the addressee of users’ utterances in multi-user conversation using nonverbal information. In Proceedings of the 13th International Conference on Multimodal Interfaces, ICMI ’11, page 401–408, New York, NY, USA. Association for Computing Machinery.
Ishii et al., (2016) Ishii, R., Otsuka, K., Kumano, S., and Yamato, J. (2016). Prediction of who will be the next speaker and when using gaze behavior in multiparty meetings. ACM Trans. Interact. Intell. Syst., 6(1).
Jakobson, (1981) Jakobson, R. (1981). Linguistics and Poetics, pages 18–51. De Gruyter Mouton, Berlin, Boston.
Jayagopi et al., (2012) Jayagopi, D. B., Sheikhi, S., Klotz, D., Wienke, J., Odobez, J.-M., Wrede, S., Khalidov, V., Nguyen, L., Wrede, B., and Gatica-Perez, D. (2012). The vernissage corpus: A multimodal human-robot-interaction dataset. Technical report.
Jayagopi et al., (2013) Jayagopi, D. B., Sheiki, S., Klotz, D., Wienke, J., Odobez, J.-M., Wrede, S., Khalidov, V., Nguyen, L., Wrede, B., and Gatica-Perez, D. (2013). The vernissage corpus: A conversational human-robot-interaction dataset. In 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 149–150. IEEE.
Jiang et al., (2023) Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. (2023). Mistral 7b.
Jocher et al., (2023) Jocher, G., Chaurasia, A., and Qiu, J. (2023). Ultralytics yolo.
Kashefi et al., (2023) Kashefi, R., Barekatain, L., Sabokrou, M., and Aghaeipoor, F. (2023). Explainability of vision transformers: A comprehensive review and new perspectives. arXiv preprint arXiv:2311.06786.
Kerzel et al., (2022) Kerzel, M., Ambsdorf, J., Becker, D., Lu, W., Strahl, E., Spisak, J., Gäde, C., Weber, T., and Wermter, S. (2022). What’s on your mind, nico? KI - Künstliche Intelligenz, 36(3):237–254.
Kotseruba and Tsotsos, (2020) Kotseruba, I. and Tsotsos, J. K. (2020). 40 years of cognitive architectures: core cognitive abilities and practical applications. Artificial Intelligence Review, 53(1):17–94.
Krizhevsky et al., (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc.
Lai et al., (2021) Lai, V., Chen, C., Liao, Q. V., Smith-Renner, A., and Tan, C. (2021). Towards a science of human-ai decision making: a survey of empirical studies. arXiv preprint arXiv:2112.11471.
Lundberg and Lee, (2017) Lundberg, S. M. and Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
Maruyama et al., (2022) Maruyama, Y., Fukui, H., Hirakawa, T., Yamashita, T., Fujiyoshi, H., and Sugiura, K. (2022). Visual explanation of deep q-network for robot navigation by fine-tuning attention branch. arXiv preprint arXiv:2208.08613.
(46) Matarese, M., Cocchella, F., Rea, F., and Sciutti, A. (2023a). Ex(plainable) machina: how social-implicit xai affects complex human-robot teaming tasks. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11986–11993.
(47) Matarese, M., Cocchella, F., Rea, F., and Sciutti, A. (2023b). Natural born explainees: how users’ personality traits shape the human-robot interaction with explainable robots. In 2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), pages 1786–1793.
Matarese et al., (2021) Matarese, M., Sciutti, A., Rea, F., and Rossi, S. (2021). Toward robots’ behavioral transparency of temporal difference reinforcement learning with a human teacher. IEEE Transactions on Human-Machine Systems, 51(6):578–589.
Mazzola et al., (2024) Mazzola, C., Rea, F., and Sciutti, A. (2024). Real-time addressee estimation: Deployment of a deep-learning model on the icub robot. In Proceedings of the 2023 I-RIM Conference.
Mazzola et al., (2023) Mazzola, C., Romeo, M., Rea, F., Sciutti, A., and Cangelosi, A. (2023). To whom are you talking? a deep learning model to endow social robots with addressee estimation skills. In 2023 International Joint Conference on Neural Networks (IJCNN), pages 1–10.
McMahon and Isik, (2023) McMahon, E. and Isik, L. (2023). Seeing social interactions. Trends in Cognitive Sciences, 27(12):1165–1179. doi: 10.1016/j.tics.2023.09.001.
Metta et al., (2006) Metta, G., Fitzpatrick, P., and Natale, L. (2006). Yarp: Yet another robot platform. International Journal of Advanced Robotic Systems, 3(1):8.
Miller, (2019) Miller, T. (2019). Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence, 267:1–38.
Niu et al., (2021) Niu, Z., Zhong, G., and Yu, H. (2021). A review on the attention mechanism of deep learning. Neurocomputing, 452:48–62.
Osokin, (2018) Osokin, D. (2018). Real-time 2d multi-person pose estimation on cpu: Lightweight openpose. In arXiv preprint arXiv:1811.12004.
Radford et al., (2023) Radford, A., Kim, J. W., Xu, T., Brockman, G., Mcleavey, C., and Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J., editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 28492–28518. PMLR.
Ribeiro et al., (2016) Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). Why should i trust you?: Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Richter et al., (2016) Richter, V., Carlmeyer, B., Lier, F., Meyer zu Borgsen, S., Schlangen, D., Kummert, F., Wachsmuth, S., and Wrede, B. (2016). Are you talking to me? improving the robustness of dialogue systems in a multi party hri scenario by incorporating gaze direction and lip movement of attendees. In Proceedings of the Fourth International Conference on Human Agent Interaction, HAI ’16, page 43–50, New York, NY, USA. Association for Computing Machinery.
Robrecht and Kopp, (2023) Robrecht, A. S. and Kopp, S. (2023). Snape: A sequential non-stationary decision process model for adaptive explanation generation. In ICAART (1), pages 48–58.
Rohlfing et al., (2021) Rohlfing, K. J., Cimiano, P., Scharlau, I., Matzner, T., Buhl, H. M., Buschmeier, H., Esposito, E., Grimminger, A., Hammer, B., Häb-Umbach, R., Horwath, I., Hüllermeier, E., Kern, F., Kopp, S., Thommes, K., Ngonga Ngomo, A.-C., Schulte, C., Wachsmuth, H., Wagner, P., and Wrede, B. (2021). Explanation as a social practice: Toward a conceptual framework for the social design of ai systems. IEEE Transactions on Cognitive and Developmental Systems, 13(3):717–728.
Roncone et al., (2016) Roncone, A., Pattacini, U., Metta, G., and Natale, L. (2016). A cartesian 6-dof gaze controller for humanoid robots. In Robotics: science and systems, volume 2016.
Saade, (2023) Saade, P. (2023). Optimizing the portability of an addressee estimation model in the icub social robot. Master’s thesis, University of Genoa.
Sciutti et al., (2018) Sciutti, A., Mara, M., Tagliasco, V., and Sandini, G. (2018). Humanizing human-robot interaction: On the importance of mutual understanding. IEEE Technology and Society Magazine, 37(1):22–29.
Selkowitz et al., (2017) Selkowitz, A. R., Larios, C. A., Lakhmani, S. G., and Chen, J. Y. (2017). Displaying information to support transparency for autonomous platforms. In Savage-Knepshield, P. and Chen, J., editors, Advances in Human Factors in Robots and Unmanned Systems, pages 161–173, Cham. Springer International Publishing.
Selvaraju et al., (2017) Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. (2017). Grad-CAM: Visual explanations from deep networks via gradient-based localization. In IEEE International Conference on Computer Vision, pages 618–626.
Setchi et al., (2020) Setchi, R., Dehkordi, M. B., and Khan, J. S. (2020). Explainable robotics in human-robot interactions. Procedia Computer Science, 176:3057–3066. Knowledge-Based and Intelligent Information & Engineering Systems: Proceedings of the 24th International Conference KES2020.
Sheikhi et al., (2013) Sheikhi, S., Babu Jayagopi, D., Khalidov, V., and Odobez, J.-M. (2013). Context aware addressee estimation for human robot interaction. In GazeIn ’13: Proceedings of the 6th workshop on Eye gaze in intelligent human machine interaction: gaze in multimodal interaction, GazeIn ’13, page 1–6, New York, NY, USA. Association for Computing Machinery.
Shen et al., (2017) Shen, T., Zhou, T., Long, G., Jiang, J., Pan, S., and Zhang, C. (2017). Disan: Directional self-attention network for rnn/cnn-free language understanding. Proceedings of the AAAI Conference on Artificial Intelligence, 32.
Simonyan et al., (2014) Simonyan, K., Vedaldi, A., and Zisserman, A. (2014). Deep inside convolutional networks: Visualising image classification models and saliency maps. In Workshop at International Conference on Learning Representations.
Sinha et al., (2021) Sinha, P., Mitra, P., da Costa, A. A. B., and Kekatos, N. (2021). Explaining outcomes of multi-party dialogues using causal learning. arXiv preprint arXiv:2105.00944.
Skantze, (2021) Skantze, G. (2021). Turn-taking in conversational systems and human-robot interaction: A review. Computer Speech & Language, 67:101178.
Sobrín-Hidalgo et al., (2024) Sobrín-Hidalgo, D., González-Santamarta, M. Á., Guerrero-Higueras, Á. M., Rodríguez-Lera, F. J., and Matellán-Olivera, V. (2024). Enhancing robot explanation capabilities through vision-language models: a preliminary study by interpreting visual inputs for improved human-robot interaction. arXiv preprint arXiv:2404.09705.
Spaccatini et al., (2019) Spaccatini, F., Pacilli, M. G., Giovannelli, I., Roccato, M., and Penone, G. (2019). Sexualized victims of stranger harassment and victim blaming: The moderating role of right-wing authoritarianism. Sexuality & Culture, 23(3):811–825.
Stange et al., (2022) Stange, S., Hassan, T., Schröder, F., Konkol, J., and Kopp, S. (2022). Self-explaining social robots: An explainable behavior generation architecture for human-robot interaction. Frontiers in Artificial Intelligence, page 87.
Tabrez et al., (2019) Tabrez, A., Agrawal, S., and Hayes, B. (2019). Explanation-based reward coaching to improve human performance via reinforcement learning. In 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 249–257. IEEE.
Tesema et al., (2023) Tesema, F. B., Gu, J., Song, W., Wu, H., Zhu, S., Lin, Z., Huang, M., Wang, W., and Kumar, R. (2023). Addressee detection using facial and audio features in mixed human–human and human–robot settings: A deep learning framework. IEEE Systems, Man, and Cybernetics Magazine, 9(2):25–38.
van Turnhout et al., (2005) van Turnhout, K., Terken, J., Bakx, I., and Eggen, B. (2005). Identifying the intended addressee in mixed human-human and human-computer interaction from non-verbal features. In Proceedings of the 7th International Conference on Multimodal Interfaces, ICMI ’05, page 175–182, New York, NY, USA. Association for Computing Machinery.
Vaswani et al., (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
Vernon, (2014) Vernon, D. (2014). Artificial cognitive systems: A primer.
Wightman, (2019) Wightman, R. (2019). Pytorch image models. GitHub repository.
Wortham et al., (2016) Wortham, R. H., Theodorou, A., and Bryson, J. J. (2016). What does the robot think? transparency as a fundamental design requirement for intelligent systems. In IJCAI 2016 Ethics for AI Workshop.
Wortham et al., (2017) Wortham, R. H., Theodorou, A., and Bryson, J. J. (2017). Robot transparency: Improving understanding of intelligent behaviour for designers and users. In Towards Autonomous Robotic Systems: 18th Annual Conference, TAROS 2017, Guildford, UK, July 19–21, 2017, Proceedings. Lecture Notes in Artificial Intelligence, pages 274–289. Springer.
Zhou et al., (2016) Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba, A. (2016). Learning deep features for discriminative localization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2921–2929.
Zhu et al., (2022) Zhu, H., Yu, C., and Cangelosi, A. (2022). Affective human-robot interaction with multimodal explanations. In International Conference on Social Robotics, pages 241–252. Springer.

Appendix A

List of abbreviations

Abbreviations
AE	Addressee Estimation
HRI	Human-Robot Interaction
IAE	Improved Addressee Estimation (model)
LLM	Large Language Models
MOT	Multi-Object Tracker
SOTA	State of the art
XAE	Explainable Addressee Estimation (model)
XAI	Explainable Artificial Intelligence

Appendix B

Hyper-parameter search

List of all optimized hyper-parameters along with all considered values and the chosen values is provided in Table B.1. The hyper-parameters were split into multiple groups, which were being optimized separately. They are ordered and grouped approximately (some parameters were present in multiple groups) in the order in which their value was chosen and fixed. “Gamma” controls the learning rate decay. “Convolutional” controls the number of output channels: “small” = [6, 8, 12, 16], “medium” = [8, 12, 16, 32], “large” = [8, 16, 32, 64]. The last eight hyper-parameters control data augmentation. We used WandB (Biewald,, 2020) with Bayesian search for the optimization.

The final IAE model architecture is summarized in Table B.2.

Tables B.3 and B.4 contain hyperparameters of the XAE model and its architecture (where relevant). Activation function in the recurrent network is used on inputs to the GRU unit and outputs from the GRU (i.e., inputs to the final fully-connected layer).

Table B.1: List of all considered hyper-parameters and their values ([min, max] range for real-valued ones, set of values for discrete). The last column contains values chosen for the IAE model.

Parameter name	Considered values	Chosen value
num_epochs	{5, 6, …, 50}	15
normalisation	{true_stats, imagenet}	true_stats
dropout	{0, 0.1, 0.2, 0.3}	0.2
act1	{ReLU, Tanh}	Tanh
act2	{ReLU, Tanh}	Tanh
act3	{ReLU, Tanh}	Tanh
hid1	{128, 129, …, 400}	256
hid2	{16, 17, …, 40}	32
hid3	{16, 17, …, 40}	32
out1	{10, 11, …, 64}	32
out2	{10, 11, …, 32}	20
out3	{8, 9, …, 32}	20
optimizer1	{SGD, Adam, RMS}	RMS
optimizer2	{SGD, Adam, RMS}	RMS
optimizer3	{SGD, Adam, RMS}	Adam
post_fusion	{LSTM, GRU}	GRU
convolutional	{small, medium, large}	large
gamma1	[0.5, 1]	0.75
gamma2	[0.5, 1]	0.9725
gamma3	[0.3, 1]	0.5
learning_rate1	[ $e^{-10}$ , $e^{-7}$ ]	0.00018
learning_rate2	[ $e^{-10}$ , $e^{-2}$ ]	0.01
learning_rate3	[ $e^{-10}$ , $e^{-7}$ ]	0.00009
batch_size	{10, 11, …, 350}	16
brightness	[0, 0.5]	0.2
contrast	[0, 0.5]	0.4
saturation	[0, 0.5]	0.45
hue	[0, 0.25]	0.135
angle	[0, 45]	25
crop	[40, 50]	44
kernel_size	{1, 3, 5, 7, 9}	7
sigma	[0.1, 3]	0.8

Table B.2: Architecture of our final IAE model. All convolution layers use kernel size 5. Relevant layer-specific hyper-parameters are listed in the brackets.

Face image model (net 1)
Input	50x50 px RGB images
Convolution	8 output channels
Activation (act1)	Hyperbolic tangent
Convolution	16 output channels
Activation (act1)	Hyperbolic tangent
Max-Pool	Kernel of size 2 $\times$ 2
Dropout (dropout)	p = 0.2
Convolution	32 output channels
Activation (act1)	Hyperbolic tangent
Convolution	64 output channels
Activation (act1)	Hyperbolic tangent
Max-Pool	Kernel of size 2 $\times$ 2
Dropout (dropout)	p = 0.2
Flatten
Fully-connected (hid1)	256 output neurons
Activation (act1)	Hyperbolic tangent
Fully-connected (out1)	32 output neurons
Pose model (net 2)
Input	Vector of length 54
Fully-connected (hid2)	32 output neurons
Activation (act2)	Hyperbolic tangent
Fully-connected (out2)	20 output neurons
Recurrent model (net 3)
Input	Concat. of outputs
GRU block (post_fusion, hid3)	32 hidden neurons
Dropout (dropout)	p = 0.2
Fully-connected (out3)	20 output neurons
Activation (act3)	Hyperbolic tangent
Fully-connected	3 output neurons

Table B.3: Hyperparameters of the XAE model. Parameters with indices 1 – 4 control Vit (M1), MLP for pose vectors, intermediate network (M2), and GRU with attention (M3) respectively.

XAE model hyperparameters
gamma1	0.70807
gamma2	0.96882
gamma3	0.51241
gamma4	0.9274
learning_rate1	0.00003791
learning_rate2	0.00117708
learning_rate3	0.00185
learning_rate4	0.00006085
optimizer1	RMS
optimizer2	RMS
optimizer3	RMS
optimizer4	Adam
batch_size	4
num_epochs	8

Table B.4: Hyperparameters/architecture of our final XAE model. For processing face images, we use the standard vision transformer architecture, therefore, we only provide the chosen hyperparameters. The architecture of the recurrent part is discussed in more detail in Paragraph 3.1.2.

Face image model (ViT)
Input	50x50 px RGB images
Patch size	4x4 px
Embedding dimension	42
Depth	6
Num. heads	6
Output dimension	185
Pose model (fully-connected)
Input	Vector of length 54
Fully-connected	73 output neurons
Activation	Hyperbolic tangent
Fully-connected	185 output neurons
Intermediate (data-fusion) model
Input	Two 185-dimensional vectors
Fully-connected	14 output neurons
Activation	ReLU
Fully-connected	1 output neuron
Normalization	Softmax, outputs weights
Weighted sum of inputs	Output vector of length 185
GRU with attention (recurrent) model
Input	Sequence of 185-dim. vectors
Values dimension	81
Keys dimension	20
GRU input dim.	20
GRU hidden dim.	20
Activation	Hyperbolic tangent
Fully-connected	81 inputs, 3 outputs

Appendix C

To merge the two data streams, we also tested some additional architectural concepts. These methods however, neither surpassed the baseline accuracy significantly, nor brought clear explanation to the user.

Additional method 1

In this part, we are going to transform the two modalities of pose and image vector. We base it on the mechanism of multi-dimensional attention proposed in Shen et al., (2017), where the authors propose an attention scoring not only to weight individual value vectors but to give each of their elements a corresponding weight as well. For a given time frame $t$ , using this procedure we are able to merge two modalities (outputs of the face and pose networks, $\boldsymbol{f}_{t}\in\mathbb{R}^{d_{face}}$ and $\boldsymbol{p}_{t}\in\mathbb{R}^{d_{pose}}$ respectively) into a single representation, which is then fed to the final, recurrent network. Using this method, the modalities are combined while taking into account their cross-relations. Our divergence from the traditional approach is that we do not combine multiple value vectors to a single representation, but we rather compute one value vector at a time and feed it to the RNN. The exact computation is following:

\boldsymbol{q}_{t}=\mathbf{W}^{Q}\boldsymbol{p}_{t},\quad\boldsymbol{k}_{t}=% \mathbf{W}^{K}\boldsymbol{f}_{t},\quad\boldsymbol{v}_{t}=\mathbf{W}^{V}% \boldsymbol{f}_{t},

(C.1)

where the matrices, $\mathbf{W}^{Q}\in\mathbb{R}^{d_{q}\times d_{pose}}$ , $\mathbf{W}^{K}\in\mathbb{R}^{d_{k}\times d_{face}}$ and $\mathbf{W}^{V}\in\mathbb{R}^{d_{v}\times d_{face}}$ are trainable parameters representing transformation of the corresponding vectors to query, key and value.

The next step is to compute the importance weights for each of the elements of the value vector.

\boldsymbol{e}_{t}=\boldsymbol{W}^{T}_{D}\times act(\boldsymbol{W}_{1}\times% \boldsymbol{q}_{t}+\boldsymbol{W}_{2}\times\boldsymbol{k}_{t}+\boldsymbol{b}),

(C.2)

where $\boldsymbol{W}_{1}\in\mathbb{R}^{d_{inner}\times d_{q}}$ , $\boldsymbol{W}_{2}\in\mathbb{R}^{d_{inner}\times d_{k}}$ , $\boldsymbol{W}_{D}\in\mathbb{R}^{d_{inner}\times d_{v}}$ and $\boldsymbol{b}\in\mathbb{R}^{d_{inner}}$ are trainable parameters.

The input to the RNN network is a simple Hadamard product of $\boldsymbol{e}_{t}$ and $\boldsymbol{v}_{t}$ . Thus, each element of the value vector has its corresponding weight.

Additional method 2

In this method we leverage the idea of general attention procedure (Brauwers and Frasincar,, 2023). After the input is processed, we end up with a representation of pose and face, $\boldsymbol{p_{t}}$ and $\boldsymbol{f_{t}}$ , respectively. However in this scenario, $\boldsymbol{f_{t}}\in\mathbb{R}^{d_{pose}}$ is a vector but $\boldsymbol{p_{t}}\in\mathbb{R}^{n_{p}\times d_{embed}}$ is a series of vectors, each corresponding to a channel aggregation for a single pixel value after the convolutions.

To continue with a general scheme of attention, we create our query, keys, and values as follows:

\boldsymbol{q}_{t}=\mathbf{W}^{Q}\boldsymbol{p}_{t},\quad\boldsymbol{k}_{t,i}=% \mathbf{W}^{K}\boldsymbol{f}_{t,i},\quad\boldsymbol{v}_{t,i}=\mathbf{W}^{V}% \boldsymbol{f}_{t,i},

(C.3)

\boldsymbol{e}_{t,i}=\boldsymbol{w}^{T}_{D}\times act(\boldsymbol{W}_{1}\times% \boldsymbol{q}_{t}+\boldsymbol{W}_{2}\times\boldsymbol{k}_{t,i}+\boldsymbol{b}),

(C.4)

where $\boldsymbol{W}_{1}\in\mathbb{R}^{d_{inner}\times d_{q}}$ , $\boldsymbol{W}_{2}\in\mathbb{R}^{d_{inner}\times d_{k}}$ , $\boldsymbol{W}_{D}\in\mathbb{R}^{d_{inner}}$ and $\boldsymbol{b}\in\mathbb{R}^{d_{inner}}$ are trainable parameters. This is followed by an alignment step, where we normalize the contribution of each component of the face representation.

a_{t,i}=\rm{softmax}(e_{t,i};\boldsymbol{e}).

(C.5)

The final representation $\boldsymbol{r}$ of the input is a weighted sum of the face components.

\boldsymbol{r_{t}}=\sum_{i=1}^{n_{p}}a_{t,i}.\boldsymbol{v_{t,i}}.

(C.6)

Appendix D

Prompts for the LLM agent

Shopping Mall Context: “You are a service robot named iCub, working as an assistant in shopping mall. You help customers who need information about shops inside the mall. You give informations based on customers needs and preferences. Be friendly and keep answers very short and concise!.”

Domestic Assistant Context: “You are a domestic assistant robot named iCub. You can do several tasks, like preparing drinks and food, and your role is to help accomplish this task when required. Be friendly and keep answers very short and concise!.”

Restaurant Context: “You are a service robot named iCub, working as a waiter for a restaurant. You help customers who need to order their food and drinks. Be friendly and keep answers very short and concise!.”

Appendix E

Table E.1: Results from the correlation matrix of users’ Explainability evaluation related to the embodied modality.

Correlation Matrix
		Satisfaction		Usefulness		Intrusiveness
Satisfaction	Spearman’s $\rho$	—
	df	—
	p-value	—
Usefulness	Spearman’s $\rho$	0.611	**	—
	df	19		—
	p-value	0.003		—
Intrusiveness	Spearman’s $\rho$	-0.223		-0.644	**	—
	df	19		19		—
	p-value	0.332		0.002		—
Note. * p < 0.05, p < 0.01, * p < 0.001

Table E.2: Results from the correlation matrix of users’ Explainability evaluation related to the verbal modality.

Correlation Matrix
		Satisfaction		Usefulness	Intrusiveness
Satisfaction	Spearman’s $\rho$	—
	df	—
	p-value	—
Usefulness	Spearman’s $\rho$	0.479	*	—
	df	19		—
	p-value	0.028		—
Intrusiveness	Spearman’s $\rho$	-0.117		-0.332	—
	df	19		19	—
	p-value	0.614		0.142	—
Note. * p < 0.05, p < 0.01, * p < 0.001

Table E.3: Results from the correlation matrix of users’ Explainability evaluation related to the visual/attentional modality.

Correlation Matrix
		Satisfaction		Usefulness		Intrusiveness
Satisfaction	Spearman’s $\rho$	—
	df	—
	p-value	—
Usefulness	Spearman’s $\rho$	0.728	***	—
	df	19		—
	p-value	0.001		—
Intrusiveness	Spearman’s $\rho$	-0.378		-0.650	**	—
	df	19		19		—
	p-value	0.091		0.001		—
Note. * p < 0.05, p < 0.01, * p < 0.001

Table E.4: Results from the correlation matrix of users’ Explainability evaluation related to the visual/functional modality.

Correlation Matrix
		Satisfaction		Usefulness		Intrusiveness
Satisfaction	Spearman’s $\rho$	—
	df	—
	p-value	—
Usefulness	Spearman’s $\rho$	0.666	***	—
	df	19		—
	p-value	0.001		—
Intrusiveness	Spearman’s $\rho$	-0.624	**	-0.554	**	—
	df	19		19		—
	p-value	0.003		0.009		—
Note. * p < 0.05, p < 0.01, * p < 0.001