-
One Size Does not Fit All: Personalised Affordance Design for Social Robots
Authors:
Guanyu Huang,
Roger K. Moore
Abstract:
Personalisation is essential to achieve more acceptable and effective results in human-robot interaction. Placing users in the central role, many studies have focused on enhancing the abilities of social robots to perceive and understand users. However, little is known about improving user perceptions and interpretation of a social robot in spoken interactions. The work described in the paper aims…
▽ More
Personalisation is essential to achieve more acceptable and effective results in human-robot interaction. Placing users in the central role, many studies have focused on enhancing the abilities of social robots to perceive and understand users. However, little is known about improving user perceptions and interpretation of a social robot in spoken interactions. The work described in the paper aims to find out what affects the personalisation of affordance of a social robot, namely its appearance, voice and language behaviours. The experimental data presented here is based on an ongoing project. It demonstrates the many and varied ways in which people change their preferences for the affordance of a social robot under different circumstances. It also examines the relationship between such preferences and expectations of characteristics of a social robot, like competence and warmth. It also shows that individuals have different perceptions of the language behaviours of the same robot. These results demonstrate that one-sized personalisation does not fit all. Personalisation should be considered a comprehensive approach, including appropriate affordance design, to suit the user expectations of social roles.
△ Less
Submitted 11 December, 2023;
originally announced December 2023.
-
Adapting the NICT-JLE Corpus for Disfluency Detection Models
Authors:
Lucy Skidmore,
Roger K. Moore
Abstract:
The detection of disfluencies such as hesitations, repetitions and false starts commonly found in speech is a widely studied area of research. With a standardised process for evaluation using the Switchboard Corpus, model performance can be easily compared across approaches. This is not the case for disfluency detection research on learner speech, however, where such datasets have restricted acces…
▽ More
The detection of disfluencies such as hesitations, repetitions and false starts commonly found in speech is a widely studied area of research. With a standardised process for evaluation using the Switchboard Corpus, model performance can be easily compared across approaches. This is not the case for disfluency detection research on learner speech, however, where such datasets have restricted access policies, making comparison and subsequent development of improved models more challenging. To address this issue, this paper describes the adaptation of the NICT-JLE corpus, containing approximately 300 hours of English learners' oral proficiency tests, to a format that is suitable for disfluency detection model training and evaluation. Points of difference between the NICT-JLE and Switchboard corpora are explored, followed by a detailed overview of adaptations to the tag set and meta-features of the NICT-JLE corpus. The result of this work provides a standardised train, heldout and test set for use in future research on disfluency detection for learner speech.
△ Less
Submitted 4 August, 2023;
originally announced August 2023.
-
Local Minima Drive Communications in Cooperative Interaction
Authors:
Roger K. Moore
Abstract:
An important open question in human-robot interaction (HRI) is precisely when an agent should decide to communicate, particularly in a cooperative task. Perceptual Control Theory (PCT) tells us that agents are able to cooperate on a joint task simply by sharing the same 'intention', thereby distributing the effort required to complete the task among the agents. This is even true for agents that do…
▽ More
An important open question in human-robot interaction (HRI) is precisely when an agent should decide to communicate, particularly in a cooperative task. Perceptual Control Theory (PCT) tells us that agents are able to cooperate on a joint task simply by sharing the same 'intention', thereby distributing the effort required to complete the task among the agents. This is even true for agents that do not possess the same abilities, so long as the goal is observable, the combined actions are sufficient to complete the task, and there is no local minimum in the search space. If these conditions hold, then a cooperative task can be accomplished without any communication between the contributing agents. However, for tasks that do contain local minima, the global solution can only be reached if at least one of the agents adapts its intention at the appropriate moments, and this can only be achieved by appropriately timed communication. In other words, it is hypothesised that in cooperative tasks, the function of communication is to coordinate actions in a complex search space that contains local minima. These principles have been verified in a computer-based simulation environment in which two independent one-dimensional agents are obliged to cooperate in order to solve a two-dimensional path-finding task.
△ Less
Submitted 18 July, 2023;
originally announced July 2023.
-
Interactivism in Spoken Dialogue Systems
Authors:
T. Rodríguez Muñoz,
Emily Y. J. Ip,
G. Huang,
R. K. Moore
Abstract:
The interactivism model introduces a dynamic approach to language, communication and cognition. In this work, we explore this fundamental theory in the context of dialogue modelling for spoken dialogue systems (SDS). To extend such a theoretical framework, we present a set of design principles which adhere to central psycholinguistic and communication theories to achieve interactivism in SDS. From…
▽ More
The interactivism model introduces a dynamic approach to language, communication and cognition. In this work, we explore this fundamental theory in the context of dialogue modelling for spoken dialogue systems (SDS). To extend such a theoretical framework, we present a set of design principles which adhere to central psycholinguistic and communication theories to achieve interactivism in SDS. From these, key ideas are linked to constitute the basis of our proposed design principles.
△ Less
Submitted 28 September, 2022; v1 submitted 27 September, 2022;
originally announced September 2022.
-
Whither the Priors for (Vocal) Interactivity?
Authors:
Roger K. Moore
Abstract:
Voice-based communication is often cited as one of the most `natural' ways in which humans and robots might interact, and the recent availability of accurate automatic speech recognition and intelligible speech synthesis has enabled researchers to integrate advanced off-the-shelf spoken language technology components into their robot platforms. Despite this, the resulting interactions are anything…
▽ More
Voice-based communication is often cited as one of the most `natural' ways in which humans and robots might interact, and the recent availability of accurate automatic speech recognition and intelligible speech synthesis has enabled researchers to integrate advanced off-the-shelf spoken language technology components into their robot platforms. Despite this, the resulting interactions are anything but `natural'. It transpires that simply giving a robot a voice doesn't mean that a user will know how (or when) to talk to it, and the resulting `conversations' tend to be stilted, one-sided and short. On the surface, these difficulties might appear to be fairly trivial consequences of users' unfamiliarity with robots (and \emph{vice versa}), and that any problems would be mitigated by long-term use by the human, coupled with `deep learning' by the robot. However, it is argued here that such communication failures are indicative of a deeper malaise: a fundamental lack of basic principles -- \emph{priors} -- underpinning not only speech-based interaction in particular, but (vocal) interactivity in general. This is evidenced not only by the fact that contemporary spoken language systems already require training data sets that are orders-of-magnitude greater than that experienced by a young child, but also by the lack of design principles for creating effective communicative human-robot interaction. This short position paper identifies some of the key areas where theoretical insights might help overcome these shortfalls.
△ Less
Submitted 16 March, 2022;
originally announced March 2022.
-
Investigating Deep Neural Structures and their Interpretability in the Domain of Voice Conversion
Authors:
Samuel J. Broughton,
Md Asif Jalal,
Roger K. Moore
Abstract:
Generative Adversarial Networks (GANs) are machine learning networks based around creating synthetic data. Voice Conversion (VC) is a subset of voice translation that involves translating the paralinguistic features of a source speaker to a target speaker while preserving the linguistic information. The aim of non-parallel conditional GANs for VC is to translate an acoustic speech feature sequence…
▽ More
Generative Adversarial Networks (GANs) are machine learning networks based around creating synthetic data. Voice Conversion (VC) is a subset of voice translation that involves translating the paralinguistic features of a source speaker to a target speaker while preserving the linguistic information. The aim of non-parallel conditional GANs for VC is to translate an acoustic speech feature sequence from one domain to another without the use of paired data. In the study reported here, we investigated the interpretability of state-of-the-art implementations of non-parallel GANs in the domain of VC. We show that the learned representations in the repeating layers of a particular GAN architecture remain close to their original random initialised parameters, demonstrating that it is the number of repeating layers that is more responsible for the quality of the output. We also analysed the learned representations of a model trained on one particular dataset when used during transfer learning on another dataset. This showed extremely high levels of similarity across the entire network. Together, these results provide new insight into how the learned representations of deep generative networks change during learning and the importance in the number of layers.
△ Less
Submitted 22 February, 2021;
originally announced February 2021.
-
Talking with Robots: Opportunities and Challenges
Authors:
Roger K. Moore
Abstract:
Notwithstanding the tremendous progress that is taking place in spoken language technology, effective speech-based human-robot interaction still raises a number of important challenges. Not only do the fields of robotics and spoken language technology present their own special problems, but their combination raises an additional set of issues. In particular, there is a large gap between the formul…
▽ More
Notwithstanding the tremendous progress that is taking place in spoken language technology, effective speech-based human-robot interaction still raises a number of important challenges. Not only do the fields of robotics and spoken language technology present their own special problems, but their combination raises an additional set of issues. In particular, there is a large gap between the formulaic speech that typifies contemporary spoken dialogue systems and the flexible nature of human-human conversation. It is pointed out that grounded and situated speech-based human-robot interaction may lead to deeper insights into the pragmatics of language usage, thereby overcoming the current `habitability gap'.
△ Less
Submitted 1 December, 2019;
originally announced December 2019.
-
A 'Canny' Approach to Spoken Language Interfaces
Authors:
Roger K. Moore
Abstract:
Voice-enabled artefacts such as Amazon Echo are very popular, but there appears to be a 'habitability gap' whereby users fail to engage with the full capabilities of the device. This position paper draws a parallel with the 'uncanny valley' effect, thereby proposing a solution based on aligning the visual, vocal, behavioural and cognitive affordances of future voice-enabled devices.
Voice-enabled artefacts such as Amazon Echo are very popular, but there appears to be a 'habitability gap' whereby users fail to engage with the full capabilities of the device. This position paper draws a parallel with the 'uncanny valley' effect, thereby proposing a solution based on aligning the visual, vocal, behavioural and cognitive affordances of future voice-enabled devices.
△ Less
Submitted 21 August, 2019;
originally announced August 2019.
-
Vocal Interactivity in Crowds, Flocks and Swarms: Implications for Voice User Interfaces
Authors:
Roger K. Moore
Abstract:
Recent years have seen an explosion in the availability of Voice User Interfaces. However, user surveys suggest that there are issues with respect to usability, and it has been hypothesised that contemporary voice-enabled systems are missing crucial behaviours relating to user engagement and vocal interactivity. However, it is well established that such ostensive behaviours are ubiquitous in the a…
▽ More
Recent years have seen an explosion in the availability of Voice User Interfaces. However, user surveys suggest that there are issues with respect to usability, and it has been hypothesised that contemporary voice-enabled systems are missing crucial behaviours relating to user engagement and vocal interactivity. However, it is well established that such ostensive behaviours are ubiquitous in the animal kingdom, and that vocalisation provides a means through which interaction may be coordinated and managed between individuals and within groups. Hence, this paper reports results from a study aimed at identifying generic mechanisms that might underpin coordinated collective vocal behaviour with a particular focus on closed-loop negative-feedback control as a powerful regulatory process. A computer-based real-time simulation of vocal interactivity is described which has provided a number of insights, including the enumeration of a number of key control variables that may be worthy of further investigation.
△ Less
Submitted 26 July, 2019;
originally announced July 2019.
-
On the Use/Misuse of the Term 'Phoneme'
Authors:
Roger K. Moore,
Lucy Skidmore
Abstract:
The term 'phoneme' lies at the heart of speech science and technology, and yet it is not clear that the research community fully appreciates its meaning and implications. In particular, it is suspected that many researchers use the term in a casual sense to refer to the sounds of speech, rather than as a well defined abstract concept. If true, this means that some sections of the community may be…
▽ More
The term 'phoneme' lies at the heart of speech science and technology, and yet it is not clear that the research community fully appreciates its meaning and implications. In particular, it is suspected that many researchers use the term in a casual sense to refer to the sounds of speech, rather than as a well defined abstract concept. If true, this means that some sections of the community may be missing an opportunity to understand and exploit the implications of this important psychological phenomenon. Here we review the correct meaning of the term 'phoneme' and report the results of an investigation into its use/misuse in the accepted papers at INTERSPEECH-2018. It is confirmed that a significant proportion of the community (i) may not be aware of the critical difference between `phonetic' and 'phonemic' levels of description, (ii) may not fully understand the significance of 'phonemic contrast', and as a consequence, (iii) consistently misuse the term 'phoneme'. These findings are discussed, and recommendations are made as to how this situation might be mitigated.
△ Less
Submitted 26 July, 2019;
originally announced July 2019.
-
A Biomimetic Vocalisation System for MiRo
Authors:
Roger K. Moore,
Ben Mitchinson
Abstract:
There is increasing interest in the use of animal-like robots in applications such as companionship and pet therapy. However, in the majority of cases it is only the robot's physical appearance that mimics a given animal. In contrast, MiRo is the first commercial biomimetic robot to be based on a hardware and software architecture that is modelled on the biological brain. This paper describes how…
▽ More
There is increasing interest in the use of animal-like robots in applications such as companionship and pet therapy. However, in the majority of cases it is only the robot's physical appearance that mimics a given animal. In contrast, MiRo is the first commercial biomimetic robot to be based on a hardware and software architecture that is modelled on the biological brain. This paper describes how MiRo's vocalisation system was designed, not using pre-recorded animal sounds, but based on the implementation of a real-time parametric general-purpose mammalian vocal synthesiser tailored to the specific physical characteristics of the robot. The novel outcome has been the creation of an 'appropriate' voice for MiRo that is perfectly aligned to the physical and behavioural affordances of the robot, thereby avoiding the 'uncanny valley' effect and contributing strongly to the effectiveness of MiRo as an interactive device.
△ Less
Submitted 15 May, 2017;
originally announced May 2017.
-
PCT and Beyond: Towards a Computational Framework for `Intelligent' Communicative Systems
Authors:
Prof. Roger K. Moore
Abstract:
Recent years have witnessed increasing interest in the potential benefits of `intelligent' autonomous machines such as robots. Honda's Asimo humanoid robot, iRobot's Roomba robot vacuum cleaner and Google's driverless cars have fired the imagination of the general public, and social media buzz with speculation about a utopian world of helpful robot assistants or the coming robot apocalypse! Howeve…
▽ More
Recent years have witnessed increasing interest in the potential benefits of `intelligent' autonomous machines such as robots. Honda's Asimo humanoid robot, iRobot's Roomba robot vacuum cleaner and Google's driverless cars have fired the imagination of the general public, and social media buzz with speculation about a utopian world of helpful robot assistants or the coming robot apocalypse! However, there is a long way to go before autonomous systems reach the level of capabilities required for even the simplest of tasks involving human-robot interaction - especially if it involves communicative behaviour such as speech and language. Of course the field of Artificial Intelligence (AI) has made great strides in these areas, and has moved on from abstract high-level rule-based paradigms to embodied architectures whose operations are grounded in real physical environments. What is still missing, however, is an overarching theory of intelligent communicative behaviour that informs system-level design decisions in order to provide a more coherent approach to system integration. This chapter introduces the beginnings of such a framework inspired by the principles of Perceptual Control Theory (PCT). In particular, it is observed that PCT has hitherto tended to view perceptual processes as a relatively straightforward series of transformations from sensation to perception, and has overlooked the potential of powerful generative model-based solutions that have emerged in practical fields such as visual or auditory scene analysis. Starting from first principles, a sequence of arguments is presented which not only shows how these ideas might be integrated into PCT, but which also extend PCT towards a remarkably symmetric architecture for a needs-driven communicative agent. It is concluded that, if behaviour is the control of perception, then perception is the simulation of behaviour.
△ Less
Submitted 16 November, 2016;
originally announced November 2016.
-
Automatic recognition of child speech for robotic applications in noisy environments
Authors:
Samuel Fernando,
Roger K. Moore,
David Cameron,
Emily C. Collins,
Abigail Millings,
Amanda J. Sharkey,
Tony J. Prescott
Abstract:
Automatic speech recognition (ASR) allows a natural and intuitive interface for robotic educational applications for children. However there are a number of challenges to overcome to allow such an interface to operate robustly in realistic settings, including the intrinsic difficulties of recognising child speech and high levels of background noise often present in classrooms. As part of the EU EA…
▽ More
Automatic speech recognition (ASR) allows a natural and intuitive interface for robotic educational applications for children. However there are a number of challenges to overcome to allow such an interface to operate robustly in realistic settings, including the intrinsic difficulties of recognising child speech and high levels of background noise often present in classrooms. As part of the EU EASEL project we have provided several contributions to address these challenges, implementing our own ASR module for use in robotics applications. We used the latest deep neural network algorithms which provide a leap in performance over the traditional GMM approach, and apply data augmentation methods to improve robustness to noise and speaker variation. We provide a close integration between the ASR module and the rest of the dialogue system, allowing the ASR to receive in real-time the language models relevant to the current section of the dialogue, greatly improving the accuracy. We integrated our ASR module into an interactive, multimodal system using a small humanoid robot to help children learn about exercise and energy. The system was installed at a public museum event as part of a research study where 320 children (aged 3 to 14) interacted with the robot, with our ASR achieving 90% accuracy for fluent and near-fluent speech.
△ Less
Submitted 8 November, 2016;
originally announced November 2016.
-
Is spoken language all-or-nothing? Implications for future speech-based human-machine interaction
Authors:
Roger K. Moore
Abstract:
Recent years have seen significant market penetration for voice-based personal assistants such as Apple's Siri. However, despite this success, user take-up is frustratingly low. This position paper argues that there is a habitability gap caused by the inevitable mismatch between the capabilities and expectations of human users and the features and benefits provided by contemporary technology. Sugg…
▽ More
Recent years have seen significant market penetration for voice-based personal assistants such as Apple's Siri. However, despite this success, user take-up is frustratingly low. This position paper argues that there is a habitability gap caused by the inevitable mismatch between the capabilities and expectations of human users and the features and benefits provided by contemporary technology. Suggestions are made as to how such problems might be mitigated, but a more worrisome question emerges: "is spoken language all-or-nothing"? The answer, based on contemporary views on the special nature of (spoken) language, is that there may indeed be a fundamental limit to the interaction that can take place between mismatched interlocutors (such as humans and machines). However, it is concluded that interactions between native and non-native speakers, or between adults and children, or even between humans and dogs, might provide critical inspiration for the design of future speech-based human-machine interaction.
△ Less
Submitted 18 July, 2016;
originally announced July 2016.