Search | arXiv e-print repository

OSCaR: Object State Captioning and State Change Representation

Authors: Nguyen Nguyen, Jing Bi, Ali Vosoughi, Yapeng Tian, Pooyan Fazli, Chenliang Xu

Abstract: The capability of intelligent models to extrapolate and comprehend changes in object states is a crucial yet demanding aspect of AI research, particularly through the lens of human interaction in real-world settings. This task involves describing complex visual environments, identifying active objects, and interpreting their changes as conveyed through language. Traditional methods, which isolate… ▽ More The capability of intelligent models to extrapolate and comprehend changes in object states is a crucial yet demanding aspect of AI research, particularly through the lens of human interaction in real-world settings. This task involves describing complex visual environments, identifying active objects, and interpreting their changes as conveyed through language. Traditional methods, which isolate object captioning and state change detection, offer a limited view of dynamic environments. Moreover, relying on a small set of symbolic words to represent changes has restricted the expressiveness of the language. To address these challenges, in this paper, we introduce the Object State Captioning and State Change Representation (OSCaR) dataset and benchmark. OSCaR consists of 14,084 annotated video segments with nearly 1,000 unique objects from various egocentric video collections. It sets a new testbed for evaluating multimodal large language models (MLLMs). Our experiments demonstrate that while MLLMs show some skill, they lack a full understanding of object state changes. The benchmark includes a fine-tuned model that, despite initial capabilities, requires significant improvements in accuracy and generalization ability for effective understanding of these changes. Our code and dataset are available at https://github.com/nguyennm1024/OSCaR. △ Less

Submitted 2 April, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

Comments: NAACL 2024

arXiv:2311.04480 [pdf, other]

CLearViD: Curriculum Learning for Video Description

Authors: Cheng-Yu Chuang, Pooyan Fazli

Abstract: Video description entails automatically generating coherent natural language sentences that narrate the content of a given video. We introduce CLearViD, a transformer-based model for video description generation that leverages curriculum learning to accomplish this task. In particular, we investigate two curriculum strategies: (1) progressively exposing the model to more challenging samples by gra… ▽ More Video description entails automatically generating coherent natural language sentences that narrate the content of a given video. We introduce CLearViD, a transformer-based model for video description generation that leverages curriculum learning to accomplish this task. In particular, we investigate two curriculum strategies: (1) progressively exposing the model to more challenging samples by gradually applying a Gaussian noise to the video data, and (2) gradually reducing the capacity of the network through dropout during the training process. These methods enable the model to learn more robust and generalizable features. Moreover, CLearViD leverages the Mish activation function, which provides non-linearity and non-monotonicity and helps alleviate the issue of vanishing gradients. Our extensive experiments and ablation studies demonstrate the effectiveness of the proposed model. The results on two datasets, namely ActivityNet Captions and YouCook2, show that CLearViD significantly outperforms existing state-of-the-art models in terms of both accuracy and diversity metrics. △ Less

Submitted 8 November, 2023; originally announced November 2023.

Comments: 15 pages, 4 figures

arXiv:2304.01334 [pdf, other]

Clustering Social Touch Gestures for Human-Robot Interaction

Authors: Ramzi Abou Chahine, Steven Vasquez, Pooyan Fazli, Hasti Seifi

Abstract: Social touch provides a rich non-verbal communication channel between humans and robots. Prior work has identified a set of touch gestures for human-robot interaction and described them with natural language labels (e.g., stroking, patting). Yet, no data exists on the semantic relationships between the touch gestures in users' minds. To endow robots with touch intelligence, we investigated how peo… ▽ More Social touch provides a rich non-verbal communication channel between humans and robots. Prior work has identified a set of touch gestures for human-robot interaction and described them with natural language labels (e.g., stroking, patting). Yet, no data exists on the semantic relationships between the touch gestures in users' minds. To endow robots with touch intelligence, we investigated how people perceive the similarities of social touch labels from the literature. In an online study, 45 participants grouped 36 social touch labels based on their perceived similarities and annotated their groupings with descriptive names. We derived quantitative similarities of the gestures from these groupings and analyzed the similarities using hierarchical clustering. The analysis resulted in 9 clusters of touch gestures formed around the social, emotional, and contact characteristics of the gestures. We discuss the implications of our results for designing and evaluating touch sensing and interactions with social robots. △ Less

Submitted 3 April, 2023; originally announced April 2023.

Comments: 8 pages

arXiv:2211.09397 [pdf, other]

Charting Visual Impression of Robot Hands

Authors: Hasti Seifi, Steven A. Vasquez, Hyunyoung Kim, Pooyan Fazli

Abstract: A wide variety of robotic hands have been designed to date. Yet, we do not know how users perceive these hands and feel about interacting with them. To inform hand design for social robots, we compiled a dataset of 73 robot hands and ran an online study, in which 160 users rated their impressions of the hands using 17 rating scales. Next, we developed 17 regression models that can predict user rat… ▽ More A wide variety of robotic hands have been designed to date. Yet, we do not know how users perceive these hands and feel about interacting with them. To inform hand design for social robots, we compiled a dataset of 73 robot hands and ran an online study, in which 160 users rated their impressions of the hands using 17 rating scales. Next, we developed 17 regression models that can predict user ratings (e.g., humanlike) from the design features of the hands (e.g., number of fingers). The models have less than a 10-point error in predicting the user ratings on a 0-100 scale. The shape of the fingertips, color scheme, and size of the hands influence the user ratings the most. We present simple guidelines to improve user impression of robot hands and outline remaining questions for future work. △ Less

Submitted 17 November, 2022; originally announced November 2022.

Comments: 8 pages

arXiv:2111.03994

NarrationBot and InfoBot: A Hybrid System for Automated Video Description

Authors: Shasta Ihorn, Yue-Ting Siu, Aditya Bodi, Lothar Narins, Jose M. Castanon, Yash Kant, Abhishek Das, Ilmi Yoon, Pooyan Fazli

Abstract: Video accessibility is crucial for blind and low vision users for equitable engagements in education, employment, and entertainment. Despite the availability of professional and amateur services and tools, most human-generated descriptions are expensive and time consuming. Moreover, the rate of human-generated descriptions cannot match the speed of video production. To overcome the increasing gaps… ▽ More Video accessibility is crucial for blind and low vision users for equitable engagements in education, employment, and entertainment. Despite the availability of professional and amateur services and tools, most human-generated descriptions are expensive and time consuming. Moreover, the rate of human-generated descriptions cannot match the speed of video production. To overcome the increasing gaps in video accessibility, we developed a hybrid system of two tools to 1) automatically generate descriptions for videos and 2) provide answers or additional descriptions in response to user queries on a video. Results from a mixed-methods study with 26 blind and low vision individuals show that our system significantly improved user comprehension and enjoyment of selected videos when both tools were used in tandem. In addition, participants reported no significant difference in their ability to understand videos when presented with autogenerated descriptions versus human-revised autogenerated descriptions. Our results demonstrate user enthusiasm about the developed system and its promise for providing customized access to videos. We discuss the limitations of the current work and provide recommendations for the future development of automated video description tools. △ Less

Submitted 11 January, 2022; v1 submitted 7 November, 2021; originally announced November 2021.

Comments: arXiv admin note: This article has been withdrawn by arXiv administration due to an unresolvable authorship dispute

arXiv:1912.12630 [pdf, other]

Real-time Policy Distillation in Deep Reinforcement Learning

Authors: Yuxiang Sun, Pooyan Fazli

Abstract: Policy distillation in deep reinforcement learning provides an effective way to transfer control policies from a larger network to a smaller untrained network without a significant degradation in performance. However, policy distillation is underexplored in deep reinforcement learning, and existing approaches are computationally inefficient, resulting in a long distillation time. In addition, the… ▽ More Policy distillation in deep reinforcement learning provides an effective way to transfer control policies from a larger network to a smaller untrained network without a significant degradation in performance. However, policy distillation is underexplored in deep reinforcement learning, and existing approaches are computationally inefficient, resulting in a long distillation time. In addition, the effectiveness of the distillation process is still limited to the model capacity. We propose a new distillation mechanism, called real-time policy distillation, in which training the teacher model and distilling the policy to the student model occur simultaneously. Accordingly, the teacher's latest policy is transferred to the student model in real time. This reduces the distillation time to half the original time or even less and also makes it possible for extremely small student models to learn skills at the expert level. We evaluated the proposed algorithm in the Atari 2600 domain. The results show that our approach can achieve full distillation in most games, even with compression ratios up to 1.7%. △ Less

Submitted 29 December, 2019; originally announced December 2019.

Comments: In Proceedings of the Workshop on ML for Systems, Thirty-third Conference on Neural Information Processing Systems (NeurIPS), 2019

arXiv:1803.03719 [pdf, other]

DeepMoTIon: Learning to Navigate Like Humans

Authors: Mahmoud Hamandi, Mike D'Arcy, Pooyan Fazli

Abstract: We present a novel human-aware navigation approach, where the robot learns to mimic humans to navigate safely in crowds. The presented model, referred to as DeepMoTIon, is trained with pedestrian surveillance data to predict human velocity in the environment. The robot processes LiDAR scans via the trained network to navigate to the target location. We conduct extensive experiments to assess the c… ▽ More We present a novel human-aware navigation approach, where the robot learns to mimic humans to navigate safely in crowds. The presented model, referred to as DeepMoTIon, is trained with pedestrian surveillance data to predict human velocity in the environment. The robot processes LiDAR scans via the trained network to navigate to the target location. We conduct extensive experiments to assess the components of our network and prove their necessity to imitate humans. Our experiments show that DeepMoTIion outperforms all the benchmarks in terms of human imitation, achieving a 24% reduction in time series-based path deviation over the next best approach. In addition, while many other approaches often failed to reach the target, our method reached the target in 100% of the test cases while complying with social norms and ensuring human safety. △ Less

Submitted 1 August, 2019; v1 submitted 9 March, 2018; originally announced March 2018.

Comments: 7 pages, In Proceedings of the IEEE International Conference on Robot and Human Interactive Communication, RO-MAN 2019

arXiv:1710.06831 [pdf, other]

Setting Up the Beam for Human-Centered Service Tasks

Authors: Utkarsh Patel, Emre Hatay, Mike D'Arcy, Ghazal Zand, Pooyan Fazli

Abstract: We introduce the Beam, a collaborative autonomous mobile service robot, based on SuitableTech's Beam telepresence system. We present a set of enhancements to the telepresence system, including autonomy, human awareness, increased computation and sensing capabilities, and integration with the popular Robot Operating System (ROS) framework. Together, our improvements transform the Beam into a low-co… ▽ More We introduce the Beam, a collaborative autonomous mobile service robot, based on SuitableTech's Beam telepresence system. We present a set of enhancements to the telepresence system, including autonomy, human awareness, increased computation and sensing capabilities, and integration with the popular Robot Operating System (ROS) framework. Together, our improvements transform the Beam into a low-cost platform for research on service robots. We examine the Beam on target search and object delivery tasks and demonstrate that the robot achieves a 100% success rate. △ Less

Submitted 18 October, 2017; originally announced October 2017.

Comments: 10 pages

arXiv:0908.2661 [pdf]

Human-Robot Teams in Entertainment and Other Everyday Scenarios

Authors: Pooyan Fazli, Alan K. Mackworth

Abstract: A new and relatively unexplored research direction in robotics systems is the coordination of humans and robots working as a team. In this paper, we focus upon problem domains and tasks in which multiple robots, humans and other agents are cooperating through coordination to satisfy a set of goals or to maximize utility. We are primarily interested in applications of human robot coordination in… ▽ More A new and relatively unexplored research direction in robotics systems is the coordination of humans and robots working as a team. In this paper, we focus upon problem domains and tasks in which multiple robots, humans and other agents are cooperating through coordination to satisfy a set of goals or to maximize utility. We are primarily interested in applications of human robot coordination in entertainment and other activities of daily life. We discuss the teamwork problem and propose an architecture to address this. △ Less

Submitted 18 August, 2009; originally announced August 2009.

arXiv:0908.2656 [pdf, ps, other]

Semantic Robot Vision Challenge: Current State and Future Directions

Authors: Scott Helmer, David Meger, Pooja Viswanathan, Sancho McCann, Matthew Dockrey, Pooyan Fazli, Tristram Southey, Marius Muja, Michael Joya, Jim Little, David Lowe, Alan Mackworth

Abstract: The Semantic Robot Vision Competition provided an excellent opportunity for our research lab to integrate our many ideas under one umbrella, inspiring both collaboration and new research. The task, visual search for an unknown object, is relevant to both the vision and robotics communities. Moreover, since the interplay of robotics and vision is sometimes ignored, the competition provides a venu… ▽ More The Semantic Robot Vision Competition provided an excellent opportunity for our research lab to integrate our many ideas under one umbrella, inspiring both collaboration and new research. The task, visual search for an unknown object, is relevant to both the vision and robotics communities. Moreover, since the interplay of robotics and vision is sometimes ignored, the competition provides a venue to integrate two communities. In this paper, we outline a number of modifications to the competition to both improve the state-of-the-art and increase participation. △ Less

Submitted 18 August, 2009; originally announced August 2009.

Comments: The IJCAI-09 Workshop on Competitions in Artificial Intelligence and Robotics, Pasadena, California, USA, July 11-17, 2009

Showing 1–10 of 10 results for author: Fazli, P