Search | arXiv e-print repository

OpenVLA: An Open-Source Vision-Language-Action Model

Authors: Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, Chelsea Finn

Abstract: Large policies pretrained on a combination of Internet-scale vision-language data and diverse robot demonstrations have the potential to change how we teach robots new skills: rather than training new behaviors from scratch, we can fine-tune such vision-language-action (VLA) models to obtain robust, generalizable policies for visuomotor control. Yet, widespread adoption of VLAs for robotics has be… ▽ More Large policies pretrained on a combination of Internet-scale vision-language data and diverse robot demonstrations have the potential to change how we teach robots new skills: rather than training new behaviors from scratch, we can fine-tune such vision-language-action (VLA) models to obtain robust, generalizable policies for visuomotor control. Yet, widespread adoption of VLAs for robotics has been challenging as 1) existing VLAs are largely closed and inaccessible to the public, and 2) prior work fails to explore methods for efficiently fine-tuning VLAs for new tasks, a key component for adoption. Addressing these challenges, we introduce OpenVLA, a 7B-parameter open-source VLA trained on a diverse collection of 970k real-world robot demonstrations. OpenVLA builds on a Llama 2 language model combined with a visual encoder that fuses pretrained features from DINOv2 and SigLIP. As a product of the added data diversity and new model components, OpenVLA demonstrates strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate across 29 tasks and multiple robot embodiments, with 7x fewer parameters. We further show that we can effectively fine-tune OpenVLA for new settings, with especially strong generalization results in multi-task environments involving multiple objects and strong language grounding abilities, and outperform expressive from-scratch imitation learning methods such as Diffusion Policy by 20.4%. We also explore compute efficiency; as a separate contribution, we show that OpenVLA can be fine-tuned on consumer GPUs via modern low-rank adaptation methods and served efficiently via quantization without a hit to downstream success rate. Finally, we release model checkpoints, fine-tuning notebooks, and our PyTorch codebase with built-in support for training VLAs at scale on Open X-Embodiment datasets. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: Website: https://openvla.github.io/

arXiv:2403.12945 [pdf, other]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Authors: Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, Youngwoon Lee, Marius Memmel, Sungjae Park , et al. (74 additional authors not shown)

Abstract: The creation of large, diverse, high-quality robot manipulation datasets is an important stepping stone on the path toward more capable and robust robotic manipulation policies. However, creating such datasets is challenging: collecting robot manipulation data in diverse environments poses logistical and safety challenges and requires substantial investments in hardware and human labour. As a resu… ▽ More The creation of large, diverse, high-quality robot manipulation datasets is an important stepping stone on the path toward more capable and robust robotic manipulation policies. However, creating such datasets is challenging: collecting robot manipulation data in diverse environments poses logistical and safety challenges and requires substantial investments in hardware and human labour. As a result, even the most general robot manipulation policies today are mostly trained on data collected in a small number of environments with limited scene and task diversity. In this work, we introduce DROID (Distributed Robot Interaction Dataset), a diverse robot manipulation dataset with 76k demonstration trajectories or 350 hours of interaction data, collected across 564 scenes and 84 tasks by 50 data collectors in North America, Asia, and Europe over the course of 12 months. We demonstrate that training with DROID leads to policies with higher performance and improved generalization ability. We open source the full dataset, policy learning code, and a detailed guide for reproducing our robot hardware setup. △ Less

Submitted 19 March, 2024; originally announced March 2024.

Comments: Project website: https://droid-dataset.github.io/

arXiv:2310.08864 [pdf, other]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Authors: Open X-Embodiment Collaboration, Abby O'Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, Annie Xie , et al. (267 additional authors not shown)

Abstract: Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning method… ▽ More Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. More details can be found on the project website https://robotics-transformer-x.github.io. △ Less

Submitted 1 June, 2024; v1 submitted 13 October, 2023; originally announced October 2023.

Comments: Project website: https://robotics-transformer-x.github.io

arXiv:2210.09753 [pdf, other]

A Socially Assistive Robot using Automated Planning in a Paediatric Clinical Setting

Authors: Alan Lindsay, Andres Ramirez-Duque, Ronald P. A. Petrick, Mary Ellen Foster

Abstract: We present an ongoing project that aims to develop a social robot to help children cope with painful and distressing medical procedures in a clinical setting. Our approach uses automated planning as a core component for action selection in order to generate plans that include physical, sensory, and social actions for the robot to use when interacting with humans. A key capability of our system is… ▽ More We present an ongoing project that aims to develop a social robot to help children cope with painful and distressing medical procedures in a clinical setting. Our approach uses automated planning as a core component for action selection in order to generate plans that include physical, sensory, and social actions for the robot to use when interacting with humans. A key capability of our system is that the robot's behaviour adapts based on the affective state of the child patient. The robot must operate in a challenging physical and social environment where appropriate and safe interaction with children, parents/caregivers, and healthcare professionals is crucial. In this paper, we present our system, examine some of the key challenges of the scenario, and describe how they are addressed by our system. △ Less

Submitted 18 October, 2022; originally announced October 2022.

Comments: Presented at the AI-HRI Symposium at AAAI Fall Symposium Series (FSS) 2022

Report number: AIHRI/2022/4156

arXiv:2103.12306 [pdf, other]

GISE-51: A scalable isolated sound events dataset

Authors: Sarthak Yadav, Mary Ellen Foster

Abstract: Most of the existing isolated sound event datasets comprise a small number of sound event classes, usually 10 to 15, restricted to a small domain, such as domestic and urban sound events. In this work, we introduce GISE-51, a dataset spanning 51 isolated sound events belonging to a broad domain of event types. We also release GISE-51-Mixtures, a dataset of 5-second soundscapes with hard-labelled e… ▽ More Most of the existing isolated sound event datasets comprise a small number of sound event classes, usually 10 to 15, restricted to a small domain, such as domestic and urban sound events. In this work, we introduce GISE-51, a dataset spanning 51 isolated sound events belonging to a broad domain of event types. We also release GISE-51-Mixtures, a dataset of 5-second soundscapes with hard-labelled event boundaries synthesized from GISE-51 isolated sound events. We conduct baseline sound event recognition (SER) experiments on the GISE-51-Mixtures dataset, benchmarking prominent convolutional neural networks, and models trained with the dataset demonstrate strong transfer learning performance on existing audio recognition benchmarks. Together, GISE-51 and GISE-51-Mixtures attempt to address some of the shortcomings of recent sound event datasets, providing an open, reproducible benchmark for future research along with the freedom to adapt the included isolated sound events for domain-specific applications. △ Less

Submitted 7 October, 2021; v1 submitted 23 March, 2021; originally announced March 2021.

Comments: Technical Report

arXiv:2010.04652 [pdf, other]

Towards Social HRI for Improving Children's Healthcare Experiences

Authors: Mary Ellen Foster, Ronald P. A. Petrick

Abstract: This paper describes a new research project that aims to develop a social robot designed to help children cope with painful and distressing medical procedures in a clinical setting. While robots have previously been trialled for this task, with promising initial results, the systems have tended to be teleoperated, limiting their flexibility and robustness. This project will use epistemic planning… ▽ More This paper describes a new research project that aims to develop a social robot designed to help children cope with painful and distressing medical procedures in a clinical setting. While robots have previously been trialled for this task, with promising initial results, the systems have tended to be teleoperated, limiting their flexibility and robustness. This project will use epistemic planning techniques as a core component for action selection in the robot system, in order to generate plans that include physical, sensory, and social actions for interacting with humans. The robot will operate in a task environment where appropriate and safe interaction with children, parents/caregivers, and healthcare professionals is required. In addition to addressing the core technical challenge of building an autonomous social robot, the project will incorporate co-design techniques involving all participant groups, and the final robot system will be evaluated in a two-site clinical trial. △ Less

Submitted 9 October, 2020; originally announced October 2020.

arXiv:1909.06749 [pdf, other]

MuMMER: Socially Intelligent Human-Robot Interaction in Public Spaces

Authors: Mary Ellen Foster, Bart Craenen, Amol Deshmukh, Oliver Lemon, Emanuele Bastianelli, Christian Dondrup, Ioannis Papaioannou, Andrea Vanzo, Jean-Marc Odobez, Olivier Canévet, Yuanzhouhan Cao, Weipeng He, Angel Martínez-González, Petr Motlicek, Rémy Siegfried, Rachid Alami, Kathleen Belhassein, Guilhem Buisan, Aurélie Clodic, Amandine Mayima, Yoan Sallami, Guillaume Sarthou, Phani-Teja Singamaneni, Jules Waldhart, Alexandre Mazel , et al. (5 additional authors not shown)

Abstract: In the EU-funded MuMMER project, we have developed a social robot designed to interact naturally and flexibly with users in public spaces such as a shopping mall. We present the latest version of the robot system developed during the project. This system encompasses audio-visual sensing, social signal processing, conversational interaction, perspective taking, geometric reasoning, and motion plann… ▽ More In the EU-funded MuMMER project, we have developed a social robot designed to interact naturally and flexibly with users in public spaces such as a shopping mall. We present the latest version of the robot system developed during the project. This system encompasses audio-visual sensing, social signal processing, conversational interaction, perspective taking, geometric reasoning, and motion planning. It successfully combines all these components in an overarching framework using the Robot Operating System (ROS) and has been deployed to a shopping mall in Finland interacting with customers. In this paper, we describe the system components, their interplay, and the resulting robot behaviours and scenarios provided at the shopping mall. △ Less

Submitted 15 September, 2019; originally announced September 2019.

Report number: AI-HRI/2019/14

arXiv:1903.12264 [pdf, other]

doi 10.1145/3329189.3329191

Validation of a recommender system for prompting omitted foods in online dietary assessment surveys

Authors: Timur Osadchiy, Ivan Poliakov, Patrick Olivier, Maisie Rowland, Emma Foster

Abstract: Recall assistance methods are among the key aspects that improve the accuracy of online dietary assessment surveys. These methods still mainly rely on experience of trained interviewers with nutritional background, but data driven approaches could improve cost-efficiency and scalability of automated dietary assessment. We evaluated the effectiveness of a recommender algorithm developed for an onli… ▽ More Recall assistance methods are among the key aspects that improve the accuracy of online dietary assessment surveys. These methods still mainly rely on experience of trained interviewers with nutritional background, but data driven approaches could improve cost-efficiency and scalability of automated dietary assessment. We evaluated the effectiveness of a recommender algorithm developed for an online dietary assessment system called Intake24, that automates the multiple-pass 24-hour recall method. The recommender builds a model of eating behavior from recalls collected in past surveys. Based on foods they have already selected, the model is used to remind respondents of associated foods that they may have omitted to report. The performance of prompts generated by the model was compared to that of prompts hand-coded by nutritionists in two dietary studies. The results of our studies demonstrate that the recommender system is able to capture a higher number of foods omitted by respondents of online dietary surveys than prompts hand-coded by nutritionists. However, the considerably lower precision of generated prompts indicates an opportunity for further improvement of the system. △ Less

Submitted 20 March, 2019; originally announced March 2019.

Report number: ISBN: 978-1-4503-6126-2

Journal ref: PervasiveHealth 2019 Proceedings of the 13th EAI International Conference on Pervasive Computing Technologies for Healthcare

arXiv:1807.04355 [pdf, other]

Deepwound: Automated Postoperative Wound Assessment and Surgical Site Surveillance through Convolutional Neural Networks

Authors: Varun Shenoy, Elizabeth Foster, Lauren Aalami, Bakar Majeed, Oliver Aalami

Abstract: Postoperative wound complications are a significant cause of expense for hospitals, doctors, and patients. Hence, an effective method to diagnose the onset of wound complications is strongly desired. Algorithmically classifying wound images is a difficult task due to the variability in the appearance of wound sites. Convolutional neural networks (CNNs), a subgroup of artificial neural networks tha… ▽ More Postoperative wound complications are a significant cause of expense for hospitals, doctors, and patients. Hence, an effective method to diagnose the onset of wound complications is strongly desired. Algorithmically classifying wound images is a difficult task due to the variability in the appearance of wound sites. Convolutional neural networks (CNNs), a subgroup of artificial neural networks that have shown great promise in analyzing visual imagery, can be leveraged to categorize surgical wounds. We present a multi-label CNN ensemble, Deepwound, trained to classify wound images using only image pixels and corresponding labels as inputs. Our final computational model can accurately identify the presence of nine labels: drainage, fibrinous exudate, granulation tissue, surgical site infection, open wound, staples, steri strips, and sutures. Our model achieves receiver operating curve (ROC) area under curve (AUC) scores, sensitivity, specificity, and F1 scores superior to prior work in this area. Smartphones provide a means to deliver accessible wound care due to their increasing ubiquity. Paired with deep neural networks, they offer the capability to provide clinical insight to assist surgeons during postoperative care. We also present a mobile application frontend to Deepwound that assists patients in tracking their wound and surgical recovery from the comfort of their home. △ Less

Submitted 11 July, 2018; originally announced July 2018.

Comments: 7 pages, 11 figures, 2 tables

arXiv:1211.3376

Clipping of Arbitrary Polygons with Degeneracies

Authors: Erich L Foster, James R Overfelt

Abstract: Polygon clipping is a frequent operation in Arbitrary Lagrangian-Eulerian methods, Computer Graphics, GIS, and CAD. In fact, clipping algorithms are said to be one of the most important operations in computer graphics. Thus, efficient and general polygon clipping algorithms are of great importance. Greiner et al. developed a time efficient algorithm which could clip arbitrary polygons, including c… ▽ More Polygon clipping is a frequent operation in Arbitrary Lagrangian-Eulerian methods, Computer Graphics, GIS, and CAD. In fact, clipping algorithms are said to be one of the most important operations in computer graphics. Thus, efficient and general polygon clipping algorithms are of great importance. Greiner et al. developed a time efficient algorithm which could clip arbitrary polygons, including concave and self intersecting polygons. However, the Greiner-Hormann algorithm does not properly handle degenerate cases, without the undesirable need for perturbing vertices. We present an extension to the Greiner-Hormann polygon clipping algorithm which properly deals with degenerate cases. We combine the method proposed by Kim et al. and the method mentioned by Liu et al. to remove or properly label degenerate cases. Additionally, the algorithm presented avoids the need for calculating midpoints, doesn't require additional entry/exit flags, and avoids changing the vertex data structure used in the original Greiner-Hormann algorithm, which was required by the extension presented by Kim et al. △ Less

Submitted 16 June, 2014; v1 submitted 12 November, 2012; originally announced November 2012.

Comments: The paper has been withdrawn due to not being able to truly handle all degenerate cases as claimed

Showing 1–10 of 10 results for author: Foster, E