Zum Hauptinhalt springen

Showing 1–25 of 25 results for author: Thattai, G

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.21783  [pdf, other

    cs.AI cs.CL cs.CV

    The Llama 3 Herd of Models

    Authors: Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang , et al. (510 additional authors not shown)

    Abstract: Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical… ▽ More

    Submitted 15 August, 2024; v1 submitted 31 July, 2024; originally announced July 2024.

  2. arXiv:2308.05221  [pdf, other

    cs.HC cs.AI cs.RO

    Alexa, play with robot: Introducing the First Alexa Prize SimBot Challenge on Embodied AI

    Authors: Hangjie Shi, Leslie Ball, Govind Thattai, Desheng Zhang, Lucy Hu, Qiaozi Gao, Suhaila Shakiah, Xiaofeng Gao, Aishwarya Padmakumar, Bofei Yang, Cadence Chung, Dinakar Guthy, Gaurav Sukhatme, Karthika Arumugam, Matthew Wen, Osman Ipek, Patrick Lange, Rohan Khanna, Shreyas Pansare, Vasu Sharma, Chao Zhang, Cris Flagg, Daniel Pressel, Lavina Vaz, Luke Dai , et al. (17 additional authors not shown)

    Abstract: The Alexa Prize program has empowered numerous university students to explore, experiment, and showcase their talents in building conversational agents through challenges like the SocialBot Grand Challenge and the TaskBot Challenge. As conversational agents increasingly appear in multimodal and embodied contexts, it is important to explore the affordances of conversational interaction augmented wi… ▽ More

    Submitted 9 August, 2023; originally announced August 2023.

  3. arXiv:2308.03882  [pdf, other

    cs.LG cs.AI stat.ML

    Exploiting Generalization in Offline Reinforcement Learning via Unseen State Augmentations

    Authors: Nirbhay Modhe, Qiaozi Gao, Ashwin Kalyan, Dhruv Batra, Govind Thattai, Gaurav Sukhatme

    Abstract: Offline reinforcement learning (RL) methods strike a balance between exploration and exploitation by conservative value estimation -- penalizing values of unseen states and actions. Model-free methods penalize values at all unseen actions, while model-based methods are able to further exploit unseen states via model rollouts. However, such methods are handicapped in their ability to find unseen st… ▽ More

    Submitted 24 September, 2023; v1 submitted 7 August, 2023; originally announced August 2023.

  4. arXiv:2308.00937  [pdf, other

    cs.RO cs.AI cs.MA

    LEMMA: Learning Language-Conditioned Multi-Robot Manipulation

    Authors: Ran Gong, Xiaofeng Gao, Qiaozi Gao, Suhaila Shakiah, Govind Thattai, Gaurav S. Sukhatme

    Abstract: Complex manipulation tasks often require robots with complementary capabilities to collaborate. We introduce a benchmark for LanguagE-Conditioned Multi-robot MAnipulation (LEMMA) focused on task allocation and long-horizon object manipulation based on human language instructions in a tabletop setting. LEMMA features 8 types of procedurally generated tasks with varying degree of complexity, some of… ▽ More

    Submitted 16 September, 2023; v1 submitted 2 August, 2023; originally announced August 2023.

    Comments: 8 pages, 3 figures, accepted by RA-L

    Journal ref: IEEE Robotics and Automation Letters, vol. 8, no. 10, pp. 6835-6842, Oct. 2023

  5. arXiv:2305.16597  [pdf, other

    cs.CL cs.AI cs.LG

    Neural Architecture Search for Parameter-Efficient Fine-tuning of Large Pre-trained Language Models

    Authors: Neal Lawton, Anoop Kumar, Govind Thattai, Aram Galstyan, Greg Ver Steeg

    Abstract: Parameter-efficient tuning (PET) methods fit pre-trained language models (PLMs) to downstream tasks by either computing a small compressed update for a subset of model parameters, or appending and fine-tuning a small number of new model parameters to the pre-trained network. Hand-designed PET architectures from the literature perform well in practice, but have the potential to be improved via auto… ▽ More

    Submitted 25 May, 2023; originally announced May 2023.

    Comments: 8 pages, 3 figures, ACL 2023

    ACM Class: I.2.7

  6. arXiv:2303.01586  [pdf, other

    cs.HC cs.AI cs.RO

    Alexa Arena: A User-Centric Interactive Platform for Embodied AI

    Authors: Qiaozi Gao, Govind Thattai, Suhaila Shakiah, Xiaofeng Gao, Shreyas Pansare, Vasu Sharma, Gaurav Sukhatme, Hangjie Shi, Bofei Yang, Desheng Zheng, Lucy Hu, Karthika Arumugam, Shui Hu, Matthew Wen, Dinakar Guthy, Cadence Chung, Rohan Khanna, Osman Ipek, Leslie Ball, Kate Bland, Heather Rocker, Yadunandana Rao, Michael Johnston, Reza Ghanadan, Arindam Mandal , et al. (2 additional authors not shown)

    Abstract: We introduce Alexa Arena, a user-centric simulation platform for Embodied AI (EAI) research. Alexa Arena provides a variety of multi-room layouts and interactable objects, for the creation of human-robot interaction (HRI) missions. With user-friendly graphics and control mechanisms, Alexa Arena supports the development of gamified robotic tasks readily accessible to general human users, thus openi… ▽ More

    Submitted 7 June, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

  7. arXiv:2301.05318  [pdf, other

    cs.RO cs.AI cs.LG

    Language-Informed Transfer Learning for Embodied Household Activities

    Authors: Yuqian Jiang, Qiaozi Gao, Govind Thattai, Gaurav Sukhatme

    Abstract: For service robots to become general-purpose in everyday household environments, they need not only a large library of primitive skills, but also the ability to quickly learn novel tasks specified by users. Fine-tuning neural networks on a variety of downstream tasks has been successful in many vision and language domains, but research is still limited on transfer learning between diverse long-hor… ▽ More

    Submitted 12 January, 2023; originally announced January 2023.

  8. arXiv:2301.01893  [pdf, other

    cs.CV cs.AI cs.CL

    GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods

    Authors: Da Yin, Feng Gao, Govind Thattai, Michael Johnston, Kai-Wei Chang

    Abstract: A key goal for the advancement of AI is to develop technologies that serve the needs not just of one group but of all communities regardless of their geographical region. In fact, a significant proportion of knowledge is locally shared by people from certain regions but may not apply equally in other regions because of cultural differences. If a model is unaware of regional characteristics, it may… ▽ More

    Submitted 4 January, 2023; originally announced January 2023.

  9. arXiv:2212.05211  [pdf, other

    cs.LG

    OpenD: A Benchmark for Language-Driven Door and Drawer Opening

    Authors: Yizhou Zhao, Qiaozi Gao, Liang Qiu, Govind Thattai, Gaurav S. Sukhatme

    Abstract: We introduce OPEND, a benchmark for learning how to use a hand to open cabinet doors or drawers in a photo-realistic and physics-reliable simulation environment driven by language instruction. To solve the task, we propose a multi-step planner composed of a deep neural network and rule-base controllers. The network is utilized to capture spatial relationships from images and understand semantic me… ▽ More

    Submitted 10 December, 2022; originally announced December 2022.

  10. arXiv:2211.13887  [pdf, other

    cs.AI cs.CL cs.CV cs.GR eess.IV

    TPA-Net: Generate A Dataset for Text to Physics-based Animation

    Authors: Yuxing Qiu, Feng Gao, Minchen Li, Govind Thattai, Yin Yang, Chenfanfu Jiang

    Abstract: Recent breakthroughs in Vision-Language (V&L) joint research have achieved remarkable results in various text-driven tasks. High-quality Text-to-video (T2V), a task that has been long considered mission-impossible, was proven feasible with reasonably good results in latest works. However, the resulting videos often have undesired artifacts largely because the system is purely data-driven and agnos… ▽ More

    Submitted 24 November, 2022; originally announced November 2022.

  11. arXiv:2211.05190  [pdf

    cs.CL

    Towards Reasoning-Aware Explainable VQA

    Authors: Rakesh Vaideeswaran, Feng Gao, Abhinav Mathur, Govind Thattai

    Abstract: The domain of joint vision-language understanding, especially in the context of reasoning in Visual Question Answering (VQA) models, has garnered significant attention in the recent past. While most of the existing VQA models focus on improving the accuracy of VQA, the way models arrive at an answer is oftentimes a black box. As a step towards making the VQA task more explainable and interpretable… ▽ More

    Submitted 9 November, 2022; originally announced November 2022.

  12. arXiv:2208.13626  [pdf, other

    cs.AI cs.CV cs.LG cs.MA cs.RO

    CH-MARL: A Multimodal Benchmark for Cooperative, Heterogeneous Multi-Agent Reinforcement Learning

    Authors: Vasu Sharma, Prasoon Goyal, Kaixiang Lin, Govind Thattai, Qiaozi Gao, Gaurav S. Sukhatme

    Abstract: We propose a multimodal (vision-and-language) benchmark for cooperative and heterogeneous multi-agent learning. We introduce a benchmark multimodal dataset with tasks involving collaboration between multiple simulated heterogeneous robots in a rich multi-room home environment. We provide an integrated learning framework, multimodal implementations of state-of-the-art multi-agent reinforcement lear… ▽ More

    Submitted 25 August, 2022; originally announced August 2022.

  13. A Multi-level Alignment Training Scheme for Video-and-Language Grounding

    Authors: Yubo Zhang, Feiyang Niu, Qing Ping, Govind Thattai

    Abstract: To solve video-and-language grounding tasks, the key is for the network to understand the connection between the two modalities. For a pair of video and language description, their semantic relation is reflected by their encodings' similarity. A good multi-modality encoder should be able to well capture both inputs' semantics and encode them in the shared feature space where embedding distance get… ▽ More

    Submitted 27 February, 2023; v1 submitted 22 April, 2022; originally announced April 2022.

    Comments: Accepted at ICDM 2022 FOMO-VL workshop

    Journal ref: 2022 IEEE International Conference on Data Mining Workshops (ICDMW), Orlando, FL, USA, 2022, pp. 958-966

  14. DialFRED: Dialogue-Enabled Agents for Embodied Instruction Following

    Authors: Xiaofeng Gao, Qiaozi Gao, Ran Gong, Kaixiang Lin, Govind Thattai, Gaurav S. Sukhatme

    Abstract: Language-guided Embodied AI benchmarks requiring an agent to navigate an environment and manipulate objects typically allow one-way communication: the human user gives a natural language command to the agent, and the agent can only follow the command passively. We present DialFRED, a dialogue-enabled embodied instruction following benchmark based on the ALFRED benchmark. DialFRED allows an agent t… ▽ More

    Submitted 15 August, 2022; v1 submitted 27 February, 2022; originally announced February 2022.

    Comments: 8 pages, 5 figures, accepted by RA-L

    Journal ref: IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 10049-10056, Oct. 2022

  15. arXiv:2202.07712  [pdf, other

    cs.CV

    Privacy Preserving Visual Question Answering

    Authors: Cristian-Paul Bara, Qing Ping, Abhinav Mathur, Govind Thattai, Rohith MV, Gaurav S. Sukhatme

    Abstract: We introduce a novel privacy-preserving methodology for performing Visual Question Answering on the edge. Our method constructs a symbolic representation of the visual scene, using a low-complexity computer vision model that jointly predicts classes, attributes and predicates. This symbolic representation is non-differentiable, which means it cannot be used to recover the original image, thereby k… ▽ More

    Submitted 15 February, 2022; originally announced February 2022.

  16. arXiv:2201.09862  [pdf, other

    cs.RO cs.AI

    Learning to Act with Affordance-Aware Multimodal Neural SLAM

    Authors: Zhiwei Jia, Kaixiang Lin, Yizhou Zhao, Qiaozi Gao, Govind Thattai, Gaurav Sukhatme

    Abstract: Recent years have witnessed an emerging paradigm shift toward embodied artificial intelligence, in which an agent must learn to solve challenging tasks by interacting with its environment. There are several challenges in solving embodied multimodal tasks, including long-horizon planning, vision-and-language grounding, and efficient exploration. We focus on a critical bottleneck, namely the perform… ▽ More

    Submitted 24 October, 2022; v1 submitted 24 January, 2022; originally announced January 2022.

    Comments: Accepted by IROS 2022

  17. arXiv:2201.08520  [pdf, other

    cs.LG cs.AI cs.CL

    Learning Two-Step Hybrid Policy for Graph-Based Interpretable Reinforcement Learning

    Authors: Tongzhou Mu, Kaixiang Lin, Feiyang Niu, Govind Thattai

    Abstract: We present a two-step hybrid reinforcement learning (RL) policy that is designed to generate interpretable and robust hierarchical policies on the RL problem with graph-based input. Unlike prior deep reinforcement learning policies parameterized by an end-to-end black-box graph neural network, our approach disentangles the decision-making process into two steps. The first step is a simplified clas… ▽ More

    Submitted 19 October, 2022; v1 submitted 20 January, 2022; originally announced January 2022.

    Comments: Transactions on Machine Learning Research (TMLR)

  18. arXiv:2201.05299  [pdf, other

    cs.CV cs.CL cs.IR

    A Thousand Words Are Worth More Than a Picture: Natural Language-Centric Outside-Knowledge Visual Question Answering

    Authors: Feng Gao, Qing Ping, Govind Thattai, Aishwarya Reganti, Ying Nian Wu, Prem Natarajan

    Abstract: Outside-knowledge visual question answering (OK-VQA) requires the agent to comprehend the image, make use of relevant knowledge from the entire web, and digest all the information to answer the question. Most previous works address the problem by first fusing the image and question in the multi-modal space, which is inflexible for further fusion with a vast amount of external knowledge. In this pa… ▽ More

    Submitted 13 January, 2022; originally announced January 2022.

  19. arXiv:2201.02740  [pdf, other

    cs.CL cs.AI

    Best of Both Worlds: A Hybrid Approach for Multi-Hop Explanation with Declarative Facts

    Authors: Shane Storks, Qiaozi Gao, Aishwarya Reganti, Govind Thattai

    Abstract: Language-enabled AI systems can answer complex, multi-hop questions to high accuracy, but supporting answers with evidence is a more challenging task which is important for the transparency and trustworthiness to users. Prior work in this area typically makes a trade-off between efficiency and accuracy; state-of-the-art deep neural network systems are too cumbersome to be useful in large-scale app… ▽ More

    Submitted 17 December, 2021; originally announced January 2022.

    Comments: Accepted to CLeaR Workshop @ AAAI 2022

  20. arXiv:2111.05527  [pdf, other

    cs.AI

    LUMINOUS: Indoor Scene Generation for Embodied AI Challenges

    Authors: Yizhou Zhao, Kaixiang Lin, Zhiwei Jia, Qiaozi Gao, Govind Thattai, Jesse Thomason, Gaurav S. Sukhatme

    Abstract: Learning-based methods for training embodied agents typically require a large number of high-quality scenes that contain realistic layouts and support meaningful interactions. However, current simulators for Embodied AI (EAI) challenges only provide simulated indoor scenes with a limited number of layouts. This paper presents Luminous, the first research framework that employs state-of-the-art ind… ▽ More

    Submitted 9 November, 2021; originally announced November 2021.

    Comments: 2021 paper, Amazon

  21. arXiv:2108.04927  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion

    Authors: Alessandro Suglia, Qiaozi Gao, Jesse Thomason, Govind Thattai, Gaurav Sukhatme

    Abstract: Language-guided robots performing home and office tasks must navigate in and interact with the world. Grounding language instructions against visual observations and actions to take in an environment is an open challenge. We present Embodied BERT (EmBERT), a transformer-based model which can attend to high-dimensional, multi-modal inputs across long temporal horizons for language-conditioned task… ▽ More

    Submitted 4 November, 2021; v1 submitted 10 August, 2021; originally announced August 2021.

    Comments: Accepted at Novel Ideas in Learning-to-Learn through Interaction (NILLI) workshop @ EMNLP 2021

  22. arXiv:2105.11541  [pdf, other

    cs.CV

    Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation

    Authors: Tao Tu, Qing Ping, Govind Thattai, Gokhan Tur, Prem Natarajan

    Abstract: GuessWhat?! is a two-player visual dialog guessing game where player A asks a sequence of yes/no questions (Questioner) and makes a final guess (Guesser) about a target object in an image, based on answers from player B (Oracle). Based on this dialog history between the Questioner and the Oracle, a Guesser makes a final guess of the target object. Previous baseline Oracle model encodes no visual i… ▽ More

    Submitted 24 May, 2021; originally announced May 2021.

  23. arXiv:2101.03431  [pdf, other

    cs.AI cs.CL cs.CV cs.RO

    Are We There Yet? Learning to Localize in Embodied Instruction Following

    Authors: Shane Storks, Qiaozi Gao, Govind Thattai, Gokhan Tur

    Abstract: Embodied instruction following is a challenging problem requiring an agent to infer a sequence of primitive actions to achieve a goal environment state from complex language and visual inputs. Action Learning From Realistic Environments and Directives (ALFRED) is a recently proposed benchmark for this problem consisting of step-by-step natural language instructions to achieve subgoals which compos… ▽ More

    Submitted 9 January, 2021; originally announced January 2021.

    Comments: Accepted to HAI @ AAAI 2021

  24. arXiv:2012.00958  [pdf, other

    cs.CL

    Interactive Teaching for Conversational AI

    Authors: Qing Ping, Feiyang Niu, Govind Thattai, Joel Chengottusseriyil, Qiaozi Gao, Aishwarya Reganti, Prashanth Rajagopal, Gokhan Tur, Dilek Hakkani-Tur, Prem Nataraja

    Abstract: Current conversational AI systems aim to understand a set of pre-designed requests and execute related actions, which limits them to evolve naturally and adapt based on human interactions. Motivated by how children learn their first language interacting with adults, this paper describes a new Teachable AI system that is capable of learning new language nuggets called concepts, directly from end us… ▽ More

    Submitted 1 December, 2020; originally announced December 2020.

    Comments: Accepted at Human in the Loop Dialogue Systems Workshop @NeurIPS 2020

  25. arXiv:2011.10731  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    LRTA: A Transparent Neural-Symbolic Reasoning Framework with Modular Supervision for Visual Question Answering

    Authors: Weixin Liang, Feiyang Niu, Aishwarya Reganti, Govind Thattai, Gokhan Tur

    Abstract: The predominant approach to visual question answering (VQA) relies on encoding the image and question with a "black-box" neural encoder and decoding a single token as the answer like "yes" or "no". Despite this approach's strong quantitative results, it struggles to come up with intuitive, human-readable forms of justification for the prediction process. To address this insufficiency, we reformula… ▽ More

    Submitted 21 November, 2020; originally announced November 2020.

    Comments: NeurIPS KR2ML 2020