Zum Hauptinhalt springen

Showing 1–10 of 10 results for author: Alwala, K V

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.00714  [pdf, other

    cs.CV cs.AI cs.LG

    SAM 2: Segment Anything in Images and Videos

    Authors: Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer

    Abstract: We present Segment Anything Model 2 (SAM 2), a foundation model towards solving promptable visual segmentation in images and videos. We build a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. Our model is a simple transformer architecture with streaming memory for real-time video processing. SAM 2 trained on our data provi… ▽ More

    Submitted 1 August, 2024; originally announced August 2024.

    Comments: Website: https://ai.meta.com/sam2

  2. arXiv:2407.21783  [pdf, other

    cs.AI cs.CL cs.CV

    The Llama 3 Herd of Models

    Authors: Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang , et al. (510 additional authors not shown)

    Abstract: Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical… ▽ More

    Submitted 15 August, 2024; v1 submitted 31 July, 2024; originally announced July 2024.

  3. arXiv:2305.05665  [pdf, other

    cs.CV cs.AI cs.LG cs.MM

    ImageBind: One Embedding Space To Bind Them All

    Authors: Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

    Abstract: We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their… ▽ More

    Submitted 31 May, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

    Comments: CVPR 2023 (Highlighted Paper). Website: https://imagebind.metademolab.com/ Code/Models: https://github.com/facebookresearch/ImageBind

  4. arXiv:2303.13496  [pdf, other

    cs.CV cs.AI cs.LG

    The effectiveness of MAE pre-pretraining for billion-scale pretraining

    Authors: Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand Joulin, Piotr Dollár, Christoph Feichtenhofer, Ross Girshick, Rohit Girdhar, Ishan Misra

    Abstract: This paper revisits the standard pretrain-then-finetune paradigm used in computer vision for visual recognition tasks. Typically, state-of-the-art foundation models are pretrained using large scale (weakly) supervised datasets with billions of images. We introduce an additional pre-pretraining stage that is simple and uses the self-supervised MAE technique to initialize the model. While MAE has on… ▽ More

    Submitted 24 January, 2024; v1 submitted 23 March, 2023; originally announced March 2023.

    Comments: ICCV 2023. Models available at https://github.com/facebookresearch/maws/

  5. arXiv:2206.08356  [pdf, other

    cs.CV cs.AI cs.LG stat.ML

    OmniMAE: Single Model Masked Pretraining on Images and Videos

    Authors: Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra

    Abstract: Transformer-based architectures have become competitive across a variety of visual domains, most notably images and videos. While prior work studies these modalities in isolation, having a common architecture suggests that one can train a single unified model for multiple visual modalities. Prior attempts at unified modeling typically use architectures tailored for vision tasks, or obtain worse pe… ▽ More

    Submitted 31 May, 2023; v1 submitted 16 June, 2022; originally announced June 2022.

    Comments: CVPR 2023. Code/models: https://github.com/facebookresearch/omnivore

  6. arXiv:2204.03642  [pdf, other

    cs.CV

    Pre-train, Self-train, Distill: A simple recipe for Supersizing 3D Reconstruction

    Authors: Kalyan Vasudev Alwala, Abhinav Gupta, Shubham Tulsiani

    Abstract: Our work learns a unified model for single-view 3D reconstruction of objects from hundreds of semantic categories. As a scalable alternative to direct 3D supervision, our work relies on segmented image collections for learning 3D of generic categories. Unlike prior works that use similar supervision but learn independent category-specific models from scratch, our approach of learning a unified mod… ▽ More

    Submitted 7 April, 2022; originally announced April 2022.

    Comments: To appear in CVPR 22. Project page: https://shubhtuls.github.io/ss3d/

  7. arXiv:2111.09887  [pdf, other

    cs.CV cs.LG

    PyTorchVideo: A Deep Learning Library for Video Understanding

    Authors: Haoqi Fan, Tullie Murrell, Heng Wang, Kalyan Vasudev Alwala, Yanghao Li, Yilei Li, Bo Xiong, Nikhila Ravi, Meng Li, Haichuan Yang, Jitendra Malik, Ross Girshick, Matt Feiszli, Aaron Adcock, Wan-Yen Lo, Christoph Feichtenhofer

    Abstract: We introduce PyTorchVideo, an open-source deep-learning library that provides a rich set of modular, efficient, and reproducible components for a variety of video understanding tasks, including classification, detection, self-supervised learning, and low-level processing. The library covers a full stack of video understanding tools including multimodal data loading, transformations, and models tha… ▽ More

    Submitted 18 November, 2021; originally announced November 2021.

    Comments: Technical report

  8. arXiv:2105.13965  [pdf, other

    cs.CV cs.RO

    Revitalizing Optimization for 3D Human Pose and Shape Estimation: A Sparse Constrained Formulation

    Authors: Taosha Fan, Kalyan Vasudev Alwala, Donglai Xiang, Weipeng Xu, Todd Murphey, Mustafa Mukadam

    Abstract: We propose a novel sparse constrained formulation and from it derive a real-time optimization method for 3D human pose and shape estimation. Our optimization method, SCOPE (Sparse Constrained Optimization for 3D human Pose and shapE estimation), is orders of magnitude faster (avg. 4 ms convergence) than existing optimization methods, while being mathematically equivalent to their dense unconstrain… ▽ More

    Submitted 4 October, 2021; v1 submitted 28 May, 2021; originally announced May 2021.

    Comments: 21 pages, including appendix

  9. arXiv:2011.07171  [pdf, other

    cs.RO

    Joint Sampling and Trajectory Optimization over Graphs for Online Motion Planning

    Authors: Kalyan Vasudev Alwala, Mustafa Mukadam

    Abstract: Among the most prevalent motion planning techniques, sampling and trajectory optimization have emerged successful due to their ability to handle tight constraints and high-dimensional systems, respectively. However, limitations in sampling in higher dimensions and local minima issues in optimization have hindered their ability to excel beyond static scenes in offline settings. Here we consider hig… ▽ More

    Submitted 28 July, 2021; v1 submitted 13 November, 2020; originally announced November 2020.

    Comments: International Conference on Intelligent Robots and Systems (IROS), 2021

  10. arXiv:1906.08236  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    PyRobot: An Open-source Robotics Framework for Research and Benchmarking

    Authors: Adithyavairavan Murali, Tao Chen, Kalyan Vasudev Alwala, Dhiraj Gandhi, Lerrel Pinto, Saurabh Gupta, Abhinav Gupta

    Abstract: This paper introduces PyRobot, an open-source robotics framework for research and benchmarking. PyRobot is a light-weight, high-level interface on top of ROS that provides a consistent set of hardware independent mid-level APIs to control different robots. PyRobot abstracts away details about low-level controllers and inter-process communication, and allows non-robotics researchers (ML, CV researc… ▽ More

    Submitted 19 June, 2019; originally announced June 2019.