Twelve Labs

Twelve Labs

Software-Entwicklung

San Francisco, California 6,523 followers

Help developers build programs that can see, listen, and understand the world as we do.

Über uns

Helping developers build programs that can see, hear, and understand the world as we do by giving them the world's most powerful video-understanding infrastructure.

Website
http://www.twelvelabs.io
Industrie
Software-Entwicklung
Größe des Unternehmens
11-50 Mitarbeiter
Hauptsitz
San Francisco, Kalifornien
Typ
In Privatbesitz
Gegründet
2021

Standorte

Employees at Twelve Labs

Aktualisierungen

  • View organization page for Twelve Labs, graphic

    6,523 followers

    In the 57th session of #MultimodalWeekly, we have three exciting presentations - two on video captions and one on training data for foundation models. ✅ Lucas Ventura will discuss CoVR - a nice work that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval. (with Antoine Y. Gül Varol Cordelia Schmid) ✅ Shayne Longpre will discuss his new work Consent in Crisis: The Rapid Decline of the AI Data Commons and its multimodal implications. This work has been covered by The New York Times404 MediaVox, and Yahoo Finance. ✅ Nina Shvetsova and Anna Kukleva will discuss HowToCaption - a nice work that leverages recent advances in LLMs and generates high-quality video captions at scale without any human supervision. (with Hilde Kuehne) Register for the webinar here: https://lnkd.in/gJGtscSH ⬅ Join our Discord to connect with the speakers: https://lnkd.in/gRt4GdDx 🤝

    • Keine alternative Textbeschreibung für dieses Bild
  • View organization page for Twelve Labs, graphic

    6,523 followers

    ~ New Webinar ~ The recording of #MultimodalWeekly 53 with Xiang Yue, Orr Zohar, and Mingqi Jiang is up! Watch here: https://lnkd.in/gSKtFGTb 📺 They discussed: - Evaluating multimodal models on massive multi-discipline tasks - Self-training for video language models via video instruction tuning - Explanation methods for ConvNets and Transformers Enjoy!

  • Twelve Labs reposted this

    View profile for Jae Lee, graphic

    Multimodal neural nets @Twelve Labs - We are hiring!

    All the new capabilities emerging from recent video foundation models are exciting. However, to make a real impact in the wild, it's crucial to first master the fundamentals of video comprehension: motion, appearance, and spatiotemporal understanding. TWLV-I is our response to this challenge. At Twelve Labs, we’re proud to introduce TWLV-I, our latest video foundation model, along with a new evaluation framework designed to assess these core capabilities. Unlike language or image models, video models face unique challenges that complicate fair comparisons. Our framework specifically measures two fundamental aspects of video comprehension: appearance and motion understanding. Our research reveals that existing models, like UMT, InternVideo2, and V-JEPA, fall short in at least one of these areas. TWLV-I, trained only on publicly available datasets, excels in both, demonstrating robust performance across a variety of tasks, from action recognition to spatiotemporal action localization. Next: Scaling up with our proprietary data. Congratulations Aiden L. and team! 📄 Read our report on arXiv: https://lnkd.in/grbrxh7X 👍 Upvote on Hugging Face: https://lnkd.in/gDvWrrYp 🧠 Explore the embeddings: https://lnkd.in/gcANeufq

    View organization page for Twelve Labs, graphic

    6,523 followers

    Building video foundation models has been our core focus since day 1. Unlike language or image foundation models, many video foundation models are evaluated with differing parameters (such as sampling rate, number of frames, pretraining steps, etc.), making fair and robust comparisons challenging. Therefore, we present a carefully designed evaluation framework for measuring two core capabilities of video comprehension: appearance and motion understanding. ⚖ Our findings reveal that existing video foundation models, whether text-supervised like UMT or InternVideo2, or self-supervised like V-JEPA, exhibit limitations in at least one of these capabilities. As an alternative, we introduce TWLV-I, a new video foundation model that constructs robust visual representations for both motion- and appearance-based videos. 🆕 Trained exclusively on publicly available datasets, TWLV-I demonstrates notable performance across both appearance- and motion-centric action recognition benchmark datasets. 📊 TWLV-I's capabilities extend beyond action recognition. It achieves competitive performance on various video-centric tasks, including temporal action localization, spatiotemporal action localization, and temporal action segmentation. This multifaceted proficiency highlights TWLV-I's spatial and temporal understanding capabilities. 🎡 ▶ Read the technical report on arXiv: https://lnkd.in/grbrxh7X ▶ Upvote it on Hugging Face: https://lnkd.in/gDvWrrYp ▶ Play with embedding vectors obtained by TWLV-I via the evaluation source code: https://lnkd.in/gcANeufq

    • Keine alternative Textbeschreibung für dieses Bild
  • View organization page for Twelve Labs, graphic

    6,523 followers

    Building video foundation models has been our core focus since day 1. Unlike language or image foundation models, many video foundation models are evaluated with differing parameters (such as sampling rate, number of frames, pretraining steps, etc.), making fair and robust comparisons challenging. Therefore, we present a carefully designed evaluation framework for measuring two core capabilities of video comprehension: appearance and motion understanding. ⚖ Our findings reveal that existing video foundation models, whether text-supervised like UMT or InternVideo2, or self-supervised like V-JEPA, exhibit limitations in at least one of these capabilities. As an alternative, we introduce TWLV-I, a new video foundation model that constructs robust visual representations for both motion- and appearance-based videos. 🆕 Trained exclusively on publicly available datasets, TWLV-I demonstrates notable performance across both appearance- and motion-centric action recognition benchmark datasets. 📊 TWLV-I's capabilities extend beyond action recognition. It achieves competitive performance on various video-centric tasks, including temporal action localization, spatiotemporal action localization, and temporal action segmentation. This multifaceted proficiency highlights TWLV-I's spatial and temporal understanding capabilities. 🎡 ▶ Read the technical report on arXiv: https://lnkd.in/grbrxh7X ▶ Upvote it on Hugging Face: https://lnkd.in/gDvWrrYp ▶ Play with embedding vectors obtained by TWLV-I via the evaluation source code: https://lnkd.in/gcANeufq

    • Keine alternative Textbeschreibung für dieses Bild
  • View organization page for Twelve Labs, graphic

    6,523 followers

    Have you ever wanted to pinpoint specific color shades in a video, perhaps to find a product or a particular moment that features your favorite hues? 🌈 Shade Finder is an app designed to pinpoint moments in beauty and fashion videos where specific shades appear: https://lnkd.in/gVrdmHAv 💄 The app excels at finding videos featuring objects, colors, and shapes that closely match the images you provide. Ideal for beauty enthusiasts and fashion aficionados, Shade Finder ensures you never miss a moment of your favorite shades in action. 🤩 Meeran K. wrote this in-depth tutorial on how she built this app using the new Twelve Labs Image-to-Video Search API: https://lnkd.in/g3rdGfBc 👩💻 ☑ Watch the tutorial: https://lnkd.in/gatwadmT ☑ Check out the code: https://lnkd.in/g-e7yfVq ☑ Play with it via Replit: https://lnkd.in/gx-Ha7S4

    • Keine alternative Textbeschreibung für dieses Bild
  • View organization page for Twelve Labs, graphic

    6,523 followers

    In the 56th session of #MultimodalWeekly, we have three exciting presentations across different video understanding tasks: action recognition, video description, and video summarization. ✅ Jacob Chalk and Jaesung Huh will discuss the Time Interval Machine (TIM) - which addresses the interplay between the two modalities in long videos by explicitly modeling the temporal extents of audio and visual events: https://lnkd.in/gThpCzsx ✅ Haran Raajesh and Naveen Reddy D will discuss Movie-Identity Captioner (MICap) - which is a new single-stage approach that can seamlessly switch between id-aware caption generation or fill-in-the-blanks when given a caption with blanks: https://lnkd.in/g6NW_Rp3 ✅ Aditya SinghDhruv Srivastava, and Assistant Professor Makarand Tapaswi will discuss their work "Previously on ..."  From Recaps to Story Summarization - which tackles multimodal story summarization by leveraging TV episode recaps — short video sequences interweaving key story moments from previous episodes to bring viewers up to speed: https://lnkd.in/gD8Kr3uy Register for the webinar here: https://lnkd.in/gJGtscSH 👈 Join our Discord community: https://lnkd.in/gRt4GdDx 🤝

    • Keine alternative Textbeschreibung für dieses Bild
  • View organization page for Twelve Labs, graphic

    6,523 followers

    ~ New Webinar ~ The recording of #MultimodalWeekly 52 with Saelyne Yang, Bo Li/Yuanhan Zhang, and 肖俊斌 is up! Watch here: https://lnkd.in/grmE-Pye 📺 They discussed: - Learning Procedural Tasks via How-To Videos - Feeling & Building Multimodal Intelligence - Visually-Grounded VideoQA Enjoy!

  • View organization page for Twelve Labs, graphic

    6,523 followers

    🚀 New Tutorial for AI Engineers! 🚀 We've published a comprehensive tutorial on integrating Twelve Labs' Embed API with LanceDB to build advanced video understanding applications. This guide is designed for those working on semantic video search engines, content-based recommendation systems, or anomaly detection in video streams. 🔍 Key Highlights: - Twelve Labs Embed API: Generate detailed, multimodal embeddings that capture the essence of video content. - LanceDB: Efficiently store, index, and query high-dimensional vectors for accurate retrieval. - Step-by-Step Guide: This covers setting up your environment and generating, storing, and querying video embeddings. - Practical Applications: Create semantic search engines and integrate them with a Retrieval-Augmented Generation (RAG) workflow. 💡 Why This Matters: - Improve your AI projects with precise video content analysis. - Utilize the strengths of both embedding generation and vector storage. - Follow a detailed guide to get started quickly. 👉 Check out the tutorial and see how this integration can enhance your AI capabilities: https://lnkd.in/e5RK4P9j Explore the possibilities of advanced video understanding with our step-by-step guide.

    • Keine alternative Textbeschreibung für dieses Bild
  • View organization page for Twelve Labs, graphic

    6,523 followers

    In the 55th session of #MultimodalWeekly, we have three Ph.D candidates from Stony Brook University working on long-form video understanding under Michael Ryoo. ✅ Jongwoo Park will introduce LVNet - a video question-answering framework with optimal strategies for key-frame selection and sequence-aware captioning: https://lnkd.in/gEf45TfJ ✅ Kumara Kahatapitiya will bring up LangRepo - a Language Repository for LLMs, that maintains concise and structured information as an interpretable (i.e., all-textual) representation: https://lnkd.in/gVSgqppb ✅ Kanchana Ranasinghe will discuss MVU - an LLM-based framework for solving long video question-answering benchmarks and discover multiple surprising results: https://lnkd.in/grTdS4Mc Register for the webinar here: https://lnkd.in/gJGtscSH 👈

    • Keine alternative Textbeschreibung für dieses Bild
  • View organization page for Twelve Labs, graphic

    6,523 followers

    ~ New Webinar ~ The recording of #MultimodalWeekly 51 with Jay Chia, Saptarshi Sinha, and Yunhua Zhang is up! Watch here: https://lnkd.in/gq_Z7SdD 📺 They discussed: - Multimodal data lake - Exemplar-based video repetition counting - Low-resource vision challenges for foundation models Enjoy!

Ähnliche Seiten

Jobs durchsuchen

Finanzierung