Learning from Narrated Instruction Videos

IEEE Trans Pattern Anal Mach Intell. 2018 Sep;40(9):2194-2208. doi: 10.1109/TPAMI.2017.2749223. Epub 2017 Sep 5.

Abstract

Automatic assistants could guide a person or a robot in performing new tasks, such as changing a car tire or repotting a plant. Creating such assistants, however, is non-trivial and requires understanding of visual and verbal content of a video. Towards this goal, we here address the problem of automatically learning the main steps of a task from a set of narrated instruction videos. We develop a new unsupervised learning approach that takes advantage of the complementary nature of the input video and the associated narration. The method sequentially clusters textual and visual representations of a task, where the two clustering problems are linked by joint constraints to obtain a single coherent sequence of steps in both modalities. To evaluate our method, we collect and annotate a new challenging dataset of real-world instruction videos from the Internet. The dataset contains videos for five different tasks with complex interactions between people and objects, captured in a variety of indoor and outdoor settings. We experimentally demonstrate that the proposed method can automatically discover, learn and localize the main steps of a task in input videos.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Cluster Analysis
  • Databases, Factual
  • Information Dissemination
  • Internet*
  • Narration
  • Natural Language Processing*
  • Unsupervised Machine Learning*
  • Video Recording*