Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

Shi, Xiaoyu; Huang, Zhaoyang; Wang, Fu-Yun; Bian, Weikang; Li, Dasong; Zhang, Yi; Zhang, Manyuan; Cheung, Ka Chun; See, Simon; Qin, Hongwei; Dai, Jifeng; Li, Hongsheng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2401.15977 (cs)

[Submitted on 29 Jan 2024 (v1), last revised 31 Jan 2024 (this version, v2)]

Title:Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

Authors:Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, Jifeng Dai, Hongsheng Li

View PDF

Abstract:We introduce Motion-I2V, a novel framework for consistent and controllable image-to-video generation (I2V). In contrast to previous methods that directly learn the complicated image-to-video mapping, Motion-I2V factorizes I2V into two stages with explicit motion modeling. For the first stage, we propose a diffusion-based motion field predictor, which focuses on deducing the trajectories of the reference image's pixels. For the second stage, we propose motion-augmented temporal attention to enhance the limited 1-D temporal attention in video latent diffusion models. This module can effectively propagate reference image's feature to synthesized frames with the guidance of predicted trajectories from the first stage. Compared with existing methods, Motion-I2V can generate more consistent videos even at the presence of large motion and viewpoint variation. By training a sparse trajectory ControlNet for the first stage, Motion-I2V can support users to precisely control motion trajectories and motion regions with sparse trajectory and region annotations. This offers more controllability of the I2V process than solely relying on textual instructions. Additionally, Motion-I2V's second stage naturally supports zero-shot video-to-video translation. Both qualitative and quantitative comparisons demonstrate the advantages of Motion-I2V over prior approaches in consistent and controllable image-to-video generation. Please see our project page at this https URL.

Comments:	Project page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2401.15977 [cs.CV]
	(or arXiv:2401.15977v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2401.15977

Submission history

From: Xiaoyu Shi [view email]
[v1] Mon, 29 Jan 2024 09:06:43 UTC (32,102 KB)
[v2] Wed, 31 Jan 2024 07:41:04 UTC (32,102 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators