EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture

Jiaqi Xu^1∗, Xinyi Zou¹, Kunzhe Huang¹, Yunkuo Chen¹, Bo Liu¹, MengLi Cheng¹, Xing Shi¹

{}^{(\textrm{{\char 0\relax}})}

, Jun Huang¹

{}^{(\textrm{{\char 0\relax}})}

¹ Platform of AI (PAI), Alibaba Group
{zhoumo.xjq, zouxinyi.zxy, huangkunzhe.hkz, chenyunkuo.cyk
xuanyuan.lb, mengli.cml, shubao.sx, huangjun.hj}@alibaba-inc.com

Abstract

This paper presents EasyAnimate, an advanced method for video generation that leverages the power of transformer architecture for high-performance outcomes. We have expanded the DiT framework originally designed for 2D image synthesis to accommodate the complexities of 3D video generation by incorporating a special motion module block named Hybrid Motion Module. In the motion module, we employ a combination of temporal attention and global attention to ensure the generation of coherent frames and seamless motion transitions. Moreover, we introduce Slice VAE, a novel approach to condense the temporal axis, facilitating the generation of long duration videos. Currently, EasyAnimate exhibits the capability to generate videos of up to 144 frames from images of varying resolutions. We provide a holistic ecosystem for video production based on DiT, encompassing aspects such as data pre-processing, VAE training, DiT models training (both the baseline model and LoRA model), and end-to-end video inference. Code is available at: https://github.com/aigc-apps/EasyAnimate.

1 Introduction

Artificial Intelligence has decisively expanded the horizons of creative content generation across text, imagery, and sound. In the visual sphere, diffusion models have been greatly used in image generation and modification. Open source projects like Stable Diffusion Rombach et al. (2021) have achieved significant strides in converting text to images.

Refer to caption — Figure 1: The architecture of Diffusion Transformer in EasyAnimate, including: (a) DiT overview, (b) Hybrid Motion Module to introduce the temporal information, (c) U-ViT to stablize the training.

However, when it comes to video generation, current models still face some challenges, such as poor quality, limited video length, and unnatural movement, indicating that there’s still much progress to be made in the technology. Pioneering efforts Guo et al. (2023); Chen et al. (2024, 2023a); Wang et al. (2023); Luo et al. (2023) in video synthesis utilizing stable diffusion methods, with a focus on the UNet architecture for denoise process. Very recently, SoraOpenAI (2024) has unveiled extraordinary video generation capabilities, achieving up to one minute of high-fidelity video. This advancement significantly elevates the realism of real-world simulations over its forerunners. Moreover, it reveals the critical role of the Transformer architecture in video generation, prompting the open-source communityhpcaitech (2024); Lab and etc. (2024) to delve into the intricacies of Transformer structures with renewed vigor.

In this light, we introduce EasyAnimate, a simple yet powerful baseline for video generation. In terms of DiT, we explore the temporal information for video generation by incorporating motion module blocks named Hybrid Motion Module. In the motion module, we combine temporal attention with global attention to ensure the creation of coherent frames and seamless motion transitions. In addition, EasyAnimate integrates images for guided content generation, using the motion priors from a text-to-video model. A dual-stream setup uses a text-rich Encoder for image encoding and reference data infusion, alongside a variational auto-encoder for mask reconstruction, enabling image-to-video creation. Additionally, we propose the Slice VAE technique, aimed at compressing the temporal dimension and reducing memory usage as video length increases, thereby facilitating the production of extended-length videos. We offer a comprehensive ecosystem for video production based on DiT, which includes data preprocessing, VAE training, training of DiT models, and end-to-end video inference. Figure 1 gives an overview of the pipeline of EasyAnimate.

Contributions can be summarized as follows:

(1) We propose EasyAnimate, an advanced method video generation that leverages the power of transformer architecture for high-performance outcomes.

(2) We explore the temporal information for video generation by incorporating motion module blocks. In the motion module, we combine temporal attention with global attention to ensure the creation of coherent frames and seamless motion transitions.

(3) We propose the Slice VAE to compress the temporal dimension and reducing GPU memory usage as video length increases, thereby facilitating the production of extended-length videos.

2 Related Work

Video VAE: In earlier studies, image-based Variational Autoencoders (VAEs) have been widely used for encoding and decoding video frames, such as AnimateDiffGuo et al. (2023), ModelScopeT2VWang et al. (2023), and OpenSorahpcaitech (2024). A common image VAE implementation, as used in Stable DiffusionStability-AI (2023), reduces video frames to latent features, shrinking their spatial dimensions to an eighth of both the width and height significantly. This method ignores temporal dynamics, turning videos into static images and failing to compress time, leading to large latent features and increased CUDA memory needs. This significantly hinders long video creation, highlighting the essential challenge of compressing time effectively in video encoding and decoding.

MagViTYu et al. (2023) is a famous example of a video VAE, which is guessed to be used in the Sora framework. The causal 3D convolution block used introduces temporal padding before standard 3D convolutions, leveraging previous frame information to enhance temporal causality without impact from subsequent frames. MagViT allows for simultaneous image and video handling, improving video generation by integrating image training to leverage abundant, accessible images, enhancing text-image alignment in DiT training. Past studies Blattmann et al. (2023) indicate that incorporating images into video training can optimize model architecture more efficiently, improving its textual alignment and output quality. Despite MagViT’s sophisticated video encoding and decoding capabilities, it struggles with training on very long video sequences due to memory constraints, such as 1024x1024x40, often exceeding the capacity of A100 GPUs. This necessitates batch processing for incremental decoding to manage large videos, instead of one-step decoding, underlining the importance of memory-efficient techniques.

To enhance the compression efficiency in the temporal dimension, we propose Slice VAE, which introduces a slicing mechanism along the time axis, compressing long videos through batch processing. Additionally, with a unique setup for upsampling, Slice VAE can process both images and videos separately.

Video Diffusion Model: Pioneering efforts Guo et al. (2023); Chen et al. (2024, 2023a); Wang et al. (2023); Luo et al. (2023) in video synthesis utilizing stable diffusion methods, with a focus on the U-Net architecture for denoise process. Very recently, SoraOpenAI (2024) has unveiled extraordinary video generation capabilities, achieving up to one minute of high-fidelity video. This advancement significantly elevates the realism of real-world simulations over its forerunners. Moreover, it reveals the critical role of the Transformer architecture in video generation. A well-known baseline for Transformer-based video generation is Latte, which initially extracts spatio-temporal tokens from input videos and then employs a series of Transformer blocks to model the video distribution in latent space. However, Latte focuses solely on the information of each latent on the temporal axis and overlooks global information.

To address the lack of global perception in latte, we adapt the DiT framework, initially for 2D image synthesis, to 3D video generation by adding a special motion module block and employ a combination of temporal attention and global attention to further improve the global perception ability of DiT.

3 Architecture

We build EasyAnimate upon PixArt- $\alpha$ Chen et al. (2023b). It includes a text encoder (T5 EncoderRaffel et al. (2020)), video VAE (a video encoder and a video decoder), and a diffsuion transformer (DiT). These components will be elaborately illustrated in the following part.

3.1 Slice VAE

The role of Video VAE is to compress the latent temporal dimensions of videos to reduce the computational load required for the diffusion process itself, which also involves substantial calculations. Taking MagViT as an example, when processing a 1024x1024x21 video in one go, even using an A100 80GB GPU, we encounter "out of memory" errors. Therefore, we need to batch process the input data.

There are two batch processing methods: one is slicing along the spatial dimension, and the other is splitting along the temporal dimension. We initially tried slicing along the spatial dimension as shown in Figure 3(a). Although this can reduce the model’s memory usage, decoding nearly a hundred frames of video at once still requires a significant amount of memory as the video time increases. We then shifted to slicing along the temporal dimension. With this method, a set of video frames is divided into several segments, each encoded and decoded separately, as shown in Figure 3(b). In this case, assuming we use MagViT for video encoding, due to the forward padding in MagViT, the first unit’s latent in each batch contains less information. This uneven distribution of information could be a unique factor hindering model optimization. To solve this problem, we design Slice VAE as shown in Figure 4.

The Slice VAE employs different decoding methods for images and videos. When processing videos, we downsample and upsample the latent on the temporal axis. Additionally, we implement feature sharing across different batches, as illustrated in Figure 4. During the decoding process, features are concatenated with their previous and after features (if available), resulting in more consistent features.

3.2 Video Diffusion Transformer

The architecture of the Diffusion Transformer is depicted in Figure 1. We add a special motion module to the DiT as shown in Figure 1 (b), enabling the expansion from 2D image synthesis to 3D video generation. Within the motion module, we utilize a mix of temporal and global attention mechanisms to guarantee the production of cohesive frames and fluid motion transitions. Additionally, we integrate the U-ViTBao et al. (2023) connection as shown in Figure 1 (c) to bolster the stability of the training process.

Hybrid Motion Module: The motion module is specifically devised to harness the temporal information embedded within frame length. By integrating attention mechanisms across the temporal dimension, the model gains the capability to assimilate such temporal data, essential for generating video motion. Similar to AnimateDiffGuo et al. (2023), our motion module applies attention mechanisms along the temporal axis. In addition, we focus on global information within the motion module rather than simply on the temporal axis, endowing the entire model with global receptive capabilities and enhancing dynamic reconstruction performance.

Image guidance: EasyAnimate enables the incorporation of images into the generation process as guidance, utilizing the motion priors from the text-to-video diffusion model. This is achieved through a dual-stream architecture, where an Encoder, enriched with textual information, encodes the image to obtain a text embedding that infuses information into the reference image. Simultaneously, a Variational Auto-Encoder encodes the information of the mask that requires reconstruction, along with the reference image information, to facilitate image-to-video generation. The detailed of image guided video generation is shown in Figure 5

U-VIT: During the training process, we observed that deep DiTs tended to be unstable, as evidenced by the model’s loss exhibiting sharp increases from 0.05 to 0.2, eventually escalating to 1.0. In pursuit of bolstering the model optimization process and averting gradient collapse during backpropagation through the DIT layers, we use the long-skip connection among the corresponding transformer blocks, which is efficient for the Stable Diffusion model based on the UNet framework. To seamlessly integrate this modification within the existing Diffusion Transformer architecture, without necessitating a comprehensive retraining, we initialize several fully connected layers with zero-filled weights as shown in the grey block in Figure 1(c).

4 Data Preprocess

The training of EasyAnimate includes both the image data and the video data. This section details the video data processing methodology, consisting of three principal stages: video splitting, video filtering, and video captioning. These steps are critical to cull high-quality video data with detailed captions capable of encapsulating the essence of the video content.

4.1 Video Splitting

For longer video splitting, we initially use PySceneDetect¹¹1https://github.com/Breakthrough/PySceneDetect to identify scene changes within the video and perform scene cuts based on these transitions to ensure the thematic consistency of the video segments. After cutting, we retain only those segments that are between 3 to 10 seconds in length for model training.

4.2 Video Filtering

We filter the video data from three aspects, namely the Motion Filtering, Text Filtering, and the Aesthetic Filtering.

Motion Filtering: During the training of video generation models, it is crucial to ensure the videos showcase a sense of motion, distinguishing them from mere static images. Simultaneously, it is vital to maintain a certain level of consistency in the movement, as overly erratic motion can detract from the video’s overall cohesion. To this end, we utilize RAFTTeed and Deng (2020) to compute a motion score between frames at a specified frames per second (FPS), and filter the video with suitable motion score for the fine-tuning of dynamism.

Text Filtering: The video data often contains specific text information (e.g., subtitles) which is not conducive to the learning process of video models. To address this, we employ optical character recognition (OCR) to ascertain the proportional area of text regions within videos. OCR is conducted on the sampled frames to represent the text score of the video. We then meticulously filter out any video segments where text encompasses an area exceeding 1% of the video frame, ensuring that the remaining videos remain optimal for model training.

Aesthetic Filtering: Moreover, there are many low-quality videos on the internet. These videos may suffer from an absence of thematic focus or be marred by excessive blurriness. To enhance the quality of our training dataset, we calculate the aesthetic score²²2https://github.com/christophschuhmann/improved-aesthetic-predictor and preserves the videos with high score, obtaining visually appealing training set for our video generation.

4.3 Video Captioning

The quality of video captioning directly impacts the outcome of generated videos. We conducted a comprehensive comparison of several large multimodal models, weighing both their performance and operational efficiency. After careful consideration and evaluation, we selected VideoChat2Li et al. (2023) and VILALin et al. (2023) for the task of video data captioning, as they demonstrated superior performance in our assessments, showing them to be particularly promising in achieving video captions with details and time information.

5 Training Process

Stage	Resolution	Type	#Dataset	Steps	Batch Size	Learning Rate
VAE adaption	256 $\times$ 256	Images	12M	20.0K	1024	2e-5
Motion Module Pretraining	256 $\times$ 256	Images & Videos	12M	11.0K	1024	2e-5
Video Pretraining	256 $\times$ 256	Images & Videos	12M	60.0K	1024	2e-5
Video Pretraining	512 $\times$ 512	Images & Videos	2.2M	8.0K	1152	2e-5
Video Finetune	512 $\times$ 512	Images & Videos	250K	1.0K	1152	2e-5
Video Pretraining	768 $\times$ 768	Images & Videos	2.2M	2.5K	512	2e-5
Video Finetune	768 $\times$ 768	Images & Videos	250K	1.0K	512	2e-5
Video Finetune	960 $\times$ 960	Images & Videos	250K	1.5K	128	2e-6

Table 1: We report detailed information about each training stage of EasyAnimate. Note that Images dataset for VAE adaption here includes 10M SAM Kirillov et al. (2023) and 2M JourneyDB Pan et al. (2023). Images and videos dataset is internal.

Totally, we use approximately 12 million image and video data for training the video VAE model and the DiT model. We first train the video VAE and then adapt the DiT model to the new VAE using a three-stage coarse-to-fine training strategy.

5.1 Video VAE

We initially trained MagViT using the Adam optimizer with beta values of (0.5, 0.9) and a learning rate of 1e-4, for a total of 350,000 training steps. The overall batch size was set to 128.

Then, we initialize the weight of Slice VAE from the aforementioned MagViT trained above. The Slice VAE is then trained in two stage. Firstly, we train the whole VAE within 200k steps, using the Adam optimizer with beta=(0.5, 0.9), batch size=96, a learning rate of 1e-4 for training. Next, following the procedure of Stable DiffusionRombach et al. (2021), we train decoder only in second stage within 100k steps so that to better enhance the fidelity of the decoded video.

5.2 Video Diffusion Transformer

As depicted in Figure 6, the training process of the DiT model has several stages.

At the first stage, upon introducing a new video VAE, we commence by aligning the DiT parameters with this VAE, using only the image data. At the second stage, we use large-scale video datasets alongside image data to pretrain the motion module block, thereby introducing the video generation capacity for DiT. At this point, although the model is capable of generating videos with rudimentary motion, the output is often of suboptimal quality, typified by limited motion and lackluster sharpness.

Therefore, at the third stage, we unfreeze the whole DiT model with large-scale video datasets, which gives the model a better sense of dynamism. Finally, we refine the entire DiT model using high-quality video data to enhance its generative performance. The model is trained progressively, scaling from lower to higher resolutions, which serves as an effective strategy for conserving GPU memory and reducing computational time.

To accommodate video generation of different resolutions, a bucket strategy is used to train with different video resolution.

6 Conclusion

This paper introduces EasyAnimate, a high-performance AI video generation and training pipeline based on transformer architecture. EasyAnimate incorporates a specialized motion module named as the Hybrid Motion Module, designed to guarantee uniform frame production and seamless transition of movements. It also enables the creation of videos guided by images. Furthermore, to tackle the issue of insufficient GPU memory in VAE during the generation of long videos, we propose Slice VAE for batched encoding and decoding along the temporal axis. The model is capable of adapting to different combinations of frame counts and resolutions during both the training and inference processes, making it suitable for generating both images and videos.

References

Bao et al. (2023) Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. 2023. All are worth words: A vit backbone for diffusion models. In CVPR.
Blattmann et al. (2023) Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. 2023. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127.
Chen et al. (2023a) Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. 2023a. Videocrafter1: Open diffusion models for high-quality video generation.
Chen et al. (2024) Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. 2024. Videocrafter2: Overcoming data limitations for high-quality video diffusion models.
Chen et al. (2023b) Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. 2023b. Pixart- $\alpha$ : Fast training of diffusion transformer for photorealistic text-to-image synthesis.
Guo et al. (2023) Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. 2023. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725.
hpcaitech (2024) hpcaitech. 2024. Open-sora: Democratizing efficient video production for all. https://github.com/hpcaitech/Open-Sora.
Kirillov et al. (2023) Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross B. Girshick. 2023. Segment anything. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3992–4003.
Lab and etc. (2024) PKU-Yuan Lab and Tuzhan AI etc. 2024. Open-sora-plan.
Li et al. (2023) Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. 2023. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355.
Lin et al. (2023) Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, and Song Han. 2023. Vila: On pre-training for visual language models.
Luo et al. (2023) Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tieniu Tan. 2023. Videofusion: Decomposed diffusion models for high-quality video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
OpenAI (2024) OpenAI. 2024. Video generation models as world simulators. https://openai.com/index/video-generation-models-as-world-simulators/.
Pan et al. (2023) Junting Pan, Keqiang Sun, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, Jifeng Dai, Yu Qiao, and Hongsheng Li. 2023. Journeydb: A benchmark for generative image understanding.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67.
Rombach et al. (2021) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021. High-resolution image synthesis with latent diffusion models.
Stability-AI (2023) Stability-AI. 2023. sd-vae-ft-ema. https://huggingface.co/stabilityai/sd-vae-ft-ema.
Teed and Deng (2020) Zachary Teed and Jia Deng. 2020. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer.
Wang et al. (2023) Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. 2023. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571.
Yu et al. (2023) Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. 2023. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737.