Video Generation · ECCV 2024

MEVG: Multi-event Video Generation with Text-to-Video Models

Gyeongrok Oh1,  Jaehwan Jeong1,  Sieun Kim1,  Wonmin Byeon2,  Jinkyu Kim1,  Sungwoong Kim1,  Sangpil Kim1,∗

1 Korea University 2 NVIDIA ∗ Corresponding author

European Conference on Computer Vision (ECCV), 2024

TL;DR

We introduce MEVG, a training-free diffusion-based method that generates a single coherent video depicting multiple sequential events given individual text sentences — with no fine-tuning or additional video data required. A Last Frame-aware Latent Initialization (LFLI) strategy seeds each clip's latent from the previous clip's final frame, while Dynamic Noise enforces diversity and Structure-guided Sampling maintains visual consistency within each event. A novel prompt generator automatically converts coarse user scenarios into optimized per-event prompts, and MEVG outperforms zero-shot baselines in multi-event video generation.

"Santa Claus goes snowboarding on a snowy mountain." → "...rides his sleigh through the snow." → "...walks through the forest to a frozen lake." → "...has fun skating on the ice."

Key Contributions

  • Training-free multi-event generation — generates a video consisting of multiple events without requiring any training or additional video data.
  • Last-frame-aware initialization & dynamic noise adjustment — a novel latent vector strategy enhancing temporal and semantic consistency between consecutive event clips.
  • Novel prompt generator — transforms coarse text inputs into optimal per-event instructions, ensuring coherent semantic transitions throughout the generated video.
  • State-of-the-art zero-shot performance — outperforms other zero-shot video generation methods in multi-event reflection while maintaining visually coherent content.

Project Design

MEVG overall pipeline

Given multiple prompts (P) each describing a different event, MEVG chains video clips into a single, temporally coherent video. Each clip's latent is initialized from the final frame of the preceding clip via Last Frame-aware Latent Initialization (LFLI), and Structure-guided Sampling ensures visual consistency throughout each event.

Last Frame-aware Latent Initialization

(i) Structure-guided Sampling

  • Maintain visual consistency within an event, which act as a regularization term.
  • Differential update strategy prevent the identical latent codes across the video.

(ii) Dynamic Noise

  • Enforce the diversity of the generated video with noise scheduling function.
  • Noise scheduling function is monotonically decrease function.

(iii) Last Frame Aware Inversion

  • Maintain a visual correlation between two different events.
  • Denoised observation x̂ contains a sketcy spatial layout and context.

Results

Video Results Based on LVDM

Video Results Based on VideoCrafter1

Applications

Image & Multi-text Video Generation

MEVG accepts an input image alongside multiple text prompts to generate a video grounded in the given visual context.

Video Generation with Large Language Model (LLM)

An LLM decomposes a coarse scenario into optimized per-event prompts, enabling flexible and dynamic text input.

Conclusion

We presented MEVG, a training-free diffusion-based framework for generating temporally coherent videos from multiple sequential text prompts — requiring no fine-tuning or additional video data. By combining Last Frame-aware Latent Initialization, Dynamic Noise scheduling, and Structure-guided Sampling, MEVG effectively bridges consecutive event clips while maintaining visual and semantic consistency. A novel prompt generator further enables flexible coarse-to-optimal text input. Extensive experiments and user studies demonstrate that MEVG outperforms zero-shot baselines in temporal coherency of content and semantics, offering a practical and scalable approach to multi-event video synthesis.

Acknowledgement

NVIDIA

Research collaboration with NVIDIA (Wonmin Byeon), contributing to the development of multi-event video generation.

This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) funded by the Korean government (MSIT) under the Artificial Intelligence Graduate School Program (Korea University), and by the Culture, Sports and Tourism R&D Program through the Korea Creative Content Agency (KOCCA) funded by the Ministry of Culture, Sports and Tourism.