Video Generation · ECCV 2024

MEVG: Multi-event Video Generation with Text-to-Video Models

Gyeongrok Oh¹, Jaehwan Jeong¹, Sieun Kim¹, Wonmin Byeon², Jinkyu Kim¹, Sungwoong Kim¹, Sangpil Kim^1,∗

¹ Korea University ² NVIDIA ∗ Corresponding author

European Conference on Computer Vision (ECCV), 2024

arXiv GitHub

TL;DR

We introduce MEVG, a training-free diffusion-based method that generates a single coherent video depicting multiple sequential events given individual text sentences — with no fine-tuning or additional video data required. A Last Frame-aware Latent Initialization (LFLI) strategy seeds each clip's latent from the previous clip's final frame, while Dynamic Noise enforces diversity and Structure-guided Sampling maintains visual consistency within each event. A novel prompt generator automatically converts coarse user scenarios into optimized per-event prompts, and MEVG outperforms zero-shot baselines in multi-event video generation.

"Santa Claus goes snowboarding on a snowy mountain." → "...rides his sleigh through the snow." → "...walks through the forest to a frozen lake." → "...has fun skating on the ice."

Key Contributions

Training-free multi-event generation — generates a video consisting of multiple events without requiring any training or additional video data.
Last-frame-aware initialization & dynamic noise adjustment — a novel latent vector strategy enhancing temporal and semantic consistency between consecutive event clips.
Novel prompt generator — transforms coarse text inputs into optimal per-event instructions, ensuring coherent semantic transitions throughout the generated video.
State-of-the-art zero-shot performance — outperforms other zero-shot video generation methods in multi-event reflection while maintaining visually coherent content.

Project Design

Given multiple prompts (P) each describing a different event, MEVG chains video clips into a single, temporally coherent video. Each clip's latent is initialized from the final frame of the preceding clip via Last Frame-aware Latent Initialization (LFLI), and Structure-guided Sampling ensures visual consistency throughout each event.

(i) Structure-guided Sampling

Maintain visual consistency within an event, which act as a regularization term.
Differential update strategy prevent the identical latent codes across the video.

(ii) Dynamic Noise

Enforce the diversity of the generated video with noise scheduling function.
Noise scheduling function is monotonically decrease function.

(iii) Last Frame Aware Inversion

Maintain a visual correlation between two different events.
Denoised observation x̂ contains a sketcy spatial layout and context.

Results

Video Results Based on LVDM

"Santa Claus goes snowboarding on a snowy mountain."

→"Santa Claus rides his sleigh through the snow in the mountain."

→"Santa Claus walks through the forest to a frozen lake."

→"Santa Claus has fun skating on the ice."

"A golden retriever is having a picnic on a beautiful tropical beach at sunset."

→"A golden retriever is running towards a beautiful tropical beach at sunset."

→"A golden retriever is sitting next to a bonfire on a beautiful tropical beach at sunset."

→"A golden retriever is looking at the starry sky on a beautiful tropical beach."

"A waterfall flows in the mountains under a clear sky."

→"A waterfall flows in the fall mountains under a clear sky."

→"A waterfall flows in the winter mountains under a clear sky."

→"A waterfall frozen on a mountain during a snowstorm."

"The volcano erupts in the clear weather."

→"Smoke comes from the crater of the volcano, which has ended its eruption in the clear weather."

→"The weather around the volcano turns cloudy."

"There is a beach where there is no one."

→"The waves hit the deserted beach."

→"There is a beach that has been swept away by waves."

Video Results Based on VideoCrafter1

"An astronaut in a white uniform is snowboarding in the snowy hill."

→"An astronaut in a white uniform is surfing in the sea."

→"An astronaut in a white uniform is surfing in the desert."

"A white dog is running in the beautiful meadow."

→"A white dog is standing in the beautiful meadow."

→"A white dog is yawning loudly in the beautiful meadow."

→"A white dog lies on the ground in the beautiful meadow."

Applications

Image & Multi-text Video Generation

MEVG accepts an input image alongside multiple text prompts to generate a video grounded in the given visual context.

"A single white flower gradually blooms from a single green flower bud."

→"The single white flower is blooming."

→"A lovely fully blossomed single white flower."

Input image

"People walks on the beach at night."

→"There are sand castles on the beach under the fireworks at night."

→"Very few people remain on the beach at night and they gradually fade away."

Input image

Video Generation with Large Language Model (LLM)

An LLM decomposes a coarse scenario into optimized per-event prompts, enabling flexible and dynamic text input.

Original Scenario

"In the morning, Albert Einstein was walking in the forest, later he read a book under a tree, and as night fell, he walked towards the lake, eventually sitting near it in the forest at night."

Prompts (LLM)

"Albert Einstein is walking in the forest in the morning."

→"Albert Einstein reads a book under a tree."

→"Albert Einstein walks from the forest towards the lake as night falls."

→"Albert Einstein sits near the lake in the forest at night."

Original Scenario

"A man embarks on a motorcycle journey, runs through a traffic jam on a busy road, rides a motorcycle in the desert, walks in the desert at night, and looks at the sky with aurora in the desert."

Prompts (LLM)

"A man embarks on a motorcycle journey."

→"A man runs through a traffic jam on a busy road."

→"A man rides a motorcycle in the desert."

→"A man walks in the desert at night."

→"A man looks at the sky with aurora in the desert."

Conclusion

We presented MEVG, a training-free diffusion-based framework for generating temporally coherent videos from multiple sequential text prompts — requiring no fine-tuning or additional video data. By combining Last Frame-aware Latent Initialization, Dynamic Noise scheduling, and Structure-guided Sampling, MEVG effectively bridges consecutive event clips while maintaining visual and semantic consistency. A novel prompt generator further enables flexible coarse-to-optimal text input. Extensive experiments and user studies demonstrate that MEVG outperforms zero-shot baselines in temporal coherency of content and semantics, offering a practical and scalable approach to multi-event video synthesis.

Acknowledgement

NVIDIA

Research collaboration with NVIDIA (Wonmin Byeon), contributing to the development of multi-event video generation.

This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) funded by the Korean government (MSIT) under the Artificial Intelligence Graduate School Program (Korea University), and by the Culture, Sports and Tourism R&D Program through the Korea Creative Content Agency (KOCCA) funded by the Ministry of Culture, Sports and Tourism.