Audio-to-Video · Multi-Object Generation

Semantically Complex Audio to Video Generation with Audio Source Separation

Sieun Kim¹, Jaehwan Jeong¹, Sumin In¹, Seung Hyun Lee², Seungryong Kim²,
Saerom Kim³, Wooyeol Baek³, Sang Ho Yoon³, Eugenio Culurciello³, Sangpil Kim^1,∗

¹ Korea University ² KAIST ³ Samsung Research ∗ Corresponding author

Engineering Applications of Artificial Intelligence (EAAI, JCR IF Top 10%), 2025

EAAI GitHub

Abstract

Generating video from audio is a challenging task when the input contains multiple overlapping sound sources with complex semantics. Existing audio-to-video methods struggle to disentangle mixed audio signals and faithfully reflect each source's content in the generated video. We address this by introducing an audio source separation stage prior to video generation, allowing the model to independently process each audio component and synthesize semantically coherent video that aligns with the full acoustic scene. Our method improves both visual quality and semantic fidelity on complex, multi-source audio inputs.

Key Contributions

Audio source separation pipeline as a preprocessing stage for audio-to-video generation.
Semantically-aware video synthesis that independently handles each separated audio component.
Improved visual-semantic alignment on complex, multi-source audio inputs.
Comprehensive evaluation demonstrating gains in visual quality and semantic fidelity.