Audio-to-Video · Multi-Object Generation

Semantically Complex Audio to Video Generation with Audio Source Separation

Sieun Kim1,  Jaehwan Jeong1,  Sumin In1,  Seung Hyun Lee2,  Seungryong Kim2,
Saerom Kim3,  Wooyeol Baek3,  Sang Ho Yoon3,  Eugenio Culurciello3,  Sangpil Kim1,∗

1 Korea University 2 KAIST 3 Samsung Research ∗ Corresponding author

Engineering Applications of Artificial Intelligence (EAAI, JCR IF Top 10%), 2025

Complex A2V teaser

Abstract

Generating video from audio is a challenging task when the input contains multiple overlapping sound sources with complex semantics. Existing audio-to-video methods struggle to disentangle mixed audio signals and faithfully reflect each source's content in the generated video. We address this by introducing an audio source separation stage prior to video generation, allowing the model to independently process each audio component and synthesize semantically coherent video that aligns with the full acoustic scene. Our method improves both visual quality and semantic fidelity on complex, multi-source audio inputs.

Key Contributions

  • Audio source separation pipeline as a preprocessing stage for audio-to-video generation.
  • Semantically-aware video synthesis that independently handles each separated audio component.
  • Improved visual-semantic alignment on complex, multi-source audio inputs.
  • Comprehensive evaluation demonstrating gains in visual quality and semantic fidelity.