Semantically Complex Audio to Video Generation with Audio Source Separation
Engineering Applications of Artificial Intelligence (EAAI, JCR IF Top 10%), 2025
Abstract
Generating video from audio is a challenging task when the input contains multiple overlapping sound sources with complex semantics. Existing audio-to-video methods struggle to disentangle mixed audio signals and faithfully reflect each source's content in the generated video. We address this by introducing an audio source separation stage prior to video generation, allowing the model to independently process each audio component and synthesize semantically coherent video that aligns with the full acoustic scene. Our method improves both visual quality and semantic fidelity on complex, multi-source audio inputs.
Key Contributions
- Audio source separation pipeline as a preprocessing stage for audio-to-video generation.
- Semantically-aware video synthesis that independently handles each separated audio component.
- Improved visual-semantic alignment on complex, multi-source audio inputs.
- Comprehensive evaluation demonstrating gains in visual quality and semantic fidelity.