Latent-Cascaded Ultra-High-Resolution Video Generation with Dual Frequency Experts
Recent advances in video diffusion models have significantly improved visual quality, yet ultra-high-resolution (UHR) video generation remains a formidable challenge due to the compounded difficulties of motion modeling, semantic planning, and detail synthesis. To address these limitations, we propose LUVE, a Latent-cascaded UHR Video generation framework built upon dual frequency Experts. LUVE employs a three-stage architecture comprising low-resolution motion generation for motion-consistent latent synthesis, video latent upsampling that performs resolution upsampling directly in the latent space to mitigate memory and computational overhead, and high-resolution content refinement that integrates low-frequency and high-frequency experts to jointly enhance semantic coherence and fine-grained detail generation. Extensive experiments demonstrate that our LUVE achieves superior photorealism and content fidelity in UHR video generation, and comprehensive ablation studies further validate the effectiveness of each component.
@misc{zhao2026luvelatentcascadedultrahighresolution,
title={LUVE : Latent-Cascaded Ultra-High-Resolution Video Generation with Dual Frequency Experts},
author={Chen Zhao and Jiawei Chen and Hongyu Li and Zhuoliang Kang and Shilin Lu and Xiaoming Wei and Kai Zhang and Jian Yang and Ying Tai},
year={2026},
eprint={2602.11564},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.11564},
}