DiffST: Spatiotemporal-Aware Diffusion for Real-World Space-Time Video Super-Resolution

¹Shanghai Jiao Tong University, ²Huawei Noah's Ark Lab,
³Duke University, ⁴ Huawei Consumer Business Group
^*Indicates Equal Contribution. ^†Indicates Corresponding Author

Abstract

Diffusion-based models have shown strong performance in video super-resolution (VSR) and video frame interpolation (VFI). However, their role in the coupled space-time video super-resolution (STVSR) setting remains limited. Existing diffusion-based STVSR approaches suffer from two issues: (1) low inference efficiency and (2) insufficient utilization of spatiotemporal information. These limitations impede deployment. To address these issues, we introduce DiffST, an efficient spatiotemporal-aware video diffusion framework for real-world STVSR. To improve efficiency, we adapt a pre-trained diffusion model for one-step sampling and process the entire video directly rather than operating on individual frames. Furthermore, to enhance spatiotemporal information utilization, we introduce cross-frame context aggregation (CFCA) and video representation guidance (VRG). The CFCA module aggregates information across multiple keyframes to produce intermediate frames. The VRG module extracts video-level global features to guide the diffusion process. Extensive experiments show that DiffST obtains leading results on real-world STVSR tasks. It also maintains high inference efficiency, running about 17× faster than previous diffusion-based STVSR methods.

Method

BinaryDemoire Overview

DiffST performs one-step sampling to process the entire video directly. The cross-frame context aggregation (CFCA) module aggregates information from multiple frames to generate intermediate frames. The video representation guidance (VRG) module extracts a video-level representation to guide the restoration process. To better match real-world conditions, we adopt multiple spatial degradations, combined with frame subsampling.

Results

Quantitative Results (click to expand)

Quantitative Results (Tab. 4 of the main paper)

Complexity Comparison (Tab. 5 of the main paper)

ST-VSR Performance Comparison

More Results (click to expand)

Quantitative comparison on more metrics (Tab. 6)

Quantitative comparison with more STVSR methods (Tab. 7)

Qualitative Results (click to expand)

Visual Results (Fig. 5 of the main paper)

Consistency Comparison (Fig. 6 of the main paper)

ST-VSR Performance Comparison

More Results (click to expand)

Visual comparison on synthetic (UDM10 and Vid4) datasets (Fig. 7)

Visual comparison on real-world (MVSR4x and RealVSR) datasets (Fig. 8)

BibTeX

@article{chen2026diffst, title = {DiffST: Spatiotemporal-Aware Diffusion for Real-World Space-Time Video Super-Resolution}, author = {Chen, Zheng and Yang, Ruofan and Han, Jin and Song, Dehua and Zou, Zichen and He, Chunming and Guo, Yong and Zhang, Yulun}, journal = {arXiv preprint arXiv:2605.13182}, year = {2026} }