DiffST: Spatiotemporal-Aware Diffusion for Real-World Space-Time Video Super-Resolution

1Shanghai Jiao Tong University, 2Huawei Noah's Ark Lab,
3Duke University, 4 Huawei Consumer Business Group

*Indicates Equal Contribution. Indicates Corresponding Author
BinaryDemoire Example Illustration

Visual comparison on the real-world dataset. Our proposed DiffST outperforms others with rich and consistent details.

Abstract

Diffusion-based models have shown strong performance in video super-resolution (VSR) and video frame interpolation (VFI). However, their role in the coupled space-time video super-resolution (STVSR) setting remains limited. Existing diffusion-based STVSR approaches suffer from two issues: (1) low inference efficiency and (2) insufficient utilization of spatiotemporal information. These limitations impede deployment. To address these issues, we introduce DiffST, an efficient spatiotemporal-aware video diffusion framework for real-world STVSR. To improve efficiency, we adapt a pre-trained diffusion model for one-step sampling and process the entire video directly rather than operating on individual frames. Furthermore, to enhance spatiotemporal information utilization, we introduce cross-frame context aggregation (CFCA) and video representation guidance (VRG). The CFCA module aggregates information across multiple keyframes to produce intermediate frames. The VRG module extracts video-level global features to guide the diffusion process. Extensive experiments show that DiffST obtains leading results on real-world STVSR tasks. It also maintains high inference efficiency, running about 17× faster than previous diffusion-based STVSR methods.

Method

BinaryDemoire Overview
BinaryDemoire Overview

DiffST performs one-step sampling to process the entire video directly. The cross-frame context aggregation (CFCA) module aggregates information from multiple frames to generate intermediate frames. The video representation guidance (VRG) module extracts a video-level representation to guide the restoration process. To better match real-world conditions, we adopt multiple spatial degradations, combined with frame subsampling.

Results

Quantitative Results (click to expand)
  • Quantitative Results (Tab. 4 of the main paper)

  • Complexity Comparison (Tab. 5 of the main paper)

ST-VSR Performance Comparison

    More Results (click to expand)
    • Quantitative comparison on more metrics (Tab. 6)

    • Quantitative comparison with more STVSR methods (Tab. 7)

Qualitative Results (click to expand)
  • Visual Results (Fig. 5 of the main paper)

  • Consistency Comparison (Fig. 6 of the main paper)

ST-VSR Performance Comparison

    More Results (click to expand)
    • Visual comparison on synthetic (UDM10 and Vid4) datasets (Fig. 7)

    • Visual comparison on real-world (MVSR4x and RealVSR) datasets (Fig. 8)

BibTeX

@article{chen2026diffst,
  title = {DiffST: Spatiotemporal-Aware Diffusion for Real-World Space-Time Video Super-Resolution},
  author = {Chen, Zheng and Yang, Ruofan and Han, Jin and Song, Dehua and Zou, Zichen and He, Chunming and Guo, Yong and Zhang, Yulun},
  journal = {arXiv preprint arXiv:2605.13182},
  year = {2026}
}