DVFace
Spatio-Temporal Dual-Prior Diffusion
for Video Face Restoration

1Shanghai Jiao Tong University, 2Meituan Inc,
*Equal Contribution.
Corresponding Author
LQ

LQ

HQ

HQ

SVFR

SVFR

DVFace (ours)

DVFace (ours)

LQ

LQ

DicFace

DicFace

SVFR

SVFR

DVFace (ours)

DVFace (ours)

Radar Figure

DVFace is a one-step diffusion video face restoration model. The left shows qualitative comparisons with other methods, and the right demonstrates quantitative superiority on multiple metrics. DVface delivers superior facial restoration quality compared with other methods.

Visual Results

HQ

LQ

PGTFormer

DicFace

SVFR

DVFace (ours)

HQ

LQ

PGTFormer

DicFace

SVFR

DVFace (ours)

HQ

LQ

PGTFormer

DicFace

SVFR

DVFace (ours)

LQ

PGTFormer

DicFace

SVFR

DVFace (ours)

LQ

PGTFormer

DicFace

SVFR

DVFace (ours)

LQ

PGTFormer

DicFace

SVFR

DVFace (ours)

Abstract

Video face restoration aims to recover high-quality face videos from severely degraded inputs while preserving realistic facial details, stable identity, and temporal coherence. Recent diffusion-based methods have brought strong generative priors to restoration and enabled more realistic detail synthesis. However, existing approaches for face videos still rely heavily on generic diffusion priors and multi-step sampling, which limits both facial adaptation and inference efficiency. These limitations motivate the use of one-step diffusion for video face restoration, but achieving faithful facial recovery together with temporally stable outputs remains challenging. In this paper, we propose \textbf{DVFace}, a one-step diffusion framework for real-world video face restoration. Specifically, we introduce a spatio-temporal dual-codebook design to extract complementary spatial and temporal facial priors from degraded videos. We further propose an asymmetric spatio-temporal fusion module to inject these priors into the diffusion backbone according to their distinct roles. Extensive experiments on synthetic and real-world benchmarks demonstrate that DVFace achieves superior restoration quality, temporal consistency, and identity preservation compared with recent methods.

Method

Illustration of automated annotation pipeline

Overview of DVFace. (a) Overall Framework: DVFace restores low-quality face videos with one-step diffusion and facial priors. (b) Spatio-Temporal Dual-Codebook (STDC): spatial and temporal codebooks extract complementary facial priors. (c) Asymmetric Spatio-Temporal Fusion (ASTF): injecting temporal priors globally and spatial priors locally.

Results

Quantitative Results (click to expand)
  • Results on synthetic datasets VFHQ-Test and HDTF. (Tab. 1 of the main paper)

  • Results on real-world datasets RFV-LQ and Voxceleb2. (Tab. 2 of the main paper)

More Quantitative Results (click to expand)
  • Comparison with current video super-resolution methods (Tab. 1 of the supplementary material)

  • Efficiency comparison. (Tab. 2 of the supplementary material)

Qualitative Results (click to expand)
  • Results on synthetic datasets VFHQ-Test and HDTF. (Fig. 4 of the main paper)

  • Results on real-world dataset RFV-LQ. (Fig. 5 of the main paper)

More Qualitative Results (click to expand)
  • Multi-frame comparison. (Fig. 1 of the supplementary material)

  • More results on synthetic datasets. (Fig. 2 of the supplementary material)

  • More results on real-world datasets. (Fig. 3 of the supplementary material)

BibTeX

@article{chen2026dvface,
  title = {DVFace: Spatio-Temporal Dual-Prior Diffusion for Video Face Restoration},
  author = {Zheng Chen, Bowen Chai, Rongjun Gao, Mingtao Nie, Xi Li, Bingnan Duan, Jianping Fang, Xiaohong
  Liu, Linghe Kong, Yulun Zhang},
	journal = {arXiv preprint arXiv:2604.14560},
  year = {2026}
}