DepthMaster: Taming Diffusion Models for Monocular Depth Estimation

1University of Science and Technology of China, 2vivo Mobile Communication Co., Ltd.
Arxiv 2025

*Indicates Equal Contribution, Indicates Corresponding Author

Abstract

Monocular depth estimation within the diffusion-denoising paradigm demonstrates impressive generalization ability but suffers from low inference speed. Recent methods adopt a single-step deterministic paradigm to improve inference efficiency while maintaining comparable performance. However, they overlook the gap between generative and discriminative features, leading to suboptimal results. In this work, we propose DepthMaster, a single-step diffusion model designed to adapt generative features for the discriminative depth estimation task. First, to mitigate overfitting to texture details introduced by generative features, we propose a Feature Alignment module, which incorporates high-quality semantic features to enhance the denoising network's representation capability. Second, to address the lack of fine-grained details in the single-step deterministic framework, we propose a Fourier Enhancement module to adaptively balance low-frequency structure and high-frequency details. We adopt a two-stage training strategy to fully leverage the potential of the two modules. In the first stage, we focus on learning the global scene structure with the Feature Alignment module, while in the second stage, we exploit the Fourier Enhancement module to improve the visual quality. Through these efforts, our model achieves state-of-the-art performance in terms of generalization and detail preservation, outperforming other diffusion-based methods across various datasets.

Method

We propose DepthMaster, a approach that customizes generative features in diffusion models to suit the discriminative depth estimation task. We introduce a Feature Alignment module to mitigate overfitting to texture details with high-quality external features and a Fourier Enhancement module to refine fine-grained details in the frequency domain.

Quantitative Results

Comparison with state-of-the-art zero-shot affine-invariant monocular depth estimation methods. Our model achieves state-of-the-art zero-shot performance, outperforming other diffusion-based methods across various datasets.

Qualitative Results

Qualitative comparison with zero-shot monocular depth estimation methods across different datasets. Our model demonstrates excellent detail preservation and structure capture capabilities.

Qualitative results on in-the-wild examples.