DiffuDETR: Rethinking Detection Transformers with Denoising Diffusion Process

0

+1.0 AP over DINO

mAP on COCO val2017

ResNet-50 Backbone

0

+2.4 AP over DINO

AP on LVIS val

ResNet-50 Backbone

0

+8.3 AP over DINO

AP on V3Det val

Swin-B Backbone

3×

Decoder-Only Passes

Efficient Inference

Only ~17% Extra FLOPs

Overview

Abstract

We present DiffuDETR, a novel approach that formulates object detection as a conditional object query generation task, conditioned on the image and a set of noisy reference points. We integrate DETR-based models with denoising diffusion training to generate object queries' reference points from a prior Gaussian distribution. We propose two variants: DiffuDETR, built on top of the Deformable DETR decoder, and DiffuDINO, based on DINO's decoder with contrastive denoising queries. To improve inference efficiency, we further introduce a lightweight sampling scheme that requires only multiple forward passes through the decoder.

Our method demonstrates consistent improvements across multiple backbones and datasets, including COCO 2017, LVIS, and V3Det, surpassing the performance of their respective baselines, with notable gains in complex and crowded scenes. Using ResNet-50 backbone we observe a +1.0 mAP on COCO, reaching 51.9 mAP on DiffuDINO compared to 50.9 mAP of DINO. We also observe similar improvements on LVIS and V3Det with +2.4 and +2.2 respectively.

Highlights

Key Contributions

1

Diffusion-Based Query Generation

We reformulate object detection in DETR as a denoising diffusion process, progressively denoising queries' reference points from Gaussian noise to precise object locations.
2

Two Powerful Variants

We introduce DiffuDETR (built on Deformable DETR) and DiffuDINO (built on DINO with contrastive denoising queries), demonstrating the generality of our approach.
3

Extensive Evaluation & Ablations

Comprehensive experiments on COCO 2017, LVIS, and V3Det across multiple backbones with thorough ablation studies on noise distributions, schedulers, decoder evaluations, and multi-seed robustness.

Approach

Method Overview

Integrating denoising diffusion into DETR-based object detection transformers.

Input Image

Backbone
(ResNet/Swin)

Transformer
Encoder

Diffusion
Decoder

Detection
Output

DiffuDETR Framework

Overview of the DiffuDETR architecture. The model extracts multi-scale features via a backbone and transformer encoder, then uses noisy reference points (generated from Gaussian noise) along with learnable content queries to produce detections through an iterative diffusion-denoising process in the decoder.

Forward Diffusion

During training, ground-truth bounding box coordinates are corrupted with Gaussian noise at a random timestep $t \sim U(0, 100)$, creating noisy reference points $r_t$ via a cosine noise schedule with only 100 timesteps.

Reverse Denoising

At inference, reference points start as pure Gaussian noise and are iteratively denoised using DDIM sampling. The decoder predicts noise residuals conditioned on image features, requiring only 3 forward passes.

Timestep-Conditioned Decoder

The decoder integrates timestep embeddings after self-attention, conditioning each layer on the current diffusion step: $q_n = \text{FFN}(\text{MSDA}(\text{SA}(q_{n-1}) + t), r_t, O_{\text{enc}})$

Efficient Inference

Only the lightweight decoder is run multiple times—backbone and encoder execute once. With 3 decoder evaluations, the overhead is just ~17% extra FLOPs while achieving optimal performance.

Decoder Architecture

Detailed decoder layer design showing how timestep embeddings are injected after self-attention, followed by multi-scale deformable cross-attention with noisy reference points attending to encoded image features.

DDIM Sampling Update Rule

$$r_{t-1} = \frac{\sqrt{\bar{\alpha}_{t-1}}}{\sqrt{\bar{\alpha}_t}} \cdot r_t + \left(\sqrt{1 - \bar{\alpha}_{t-1}} - \frac{\sqrt{\bar{\alpha}_{t-1}} \cdot \sqrt{1 - \bar{\alpha}_t}}{\sqrt{\bar{\alpha}_t}}\right) \cdot \hat{\epsilon}_\theta(r_t, t)$$

Experiments

Main Results

Consistent improvements across three challenging benchmarks and multiple backbones.

Training Convergence

COCO val2017 AP (%) vs. training epochs for DiffuDINO, DiffuDETR, and various baselines including DINO, DN-DETR, DiffusionDet, Deformable-DETR, and more. DiffuDINO converges to the highest AP, surpassing all baseline methods.

COCO 2017 val — Object Detection Results

Comparison with different methods on COCO 2017 validation set.

Model	Backbone	Epochs	AP	AP₅₀	AP₇₅	APₛ	APₘ	APₗ
Pix2Seq	R50	300	43.2	61.0	46.1	26.6	47.0	58.6
DiffusionDet	R50	—	46.8	65.3	51.8	29.6	49.3	62.2
Deformable DETR	R50	50	48.2	67.0	52.2	30.7	51.4	63.0
Align-DETR	R50	24	51.4	69.1	55.8	35.5	54.6	65.7
DINO	R50	36	50.9	69.0	55.3	34.6	54.1	64.6
DiffuDETR (Ours)	R50	50	50.2 (+2.0)	66.8	55.2	33.3	53.9	65.8
DiffuAlignDETR (Ours)	R50	24	51.9 (+0.5)	69.2	56.4	34.9	55.6	66.2
DiffuDINO (Ours)	R50	50	51.9 (+1.0)	69.4	55.7	35.8	55.7	67.1
Pix2Seq	R101	300	44.5	62.8	47.5	26.0	48.2	60.3
DiffusionDet	R101	—	47.5	65.7	52.0	30.8	50.4	63.1
Align-DETR	R101	12	51.2	68.8	55.7	32.9	55.1	66.6
DINO	R101	12	50.0	67.7	54.4	32.2	53.4	64.3
DiffuAlignDETR (Ours)	R101	12	51.7 (+0.5)	69.3	56.1	34.0	55.6	67.0
DiffuDINO (Ours)	R101	12	51.2 (+1.2)	68.6	55.8	33.2	55.6	67.2

LVIS val — Large Vocabulary Detection

Results on LVIS validation set. Notable gains over DINO (+2.4 AP with R50).

Model	Backbone	AP	AP₅₀	AP_r	AP_c	AP_f
DINO	R50	26.5	35.9	9.2	24.6	36.2
DiffuDINO (Ours)	R50	28.9 (+2.4)	38.5	13.7	27.6	36.9
DINO	R101	30.9	40.4	13.9	29.7	39.7
DiffuDINO (Ours)	R101	32.5 (+1.6)	42.4	13.5	32.0	41.5

V3Det val — Vast Vocabulary Detection

Results on V3Det with 13,204 categories. Massive +8.3 AP gain with Swin-B backbone.

Model	Backbone	AP	AP₅₀	AP₇₅
DINO	R50	33.5	37.7	35.0
DiffuDINO (Ours)	R50	35.7 (+2.2)	41.4	37.7
DINO	Swin-B	42.0	46.8	43.9
DiffuDINO (Ours)	Swin-B	50.3 (+8.3)	56.6	52.9

Visualization

Qualitative Results

Visual comparison of detection results showing improvements over baselines.

Baseline Comparison

Side-by-side qualitative comparison on COCO 2017 val: Deformable DETR vs. DiffuDETR and DINO vs. DiffuDINO. Our diffusion-based models produce more accurate and complete detections, especially in crowded scenes with overlapping objects.

COCO — Decoder Steps

DiffuDINO results at different decoder evaluation steps ($t = 1, 3, 5, 10$) compared to DINO baseline and ground truth. Even $t = 1$ already surpasses DINO.

LVIS — Decoder Steps

Same comparison on LVIS validation, showing improved handling of long-tail categories and fine-grained objects across varying decoder steps.

Analysis

Ablation Studies

Detailed analysis of design choices on COCO 2017 validation.

Noise Distribution

mAP on COCO val2017

Gaussian

51.9

Sigmoid

50.4

Beta

49.5

Noise Scheduler

mAP on COCO val2017

Cosine

51.9

Linear

51.6

Sqrt

51.4

Decoder Evaluations

mAP on COCO val2017

1 eval

51.6

3 evals

51.9

5 evals

51.8

10 evals

51.4

Computational Cost

FLOPs per decoder evaluation

1 eval

244.5G

3 evals

285.2G

5 evals

326.0G

10 evals

427.9G

Multi-Seed Robustness: Across 5 random seeds, standard deviation remains below ±0.2 AP in all settings, demonstrating high stability of DiffuDINO regardless of initialization noise — even in dense and sparse scene subsets.

Reference

Citation

BibTeX

@inproceedings{nawar2026diffudetr, title = {DiffuDETR: Rethinking Detection Transformers with Denoising Diffusion Process}, author = {Nawar, Youssef and Badran, Mohamed and Torki, Marwan}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2026} }

DiffuDETR

Abstract

Key Contributions

Diffusion-Based Query Generation

Two Powerful Variants

Extensive Evaluation & Ablations

Method Overview

DiffuDETR Framework

Forward Diffusion

Reverse Denoising

Timestep-Conditioned Decoder

Efficient Inference

Decoder Architecture

Main Results

Training Convergence

COCO 2017 val — Object Detection Results

LVIS val — Large Vocabulary Detection

V3Det val — Vast Vocabulary Detection

Qualitative Results

Baseline Comparison

COCO — Decoder Steps

LVIS — Decoder Steps

Ablation Studies

Noise Distribution

Noise Scheduler

Decoder Evaluations

Computational Cost

Citation