ICLR 2026

DiffuDETR

Rethinking Detection Transformers with Denoising Diffusion Process

Youssef Nawar*
Mohamed Badran*
Marwan Torki
Alexandria University  ·  Technical University of Munich  ·  Applied Innovation Center
* Equal Contribution
0
+1.0 AP over DINO
mAP on COCO val2017
ResNet-50 Backbone
0
+2.4 AP over DINO
AP on LVIS val
ResNet-50 Backbone
0
+8.3 AP over DINO
AP on V3Det val
Swin-B Backbone
Decoder-Only Passes
Efficient Inference
Only ~17% Extra FLOPs

Abstract

We present DiffuDETR, a novel approach that formulates object detection as a conditional object query generation task, conditioned on the image and a set of noisy reference points. We integrate DETR-based models with denoising diffusion training to generate object queries' reference points from a prior Gaussian distribution. We propose two variants: DiffuDETR, built on top of the Deformable DETR decoder, and DiffuDINO, based on DINO's decoder with contrastive denoising queries. To improve inference efficiency, we further introduce a lightweight sampling scheme that requires only multiple forward passes through the decoder.


Our method demonstrates consistent improvements across multiple backbones and datasets, including COCO 2017, LVIS, and V3Det, surpassing the performance of their respective baselines, with notable gains in complex and crowded scenes. Using ResNet-50 backbone we observe a +1.0 mAP on COCO, reaching 51.9 mAP on DiffuDINO compared to 50.9 mAP of DINO. We also observe similar improvements on LVIS and V3Det with +2.4 and +2.2 respectively.

Key Contributions

Method Overview

Integrating denoising diffusion into DETR-based object detection transformers.

Input Image
Backbone
(ResNet/Swin)
Transformer
Encoder
Diffusion
Decoder
Detection
Output
DiffuDETR Framework Overview

 DiffuDETR Framework

Overview of the DiffuDETR architecture. The model extracts multi-scale features via a backbone and transformer encoder, then uses noisy reference points (generated from Gaussian noise) along with learnable content queries to produce detections through an iterative diffusion-denoising process in the decoder.

Forward Diffusion

During training, ground-truth bounding box coordinates are corrupted with Gaussian noise at a random timestep $t \sim U(0, 100)$, creating noisy reference points $r_t$ via a cosine noise schedule with only 100 timesteps.

Reverse Denoising

At inference, reference points start as pure Gaussian noise and are iteratively denoised using DDIM sampling. The decoder predicts noise residuals conditioned on image features, requiring only 3 forward passes.

Timestep-Conditioned Decoder

The decoder integrates timestep embeddings after self-attention, conditioning each layer on the current diffusion step: $q_n = \text{FFN}(\text{MSDA}(\text{SA}(q_{n-1}) + t), r_t, O_{\text{enc}})$

Efficient Inference

Only the lightweight decoder is run multiple times—backbone and encoder execute once. With 3 decoder evaluations, the overhead is just ~17% extra FLOPs while achieving optimal performance.

DiffuDETR Decoder Architecture

 Decoder Architecture

Detailed decoder layer design showing how timestep embeddings are injected after self-attention, followed by multi-scale deformable cross-attention with noisy reference points attending to encoded image features.

DDIM Sampling Update Rule

$$r_{t-1} = \frac{\sqrt{\bar{\alpha}_{t-1}}}{\sqrt{\bar{\alpha}_t}} \cdot r_t + \left(\sqrt{1 - \bar{\alpha}_{t-1}} - \frac{\sqrt{\bar{\alpha}_{t-1}} \cdot \sqrt{1 - \bar{\alpha}_t}}{\sqrt{\bar{\alpha}_t}}\right) \cdot \hat{\epsilon}_\theta(r_t, t)$$

Main Results

Consistent improvements across three challenging benchmarks and multiple backbones.

Convergence Comparison

 Training Convergence

COCO val2017 AP (%) vs. training epochs for DiffuDINO, DiffuDETR, and various baselines including DINO, DN-DETR, DiffusionDet, Deformable-DETR, and more. DiffuDINO converges to the highest AP, surpassing all baseline methods.

COCO 2017 val — Object Detection Results

Comparison with different methods on COCO 2017 validation set.

Model Backbone Epochs AP AP₅₀ AP₇₅ APₛ APₘ APₗ
Pix2Seq R50 300 43.2 61.0 46.1 26.6 47.0 58.6
DiffusionDet R50 46.8 65.3 51.8 29.6 49.3 62.2
Deformable DETR R50 50 48.2 67.0 52.2 30.7 51.4 63.0
Align-DETR R50 24 51.4 69.1 55.8 35.5 54.6 65.7
DINO R50 36 50.9 69.0 55.3 34.6 54.1 64.6
DiffuDETR (Ours) R50 50 50.2 (+2.0) 66.8 55.2 33.3 53.9 65.8
DiffuAlignDETR (Ours) R50 24 51.9 (+0.5) 69.2 56.4 34.9 55.6 66.2
DiffuDINO (Ours) R50 50 51.9 (+1.0) 69.4 55.7 35.8 55.7 67.1
Pix2Seq R101 300 44.5 62.8 47.5 26.0 48.2 60.3
DiffusionDet R101 47.5 65.7 52.0 30.8 50.4 63.1
Align-DETR R101 12 51.2 68.8 55.7 32.9 55.1 66.6
DINO R101 12 50.0 67.7 54.4 32.2 53.4 64.3
DiffuAlignDETR (Ours) R101 12 51.7 (+0.5) 69.3 56.1 34.0 55.6 67.0
DiffuDINO (Ours) R101 12 51.2 (+1.2) 68.6 55.8 33.2 55.6 67.2

LVIS val — Large Vocabulary Detection

Results on LVIS validation set. Notable gains over DINO (+2.4 AP with R50).

Model Backbone AP AP₅₀ APr APc APf
DINO R50 26.5 35.9 9.2 24.6 36.2
DiffuDINO (Ours) R50 28.9 (+2.4) 38.5 13.7 27.6 36.9
DINO R101 30.9 40.4 13.9 29.7 39.7
DiffuDINO (Ours) R101 32.5 (+1.6) 42.4 13.5 32.0 41.5

V3Det val — Vast Vocabulary Detection

Results on V3Det with 13,204 categories. Massive +8.3 AP gain with Swin-B backbone.

Model Backbone AP AP₅₀ AP₇₅
DINO R50 33.5 37.7 35.0
DiffuDINO (Ours) R50 35.7 (+2.2) 41.4 37.7
DINO Swin-B 42.0 46.8 43.9
DiffuDINO (Ours) Swin-B 50.3 (+8.3) 56.6 52.9

Qualitative Results

Visual comparison of detection results showing improvements over baselines.

Comparison with Baselines

 Baseline Comparison

Side-by-side qualitative comparison on COCO 2017 val: Deformable DETR vs. DiffuDETR and DINO vs. DiffuDINO. Our diffusion-based models produce more accurate and complete detections, especially in crowded scenes with overlapping objects.

Ablation Studies

Detailed analysis of design choices on COCO 2017 validation.

 Noise Distribution

mAP on COCO val2017
Gaussian
51.9
Sigmoid
50.4
Beta
49.5

 Noise Scheduler

mAP on COCO val2017
Cosine
51.9
Linear
51.6
Sqrt
51.4

 Decoder Evaluations

mAP on COCO val2017
1 eval
51.6
3 evals
51.9
5 evals
51.8
10 evals
51.4

 Computational Cost

FLOPs per decoder evaluation
1 eval
244.5G
3 evals
285.2G
5 evals
326.0G
10 evals
427.9G

   Multi-Seed Robustness: Across 5 random seeds, standard deviation remains below ±0.2 AP in all settings, demonstrating high stability of DiffuDINO regardless of initialization noise — even in dense and sparse scene subsets.

Citation

 BibTeX
@inproceedings{nawar2026diffudetr, title = {DiffuDETR: Rethinking Detection Transformers with Denoising Diffusion Process}, author = {Nawar, Youssef and Badran, Mohamed and Torki, Marwan}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2026} }