Rethinking Detection Transformers with Denoising Diffusion Process
We present DiffuDETR, a novel approach that formulates object detection as a conditional object query generation task, conditioned on the image and a set of noisy reference points. We integrate DETR-based models with denoising diffusion training to generate object queries' reference points from a prior Gaussian distribution. We propose two variants: DiffuDETR, built on top of the Deformable DETR decoder, and DiffuDINO, based on DINO's decoder with contrastive denoising queries. To improve inference efficiency, we further introduce a lightweight sampling scheme that requires only multiple forward passes through the decoder.
Our method demonstrates consistent improvements across multiple backbones and datasets, including COCO 2017, LVIS, and V3Det, surpassing the performance of their respective baselines, with notable gains in complex and crowded scenes. Using ResNet-50 backbone we observe a +1.0 mAP on COCO, reaching 51.9 mAP on DiffuDINO compared to 50.9 mAP of DINO. We also observe similar improvements on LVIS and V3Det with +2.4 and +2.2 respectively.
We reformulate object detection in DETR as a denoising diffusion process, progressively denoising queries' reference points from Gaussian noise to precise object locations.
We introduce DiffuDETR (built on Deformable DETR) and DiffuDINO (built on DINO with contrastive denoising queries), demonstrating the generality of our approach.
Comprehensive experiments on COCO 2017, LVIS, and V3Det across multiple backbones with thorough ablation studies on noise distributions, schedulers, decoder evaluations, and multi-seed robustness.
Integrating denoising diffusion into DETR-based object detection transformers.
Overview of the DiffuDETR architecture. The model extracts multi-scale features via a backbone and transformer encoder, then uses noisy reference points (generated from Gaussian noise) along with learnable content queries to produce detections through an iterative diffusion-denoising process in the decoder.
During training, ground-truth bounding box coordinates are corrupted with Gaussian noise at a random timestep $t \sim U(0, 100)$, creating noisy reference points $r_t$ via a cosine noise schedule with only 100 timesteps.
At inference, reference points start as pure Gaussian noise and are iteratively denoised using DDIM sampling. The decoder predicts noise residuals conditioned on image features, requiring only 3 forward passes.
The decoder integrates timestep embeddings after self-attention, conditioning each layer on the current diffusion step: $q_n = \text{FFN}(\text{MSDA}(\text{SA}(q_{n-1}) + t), r_t, O_{\text{enc}})$
Only the lightweight decoder is run multiple times—backbone and encoder execute once. With 3 decoder evaluations, the overhead is just ~17% extra FLOPs while achieving optimal performance.
Detailed decoder layer design showing how timestep embeddings are injected after self-attention, followed by multi-scale deformable cross-attention with noisy reference points attending to encoded image features.
$$r_{t-1} = \frac{\sqrt{\bar{\alpha}_{t-1}}}{\sqrt{\bar{\alpha}_t}} \cdot r_t + \left(\sqrt{1 - \bar{\alpha}_{t-1}} - \frac{\sqrt{\bar{\alpha}_{t-1}} \cdot \sqrt{1 - \bar{\alpha}_t}}{\sqrt{\bar{\alpha}_t}}\right) \cdot \hat{\epsilon}_\theta(r_t, t)$$
Consistent improvements across three challenging benchmarks and multiple backbones.
COCO val2017 AP (%) vs. training epochs for DiffuDINO, DiffuDETR, and various baselines including DINO, DN-DETR, DiffusionDet, Deformable-DETR, and more. DiffuDINO converges to the highest AP, surpassing all baseline methods.
Comparison with different methods on COCO 2017 validation set.
| Model | Backbone | Epochs | AP | AP₅₀ | AP₇₅ | APₛ | APₘ | APₗ |
|---|---|---|---|---|---|---|---|---|
| Pix2Seq | R50 | 300 | 43.2 | 61.0 | 46.1 | 26.6 | 47.0 | 58.6 |
| DiffusionDet | R50 | — | 46.8 | 65.3 | 51.8 | 29.6 | 49.3 | 62.2 |
| Deformable DETR | R50 | 50 | 48.2 | 67.0 | 52.2 | 30.7 | 51.4 | 63.0 |
| Align-DETR | R50 | 24 | 51.4 | 69.1 | 55.8 | 35.5 | 54.6 | 65.7 |
| DINO | R50 | 36 | 50.9 | 69.0 | 55.3 | 34.6 | 54.1 | 64.6 |
| DiffuDETR (Ours) | R50 | 50 | 50.2 (+2.0) | 66.8 | 55.2 | 33.3 | 53.9 | 65.8 |
| DiffuAlignDETR (Ours) | R50 | 24 | 51.9 (+0.5) | 69.2 | 56.4 | 34.9 | 55.6 | 66.2 |
| DiffuDINO (Ours) | R50 | 50 | 51.9 (+1.0) | 69.4 | 55.7 | 35.8 | 55.7 | 67.1 |
| Pix2Seq | R101 | 300 | 44.5 | 62.8 | 47.5 | 26.0 | 48.2 | 60.3 |
| DiffusionDet | R101 | — | 47.5 | 65.7 | 52.0 | 30.8 | 50.4 | 63.1 |
| Align-DETR | R101 | 12 | 51.2 | 68.8 | 55.7 | 32.9 | 55.1 | 66.6 |
| DINO | R101 | 12 | 50.0 | 67.7 | 54.4 | 32.2 | 53.4 | 64.3 |
| DiffuAlignDETR (Ours) | R101 | 12 | 51.7 (+0.5) | 69.3 | 56.1 | 34.0 | 55.6 | 67.0 |
| DiffuDINO (Ours) | R101 | 12 | 51.2 (+1.2) | 68.6 | 55.8 | 33.2 | 55.6 | 67.2 |
Results on LVIS validation set. Notable gains over DINO (+2.4 AP with R50).
| Model | Backbone | AP | AP₅₀ | APr | APc | APf |
|---|---|---|---|---|---|---|
| DINO | R50 | 26.5 | 35.9 | 9.2 | 24.6 | 36.2 |
| DiffuDINO (Ours) | R50 | 28.9 (+2.4) | 38.5 | 13.7 | 27.6 | 36.9 |
| DINO | R101 | 30.9 | 40.4 | 13.9 | 29.7 | 39.7 |
| DiffuDINO (Ours) | R101 | 32.5 (+1.6) | 42.4 | 13.5 | 32.0 | 41.5 |
Results on V3Det with 13,204 categories. Massive +8.3 AP gain with Swin-B backbone.
| Model | Backbone | AP | AP₅₀ | AP₇₅ |
|---|---|---|---|---|
| DINO | R50 | 33.5 | 37.7 | 35.0 |
| DiffuDINO (Ours) | R50 | 35.7 (+2.2) | 41.4 | 37.7 |
| DINO | Swin-B | 42.0 | 46.8 | 43.9 |
| DiffuDINO (Ours) | Swin-B | 50.3 (+8.3) | 56.6 | 52.9 |
Visual comparison of detection results showing improvements over baselines.
Side-by-side qualitative comparison on COCO 2017 val: Deformable DETR vs. DiffuDETR and DINO vs. DiffuDINO. Our diffusion-based models produce more accurate and complete detections, especially in crowded scenes with overlapping objects.
DiffuDINO results at different decoder evaluation steps ($t = 1, 3, 5, 10$) compared to DINO baseline and ground truth. Even $t = 1$ already surpasses DINO.
Same comparison on LVIS validation, showing improved handling of long-tail categories and fine-grained objects across varying decoder steps.
Detailed analysis of design choices on COCO 2017 validation.
Multi-Seed Robustness: Across 5 random seeds, standard deviation remains below ±0.2 AP in all settings, demonstrating high stability of DiffuDINO regardless of initialization noise — even in dense and sparse scene subsets.