1Southeast University, China 2Alibaba 3Nanyang Technological University, Singapore
We compare dLLMs and autoregressive LLMs on embodied (AgentBoard) and tool-calling (BFCL) benchmarks. The results show that dLLMs lag behind on both success/progress and tool-calling accuracy.
(a) Failure of Replan for embodied agents: dLLMs exhibit significantly more frequent retry loops than LLMs.
(b) Failure of Precision for tool-calling agents: dLLMs are more prone to produce malformed JSON schemas.
(c) Performance-Efficiency Trade-offs: despite higher inference efficiency, dLLMs do not guarantee comparable agentic performance to autoregressive LLMs.
To better understand the agentic potential of dLLMs, we introduce DiffuAgent, a novel evaluation framework that treats dLLMs as plug-and-play cognitive modules for augmenting LLM agents.
• dLLMs are competitive memory modules for memory-augmented agents.
• LLM Verifiers tend to trigger premature early exits, whereas dLLMs terminate more reliably.
• dLLMs are effective tool selectors but struggle as tool-call editors.
@article{lu2026diffuagent,
title = {The Bitter Lesson of Diffusion Language Models for Agentic Workflows: A Comprehensive Reality Check},
author = {Lu, Qingyu and Ding, Liang and Liu, Xuebo and Zhang, Kanjian and Zhang, Jinxia and Tao, Dacheng},
journal = {arXiv preprint},
year = {2026},
url = {https://arxiv.org/pdf/2601.12979}
}