Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models
Preprint 2026

1 City University of Hong Kong
2 Huawei Research
3 The University of Hong Kong
*Equal contribution. CCorresponding authors. PProject lead.
TABOM framework overview

TL;DR

🚀 Diffusion Language Models generate text by progressively unmasking tokens, usually from easy tokens to hard tokens. However, standard supervised fine-tuning still trains them as if all masked tokens were equally important and equally difficult. TABOM fixes this training-inference mismatch by learning from the model's own self-distilled decoding trajectories and adding a ranking objective that teaches the model which tokens should become confident earlier.

✅ The result is a post-training recipe that turns self-distilled trajectories into real capability gains, rather than using them only for faster sampling.

Abstract

Diffusion Language Models (DLMs) have recently emerged as a promising alternative to autoregressive language models, offering stronger global awareness and highly parallel generation. However, post-training DLMs with standard Negative Evidence Lower Bound (NELBO)-based supervised fine-tuning remains inefficient: training reconstructs randomly masked tokens in a single step, whereas inference follows a confidence-guided, multi-step easy-to-hard denoising trajectory. Recent trajectory-based self-distillation methods exploit such inference trajectories mainly for sampling-step compression and acceleration, often improving decoding efficiency without substantially enhancing the model's underlying capability, and may even degrade performance under full diffusion decoding.

We propose Trajectory-Aligned Optimization via Boltzmann Modeling (TABOM), a self-distilled trajectory-based post-training framework that aligns training with the easy-to-hard structure of inference. TABOM models the inference unmasking preference as a Boltzmann distribution over predictive entropies and derives a tractable pairwise ranking objective to align the model's certainty ordering with the observed decoding trajectory. Empirically, TABOM achieves substantial gains in new domains, expands the effective knowledge boundary of DLMs, and significantly mitigates catastrophic forgetting compared with standard SFT.

Why Is Standard SFT Not Enough?

🔍 The key difficulty is a hidden mismatch between training and inference. During NELBO-style SFT, a DLM receives a randomly masked sequence and is asked to reconstruct all masked tokens in one shot. This objective is simple and scalable, but it implicitly treats all masked positions uniformly. In other words, a very easy token and a very hard token contribute to the same reconstruction objective without considering their order in the actual decoding process.

Inference behaves very differently. A DLM does not recover all tokens at once. It repeatedly estimates token confidence or entropy, unmasks a subset of high-confidence tokens, and then uses these newly recovered tokens as context for later decisions. This creates an easy-to-hard trajectory: easy tokens should become certain first, while difficult tokens are resolved later with richer context.

  • Uniform training bias: all masked tokens are reconstructed under the same objective.
  • Easy-to-hard inference bias: high-confidence tokens are decoded earlier and shape later context.
  • Core question: can self-distilled trajectories be used for genuine knowledge acquisition, not merely faster decoding?

💡 Self-distilled trajectories are attractive because they lie on the pretrained model's own distributional manifold. But if we still train them with a uniform reconstruction loss, the model sees the trajectory data without learning the ordering preference behind the trajectory. This is the central issue TABOM targets.

Empirical Observation

📉 Our preliminary diagnostics compare offline ground-truth SFT and self-distilled SFT across code generation and mathematical reasoning. The observation is consistent: self-distilled trajectories reduce the optimization barrier and help avoid catastrophic forgetting, but they still provide only marginal gains when trained with the same NELBO objective. The bottleneck is not only the data source, but also the training objective.

Code-domain loss-ratio diagnostic
Math-domain loss-ratio diagnostic

How TABOM Works

🧭 TABOM keeps the useful part of self-distillation while changing what the model is asked to learn. Instead of only reconstructing tokens from trajectory states, it also learns the relative certainty ordering encoded by the trajectory. Intuitively, if token A is decoded before token B in an entropy-guided trajectory, the model should assign token A lower entropy than token B under the corresponding context.

As a foundation, TABOM uses self-distilled trajectory states as reconstruction contexts instead of relying only on uniformly random masks. The key difference, however, is not just the data: TABOM explicitly models and optimizes the inference-time ordering preference.

Boltzmann Modeling of Inference Preference

🔥 We view the easy-to-hard unmasking behavior as a probability distribution over possible unmasked states. States containing easier, lower-entropy tokens should receive higher probability. TABOM formalizes this preference as a Boltzmann distribution over predictive entropies, turning training-inference alignment into a principled distribution matching problem.

$$ q_{\mathrm{infer}}^\star(U \mid \mathbf{x}_0) = \frac{1}{Z_\theta} \exp\!\left( -\beta \sum_{r\in U} h_\theta(r;\tau) \right), \qquad h_\theta(r;\tau)=\mathcal{H}_\theta(x_0^r \mid \mathbf{x}_0^{U_t}, \mathbf{s}). $$

Energy-Based Pairwise Ranking

🎯 Ideally, training should make the model-induced unmasking distribution match the Boltzmann target above. This gives a KL alignment objective:

$$ \min_{\theta}\ D_{\mathrm{KL}}\!\left( q_{\mathrm{infer}}^\star(U \mid \mathbf{x}_0) \,\|\, p_\theta(U \mid \mathbf{x}_0,\mathbf{s}) \right). $$

Directly optimizing this KL objective is intractable because it requires sampling from the global target distribution and computing the partition function. TABOM therefore derives a local pairwise ranking loss that shapes the entropy landscape so that earlier decoded tokens become more certain than later decoded tokens within a local timestep window.

$$ \mathcal{L}_{\mathrm{rank}} = \frac{2}{W(W-1)} \sum_{(r,s)\in\mathcal{P}_{t,t'}} \max\big(0,\ h_\theta(r;\tau)-h_\theta(s;\tau)+\gamma\big), \qquad \mathcal{L}_{\mathrm{TABOM}} = \mathcal{L}_{\mathrm{traj}}+\lambda \mathcal{L}_{\mathrm{rank}}. $$

✨ This local ranking view is important in practice. Comparing tokens decoded at vastly different stages can be noisy because their difficulties are inherently different. TABOM therefore ranks tokens inside a local decoding window, which better matches the step-wise nature of diffusion inference.

What Do the Experiments Show?

We evaluate TABOM on Dream-7B-Instruct and LLaDA-8B-Instruct across two post-training domains: code generation and mathematical reasoning. Each fine-tuned model is evaluated not only on its in-domain tasks, but also on out-of-distribution tasks. This matters because standard SFT can improve a target domain while damaging the pretrained model's broader capabilities.

Code Generation Fine-Tuning

Method In-Domain Out-of-Distribution
HumanEvalMBPPAvg. GSM8KMATH500IFEvalAvg.
Base Model: Dream-7B-Instruct
No-SFT52.6658.0055.3381.4139.8056.5659.26
SFT-GT61.5558.0059.7852.3332.4046.2143.65
SFT-SD53.6659.2056.4381.8141.6057.1060.17
dInfer57.3158.2057.7681.8839.8057.3059.66
T3D55.4858.7057.0981.8440.7057.2059.91
TABOM60.3660.6060.4881.7342.4055.4559.86
Base Model: LLaDA-8B-Instruct
No-SFT36.0139.2037.6176.1236.2033.0848.47
SFT-GT42.0132.8037.4170.7335.4034.9347.02
SFT-SD39.6338.8039.2276.9535.8032.9048.55
dInfer41.4638.6040.0377.3336.6033.8249.25
T3D40.5438.7039.6277.1436.2033.3648.90
TABOM42.6840.0041.3477.3338.2034.1949.91

Mathematical Reasoning Fine-Tuning

Method In-Domain Out-of-Distribution
GSM8KMATH500Avg. HumanEvalMBPPIFEvalAvg.
Base Model: Dream-7B-Instruct
No-SFT81.4139.8060.6152.6658.0056.5655.74
SFT-GT80.1237.4058.7646.3458.0053.2352.52
SFT-SD81.9539.8060.8857.9258.6056.0157.51
dInfer82.3341.6061.9756.1158.8055.8256.91
T3D82.1440.7061.4257.0158.7055.9157.21
TABOM84.3141.1062.7158.5459.2056.1957.98
Base Model: LLaDA-8B-Instruct
No-SFT76.1236.2056.1636.0139.2033.0836.10
SFT-GT74.2935.5054.9031.0940.6026.9832.89
SFT-SD75.9635.7055.8336.5839.8034.3836.92
dInfer76.7236.5056.6138.9039.4034.3837.56
T3D76.3436.1056.2237.7439.6034.3837.24
TABOM78.6236.8057.7140.3040.1032.9837.79

🏆 TABOM consistently achieves the strongest in-domain average across both base models and training domains. More importantly, it preserves or improves OOD performance, avoiding the catastrophic forgetting often observed when standard SFT is applied to new-domain data.

📌 The table highlights the main story of the paper. SFT-SD is safer than ground-truth SFT because it stays closer to the pretrained model's manifold, but its gains are often small. TABOM keeps this stability while turning the self-distilled trajectories into larger performance improvements.

A Diagnostic View: Trajectory Discrimination Score

🔬 To check whether a model really learns the easy-to-hard structure, we introduce Trajectory Discrimination Score (TDS). At each decoding step, TDS measures the variance of predictive entropy across currently masked tokens. A high TDS means the model assigns clearly different uncertainty levels to different tokens; a low TDS means the model treats masked tokens almost uniformly.

MBPP trajectory discrimination score curve
GSM8K trajectory discrimination score curve

Higher TDS indicates stronger discrimination between easy and hard tokens at the same decoding step. TABOM produces a more discriminative entropy landscape, providing mechanism evidence that its gains come from better trajectory alignment rather than simply reusing self-generated samples.

✅ In other words, TABOM does not merely expose the model to more intermediate states. It changes the shape of the model's uncertainty landscape so that the learned model behaves more like the inference process it will actually use.

BibTeX

@article{chen2026tabom,
  title={Self-Distilled Trajectory-Aware Boltzmann Modeling: Bridging the Training-Inference Discrepancy in Diffusion Language Models},
  author={Chen, Kecheng and Liu, Ziru and Tao, Xijia and Liu, Hui and Liu, Yibing and Fu, Xinyu and Wu, Shi and Zhang, Suiyun and Tu, Dandan and Kong, Lingpeng and Liu, Rui and Li, Haoliang},
  journal={arXiv preprint},
  year={2026}
}

Visitor Globe

Real-time visitor origins (powered by Clustrmaps).