SAT-Mask: Self-Aligned Trajectory Masking for Diffusion Language Models

Introduction

Autoregressive language models scale well, but their left-to-right factorization fixes generation to a single causal order. That order is useful for next-token prediction, yet it also limits bidirectional dependency modeling, non-monotonic planning, and intrinsic error correction. Masked diffusion models (MDMs) are interesting precisely because they replace this monotone generation path with a global denoising process: tokens can be filled in any order, partial contexts are bidirectional, and multiple positions can be updated in parallel [1~4][1]Structured Denoising Diffusion Models in Discrete State-Spaces [PDF]Advances in Neural Information Processing Systems · Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, Rianne van den Berg · 2021[2]Simple and Effective Masked Diffusion Language Models [PDF]Advances in Neural Information Processing Systems · Subham S. Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T. Chiu, Alexander Rush, Volodymyr Kuleshov · 2024[3]Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution [PDF]International Conference on Machine Learning · Aaron Lou, Chenlin Meng, Stefano Ermon · 2024[4]Large Language Diffusion Models [PDF]arXiv preprint arXiv:2502.09992 · Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, Chongxuan Li · 2025.

The same flexibility creates the central dilemma of the paper. During training, the usual MDM objective corrupts a clean sequence by independently masking positions. This gives convenient order-agnostic supervision, but it also forces the denoiser to fit a combinatorially large family of arbitrary mask patterns. During inference, however, generation is not arbitrary. A sampler follows a concrete unmasking trajectory, often selecting positions using confidence, margin, entropy, or a planner [5~7][5]Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions [PDF]International Conference on Machine Learning · Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham M. Kakade, Sitan Chen · 2025[6]Planner Aware Path Learning in Diffusion Language Models Training [PDF]arXiv preprint arXiv:2509.23405 · Fred Zhangzhi Peng, Zachary Bezemek, Jarrid Rector-Brooks, Shuibai Zhang, Anru R. Zhang, Michael M. Bronstein, Avishek Joey Bose, Alexander Tong · 2025[7]SPMDM: Enhancing Masked Diffusion Models through Simplifying Sampling Path [PDF]Advances in Neural Information Processing Systems · Yichen Zhu, Weiyu Chen, James Kwok, Zhou Zhao · 2025. The model is therefore trained on one state distribution and evaluated along another.

Existing fixes mostly intervene at inference time: use a better selection rule, choose safer tokens first, or allow remasking. These samplers can reduce local uncertainty, but they do not change the training distribution that shaped the denoiser in the first place. Training-side schedules are closer in spirit, but if they still rely on heuristic or uniformly random states, they leave the capacity dilution problem unresolved.

SAT-Mask addresses the issue at the source. It constructs training states by first over-noising a clean example and then partially denoising it with the same kind of confidence-guided transition used at inference. The resulting state is no longer an arbitrary mask subset; it is a local sample from the model’s own easy-to-hard trajectory.

The paper’s logic is:

Random masking pays an explicit information tax by asking the model to distinguish arbitrary mask states.
Train-inference mismatch accumulates as exposure bias because local transition errors compound along the reverse chain.
SAT-Mask reduces both failures by using a shared training-inference transition kernel.
This improves quality and efficiency across problem solving, text generation, and math reasoning.

Background

Let $\mathbf{x}_0\in\mathcal{X}^L$ be a clean sequence and let $\mathbf{x}_t\in\bar{\mathcal{X}}^L$ be a corrupted state, where $\bar{\mathcal{X}}=\mathcal{X}\cup\{\mathrm{m}\}$ and $\mathrm{m}$ is the mask token. In the standard absorbing corruption process, each coordinate is independently kept with probability $\alpha_t$ and masked otherwise:

q_{t|0}(\mathbf{x}_t\mid\mathbf{x}_0) = \prod_{i=1}^{L} \left[ \alpha_t\delta_{x_0^i}(x_t^i) + (1-\alpha_t)\delta_{\mathrm{m}}(x_t^i) \right].

The denoiser learns a clean-token posterior

p_\theta^i(a\mid \mathbf{x}_t,t) \equiv p_\theta(x_0^i=a\mid \mathbf{x}_t,t),

and the continuous-time objective reduces to a weighted masked-token reconstruction loss:

\mathcal{L}_{\mathrm{MDM}}(\theta) = \mathbb{E}_{\mathbf{x}_0} \int_0^1 \mathbb{E}_{\mathbf{x}_t\sim q_{t|0}} \left[ \frac{-\dot{\alpha}_t}{1-\alpha_t} \sum_{i\in\mathcal{M}(\mathbf{x}_t)} -\log p_\theta^i(x_0^i\mid \mathbf{x}_t,t) \right]dt.

This objective trains a denoiser, but sampling needs one more ingredient: a rule for which masked coordinates to reveal at each reverse step. On a discrete grid $1=t_T>\cdots>t_0=0$ , a sampler transition from $\mathbf{x}_{t_{k+1}}$ to $\mathbf{x}_{t_k}$ can be written as a transition kernel. A policy $\pi_k(M\mid\mathbf{x})$ first selects a subset of still-masked positions, and the denoiser then fills those positions:

\mathbf{T}^{\pi}_{k}(\mathbf{x},\mathbf{y}) = \sum_{M\subseteq\mathcal{M}(\mathbf{x})} \pi_k(M\mid\mathbf{x}) \left[ \prod_{i\in M}p_\theta^i(y^i\mid\mathbf{x},t_{k+1}) \right] \mathbf{1}\{\mathbf{y}^{-M}=\mathbf{x}^{-M}\}.

This formulation makes the mismatch precise. Training samples $\mathbf{x}_t$ from independent random masking, while inference moves through states induced by $\mathbf{T}^{\pi}_k$ . SAT-Mask is built around making these two paths share a local transition structure.

Motivation

The motivation section of the paper has two parts. The first explains why random masking wastes capacity. The second explains why that waste is not only a training inefficiency, but also becomes exposure bias during generation.

Random masking causes capacity dilution

Consider a generic masking policy $Q(\mathbf{x}_t\mid\mathbf{x}_0)$ . At a fixed time, the masked-token objective can be decomposed into the true conditional entropy plus the denoiser approximation error:

\mathcal{L}_Q(\theta) = H_Q(\mathbf{x}_0\mid\mathbf{x}_t) + \mathbb{E}_{\mathbf{x}_t\sim p_Q} \left[ D_{\mathrm{KL}}\!\left( p_Q(\mathbf{x}_0\mid\mathbf{x}_t) \parallel p_\theta(\mathbf{x}_0\mid\mathbf{x}_t) \right) \right].

The important point is that $Q$ determines the state space the model must represent. The paper formalizes this with an additive capacity law:

Theorem (Additive capacity law). For any masking policy $Q$ , the idealized capacity requirement satisfies
$\mathcal{C}_{\mathrm{needed}}(Q) \ge H_Q(\mathbf{x}_0\mid\mathbf{x}_t) + H_Q(\mathbf{x}_t) = H(\mathbf{x}_0) + H_Q(\mathbf{x}_t\mid\mathbf{x}_0).$

The term $H(\mathbf{x}_0)$ is intrinsic to the data. The extra term $H_Q(\mathbf{x}_t\mid\mathbf{x}_0)$ is created by the masking policy. For independent random masking with mask probability $m_t$ , every coordinate contributes one Bernoulli mask decision:

H_{Q_{\mathrm{rand}}}(\mathbf{x}_t\mid\mathbf{x}_0) = L\,\mathcal{H}_b(m_t).

Under a linear mask schedule $m_t=t$ and uniform time sampling, this gives the closed-form overhead:

\mathbb{E}_{t\sim\mathcal{U}(0,1)} H_{Q_{\mathrm{rand}}}(\mathbf{x}_t\mid\mathbf{x}_0) = L\int_0^1 \mathcal{H}_b(t)dt = \frac{L}{2\ln 2} \approx {\color{red}0.721L\;\text{bits}}.

Random masking versus order-aware masking — Figure 1. Random masking visits arbitrary mask states that dilute capacity; SAT-Mask follows an order-aware trajectory and releases capacity for meaningful context.

For length $1024$ , this is about $738$ bits spent only on arbitrary mask subsets. A deterministic fixed order would remove this entropy, but it would also give up the bidirectional, any-order advantage of MDMs. The target is therefore not “make the order fixed”; it is “make the order structured while preserving bidirectional context.”

Misaligned intrinsic order induces exposure bias

The capacity argument explains why random masking is inefficient. Exposure bias explains why it hurts generation.

Let $P_t$ denote the distribution of states seen during training, and let $\hat{P}_t$ denote the distribution of states produced by the sampler. Their mismatch is

\Delta_t = D_{\mathrm{KL}}(P_t\parallel \hat{P}_t).

For one reverse step $t_{k+1}\to t_k$ , the paper derives a recursion:

\Delta_{t_k} \le \Delta_{t_{k+1}} + {\color{red} \mathbb{E}_{\mathbf{x}\sim P_{t_{k+1}}} \left[ D_{\mathrm{KL}}\!\left( \mathcal{Q}^{\mathrm{train}}_k(\cdot\mid\mathbf{x}) \parallel \mathbf{T}^{\pi}_k(\cdot\mid\mathbf{x}) \right) \right] }.

This is the key exposure-bias statement. The mismatch at the next cleaner state is bounded by the previous mismatch plus a local transition mismatch. If the training transition and sampler transition disagree at every step, the error is not a one-shot defect; it accumulates along the whole denoising path.

The next question is: which path should they agree on? The paper argues that good inference seeks an empirical intrinsic order $\pi_\theta^*$ : for a given sample and model state, reveal positions that have lower local surprisal first, so easy tokens become anchors for harder tokens. This is the easy-to-hard order induced by the current denoiser, not a fixed left-to-right order.

Using this intrinsic order as a reference, the local mismatch can be decomposed into three terms:

Proposition (Local error decomposition).
$\begin{aligned} D_{\mathrm{KL}}\!\left( \mathcal{Q}^{\mathrm{train}}_k \parallel \mathbf{T}^{\pi}_k \right) &= \underbrace{ D_{\mathrm{KL}}\!\left( \mathcal{Q}^{\mathrm{train}}_k \parallel \mathcal{Q}^{\pi_\theta^*}_k \right) }_{\mathcal{E}_{\mathrm{capacity}}} + \underbrace{ D_{\mathrm{KL}}\!\left( \mathcal{Q}^{\pi_\theta^*}_k \parallel \mathbf{T}^{\pi}_k \right) }_{\mathcal{E}_{\mathrm{align}}} + \mathcal{R}_{\mathrm{shift}}. \end{aligned}$

This decomposition clarifies why inference-only fixes are insufficient.

$\mathcal{E}_{\mathrm{capacity}}$ measures how far the training transition is from the model’s intrinsic easy-to-hard order. Uniform random masking makes this large because it trains on states unrelated to the denoising frontier.
$\mathcal{E}_{\mathrm{align}}$ measures how far the sampler is from that intrinsic order. Confidence, margin, and entropy samplers mainly attack this term.
$\mathcal{R}_{\mathrm{shift}}$ captures the residual distribution shift between the states used in training and the states induced by the intrinsic trajectory.

So a better sampler can reduce $\mathcal{E}_{\mathrm{align}}$ , but if the training distribution remains random, $\mathcal{E}_{\mathrm{capacity}}$ and $\mathcal{R}_{\mathrm{shift}}$ remain. This is the precise reason SAT-Mask moves the planner-like transition into training. The training state itself must be produced by a trajectory that approximates the model’s empirical intrinsic order.

The SAT-Mask Framework

SAT-Mask constructs training states by matching the inference-time two-stage update. It first rolls back to a higher-noise state, then executes an uncertainty-aware denoising step back to the target mask budget. The architecture and reconstruction loss stay the same; the supervised state distribution changes.

SAT-Mask method overview — Figure 2. Overview of SAT-Mask. Left: starting from a more corrupted state, SAT-Mask uses denoiser confidence to fill high-confidence tokens along a zigzag self-aligned trajectory, while computing loss only on the retained masks. Right: the shared training-inference path reduces capacity dilution and exposure bias.

SAT-Masking schedule

Given a clean sequence $\mathbf{x}_0$ and time $t$ , SAT-Mask first samples a more corrupted state at

t^+ = \min(t+\Delta t,1).

The masks at $t$ and $t^+$ are coupled by shared uniforms $u_i\sim\mathrm{Unif}(0,1)$ :

\mathcal{M}_t=\{i:u_i<m_t\}, \qquad \mathcal{M}_{t^+}=\{i:u_i<m_{t^+}\}.

Therefore $\mathcal{M}_t\subseteq\mathcal{M}_{t^+}$ . The target budget at time $t$ is

N_t=|\mathcal{M}_t|.

SAT-Mask then calls the current denoiser once on $\mathbf{x}_{t^+}$ and asks a sampler operator $S$ to unmask exactly

B_t=|\mathcal{M}_{t^+}|-N_t

positions:

\mathbf{x}'_t = S\!\left( \mathbf{x}_{t^+}, f_\theta(\mathbf{x}_{t^+},t^+), B_t \right), \qquad |\mathcal{M}(\mathbf{x}'_t)|=N_t.

This is the zigzag move: go to a slightly noisier point, then take one sampler-like step back to the original mask budget. The visible tokens in $\mathbf{x}'_t$ are no longer a uniformly random subset. They are the tokens the current model would prefer to reveal along its trajectory.

Sampler-compatible transition

The operator $S$ is an interface. It can be random, confidence-based, entropy-based, margin-based, remasking-aware, or replaced by future sampler designs. The core requirement is that the same local rule used at inference can also be used to construct training states.

In the experiments, the default is downk-margin. Let $\ell_\theta^i$ be the logits at a masked position $i$ , and let

r_i = \ell_{(1)}^i-\ell_{(2)}^i

be the margin between the largest and second-largest log-probabilities. SAT-Mask selects a set $\mathcal{U}_t$ by mixing deterministic high-margin unmasking with random coverage:

\mathcal{U}_t = \operatorname{TopK}_{i\in\mathcal{M}_{t^+}} (r_i,\lfloor\gamma B_t\rfloor) \cup \operatorname{Uniform}(\mathrm{rest},B_t-\lfloor\gamma B_t\rfloor).

For selected positions, tokens are sampled from a temperature-controlled distribution:

x_t^{\prime i} \sim \mathrm{Cat}\!\left( \operatorname{softmax} \left( \ell_\theta^i(\cdot\mid\mathbf{x}_{t^+},t^+)/\tau \right) \right), \qquad i\in\mathcal{U}_t.

The margin score gives an easy-to-hard signal. The random part avoids collapsing the schedule to a single deterministic path. The temperature $\tau$ preserves stochasticity in the rollout.

Training objective

SAT-Mask keeps the standard reconstruction target but evaluates it on the sampler-induced state distribution:

\mathcal{L}_{\mathrm{SAT}}(\theta) = \mathbb{E}_{\mathbf{x}_0} \int_0^1 \mathbb{E}_{ {\color{red}\mathbf{x}'_t\sim \operatorname{sg}[ \mathcal{Q}^{S}_{\theta,t}(\cdot\mid\mathbf{x}_0) ]}} \left[ \frac{-\dot{\alpha}_t}{1-\alpha_t} \sum_{i\in\mathcal{M}({\color{red}\mathbf{x}'_t})} -\log p_\theta^i(x_0^i\mid {\color{red}\mathbf{x}'_t},t) \right]dt.

The stop-gradient notation indicates that the rollout constructs the state but does not receive gradients directly. Gradients are applied through the final reconstruction loss. Thus SAT-Mask changes neither the denoiser architecture nor the target mask schedule. It changes which contexts the model treats as normal during optimization.

Algorithm 1 SAT-Mask Training

Input: Dataset $D$ , denoiser $f_{θ}$ , rollout step $Δ t$ , sampler $S$ , temperature $τ$
while not converged do
Sample clean data $x_{0} \sim D$ and diffusion time $t \sim U (0, 1)$
Sample state $x_{t^{+}}$ at $t^{+} = min (t + Δ t, 1)$ using masks coupled with the target budget $N_{t}$
Compute logits $ℓ_{θ} = f_{θ} (x_{t^{+}}, t^{+})$
$B_{t} \leftarrow ∣ M_{t^{+}} ∣ - N_{t}$ Number of tokens to unmask from $x_{t^{+}}$
$x_{t}^{'} \leftarrow S (x_{t^{+}}, ℓ_{θ}, B_{t})$ One sampler step; default uses downk-margin
Update $θ$ by descending $\nabla_{θ} (λ (t) \sum_{i \in M (x_{t}^{'})} - lo g p_{θ}^{i} (x_{0}^{i} ∣ x_{t}^{'}, t))$ Reconstruction loss
end while

Effectiveness analysis

For capacity dilution, SAT-Mask replaces arbitrary random mask subsets with states produced by a model-dependent map. In the deterministic core, many over-noised states can collapse to the same supervised state when they induce the same high-margin reveals:

Proposition (State-space collapse of SAT-Mask).
$H_{\mathrm{SAT}}(\tilde{\mathbf{x}}_t\mid\mathbf{x}_0) = H_q(\mathbf{x}_{t^+}\mid\mathbf{x}_0) - H(\mathbf{x}_{t^+}\mid\tilde{\mathbf{x}}_t,\mathbf{x}_0,\theta).$

The second term is the collapse entropy released by SAT-Mask. Intuitively, states that differ only by irrelevant arbitrary mask choices no longer need to be separately modeled if they lead to the same high-margin reveal pattern.

For exposure bias, the same margin-guided local kernel appears in training and inference. Because margin is a target-free proxy for low local surprisal, the training states become closer to the empirical intrinsic order $\pi_\theta^*$ . This directly reduces the capacity and shift terms that inference-only samplers cannot remove.

Experiments

The experiments evaluate SAT-Mask on problem solving, text generation, and math reasoning. In each setting, the core comparison is the same: replace vanilla random masking with SAT-Mask while keeping the denoiser and decoding policy controlled.

Sudoku and Countdown

For Sudoku, the paper uses the one-million solved-game corpus and trains on the first 100k puzzles. Each $9\times 9$ grid is serialized as a digit sequence, with 0 marking an empty cell. For Countdown-4, it generates 500k arithmetic problems following Stream of Search, with 10% of targets held out for out-of-distribution evaluation [8,9][8]1 Million Sudoku Games [PDF]Kaggle · Kyubyong Park · 2016[9]Stream of Search: Learning to Search in Language [PDF]misc · Kanishk Gandhi, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, Noah D. Goodman · 2024. Both tasks use the same 6M DiT denoiser.

SAT-Mask consistently improves both benchmarks. On Sudoku, accuracy increases from 39.1% with vanilla random masking to 63.5% at $T=2.5$ , a 62.4% relative gain. On Countdown-4, the best SAT-Mask setting reaches 35.6% at $T=1.5$ , improving over the 30.7% vanilla baseline by 16.0%.

Figure 3. Problem-solving results. Left: Sudoku accuracy across training steps. Right: Countdown-4 accuracy and relative improvement over vanilla across temperatures.

These tasks make the trajectory problem visible. A wrong early reveal can constrain later decisions, while a reliable early reveal can become an anchor. SAT-Mask helps because the model is trained on contexts that already reflect that easy-to-hard dependency.

OpenWebText generation

For text generation, the paper evaluates unconditional OpenWebText generation with a 169M DiT-based MDM initialized from the MDLM checkpoint, using the GPT-2 tokenizer and length $L=1024$ [2,10,11][10]OpenWebText Corpus [PDF]misc · Aaron Gokaslan, Vanya Cohen · 2019[2]Simple and Effective Masked Diffusion Language Models [PDF]Advances in Neural Information Processing Systems · Subham S. Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T. Chiu, Alexander Rush, Volodymyr Kuleshov · 2024[11]Language Models are Unsupervised Multitask Learners [PDF]OpenAI blog · Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever · 2019. The evaluation reports MAUVE, GenPPL, and entropy over 5000 samples [12~14][12]MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers [PDF]Advances in Neural Information Processing Systems · Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, Zaid Harchaoui · 2021[13]Fine-Tuning Masked Diffusion for Provable Self-Correction [PDF]misc · Jaeyeon Kim, Seunggeun Kim, Taekyun Lee, David Z. Pan, Hyeji Kim, Sham Kakade, Sitan Chen · 2025[14]Remasking Discrete Diffusion Models with Inference-Time Scaling [PDF]Advances in Neural Information Processing Systems · Guanghan Wang, Yair Schiff, Subham Sahoo, Volodymyr Kuleshov · 2025.

Table 1. OpenWebText unconditional generation. MAUVE is higher-is-better; GenPPL is lower-is-better; entropy is reported as a sanity metric. Best scores within each method family and sampling budget are bolded.
	T=128			T=256			T=512
Method	MAUVE	GenPPL	Ent.	MAUVE	GenPPL	Ent.	MAUVE	GenPPL	Ent.
MDM without remask
MDLM	0.016	79.37	5.57	0.027	73.02	5.55	0.034	70.21	5.54
MDLM+SAT-Mask	0.034	78.16	5.55	0.038	72.58	5.53	0.039	67.96	5.51
MDM with remask
ReMDM-conf	0.02	74.50	5.57	0.03	66.50	5.54	0.04	52.50	5.49
ReMDM	0.06	42.50	5.43	0.22	30.50	5.35	0.35	21.00	5.22
PRISM	0.18	18.10	5.11	0.30	18.00	5.15	0.42	17.12	5.12
PRISM+SAT-Mask	0.31	23.30	5.20	0.42	15.60	5.05	0.43	11.08	4.91

SAT-Mask improves the non-remasking MDLM baseline at every sampling budget and also improves PRISM in the remasking setting. This supports the paper’s claim that training-state alignment is complementary to sampler design.

Math reasoning

For GSM8K, the paper follows the SMDM supervised fine-tuning setup and evaluates different model scales under 32, 64, 128, and 256 sampling steps [15,16][15]Training Verifiers to Solve Math Word Problems [PDF]arXiv preprint arXiv:2110.14168 · Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman · 2021[16]Scaling up Masked Diffusion Models on Text [PDF]International Conference on Learning Representations · Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, Chongxuan Li · 2025. The comparison includes LLAMA, Plaid, MDLM, SPMDM, SEDD, SMDM, and PAPL numbers where applicable [3,6,7,17,18][17]LLaMA: Open and Efficient Foundation Language Models [PDF]arXiv preprint arXiv:2302.13971 · Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothee Lacroix, Baptiste Roziere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample · 2023[18]Likelihood-Based Diffusion Language Models [PDF]Advances in Neural Information Processing Systems · Ishaan Gulrajani, Tatsunori B. Hashimoto · 2023[3]Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution [PDF]International Conference on Machine Learning · Aaron Lou, Chenlin Meng, Stefano Ermon · 2024[7]SPMDM: Enhancing Masked Diffusion Models through Simplifying Sampling Path [PDF]Advances in Neural Information Processing Systems · Yichen Zhu, Weiyu Chen, James Kwok, Zhou Zhao · 2025[6]Planner Aware Path Learning in Diffusion Language Models Training [PDF]arXiv preprint arXiv:2509.23405 · Fred Zhangzhi Peng, Zachary Bezemek, Jarrid Rector-Brooks, Shuibai Zhang, Anru R. Zhang, Michael M. Bronstein, Avishek Joey Bose, Alexander Tong · 2025.

Table 2. GSM8K-CoT math reasoning accuracy under different sampling steps. Higher is better. Missing entries indicate results not reported in the corresponding baseline.
Method	Param.	32 steps	64 steps	128 steps	256 steps
LLAMA	7B	58.60	58.60	58.60	58.60
Plaid	1.3B	--	--	--	32.60
SMDM	1.1B	53.82	55.11	54.96	56.10
SAT-Mask (ours)	1.1B	54.58	55.11	57.01	58.75
SMDM	336M	52.08	53.52	54.20	54.96
PAPL	336M	51.40	53.52	54.89	55.64
SAT-Mask (ours)	336M	53.52	54.35	55.64	55.49
SEDD	170M	--	--	--	45.30
SMDM	170M	49.65	50.01	50.79	51.25
PAPL	170M	49.65	52.01	53.37	53.60
SAT-Mask (ours)	170M	51.63	53.52	53.67	54.28
MDLM	127M	--	--	--	46.10
SPMDM	127M	--	--	--	51.30
SMDM	113M	47.76	49.05	50.34	50.56
PAPL	113M	45.11	48.67	49.50	50.64
SAT-Mask (ours)	113M	48.74	51.25	51.70	52.76

The gains are stronger in smaller models. At 170M, SAT-Mask reaches 54.28%, close to the 336M SMDM baseline at 54.96%, while requiring fewer training steps in the paper’s efficiency comparison. This matches the theory: capacity-limited models pay more for arbitrary mask-state entropy, so they benefit more from aligned state construction.

Efficiency and ablation

The paper measures efficiency by baseline-equivalent training steps required to reach comparable performance. SAT-Mask reduces the required steps by 68.2% on Sudoku, 61.1% on Countdown, and 16.7% on OpenWebText. On GSM8K, the reductions are 9.1%, 28.3%, 3.5%, and 16.7% for 113M, 170M, 336M, and 1028M models.

Figure 4. Training efficiency of SAT-Mask. Bars report the baseline-equivalent training steps required by vanilla random masking and SAT-Mask; percentages denote relative step reduction.

The ablations isolate the schedule design. Temperature controls the diversity of filled tokens in the zigzag rollout: too little exploration or too much noise hurts, while $\tau=2.5$ performs best on Sudoku. The selection rule is also crucial: random selection behaves like vanilla masking, top- $k$ alone hurts, and downk-margin performs best by preserving the easy-to-hard order. Finally, the over-noise offset $\Delta t$ should be moderate; on GSM8K-113M, performance improves up to $\Delta t=1/16$ and then drops when the over-noised state is too far from the target budget.

Ablation studies of SAT-Mask — Figure 5. Ablation studies. Temperature controls token-filling stochasticity, the selection function controls which positions are unmasked, and the offset controls the distance between the over-noised state and the target mask budget.

Conclusion

SAT-Mask is best understood as a training-side alignment method for masked diffusion language models. Random masking asks the denoiser to spend capacity on arbitrary mask states and creates a state-distribution gap with the sampler. SAT-Mask replaces those arbitrary states with local rollout states produced by a shared confidence-guided transition.

This closes the loop between training and inference without changing the architecture or the reconstruction objective. The result is a masking schedule that follows the model’s empirical easy-to-hard order, reduces exposure bias by construction, and improves quality and training efficiency across structured reasoning, open-ended generation, and math reasoning.

References

[1]

Structured Denoising Diffusion Models in Discrete State-Spaces [PDF]

Advances in Neural Information Processing Systems · Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, Rianne van den Berg · 2021

[2]

Simple and Effective Masked Diffusion Language Models [PDF]

Advances in Neural Information Processing Systems · Subham S. Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T. Chiu, Alexander Rush, Volodymyr Kuleshov · 2024

[3]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution [PDF]

International Conference on Machine Learning · Aaron Lou, Chenlin Meng, Stefano Ermon · 2024

[4]

Large Language Diffusion Models [PDF]

arXiv preprint arXiv:2502.09992 · Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, Chongxuan Li · 2025

[5]

Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions [PDF]

International Conference on Machine Learning · Jaeyeon Kim, Kulin Shah, Vasilis Kontonis, Sham M. Kakade, Sitan Chen · 2025

[6]

Planner Aware Path Learning in Diffusion Language Models Training [PDF]

arXiv preprint arXiv:2509.23405 · Fred Zhangzhi Peng, Zachary Bezemek, Jarrid Rector-Brooks, Shuibai Zhang, Anru R. Zhang, Michael M. Bronstein, Avishek Joey Bose, Alexander Tong · 2025

[7]

SPMDM: Enhancing Masked Diffusion Models through Simplifying Sampling Path [PDF]

Advances in Neural Information Processing Systems · Yichen Zhu, Weiyu Chen, James Kwok, Zhou Zhao · 2025

[8]

1 Million Sudoku Games [PDF]

Kaggle · Kyubyong Park · 2016

[9]

Stream of Search: Learning to Search in Language [PDF]

misc · Kanishk Gandhi, Denise Lee, Gabriel Grand, Muxin Liu, Winson Cheng, Archit Sharma, Noah D. Goodman · 2024

[10]

OpenWebText Corpus [PDF]

misc · Aaron Gokaslan, Vanya Cohen · 2019

[11]

Language Models are Unsupervised Multitask Learners [PDF]

OpenAI blog · Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever · 2019

[12]

MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers [PDF]

Advances in Neural Information Processing Systems · Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean Welleck, Yejin Choi, Zaid Harchaoui · 2021

[13]

Fine-Tuning Masked Diffusion for Provable Self-Correction [PDF]

misc · Jaeyeon Kim, Seunggeun Kim, Taekyun Lee, David Z. Pan, Hyeji Kim, Sham Kakade, Sitan Chen · 2025

[14]

Remasking Discrete Diffusion Models with Inference-Time Scaling [PDF]

Advances in Neural Information Processing Systems · Guanghan Wang, Yair Schiff, Subham Sahoo, Volodymyr Kuleshov · 2025

[15]

Training Verifiers to Solve Math Word Problems [PDF]

arXiv preprint arXiv:2110.14168 · Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman · 2021

[16]

Scaling up Masked Diffusion Models on Text [PDF]

International Conference on Learning Representations · Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, Chongxuan Li · 2025

[17]

LLaMA: Open and Efficient Foundation Language Models [PDF]

arXiv preprint arXiv:2302.13971 · Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothee Lacroix, Baptiste Roziere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample · 2023

[18]

Likelihood-Based Diffusion Language Models [PDF]

Advances in Neural Information Processing Systems · Ishaan Gulrajani, Tatsunori B. Hashimoto · 2023

[19]

Can Diffusion Model Achieve Better Performance in Text Generation? Bridging the Gap between Training and Inference! [PDF]

Findings of the Association for Computational Linguistics: ACL · Zecheng Tang, Pinzheng Wang, Keyan Zhou, Juntao Li, Ziqiang Cao, Min Zhang · 2023 · doi:10.18653/V1/2023.FINDINGS-ACL.721