In Mask Diffusion Models (MDM), the Noise Scheduler is pivotal for learning capacity and sampling quality. This paper presents a unified analysis addressing three core challenges —— Exposure Bias induced by Absorb mechanisms, efficiency bottlenecks from Intrinsic Order, and joint probability deviations from Independence Assumptions. We systematically review mainstream strategies, comparing their efficacy in semantic capture, remasking, and efficiency to elucidate how refined scheduling reshapes token dependencies. Finally, we outline future directions for overcoming these underlying logical defects.
In the research of Diffusion Models, the selection of the Noise Scheduler not only determines the sampling trajectory but also directly influences the upper bound of the model’s learning capability for complex distributions. Compared to Gaussian diffusion in the continuous domain, Mask Diffusion Models (MDM) have demonstrated unique potential in discrete data modeling. However, the Absorb Transition Kernel universally employed in MDM, while endowing the model with logical simplicity, also introduces three key theoretical and engineering challenges.
Firstly, Exposure Bias is particularly severe in MDM. Due to the Absorb mechanism, generated tokens possess an “immutable” characteristic. Any slight deviation during the sampling process accumulates rapidly over timesteps due to the lack of a correction mechanism, leading to the collapse of the generated sequence.
Secondly, the complexity of Intrinsic Order limits training efficiency. Real-world data (such as natural language or code) often exhibits diverse dependency orders, while MDM attempts to capture all possible generation paths within a limited parameter space. Since it is impossible to enumerate and achieve optimal learning for every generation order in polynomial time, the model often faces insufficient training efficiency and struggles to achieve convergence on complex structured data.
The most fundamental challenge lies in the probability deviation caused by the Independence Assumption. During the modeling phase, MDM often assumes conditional independence between tokens, which contradicts the essence of strong couplings in discrete data. This leads to a severe deviation of the estimated product of marginal distributions from the true joint probability distribution.
Therefore, designing a sophisticated Noise Scheduler is no longer merely a simple scaling of mask ratios, but a pivotal mechanism for compensating for the underlying logical defects of MDM. An optimized scheduler not only needs to alleviate exposure bias through fine-grained mask rate balancing but, more critically, aims to reshape the dependency dynamics between tokens through the evolution of the time trajectory.
In addition to seeking a “zero-full-correlation path” to statistically approximate an independent distribution, a more forward-looking correction scheme lies in breaking the single-token processing paradigm. For instance, by introducing Block-wise Masking, the Scheduler can force the model to capture locally strongly coupled semantic clusters, alleviating the joint probability deviation caused by the independence assumption at a mechanistic level. Meanwhile, introducing Iterative Remasking endows the model with the ability to “regret” and correct on the inference trajectory, breaking the unidirectional irreversibility of the Absorb Kernel. Coordinated dynamically by the Scheduler, these strategies not only reduce information entropy loss in a statistical sense but essentially act as a calibrator between the independence assumption and structured dependencies, thereby achieving accurate restoration of the deep joint probability distribution of discrete data while ensuring computational efficiency.
Based on the above observations, this paper reviews and summarizes current improvement strategies for Noise Schedulers in the field of Mask Diffusion Models, focusing on exploring how to break through the aforementioned bottlenecks through the refined design of the scheduler.
Suppose there are $C$ types of tokens. Including the mask, the state space of each token is ${0, 1, \dots, C}$. Let the token sequence be denoted as $x = x^1 x^2 \dots x^L$, where $x_0 \sim p_{\text{data}}$. The conditional probability is given by $q(x_t \mid x_0) = \prod_{i=1}^L q(x_t^i \mid x_0^i)$, where
\[q(x_t^i \mid x_0^i) = (1 - \alpha_t) \delta_{\text{m}}(x_t^i) + \alpha_t \delta_{x_0^i}(x_t^i)\]indicates that at time $t$, each token independently transitions to a mask with probability $1 - \alpha_t$ or remains unchanged with probability $\alpha_t$.
Let $0 \leq s \le t \leq 1$. In the reverse process, to calculate $p(x_s^i \mid x_t^i)$, we first consider $q(x_t^i \mid x_s^i)$:
\[q(x_t^i \mid x_s^i) = \begin{cases} \frac{\alpha_s - \alpha_t}{\alpha_s} \delta_{\text{m}}(x_t^i) + \frac{\alpha_t}{\alpha_s} \delta_{x_s^i}(x_t^i) & \text{if } x_s^i \in \mathcal{X}, \\ \delta_{x_s^i}(x_t^i) & \text{if } x_s^i = \text{m}. \end{cases}\]By Applying Bayes’ theorem, we obtain:
\[q(x_s^i \mid x_t^i, x_0^i) = \begin{cases} \delta_{x_t^i}(x_s^i) & \text{if } x_t^i \in \mathcal{X}, \\ \frac{1 - \alpha_s}{1 - \alpha_t} \delta_{\text{m}}(x_s^i) + \frac{\alpha_s - \alpha_t}{1 - \alpha_t} \delta_{x_0^i}(x_s^i) & \text{if } x_t^i = \text{m}. \end{cases}\]This indicates that when a token $x_t^i$ is a mask, it remains a mask with probability $\frac{1 - \alpha_s}{1 - \alpha_t}$ and transitions to $x_0^i$ with probability $\frac{\alpha_s - \alpha_t}{1 - \alpha_t}$. Once a token is unmasked, it is no longer updated.
Since $p(x_s \mid x_t) = \mathbb{E}_{p(x_0 \mid x_t)}[q(x_s \mid x_t, x_0)]$, the reverse process can be implemented by first sampling $x_0 \sim p(\cdot \mid x_t)$ and then sampling $x_s \sim q(\cdot \mid x_t, x_0)$. As $x_0$ is unknown, we model it via $p_{\theta}(x_0 \mid x_t) = \prod_{i=1}^L p_{\theta}(x_0^i \mid x_t)$ and optimize it using the following loss function:
\[\mathcal{L}_{\text{vb}}(\mathbf{x}_0; \theta) = \int_0^1 \frac{\alpha_t'}{1 - \alpha_t} \mathbb{E}_{q(\mathbf{x}_t \mid \mathbf{x}_0)} \left[ \sum_{i=1}^{L} \log p_\theta(x_0^i \mid x_t^i) \right] dt,\]In the inference phase, given a fully masked sequence $\mathbf{x}_L = [\text{m}, \text{m}, \dots, \text{m}]$, we need to obtain $\mathbf{x}_0$ through $T$ denoising steps. Based on previous discussions, the state transition at each time step $t$ depends not only on the model’s predictive capability $p_\theta(x_0^i \mid \mathbf{x}_t)$ but also crucially on the choice of the Sampling Path.
We model the inference from $t+1 \rightarrow t$ as a two-step process: first, a selection strategy $\pi(i \mid x_{t+1}), i=1, 2, \dots, L$ is used to determine the positions to be unmasked, denoted as $M_{t+1}$. Then, for the positions in $M_{t+1}$, we sample according to the logits predicted by the model, i.e., $x_t^i \sim p_{\theta}(x_t^i \mid x_{t+1})$. Thus, a single inference step is given by:
\[x_t^i = \begin{cases} \sim p_{\theta}(x^i \mid \mathbf{x}_{t+1}), & \text{if } i \in M_{t+1} \quad \text{(Selected to be unmasked in the current step)} \\ x_{t+1}^i, & \text{if } i \notin M_{t+1} \quad \text{(State from the previous step is maintained)} \end{cases}\]Namely:
\[[\mathbf{T}_{t+1 \rightarrow t}^M](\mathbf{x}_{t+1}, \mathbf{x}_t) = \begin{cases} \underbrace{\pi(M \mid \mathbf{x}_{t+1})}_{\text{Selection}} \cdot \underbrace{\prod_{i \in M} p_\theta(x_t^i \mid \mathbf{x}_{t+1})}_{\text{Generation}} & \text{if } \mathbf{x}_t^{-M} = \mathbf{x}_{t+1}^{-M} \\ 0 & \text{otherwise} \end{cases}\]First, we define two core probability measures existing in the same state space $\mathcal{X}^L$:
Our objective is the total distributional shift at time $t$: $\Delta_t = D_{KL}(P_t \Vert \hat{P}_t)$.
By applying the chain rule of KL divergence, we can establish the evolutionary relationship of $\Delta_t$.
Let $\mathcal{Q}_{t+1 \to t}$ be the optimal reverse transition operator induced by the ground-truth data distribution. Clearly, $P_t = \mathcal{Q}_{t+1 \to t} P_{t+1}$. Then:
\[\Delta_t = D_{KL}\left( \mathcal{Q}\_{t+1 \to t} P_{t+1} \big\| \mathbf{T}\_{t+1 \to t}^M \hat{P}\_{t+1} \right)\]By introducing an intermediate term $\mathbf{T}_{t+1 \to t}^M P_{t+1}$ and utilizing the triangle inequality of divergence, the following upper bound can be derived:
\[\Delta_t \le \underbrace{D_{KL}(\mathcal{Q}_{t+1 \to t} P_{t+1} \| \mathbf{T}_{t+1 \to t}^M P_{t+1})}_{\color{red}{\text{Single-step local error } \mathcal{E}_{\text{loc}}}} + \underbrace{D_{KL}(\mathbf{T}_{t+1 \to t}^M P_{t+1} \| \mathbf{T}_{t+1 \to t}^M \hat{P}_{t+1})}_{\color{red}{\text{History error propagation}}} \leq \mathcal{E}_{\text{loc}} + \Delta_{t+1}\]By optimizing $\mathbf{T}_{t+1 \to t}^M$, we can simultaneously minimize individual local errors, thereby reducing $\Delta_t$.
We first formalize the concept of “intrinsic order” of data. Let the complete sequence be
\[\mathbf{x}_0 = (x_0^1, x_0^2, \dots, x_0^L) \sim p_{\text{data}}\]For any subset $S \subset {1,\dots,L}$, we define its conditional generation difficulty as:
\[\mathcal{H}(S) \;\triangleq\; \mathbb{E}_{\mathbf{x}_0 \sim p_{\text{data}}} \left[ H\big( \mathbf{x}_0^S \mid \mathbf{x}_0^{\bar S} \big) \right]\]This quantity characterizes the uncertainty of the tokens in the set $S$ given that the remaining tokens are known.
Thus, the data distribution $p_{\text{data}}$ induces a family of equivalent optimal generation orders:
\[\mathcal{O}^* = \arg\min_{\pi \in \mathfrak{S}_L} \sum_{k=1}^L \mathcal{H}\big(\{\pi_k\} \mid \{\pi_1,\dots,\pi_{k-1}\}\big)\]where $\pi$ is a permutation (generation order) of the tokens.
However, defining a single or fixed optimal generation order at the distribution level is unreasonable in practice, as different samples $\boldsymbol{x}_0$ often correspond to different $\pi^*(\boldsymbol{x}_0)$. This phenomenon reflects the diversity of generation orders ubiquitously existing in real-world discrete data. For a specific sample $x_0\sim p_{data}$, its optimal generation order generally depends on the embodied local structure and dependency relationships, which can be formalized as:
\[\pi^*(\boldsymbol{x}_0)=\operatorname*{arg\,min}_{\pi\in \mathfrak{S}_L }\sum_{k=1}^LH\bigg(x_0^{\pi_k}\mid \boldsymbol{x}_0^{\{\pi_1,\cdots,\pi_{k-1}\}}\bigg)\]In standard MDM training, the noise scheduler commonly employs a uniform sampling strategy, assuming that all tokens have an equal probability of being masked. This implies that the model is forced to learn the average of all possible generation permutations $\pi \in \mathfrak{S}_L$ during training.
However, for data with strong structural dependencies, there is a tremendous difference in the learning difficulty of different generation paths. We can decompose the expected loss during training into a permutation-based weighted sum:
\[\mathcal{L}_{\text{total}} \approx \mathbb{E}_{\mathbf{x}_0} \mathbb{E}_{\pi \sim U(\mathfrak{S}_L)} \left[ \sum_{k=1}^L -\log p_\theta(x_0^{\pi_k} \mid \mathbf{x}_0^{\{\pi_1, \dots, \pi_{k-1}\}}) \right]\]Due to the limited parameter space, the model is unable to fit the marginal distributions of all $L!$ paths in polynomial time. The uniform scheduler forces the model to predict high-entropy, highly-dependent tokens at early time steps, which leads to gradient conflicts in the optimization objective.
This conflict not only slows down the convergence speed but also theoretically lowers the upper bound of the model’s achievable performance. We can define the Optimization Gap, denoted as $\Delta \mathcal{L}_{\text{order}}$, to quantify the additional complexity introduced by neglecting the intrinsic order of the data:
\[\begin{aligned} \Delta \mathcal{L}_{\text{order}} = \mathbb{E}_{\mathbf{x}_0} \left[ \sum_{k=1}^L \left( \underbrace{\mathbb{E}_{\pi \sim U} [H(x_0^{\pi_k} \mid \mathbf{x}_0^{\pi_{\lt k}})]}_{\color{purple}{\text{Avg. Complexity (Uniform)}}} - \underbrace{H(x_0^{\pi^*_k} \mid \mathbf{x}_0^{\pi^*_{\lt k}})}_{\color{purple}{\text{Min. Complexity (Optimal)}}} \right) \right] \end{aligned}\]where $\pi^*$ represents the optimal generation order that minimizes the sum of sequence conditional entropies. The presence of $\Delta \mathcal{L}_{\text{order}}$ reveals a core contradiction: the model not only has to learn “easy-to-learn paths” that comply with causality ($P(\text{Result}\mid\text{Cause})$), but is also forced to consume a massive amount of parameters to fit counter-intuitive reverse or disordered paths $P(\text{Cause}\mid\text{Result})$.
This phenomenon triggers the problem of capacity dilution. According to Rate-Distortion Theory, if we consider model parameters $\theta$ as an information channel $C_\theta$ with limited capacity, the uniform sampling strategy artificially elevates the required information encoding rate of the task from $R_{\text{opt}}$ to $R_{\text{uni}}$. When $R_{\text{uni}} > C_\theta$, the model cannot perfectly fit this high-entropy mixture distribution, inevitably yielding irreducible distortion.
In other words, $\Delta \mathcal{L}_{\text{order}}$ essentially measures how much model capacity is wasted on fitting invalid noisy paths. This passive dispersion of resources leads to a reduction in the available parameters when the model deals with truly important forward dependencies, thus coercing the joint distribution $P_\theta(\mathbf{x}_0)$ learned by the model to exhibit a smoothed characteristic, making it difficult to capture the originally sharp multimodal structures in the data distribution. Therefore, designing a Scheduler that can approximate $\pi^*$ is essentially conducting a form of curriculum learning, aiming to eliminate the entropy surge brought by $\Delta \mathcal{L}_{\text{order}}$, thereby approaching a higher performance upper bound with a limited parameter scale.
During the inference phase, when the denoising path selected by the noise scheduler $\pi$ violates the intrinsic order $\pi^*$ of the data, it triggers a cascading accuracy collapse.
When the scheduler forces the model to prioritize forecasting downstream tokens in the absence of context, the model confronts a high-entropy distribution.
Suppose $x^A \to x^B$ is the true causal dependency. If $\pi$ reveals $x^B$ first, the model has to sample from the marginal distribution:
\[x^B \sim p_\theta(x^B \mid \emptyset) = \sum_{x^A} p(x^B \mid x^A)p(x^A)\]Compared to the conditional distribution $p(x^B\mid x^A)$, this distribution is extremely flat, making the sampling highly susceptible to hitting low-probability or incorrect tokens.
Let $\mathcal{M}_t$ denote the set of masked tokens at time step $t$, and $c = x_{t+1}$ be the current context. In the step $t+1 \to t$, we need to unmask a subset $\mathcal{U} \subset \mathcal{M}_{t+1}$. The model approximates the true distribution $P(x_\mathcal{U} \mid c)$ using an independent distribution $Q_\theta(x_\mathcal{U} \mid c) = \prod_{i \in \mathcal{U}} Q_\theta(x_i \mid c)$. Our objective to minimize is $D_{\text{KL}}(P \Vert Q_\theta)$. By introducing the product of true marginal distributions $\prod_{i \in \mathcal{U}} P(x_i \mid c)$ as an intermediate term, we can perfectly decompose this KL divergence:
\[\begin{aligned} D_{\text{KL}}(P \| Q_\theta) &= \mathbb{E}_{x \sim P} \left[ \log \frac{P(x_\mathcal{U}\mid c)}{\prod_{i \in \mathcal{U}} Q_\theta(x_i\mid c)} \right] \\ &= \mathbb{E}_{x \sim P} \left[ \log \left( \frac{P(x_\mathcal{U}\mid c)}{\color{blue}{\prod_{i \in \mathcal{U}} P(x_i\mid c)}} \cdot \frac{\color{blue}{\prod_{i \in \mathcal{U}} P(x_i\mid c)}}{\prod_{i \in \mathcal{U}} Q_\theta(x_i\mid c)} \right) \right] \\ &= \underbrace{\mathbb{E}_{x \sim P} \left[ \log \frac{\prod_{i \in \mathcal{U}} P(x_i\mid c)}{\prod_{i \in \mathcal{U}} Q_\theta(x_i\mid c)} \right]}_{\color{green}{\text{Term 1: Sum of Marginal Errors}}} + \underbrace{\mathbb{E}_{x \sim P} \left[ \log \frac{P(x_\mathcal{U}\mid c)}{\prod_{i \in \mathcal{U}} P(x_i\mid c)} \right]}_{\color{green}{\text{Term 2: Total Correlation (Dependency)}}} \end{aligned}\]Term 1: Sum of Marginal Errors (corresponding to standard CE Loss)
\[\text{Term 1} = \sum_{i \in \mathcal{U}} D_{\text{KL}}(P(x_i\mid c) \| Q_\theta(x_i\mid c))\]Since $D_{\text{KL}}(P_i \Vert Q_i) = \text{CrossEntropy}(P_i, Q_i) - H(P_i)$, and $H(P_i)$ is constant with respect to the model parameters $\theta$, minimizing this term is entirely equivalent to minimizing the standard Cross Entropy Loss.
Term 2: Total Correlation Error
\[\text{Term 2} = D_{\text{KL}}\left( P(x_\mathcal{U}\mid c) \bigg\| \prod_{i \in \mathcal{U}} P(x_i\mid c) \right) = \left( \sum_{i \in \mathcal{U}} H(x_i\mid c) \right) - H(x_\mathcal{U}\mid c)\]This measures the degree to which the variables deviate from independence. If the tokens in $x_\mathcal{U}$ are mutually conditionally independent given $c$, this term is 0. The stronger the dependency, the larger this term becomes.
Based on the mathematical decomposition of Total Correlation (TC) above, we can gain a profound understanding of the core mission of the Noise Scheduler during the inference phase. Because the model parameters $\theta$ are fixed during inference, the upper bound of Term 1 (Marginal Errors) has already been determined at the end of training. Therefore, the improvement of inference quality depends entirely on whether Term 2 (Total Correlation) can be minimized through the strategic selections of the Scheduler.
In other words, during the inference process, the Scheduler is no longer merely a simple progress bar controller (deciding how many tokens to unmask), but instead plays the role of a “dependency decoupler.”
Recalling the previous derivation of Exposure Bias, the accumulation of the total error $\Delta_t$ originates from the local error $\mathcal{E}_{\text{loc}}$ at each step. In the $t+1 \to t$ stage of inference, given the context $c = \mathbf{x}_{t+1}$, the Scheduler $\pi$ first decides the subset to unmask $\mathcal{U} \sim \pi(\cdot\mid c)$, which is subsequently filled by the model $p_\theta$. We re-formalize the single-step local error $\mathcal{E}_{\text{loc}}$ as an expectation over the policy $\pi$:
\[\mathcal{E}_{\text{loc}}(\pi, \theta) = \mathbb{E}_{\mathcal{U} \sim \pi(\cdot \mid c)} \left[ D_{KL}\Big(~P(x_\mathcal{U} \mid c) ~\Big\| \underbrace{\prod_{i \in \mathcal{U}} p_\theta(x_i \mid c)}_{\text{Independence Assumption}} \Big) \right]\]We can further decompose this error into two parts, thereby clearly separating the influences of Intrinsic Order and Independence:
\[\mathcal{E}_{\text{loc}}(\pi, \theta) = \mathbb{E}_{\mathcal{U} \sim \pi} \left[ \underbrace{\sum_{i \in \mathcal{U}} D_{KL}(P(x_i\mid c) \| p_\theta(x_i\mid c))}_{\color{orange}{\text{(I) Marginal Precision Error}}} + \underbrace{\mathbb{I}(\lvert\mathcal{U}\rvert > 1) \cdot \text{TC}(x_\mathcal{U} \mid c)}_{\color{orange}{\text{(II) Structural Dependency Error}}} \right]\]where $\mathbb{I}(\cdot)$ is the indicator function, emphasizing that term (II) only exists when multiple tokens are unmasked in a single step.
In summary, the intrinsic order and independence of data simultaneously affect training and inference. When the number of decoded tokens in a single step $\mathcal U$ is large, independence will significantly restrict model accuracy, but at the same time it reduces the violation of the data’s intrinsic order; conversely, when $\mathcal U$ is very small, the impact of independence is greatly weakened, but the restriction of the data’s intrinsic order is enhanced. By designing a rational mask/unmask policy $\pi$ such that model training and inference can conform to the inherent order of the data, while mitigating the impact of Independence as much as possible, we can significantly improve the model’s training efficiency, model capacity, and alleviate exposure bias.
The random strategy serves as the baseline implementation of MDMs (such as MDLM, MD4, SEDD)
Training During training, the time step $t \in [0, 1]$ is sampled uniformly, with a corresponding mask ratio of $1 - \alpha_t$. For any position $i$ in the sequence, its probability of being selected is independent:
\[\pi(M \mid \mathbf{x}_0) = \binom{L}{k}^{-1}, \quad \text{where } k \approx (1-\alpha_t)L\]At this point, the loss function $\mathcal{L}_{\text{vb}}$ degrades into a uniform expectation over all possible permutations $\mathfrak{S}_L$.
Inference At each inference step $t+1 \to t$, the policy $\pi$ randomly selects a fixed proportion (determined by the scheduler) of positions from the current mask set to unmask:
\[M_{t+1} \subset \{i \mid x_{t+1}^i = \text{m}\}, \quad \text{s.t. } \lvert M_{t+1} \rvert = \lfloor (\alpha_t - \alpha_{s})L \rfloor\]The selection process is independent of the context $c$, i.e., $\pi(M \mid \mathbf{x}_{t+1}) = \pi(M)$. The model independently predicts the probability distribution of all masked positions at each step:
\[x_t^i \sim p_\theta(x^i \mid \mathbf{x}_{t+1}) \quad \forall i \in M_{t+1}\]Unlike random strategies, confidence-based strategies (Train for the worst, Plan for the best, MGDM)
Inference ($\pi_{\text{inf}}$):
At step $t+1 \to t$, the $s_i$ of all masked positions is calculated, and the top $k$ positions with the highest scores are selected:
\[M_{t+1} = \text{arg-topk}_{i} \{s_i\}, \quad \text{where } k = \lfloor (\alpha_t - \alpha_s)L \rfloor\]The corresponding generation process is:
\[\pi(M \mid \mathbf{x}\_{t+1}) = \begin{cases} 1 & \text{if } M = M_{t+1} \\ 0 & \text{otherwise} \end{cases}\]To preserve a certain degree of randomness, a topk ratio can also be designed to control the proportion of topk selections, meaning a portion of topk tokens and a portion of randomly selected tokens are placed into $M_{t+1}$.
Training
In current works, there is no explicit implementation of a confidence-based policy for masking during training. However, some works implement preference training for specific tokens or specific generation orders by reweighting the loss function. For instance, in MGDM
where $v(\mathbf{x}_{t,n}) = \alpha(1 - \exp(-u(\cdot)))^\beta$ is an adaptive token-level reweighting term. When $\beta > 0$, this term reduces the weight of easy-to-learn tokens (low loss) and amplifies the weight of hard-to-learn tokens (high loss). From an optimization objective perspective, this training strategy aims to improve global training efficiency by “forcing” the model to focus on those hard samples that are currently inaccurately predicted.
Advantages:
Shortages:
In this section, we introduce several schedulers learned by the model to reduce inductive bias and obtain better generalization.
Paper: THINK WHILE YOU GENERATE: DISCRETE DIFFUSION WITH PLANNED DENOISING (ICLR 2025)
DDPD consists of two independent neural networks:
Planner ($p_\theta(z_t^d \mid x_t)$): Executes a binary classification task to predict whether each dimension $d$ in the sequence is corrupted (i.e., in a noise state $z_t^d = N$). It only outputs one logit for each dimension.
Denoiser ($p_{1\mid t}^\theta(x_1^d \mid x_t, z_t^d = N)$): Predicts the true data distribution conditioned on the corresponding position being classified as noise. DDPD no longer relies on a preset time step but inversely deduces the actual time through the global noise level estimated by the planner: $\tilde{t} = \alpha_t^{-1}(1 - \sum_{d’} p_\theta(z_t^{d’} = N \mid x_t) / D)$, and uses the actual time as the time embedding for the denoiser.
Advantages:
Shortages:
RL-Policy Planer aims to model the denoising process of discrete diffusion models as a Markov Decision Process, using reinforcement learning to learn the optimal denoising trajectory or unmask policy. Unlike heuristic-based schedulers (such as random or confidence sampling), RL-Policy can adaptively adjust the generation path based on the final reward of downstream tasks (e.g., accuracy of mathematical problems, logical correctness of reasoning), thereby exhibiting superiority in more complex generation scenarios.
Paper: IMPROVING DISCRETE DIFFUSION UNMASKING POLICIES BEYOND EXPLICIT REFERENCE POLICIES (ICLR 2026)
The core idea of this architecture is a clear division of labor: the underlying backbone already possesses powerful conditional probability modeling capabilities, while the Policy Head acts as a scheduler, specifically responsible for evaluating the unmasking priority of each position based on the current context.
Implementation:
Advantages:
Shortages:
Paper: MDPO: OVERCOMING THE TRAINING-INFERENCE DIVIDE OF MASKED DIFFUSION LANGUAGE MODELS
Unlike the aforementioned schemes, MDPO does not alter the logic of position selection (still following the confidence-based heuristic) but directly optimizes the backbone itself through RL training, enabling its output logits to more perfectly adapt to the scheduling trajectory during inference.
Implementation
End-to-end Optimization: The backbone parameters $\theta$ are trainable. In the RL stage, the model performs sampling generation according to the confidence-based policy used during inference.
RL simulates the “progressive unmasking” path during the inference process. By comparing the cumulative rewards of different trajectories (such as using Group Relative Policy Optimization, GRPO), the model is forced to produce more accurate predictions under the current inference distribution.
The RL objective increases the logit values of tokens on the correct path and suppresses the values of erroneous/interfering paths. This ensures the model’s confidence scores during inference are no longer merely a byproduct of the training loss, but a heuristic function equipped with authentic navigational capabilities.
Advantages:
Shortages:
In this section, we introduce several manually selected schedulers.
Paper: SPMDM: Enhancing Masked Diffusion Models through Simplifying Sampling Path
Unlike traditional global random scheduling, SPMDM asserts that an efficient sampling path should possess two characteristics: local priority (prioritizing the unmasking of masks near already solved tokens) and logical sequence priority (prioritizing the unmasking of early subsequences in the reasoning chain).
Insight
Training Method: Subsequence Partitioning and Non-uniform Noise
Target: Optimized through subsequence-level NELBO, forcing the model to predict a specific subsequence $x^k$ given the context $x_t^{-k}$:
\[\mathcal{L}_{NELBO}=\sum_{k=1}^{K}\mathbb{E}_{t_{k}\sim[0,1],q_{t_{k}/0}}\left[\frac{\alpha_{t_{k}}^{\prime}}{1-\alpha_{t_{k}}}\log p_{\theta}(x^{k}\mid x_{t_{k}}^{k},{x_{t}}^{-k})\right]\]Paper: BLOCK DIFFUSION: INTERPOLATING BETWEEN AUTOREGRESSIVE AND DIFFUSION LANGUAGE MODELS
BD3-LMs propose a new paradigm that interpolates between autoregressive (AR) modeling and discrete diffusion modeling. Its core motivation lies in overcoming the limitations of standard diffusion models in generating sequences of arbitrary length, inference efficiency (KV Caching), and likelihood modeling quality (Perplexity) by introducing block-wise structures.
BD3-LM partitions a sequence of length $L$ into $B$ blocks of length $L’$. The model follows an autoregressive decomposition across blocks, while executing a discrete diffusion process within each block:
\[\log p_{\theta}(x) = \sum_{b=1}^{B} \log p_{\theta}(x^{b} \mid x^{\lt b})\]| Schedulers | Policy $\pi(M \mid c)$ | Sampler $p_{\theta}$ | Advantages | Shortages |
|---|---|---|---|---|
| Random Schedulers | $\pi(\mathcal{U}) = \displaystyle\frac{1}{\binom{ \vert \mathcal{M}_{t+1}\vert}{k}}$ | Average distribution of all possible permutation paths $\mathfrak{S}_L$ | Closed-form solution for the forward process; No need for additional logit sorting or planner forward computation during sampling | Squanders training resources; Ignores intrinsic order, leading to severe error accumulation |
| Confidence-based Schedulers | $\mathcal{U} = \text{arg-topk}_{i \in \mathcal{M}_{t+1}} { s_i \mid s_i = \max_{x \in \mathcal{X}} \log p_\theta(x^i = x \mid \mathbf{x}_{t+1}) }$ $\pi(\mathcal{U} \mid \mathbf{x}_{t+1}) = \mathbb{1}(\mathcal{U} = \text{arg-topk})$ | Usually introduces token-level reweighting during training to make the model delineate boundaries for low-confidence (hard) samples more clearly. | Adaptively adjusts generation order according to different input samples | Insufficient model training or overly large vocabulary leads to noisy logit sorting; Cannot resolve Total Correlation errors |
| Planer-based Schedulers | $\mathcal{U} = { i \mid p_\phi(z^i = N \mid \mathbf{x}_{t+1}) < \tau }$ | $p_\theta$ is only responsible for dimensions selected by $p_\phi$, and its input usually contains the real-time time step $\tilde{t}$ estimated by the planner. | Supports Remasking; Avoids capacity dilution; | Generates token by token with remasking, resulting in low efficiency; |
| RL-Policy Schedulers | $\pi_\phi(i \mid \mathbf{x}_{t+1}) = \text{Softmax}(h_\phi(\mathbf{x}_{t+1}))$ | Training objective shifts from simple maximum likelihood to preferring high-reward paths | Consistency in training and inference; High performance upper bound | Complex training |
| Manual/Block Schedulers | $\pi(\mathcal{U}) \implies \mathcal{U} \subset \text{Block}_k,\ \text{where } k \text{ is determined by preset order or block noise } t_k$ | Learns locally strongly coupled semantics | Significantly reduces $\text{TC}(x_\mathcal{U} \mid c)$ | Fixed block partition may fail to perfectly match the nonlinear logical dependencies of data |
From a unified perspective, the Noise Scheduler in a Mask Diffusion Model is fundamentally a "Structural Regulator" rather than a time controller in the traditional sense. By determining when, what to unmask, and how many tokens to unmask at once, it directly shapes the conditional distributions confronted by the model during training and inference, thereby dictating the tradeoff among three core errors:
Analysis reveals that these three issues are not independent variables; rather, they are jointly regulated along the same error tradeoff curve through the strategic choices of the Scheduler:
In conclusion, RL-Policy Schedulers comprehensively address the constraints of intrinsic order and total correlation errors through rewards. Furthermore, because their sampling trajectories adhere seamlessly to the learned policy, they manifest enhanced training-inference consistency, ultimately tending toward optimal generation performance.
Here are some more articles you might like to read next: