This blog post investigates the possibility of parallel decoding for autoregressive models. The author notes that autoregressive and diffusion models both fundamentally model data probability distributions, and that each has advantages—autoregressive models in training and diffusion models in sampling. The goal is to achieve a training-free way to perform parallel decoding with a pretrained autoregressive model, enabling low-cost accelerated generation.
In recent years, the rapid rise of artificial intelligence has been driven not only by advances in discriminative models but, more fundamentally, by the evolution of generative models — models that learn to represent and simulate the underlying probability distribution of data. From text and images to audio and 3D scenes, the essence of generation lies in one universal goal: to capture the complexity of real-world data distributions and to reproduce samples that are both coherent and diverse.
However, modeling such high-dimensional and structured distributions directly is intractable. Instead of storing an explicit probability function, modern generative models encode this distribution implicitly within their network parameters and decode it through learned stochastic processes. The diversity of generative paradigms — from autoregressive (AR) models to diffusion and energy-based models — stems primarily from the different ways they design and interpret this encoding–decoding process.
Autoregressive models, by factorizing joint probabilities into sequential conditionals, offer a simple and efficient training pipeline. Their major limitation, however, lies in the sequential nature of generation, which prevents parallel sampling. Diffusion models, in contrast, model the joint distribution directly through iterative denoising steps, enabling parallel decoding at the cost of heavier and slower training. Recent studies, such as
This blog explores a conceptual bridge between these two paradigms — a path to train with autoregression and decode with diffusion. Since both families ultimately learn to approximate the same probability distribution, it may be possible to transfer or reinterpret an autoregressive model into a diffusion-like sampling mechanism without extensive retraining — a direction we tentatively refer to as AR2Diff. Early efforts, such as Parallel and Flexible Sampling from Autoregressive Models via Langevin Dynamics (ICML 2021)
In the following sections, we will review the current landscape of generative modeling, connect autoregression and diffusion through the lens of score functions, and discuss how AR2Diff might be realized in both continuous and discrete spaces.
The autoregressive model (AR) is fundamentally grounded in the chain rule decomposition of the joint distribution $p(\mathbf{x})$. Let $\mathbf{x} = (x_1, x_2, \dots, x_D) \in \mathcal{X}^D$, where $\mathcal{X}$ may denote either a discrete vocabulary or a continuous real-valued space. The model factorizes the distribution as:
\[\begin{equation*} p(\mathbf{x}) = \prod_{i=1}^{D}p(x_i \mid x_{<i}),\quad \text{where } x_{<i}:= (x_1, \dots, x_{i-1}). \end{equation*}\]In parametric modeling, a neural network with parameters $\theta$ is employed to approximate each conditional distribution. For discrete data, this often takes the form of a softmax output:
\[p_\theta(x_i = k \mid x_{<i}) = \frac{\exp(f_\theta(x_{<i})_k)}{\sum_{k' \in \mathcal{V}} \exp(f_\theta(x_{<i})_{k'})},\]while for continuous variables, a Gaussian parameterization is common:
\[p_\theta(x_i \mid x_{<i}) = \mathcal{N}\big(x_i; \mu_\theta(x_{<i}), \sigma_\theta^2(x_{<i})\big).\]During training, maximum likelihood estimation (MLE) is used to minimize the negative log-likelihood:
\[\mathcal{L}_{\text{AR}}(\theta) = -\mathbb{E}_{\mathbf{x} \sim p_{\text{data}}} \left[ \sum_{i=1}^{D} \log p_\theta(x_i \mid x_{<i}) \right].\]This objective is fully parallelizable under teacher forcing, as all conditioning contexts $x_{<i}$ are taken from the ground-truth data. However, sampling remains inherently sequential: the generation process must proceed step-by-step,
\[x_i^{(s)} \sim p_\theta\big(x_i \mid x_1^{(s)}, \dots, x_{i-1}^{(s)}\big), \quad i = 1, \dots, D,\]resulting in inference latency that scales linearly with the data dimensionality $D$. More fundamentally, this factorization imposes a fixed ordering $\pi$ (typically the natural index order) on the variables, despite the fact that the true data distribution $p_{\text{data}}(\mathbf{x})$ possesses no intrinsic sequential structure. While this inductive bias facilitates tractable modeling, it may constrain the model’s ability to capture non-local, symmetric, or graph-structured dependencies that do not conform to a unidirectional causal chain.
Diffusion models construct a forward Markov chain that progressively corrupts data with noise, then learn the reverse process to enable generation. Let $\mathbf{x}0 \sim p{\text{data}}$. The forward process is defined as:
\[q(\mathbf{x}_{1:T} \mid \mathbf{x}_0) = \prod_{t=1}^{T} q(\mathbf{x}_t \mid \mathbf{x}_{t-1}), \quad q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \, \mathbf{x}_{t-1}, \beta_t \mathbf{I}),\]where $\beta_t \in (0,1)$ is a pre-specified noise schedule. This process admits a closed-form expression:
\[\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \, \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \, \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(0, \mathbf{I}),\]with $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}t = \prod{s=1}^t \alpha_s$.
The reverse process aims to learn a sequence of conditionals $p_\theta(\mathbf{x}{t-1} \mid \mathbf{x}_t)$ such that the resulting generative chain $p\theta(\mathbf{x}{0:T}) = p(\mathbf{x}_T) \prod{t=1}^T p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t)$ approximates the true data distribution. In practice, the reverse transitions are often parameterized as Gaussians, and training is performed by minimizing a variational lower bound (ELBO) or, more commonly, by direct noise regression:
\[\mathcal{L}_{\text{diff}}(\theta) = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \| \boldsymbol{\epsilon} - \epsilon_\theta(\mathbf{x}_t, t) \|^2 \right],\]where $\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \, \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \, \boldsymbol{\epsilon}$.
From the perspective of score matching, diffusion models equivalently learn the time-dependent score function:
\[s_\theta(\mathbf{x}_t, t) \approx \nabla_{\mathbf{x}_t} \log p_t(\mathbf{x}_t),\]and sampling can be carried out by solving the reverse stochastic differential equation (SDE):
\[d\mathbf{x} = \left[ -\frac{1}{2} \beta(t) \left( \mathbf{x} + s_\theta(\mathbf{x}, t) \right) \right] dt + \sqrt{\beta(t)} \, d\mathbf{w},\]where $\mathbf{w}$ denotes a standard Wiener process. This formulation reveals the essential nature of diffusion models: rather than explicitly modeling the density, they learn a vector field that steers samples toward high-probability regions of the data manifold.
This implicit, geometry-aware approach endows diffusion models with remarkable expressiveness and robustness to long-range dependencies—making them especially well-suited for structured data like natural images or 3D scenes. However, this flexibility comes at a cost: training requires optimization across multiple noise levels, resulting in significantly higher computational overhead compared to autoregressive models. Moreover, the continuous-time foundation of diffusion poses fundamental challenges when applied to discrete domains (e.g., text), where gradients are undefined and semantic continuity breaks down.
Masked diffusion models (MDMs)
where $\mathcal{M}_t \in {0,1}^D$ is a binary mask vector satisfying $\mathbb{E}[|\mathcal{M}_t|_1] = (1 - \alpha_t) D$.
The reverse process trains a unified model $p_\theta(x_i \mid \mathbf{x}_t, t)$ to predict the original token at any masked position. The training objective is typically formulated as a masked cross-entropy loss:
\[\mathcal{L}_{\text{MDM}}(\theta) = \mathbb{E}_{t, \mathbf{x}_0} \left[ \sum_{i: \mathcal{M}_{t,i} = 0} \log p_\theta(x_{0,i} \mid \mathbf{x}_t, t) \right].\]Notably, at early timesteps (e.g., $t=1$) with high masking rates, $\mathbf{x}_1$ is nearly all [MASK], forcing the model to perform “cold-start” prediction with minimal context. In contrast, as $t \to T$, $\mathbf{x}_T \approx \mathbf{x}_0$, and the task reduces to self-supervised reconstruction. This progression naturally induces a multi-scale modeling hierarchy, progressing from global structure to fine-grained detail.
MDMs exhibit a deep formal connection to autoregressive models. In the limiting case where the masking strategy is fixed to “mask only the last position,” the MDM exactly recovers a standard AR model. Conversely, under fully random masking, the model must learn the conditional distribution over any subset of variables given the rest—that is, for any $\mathcal{S} \subset {1,\dots,D}$, it implicitly learns $p(\mathbf{x}{\mathcal{S}^c} \mid \mathbf{x}{\mathcal{S}})$. This capability far exceeds the unidirectional conditioning of AR models, yet the training objective retains the same semantic essence: predicting missing information from partial observations.
It is precisely this semantic equivalence coupled with structural disparity that provides a theoretical foundation for AR2Diff: if an autoregressive model has already internalized rich contextual dependencies through sequential training, can we reinterpret it as a denoiser in a mask-based iterative refinement loop, thereby enabling parallel sampling without retraining? The answer may lie in the dual relationship between conditional probabilities and score functions—an insight we will explore in the next section.
Here are some more articles you might like to read next: