Constrained Decoding of Diffusion LLMs with CFGs

Constrained Decoding of Diffusion LLMs
with Context-Free Grammars

Niels Mündler, Jasper Dekoninck, Martin Vechev

arXiv Code

Demo

Our work enables constrained decoding for diffusion LLMs and multi-region infilling.
Given a formal syntax, for example of JSON or C++, our method guarantees that the LLM adheres to this syntax, even when outputs have many infilling regions or are generated out-of-order.
We demonstrate a consistent improvement in functional correctness of up to 7%, with an average of 1-2%, with a small computational overhead.

What is special about multi-region infilling and diffusion LLMs?

Classical LLM generation is strictly left-to-right. The classical setting is to generate the next token to append to a given fixed prefix (PRE). This allowed constrained decoding method to rely on techniques to determine whether an output is a valid prefix of any word in the target language.

Multi-Region Infilling complicates the setting by allowing not only a fixed prefix, but also a fixed suffix (FIM) and fixed regions in between (MRI). This means the constraining method also needs to take into account the fixed regions when determining admissability of a partial output.

Diffusion LLMs (DLM) further generate tokens in any order, i.e., they may choose to generate a token at the end of the output or in the middle of the output. This means that the constraining method needs to be able to handle arbitrary updates of a partial output, which is not possible with classical constrained decoding methods.

How does the method work?

We check partial output validity by computing the intersection of the target language's context-free grammar (CFG) with the language of all possible completions.

We first generalize the formal framework of constrained decoding by adapting the standard constrained decoding algorithm to unordered updates of a partial output with arbitrarily many infilling regions. While designed for diffusion LLMs, this naturally subsumes multi-region infilling, fill-in-the-middle and classical prefix-completion. The decoding process is illustrated in the picture above. The model iteratively generates proposals to insert a token in a specific location of the partial output, for example foo() into the first infilling region. We verify the proposal's validity by intersecting the target language's CFG with the language of all possible partial output completions. This intersection is non-empty if and only if a valid completion exists.

A key technical challenge to this approach is to efficiently determine the emptiness of the language intersection. We first show that the language of all possible completions is a regular language, enabling standard formal language operations to generate the intersection language. The intersection language is however prohibitively large. We address this by applying optimizations for size reduction, such as employing a custom normal form. Further, we perform an implicit search over the intersection language to avoid generating the entire language, and all non-generating symbols in particular.

Demo

The demo below is based on LLaDA 7B, a diffusion model, generating C++ code for a HumanEval task in our dataset. We highlight all rejected tokens, which are correctly determined to violate syntax during the generation process.

Task: Given a non-empty vector of integers, return the sum of all of the odd elements that are in even positions.

Constrained
Unconstrained

Demo loading...

What are your key techniques?

We formalize the constrained decoding problem for multi-region infilling and diffusion LLMs, shown next to this text. The formalization allows us to reason about the requirements of an incremental verifier for constraining diffusion LLMs and multi-region infilling. We define a formal infilling problem as the problem of deciding whether an output with infilling regions can be infilled to be a valid output of a target language. We then show how to reduce this problem to a standard intersection emptiness problem, which is well-studied in the literature.

Generalized constrained decoding algorithm

The intersection we compute is the intersection of two constraints on the model output: the target language's context-free grammar (CFG) and the language of all possible completions of the partial output. The latter is a regular language, which can be described by a non-deterministic finite automaton (NFA), like the one shown below. We show how to construct this NFA from the partial output and the target language's CFG, and how to efficiently check whether the intersection of the two languages is empty.

Computing the intersection of a CFG and a regular language is a well-studied problem in formal language theory, but it is quite expensive. The intersection itself can be described by a CFG and requires a DFA and CFG in normal form, but has a cubic number of productions, in the general case. DFAs can be minimized efficiently, but for CFGs no general minimization is possible. We therefore apply a number of optimizations to reduce the size of the intersection, such as using a normal form that is obtained with a linear increase in symbols instead of quadratic, left factoring, and removing non-generating symbols. Crucially, we then search implicitly over the intersection language, avoiding the need to generate the entire language, and all non-generating symbols in particular.

What are the results of your method?

We observe a drastic increase in syntactic correctness, with 99.5% of the generated solutions being syntactically correct. This leads to an increase in functional correctness of up to 7% for the best-performing model. We also show that our method is efficient, less than doubling the decoding time compared to standard methods.

We evaluate a number of recent code infilling models on a HumanEval translation into C++. As can be seen in the above table, our method increases syntactic correctness significantly across all models and numbers of infilling regions. Our method recovers a syntactically valid completion in 95.8% of instances. In the lower half of the table, we observe that constraining consistently increases functional correctness, on average by 2.8%. This is expected, as syntactically incorrect completions can not be functionally correct and are effectively prevented by our method.

We further design experiments for SOTA diffusion LLMs on C++ HumanEval tasks, JSON Schema compliant data extraction and representing chemical molecules in SMILES. We observe that our method consistently increases syntactic correctness for all tasks and models, as shown in the table. We observe that our method recovers the failed instances, increasing to 99.2%. In JSON Schema extraction, constrained decoding with completion achieves 100% syntactic correctness.

As shown in the lower half of the table, the positive effect of constraining on functional correctness is also present for DLMs, with an average increase in functional correctness of 2.2%. Notably, Dream 7B performance on JSON Schema extraction increases by 6.9%. In the SMILES setting, where models perform very poorly at only 1.5% average correctness, syntactic constraints are not able to improve functional correctness significantly, achieving only a modest average increase of 0.2%.

Citation

@article{muendler2025constraineddiffusion,
        title={Constrained Decoding of Diffusion LLMs with Context-Free Grammars},
        author={Niels Mündler and Jasper Dekoninck and Martin Vechev},
        year={2025},
        eprint={2508.10111},
        archivePrefix={arXiv},
        url={https://arxiv.org/abs/2508.10111},
}

Constrained Decoding of Diffusion LLMswith Context-Free Grammars

What is special about multi-region infilling and diffusion LLMs?

How does the method work?

Demo

What are your key techniques?

What are the results of your method?

Citation

Constrained Decoding of Diffusion LLMs
with Context-Free Grammars