Combining Constraint Programming Reasoning with Large Language Model Predictions

Florian Régin
Université Côte d’Azur, CNRS, I3S, France
[email protected]
   Elisabetta De Maria
Université Côte d’Azur, CNRS, I3S, France
[email protected]
   Alexandre Bonlarron
Université Côte d’Azur, Inria, France
[email protected]
Abstract

Constraint Programming (CP) and Machine Learning (ML) face challenges in text generation due to CP’s struggle with implementing “meaning” and ML’s difficulty with structural constraints. This paper proposes a solution by combining both approaches and embedding a Large Language Model (LLM) in CP. The LLM handles word generation and meaning, while CP manages structural constraints. This approach builds on GenCP, an improved version of On-the-fly Constraint Programming Search (OTFS) using LLM-generated domains. Compared to Beam Search (BS), a standard NLP method, this combined approach (GenCP with LLM) is faster and produces better results, ensuring all constraints are satisfied. This fusion of CP and ML presents new possibilities for enhancing text generation under constraints.

1 Introduction

How can we perceive Constraint Programming beyond its traditional role in solving combinatorial optimization problems? Once Eugene Freuder wrote Constraint programming represents one of the closest approaches computer science has yet made to the Holy Grail of programming: the user states the problem, the computer solves it [1].

Nevertheless, some real-world problems are still beyond the reach of the current CP paradigm. This is particularly true when real-world problems involve vague notions such as “meaning” and “melody” for text and music. These are not easy to model in CP with the classical toolbox, mainly because these notions are hard to define formally. For instance, it is unclear how to formalize an objective function or a constraint to get closer to a meaningful sentence, a melodious song or a captivating painting. On the other hand, recent results in Machine Learning (ML), such as transformer-based models [2], have demonstrated the power of these techniques to capture a significant part of these vague concepts through data-driven statistical learning (e.g., Large Language Model (LLM) like the GPT series [3], stable-diffusion [4], ChatMusician [5]). In the article, we demonstrate that ML, and in particular LLM, can help CP to model and solve problems where such vague concepts can be found.

In recent years, there has been a growing interest in text generation under constraints thanks to the rise of transformer-based models, like OpenAI ChatGPT ([3]) and Meta LLaMa ([6]). Nevertheless, even fine-tuned prompted LLMs fail to generate several constrained outputs (see the tasks introduced in [7]). The goal of this paper is to present a new method for the task of text generation under constraints. This interest has a strong chance of continuing to grow insofar as many brands wish to integrate these technologies, in particular with their customers, and want to have control and guarantees on the behavior of these conversational agents. Hence, it may impact several critical marketing aspects (e.g., brand representation, legal issues, data privacy, …). Therefore, CP has the potential to become a strong safeguard of this kind of generative model.

For the task of text generation under constraints, ML techniques face limitations when they have to manage structural constraints, such as limits on the number of words or characters (e.g. Text Summarization, Text Simplification, Text style transfer, Question Answering, Storytelling, Poetry or Lyrics Generation, Subtitle) [8]. CP succeeds on these types of constraints, making the combination of CP and ML a natural fit for the task of text generation under constraints.

This paper proposes such a combination, to tackle a class of problems where neither CP and ML succeeds on their own (Fig. 1).

Refer to caption
Figure 1: Our approach aspires to explore the in-between area. In the blue (left-hand side) region, LLM guided searches solve weakly constrained problems [9, 10, 11] and in the green (right-hand side) region, CP-based generation tackles strongly constrained problems [12, 13, 14, 15, 16, 17, 18].

Combining Combinatorial Optimization (CO) and ML is a very active research area [19], however there is no easy way to integrate the ML “expertise” into CP as a constraint of the model [20, 21] and vice versa [22]. Furthermore, there are many incentives to strengthen the interactions between ML and CO [23, 24, 25]. Usually, the main motivation comes from the performance perspective, where the idea is to improve a solver’s performance with ML (e.g., finding branching heuristics thanks to Reinforcement Learning [26] or finding better bounds with Clustering [27]). This paper tackles it from the modeling point of view. Modeling is a crucial part of CO works. In the end, the model must account for the underlying solver that runs it. More in detail, here, the paper focuses on the interaction between CP and ML, more precisely through an ML-augmented CP approach [28].

In the context of text generation under constraints, the domain of a variable represents a word. The base idea of the paper consists in letting ML manage the domain of variables and CP manage the constraints and the number of variables. In this manner, the sentence formed by variables has high chances to have a meaning and all the constraints will be satisfied. In traditional CP, the domains can not be managed by ML because they have to be set beforehand. However, it is possible to rely on On-the-fly Constraint Programming Search (OTFS) [29], a CP based method where variables, domains and constraints are generated during the search for a solution.

The main contribution of this paper is to propose a new version of OTFS, called GenCP, where the generative function of the domain of variables is modified to allow CP variable domains to be computed by an LLM embedded in it, during the search for a solution. More in detail, ML is used during process solving but it is also used as an explicit part of the problem definition (i.e., domains are predicted by the LLM and can replace entirely static variable domains definition of a CSP.). Thus it bridges CP and ML through solving and modelling.

The potential of the approach is showcased for the problem of text generation under constraints, against one the most used techniques in the Natural Language Processing (NLP) field: Beam Search (BS). Both methods (BS and GenCP) are compared on constrained sentences generation tasks extracted from benchmarks recently introduced [7]. The approach highlights how CP can provide guarantees when combined with LLM predictions.

The paper is organized as follows: Sec. 2 serves as background, Sec. 3 shows how to extend OTFS to GenCP and how to implement an interaction between GenCP and LLM. Sec. 4 presents the experimental results in which the new approach is demonstrated on the task of text generation under constraints. Finally, Sec. 5 delves into further discussion, offering additional insights into this work and providing perspectives for future research endeavors.

2 Background

This section introduces the necessary background on LLM and CP.

2.1 LLM Predictions Strategies

2.1.1 Decoding Strategies Combined with LLMs

Large Language Models (LLMs), such as the GPT series, generate text by predicting the next token (word or character) given the history of previously generated words. Decoding in LLMs refers to the strategy used to select the next words to be generated.

2.1.2 Greedy Decoding

The simplest decoding strategy is greedy decoding. Here, the LLM selects the words with the highest probability at each time step. Although simple and efficient, this approach does not guarantee the best overall sequence, as it does not consider the effect of the current selection on future tokens.

2.1.3 Beam Search

Beam Search (BS) [30, 10, 11] is a refined version of greedy decoding. A beam is a candidate sequence of words. Instead of selecting the single best token at each time step, it usually keeps track of the k𝑘kitalic_k most likely sequences (beams) at each step.

Although BS usually achieves better results than greedy decoding, it assumes that high-ranking token sequences consist of high-ranking tokens, which may only sometimes be the case. For a more stochastic and diverse output, top-k𝑘kitalic_k sampling and top-p𝑝pitalic_p sampling (also known as nucleus sampling) are used. In top-k𝑘kitalic_k sampling, the model selects from the top k𝑘kitalic_k highest probability predictions, while in top-p𝑝pitalic_p sampling, it dynamically selects the number of top predictions to cover p𝑝pitalic_p percent of the total probability mass.

2.1.4 Perplexity

Perplexity is an entropy metric derived from Shannon’s information theory [31]. Since an LLM computes the probability of text, then it can compute text perplexity. It can be expressed as the geometric mean of the inverse conditional likelihood of the sequence [32]. Let Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT be the sequence of a succession of words of size n𝑛nitalic_n: Sn=w1w2..wnS_{n}=w_{1}w_{2}..w_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . . italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. The perplexity (PPL) of Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is computed as follows:

PPL(Sn)=1P(w1w2w3wn)n,𝑃𝑃𝐿subscript𝑆𝑛𝑛1𝑃subscript𝑤1subscript𝑤2subscript𝑤3subscript𝑤𝑛PPL(S_{n})=\sqrt[n]{\frac{1}{P(w_{1}w_{2}w_{3}...w_{n})}},italic_P italic_P italic_L ( italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = nth-root start_ARG italic_n end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG italic_P ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT … italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG end_ARG ,

where probability P()𝑃P(\cdot)italic_P ( ⋅ ) is given by the LLM. PPL can be interpreted as the “how likely a text is generated by a given model” [8]. Usually, it is used to evaluate the LLM itself by checking that good samples are recognized as such (i.e., low PPL values).

In NLP, the evaluation of text is still an open problem, and human evaluation remains the gold standard. Numerous metrics have been developed to address this issue. Among them, PPL remains an objective criterion associated with text produced by a given model. PPL is also much more convenient to use than pure probability. Its range is [1;+[[1;+\infty[[ 1 ; + ∞ [ . The lower, the better.

2.2 Constraint Programming

Constraint Programming (CP) is a paradigm for solving combinatorial problems that draws on a wide range of techniques from artificial intelligence and operations research. In CP a problem can be defined as a Constraint Satisfaction Problem (CSP). A CSP is a triplet: X,D,C𝑋𝐷𝐶\langle X,D,C\rangle⟨ italic_X , italic_D , italic_C ⟩, where:

  • X={X1,X2,,Xn}𝑋subscript𝑋1subscript𝑋2subscript𝑋𝑛X=\{X_{1},X_{2},...,X_{n}\}italic_X = { italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } is the set of variables of the problem.

  • D={DX1,DX2,,DXn}𝐷subscript𝐷subscript𝑋1subscript𝐷subscript𝑋2subscript𝐷subscript𝑋𝑛D=\{D_{X_{1}},D_{X_{2}},...,D_{X_{n}}\}italic_D = { italic_D start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_D start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT } is the set of domains, where each domain DXisubscript𝐷subscript𝑋𝑖D_{X_{i}}italic_D start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT corresponds to the set of possible values for the variable Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

  • C={c1,c2,,cm}𝐶subscript𝑐1subscript𝑐2subscript𝑐𝑚C=\{c_{1},c_{2},...,c_{m}\}italic_C = { italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } is the set of constraints of the problem. A constraints represent a property of the problem.

A solution is an assignment of all the variables to a value present in their respective domains, such that all the constraints are satisfied.

2.2.1 Avoiding Static Definition of the CSP

In traditional CP, for the task of text generation under constraints, a variable represents a word. Since the domains of variables have to be set beforehand, they will be of enormous size, containing every word/declination of words for a given language. Furthermore, constraints between succession of words may lead to a combinatorial explosion. Since traditional CP is not well suited, this work focuses on OTFS, a CP based method recently introduced by Régin and De Maria [29]. Instead of having the variables/domains/constraints set before the search, OTFS generates the variables/domains/constraints during the search for a solution, avoiding the problem stated above and being expendable to permit the integration of an LLM. The new version of OTFS is called GenCP.

3 Method: LLM alongside OTFS

The approach of this paper extends OTFS by having an embedded LLM generate the domains of variables. Figure 2 graphically depicts the interplay between those components. The approach also adds a minor improvement in the form of helping functions, to differentiate between implicit constraints (prevent infinite loops, ensure a variable represents a word, etc.) and explicit constraints (constraints of the problem). In the next subsection, the new version of OTFS called GenCP is described.

Refer to caption
Figure 2: This scheme presents the integration of ML into CP performed by GenCP. It is freely inspired by Sec. 3.2.3 of Bengio et al.’s survey [19], which introduces an architecture for ML alongside Optimization Algorithms. The similarity is highlighted because the master algorithm (here, GenCP) repeatedly queries an ML model (here, an LLM) to obtain a prediction as a subroutine. In the context of this paper, the decision (search or propagation) has an impact on the problem definition (the CSP) because it may generate new variables, domains, or constraints during the solving process. The state is the current assignment of the variables.

3.1 New version of OTFS: GenCP

In traditional CP it is not common to generate new variables/domains/constraints during the search, while OTFS is based on this idea. OTFS begins with an empty or partially defined CSP (the CSP has less variables/domains/constraints than the CSP in traditional CP) and will generate variable/domains/constraints during the search for solutions.

GenCP is a new version of OTFS that makes two changes to the original version: 1) the function that generates the domain genD𝑔𝑒𝑛𝐷genDitalic_g italic_e italic_n italic_D calls an LLM to generate the domain of the current variable. 2) Helping functions are added to represent implicit constraints.

Here is GenCP applied to text generation under constraints. An GenCP model can be defined as a pair of sets {,}\{\mathcal{M},\mathcal{F}\}{ caligraphic_M , caligraphic_F }, where:

  • ={X,D,C}𝑋𝐷𝐶\mathcal{M}=\{X,D,C\}caligraphic_M = { italic_X , italic_D , italic_C } represents the model of the problem.

  • X𝑋Xitalic_X represents the variables. The variables represent words.

  • D𝐷Ditalic_D represents the domain of the variables. A domain diDsubscript𝑑𝑖𝐷d_{i}\in Ditalic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_D contains a list of predicted words by an LLM.

  • C𝐶Citalic_C represents the explicit constraints (constraints of the problem). A constraint ciCsubscript𝑐𝑖𝐶c_{i}\in Citalic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_C represents rules over text (e.g., number of words, number of characters, forbidden words, or symbols).

  • ={G,B,H}𝐺𝐵𝐻\mathcal{F}=\{G,B,H\}caligraphic_F = { italic_G , italic_B , italic_H } is a set of functions.

  • G𝐺Gitalic_G represents the set of generative functions: these functions explain to the solver how to generate variables/domains/constraints.

  • B𝐵Bitalic_B represents the set of Boolean functions: these functions tell the solver when a solution is found.

  • H𝐻Hitalic_H represents the set of helping functions: these functions are used to represent implicit constraints, for example ensuring that when a variable is generated, it helps obtaining a solution (to prevent the solver from attaining an infinite loop of generating variables).

3.1.1 Generative Functions

The set of generative functions G={genV,genD,genC}𝐺𝑔𝑒𝑛𝑉𝑔𝑒𝑛𝐷𝑔𝑒𝑛𝐶G=\{genV,genD,genC\}italic_G = { italic_g italic_e italic_n italic_V , italic_g italic_e italic_n italic_D , italic_g italic_e italic_n italic_C } is such that:

  • genV𝑔𝑒𝑛𝑉genVitalic_g italic_e italic_n italic_V generates a new variable with an empty domain and adds it to X𝑋Xitalic_X.

  • genD𝑔𝑒𝑛𝐷genDitalic_g italic_e italic_n italic_D calls the LLM with the current sentence formed by the model and sets the domain of the previously generated variable to the output.

  • genC𝑔𝑒𝑛𝐶genCitalic_g italic_e italic_n italic_C generates the constraint(s) relevant to the current variables of the model to C𝐶Citalic_C. The constraints generated depend on the problem (e.g., generate a sentence that does not contain the letter “e”).

3.1.2 LLM integration

A variable is generated with an empty domain. To generate the domain of variables, genD𝑔𝑒𝑛𝐷genDitalic_g italic_e italic_n italic_D calls an LLM using callLLM(sentence,parameters,k)𝑐𝑎𝑙𝑙𝐿𝐿𝑀𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠𝑘callLLM(sentence,parameters,k)italic_c italic_a italic_l italic_l italic_L italic_L italic_M ( italic_s italic_e italic_n italic_t italic_e italic_n italic_c italic_e , italic_p italic_a italic_r italic_a italic_m italic_e italic_t italic_e italic_r italic_s , italic_k ), where:

  • sentence𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒sentenceitalic_s italic_e italic_n italic_t italic_e italic_n italic_c italic_e is the current sentence represented by the variables of the model.

  • parameters𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠parametersitalic_p italic_a italic_r italic_a italic_m italic_e italic_t italic_e italic_r italic_s represents sampling parameters (top_k,top_p𝑡𝑜𝑝_𝑘𝑡𝑜𝑝_𝑝top\_k,top\_pitalic_t italic_o italic_p _ italic_k , italic_t italic_o italic_p _ italic_p…). For this paper, top_k𝑡𝑜𝑝_𝑘top\_kitalic_t italic_o italic_p _ italic_k is used exclusively for both GenCP and BS: the LLM answers k𝑘kitalic_k words ranked by probability, highest to lowest.

  • k𝑘kitalic_k is the number of words asked to the LLM.

Since the parameters and k𝑘kitalic_k are not modified after the definition of the model,
callLLM(sentence,parameters,k)𝑐𝑎𝑙𝑙𝐿𝐿𝑀𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠𝑘callLLM(sentence,parameters,k)italic_c italic_a italic_l italic_l italic_L italic_L italic_M ( italic_s italic_e italic_n italic_t italic_e italic_n italic_c italic_e , italic_p italic_a italic_r italic_a italic_m italic_e italic_t italic_e italic_r italic_s , italic_k ) will be simply referred to as callLLM(sentence)𝑐𝑎𝑙𝑙𝐿𝐿𝑀𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒callLLM(sentence)italic_c italic_a italic_l italic_l italic_L italic_L italic_M ( italic_s italic_e italic_n italic_t italic_e italic_n italic_c italic_e ).

3.1.3 Helping Functions

Helping functions represent implicit constraints, like avoiding infinite loops. In our current implementation, the set of functions H𝐻Hitalic_H contains the following functions:

  • Hosubscript𝐻𝑜H_{o}italic_H start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT: it orders the domain of variables depending on the problem.

  • HonlyWordssubscript𝐻𝑜𝑛𝑙𝑦𝑊𝑜𝑟𝑑𝑠H_{onlyWords}italic_H start_POSTSUBSCRIPT italic_o italic_n italic_l italic_y italic_W italic_o italic_r italic_d italic_s end_POSTSUBSCRIPT: it ensures that any word predicted by the LLM is a complete word and not the suffix or prefix of a word and it filters out any symbol or special character.

3: HelpingFunctions4: SaveState2: GenerativeFunctions5: RunPropagation8: Backtrackfail6: BooleanFunctionsno sol found7: SaveSolutionsolution found9: ReturnSolution(s) Savedfail1: InitialStateemptyfailLLM
Figure 3: This graph illustrates the main steps in GenCP solving.
1: 𝒮=𝒮\mathcal{S}=\emptysetcaligraphic_S = ∅; if \mathcal{M}caligraphic_M is not empty then go to 3.;
2: generativeFunctions(\mathcal{M}caligraphic_M);
3: helpingFunctions(\mathcal{M}caligraphic_M); if .\mathcal{M}.caligraphic_M .X.containsEmptyVariable().containsEmptyVariable(). italic_c italic_o italic_n italic_t italic_a italic_i italic_n italic_s italic_E italic_m italic_p italic_t italic_y italic_V italic_a italic_r italic_i italic_a italic_b italic_l italic_e ( ) then go to 8.;
4: saveState(\mathcal{M}caligraphic_M);
5: propagation(\mathcal{M}caligraphic_M); if .\mathcal{M}.caligraphic_M .X.containsEmptyVariable().containsEmptyVariable(). italic_c italic_o italic_n italic_t italic_a italic_i italic_n italic_s italic_E italic_m italic_p italic_t italic_y italic_V italic_a italic_r italic_i italic_a italic_b italic_l italic_e ( ) then go to 8.;
6: if not booleanFunctions(\mathcal{M}caligraphic_M) then go to 2.;
7: 𝒮𝒮\mathcal{S}caligraphic_S.add(\mathcal{M}caligraphic_M);
8: if backtrack()𝑏𝑎𝑐𝑘𝑡𝑟𝑎𝑐𝑘backtrack(\mathcal{M})italic_b italic_a italic_c italic_k italic_t italic_r italic_a italic_c italic_k ( caligraphic_M ) then go to 4.; else return 𝒮𝒮\mathcal{S}caligraphic_S;
Algorithm 1 GenCP(,\mathcal{M},\mathcal{F}caligraphic_M , caligraphic_F), ={X,D,C}𝑋𝐷𝐶\mathcal{M}=\{X,D,C\}caligraphic_M = { italic_X , italic_D , italic_C }, \mathcal{F}caligraphic_F contains the generative and boolean functions

3.1.4 Description of the new approach

The main steps of GenCP are depicted in Fig. 3 and Algorithm 1:

  1. 1.

    GenCP begins with an initial state. If the initial state is empty, the generative functions are called (2.), otherwise the helping functions are called (3.).

  2. 2.

    The generative functions genV/genD/genC𝑔𝑒𝑛𝑉𝑔𝑒𝑛𝐷𝑔𝑒𝑛𝐶genV/genD/genCitalic_g italic_e italic_n italic_V / italic_g italic_e italic_n italic_D / italic_g italic_e italic_n italic_C are called (genD𝑔𝑒𝑛𝐷genDitalic_g italic_e italic_n italic_D calls the LLM).

  3. 3.

    The helping functions are called to manage implicit constraints, backtracking if necessary (e.g., if the LLM generated an empty domain).

  4. 4.

    The current state of the model \mathcal{M}caligraphic_M is saved.

  5. 5.

    The propagation is called, if it fails the model backtracks (8.), else it calls the boolean functions (6.).

  6. 6.

    The Boolean functions are called to check if a solution has been found. If a solution is found, it is saved (7.) and the model backtracks (8.), otherwise the model calls the generative functions (2.).

  7. 7.

    The current sentence formed by the variables is saved as a solution.

  8. 8.

    GenCP backtracks to a previously saved state (4.) of the model and changes the choices made during propagation (5.). If no previous state was saved, then backtracking fails (9.).

    • When backtracking to a previously saved state, the model deletes all variables, their respective domains, and the constraints associated with them, that are not present in the previously saved state.

  9. 9.

    GenCP outputs the solution(s) that were saved or it indicates that no solution was found.

3.1.5 Enforce variability

Variability between two sentences is the number of words that are not equal at each position, for example:

  • “The little boy is” and “The little cat is” have a variability of 1.

  • “My name is John” and “John is my name” have a variability of 4.

To force a greater variability between solutions (greater than 2222), a special backtrack called backtrackTo𝑏𝑎𝑐𝑘𝑡𝑟𝑎𝑐𝑘𝑇𝑜backtrackToitalic_b italic_a italic_c italic_k italic_t italic_r italic_a italic_c italic_k italic_T italic_o(n𝑛nitalic_n) is used. Let the set of variables X={x1,,xn,xn+1,xm}𝑋subscript𝑥1subscript𝑥𝑛subscript𝑥𝑛1subscript𝑥𝑚X=\{x_{1},\dots,x_{n},x_{n+1}\dots,x_{m}\}italic_X = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT … , italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }. The function backtrackTo𝑏𝑎𝑐𝑘𝑡𝑟𝑎𝑐𝑘𝑇𝑜backtrackToitalic_b italic_a italic_c italic_k italic_t italic_r italic_a italic_c italic_k italic_T italic_o(n𝑛nitalic_n) deletes the variables xn+1subscript𝑥𝑛1x_{n+1}italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT to xmsubscript𝑥𝑚x_{m}italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and causes a backtrack. For example, consider the sentence “I like to swim in the summer.”. With backtrackTo𝑏𝑎𝑐𝑘𝑡𝑟𝑎𝑐𝑘𝑇𝑜backtrackToitalic_b italic_a italic_c italic_k italic_t italic_r italic_a italic_c italic_k italic_T italic_o(2), “to swim in the summer.” is deleted and the value of variable x2=``like"𝑥2``𝑙𝑖𝑘𝑒"x2=``like"italic_x 2 = ` ` italic_l italic_i italic_k italic_e " is changed. The next solution might be “I want to break free.”.

3.1.6 Ordering

For some tasks, not following the ordering strategies of the LLM (like top-k𝑘kitalic_k and top-p𝑝pitalic_p) can lead to better/faster solutions. Two other orderings are considered: PPL valuation and length of a word (depending on the average word length in the given language).

3.2 Modeling Example

Here is a simple example of how the search of GenCP works: for this paper the generative functions only generate variables one at a time but it is important to note that these functions can generate multiple variables, domains and constraints at once. Let us suppose GenCP has to generate a sentence beginning by “The” and containing between 10 and 15 words with exactly 60 characters. The following functions are needed:

  • currentSentence()𝑐𝑢𝑟𝑟𝑒𝑛𝑡𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒currentSentence(\mathcal{M})italic_c italic_u italic_r italic_r italic_e italic_n italic_t italic_S italic_e italic_n italic_t italic_e italic_n italic_c italic_e ( caligraphic_M ): outputs the current sentence the variables form.

  • callLLM(sentence)𝑐𝑎𝑙𝑙𝐿𝐿𝑀𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒callLLM(sentence)italic_c italic_a italic_l italic_l italic_L italic_L italic_M ( italic_s italic_e italic_n italic_t italic_e italic_n italic_c italic_e ): described in 3.1.2. Here k𝑘kitalic_k is equal to 10 (each time the LLM is called, it will output 10 words).

  • contains(sentence,word)𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑠𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑤𝑜𝑟𝑑contains(sentence,word)italic_c italic_o italic_n italic_t italic_a italic_i italic_n italic_s ( italic_s italic_e italic_n italic_t italic_e italic_n italic_c italic_e , italic_w italic_o italic_r italic_d ): outputs yes if the sentence contains the word and no otherwise.

  • nbChar(sentence)𝑛𝑏𝐶𝑎𝑟𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒nbChar(sentence)italic_n italic_b italic_C italic_h italic_a italic_r ( italic_s italic_e italic_n italic_t italic_e italic_n italic_c italic_e ): outputs the number of characters in the sentence.

The obtained model is {,}\{\mathcal{M},\mathcal{F}\}{ caligraphic_M , caligraphic_F }, where:

  • ={X,D,C}𝑋𝐷𝐶\mathcal{M}=\{X,D,C\}caligraphic_M = { italic_X , italic_D , italic_C }:

  • X={x1}𝑋subscript𝑥1X=\{x_{1}\}italic_X = { italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT }.

  • D={d1={D=\{d_{1}=\{italic_D = { italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = {“The”}}\}\}} }.

  • C=𝐶C=\emptysetitalic_C = ∅.

  • ={G,B,H}𝐺𝐵𝐻\mathcal{F}=\{G,B,H\}caligraphic_F = { italic_G , italic_B , italic_H }.

  • G={genV,genD,genC}𝐺𝑔𝑒𝑛𝑉𝑔𝑒𝑛𝐷𝑔𝑒𝑛𝐶G=\{genV,genD,genC\}italic_G = { italic_g italic_e italic_n italic_V , italic_g italic_e italic_n italic_D , italic_g italic_e italic_n italic_C } is a set of functions, each function follows these steps:

    • -

      generate x|X|+1subscript𝑥𝑋1x_{|X|+1}italic_x start_POSTSUBSCRIPT | italic_X | + 1 end_POSTSUBSCRIPT and add it to X𝑋Xitalic_X with an empty domain d|X|+1subscript𝑑𝑋1d_{|X|+1}italic_d start_POSTSUBSCRIPT | italic_X | + 1 end_POSTSUBSCRIPT.

    • -

      d|X|+1=callLLM(currentSentence())subscript𝑑𝑋1𝑐𝑎𝑙𝑙𝐿𝐿𝑀𝑐𝑢𝑟𝑟𝑒𝑛𝑡𝑆𝑒𝑛𝑡𝑒𝑛𝑐𝑒d_{|X|+1}=callLLM(currentSentence(\mathcal{M}))italic_d start_POSTSUBSCRIPT | italic_X | + 1 end_POSTSUBSCRIPT = italic_c italic_a italic_l italic_l italic_L italic_L italic_M ( italic_c italic_u italic_r italic_r italic_e italic_n italic_t italic_S italic_e italic_n italic_t italic_e italic_n italic_c italic_e ( caligraphic_M ) ).

    • -

      cremoveover60char((currentSentence(),d|X|+1)c_{remove_{over60char}((currentSentence(\mathcal{M}),d_{|X|+1})}italic_c start_POSTSUBSCRIPT italic_r italic_e italic_m italic_o italic_v italic_e start_POSTSUBSCRIPT italic_o italic_v italic_e italic_r 60 italic_c italic_h italic_a italic_r end_POSTSUBSCRIPT ( ( italic_c italic_u italic_r italic_r italic_e italic_n italic_t italic_S italic_e italic_n italic_t italic_e italic_n italic_c italic_e ( caligraphic_M ) , italic_d start_POSTSUBSCRIPT | italic_X | + 1 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT.

    • -

      The constraints remove the words that make the current sentence exceed 60 characters from the domain of the current variable.

  • B={endNbWords,endNbCharacters,endLLM}𝐵𝑒𝑛𝑑𝑁𝑏𝑊𝑜𝑟𝑑𝑠𝑒𝑛𝑑𝑁𝑏𝐶𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑠𝑒𝑛𝑑𝐿𝐿𝑀B=\{endNbWords,endNbCharacters,endLLM\}italic_B = { italic_e italic_n italic_d italic_N italic_b italic_W italic_o italic_r italic_d italic_s , italic_e italic_n italic_d italic_N italic_b italic_C italic_h italic_a italic_r italic_a italic_c italic_t italic_e italic_r italic_s , italic_e italic_n italic_d italic_L italic_L italic_M } is a set of functions, each function behaves as follows:

    • -

      |X|>=10|X|<=15𝑋10𝑋15|X|>=10\land|X|<=15| italic_X | > = 10 ∧ | italic_X | < = 15.

    • -

      nbChar(currentSentence())==60nbChar(currentSentence(\mathcal{M}))==60italic_n italic_b italic_C italic_h italic_a italic_r ( italic_c italic_u italic_r italic_r italic_e italic_n italic_t italic_S italic_e italic_n italic_t italic_e italic_n italic_c italic_e ( caligraphic_M ) ) = = 60.

    • -

      contains(callLLM(currentSentence()),``.")contains(callLLM(currentSentence(\mathcal{M})),``.")italic_c italic_o italic_n italic_t italic_a italic_i italic_n italic_s ( italic_c italic_a italic_l italic_l italic_L italic_L italic_M ( italic_c italic_u italic_r italic_r italic_e italic_n italic_t italic_S italic_e italic_n italic_t italic_e italic_n italic_c italic_e ( caligraphic_M ) ) , ` ` . " ).

  • H={Hho}𝐻subscript𝐻𝑜H=\{H_{ho}\}italic_H = { italic_H start_POSTSUBSCRIPT italic_h italic_o end_POSTSUBSCRIPT }:

    • -

      Hho:order(d|X|+1):subscript𝐻𝑜𝑜𝑟𝑑𝑒𝑟subscript𝑑𝑋1H_{ho}:order(d_{|X|+1})italic_H start_POSTSUBSCRIPT italic_h italic_o end_POSTSUBSCRIPT : italic_o italic_r italic_d italic_e italic_r ( italic_d start_POSTSUBSCRIPT | italic_X | + 1 end_POSTSUBSCRIPT ).

    • -

      To help attain the goal of 60 characters, the domain of the current variable is ordered such that before the 10th word the solver tries the longer words first and at the 10th word the solver tries the shorter words first.

With the above representation of the problem, GenCP is asked for 4 solutions, backtrackTo(2)𝑏𝑎𝑐𝑘𝑡𝑟𝑎𝑐𝑘𝑇𝑜2backtrackTo(2)italic_b italic_a italic_c italic_k italic_t italic_r italic_a italic_c italic_k italic_T italic_o ( 2 ) is used and the LLM is asked for 10 words maximum per call. The obtained solutions are:

  1. 1.

    The following is an article by the author of the above book.

  2. 2.

    The first time you see the movie version of your book on TV.

  3. 3.

    The New York Times has an article on the new book by Tim Wu.

  4. 4.

    The new year is here and we are ready to make the next step.

3.3 Illustrated Example

Refer to caption
Figure 4: Illustrations of GenCP as a simplified framework with three variables and predictions of 3 values per LLM call, on a simple problem: generate a sentence that does not contain the letter e. For each variable, the predefined constraint cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT “the letter e is forbidden” is generated. A predefined domain with one word is defined for the first variable: A. The current sentence formed by the variable “A” is not a solution (callLLM(``A")𝑐𝑎𝑙𝑙𝐿𝐿𝑀``𝐴"callLLM(``A")italic_c italic_a italic_l italic_l italic_L italic_L italic_M ( ` ` italic_A " ) does not answer a period (“.”)), so a new empty variable x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is generated. GenCP calls the LLM with the sentence “A” to predict the domain of x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The LLM model predicts three values: [ man, house, boy ]. c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is generated, causing the domain of x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to be filtered accordingly: house is removed. x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is then assigned to boy, GenCP generates the variable x3subscript𝑥3x_{3}italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and calls the LLM with the sentence “A boy” to predict a new domain. Unfortunately, the domain of x3subscript𝑥3x_{3}italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is empty, either because the LLM answered an empty domain or because this domain was entirely filtered during propagation. Hence, GenCP backtracks to x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and the value of x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is changed to man. In the same fashion as before, GenCP generates the variables x3subscript𝑥3x_{3}italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, and gives “A man” to the LLM that predicts: [ drinks, and, helps ]. c3subscript𝑐3c_{3}italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is generated, filtering helps because it contains an e. The process continues until the domain of the next predicted variable contains a period (a solution is found) or the solver fails.

Fig. 4 illustrates GenCP as a simplified framework with three variables and predictions of 3 values per LLM call, on a simple problem: generate a sentence that does not contain the letter e.

4 Experiments

4.1 Experimental Conditions

4.1.1 Baseline

The experiments presented by Yao et al. are partially reproduced [7]. In particular, the constrained sentence generation tasks described in Tab. 1. Five LLMs were selected: GPT4, GPT4-O, Mistral Next, Claude 3.5, and Gemini. The four LLMs are prompted with the same example command given in [7]. For example, “Please create a sentence of exactly 82 characters.” for the Sent-1 task111https://chatgpt.com/share/b2834735-f7d8-468a-ba54-7da19dd6723c. Tab. 2 gives an overview of the performance of the five LLMs on the four tasks. The satisfaction rate is based on ten trials per task per model. In addition, Tab. 2 also shows that the LLMs perform well on the lexically constrained scenario task-4 with a 90+% satisfaction rate over ten trials. Also, as Yao et al. previously showed in their paper, LLMs struggle to produce constrained sentences involving counting (e.g., words and characters). They provide a nice picture of current LLM satisfaction capabilities by introducing new benchmarks. Unfortunately, the Yao et al. article only provides the benchmarks and some hints on reproducing them. However, it does not give any hints on how to solve the tasks associated with the benchmarks (see the original article for more details [7]).

name words count character count lexical constraints
sent-1 =82absent82=82= 82
sent-2 =10absent10=10= 10 X3=subscript𝑋3absentX_{3}=italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = soft, X7=subscript𝑋7absentX_{7}=italic_X start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT = soft,X10=subscript𝑋10absentX_{10}=italic_X start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT = math
sent-3 20absent20\geq 20≥ 20 i,|Xi|6for-all𝑖subscript𝑋𝑖6\forall i,|X_{i}|\leq 6∀ italic_i , | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≤ 6
sent-4 soft, beach, math
Table 1: Four tasks on sentence generation used to compare BS and GenCP extract from [7].
name GPT-4 GPT-4.0 Mistral Next Claude3.5 Sonnet Gemini
#s #f %sat #s #f %sat #s #f %sat #s #f %sat #s #f %sat
sent-1 1 9 10% 0 10 0% 0 10 0% 1 9 10% 0 10 0%
sent-2 0 10 0% 0 10 0% 0 10 0% 0 10 0% 0 10 0%
sent-3 1 9 10% 5 5 50% 0 10 0% 9 1 90% 1 9 10%
sent-4 9 1 90% 9 1 90% 10 0 100% 10 0 100% 1 9 10%
Table 2: Number of successes (#s), Number of fails (#f) and satisfaction rate (%sat) for each model (GPT-4, GPT-4.0, Mistral Next, Claude 3.5, Gemini) for each task (sent-1, sent-2, sent-3, sent-4).

4.1.2 Hardware & Implementation

The experiments were performed on a laptop with Windows 10 Professional, 32 GB RAM, and Intel 16 CPU cores. The approach and the BS are implemented in Java 17 for easier comparisons.

4.1.3 LLM choice

LLaMa [6] is responsible for the predictions of words as domains for the variables, mainly because an efficient implementation in C++ was recently released222https://github.com/ggerganov/llama.cpp. It allows running a model on a personal laptop and CPU (only) efficiently. Thanks to quantization [33] (model weight compression), the 7B parameters model (in Float16) of 13GB original size, in 4-bit quantization (Q4) drops to 3.9GB of RAM. However, the biggest model of LLama 65B (120GB), even in Q4, needs 38.5 GB of RAM. Thus, the LLaMa v1 model used in the experiments is LLaMa 7B Q4 with low temperature (i.e., 1absent1\leq 1≤ 1, temp = 0.8).

When asked for k𝑘kitalic_k words, this version of LLaMa will take the same amount of time to ouput 1111 word and 1000100010001000 words. To minimize the importance of k𝑘kitalic_k, callLLM𝑐𝑎𝑙𝑙𝐿𝐿𝑀callLLMitalic_c italic_a italic_l italic_l italic_L italic_L italic_M outputs more than k𝑘kitalic_k words, a beam/variable only keeps k𝑘kitalic_k “valid” words. A “valid” word is a word that does not violate a constraint on its own. For example, a word that does not violate the constraint “does not contain the letter e”.

4.1.4 Beam Search Technical Remarks

In the current implementations two halting conditions are defined for BS:

  • First solution: when the current beam contains at least one solution, BS is stopped and output the solutions.

  • All solutions: when the current beam contains at least one solution but another beam can continue to generate words without violating a constraint (for instance, it does not contain enough characters to satisfy a length constraint), the beam solutions are saved and BS continues with the remaining beams.

4.1.5 Benchmarks Settings

BS and GenCP are compared on some recent benchmarks described in Sec. 4.1.1.

To guarantee GenCP and BS to be judged on the generation of sentences of the same quality, a solution is a sentence that satisfies all the constraints of the current task and, when given this sentence, the LLM predicts a period (“.”). Not to alter BS too much, words are ordered by probability (PPL is not used) and, since BS sentences have low variability, GenCP is used without backtrackTo𝑏𝑎𝑐𝑘𝑡𝑟𝑎𝑐𝑘𝑇𝑜backtrackToitalic_b italic_a italic_c italic_k italic_t italic_r italic_a italic_c italic_k italic_T italic_o(n𝑛nitalic_n).

BS and GenCP are compared on the following criteria:

  • Time in seconds.

  • Number of solutions. GenCP was stopped when it found the same number of solutions as BS on a task. 0/1010/10 / 1 means that BS found no solution while GenCP found one solution.

  • The ratio solutions/outputs as a constraint satisfaction rate.

  • For BS only, the number of bad outputs (number of outputs that are not a solution).

  • For GenCP only, the number of backtracks.

4.2 Result Analysis

The results show that GenCP can be used to solve efficiently text generation under constraints problems. GenCP is faster than BS and all the outputs are solutions, contrary to BS where some outputs are not solutions.

Although the results suggest that GenCP succeeds in all tasks (see Tab. 3), it becomes particularly interesting when considering size constraints (e.g., sentences with a precise number of words or characters). It obtains sentences that satisfy the constraint with a low PPL score on sent-1 and sent-3 tasks.

GenCP also succeeds in producing sequences obeying lexical constraints in sent-2 and sent-4. However, the PPL and a human evaluation on these sentences show a substantial deterioration in term of quality (i.e., meaningfulness).

Therefore, regarding sent-1 and sent-3 tasks, GenCP is to be preferred, whereas for sent-4 and sent-2 tasks, LLMs prompted alone or joint with BS is still adequate.

Experiments BS GenCP
sent-i k #sols s #badoutput %sat s %sat #bk
1 5 1 108 9 10% 103 100% 45
10 0 182 18 0% 177 100% 84
20 1 399 58 1% 46 100% 13
50 1 1123 109 0%absentpercent0\approx 0\%≈ 0 % 47 100% 13
2 5 5 34 0 100% 38 100% 38
10 10 69 0 100% 36 100% 25
20 20 140 0 100% 58 100% 40
50 49 354 1 99% 134 100% 100
3 5 0 248 5 0% 36 100% 4
10 2 510 8 20% 55 100% 6
20 4 1030 16 20% 164 100% 38
50 20 2633 30 66% 1174 100% 374
4 5 25 279 3 89% 308 100% 118
10 30 513 8 78% 311 100% 114
20 45 1123 14 76% 321 100% 104
50 89 2928 40 68% 388 100% 27
Table 3: Comparison of BS and GenCP on the tasks of Tab. 1. Task considered (sent-i), Number of solutions (#sols), Time in seconds (s), Number of bad output (#badoutput), satisfaction rate (%sat) and Number of backtracks (#bk).
Experiments GenCP
sent-i k #sols s MB #bk
1 50 2 1123 136 79
2 50 355 222 208 222
3 50 488 2633 378 624
4 50 830 2929 676 680
Table 4: GenCP results for k=50𝑘50k=50italic_k = 50 when given approximately the same amount of time as BS in Tab 3. Number of solutions (#sols), Time in seconds (s), Memory usage in megabytes (MB), Number of backtracks (#bk).

4.2.1 Beam Search

BS and GenCP are compared in Tab. 3. In all tables, the number of backtracks is denoted by #bk. BS is slower than GenCP and has lower satisfaction rate (number of outputs that are solutions / total number of outputs), denoted by %sat. This is due to multiple facts:

  1. 1.

    Beam Search can not guarantee to find every solution.

  2. 2.

    Beam Search chooses the next word depending on the probability of the LLM.

  3. 3.

    At each step, BS considers k𝑘kitalic_k sentences, each sentence asks k𝑘kitalic_k words to the LLM, so each step considers k2superscript𝑘2k^{2}italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT words. BS orders these words decreasingly by probability and only keeps the k𝑘kitalic_k first.

Facts 2 and 3 explain why increasing k𝑘kitalic_k does not guarantee to find the same/more solutions, it might even cause BS to find less solutions.

Let us suppose k𝑘kitalic_k = 5, BS found one solution, and at depth 4, the candidate needed to find this solution was ranked 5 out of 25. Let us suppose now k𝑘kitalic_k is increased to 6: at each step BS will consider 36 candidates and take the 6 best ones. BS considers 11 more candidates than with k𝑘kitalic_k = 5; if at depth 4, the candidate needed to find the previous solution is now ranked 7 instead of 5, BS will not consider it and k𝑘kitalic_k = 6 will not find the solution found with k𝑘kitalic_k = 5.

4.2.2 GenCP

Experiments GenCP
sent-i k bkTo𝑏𝑘𝑇𝑜bkToitalic_b italic_k italic_T italic_o(n𝑛nitalic_n) PPL sentence generated
1 50 NO 8 The following is an article by Dr David Hillon the subject of the role of prayer.
2 13 The New York Times has an article on the new book by former President George Bush.
3 6 The following information is taken from the website of the National Park Services.
2 50 NO 189 The following soft skills are required beach resort jobs math.
2 169 The National soft drink association has beach balls and math.
3 107 The most soft and comfortable of beach wear is math.
3 50 NO 8 The first time you see the movie The Big Short is like being hit by an ice cube in the face.
2 5 The world is full of great ideas and the best way to get them out there is by using the power of the web.
3 5 The first step in the right path is to know what you want and where you are going in life.
4* 50 NO 347 The following is an article by Dr math and science teacher beach high school in soft.
2 593 The term of the contract is for math and science teachers beach to be able soft.
3 48 The following data is based on the math and physics of beach waves and the soft sand.
Table 5: Output sentences of GenCP on the experiments of Tab 1 associated with the task (sent-i), k, backtrackTo𝑏𝑎𝑐𝑘𝑡𝑟𝑎𝑐𝑘𝑇𝑜backtrackToitalic_b italic_a italic_c italic_k italic_t italic_r italic_a italic_c italic_k italic_T italic_o (bkTo𝑏𝑘𝑇𝑜bkToitalic_b italic_k italic_T italic_o), and Perplexity (PPL). In sent-4* a constraint was added so that “soft”, “beach”, “math” have to be separated by at least three words. Sentences with high perplexity were chosen to showcase the importance of low perplexity.

Tab. 4 shows the capability of GenCP to generate more solutions than BS. GenCP is given the same time as BS for the same task and k=50𝑘50k=50italic_k = 50, GenCP obtains more solutions than BS. Note that for sent-1, without backtrackTo𝑏𝑎𝑐𝑘𝑡𝑟𝑎𝑐𝑘𝑇𝑜backtrackToitalic_b italic_a italic_c italic_k italic_t italic_r italic_a italic_c italic_k italic_T italic_o GenCP only obtains 2 solutions in 1123 seconds, while with backtrackTo(6)𝑏𝑎𝑐𝑘𝑡𝑟𝑎𝑐𝑘𝑇𝑜6backtrackTo(6)italic_b italic_a italic_c italic_k italic_t italic_r italic_a italic_c italic_k italic_T italic_o ( 6 ) GenCP obtains 11 solutions in 1123 seconds.

The LLM-enhanced GenCP avoids the drawbacks of BS and proposes an alternative approach to text generation under constraints for the following reasons:

  • GenCP can guarantee to find every solution (if any). Increasing k𝑘kitalic_k guarantees to find at least the same solutions previously found and potentially finds new solutions. Furthermore, it can offer more solutions than BS.

  • All the outputs answered by GenCP are solutions (all the constraints are satisfied).

  • GenCP offers more options for improvement, for example to ensure better variability (backtrackTo𝑏𝑎𝑐𝑘𝑡𝑟𝑎𝑐𝑘𝑇𝑜backtrackToitalic_b italic_a italic_c italic_k italic_t italic_r italic_a italic_c italic_k italic_T italic_o explained in 3.1.5 can be used) or other orderings than probability (3.1.6).

4.2.3 Variability and Perplexity

Tab. 5 demonstrates the importance of enforcing variability and perplexity. When GenCP generated solutions for Tab. 3 and 4, the maximum variability was 4. Tab. 5 shows that with backtrackTo(2)𝑏𝑎𝑐𝑘𝑡𝑟𝑎𝑐𝑘𝑇𝑜2backtrackTo(2)italic_b italic_a italic_c italic_k italic_t italic_r italic_a italic_c italic_k italic_T italic_o ( 2 )/backtrackTo(3)𝑏𝑎𝑐𝑘𝑡𝑟𝑎𝑐𝑘𝑇𝑜3backtrackTo(3)italic_b italic_a italic_c italic_k italic_t italic_r italic_a italic_c italic_k italic_T italic_o ( 3 ), sentences generated are almost completely different thanks to high variability (10+ for sent-3 for example).

Tab. 5 purposefully contains sentences with high perplexity to illustrate that this leads to a degradation in the sentence quality (i.e., low meaning).

All the sentences generated for sent-4 had the words “soft”, “beach” and “math” next to each other. To showcase the capability of GenCP to improve sentences, sent-4* was created: it is the same as sent-4 except that “soft”, “beach” and “math” must contain at least three words between them.

5 Discussion & Perspectives

5.1 GPU and CPU Interplay

The article shares a proof-of-concept showing that interesting results can be obtained using CPU resources combined with a small quantized LLM in a CP solver. However, LLMs, in general, work best with much larger computational resources and require GPU resources. Even though smaller models (e.g., Mistral 8x7B) sometimes manage to take top places in specific scenarios. The top spots in the LLM Elo rankings feature gigantic models [34]. Given their size, clusters of GPU are quickly mandatory. Hence, it would be interesting to study in more detail how the joint use of resources (for instance, CPU for solver and GPU for LLM) could improve the results of the paper and correspond to more real-world usage in industry.

5.2 Token Management

In this article, GenCP ignores tokens and works at the word level (pre-token). It is possible to handle tokens by adapting the problem modeling. Indeed, it is possible to consider a word as a meta-variable X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT composed of several decision variables (e.g., X11subscript𝑋subscript11X_{1_{1}}italic_X start_POSTSUBSCRIPT 1 start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, X12subscript𝑋subscript12X_{1_{2}}italic_X start_POSTSUBSCRIPT 1 start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, X13subscript𝑋subscript13X_{1_{3}}italic_X start_POSTSUBSCRIPT 1 start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT…). This is useful and straightforward, as it is not clear in advance how the tokenizer will cut the words. For instance, let us consider the following sentence: The first step in the recruitment of a new hire is to make sure that the job requisition is clear. Let us look at the assignments of the variables (space separates meta-variables, and semicolon decision variables): The; first; step; in; the; rec;ruit;ment; of; a; new; h;ire; is; to; make; sure; that; the; job; requ;is;ition; is; clear;. The word recruitment needs three decision variables because it is composed of three tokens (i.e., rec, ruit and ment). It is easy to manage in GenCP because it can generate as many variables as required. Nevertheless, the evolution of the CSP (generation of variables and domains) is rather technical and, therefore, depends on the tokenizer.

5.3 CSP Modeling

The idea that a CSP can evolve in response to external information is not new (e.g., Dynamic Constraint Network [35]). This dynamic vision of CSPs has been motivated by several real-world problems, particularly in product configuration [36]. GenCP proposes ML integration in modeling by letting LLMs manage operations for CSP domains during the resolution process. The ”outside the world” information [37] is given by the LLM. The article shows that LLMs can contribute to CSP modeling for generation tasks. However, how ML/LLMs can be used for CSP modeling in general for any problem remains an open problem [38, 39, 40, 41].

6 Conclusion

This paper showed that combining CP solving of structural constraints and ML understanding of vague notions (like meaning) on the task of text generation under constraints obtains promising results. This paper presents GenCP, a new method that extends OTFS to make the domains manageable by LLM predictions. The results show that GenCP can generate meaningful sentences that ensure various properties like the number of words, number of characters, mandatory keywords, or some forbidden characters. The results also show that GenCP has 100% satisfaction rate and takes less time to output solutions of the same quality than a well-known technique in the field of text generation under constraints: Beam Search. GenCP provides multiple improvements thanks to ordering, enforcing variability and perplexity, allowing thus to obtain overall higher quality solutions than BS.

Acknowledgments

We thank Jack Massey for his help in reproducing the benchmarks used as baseline in Section 4.1.1.

References

  • [1] Eugene C. Freuder. In pursuit of the holy grail. Constraints, 2(1):57–61, 1997.
  • [2] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
  • [3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • [4] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021.
  • [5] Ruibin Yuan, Hanfeng Lin, Yi Wang, Zeyue Tian, Shangda Wu, Tianhao Shen, Ge Zhang, Yuhang Wu, Cong Liu, Ziya Zhou, Ziyang Ma, Liumeng Xue, Ziyu Wang, Qin Liu, Tianyu Zheng, Yizhi Li, Yinghao Ma, Yiming Liang, Xiaowei Chi, Ruibo Liu, Zili Wang, Pengfei Li, Jingcheng Wu, Chenghua Lin, Qifeng Liu, Tao Jiang, Wenhao Huang, Wenhu Chen, Emmanouil Benetos, Jie Fu, Gus Xia, Roger Dannenberg, Wei Xue, Shiyin Kang, and Yike Guo. Chatmusician: Understanding and generating music intrinsically with llm, 2024.
  • [6] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023. cite arxiv:2302.13971.
  • [7] Shunyu Yao, Howard Chen, Austin W. Hanjie, Runzhe Yang, and Karthik R Narasimhan. COLLIE: Systematic construction of constrained text generation tasks. In The Twelfth International Conference on Learning Representations, 2024.
  • [8] Cristina Garbacea and Qiaozhu Mei. Why is constrained neural language generation particularly challenging? arXiv preprint arXiv:2206.05395, 2022.
  • [9] Ximing Lu, Sean Welleck, Peter West, Liwei Jiang, Jungo Kasai, Daniel Khashabi, Ronan Le Bras, Lianhui Qin, Youngjae Yu, Rowan Zellers, Noah A. Smith, and Yejin Choi. NeuroLogic A*esque decoding: Constrained text generation with lookahead heuristics. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 780–799, Seattle, United States, July 2022. Association for Computational Linguistics.
  • [10] Matt Post and David Vilar. Fast lexically constrained decoding with dynamic beam allocation for neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1314–1324, New Orleans, Louisiana, June 2018. Association for Computational Linguistics.
  • [11] Chris Hokamp and Qun Liu. Lexically constrained decoding for sequence generation using grid beam search. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1535–1546, Vancouver, Canada, July 2017. Association for Computational Linguistics.
  • [12] Damien Sprockeels and Peter Van Roy. Expressing musical ideas with constraint programming using a model of tonal harmony. In International Joint Conference on Artificial Intelligence, 2024. To appear.
  • [13] Alexandre Bonlarron and Jean-Charles Régin. Intertwining cp and nlp: The generation of unreasonably constrained sentences. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, 2024. To appear.
  • [14] Alexandre Bonlarron and Jean-Charles Régin. Markov constraint as large language model surrogate. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, 2024. To appear.
  • [15] Alexandre Bonlarron, Aurélie Calabrèse, Pierre Kornprobst, and Jean-Charles Régin. Constraints first: a new mdd-based model to generate sentences under constraints. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, pages 1893–1901, 2023.
  • [16] Guillaume Perez and Jean-Charles Régin. MDDs: Sampling and probability constraints. In Proceedings of the International Conference on Principles and Practice of Constraint Programming, page 226–242, 2017.
  • [17] Alexandre Papadopoulos, Pierre Roy, Jean-Charles Régin, and François Pachet. Generating all possible palindromes from ngram corpora. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15, page 2489–2495. AAAI Press, 2015.
  • [18] François Pachet and Pierre Roy. Markov constraints: Steerable generation of markov sequences. Constraints, 16(2):148–172, apr 2011.
  • [19] Yoshua Bengio, Andrea Lodi, and Antoine Prouvost. Machine learning for combinatorial optimization: A methodological tour d’horizon. European Journal of Operational Research, 290(2):405–421, 2021.
  • [20] Andrea Bartolini, Michele Lombardi, Michela Milano, and Luca Benini. Neuron constraints to model complex real-world problems. In Principles and Practice of Constraint Programming–CP 2011: 17th International Conference, CP 2011, Perugia, Italy, September 12-16, 2011, Proceedings, volume 6876, page 115. Springer, 2011.
  • [21] Michele Lombardi, Michela Milano, and Andrea Bartolini. Empirical decision model learning. Artificial Intelligence, 244:343–367, 2017. Combining Constraint Solving with Mining and Learning.
  • [22] Nan Jiang, Maosen Zhang, Willem-Jan Van Hoeve, and Yexiang Xue. Constraint reasoning embedded structured prediction. J. Mach. Learn. Res., 23(1), jan 2022.
  • [23] Elias Khalil, Pierre Le Bodic, Le Song, George Nemhauser, and Bistra Dilkina. Learning to branch in mixed integer programming. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1), Feb. 2016.
  • [24] Elias B. Khalil, Bistra Dilkina, George L. Nemhauser, Shabbir Ahmed, and Yufen Shao. Learning to run heuristics in tree search. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 659–666, 2017.
  • [25] Jialin Song, ravi lanka, Yisong Yue, and Bistra Dilkina. A general large neighborhood search framework for solving integer linear programs. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 20012–20023. Curran Associates, Inc., 2020.
  • [26] Quentin Cappart, Thierry Moisan, Louis-Martin Rousseau, Isabeau Prémont-Schwarz, and Andre A. Cire. Combining reinforcement learning and constraint programming for combinatorial optimization. Proceedings of the AAAI Conference on Artificial Intelligence, 35(5):3677–3687, May 2021.
  • [27] Mohsen Nafar and Michael Römer. Using clustering to strengthen decision diagram bounds for discrete optimization. Proceedings of the AAAI Conference on Artificial Intelligence, 38(8):8082–8089, Mar. 2024.
  • [28] James Kotary, Ferdinando Fioretto, Pascal van Hentenryck, and Bryan Wilder. End-to-end constrained optimization learning: A survey. In 30th International Joint Conference on Artificial Intelligence, IJCAI 2021, pages 4475–4482. International Joint Conferences on Artificial Intelligence, 2021.
  • [29] Florian Régin and Elisabetta De Maria. Using on-the-fly model checking to improve constraint programming for dynamic problems. In 2023 IEEE 35th International Conference on Tools with Artificial Intelligence (ICTAI), pages 393–398, 2023.
  • [30] Yixian Liu, Liwen Zhang, Wenjuan Han, Yue Zhang, and Kewei Tu. Constrained text generation with global guidance - case study on commongen. CoRR, abs/2103.07170, 2021.
  • [31] Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, Jennifer C. Lai, and Robert L. Mercer. An estimate of an upper bound for the entropy of English. Computational Linguistics, 18(1):31–40, 1992.
  • [32] Dan Jurafsky and James H. Martin. Speech and language processing : an introduction to natural language processing, computational linguistics, and speech recognition. Pearson Prentice Hall, Upper Saddle River, N.J., 2009.
  • [33] Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network inference. CoRR, abs/2103.13630, 2021.
  • [34] Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024.
  • [35] Rina Dechter and Avi Dechter. Belief maintenance in dynamic constraint networks. In Howard E. Shrobe, Tom M. Mitchell, and Reid G. Smith, editors, Proceedings of the 7th National Conference on Artificial Intelligence, St. Paul, MN, USA, August 21-26, 1988, pages 37–42. AAAI Press / The MIT Press, 1988.
  • [36] U Junker, F Rossi, P van Beek, and T Walsh. Handbook of constraint programming. Chapter Configuration, 2006.
  • [37] Christian Bessière. Arc-consistency in dynamic constraint satisfaction problems. In Thomas L. Dean and Kathleen R. McKeown, editors, Proceedings of the 9th National Conference on Artificial Intelligence, Anaheim, CA, USA, July 14-19, 1991, Volume 1, pages 221–226. AAAI Press / The MIT Press, 1991.
  • [38] Eugene C. Freuder. Conversational modeling for constraint satisfaction. Proceedings of the AAAI Conference on Artificial Intelligence, 38(20):22592–22597, Mar. 2024.
  • [39] Connor Lawless, Jakob Schoeffer, Lindy Le, Kael Rowan, Shilad Sen, Cristina St. Hill, Jina Suh, and Bahareh Sarrafzadeh. ”i want it that way”: Enabling interactive decision support using large language models and constraint programming, 2024.
  • [40] Dimos Tsouros, Hélène Verhaeghe, Serdar Kadıoğlu, and Tias Guns. Holy grail 2.0: From natural language to constraint models, 2023.
  • [41] Parag Pravin Dakle, Serdar Kadıoğlu, Karthik Uppuluri, Regina Politi, Preethi Raghavan, SaiKrishna Rallabandi, and Ravisutha Srinivasamurthy. Ner4opt: Named entity recognition for ;optimization modelling from ;natural language. In Integration of Constraint Programming, Artificial Intelligence, and Operations Research: 20th International Conference, CPAIOR 2023, Nice, France, May 29 –June 1, 2023, Proceedings, page 299–319, Berlin, Heidelberg, 2023. Springer-Verlag.