Evaluating Task-Oriented Dialogue Consistency through Constraint Satisfaction

Tiziano Labruna    Bernardo Magnini
Fondazione Bruno Kessler
Via Sommarive, 18, Trento, Italy
Abstract

Task-oriented dialogues must maintain consistency both within the dialogue itself, ensuring logical coherence across turns, and with the conversational domain, accurately reflecting external knowledge. We propose to conceptualize dialogue consistency as a Constraint Satisfaction Problem (CSP), wherein variables represent segments of the dialogue referencing the conversational domain, and constraints among variables reflect dialogue properties, including linguistic, conversational, and domain-based aspects. To demonstrate the feasibility of the approach, we utilize a CSP solver to detect inconsistencies in dialogues re-lexicalized by an LLM. Our findings indicate that: (i) CSP is effective to detect dialogue inconsistencies; and (ii) consistent dialogue re-lexicalization is challenging for state-of-the-art LLMs, achieving only a 0.15 accuracy rate when compared to a CSP solver. Furthermore, through an ablation study, we reveal that constraints derived from domain knowledge pose the greatest difficulty in being respected. We argue that CSP captures core properties of dialogue consistency that have been poorly considered by approaches based on component pipelines.

Evaluating Task-Oriented Dialogue Consistency through Constraint Satisfaction


Tiziano Labruna  and Bernardo Magnini Fondazione Bruno Kessler Via Sommarive, 18, Trento, Italy


1 Introduction

Task-oriented dialogue (TOD) systems (McTear, 2020; Louvan and Magnini, 2020; Balaraman et al., 2021) play a crucial role in human-computer interaction, facilitating seamless communication between users and machines to accomplish specific tasks. A peculiar characteristic of TODs is that they need to maintain consistency at two levels: (i) internally within the dialogue itself, ensuring that information in a turn is consistent with information in other turns, and (ii) consistency with the content of the conversational domain of the dialogue system. Internal consistency is responsible for the coherence of the dialogue, making it possible to maintain a meaningful exchange between the participants. External consistency, on the other hand, allows the dialogue to correctly reflect domain knowledge. In this paper, we investigate how dialogue consistency in TOD can be effectively modeled such that possible violations (i.e., inconsistencies) can be automatically detected.

R1: N=Taberna A=centre F=spanish P=cheap
R2: N=Espana A=centre F=spanish P=moderate
R3: N=Beirut A=centre F=lebanese P=cheap
    
{mdframed}
{dialogue}

I am looking for a restaurant serving Spanish food.

There are three restaurants serving Spanish food, one is cheap and the other is moderate price range. Which price range would you prefer?

I am looking for a cheap restaurant in any area that serves Spanish food.

Beirut is cheap and serves Lebanese food. Would you like the location information?

Figure 1: An inconsistent task-oriented dialogue with a Knowledge Base. Red values indicate internal inconsistencies, purple values indicate external inconsistencies.

Figure 1, shows a fragment of a Knowledge Base (three restaurants in a city) and a short dialogue in which a user expresses preferences for restaurants serving Spanish food, and the system responds providing information about available options. There are two inconsistencies in this dialogue: first, at turn S1, the system mentions three restaurants serving Spanish food, which is not consistent with the domain knowledge, where there are two such restaurants (domain inconsistency). Second, at turn S2, the system introduces a Lebanese restaurant, while it would have been expected to mention a Spanish restaurant (dialogue inconsistency). We assume that a well-formed TOD should not manifest any inconsistency of the type reported in our example. However, while relevant work on evaluating TODs has focused on single dialogue components (e.g., dialogue state tracking Henderson et al. (2014)), consistency evaluation has received much less attention. The problem is even more urgent now that end-to-end approaches Bang et al. (2023); Lai et al. (2023) are by-passing component evaluations. Automatic detection of dialogue inconsistencies is crucial when dialogues are generated by Large Language Models (LLMs), using few-shot or zero-shot approaches. While LLMs have the capacity to generate TODs without being fine-tuned on training data, it is well known that they are prone to hallucinations Ji et al. (2022), which may affect dialogue consistency. Furthermore, in dynamic domains where the conversational context evolves over time Labruna and Magnini (2023, 2022), maintaining dialogue consistency becomes even more challenging. The possible presence of inconsistencies in TODs Qin et al. (2021) raises the problem of detecting them, which is the topic of the paper.

The novel intuition of the paper is to consider dialogue consistency as a kind of Constraint Satisfaction Problem (CSP). We investigate how to assess the consistency of a TOD under the following working hypothesis: (i) first, dialogue consistency can be modeled with constraints that need to be respected by appropriate linguistic realizations; (ii) such constraints can be well represented to define a CSP, whose allowed solutions can be identified by a CSP solver; (iii) a TOD is consistent if its linguistic realizations belong to the set of solutions allowed by a CSP solver for that dialogue. In the paper, we discuss how dialogue constraints are defined, how they can be extracted and modeled as a CSP, and how to set up an experimental setting where we can empirically prove that a CSP solver can detect inconsistencies in a dialogue.

The contributions of the paper are the following: (i) we model TOD consistency as CSP: to the best of our knowledge, this is a fully original approach; (ii) we set up a reusable experimental setting where TOD consistency can be automatically evaluated against a CSP solver;111All resources are publicly available at https://github.com/mwozgpt/tod-csp (iii) we show that current state-of-the-art LLMs still struggle to solve simple dialogue consistency tasks, which opens to further research in dialogue consistency.

2 Dialogue Consistency as a Constraint Satisfaction Problem

In this section, we explore the conceptualization of dialogue consistency in the CSP framework. We first describe the fundamental component of a conversational domain (Section 2.1), then we elucidate the various constraints that contribute to dialogue coherence (Section 2.2), encompassing linguistic, dialogic, and domain-based considerations. We finally expound upon the formalization of dialogue constraints as CSPs (Section 2.3), delineating the process of modeling dialogue coherence as a constraint satisfaction task.

2.1 Conversational Domain

A conversational domain for a TOD refers to the specific topic that the dialogue revolves around, encompassing all the knowledge that is pertinent to the conversation. In this context, the conversational domain is typically represented by a domain ontology providing a schema of the concepts (e.g., Restaurant, Hotel, Movie), a set of slots S𝑆Sitalic_S (e.g., Food, Area, Price) for the concepts, and the set of values that each slot can assume (e.g., Expensive, Moderate, and Cheap for the Price slot). Then, a domain KB𝐾𝐵KBitalic_K italic_B comprises a collection of instances for the ontology concepts, each consisting of [slot,slot-value] pairs, adhering to the domain ontology schema.

2.2 Dialogue Consistency

A TOD can be conceptualized as a sequence of conversational turns between a user and a system aimed at achieving a specific goal. Within this framework, ensuring the consistency of the dialogue is crucial for effective communication between the user and the system. We consider three types of constraints, which need to be respected for a dialogue to be consistent: linguistic, dialogic and domain-based constraints.

Linguistic Constraints.

They are necessary to respect general linguistic rules of language, including morpho-syntactic rules (e.g., genre and number agreement) and syntax-based rules (e.g., the correct use of a preposition). For instance, if we are given with the following masked utterance:

    U: I am looking for a restaurant in <MASK>.

the choice of center as substitute to the mask token is valid, while expensive would not be suitable, because the preposition in is rarely used to introduce a price in English.

Dialogic Constraints.

They maintain the semantic coherence across successive turns of the dialogue, ensuring that each utterance logically aligns with the preceding context, thereby facilitating a seamless flow of information. As an example, suppose the following masked dialogue turns:

U: I would like an Italian restaurant.
S: There is no <MASK> restaurant in the
   center.

Here both Italian and cheap would be eligible choices from a linguistic point of view, but only Italian would maintain the coherence with the previous turn in the dialogue.

Domain Constraints.

They ensure alignment between the dialogue content and the domain knowledge, thereby maintaining the dialogue’s alignment with relevant factual information. Consider, for instance, a KB𝐾𝐵KBitalic_K italic_B with the following restaurants:

R1: N=Mario A=east F=italian P=expensive
R1: N=Napoli A=centre F=italian P=moderate

And the following piece of masked dialogue:

U: I am looking for an Italian restaurant
   in the centre.
S: We have <MASK> restaurants available for
   your preferences.

Then, the only admissible choice for the masked token would be one, as selecting any other number would introduce an inconsistency with the information provided in the KB𝐾𝐵KBitalic_K italic_B.

2.3 Dialogue Consistency as CSP

A CSP Kumar (1992) imposes certain conditions on a finite set of variables through constraints. Each variable has a finite set of possible values, known as its domain, and constraints define which combinations of values are allowed for specific subsets of the variables. A constraint can be given either explicitly, by enumerating the tuples allowed, or implicitly, e.g., by an algebraic expression. The solution of a CSP is an instantiation of all the variables for which all the constraints are satisfied. A CSP is solvable if it has at least one solution, otherwise it is unsolvable or overconstrained.

The hypothesis of this paper is that the dialogue constraints outlined in Section 2.2 can be modeled as CSPs. Intuitively, variables are the portions of the dialogue that need to be constrained (i.e., the <MASK> tokens in our examples), while the range of possible values for the variables are expressed, either explicitly or implicitly, in a domain KB𝐾𝐵KBitalic_K italic_B for that dialogue. The CSP task consists of selecting variable assignments that adhere to linguistic, dialogic, and domain constraints. To formalize this notion, consider a dialogue disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for which n𝑛nitalic_n variables (i.e., masked tokens) x1,x2,,xnsubscript𝑥1subscript𝑥2subscript𝑥𝑛x_{1},x_{2},\ldots,x_{n}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT have been defined. Let Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the domain of possible values for variable xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT; let 𝒞𝒞\mathcal{C}caligraphic_C be the set of constraints (i.e., linguistic, dialogic, and domain constraints) over the dialogue disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and let c𝑐citalic_c represent a single constraint in 𝒞𝒞\mathcal{C}caligraphic_C. The CSP task is to determine if there exists an assignment A={(x1,a1),(x2,a2),,(xn,an)}𝐴subscript𝑥1subscript𝑎1subscript𝑥2subscript𝑎2subscript𝑥𝑛subscript𝑎𝑛A=\{(x_{1},a_{1}),(x_{2},a_{2}),\ldots,(x_{n},a_{n})\}italic_A = { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) } with aiDisubscript𝑎𝑖subscript𝐷𝑖a_{i}\in D_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for 1in1𝑖𝑛1\leq i\leq n1 ≤ italic_i ≤ italic_n, such that A𝐴Aitalic_A satisfies all constraints in 𝒞𝒞\mathcal{C}caligraphic_C. This problem can be formulated as follows:

Satisfies({(x1,a1),(x2,a2),,(xn,an)},Cj)Satisfiessubscript𝑥1subscript𝑎1subscript𝑥2subscript𝑎2subscript𝑥𝑛subscript𝑎𝑛subscript𝐶𝑗\text{{Satisfies}}(\{(x_{1},a_{1}),(x_{2},a_{2}),\ldots,(x_{n},a_{n})\},C_{j})Satisfies ( { ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , … , ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) } , italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )
Cj𝒞for-allsubscript𝐶𝑗𝒞\quad\forall C_{j}\in\mathcal{C}∀ italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_C

where Satisfies(A,Cj)Satisfies𝐴subscript𝐶𝑗\text{{Satisfies}}(A,C_{j})Satisfies ( italic_A , italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) denotes the binary relationship between an assignment A𝐴Aitalic_A and a constraint Cjsubscript𝐶𝑗C_{j}italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, indicating whether the assignment satisfies the constraint.

Refer to caption
Figure 2: Overview of the CSP-based methodology applied to TOD consistency.

3 Methodology

This section outlines the process of modeling a TOD as a CSP, and then to assess the dialogue consistency using a CSP solver. The assessment involves three key steps for a [d,kb]𝑑𝑘𝑏[d,kb][ italic_d , italic_k italic_b ] pair, where d𝑑ditalic_d is a dialogue and kb𝑘𝑏kbitalic_k italic_b is a Knowledge Base: (1) identification of variables within the dialogue d𝑑ditalic_d (Section 3.1); (2) extraction of dialogue constraints and construction of a CSP solver for the [d,kb]𝑑𝑘𝑏[d,kb][ italic_d , italic_k italic_b ] pair (Section 3.2); and (3) application of the CSP solver to determine if the dialogue d𝑑ditalic_d represents a feasible solution with respect to the defined constraints (Section 3.3). These phases of the methodology are illustrated in Figure 2.

3.1 Identifying Dialogue Variables

At step 1 (see Figure 2), we consider a TOD d𝑑ditalic_d and a kb𝑘𝑏kbitalic_k italic_b (i.e., a set of entities described by slot-value pairs) related to the conversational domain of the dialogue. We do not assume any particular dependency between d𝑑ditalic_d and kb𝑘𝑏kbitalic_k italic_b: d𝑑ditalic_d could be either fully covered by kb𝑘𝑏kbitalic_k italic_b (i.e., all mentions of slot values in d𝑑ditalic_d are present in kb𝑘𝑏kbitalic_k italic_b), only partially covered, or not covered at all. We consider text portions in d𝑑ditalic_d referring to the conversational domain as potential CSP variables: a text portion referring to a slot value or mentioning amounts of instances in kb𝑘𝑏kbitalic_k italic_b. The rationale is that both slot values and instance amounts are elements that better characterize a TOD and are responsible for its consistency. In our example in Figure 1, we will obtain the following variables with their assignments:

[x1=Spanish]delimited-[]subscript𝑥1𝑆𝑝𝑎𝑛𝑖𝑠[x_{1}=Spanish][ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_S italic_p italic_a italic_n italic_i italic_s italic_h ], [x2=three]delimited-[]subscript𝑥2𝑡𝑟𝑒𝑒[x_{2}=three][ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_t italic_h italic_r italic_e italic_e ], [x3=Spanish]delimited-[]subscript𝑥3𝑆𝑝𝑎𝑛𝑖𝑠[x_{3}=Spanish][ italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_S italic_p italic_a italic_n italic_i italic_s italic_h ], [x4=cheap]delimited-[]subscript𝑥4𝑐𝑒𝑎𝑝[x_{4}=cheap][ italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = italic_c italic_h italic_e italic_a italic_p ][x10=Lebanese]delimited-[]subscript𝑥10𝐿𝑒𝑏𝑎𝑛𝑒𝑠𝑒[x_{10}=Lebanese][ italic_x start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT = italic_L italic_e italic_b italic_a italic_n italic_e italic_s italic_e ].

3.2 Extracting Dialogue Constraints

We have now established a set 𝒳𝒳\mathcal{X}caligraphic_X of variables x1,x2,,xnsubscript𝑥1subscript𝑥2subscript𝑥𝑛x_{1},x_{2},...,x_{n}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, where each variable xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can assume a value either from the slot values or from instance amounts described in kb𝑘𝑏kbitalic_k italic_b. Moving to step 2 in Figure 2, we now extract the set of constraints 𝒞𝒞\mathcal{C}caligraphic_C over the values that can be assigned to 𝒳𝒳\mathcal{X}caligraphic_X variables. We consider the three categories of constraints introduced in Section 2.2: linguistic, dialogic, and domain-based constraints.

Extracting linguistic constraints.

We model linguistic constraints as the need for a variable derived from a slot value to match the semantic type of its slot type. For instance, given the utterance I am looking for a restaurant at x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the value of the variable x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT must belong to the Area type. More precisely, C1𝐶1C1italic_C 1 is defined as follows:

C1:x1V:𝐶1subscript𝑥1𝑉C1:x_{1}\in Vitalic_C 1 : italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ italic_V

where V𝑉Vitalic_V is the set of values belonging to the same slot type as the original value. Constraint C1𝐶1C1italic_C 1, is meant to avoid that a variable can assume values that are semantically non valid. For instance, avoiding that x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT=north can be assigned to a Food, as in I am looking for a restaurant at indian, which is ungrammatical in English.

Extracting dialogic constraints.

There are two dialogic constraints that we currently consider. C2𝐶2C2italic_C 2 for ensuring that variables mentioning the same slot value in d𝑑ditalic_d are assigned to the same value. C3𝐶3C3italic_C 3 for ensuring that variables with the same semantic type occurring in the same utterance are assigned to different values. Given the turn U: I want an x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT restaurant. S: There are 3 restaurant that serve x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we define C2𝐶2C2italic_C 2 as follows:

C2:x1=x2:subscript𝐶2subscript𝑥1subscript𝑥2C_{2}:x_{1}=x_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

where the aim is to keep internal coherence across the dialogue turns. Given the utterance We have x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, or x3subscript𝑥3x_{3}italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT restaurants., we define C3𝐶3C3italic_C 3 as:

C3:x1x2,x1x3,x2x3:subscript𝐶3formulae-sequencesubscript𝑥1subscript𝑥2formulae-sequencesubscript𝑥1subscript𝑥3subscript𝑥2subscript𝑥3C_{3}:x_{1}\neq x_{2},\quad x_{1}\neq x_{3},\quad x_{2}\neq x_{3}italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT : italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≠ italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT

which captures non redundancy at the utterance level.

Extracting domain-based constraints.

There are three domain-based constraints that we currently consider. All of them are meant to guarantee consistency between the number of instances mentioned in d𝑑ditalic_d and the actual number of instances present in kb𝑘𝑏kbitalic_k italic_b. We distinguish three cases: C4𝐶4C4italic_C 4 covers the cases when an utterance in d𝑑ditalic_d states that there are no instances in kb𝑘𝑏kbitalic_k italic_b; C5𝐶5C5italic_C 5 covers the cases where it is stated that there is at least one instance; and C6𝐶6C6italic_C 6 the cases where there are exactly n𝑛nitalic_n instances.

As for C4𝐶4C4italic_C 4, consider an utterance indicating no results for a search: There are no restaurants serving x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT food, assuming that there are no restaurants with [Food=x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT] in kb𝑘𝑏kbitalic_k italic_b. For this utterance, C4𝐶4C4italic_C 4 is defined as:

C4:¬iKB with values x1:𝐶4𝑖𝐾𝐵 with values subscript𝑥1C4:\neg\exists i\in KB\text{ with values }x_{1}italic_C 4 : ¬ ∃ italic_i ∈ italic_K italic_B with values italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

implying that the variable x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT can not assume a value that is present in an instance of the KB𝐾𝐵KBitalic_K italic_B.

As for C5𝐶5C5italic_C 5, consider the utterance: We have many x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT restaurants at x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, where at least one restaurant with [Food=x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT] and [Area=x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT] is supposed to exist in kb𝑘𝑏kbitalic_k italic_b. For this utterance, C5𝐶5C5italic_C 5 is defined as:

C5:iKB with values x1,x2:𝐶5𝑖𝐾𝐵 with values subscript𝑥1subscript𝑥2C5:\exists i\in KB\text{ with values }x_{1},x_{2}italic_C 5 : ∃ italic_i ∈ italic_K italic_B with values italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

imposing the existence of at least one instance with values x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Finally, for C6𝐶6C6italic_C 6, consider the utterance There are x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT restaurants at x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We define the constraint as:

C6:|{iKB with value x2}|=x1:𝐶6𝑖𝐾𝐵 with value subscript𝑥2subscript𝑥1C6:|\{i\in KB\text{ with value }x_{2}\}|=x_{1}italic_C 6 : | { italic_i ∈ italic_K italic_B with value italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } | = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

to check that the number of instances with value x2subscript𝑥2x_{2}italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is exactly equal to x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

3.3 Assessing Dialogue Consistency

Once all variables and constraints for a dialogue d𝑑ditalic_d are identified, a CSP solver computes possible solutions for the variables in d𝑑ditalic_d given kb𝑘𝑏kbitalic_k italic_b (step 3 in Figure 2). If one of these solutions matches the variable assignments in d𝑑ditalic_d, we consider d𝑑ditalic_d consistent with respect to kb𝑘𝑏kbitalic_k italic_b (step 4 in Figure 2). For example, in the dialogue and kb𝑘𝑏kbitalic_k italic_b illustrated in Figure 1, the variable assignments do not match any CSP admissible solution. Specifically, variable assignment [x2=three]delimited-[]subscript𝑥2𝑡𝑟𝑒𝑒[x_{2}=three][ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_t italic_h italic_r italic_e italic_e ] violates C6𝐶6C6italic_C 6, referring to an incorrect number of Spanish instances in kb𝑘𝑏kbitalic_k italic_b, and variable [x10=Lebanese]delimited-[]subscript𝑥10𝐿𝑒𝑏𝑎𝑛𝑒𝑠𝑒[x_{10}=Lebanese][ italic_x start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT = italic_L italic_e italic_b italic_a italic_n italic_e italic_s italic_e ] violates C2𝐶2C2italic_C 2, as it does not maintain coherence with the previous turns. If the CSP solver finds at least one solution, the variable assignments in the dialogue must match one of those solutions, ensuring all constraints are followed. On the other hand, if no solution is found with respect to kb𝑘𝑏kbitalic_k italic_b, the variable assignments should be empty or contain values not in kb𝑘𝑏kbitalic_k italic_b to ensure consistency. These aspects will be further explored in the experiments discussed in Section 4.

4 Experimental Setting

In this section, we present the experimental setup used to assess dialogue consistency through a CSP solver. We describe the general setting and the purposes of the experiments (Section 4.1), the dataset utilized (Section 4.2), the KBs associated to each dialogue (Section 4.3), the tools employed for constraint satisfaction (Section 4.4), the language model used for dialogue generation (Section 4.5), the baselines against which we compare our results (Section 4.6) and finally, the evaluation metrics that have been used (Section 4.7).

Dataset # dialogues # variables
Alle 131 768
0 sol. 56 403
1 sol. 16 58
2-10 sol. 27 143
11-100 sol. 18 95
101+ sol. 14 69
Table 1: Dialogue distribution based on the number of solutions provided by the CSP solver.

4.1 Purposes and General Setting

The purpose of the experiments is to check the feasibility of the CSP-based approach described in Section 3 for detecting dialogue inconsistencies. Our focus is not on optimizing the performance of the CSP solver but rather on investigating critical aspects of the process in a realistic setting. Several steps are involved in this process:

  1. 1.

    Initially, we require dialogue-knowledge base (d𝑑ditalic_d-kb𝑘𝑏kbitalic_k italic_b) pairs. As for dialogues d𝑑ditalic_d, we utilize MultiWoz Han et al. (2020) dialogues, which are already annotated for dialogue state tracking, enabling precise identification of variables within the dialogue. From an annotated MultiWoz dialogue d𝑑ditalic_d, we derive a de-lexicalized version ddelexsubscript𝑑𝑑𝑒𝑙𝑒𝑥d_{delex}italic_d start_POSTSUBSCRIPT italic_d italic_e italic_l italic_e italic_x end_POSTSUBSCRIPT, where dialogue content is replaced with CSP variables.

  2. 2.

    Additionally, for each dialogue, we derive a knowledge base (kb𝑘𝑏kbitalic_k italic_b) from the MultiWoz ontology, allowing variation in both the size and type of instances.

  3. 3.

    With ddelexsubscript𝑑𝑑𝑒𝑙𝑒𝑥d_{delex}italic_d start_POSTSUBSCRIPT italic_d italic_e italic_l italic_e italic_x end_POSTSUBSCRIPT and kb𝑘𝑏kbitalic_k italic_b established, the next step involves generating variable assignments that can be assessed via a CSP solver. To produce dialogues with potential realistic inconsistencies, we employ a large language model (LLM). The LLM is tasked with re-lexicalizing the variables (i.e., substituting slot-values to CSP variables) in ddelexsubscript𝑑𝑑𝑒𝑙𝑒𝑥d_{delex}italic_d start_POSTSUBSCRIPT italic_d italic_e italic_l italic_e italic_x end_POSTSUBSCRIPT, considering the provided kb𝑘𝑏kbitalic_k italic_b. The LLM prompt is illustrated in Appendix A. This re-lexicalization process aims to maximize correctness while adhering to all implicit dialogue constraints.

  4. 4.

    Finally, the re-lexicalized dialogue drelexsubscript𝑑𝑟𝑒𝑙𝑒𝑥d_{relex}italic_d start_POSTSUBSCRIPT italic_r italic_e italic_l italic_e italic_x end_POSTSUBSCRIPT produced by the LLM serves as a variable assignment and is compared with the solutions of the CSP solver on the same d𝑑ditalic_d-kb𝑘𝑏kbitalic_k italic_b pair to produce a consistency score.

4.2 MultiWOZ Dataset

The experimental data was sourced from the MultiWOZ 2.3 dataset Han et al. (2020), a widely used benchmark for TOD systems comprising more than ten thousand conversations between a user and a system, covering various domains such as restaurants, hotels, or attractions. For our experiments, we focus on restaurant-related dialogues from the MultiWOZ dataset. In total we consider 131 dialogues with 768 total de-lexicalizations (i.e., CSP variables), as shown in the first row of Table 1. In addition, Table 1 categorizes the dataset into groups based on the number of solutions identified by MiniZinc (see Section 4.4) for each dialogue.

Constraint # variables % coverage
C1 768 1.00
C2 686 0.89
C3 108 0.14
C4 9 0.01
C5 394 0.51
C6 197 0.25
Table 2: Number of dialogue variables affected by constraints and their proportion.

4.3 Knowledge Base

The kb𝑘𝑏kbitalic_k italic_b employed in the experiments are sourced from the MultiWOZ database. Specifically, for each dialogue d𝑑ditalic_d in MultiWOZ, we selected a pertinent instance from the global MultiWOZ KB that aligns with the content of the dialogue. This ensures both relevance and coherence between the dialogue and the associated domain information. Additionally, to introduce variability in the composition of the dialogue kb𝑘𝑏kbitalic_k italic_b, we randomly sampled a set of n𝑛nitalic_n instances from the global MultiWOZ KB, where n𝑛nitalic_n is a randomly generated number between 0 and 8. This approach ensures a diverse range of instances in the dialogue kb𝑘𝑏kbitalic_k italic_b while constraining the total number of instances to a maximum of 9, facilitating efficient prompting of the kb𝑘𝑏kbitalic_k italic_b to the LLM.

4.4 MiniZinc Constraint Solver

As for CSP solver, we use MiniZinc Nethercote et al. (2007), an open-source constraint programming language specifically designed for modeling and solving constraint satisfaction problems. We employed MiniZinc to obtain solutions satisfying the dialogue constraints for ourevaluation purposes. MiniZinc provides a high-level modeling language that allows users to express problem constraints and objectives. It supports a wide range of constraint types, which make it suitable for modeling diverse problem domains. Among MiniZinc’s suite of solvers, we leveraged Chuffed Chu et al. (2018), a state-of-the-art solver known for its efficiency in solving CSPs through time optimization, especially advantageous for addressing complex and large-scale optimization problems.

4.5 GPT-3.5-Turbo Language Model

For dialogue re-lexicalization, we employed the GPT-3.5-Turbo language model, a member of the OpenAI GPT family Achiam et al. (2023), specifically designed to perform well in conversational contexts. GPT-3.5-Turbo was prompted with both (ddelexsubscript𝑑𝑑𝑒𝑙𝑒𝑥d_{delex}italic_d start_POSTSUBSCRIPT italic_d italic_e italic_l italic_e italic_x end_POSTSUBSCRIPT) and its associated kb𝑘𝑏kbitalic_k italic_b. This comprehensive input served to guide the model to produce dialogues that adhere to the implicit constraints, thereby ensuring dialogue coherence and adherence to the domain. We utilized GPT-3.5 for inference in zero-shot mode (see Appendix A), without any fine-tuning, leveraging the API version dated "2023-05-15" with a temperature setting of 0.9 to ensure balanced exploration and exploitation during dialogue generation.

4.6 Baselines

We introduce two dialogue re-lexicalization baselines, for a comparative analysis with GPT. The first baseline (Random), produces a dialogue drelexsubscript𝑑𝑟𝑒𝑙𝑒𝑥d_{relex}italic_d start_POSTSUBSCRIPT italic_r italic_e italic_l italic_e italic_x end_POSTSUBSCRIPT where variables in ddelexsubscript𝑑𝑑𝑒𝑙𝑒𝑥d_{delex}italic_d start_POSTSUBSCRIPT italic_d italic_e italic_l italic_e italic_x end_POSTSUBSCRIPT are randomly assigned to slot values present in the kb𝑘𝑏kbitalic_k italic_b. The second baseline (Most Frequent) produces a dialogue drelexsubscript𝑑𝑟𝑒𝑙𝑒𝑥d_{relex}italic_d start_POSTSUBSCRIPT italic_r italic_e italic_l italic_e italic_x end_POSTSUBSCRIPT where variables in ddelexsubscript𝑑𝑑𝑒𝑙𝑒𝑥d_{delex}italic_d start_POSTSUBSCRIPT italic_d italic_e italic_l italic_e italic_x end_POSTSUBSCRIPT are assigned to the most frequent value observed in the kb𝑘𝑏kbitalic_k italic_b. By contrasting our evaluation results with these baselines, we gain insights into the efficacy of our approach in capturing and assessing dialogue consistency.

Dataset GCA VCA
Random 0.01 0.06
Most Frequent 0.01 0.10
GPT 0.15 0.27
Table 3: Global and variable consistency for dialogues re-lexicalized by GPT compared to the Random and Most Frequent baselines.

4.7 Evaluation Metrics

Global Consistency Accuracy (GCA) and Variable Consistency Accuracy (VCA) are the two metrics used to evaluate the adherence of a dialogue to a specific set of constraints. Given a re-lexicalized dialogue drelexsubscript𝑑𝑟𝑒𝑙𝑒𝑥d_{relex}italic_d start_POSTSUBSCRIPT italic_r italic_e italic_l italic_e italic_x end_POSTSUBSCRIPT where variables are assigned to values, GCA measures the overall accuracy of the assignments for each variable. The average GCA is calculated as the proportion of dialogues that fully comply with all defined constraints:

GCA=i=1N(j=1MSatisfies(Ai,Cj))N𝐺𝐶𝐴superscriptsubscript𝑖1𝑁superscriptsubscriptproduct𝑗1𝑀Satisfiessubscript𝐴𝑖subscript𝐶𝑗𝑁GCA=\frac{\sum_{i=1}^{N}\left(\prod_{j=1}^{M}\textit{Satisfies}(A_{i},C_{j})% \right)}{N}italic_G italic_C italic_A = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT Satisfies ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) end_ARG start_ARG italic_N end_ARG

where N𝑁Nitalic_N is the total number of dialogues, and Satisfies(Ai,Cj)Satisfiessubscript𝐴𝑖subscript𝐶𝑗\textit{Satisfies}(A_{i},C_{j})Satisfies ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is a binary indicator function that returns 1 if and only if all variable assignments in dialogue disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT comply with the constraint j𝑗jitalic_j, 0 otherwise. On the other hand, VCA assesses the assignment accuracy on individual variables within the dialogue. We compare the dialogue assignment to the solutions of the CSP solver and find the most similar solution; then, we count how many variable assignments coincide with the assignments of the most similar solution. We formally define VCA as follows:

VCA=i=1N|CorrectAssignments(di)|M𝑉𝐶𝐴superscriptsubscript𝑖1𝑁CorrectAssignmentssubscript𝑑𝑖𝑀VCA=\frac{\sum_{i=1}^{N}\lvert\textit{CorrectAssignments}(d_{i})\rvert}{M}italic_V italic_C italic_A = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | CorrectAssignments ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | end_ARG start_ARG italic_M end_ARG

where N𝑁Nitalic_N is the total number of dialogues, M𝑀Mitalic_M is the total number of variables in the dialogues, and CorrectAssignments(di)CorrectAssignmentssubscript𝑑𝑖\textit{CorrectAssignments}(d_{i})CorrectAssignments ( italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) are the variable assignments in dialogue disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that coincide with the assignments of the most similar solution provided by the CSP solver. GCA and VCA provide insights into the ability of the dialogue generation system to maintain coherence and fidelity to the underlying domain knowledge while generating responses. Higher values of GCA and VCA indicate better performance in terms of dialogue quality and consistency.

5 Results

Table 2 presents the impact of each constraint on the variables in the dataset, detailing the percentage of variables influenced by each constraint. This shows that C1𝐶1C1italic_C 1 (i.e., assigned values need to respect the semantic type of the variable) applies to all variables in the dataset, while C4𝐶4C4italic_C 4 (no instances in kb𝑘𝑏kbitalic_k italic_b) applies only nine time in total. Table 3 compares the global and variable consistency in dialogues re-lexicalized by GPT with the Random and Most Frequent baselines. GPT dialogues exhibit significantly higher global and variable consistency compared to the baseline datasets. Table 4 assesses GCA and VCA for GPT dialogues across various CSP solution groups. Results show that dialogues with more solutions tend to have higher GCA and VCA scores, while the model is not able to recognize and address the 0 solution cases.

Table 5 presents the results of an ablation study, where we systematically remove each constraint one by one and analyse their impact on GCA and VCA for each configurations. Results show that the most critical constraint is C6𝐶6C6italic_C 6 (i.e., exact match with number of kb𝑘𝑏kbitalic_k italic_b instances). Additionally, we conducted experiments where groups of constraints were collectively removed to observe their influence on the dialogue generation process, confirming that domain-based constraints are more critical.

Dataset GCA VCA
0 sol. 0.0 0.0
1 sol. 0.31 0.48
2-10 sol. 0.22 0.53
11-100 sol. 0.22 0.55
101+ sol. 0.36 0.70
Table 4: Assessment of global and variable consistency for re-lexicalized dialogues across solution groups.

6 Discussion

The experiment results shed light on several key aspects of consistency assessment for TODs. First, comparing GPT and the two baselines (Random and Most Frequent) on re-lexicalized dialogues, we note the better quality achieved by the GPT model (see Table 3), both in term of GCA and VCA. GPT can effectively re-lexicalize dialogues that more closely adhere to the defined constraints. Furthermore, the assessment of global and variable consistency across different solution groups reveals interesting patterns (see Table 4). Dialogues with a higher number of solutions tend to exhibit higher levels of consistency, indicating that the model performs better when presented with more options to fulfill constraints. At the other extreme, the model is not able to address cases where no feasible solution exists, as it always provides an attempt of assignment for the variables. This finding emphasizes the importance of considering the richness and diversity of CSP solutions, as they have a strong impact on the quality and consistency of re-lexicalized dialogues. Additionally, analysing the distribution of constraints on the dialogue variables, reveals significant variations (see Table 2), with certain constraints exerting a stronger influence than others. The ablation study provides valuable insights into the impact of the different constraints on dialogue re-lexicalization. Excluding domain constraints, in particular, leads to significantly higher GCA and VCA scores, indicating the critical role of domain-specific knowledge in shaping dialogue coherence and relevance (see Table 5). This suggests that recent LLMs may not effectively leverage the provided kb𝑘𝑏kbitalic_k italic_b, highlighting an area for potential improvement in future iterations of language model training and dialogue re-lexicalization techniques. Our experiments have shown that modeling and assessing dialogue consistency through CSP is both feasible and challenging. We were able to highlights both strengths and weaknesses of dialogue generation and to discern which constraints are met and which are not, gaining insight into the specific features and challenges inherent in this process.

Constraint GCA VCA
all except C1 0.15 0.31
all except C2 0.15 0.27
all except C3 0.15 0.29
all except C4 0.16 0.30
all except C5 0.15 0.32
all except C6 0.21 0.48
all except dialogic 0.15 0.30
all except domain 0.23 0.56
Table 5: Ablation study: global and variable consistency under different constraint configurations.

7 Related Work

TOD systems have been extensively investigated in NLP. Allen et al. (2001). Recent research has explored the use of neural network architectures for dialogue state tracking Wu et al. (2020); Zhao et al. (2021) and policy learning Su et al. (2016); Liu and Lane (2017). Several metrics have been proposed to assess the performance of TOD systems, including task completion rates, user satisfaction scores, and objective measures for system components, such as precision, recall, and F1-score Chen et al. (2017); Santhanam and Shaikh (2019); Deriu et al. (2021). Recent studies have emphasized the importance of holistic evaluation frameworks that consider multiple aspects of dialogue quality Zhang et al. (2021).

Maintaining consistency and coherence in dialogues is essential for effective communication between users and dialogue systems. Previous research has investigated various approaches to ensure dialogue coherence, including coherence modeling Cervone et al. (2018), and coherence-based response generation Cervone and Riccardi (2020), aiming to enhance the naturalness and fluency of generated dialogues. Finally, several studies have explored the application of CSPs to language. These include early attempts to ensure coherence in generated text Kibble and Power (2004), model preposition lexicalization using constraints Moriceau and Saint-Dizier (2004), guide lexical choices through constraints McKeown et al. (1997), and treat context-sensitive utterance generation as a CSP Popescu et al. (2009).

8 Conclusion

In this paper, we have introduced a novel approach to assess dialogue consistency in the context of TODs using a metric based on Constraint Satisfaction. In our approach, variables represent de-lexicalized segments of the dialogue and constraints reflect linguistic, conversational, and domain-based properties of TODs. Our experiments have demonstrated the feasibility of this approach, enabling us to effectively identify and quantify inconsistencies present in the dialogues. An interesting side-effect of our investigation is the observation that state-of-the-art LLMs often introduce numerous inconsistencies when tasked with re-lexicalizing dialogues. These inconsistencies primarily concern domain knowledge adherence, resulting in an overall accuracy of only 0.15 at the dialogue level. Our study highlights the potential of CSP-based methodologies in evaluating dialogue consistency and identifying areas for improvement in automated dialogue generation systems. Future research should further explore the application of CSP in this domain and investigate strategies to enhance the coherence of LLM-generated dialogues, particularly in applications with strong domain knowledge requirements.

9 Limitations

Our study is subject to several limitations that warrant consideration. Firstly, the process of defining constraints for dialogue consistency assessment is complex and multifaceted. While we have delineated several constraints in this study, the TOD landscape is vast, and additional constraints may need to be identified and incorporated to capture a broader range of dialogue scenarios accurately. Each constraint is formulated based on our current understanding of the phenomena, acknowledging that further investigations may uncover additional constraints. Additionally, we also consider implementation feasibility, as certain constraints may require more extensive implementation efforts to detect. Moreover, the selection and prioritization of constraints inherently involve subjective judgment, and achieving consensus on the most relevant constraints for a given dialogue domain may pose a challenge.

Secondly, while we employed a state-of-the-art Large Language Model (LLM) for dialogue generation and consistency assessment, the performance of alternative language models remains unexplored. Investigating the effectiveness of various LLM architectures, pre-training strategies, or fine-tuning approaches could provide valuable insights into their suitability for TOD tasks.

Furthermore, while our methodology endeavors to be as generalizable as possible, it is important to acknowledge that nuances in dialogue structures and domain-specific knowledge may exist across different datasets, and there may still be aspects of dialogue consistency that our approach may not fully capture. Exploring additional datasets spanning diverse domains and languages could offer a more comprehensive understanding of dialogue consistency challenges and the efficacy of our proposed methodology.

References

  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  • Allen et al. (2001) James Allen, George Ferguson, and Amanda Stent. 2001. An architecture for more realistic conversational systems. In Proceedings of the 6th international conference on Intelligent user interfaces, pages 1–8.
  • Balaraman et al. (2021) Vevake Balaraman, Seyedmostafa Sheikhalishahi, and Bernardo Magnini. 2021. Recent neural methods on dialogue state tracking for task-oriented dialogue systems: A survey. In Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue, SIGdial 2021, Singapore and Online, July 29-31, 2021, pages 239–251. Association for Computational Linguistics.
  • Bang et al. (2023) Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023.
  • Cervone and Riccardi (2020) Alessandra Cervone and Giuseppe Riccardi. 2020. Is this dialogue coherent? learning from dialogue acts and entities. arXiv preprint arXiv:2006.10157.
  • Cervone et al. (2018) Alessandra Cervone, Evgeny Stepanov, and Giuseppe Riccardi. 2018. Coherence models for dialogue. arXiv preprint arXiv:1806.08044.
  • Chen et al. (2017) Hongshen Chen, Xiaorui Liu, Dawei Yin, and Jiliang Tang. 2017. A survey on dialogue systems: Recent advances and new frontiers. Acm Sigkdd Explorations Newsletter, 19(2):25–35.
  • Chu et al. (2018) Geoffrey Chu, Peter J. Stuckey, Anthony Schutt, Thorsten Ehlers, Graeme Gange, and Keith Francis. 2018. Chuffed, a lazy clause generation solver. https://github.com/chuffed/chuffed.
  • Deriu et al. (2021) Jan Deriu, Alvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset, Eneko Agirre, and Mark Cieliebak. 2021. Survey on evaluation methods for dialogue systems. Artificial Intelligence Review, 54:755–810.
  • Han et al. (2020) Ting Han, Ximing Liu, Ryuichi Takanobu, Yixin Lian, Chongxuan Huang, Wei Peng, and Minlie Huang. 2020. Multiwoz 2.3: A multi-domain task-oriented dataset enhanced with annotation corrections and co-reference annotation. arXiv preprint arXiv:2010.05594.
  • Henderson et al. (2014) Matthew Henderson et al. 2014. The second dialog state tracking challenge. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pages 263–272, Philadelphia, PA, U.S.A. Association for Computational Linguistics.
  • Ji et al. (2022) Ziwei Ji et al. 2022. Survey of hallucination in natural language generation. ACM Computing Surveys.
  • Kibble and Power (2004) Rodger Kibble and Richard Power. 2004. Optimizing referential coherence in text generation. Computational Linguistics, 30(4):401–416.
  • Kumar (1992) Vipin Kumar. 1992. Algorithms for constraint-satisfaction problems: A survey. AI magazine, 13(1):32–32.
  • Labruna and Magnini (2022) Tiziano Labruna and Bernardo Magnini. 2022. Fine-tuning bert for generative dialogue domain adaptation. In Text, Speech, and Dialogue, pages 490–501.
  • Labruna and Magnini (2023) Tiziano Labruna and Bernardo Magnini. 2023. Addressing domain changes in task-oriented conversational agents through dialogue adaptation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 149–158.
  • Lai et al. (2023) Tuan M Lai, Giuseppe Castellucci, Saar Kuzi, Heng Ji, and Oleg Rokhlenko. 2023. External knowledge acquisition for end-to-end document-oriented dialog systems. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3633–3647.
  • Liu and Lane (2017) Bing Liu and Ian Lane. 2017. Iterative policy learning in end-to-end trainable task-oriented neural dialog models. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 482–489. IEEE.
  • Louvan and Magnini (2020) Samuel Louvan and Bernardo Magnini. 2020. Recent neural methods on slot filling and intent classification for task-oriented dialogue systems: A survey. In Proceedings of the 28th International Conference on Computational Linguistics, pages 480–496, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  • McKeown et al. (1997) Kathleen McKeown, Michael Elhadad, and Jacques Robin. 1997. Floating constraints in lexical choice.
  • McTear (2020) Michael McTear. 2020. Conversational ai: Dialogue systems, conversational agents, and chatbots. Synthesis Lectures on Human Language Technologies, 13(3):1–251.
  • Moriceau and Saint-Dizier (2004) Véronique Moriceau and Patrick Saint-Dizier. 2004. A constraint-based model for preposition choice in natural language generation. Constraint Solving and Language Processing, page 124.
  • Nethercote et al. (2007) Nicholas Nethercote, Peter J. Stuckey, Rowan Becket, Simon Brand, Greg J. Duck, and Guido Tack. 2007. Minizinc: Towards a standard cp modelling language. In CP 2007, volume 4741 of LNCS, pages 529–543. Springer.
  • Popescu et al. (2009) Vladimir Popescu, Jean Caelen, and Corneliu Burileanu. 2009. A constraint satisfaction approach to context-sensitive utterance generation in multi-party dialogue systems. International Journal of Speech Technology, 12:95–112.
  • Qin et al. (2021) Libo Qin, Tianbao Xie, Shijue Huang, Qiguang Chen, Xiao Xu, and Wanxiang Che. 2021. Don’t be contradicted with anything! ci-tod: Towards benchmarking consistency for task-oriented dialogue system. arXiv preprint arXiv:2109.11292.
  • Santhanam and Shaikh (2019) Sashank Santhanam and Samira Shaikh. 2019. Towards best experiment design for evaluating dialogue system output. arXiv preprint arXiv:1909.10122.
  • Su et al. (2016) Pei-Hao Su, Milica Gasic, Nikola Mrksic, Lina Rojas-Barahona, Stefan Ultes, David Vandyke, Tsung-Hsien Wen, and Steve Young. 2016. On-line active reward learning for policy optimisation in spoken dialogue systems. arXiv preprint arXiv:1605.07669.
  • Wu et al. (2020) Peng Wu, Bowei Zou, Ridong Jiang, and AiTi Aw. 2020. Gcdst: A graph-based and copy-augmented multi-domain dialogue state tracking. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1063–1073.
  • Zhang et al. (2021) Chen Zhang, Grandee Lee, Luis Fernando D’Haro, and Haizhou Li. 2021. D-score: Holistic dialogue evaluation without reference. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:2502–2516.
  • Zhao et al. (2021) Jeffrey Zhao, Mahdis Mahdieh, Ye Zhang, Yuan Cao, and Yonghui Wu. 2021. Effective sequence-to-sequence dialogue state tracking. arXiv preprint arXiv:2108.13990.

Appendix A Appendix A: GPT prompt

Below is an instruction that outlines a task, along with a Knowledge Base containing domain-specific information to be utilized, and a dialogue for you to work on. Return a response that effectively fulfills the task.

### Instruction:
Fill in the [MASK] placeholders in the dialogue based on the information provided in the Knowledge Base. Provide the updated dialogue exactly as it was given, but with the placeholders replaced by the appropriate values for each turn in the dialogue. If a turn does not contain any placeholders, leave the sentence unchanged. Turns should start with either User or System. Be aware of leaving blank spaces before punctuation as in the original (e.g. Hi , instead of Hi,)

### Knowledge Base:
Restaurant #1 - Area: centre, Food: british, Price: moderate
Restaurant #2 - Area: west, Food: european, Price: expensive

### Dialogue: {dialogue} \speakUser I ’m looking for a restaurant serving [MASK] food in any area . \speakSystem There are no [MASK] restaurants in the area . \speakUser Well can I get the phone number to a [MASK] restaurant ? \speakSystem Restaurant one seven serves [MASK] food . Their phone number is 01223 355166 . Can I help you with anything else ? \speakUser That ’s it , goodbye \speakSystem your welcome good bye ### Response: {dialogue} \speakUser I ’m looking for a restaurant serving european food in any area . \speakSystem There are no european restaurants in the area . \speakUser Well can I get the phone number to a british restaurant ? \speakSystem Restaurant one seven serves british food . Their phone number is 01223 355166 . Can I help you with anything else ? \speakUser That ’s it , goodbye \speakSystem your welcome good bye