Should We Fine-Tune or RAG?
Evaluating Different Techniques to Adapt LLMs for Dialogue

Simone Alghisi^$\dagger$, Massimo Rizzoli, Gabriel Roccabruna,
Seyed Mahed Mousavi, Giuseppe Riccardi
Signals and Interactive Systems Lab, University of Trento, Italy
{s.alghisi, massimo.rizzoli, giuseppe.riccardi}@unitn.it Equal contribution.

Abstract

We study the limitations of Large Language Models (LLMs) for the task of response generation in human-machine dialogue. Several techniques have been proposed in the literature for different dialogue types (e.g., Open-Domain). However, the evaluations of these techniques have been limited in terms of base LLMs, dialogue types and evaluation metrics. In this work, we extensively analyze different LLM adaptation techniques when applied to different dialogue types. We have selected two base LLMs, Llama2_C and Mistral_I, and four dialogue types Open-Domain, Knowledge-Grounded, Task-Oriented, and Question Answering. We evaluate the performance of in-context learning and fine-tuning techniques across datasets selected for each dialogue type. We assess the impact of incorporating external knowledge to ground the generation in both scenarios of Retrieval-Augmented Generation (RAG) and gold knowledge. We adopt consistent evaluation and explainability criteria for automatic metrics and human evaluation protocols. Our analysis shows that there is no universal best-technique for adapting large language models as the efficacy of each technique depends on both the base LLM and the specific type of dialogue. Last but not least, the assessment of the best adaptation technique should include human evaluation to avoid false expectations and outcomes derived from automatic metrics.

Simone Alghisi^$\dagger$, Massimo Rizzoli^†^†thanks: Equal contribution., Gabriel Roccabruna, Seyed Mahed Mousavi, Giuseppe Riccardi Signals and Interactive Systems Lab, University of Trento, Italy {s.alghisi, massimo.rizzoli, giuseppe.riccardi}@unitn.it

1 Introduction

In recent years, Large Language Models (LLMs) have been employed for the task of response generation in human-machine dialogues (Hosseini-Asl et al., 2020a; Izacard and Grave, 2021; Komeili et al., 2022). Such models have been applied to several dialogue types, including Open-Domain Dialogues (i.e. informal conversations about trivial matters), Knowledge-Grounded Dialogues (i.e. conversations with a system that provides factual responses), Task-Oriented Dialogues (i.e. conversations where the system helps a user to achieve a specific goal), and Question Answering (i.e. question-answer exchanges given context).

However, recent studies have shown the shortcomings of LLMs as dialogue model surrogates as they are prone to generate toxic, biased, and irrelevant responses (Zhang et al., 2020; Mousavi et al., 2022, 2023; Lin and Chen, 2023). To adapt LLMs to dialogue types, different techniques have been employed such as in-context learning (Brown et al., 2020; Chen et al., 2023; Meade et al., 2023) and fine-tuning (Wang et al., 2022; Komeili et al., 2022; Huang et al., 2023). Furthermore, strategies such as grounding (Gopalakrishnan et al., 2019; Zhao et al., 2023) and Retrieval-Augmented Generation (RAG) (Lewis et al., 2020; Borgeaud et al., 2022) have been proposed to improve the generation quality.

Currently, the performance of the aforementioned techniques in adapting LLMs across different dialogue types is understudied. Previous studies have evaluated these techniques in a specific dialogue type only (Raposo et al., 2023; Zhang et al., 2023). Such studies are based on different base models and are assessed via incomparable evaluation methodologies.

In this work, we conduct an extensive study on the efficacy of different techniques to adapt LLMs for multiple dialogue types. We select Llama-2 Chat (Llama2_C) (Touvron et al., 2023) and Mistral Instruct (Mistral_I) (Jiang et al., 2023) as base LLMs, and experiment with in-context learning and fine-tuning in the context of four dialogue types: a) Open-Domain Dialogues (ODDs), b) Knowledge-Grounded Dialogues (KGDs), c) Task-Oriented Dialogues (TODs), d) Question Answering (QA). Besides, we assess the impact of incorporating external knowledge by considering retrieved knowledge and gold knowledge. In the retrieved knowledge scenario, we use RAG to add the knowledge to the model’s input. We assess the performance of each technique using the same automatic metrics and comparable human evaluation. We further compute the contribution of each segment of the input vector by using integrated gradients as an explainability attribution method. We evaluate the models using an open human evaluation protocol Mousavi et al. (2022) designed for dialogue contextualization, appropriateness, correctness, and validity. In summary, the main contributions of this paper are:

•

Adaptation of Llama2_C and Mistral_I using fine-tuning and in-context learning in four different dialogue types and corresponding corpora;
•

Assessment of the impact of grounding the response generation on external knowledge, both in cases of retrieved knowledge and gold knowledge;
•

Extensive study on the efficacy of each technique using automatic evaluations and human evaluation, including explainability and categorization analysis of natural language generation errors.

2 Literature Review

Open-Domain Dialogue (ODD) In earlier studies, sequence-to-sequence models have been trained for response generation in open-domain dialogues Li et al. (2017). However, such models suffered from generating generic or inappropriate responses (Zhang et al., 2020). To improve the generation quality, studies grounded the generation on external knowledge, such as persona statements (Wolf et al., 2019; Kasahara et al., 2022; Xu et al., 2022b), the personal graph of user interactions (Mousavi et al., 2023), and retrieved documents (Huang et al., 2023). While the previous works developed data-driven models using training/fine-tuning, recent studies have explored the potential of in-context learning with LLMs (Qian et al., 2023).

Knowledge-Grounded Dialogue (KGD) Sources such as Wikipedia have been used as unstructured knowledge to ground the generated responses (Dinan et al., 2019; Gopalakrishnan et al., 2019; Komeili et al., 2022) to generate consistent and factual answers. To improve the generation quality, previous works have studied the impact of knowledge selection (Qin et al., 2023; Sun et al., 2023), different knowledge representations (Mousavi et al., 2023; Yang et al., 2023), additional knowledge elements (e.g. dialogue acts, topics) (Hedayatnia et al., 2020), training without knowledge supervision (Han et al., 2023), and in-context learning (Chen et al., 2023).

Task-Oriented Dialogue (TOD) LLMs have been fine-tuned for TOD modeling for joint dialogue state tracking and response generation (Hosseini-Asl et al., 2020b; Kulhánek et al., 2021; Wang et al., 2022; Ding et al., 2024), and robustness to spoken interactions Thulke et al. (2024); Mousavi et al. (2024). Recent studies focus on augmenting the TOD modeling with unstructured knowledge access (Feng et al., 2020; Kim et al., 2020, 2021). In this regard, He et al. (2024) have proposed a pipeline for retrieval and grounded response generation. Raposo et al. (2023) compared in-context-learning and fine-tuning, but considered retrieved replies from previous dialogues as knowledge.

Question Answering (QA). In the most general setting, relevant documents need to be retrieved to provide an answer (Lee et al., 2019; Qu et al., 2020). Some studies have proposed to select the documents with the highest similarity with the question computed between their BERT encodings (Lee et al., 2019; Karpukhin et al., 2020). With this retrieval strategy, some studies have fine-tuned LLMs to condition the generation on the retrieved documents through grounding (Lewis et al., 2020; Izacard and Grave, 2021) or cross-attention (Borgeaud et al., 2022). Other works generated the answers using in-context learning with zero-shot Levine et al. (2022); Cho et al. (2023). A survey compared existing generation-only, retrieval-only, and RAG models (Zhang et al., 2023) but with different base models, hindering the comparison of the techniques.

3 Experiments

We study and compare in-context learning and fine-tuning as techniques to adapt LLMs for human-machine dialogues. We select Llama-2 Chat (Llama2_C) (Touvron et al., 2023) and Mistral Instruct (Mistral_I) (Jiang et al., 2023) as base LLMs, and experiment in the context of four dialogue types: Open-Domain Dialogue (ODD), Knowledge-Grounded Dialogue (KGD), Task-Oriented Dialogue (TOD), and Question Answering (QA). For each technique and dialogue type, we assess the impact of grounding the generation on documents in the scenarios of retrieved knowledge (RAG) and gold knowledge.

3.1 Datasets

In our experiment, we have selected a dataset for each of the four dialogue types (see §A.1 for selection). The statistics of these datasets are summarized in Table 1.

Open-Domain Dialogue (ODD) We select DailyDialog Li et al. (2017), a widely-used dataset of human-human dialogues crawled from various websites used by English learners to practice. The final dataset contains 13k written dialogues with an average of 8 turns per dialogue.

Knowledge-Grounded Dialogue (KGD) We experiment on Wizard of Wikipedia (Dinan et al., 2019), a dataset of dialogues between two participants with the roles of apprentice and wizard. At each turn, the wizard can access a set of documents (passages from Wikipedia) and use it to incorporate factual knowledge in their reply. The dataset contains 20k dialogues about one of 1359 distinct topics and provides an unseen set of documents for testing.

Task-Oriented Dialogue (TOD) We select the dataset proposed for the ninth Dialogue System Technology Challenge (Kim et al., 2020). The dataset spans over 7 domains and contains 9k multi-domain dialogues. The dialogues include turns where the system needs to access an unstructured knowledge base of 2900 documents (FAQs) to provide a correct response.

Question Answering (QA) We select NarrativeQA (Kočiský et al., 2018), a dataset of 47k questions with free-form answers based on 1.5k books and movie scripts. The question-answer pairs are formulated based on summaries of the books and movies.

Type	Dataset	#Dials	Avg. #Turns	#Ext. Know.
ODD	DailyDialog	13k	8	—
KGD	WoW	20k	9	^†61
TOD	DSTC9	9k	19	2900
QA	NarrativeQA	^*47k	2	1572

Table 1: Selected datasets for each dialogue type: Open-Domain Dialogue (ODD), Knowledge-Grounded Dialogue (KGD), Task-Oriented Dialogue (TOD), and Question Answering (QA). #Ext. know. indicates the number of documents in the unstructured knowledge base. ^† In KGD the content of the knowledge base differs at each turn with an average of

61\pm 22

documents. ^* Question-answer exchanges.

3.2 Techniques

We evaluate in-context learning and fine-tuning as techniques to adapt LLMs for response generation in the selected dialogue types. In-context learning is a technique that uses instructions and examples to condition the generation. Instead, fine-tuning further trains the model (completely or partially) on the task of interest using a smaller-scale dataset than the pre-training phase. In a dialogue setting, fine-tuning should teach the model the notion of the dialogue and the roles of the participants.

As a baseline, for both techniques, we consider the context (i.e. the question for QA, the history for ODD, KGD, and TOD) as the input and use the default prompt structure of the models to separate user and system turns. Additionally, for TOD we append the dialogue state (a summary of user requirements), following previous work on this dialogue type (Wang et al., 2022; Ding et al., 2024). For KGD, we prepend the topic to the start of the dialogue.

Model	Technique	External Knowledge	Perplexity
Model	Technique	External Knowledge	ODD	KGD	TOD	QA
Llama2_C	In-Context Learning	No Know.	64.13	35.17	25.15	1442.26
		Retrieved Know.		33.10	24.72	625.08
		Gold Know.		24.40	23.81	298.16
	Fine-Tuning	No Know.	5.67 $\pm$ 0.01	7.63 $\pm$ 0.01	3.06 $\pm$ 0.01	12.03 $\pm$ 0.06
		Retrieved Know.		6.95 $\pm$ 0.01	3.97 $\pm$ 0.01	5.47 $\pm$ 0.02
		Gold Know.		4.38 $\pm$ 0.01	3.12 $\pm$ 0.01	4.98 $\pm$ 0.01
Mistral_I	In-Context Learning	No Know.	14.19	15.31	9.82	91.42
		Retrieved Know.		14.75	9.76	42.58
		Gold Know.		9.81	9.37	16.74
	Fine-Tuning	No Know.	6.41 $\pm$ 0.01	8.67 $\pm$ 0.01	3.56 $\pm$ 0.01	14.11 $\pm$ 0.01
		Retrieved Know.		7.78 $\pm$ 0.01	3.61 $\pm$ 0.01	5.97 $\pm$ 0.01
		Gold Know.		5.17 $\pm$ 0.01	3.58 $\pm$ 0.01	4.88 $\pm$ 0.01

Table 2: Automatic Evaluation Perplexity of Fine-Tuning and In-Context Learning with Retrieved (top-3) and Gold (ground-truth) knowledge, on Llama2_C and Mistral_I, in different dialogue types: Open-Domain Dialogues (ODDs), Knowledge Grounded Dialogues (KGDs), Task-Oriented Dialogues (TODs), and Question Answering (QA). Results for fine-tuned models report mean and standard deviation over three runs.

3.3 Knowledge

Incorporating external knowledge for the task of response generation has been shown to improve the factual accuracy (He et al., 2024) and contextualization (Mousavi et al., 2023) of responses.

For each of the selected types but for ODD, we consider their corresponding unstructured knowledge base. Regarding KGD, we consider passages from Wikipedia, while for TOD we consider FAQs related to services and places (e.g. restaurants, hotels, taxi booking). For QA we consider all the summaries of the books and movies.

For both in-context learning and fine-tuning, we study the impact of knowledge on the generated responses, in two scenarios:

•

Retrieved knowledge: we retrieve k documents from the unstructured knowledge base;
•

Gold knowledge: we use the ground truth document.

For the retrieved knowledge scenario, we use the Retrieval Augmented Generation (RAG) strategy. We use an off-the-shelf retriever¹¹1https://github.com/langchain-ai/langchain (model details in §A.2) to retrieve documents from the unstructured knowledge base. First, we encode all the documents considering their content together with their topic (KGD), place or service name (TOD), or title (QA) (Karpukhin et al., 2020). Then, at each turn, we retrieve the k most similar documents based on L2 distance with the encoded context. Finally, we feed the retrieved documents to the base models together with the context to generate a response.

In the gold knowledge scenario, we directly feed the model with the ground truth documents. This serves as an upper bound for RAG. Additionally, this strategy allows us to study the ability of the techniques to incorporate knowledge in the responses.

3.4 Models

We select the widely-used 7B version of Llama2_C and Mistral_I as base models. For in-context learning, we experiment with three instructions for each dialogue type and select the best based on the development set performance. For fine-tuning, we use LoRA, a parameter-efficient technique that has shown comparable performance to fine-tuning all parameters (Hu et al., 2021). Further details about the parameters are reported in §A.2.

4 Evaluation

We conduct a comparative study on the impact of in-context learning and fine-tuning to adapt LLMs for dialogues. We select Llama2_C and Mistral_I as base LLMs and experiment in four dialogue types: ODDs, KGDs, TODs, QA. For each dialogue type, we study the impact of external knowledge, both retrieved and gold. Further details about the implementation and the resources used are available in the appendix (§A.2).

4.1 Automatic Evaluation

Table 2 reports the perplexity of Llama2_C and Mistral_I on the test set of each dialogue type. In all dialogue types, fine-tuned models have obtained better performance compared to in-context learning. When considering the impact of external knowledge, models fine-tuned on TODs show that knowledge slightly increases perplexity. The high perplexity obtained by in-context learning models on QA can be explained because of two reasons: first, besides the knowledge, only the question is used as context; second, while the ground truths are particularly short (4.26 tokens on average), these models generate long responses, making them unlikely to include the correct answer in the first few tokens. This does not happen for fine-tuned models since they are trained to generate shorter responses. Nevertheless, the best results have been obtained by the models using gold knowledge. We report automatic evaluation results including retriever accuracy, overlap between knowledge and response tokens, and other automatic metrics in §A.3.

Refer to caption — Table 3: Explanability Study Percentage of tokens with significant contribution to the generation in different segments of the input vector for each model in Knowledge-Grounded Dialogues (KGDs), and Task-Oriented Dialogues (TODs). All rows sum to 100. For KGD, the second column reports the contribution of the Topic, while for TOD it reports the contribution of the Dialogue State. The Instruction segment is only present for In-Context Learning.

Model	Dialogue Type	Technique	% of Tokens w. Significant Contribution in Each Segment
Model	Dialogue Type	Technique	Instruction	Topic/Dialogue State	Dialogue History	Knowledge
Llama2_C	KGD	In-Context Learning	21.85	28.60	15.97	33.58
\cdashline3-7	KGD	Fine-Tuning		39.43	13.80	46.77
	TOD	In-Context Learning	25.98	19.54	16.46	38.02
\cdashline3-7	TOD	Fine-Tuning		27.19	8.04	64.77
Mistral_I	KGD	In-Context Learning		69.01	14.89	16.10
\cdashline3-7	KGD	Fine-Tuning		65.55	11.00	23.45
	TOD	In-Context Learning	69.05	10.19	11.24	9.52
\cdashline3-7	TOD	Fine-Tuning		14.55	29.06	56.39

Dialogue Type	Instruction
ODD	""
	"This is a conversation between two people. Use the context to write an engaging reply for the other person."
	"Write a coherent continuation for the proposed conversation."
KGD	""
	"This is a conversation between two people about a Topic. Use the Dialogue and the additional Knowledge as context to write an engaging reply for the other person.",
	"Write a coherent continuation for the proposed conversation based on the additional Knowledge."
TOD	""
	"In the following conversation a user wants to achieve some goal and needs help from an assistant. Continue the conversation with the response of the assistant."
	"Write a coherent continuation for the proposed conversation."
QA	""
	"You are presented with a user’s Question about a movie or book. Answer to the user’s Question using the information provided in the Context."
	"Answer to the user’s question using the provided information (if available)."

Model	Technique	External Knowledge	BLEU-4		KF1
Model	Technique	External Knowledge	TOD	TOD^†	TOD	TOD^†
Llama2_C	In-Context Learning	No Know.	0.85	0.60	13.66	12.39
		Retrieved Know.	0.83	0.44	12.10	10.44
		Gold Know.	1.07	2.67	25.87	23.77
	Fine-Tuning	No Know.	6.72	4.33	34.04	25.73
		Retrieved Know.	4.33	3.15	26.85	22.92
		Gold Know.	5.39	8.50	42.69	45.49
Mistral_I	In-Context Learning	No Know.	1.33	1.12	13.01	11.91
		Retrieved Know.	1.06	1.02	12.53	10.36
		Gold Know.	1.33	3.70	28.74	28.79
	Fine-Tuning	No Know.	4.09	5.83	29.27	25.47
		Retrieved Know.	3.85	4.76	30.44	25.61
		Gold Know.	3.94	10.63	43.04	49.40
Ground Truth			100	100	38.48	39.91

Should We Fine-Tune or RAG?
Evaluating Different Techniques to Adapt LLMs for Dialogue

Abstract

1 Introduction

2 Literature Review