TourLLM: Enhancing LLMs with Tourism Knowledge

Qikai Wei∗,a,c, Mingzhi Yang∗,d, Jinqiang Wanga,c, Wenwei Maoa,c,
Jiabo Xub, Huansheng Ning†,a,c
Abstract

Recently, large language models (LLMs) have demonstrated their effectiveness in various natural language processing (NLP) tasks. However, the lack of tourism knowledge limits the performance of LLMs in tourist attraction presentations and travel planning. To address this challenge, we constructed a supervised fine-tuning dataset for the culture and tourism domain, named Cultour. This dataset consists of three parts: tourism knowledge base QA data, travelogues data, and tourism diversity QA data. Additionally, we propose TourLLM, a Qwen-based model supervised fine-tuned with Cultour, to improve the quality of the information provided about attractions and travel planning. To evaluate the performance of TourLLM, we employed both automatic and human evaluation, and we proposed a human evaluation criterion named CRA (Consistency, Readability, Availability). The experimental results demonstrate the effectiveness of the responses generated by the TourLLM. Our proposed Cultour is accessible at https://github.com/mrweiqk/Cultour. 11footnotetext: These authors contributed equally to this work. 22footnotetext: Corresponding author

keywords:
LLMs, Supervise fine-tuned, NLP, Tourism, Human evaluation
journal: Nuclear Physics B
\affiliation

organization=School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China, \affiliationorganization=Key Laboratory of Xinjiang Coal Resources Green Mining, Ministry of Education, Xinjiang Institute of Engineering, Urumqi 830023, China,

\affiliation

organization=Beijing Engineering Research Center for Cyberspace Data Analysis and Applications, Beijing 100083, China, \affiliationorganization=Guangxi Tourism Development One-Click Tour Digital Cultural Tourism Industry Co.,Ltd, Guangxi 530012, China,

1 Introduction

Large language models (LLMs) such as ChatGPT[1] and Llama[2] have proven their effectiveness in various natural language processing (NLP) downstream tasks with their excellent performance[3]. These models generate natural and fluent language expressions in diverse contexts by capturing linguistic patterns from vast amounts of textual data, significantly advancing NLP technology[4].

However, with the increasing public demand for various travel needs, there is a noticeable imbalance between the supply of tourism services and the growing demand. The applications of LLMs in the tourism domain are evident, such as personalized recommendations, language translation, and chatbots[5]. However, the lack of tourism domain knowledge limits the performance of LLMs in attraction recommendations and travel plans. This limitation restricts the application of LLMs in tourism[6]. Additionally, while LLMs perform exceptionally well in English environments, their performance in Chinese environments is relatively limited. This discrepancy is mainly due to the complexity and uniqueness of the Chinese and the differences in data distribution between Chinese and English. These factors can create barriers for non-English speakers and limit the public’s access to travel advice.

To address these limitations, researchers have explored a range of vertical domain fine-tuning methods to alleviate the lack of corpus in the domain such as Parameter-Efficient Fine-Tuning(PEFT)[7], Low-Rank Adaptation of Large Language Models(LoRA)[8], and Quantized LoRa(QLoRA)[9]. For instance, researchers have constructed domain-specific datasets in the medicine and law domain and employed techniques like continued pre-training, vertical domain fine-tuning, and learning from human feedback[10] to enhance model performance. This has driven the development of LLMs in specific domains resulting in more specialized outcomes such as HuaTuo for Chinese medical knowledge[11] and Lawyer LLaMA for legal knowledge[12].To enhance the accuracy of LLMs in providing travel advice and improving user experience in the tourism domain, it is crucial to develop LLMs tailored to the tourism domain.

In this paper, we construct a Chinese Supervised Fine-Tuning(SFT) dataset for the culture and tourism domain, named Cultour, which consists of three parts. Firstly, we build a knowledge base of common tourist questions and answers(QA). Leveraging the capabilities of ChatGPT, we design a set of prompts to expand the QA pairs and generate tourism knowledge QA data. Secondly, we organize data related to travel planning to design travelogues manually. Thirdly. we craft unique QA in the aspects of eating, living, traveling, touring, shopping, and entertaining, to enrich the diversity of the dataset. Based on this dataset, we propose TourLLM, a model fine-tuned based on Qwen1.5, to improve the quality of information provided about attractions and travel planning. To better evaluate the model’s performance, we adopt both manual and automatic evaluation methods to assess the performance of TourLLM. In particular, for manual evaluation, we design a new metric called CRA (Consistency, Readability, Availability) to evaluate LLMs in the tourism domain manually and the experimental results demonstrate the effectiveness of TourLLM.

In summary, our contributions can be summarized as follows:

1. We construct Cultour, a high-quality Chinese SFT dataset for tourism and culture. The dataset contains tourism knowledge base QA data, travelogues data, and tourism diversity QA data.

2. We propose TourLLM, a Qwen-based model supervised-finetuned with Cultour in the tourism domain.

3. We introduce CRA, a novel metric for evaluating LLMs in the tourism domain considering consistency, readability, and availability.

4. We assess TourLLM using both automated and human evaluations. The experimental results demonstrated the effectiveness of TourLLM.

2 Related Work

The emergence of general LLMs like ChatGPT 3.5 opens up a new way of life for people and reduces the cost of living[13]. However, in specific application domains, the performance of general large models is still limited[12]. Therefore, researchers are considering combining domain-specific data with general-domain large models to train a vertical-domain large model to serve specific domains. Huang et al.[12] used 50k legal domain-related data to continue pre-train the model, enabling it to possess more relevant knowledge in the legal domain. In the process of answering questions, the model’s responses are more knowledgeable. Yang et al.[14] implemented a complete LLMs training process involving pre-training, Supervised Fine-Tuning (SFT), and Reinforcement Learning from Human Feedback (RLHF) to enhance the model’s performance in the Chinese medical field. To address the limited datasets for specific languages, Xiong et al.[15] designed a translation model to translate English datasets into Chinese and then fine-tuned the ChatGLM to expand the domain-specific datasets to enhance the model’s performance. Li et al.[16] fine-tuned LLaMA using 100K patient-doctor dialogues and supplemented knowledge with a recall-based retrieval system. Wang et al.[11] integrated distilled data from ChatGPT and real data from doctors, leveraging reinforcement learning to enhance model performance and achieve the functionality of interactive medical consultation. Beyond conventional LLMs training methods, Xu et al.[17] employed RLHF[10] and Direct Preference Optimization[18] to make the final output results more fitting to human reading habits. After optimizing LLMs through continued pre-training, supervised fine-tuning, and RLHF methods, hallucinations still occur within the domain. Retrieval-augmented Generation(RAG)[19] combines LLMs (such as Llama and ChatGLM) with external data, enabling the model to refer to the knowledge base when answering questions. Cui et al.[20] used laws and regulations as an external knowledge base to enhance the accuracy of the model’s responses.

The combination of LLMs with the tourism domain has also attracted the attention of researchers. Hsu et al.[21] discussed opportunities and challenges of applying Generative Artificial Intelligence (GenAI) and LLMs by the tourism industry and tourists. Mo et al.[22] leveraged prompt-based learning with LLMs, meticulously designing prompts to ensure the responses of the model are reasonable. Wang et al.[23] used domain-oriented LLMs to transform user inquiries into diverse guidance-seeking contexts and facilitate multi-modal interactions. Secchi et al.[24] proposed a method that integrates LLMs with domain-specific knowledge graphs, aiming to optimize hotel services. This approach combines domain knowledge graphs with feature engineering to enrich the data representation within LLMs. Based on the current development of the cultural and tourism domain, we constructed Cultour dataset. Based on this dataset, we proposed TourLLM, a model fine-tuned based on LLMs to enhance the performance of LLMs in the cultural and tourism domains. This provides new methods for applying LLMs in the tourism domain and offers valuable references and insights for future research.

3 Our Approach

In this section, we provide a detailed overview of our proposed method for constructing the Cultour dataset and the PEFT method for fine-tuning TourLLM. The overall training process is shown in Fig. 1.

Refer to caption
Figure 1: The overall training process of TourLLM.

3.1 Dataset construction

People appreciate local attractions in tourism and show a strong interest in cultural heritage and intangible cultural assets. However, the training data for general LLMs in the cultural and tourism domain is relatively limited, resulting in suboptimal performance. To address this challenge, we constructed Cultour dataset to enrich the culture and tourism domain data. This dataset includes tourism knowledge base QA data, travelogues data, and diverse QA data in the tourism domain. The specific data statistics are shown in Table 1.

Data Type Number
Tourism knowledge base QA data 9,004
Travelogues data 1,792
Diverse QA data 2,027
Table 1: The statistical information of Cultour.

3.1.1 Tourism knowledge base QA data

It is worth noting that there is a large amount of tourism-related data on the internet, which includes misinformation, advertisements, and useless text. To utilize tourism resources more accurately, we collected common questions and answers about attractions, food, and intangible cultural heritage from official websites. To ensure the accuracy of the answers, we manually verified each one, ultimately constructing a comprehensive tourism knowledge base. Some examples from the Culture dataset are shown in Table 2.

Question Reference Answer
{CJK}UTF8gbsn 桂林漓江景区是几A级景区? What is the rating of Guilin Li River Scenic Area in terms of A-level classification? {CJK}UTF8gbsn 2007年,桂林市漓江景区被评为国家AAAAA级旅游区。 In 2007, the Li River Scenic Area in Guilin City was rated as a national AAAAA-level tourist area.
{CJK}UTF8gbsn 阳朔啤酒鱼是什么? What is Yangshuo beer fish? {CJK}UTF8gbsn 阳朔“啤酒鱼“,是阳朔有名的地方特色菜,鱼肉鲜辣可口,无一丝鱼本身的腥味,啤酒和鱼肉的香味可以让人胃口大开…… Yangshuo ”Beer Fish” is a famous local specialty dish in Yangshuo. The fish meat is fresh, spicy, and delicious, without any of the fish’s inherent fishy taste. The aroma of beer and fish meat can open people’s appetites open……
{CJK}UTF8gbsn 桂林渔鼓有什么历史渊源? What is the historical origin of Guilin Yu-drum? {CJK}UTF8gbsn 桂林渔鼓是桂林市的汉族传统说唱艺术。明末清初、民国时期都有湖南的渔鼓艺人在桂林传唱渔鼓并落户桂林,开创了桂林渔鼓的先河… Guilin Yu-drum is a traditional Han Chinese narrative and singing art in Guilin City. At the end of the Ming Dynasty and the beginning of the Qing Dynasty, as well as during the Republic of China period, fishermen from Hunan came to Guilin to perform and settle, thus pioneering the tradition of Guilin Yu-drum…
Table 2: Some examples from the Culture dataset. We translate it into English for better illustration.

Inspired by previous research[25], we design a specialized prompt template for generating SFT data based on the structured data in Table 2. The prompt template is shown in Table 3. Utilizing the powerful capabilities of ChatGPT, we convert the knowledge base QA pairs into single-turn dialogues within the tourism context, enhancing their diversity and richness in language expression. We apply the same processing to all knowledge base data, resulting in 9,197 QA pairs of SFT data in the tourism domain.

User Prompt:
1. Assume you are a professional AI travel assistant. Based on the provided <<<question, answer>>> pairs, generate a logical cultural tourism scenario dialogue.
2. The dialogue should consist of a question or request followed by an answer. The question should be based on the given question and provide complete contextual information. It should be specific and avoid overly technical terms. You can modify the original question to ensure the generated dialogue is logical and fits the scenario. The length should be 1-3 sentences, and the question should have a consulting tone.
3. The answer should be derived from the provided answer, with some enrichment to the content. The answer should address the question in detail, be 15-20 sentences long, and be friendly, approachable, patient, and comprehensive.
4. Enrich the content in the generated response to make it more suitable for human reading habits.
Table 3: The tourism prompt template.

In the process of generating answers by ChatGPT, there is a hallucination problem[26], which generates some content that seems reasonable but is incorrect. For example, in the generated results of ChatGPT, there is an error case that describes ”mealworm” as a bug made from insect larvae such as scarab beetles, snake eyes, and bull eyes. In fact, in this batch of data, ”mealworm” is a local specialty delicacy whose main ingredient is indica rice. It is named because it looks like Cordyceps. In this case, ChatGPT’s answer, while logical for the generic domain, is inconsistent with the true answer. To avoid this error, we carefully review the results generated by ChatGPT and manually correct the erroneous parts. As a result, we obtain 9,004 high-quality SFT data points, covering attractions, culture, and other tourism domains.

3.1.2 Travelogues data

Travelogues data is open-ended, unlike knowledge base QA data. We selected data from travel planning provided by some travel agencies and travelogues on tourism platforms. We performed manual cleaning, secondary design, and annotation to ensure the data input and output format is consistent with human reading habits. Ultimately, we generated 1,792 travelogue SFT data and examples are shown in Table 4.

{CJK}UTF8gbsn Prompt:筹备一份南宁一天的旅游攻略。 Prompt: Prepare a ONE-day travelogue to Nanning.
{CJK}UTF8gbsn Response:接下来的内容将介绍南宁一日游的travelogues建议,包括主要景点和活动。上午:青秀山: 早上可以去青秀山感受阳光和清新空气,步道盘山而上,慢慢将整个城市尽收眼底。下午:广西民族博物馆:紧邻青秀山,是以广西民族文化为专题的博物馆。广西民族博物馆以收藏、研究和展示广西12个世居民族的传统文化为主要工作任务,同时兼顾对广西周边省份各民族以及东南亚各民族的文化研究、文物资料收藏和宣传展示。晚上:西关夜市:步行去西关夜市,小吃种类很多,体验特色美食填饱胃的感觉。以上是南宁一天的旅行指南,希望对您有所裨益。在旅途中,请务必注意安全,保持良好的心境和状态。祝您旅途愉快! The following content will introduce travelogues suggestions for a day trip to Nanning, including the main attractions and activities. Morning: Qingxiu Mountain: In the morning, you can go to Qingxiu Mountain to feel the sunshine and fresh air. The trail winds up the mountain and slowly takes in the entire city. Afternoon: Guangxi Museum of Ethnology: Close to Qingxiu Mountain, it is a museum with the theme of Guangxi ethnic culture. The main mission of the Guangxi Ethnology Museum is to collect, research, and display the traditional culture of the 12 living ethnic groups in Guangxi. At the same time, it also takes into account the cultural research, collection of cultural relics and publicity and display of various ethnic groups in the surrounding provinces of Guangxi. Evening: Xiguan Night Market: You can walk to Xiguan Night Market. There are many kinds of snacks, and you can experience the feeling of filling your stomach with special delicacies. The above is a one-day travelogues guide in Nanning, I hope it will be helpful to you. During the journey, please pay attention to safety and maintain a good mood and condition. I wish you a pleasant journey!
Table 4: Travelogues data example.

3.1.3 Diversity QA data

To improve the performance of LLMs in the tourism domain and enhance their understanding of human questions, we expanded the data to meet the versatility of TourLLM when facing diverse data. It involves many aspects, including eating, living, traveling, touring, shopping, and entertaining. We use ChatGPT to automatically ask questions to collect possible questions and answers, aiming to simulate real user needs, thereby making LLM more accurate and appropriate when answering actual questions. Finally, we collected 2,027 QA pairs of SFT data.

3.2 Parameter-Efficient Tuning

To enable the LLMs to understand people’s questions, inspired by previous work[27], fine-tuning the entire model based on the given domain-specific SFT data requires substantial computational resources. We prepare 51K general domain data [28], combine Cultour as all training corpus, and use parameter-efficient fine-tuning methods to train LLMs. This approach allowed us to fine-tune the model with fewer computational resources. Among these methods, we used LoRA[8] as the fine-tuning method for TourLLM.

Refer to caption
Figure 2: Illustration of Lora method.

Specifically, the Lora method freezes the parameters of LLMs and introduces an additional trainable low-rank decomposition matrix in each transformer layer, as shown in Fig. 2. The blue part in Fig. 2 represents the pre-trained weights of frozen LLMs, and only the parameters in the red part are trained. In y𝑦yitalic_y = Wx𝑥xitalic_x, W denotes the pretrained n×k𝑛×𝑘n×kitalic_n × italic_k parameter matrix. We compute y𝑦yitalic_y by introducing low-rank matrices A(n×rsubscript𝑛×𝑟\mathbb{R}_{n×r}blackboard_R start_POSTSUBSCRIPT italic_n × italic_r end_POSTSUBSCRIPT) and B(r×ksubscript𝑟×𝑘\mathbb{R}_{r×k}blackboard_R start_POSTSUBSCRIPT italic_r × italic_k end_POSTSUBSCRIPT).

y=(W)x+(ΔW)x=Wx+BAx𝑦W𝑥ΔW𝑥W𝑥𝐵𝐴𝑥y=(\text{W})x+(\Delta\text{W})x=\text{W}x+BAxitalic_y = ( W ) italic_x + ( roman_Δ W ) italic_x = W italic_x + italic_B italic_A italic_x (1)

In Eq. (1), r𝑟ritalic_r denotes the rank of A and B, and r𝑟ritalic_r is much smaller than min(n𝑛nitalic_n, k𝑘kitalic_k). Only parameters A and B participate in the training of the model, so we can complete the training of model parameters at a smaller cost and improve model performance.

4 Experimental

To validate the performance of the TourLLM model, we evaluate it against common LLMs and conduct a detailed analysis of the results. Additionally, we introduce CRA, a human evaluation standard specifically for LLMs in the tourism domain, which assesses the model’s performance based on Consistency, Readability, and Availability.

4.1 Baseline

To evaluate the effectiveness of our method, we compare TourLLM with ChatGPT[29], ChatGLM3[30], Alpaca[31], Qwen1.5[32].

ChatGPT3.5[29] is an NLP model designed by OpenAI, which uses manually annotated data for training and reinforcement learning to enhance the model’s capabilities.

ChatGLM3[30] is an open-source model in the ChatGLM series, characterized by its low deployment barrier. Additionally, it demonstrates strong performance across various datasets in semantics, mathematics, reasoning, code, knowledge, and other domains.

Alpaca[31], derived from LLAMA and continued pretrained with Chinese textual data, exhibits robust proficiency in Chinese comprehension. Furthermore, it uses Chinese instruction data for fine-tuning to enhance the ability of the model to understand the ability of human instructions.

Qwen1.5[32], an encoder-only Transformer model, accommodates prompt with maximum length of 32K tokens and is compatible with multiple languages, including English, Chinese, French, and Spanish. Additionally, it has been made open-source in various sizes, including 0.5B, 1.8B, 4B, 7B, 14B, and 72B.

4.2 Experimental settings

In this section, we introduce the parameters involved in the fine-tuning TourLLM process. We set the maximum length of the input sequence to 1024 and 3 epochs. We set the learning rate (Lr) to 5e-4 and employ learning rate warm-up with a setting of 100[33]. In Lora, The rank r𝑟ritalic_r is set to 8, the constant a𝑎aitalic_a is set to 16, and the dropout is set to 0.1. We utilize the Adam[34] optimizer to update the Lora parameters. All experiments are conducted on 2 Nvidia GeForce RTX 3090 GPUs.

4.3 Evaluation Metrics

We evaluate the performance of TourLLM through three metrics: BLEU [35], Rouge[36], and Meteor[37].

Bilingual Evaluation Understudy(BLEU)[35] is an evaluation metric used to measure the accuracy of models with multiple correct output results. It compares the overlap of n-grams between candidate translations and reference translations. It is commonly used for evaluating machine translation quality.

Recall-Oriented Understudy for Gisting Evaluation(ROUGE)[36] is a set of metrics used to evaluate automatic summarization and machine translation. It measures the “similarity” between an automatically generated abstract or translation and a set of manually generated reference abstracts by calculating the corresponding score.

Metric for Evaluation of Translation with Explicit ORdering(Meteor)[37] is a metric used to evaluate the quality of machine-translation output. Compared to BLEU, METEOR considers more factors such as synonym matching, stem matching, and word order, making it generally considered a more comprehensive evaluation metric.

Model B-1 B-2 R-1 R-2 R-L Met.
ChatGPT 3.5 1.15 0.34 25.81 8.62 17.32 17.73
Qwen-1.5-7B 1.07 0.26 27.05 6.97 19.53 16.79
ChatGLM3-6B 1.23 0.32 24.35 7.67 18.37 17.11
LLama2-chinese-7B 0.72 0.19 14.55 3.87 12.05 9.21
TourLLM-7B(Ours) 2.77 0.52 25.60 10.85 18.85 18.54
Table 5: The statistical information of Cultour.

5 Result

In this section, we use automatic and human evaluation methods to evaluate the performance of TourLLM.

5.1 Automatic evaluation

To evaluate the performance of LLMs, we design 60 questions based on the aspects of eating, living, traveling, traveling, shopping, and entertaining, and collect all the responses without any prompt templates from LLMs. For Responses, we use BLEU-1 (B-1), BLEU-2 (B-2), ROUGE-1 (R-1), ROUGE-2 (R-2), ROUGE-L (R-L), and Meteor as the metrics to evaluate LLMs, the result is shown in Table 5.

In Table 5, the TourLLM model achieved optimal performance on most metrics. On the BLEU-related metric, the performance of TourLLM is much higher than other models. On the Rouge-L metric, TourLLM achieved the second-best performance. On the meteor metric, TourLLM achieved optimal performance compared with other models. This proves the TourLLM model can improve QA effects in the culture and tourism domains.

Kategorie Score Explanation
Consistency 0 The answer is entirely irrelevant to the question.
1 The answer is somewhat relevant to the question.
2 The answer is relevant to the question.
3 The answer is highly relevant to the question.
Readability 0 The answer is extremely difficult to understand or has poor grammar and structure.
1 The answer is somewhat difficult to understand or has grammatical errors.
2 The answer is clear, with only a few grammatical errors or room for improvement.
3 The answer is very clear, well structured, has no grammatical errors and is easy to understand.
Availability 0 There are serious problems or errors in the answer and the answer is not very usable.
1 There are some errors in the answer, some of which need to be verified by searching for the answer.
2 The answer is error-free, with only a few confusing elements.
3 The answer is error-free and instructive, providing a better experience for travelers.
Table 6: Human evaluation scoring criteria and explanations.

5.2 Human evaluation

Inspired by the previous literature [38], we conduct the human evaluation to assess the acceptability of the results generated by LLMs. We design sixty questions for LLMs and collect responses from ChatGLM3, Qwen1.5, Llama-Chinese, TourLLM, and ChatGPT3.5. To facilitate a fair evaluation, we devise a set of human evaluation metrics: Consistency (the degree of understanding questions and providing relevant answers), Readability (correctness of grammar, fluency, and formatting level), and Availability (reflects the understanding of the content of tourist attractions, correctness of the responses, such as information about transportation to attractions, relevant introductions). Each metric is scored on a scale of 0 to 3, the specific scoring criteria outlined in Table 6.

We record the responses of 5 models under 60 questions, and 300 questions are waiting for the evaluation of volunteers. We recruit 6 volunteers, and both of them are deep enthusiasts of tourism. In the evaluation process, We random the order of models, and the final score is the average of all the scores, as shown in Table 7.

Refer to caption
Figure 3: The impact of different scale data on model performance.
Model Consistency Readability availability
ChatGPT 3.5 2.68 2.68 2.20
LLama2-chinese-7B 2.24 2.34 1.79
ChatGLM3-6B 2.47 2.63 2.06
Qwen-1.5-7B 2.57 2.69 2.15
TourLLM-7B(Ours) 2.62 2.65 2.29
Table 7: The results of human evaluation.

Regarding consistency, ChatGPT3.5 achieves the best performance since the GPT model uses a human feedback-based learning approach, making the output more consistent with people’s reading habits. On the other hand, our TourLLM significantly improves the availability of knowledge in the culture and tourism domains without compromising model readability.

5.3 The impact of data volumes

We investigate the effect of varying the amount of SFT data on model performance. Specifically, we use 1K, 5K, 10K, 30K, and 60K data to train Qwen 1.5 and utilize BLEU-1, Rouge-L, and Meteor to evaluate performance, the results are shown in Fig. 3. With the improvement of the number of SFT data, the model’s performance will improve. Interestingly, when the amount of data is 30K, the model’s performance has decreased significantly. One possible reason for this inconsistency is that the data is imbalanced. The model may tend to favor predicting the more frequent categories, leading to a decrease in prediction performance for the less frequent categories.

6 Conclusion

In this paper, we present Cultour, a high-quality Chinese tourism SFT dataset, which is specifically fine-tuned for LLM. It includes 12,823 data items such as eating, living, traveling, traveling, shopping, and entertaining. Based on this dataset, we fine-tuned an LLM, TourLLM, for the cultural and tourism domain to gain deeper insights into people’s needs. To better evaluate the performance of TourLLM, we propose human evaluation CRA metrics tailored for LLMs, incorporating consistency, readability, and availability. We use both automated and manual evaluation methods to assess model performance. Experimental results demonstrate the effectiveness of TourLLM.

In future work, we will utilize TourLLM and external documents to enhance retrieval capabilities through a retrieval augmentation generation (RAG) approach to make responses more reliable.

References

  • [1] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al., Gpt-4 technical report, arXiv preprint arXiv:2303.08774 (2023).
  • [2] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., Llama: Open and efficient foundation language models, arXiv preprint arXiv:2302.13971 (2023).
  • [3] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901.
  • [4] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al., A survey of large language models, arXiv preprint arXiv:2303.18223 (2023).
  • [5] S. Gu, The future of tourism: Examining the potential applications of large language models, Qeios (2024).
  • [6] I. Carvalho, S. Ivanov, Chatgpt for tourism: applications, benefits and risks, Tourism Review 79 (2) (2024) 290–303.
  • [7] B. Lester, R. Al-Rfou, N. Constant, The power of scale for parameter-efficient prompt tuning, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2021.
  • [8] E. J. Hu, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al., Lora: Low-rank adaptation of large language models, in: International Conference on Learning Representations, 2021.
  • [9] T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, Qlora: Efficient finetuning of quantized llms, Advances in Neural Information Processing Systems 36 (2024).
  • [10] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., Training language models to follow instructions with human feedback, Advances in neural information processing systems 35 (2022) 27730–27744.
  • [11] H. Wang, C. Liu, N. Xi, Z. Qiang, S. Zhao, B. Qin, T. Liu, Huatuo: Tuning llama model with chinese medical knowledge, arXiv preprint arXiv:2304.06975 (2023).
  • [12] Q. Huang, M. Tao, Z. An, C. Zhang, C. Jiang, Z. Chen, Z. Wu, Y. Feng, Lawyer llama technical report, arXiv preprint arXiv:2305.15062 (2023).
  • [13] X. Wang, H. M. Sanders, Y. Liu, K. Seang, B. X. Tran, A. G. Atanasov, Y. Qiu, S. Tang, J. Car, Y. X. Wang, et al., Chatgpt: promise and challenges for deployment in low-and middle-income countries, The Lancet Regional Health–Western Pacific 41 (2023).
  • [14] S. Yang, H. Zhao, S. Zhu, G. Zhou, H. Xu, Y. Jia, H. Zan, Zhongjing: Enhancing the chinese medical capabilities of large language model through expert feedback and real-world multi-turn dialogue, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 38, 2024, pp. 19368–19376.
  • [15] H. Xiong, S. Wang, Y. Zhu, Z. Zhao, Y. Liu, L. Huang, Q. Wang, D. Shen, Doctorglm: Fine-tuning your chinese doctor is not a herculean task, arXiv preprint arXiv:2304.01097 (2023).
  • [16] Y. Li, Z. Li, K. Zhang, R. Dan, S. Jiang, Y. Zhang, Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge, Cureus 15 (6) (2023).
  • [17] M. Xu, Medicalgpt: Training medical gpt model (2023).
  • [18] R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, C. Finn, Direct preference optimization: Your language model is secretly a reward model, Advances in Neural Information Processing Systems 36 (2024).
  • [19] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, H. Wang, Retrieval-augmented generation for large language models: A survey, arXiv preprint arXiv:2312.10997 (2023).
  • [20] J. Cui, Z. Li, Y. Yan, B. Chen, L. Yuan, Chatlaw: Open-source legal large language model with integrated external knowledge bases, arXiv preprint arXiv:2306.16092 (2023).
  • [21] C. H. Hsu, G. Tan, B. Stantic, A fine-tuned tourism-specific generative ai concept, Annals of Tourism Research (2024) 103723.
  • [22] B. Mo, H. Xu, D. Zhuang, R. Ma, X. Guo, J. Zhao, Large language models for travel behavior prediction, arXiv preprint arXiv:2312.00819 (2023).
  • [23] Z. Wang, L.-P. Yuan, L. Wang, B. Jiang, W. Zeng, Virtuwander: Enhancing multi-modal interaction for virtual tour guidance through large language models, in: Proceedings of the CHI conference on human factors in computing systems, 2024, pp. 1–20.
  • [24] L. Secchi, et al., Knowledge graphs and large language models for intelligent applications in the tourism domain (2024).
  • [25] Z. Bao, W. Chen, S. Xiao, K. Ren, J. Wu, C. Zhong, J. Peng, X. Huang, Z. Wei, Disc-medllm: Bridging general large language models and real-world medical consultation, arXiv preprint arXiv:2308.14346 (2023).
  • [26] V. Rawte, A. Sheth, A. Das, A survey of hallucination in large foundation models, arXiv preprint arXiv:2309.05922 (2023).
  • [27] A. Yang, B. Xiao, B. Wang, B. Zhang, C. Bian, C. Yin, C. Lv, D. Pan, D. Wang, D. Yan, et al., Baichuan 2: Open large-scale language models, arXiv preprint arXiv:2309.10305 (2023).
  • [28] Y. Cui, Z. Yang, X. Yao, Efficient and effective text encoding for chinese llama and alpaca, arXiv preprint arXiv:2304.08177 (2023).
  • [29] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., Training language models to follow instructions with human feedback, Advances in neural information processing systems 35 (2022) 27730–27744.
  • [30] Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, J. Tang, Glm: General language model pretraining with autoregressive blank infilling, arXiv preprint arXiv:2103.10360 (2021).
  • [31] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, T. B. Hashimoto, Alpaca: A strong, replicable instruction-following model, Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html 3 (6) (2023) 7.
  • [32] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, et al., Qwen technical report, arXiv preprint arXiv:2309.16609 (2023).
  • [33] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [34] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014).
  • [35] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine translation, in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318.
  • [36] C.-Y. Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization branches out, 2004, pp. 74–81.
  • [37] S. Banerjee, A. Lavie, Meteor: An automatic metric for mt evaluation with improved correlation with human judgments, in: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65–72.
  • [38] E. Shi, F. Zhang, Y. Wang, B. Chen, L. Du, H. Zhang, S. Han, D. Zhang, H. Sun, Sotana: The open-source software development assistant, arXiv preprint arXiv:2308.13416 (2023).