Application of frozen large-scale models to multimodal task-oriented dialogue
Authors:
Tatsuki Kawamoto,
Takuma Suzuki,
Ko Miyama,
Takumi Meguro,
Tomohiro Takagi
Abstract:
In this study, we use the existing Large Language Models ENnhanced to See Framework (LENS Framework) to test the feasibility of multimodal task-oriented dialogues. The LENS Framework has been proposed as a method to solve computer vision tasks without additional training and with fixed parameters of pre-trained models. We used the Multimodal Dialogs (MMD) dataset, a multimodal task-oriented dialog…
▽ More
In this study, we use the existing Large Language Models ENnhanced to See Framework (LENS Framework) to test the feasibility of multimodal task-oriented dialogues. The LENS Framework has been proposed as a method to solve computer vision tasks without additional training and with fixed parameters of pre-trained models. We used the Multimodal Dialogs (MMD) dataset, a multimodal task-oriented dialogue benchmark dataset from the fashion field, and for the evaluation, we used the ChatGPT-based G-EVAL, which only accepts textual modalities, with arrangements to handle multimodal data. Compared to Transformer-based models in previous studies, our method demonstrated an absolute lift of 10.8% in fluency, 8.8% in usefulness, and 5.2% in relevance and coherence. The results show that using large-scale models with fixed parameters rather than using models trained on a dataset from scratch improves performance in multimodal task-oriented dialogues. At the same time, we show that Large Language Models (LLMs) are effective for multimodal task-oriented dialogues. This is expected to lead to efficient applications to existing systems.
△ Less
Submitted 1 October, 2023;
originally announced October 2023.
Empirical Analysis of Training Strategies of Transformer-based Japanese Chit-chat Systems
Authors:
Hiroaki Sugiyama,
Masahiro Mizukami,
Tsunehiro Arimoto,
Hiromi Narimatsu,
Yuya Chiba,
Hideharu Nakajima,
Toyomi Meguro
Abstract:
In recent years, several high-performance conversational systems have been proposed based on the Transformer encoder-decoder model. Although previous studies analyzed the effects of the model parameters and the decoding method on subjective dialogue evaluations with overall metrics, they did not analyze how the differences of fine-tuning datasets affect on user's detailed impression. In addition,…
▽ More
In recent years, several high-performance conversational systems have been proposed based on the Transformer encoder-decoder model. Although previous studies analyzed the effects of the model parameters and the decoding method on subjective dialogue evaluations with overall metrics, they did not analyze how the differences of fine-tuning datasets affect on user's detailed impression. In addition, the Transformer-based approach has only been verified for English, not for such languages with large inter-language distances as Japanese. In this study, we develop large-scale Transformer-based Japanese dialogue models and Japanese chit-chat datasets to examine the effectiveness of the Transformer-based approach for building chit-chat dialogue systems. We evaluated and analyzed the impressions of human dialogues in different fine-tuning datasets, model parameters, and the use of additional information.
△ Less
Submitted 11 September, 2021;
originally announced September 2021.