Search | arXiv e-print repository

A Word Order Synchronization Metric for Evaluating Simultaneous Interpretation and Translation

Authors: Mana Makinae, Katsuhito Sudoh, Mararu Yamada, Satoshi Nakamura

Abstract: Simultaneous interpretation (SI), the translation of one language to another in real time, starts translation before the original speech has finished. Its evaluation needs to consider both latency and quality. This trade-off is challenging especially for distant word order language pairs such as English and Japanese. To handle this word order gap, interpreters maintain the word order of the source… ▽ More Simultaneous interpretation (SI), the translation of one language to another in real time, starts translation before the original speech has finished. Its evaluation needs to consider both latency and quality. This trade-off is challenging especially for distant word order language pairs such as English and Japanese. To handle this word order gap, interpreters maintain the word order of the source language as much as possible to keep up with original language to minimize its latency while maintaining its quality, whereas in translation reordering happens to keep fluency in the target language. This means outputs synchronized with the source language are desirable based on the real SI situation, and it's a key for further progress in computational SI and simultaneous machine translation (SiMT). In this work, we propose an automatic evaluation metric for SI and SiMT focusing on word order synchronization. Our evaluation metric is based on rank correlation coefficients, leveraging cross-lingual pre-trained language models. Our experimental results on NAIST-SIC-Aligned and JNPC showed our metrics' effectiveness to measure word order synchronization between source and target language. △ Less

Submitted 9 July, 2024; originally announced July 2024.

arXiv:2407.00826 [pdf, other]

NAIST Simultaneous Speech Translation System for IWSLT 2024

Authors: Yuka Ko, Ryo Fukuda, Yuta Nishikawa, Yasumasa Kano, Tomoya Yanagita, Kosuke Doi, Mana Makinae, Haotian Tan, Makoto Sakai, Sakriani Sakti, Katsuhito Sudoh, Satoshi Nakamura

Abstract: This paper describes NAIST's submission to the simultaneous track of the IWSLT 2024 Evaluation Campaign: English-to-{German, Japanese, Chinese} speech-to-text translation and English-to-Japanese speech-to-speech translation. We develop a multilingual end-to-end speech-to-text translation model combining two pre-trained language models, HuBERT and mBART. We trained this model with two decoding poli… ▽ More This paper describes NAIST's submission to the simultaneous track of the IWSLT 2024 Evaluation Campaign: English-to-{German, Japanese, Chinese} speech-to-text translation and English-to-Japanese speech-to-speech translation. We develop a multilingual end-to-end speech-to-text translation model combining two pre-trained language models, HuBERT and mBART. We trained this model with two decoding policies, Local Agreement (LA) and AlignAtt. The submitted models employ the LA policy because it outperformed the AlignAtt policy in previous models. Our speech-to-speech translation method is a cascade of the above speech-to-text model and an incremental text-to-speech (TTS) module that incorporates a phoneme estimation model, a parallel acoustic model, and a parallel WaveGAN vocoder. We improved our incremental TTS by applying the Transformer architecture with the AlignAtt policy for the estimation model. The results show that our upgraded TTS module contributed to improving the system performance. △ Less

Submitted 30 June, 2024; originally announced July 2024.

Comments: IWSLT 2024 system paper

arXiv:2406.13476 [pdf, other]

LLMs Are Zero-Shot Context-Aware Simultaneous Translators

Authors: Roman Koshkin, Katsuhito Sudoh, Satoshi Nakamura

Abstract: The advent of transformers has fueled progress in machine translation. More recently large language models (LLMs) have come to the spotlight thanks to their generality and strong performance in a wide range of language tasks, including translation. Here we show that open-source LLMs perform on par with or better than some state-of-the-art baselines in simultaneous machine translation (SiMT) tasks,… ▽ More The advent of transformers has fueled progress in machine translation. More recently large language models (LLMs) have come to the spotlight thanks to their generality and strong performance in a wide range of language tasks, including translation. Here we show that open-source LLMs perform on par with or better than some state-of-the-art baselines in simultaneous machine translation (SiMT) tasks, zero-shot. We also demonstrate that injection of minimal background information, which is easy with an LLM, brings further performance gains, especially on challenging technical subject-matter. This highlights LLMs' potential for building next generation of massively multilingual, context-aware and terminologically accurate SiMT systems that require no resource-intensive training or fine-tuning. △ Less

Submitted 25 June, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.08940 [pdf, other]

Word Order in English-Japanese Simultaneous Interpretation: Analyses and Evaluation using Chunk-wise Monotonic Translation

Authors: Kosuke Doi, Yuka Ko, Mana Makinae, Katsuhito Sudoh, Satoshi Nakamura

Abstract: This paper analyzes the features of monotonic translations, which follow the word order of the source language, in simultaneous interpreting (SI). Word order differences are one of the biggest challenges in SI, especially for language pairs with significant structural differences like English and Japanese. We analyzed the characteristics of chunk-wise monotonic translation (CMT) sentences using th… ▽ More This paper analyzes the features of monotonic translations, which follow the word order of the source language, in simultaneous interpreting (SI). Word order differences are one of the biggest challenges in SI, especially for language pairs with significant structural differences like English and Japanese. We analyzed the characteristics of chunk-wise monotonic translation (CMT) sentences using the NAIST English-to-Japanese Chunk-wise Monotonic Translation Evaluation Dataset and identified some grammatical structures that make monotonic translation difficult in English-Japanese SI. We further investigated the features of CMT sentences by evaluating the output from the existing speech translation (ST) and simultaneous speech translation (simulST) models on the NAIST English-to-Japanese Chunk-wise Monotonic Translation Evaluation Dataset as well as on existing test sets. The results indicate the possibility that the existing SI-based test set underestimates the model performance. The results also suggest that using CMT sentences as references gives higher scores to simulST models than ST models, and that using an offline-based test set to evaluate the simulST models underestimates the model performance. △ Less

Submitted 15 July, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

Comments: Accepted to IWSLT2024

arXiv:2406.08817 [pdf, other]

Automated Essay Scoring Using Grammatical Variety and Errors with Multi-Task Learning and Item Response Theory

Authors: Kosuke Doi, Katsuhito Sudoh, Satoshi Nakamura

Abstract: This study examines the effect of grammatical features in automatic essay scoring (AES). We use two kinds of grammatical features as input to an AES model: (1) grammatical items that writers used correctly in essays, and (2) the number of grammatical errors. Experimental results show that grammatical features improve the performance of AES models that predict the holistic scores of essays. Multi-t… ▽ More This study examines the effect of grammatical features in automatic essay scoring (AES). We use two kinds of grammatical features as input to an AES model: (1) grammatical items that writers used correctly in essays, and (2) the number of grammatical errors. Experimental results show that grammatical features improve the performance of AES models that predict the holistic scores of essays. Multi-task learning with the holistic and grammar scores, alongside using grammatical features, resulted in a larger improvement in model performance. We also show that a model using grammar abilities estimated using Item Response Theory (IRT) as the labels for the auxiliary task achieved comparable performance to when we used grammar scores assigned by human raters. In addition, we weight the grammatical features using IRT to consider the difficulty of grammatical items and writers' grammar abilities. We found that weighting grammatical features with the difficulty led to further improvement in performance. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: Accepted to BEA2024

arXiv:2406.03881 [pdf, other]

Evaluating the IWSLT2023 Speech Translation Tasks: Human Annotations, Automatic Metrics, and Segmentation

Authors: Matthias Sperber, Ondřej Bojar, Barry Haddow, Dávid Javorský, Xutai Ma, Matteo Negri, Jan Niehues, Peter Polák, Elizabeth Salesky, Katsuhito Sudoh, Marco Turchi

Abstract: Human evaluation is a critical component in machine translation system development and has received much attention in text translation research. However, little prior work exists on the topic of human evaluation for speech translation, which adds additional challenges such as noisy data and segmentation mismatches. We take first steps to fill this gap by conducting a comprehensive human evaluation… ▽ More Human evaluation is a critical component in machine translation system development and has received much attention in text translation research. However, little prior work exists on the topic of human evaluation for speech translation, which adds additional challenges such as noisy data and segmentation mismatches. We take first steps to fill this gap by conducting a comprehensive human evaluation of the results of several shared tasks from the last International Workshop on Spoken Language Translation (IWSLT 2023). We propose an effective evaluation strategy based on automatic resegmentation and direct assessment with segment context. Our analysis revealed that: 1) the proposed evaluation strategy is robust and scores well-correlated with other types of human judgements; 2) automatic metrics are usually, but not always, well-correlated with direct assessment scores; and 3) COMET as a slightly stronger automatic metric than chrF, despite the segmentation noise introduced by the resegmentation step systems. We release the collected human-annotated data in order to encourage further investigation. △ Less

Submitted 6 June, 2024; originally announced June 2024.

Comments: LREC-COLING2024 publication (with corrections for Table 3)

Journal ref: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

arXiv:2402.04636 [pdf, other]

TransLLaMa: LLM-based Simultaneous Translation System

Authors: Roman Koshkin, Katsuhito Sudoh, Satoshi Nakamura

Abstract: Decoder-only large language models (LLMs) have recently demonstrated impressive capabilities in text generation and reasoning. Nonetheless, they have limited applications in simultaneous machine translation (SiMT), currently dominated by encoder-decoder transformers. This study demonstrates that, after fine-tuning on a small dataset comprising causally aligned source and target sentence pairs, a p… ▽ More Decoder-only large language models (LLMs) have recently demonstrated impressive capabilities in text generation and reasoning. Nonetheless, they have limited applications in simultaneous machine translation (SiMT), currently dominated by encoder-decoder transformers. This study demonstrates that, after fine-tuning on a small dataset comprising causally aligned source and target sentence pairs, a pre-trained open-source LLM can control input segmentation directly by generating a special "wait" token. This obviates the need for a separate policy and enables the LLM to perform English-German and English-Russian SiMT tasks with BLEU scores that are comparable to those of specific state-of-the-art baselines. We also evaluated closed-source models such as GPT-4, which displayed encouraging results in performing the SiMT task without prior training (zero-shot), indicating a promising avenue for enhancing future SiMT systems. △ Less

Submitted 7 February, 2024; originally announced February 2024.

arXiv:2311.14353 [pdf, other]

Average Token Delay: A Duration-aware Latency Metric for Simultaneous Translation

Authors: Yasumasa Kano, Katsuhito Sudoh, Satoshi Nakamura

Abstract: Simultaneous translation is a task in which the translation begins before the end of an input speech segment. Its evaluation should be conducted based on latency in addition to quality, and for users, the smallest possible amount of latency is preferable. Most existing metrics measure latency based on the start timings of partial translations and ignore their duration. This means such metrics do n… ▽ More Simultaneous translation is a task in which the translation begins before the end of an input speech segment. Its evaluation should be conducted based on latency in addition to quality, and for users, the smallest possible amount of latency is preferable. Most existing metrics measure latency based on the start timings of partial translations and ignore their duration. This means such metrics do not penalize the latency caused by long translation output, which delays the comprehension of users and subsequent translations. In this work, we propose a novel latency evaluation metric for simultaneous translation called \emph{Average Token Delay} (ATD) that focuses on the duration of partial translations. We demonstrate its effectiveness through analyses simulating user-side latency based on Ear-Voice Span (EVS). In our experiment, ATD had the highest correlation with EVS among baseline latency metrics under most conditions. △ Less

Submitted 27 November, 2023; v1 submitted 24 November, 2023; originally announced November 2023.

Comments: Extended version of the paper (doi: 10.21437/Interspeech.2023-933) which appeared in INTERSPEECH 2023

arXiv:2306.08582 [pdf, other]

Tagged End-to-End Simultaneous Speech Translation Training using Simultaneous Interpretation Data

Authors: Yuka Ko, Ryo Fukuda, Yuta Nishikawa, Yasumasa Kano, Katsuhito Sudoh, Satoshi Nakamura

Abstract: Simultaneous speech translation (SimulST) translates partial speech inputs incrementally. Although the monotonic correspondence between input and output is preferable for smaller latency, it is not the case for distant language pairs such as English and Japanese. A prospective approach to this problem is to mimic simultaneous interpretation (SI) using SI data to train a SimulST model. However, the… ▽ More Simultaneous speech translation (SimulST) translates partial speech inputs incrementally. Although the monotonic correspondence between input and output is preferable for smaller latency, it is not the case for distant language pairs such as English and Japanese. A prospective approach to this problem is to mimic simultaneous interpretation (SI) using SI data to train a SimulST model. However, the size of such SI data is limited, so the SI data should be used together with ordinary bilingual data whose translations are given in offline. In this paper, we propose an effective way to train a SimulST model using mixed data of SI and offline. The proposed method trains a single model using the mixed data with style tags that tell the model to generate SI- or offline-style outputs. Experiment results show improvements of BLEURT in different latency ranges, and our analyses revealed the proposed model generates SI-style outputs more than the baseline. △ Less

Submitted 14 June, 2023; originally announced June 2023.

Comments: Accepted to IWSLT2023 scientific paper

arXiv:2304.11766 [pdf, other]

NAIST-SIC-Aligned: an Aligned English-Japanese Simultaneous Interpretation Corpus

Authors: Jinming Zhao, Yuka Ko, Kosuke Doi, Ryo Fukuda, Katsuhito Sudoh, Satoshi Nakamura

Abstract: It remains a question that how simultaneous interpretation (SI) data affects simultaneous machine translation (SiMT). Research has been limited due to the lack of a large-scale training corpus. In this work, we aim to fill in the gap by introducing NAIST-SIC-Aligned, which is an automatically-aligned parallel English-Japanese SI dataset. Starting with a non-aligned corpus NAIST-SIC, we propose a t… ▽ More It remains a question that how simultaneous interpretation (SI) data affects simultaneous machine translation (SiMT). Research has been limited due to the lack of a large-scale training corpus. In this work, we aim to fill in the gap by introducing NAIST-SIC-Aligned, which is an automatically-aligned parallel English-Japanese SI dataset. Starting with a non-aligned corpus NAIST-SIC, we propose a two-stage alignment approach to make the corpus parallel and thus suitable for model training. The first stage is coarse alignment where we perform a many-to-many mapping between source and target sentences, and the second stage is fine-grained alignment where we perform intra- and inter-sentence filtering to improve the quality of aligned pairs. To ensure the quality of the corpus, each step has been validated either quantitatively or qualitatively. This is the first open-sourced large-scale parallel SI dataset in the literature. We also manually curated a small test set for evaluation purposes. Our results show that models trained with SI data lead to significant improvement in translation quality and latency over baselines. We hope our work advances research on SI corpora construction and SiMT. Our data can be found at https://github.com/mingzi151/AHC-SI. △ Less

Submitted 31 March, 2024; v1 submitted 23 April, 2023; originally announced April 2023.

Comments: LREC-Coling 2024

arXiv:2303.00311 [pdf, other]

Modeling Multiple User Interests using Hierarchical Knowledge for Conversational Recommender System

Authors: Yuka Okuda, Katsuhito Sudoh, Seitaro Shinagawa, Satoshi Nakamura

Abstract: A conversational recommender system (CRS) is a practical application for item recommendation through natural language conversation. Such a system estimates user interests for appropriate personalized recommendations. Users sometimes have various interests in different categories or genres, but existing studies assume a unique user interest that can be covered by closely related items. In this work… ▽ More A conversational recommender system (CRS) is a practical application for item recommendation through natural language conversation. Such a system estimates user interests for appropriate personalized recommendations. Users sometimes have various interests in different categories or genres, but existing studies assume a unique user interest that can be covered by closely related items. In this work, we propose to model such multiple user interests in CRS. We investigated its effects in experiments using the ReDial dataset and found that the proposed method can recommend a wider variety of items than that of the baseline CR-Walker. △ Less

Submitted 1 March, 2023; originally announced March 2023.

Comments: Accepted as a conference paper at IWSDS 2023

arXiv:2302.05619 [pdf, other]

Evaluating the Robustness of Discrete Prompts

Authors: Yoichi Ishibashi, Danushka Bollegala, Katsuhito Sudoh, Satoshi Nakamura

Abstract: Discrete prompts have been used for fine-tuning Pre-trained Language Models for diverse NLP tasks. In particular, automatic methods that generate discrete prompts from a small set of training instances have reported superior performance. However, a closer look at the learnt prompts reveals that they contain noisy and counter-intuitive lexical constructs that would not be encountered in manually-wr… ▽ More Discrete prompts have been used for fine-tuning Pre-trained Language Models for diverse NLP tasks. In particular, automatic methods that generate discrete prompts from a small set of training instances have reported superior performance. However, a closer look at the learnt prompts reveals that they contain noisy and counter-intuitive lexical constructs that would not be encountered in manually-written prompts. This raises an important yet understudied question regarding the robustness of automatically learnt discrete prompts when used in downstream tasks. To address this question, we conduct a systematic study of the robustness of discrete prompts by applying carefully designed perturbations into an application using AutoPrompt and then measure their performance in two Natural Language Inference (NLI) datasets. Our experimental results show that although the discrete prompt-based method remains relatively robust against perturbations to NLI inputs, they are highly sensitive to other types of perturbations such as shuffling and deletion of prompt tokens. Moreover, they generalize poorly across different NLI datasets. We hope our findings will inspire future work on robust discrete prompt learning. △ Less

Submitted 11 February, 2023; originally announced February 2023.

Comments: Accepted at EACL 2023

arXiv:2211.13173 [pdf, other]

Average Token Delay: A Latency Metric for Simultaneous Translation

Authors: Yasumasa Kano, Katsuhito Sudoh, Satoshi Nakamura

Abstract: Simultaneous translation is a task in which translation begins before the speaker has finished speaking. In its evaluation, we have to consider the latency of the translation in addition to the quality. The latency is preferably as small as possible for users to comprehend what the speaker says with a small delay. Existing latency metrics focus on when the translation starts but do not consider ad… ▽ More Simultaneous translation is a task in which translation begins before the speaker has finished speaking. In its evaluation, we have to consider the latency of the translation in addition to the quality. The latency is preferably as small as possible for users to comprehend what the speaker says with a small delay. Existing latency metrics focus on when the translation starts but do not consider adequately when the translation ends. This means such metrics do not penalize the latency caused by a long translation output, which actually delays users' comprehension. In this work, we propose a novel latency evaluation metric called Average Token Delay (ATD) that focuses on the end timings of partial translations in simultaneous translation. We discuss the advantage of ATD using simulated examples and also investigate the differences between ATD and Average Lagging with simultaneous translation experiments. △ Less

Submitted 8 February, 2023; v1 submitted 22 November, 2022; originally announced November 2022.

arXiv:2211.00513 [pdf, other]

E2E Refined Dataset

Authors: Keisuke Toyama, Katsuhito Sudoh, Satoshi Nakamura

Abstract: Although the well-known MR-to-text E2E dataset has been used by many researchers, its MR-text pairs include many deletion/insertion/substitution errors. Since such errors affect the quality of MR-to-text systems, they must be fixed as much as possible. Therefore, we developed a refined dataset and some python programs that convert the original E2E dataset into a refined dataset. Although the well-known MR-to-text E2E dataset has been used by many researchers, its MR-text pairs include many deletion/insertion/substitution errors. Since such errors affect the quality of MR-to-text systems, they must be fixed as much as possible. Therefore, we developed a refined dataset and some python programs that convert the original E2E dataset into a refined dataset. △ Less

Submitted 1 November, 2022; originally announced November 2022.

Comments: 4 pages

ACM Class: I.2.7

arXiv:2210.13034 [pdf, other]

Subspace Representations for Soft Set Operations and Sentence Similarities

Authors: Yoichi Ishibashi, Sho Yokoi, Katsuhito Sudoh, Satoshi Nakamura

Abstract: In the field of natural language processing (NLP), continuous vector representations are crucial for capturing the semantic meanings of individual words. Yet, when it comes to the representations of sets of words, the conventional vector-based approaches often struggle with expressiveness and lack the essential set operations such as union, intersection, and complement. Inspired by quantum logic,… ▽ More In the field of natural language processing (NLP), continuous vector representations are crucial for capturing the semantic meanings of individual words. Yet, when it comes to the representations of sets of words, the conventional vector-based approaches often struggle with expressiveness and lack the essential set operations such as union, intersection, and complement. Inspired by quantum logic, we realize the representation of word sets and corresponding set operations within pre-trained word embedding spaces. By grounding our approach in the linear subspaces, we enable efficient computation of various set operations and facilitate the soft computation of membership functions within continuous spaces. Moreover, we allow for the computation of the F-score directly within word vectors, thereby establishing a direct link to the assessment of sentence similarity. In experiments with widely-used pre-trained embeddings and benchmarks, we show that our subspace-based set operations consistently outperform vector-based ones in both sentence similarity and set retrieval tasks. △ Less

Submitted 9 April, 2024; v1 submitted 24 October, 2022; originally announced October 2022.

Comments: Accepted at NAACL 2024

arXiv:2203.15479 [pdf, other]

Speech Segmentation Optimization using Segmented Bilingual Speech Corpus for End-to-end Speech Translation

Authors: Ryo Fukuda, Katsuhito Sudoh, Satoshi Nakamura

Abstract: Speech segmentation, which splits long speech into short segments, is essential for speech translation (ST). Popular VAD tools like WebRTC VAD have generally relied on pause-based segmentation. Unfortunately, pauses in speech do not necessarily match sentence boundaries, and sentences can be connected by a very short pause that is difficult to detect by VAD. In this study, we propose a speech segm… ▽ More Speech segmentation, which splits long speech into short segments, is essential for speech translation (ST). Popular VAD tools like WebRTC VAD have generally relied on pause-based segmentation. Unfortunately, pauses in speech do not necessarily match sentence boundaries, and sentences can be connected by a very short pause that is difficult to detect by VAD. In this study, we propose a speech segmentation method using a binary classification model trained using a segmented bilingual speech corpus. We also propose a hybrid method that combines VAD and the above speech segmentation method. Experimental results revealed that the proposed method is more suitable for cascade and end-to-end ST systems than conventional segmentation methods. The hybrid approach further improved the translation performance. △ Less

Submitted 13 July, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

Comments: Accepted to INTERSPEECH 2022

arXiv:2203.14725 [pdf, other]

vTTS: visual-text to speech

Authors: Yoshifumi Nakano, Takaaki Saeki, Shinnosuke Takamichi, Katsuhito Sudoh, Hiroshi Saruwatari

Abstract: This paper proposes visual-text to speech (vTTS), a method for synthesizing speech from visual text (i.e., text as an image). Conventional TTS converts phonemes or characters into discrete symbols and synthesizes a speech waveform from them, thus losing the visual features that the characters essentially have. Therefore, our method synthesizes speech not from discrete symbols but from visual text.… ▽ More This paper proposes visual-text to speech (vTTS), a method for synthesizing speech from visual text (i.e., text as an image). Conventional TTS converts phonemes or characters into discrete symbols and synthesizes a speech waveform from them, thus losing the visual features that the characters essentially have. Therefore, our method synthesizes speech not from discrete symbols but from visual text. The proposed vTTS extracts visual features with a convolutional neural network and then generates acoustic features with a non-autoregressive model inspired by FastSpeech2. Experimental results show that 1) vTTS is capable of generating speech with naturalness comparable to or better than a conventional TTS, 2) it can transfer emphasis and emotion attributes in visual text to speech without additional labels and architectures, and 3) it can synthesize more natural and intelligible speech from unseen and rare characters than conventional TTS. △ Less

Submitted 28 March, 2022; originally announced March 2022.

Comments: submitted to interspech 2022

arXiv:2110.13480 [pdf, other]

Simultaneous Neural Machine Translation with Constituent Label Prediction

Authors: Yasumasa Kano, Katsuhito Sudoh, Satoshi Nakamura

Abstract: Simultaneous translation is a task in which translation begins before the speaker has finished speaking, so it is important to decide when to start the translation process. However, deciding whether to read more input words or start to translate is difficult for language pairs with different word orders such as English and Japanese. Motivated by the concept of pre-reordering, we propose a couple o… ▽ More Simultaneous translation is a task in which translation begins before the speaker has finished speaking, so it is important to decide when to start the translation process. However, deciding whether to read more input words or start to translate is difficult for language pairs with different word orders such as English and Japanese. Motivated by the concept of pre-reordering, we propose a couple of simple decision rules using the label of the next constituent predicted by incremental constituent label prediction. In experiments on English-to-Japanese simultaneous translation, the proposed method outperformed baselines in the quality-latency trade-off. △ Less

Submitted 26 October, 2021; originally announced October 2021.

Comments: WMT2021

arXiv:2107.13689 [pdf, other]

Using Perturbed Length-aware Positional Encoding for Non-autoregressive Neural Machine Translation

Authors: Yui Oka, Katsuhito Sudoh, Satoshi Nakamura

Abstract: Non-autoregressive neural machine translation (NAT) usually employs sequence-level knowledge distillation using autoregressive neural machine translation (AT) as its teacher model. However, a NAT model often outputs shorter sentences than an AT model. In this work, we propose sequence-level knowledge distillation (SKD) using perturbed length-aware positional encoding and apply it to a student mode… ▽ More Non-autoregressive neural machine translation (NAT) usually employs sequence-level knowledge distillation using autoregressive neural machine translation (AT) as its teacher model. However, a NAT model often outputs shorter sentences than an AT model. In this work, we propose sequence-level knowledge distillation (SKD) using perturbed length-aware positional encoding and apply it to a student model, the Levenshtein Transformer. Our method outperformed a standard Levenshtein Transformer by 2.5 points in bilingual evaluation understudy (BLEU) at maximum in a WMT14 German to English translation. The NAT model output longer sentences than the baseline NAT models. △ Less

Submitted 28 July, 2021; originally announced July 2021.

Comments: 5 pages, 1 figures. Will be presented at ACL SRW 2021

arXiv:2106.07999 [pdf, other]

ARTA: Collection and Classification of Ambiguous Requests and Thoughtful Actions

Authors: Shohei Tanaka, Koichiro Yoshino, Katsuhito Sudoh, Satoshi Nakamura

Abstract: Human-assisting systems such as dialogue systems must take thoughtful, appropriate actions not only for clear and unambiguous user requests, but also for ambiguous user requests, even if the users themselves are not aware of their potential requirements. To construct such a dialogue agent, we collected a corpus and developed a model that classifies ambiguous user requests into corresponding system… ▽ More Human-assisting systems such as dialogue systems must take thoughtful, appropriate actions not only for clear and unambiguous user requests, but also for ambiguous user requests, even if the users themselves are not aware of their potential requirements. To construct such a dialogue agent, we collected a corpus and developed a model that classifies ambiguous user requests into corresponding system actions. In order to collect a high-quality corpus, we asked workers to input antecedent user requests whose pre-defined actions could be regarded as thoughtful. Although multiple actions could be identified as thoughtful for a single user request, annotating all combinations of user requests and system actions is impractical. For this reason, we fully annotated only the test data and left the annotation of the training data incomplete. In order to train the classification model on such training data, we applied the positive/unlabeled (PU) learning method, which assumes that only a part of the data is labeled with positive examples. The experimental results show that the PU learning method achieved better performance than the general positive/negative (PN) learning method to classify thoughtful actions given an ambiguous user request. △ Less

Submitted 15 June, 2021; originally announced June 2021.

Comments: Accepted by The 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL2021)

arXiv:2011.04845 [pdf, other]

Simultaneous Speech-to-Speech Translation System with Neural Incremental ASR, MT, and TTS

Authors: Katsuhito Sudoh, Takatomo Kano, Sashi Novitasari, Tomoya Yanagita, Sakriani Sakti, Satoshi Nakamura

Abstract: This paper presents a newly developed, simultaneous neural speech-to-speech translation system and its evaluation. The system consists of three fully-incremental neural processing modules for automatic speech recognition (ASR), machine translation (MT), and text-to-speech synthesis (TTS). We investigated its overall latency in the system's Ear-Voice Span and speaking latency along with module-leve… ▽ More This paper presents a newly developed, simultaneous neural speech-to-speech translation system and its evaluation. The system consists of three fully-incremental neural processing modules for automatic speech recognition (ASR), machine translation (MT), and text-to-speech synthesis (TTS). We investigated its overall latency in the system's Ear-Voice Span and speaking latency along with module-level performance. △ Less

Submitted 11 November, 2020; v1 submitted 9 November, 2020; originally announced November 2020.

Comments: 6 pages

arXiv:2010.09413 [pdf, other]

Image Captioning with Visual Object Representations Grounded in the Textual Modality

Authors: Dušan Variš, Katsuhito Sudoh, Satoshi Nakamura

Abstract: We present our work in progress exploring the possibilities of a shared embedding space between textual and visual modality. Leveraging the textual nature of object detection labels and the hypothetical expressiveness of extracted visual object representations, we propose an approach opposite to the current trend, grounding of the representations in the word embedding space of the captioning syste… ▽ More We present our work in progress exploring the possibilities of a shared embedding space between textual and visual modality. Leveraging the textual nature of object detection labels and the hypothetical expressiveness of extracted visual object representations, we propose an approach opposite to the current trend, grounding of the representations in the word embedding space of the captioning system instead of grounding words or sentences in their associated images. Based on the previous work, we apply additional grounding losses to the image captioning training objective aiming to force visual object representations to create more heterogeneous clusters based on their class label and copy a semantic structure of the word embedding space. In addition, we provide an analysis of the learned object vector space projection and its impact on the IC system performance. With only slight change in performance, grounded models reach the stopping criterion during training faster than the unconstrained model, needing about two to three times less training updates. Additionally, an improvement in structural correlation between the word embeddings and both original and projected object vectors suggests that the grounding is actually mutual. △ Less

Submitted 20 October, 2020; v1 submitted 19 October, 2020; originally announced October 2020.

arXiv:2007.02598 [pdf, other]

Reflection-based Word Attribute Transfer

Authors: Yoichi Ishibashi, Katsuhito Sudoh, Koichiro Yoshino, Satoshi Nakamura

Abstract: Word embeddings, which often represent such analogic relations as king - man + woman = queen, can be used to change a word's attribute, including its gender. For transferring king into queen in this analogy-based manner, we subtract a difference vector man - woman based on the knowledge that king is male. However, developing such knowledge is very costly for words and attributes. In this work, we… ▽ More Word embeddings, which often represent such analogic relations as king - man + woman = queen, can be used to change a word's attribute, including its gender. For transferring king into queen in this analogy-based manner, we subtract a difference vector man - woman based on the knowledge that king is male. However, developing such knowledge is very costly for words and attributes. In this work, we propose a novel method for word attribute transfer based on reflection mappings without such an analogy operation. Experimental results show that our proposed method can transfer the word attributes of the given words without changing the words that do not have the target attributes. △ Less

Submitted 7 July, 2020; v1 submitted 6 July, 2020; originally announced July 2020.

Comments: Accepted at ACL 2020 Student Research Workshop (SRW)

arXiv:1911.11933 [pdf, other]

Simultaneous Neural Machine Translation using Connectionist Temporal Classification

Authors: Katsuki Chousa, Katsuhito Sudoh, Satoshi Nakamura

Abstract: Simultaneous machine translation is a variant of machine translation that starts the translation process before the end of an input. This task faces a trade-off between translation accuracy and latency. We have to determine when we start the translation for observed inputs so far, to achieve good practical performance. In this work, we propose a neural machine translation method to determine this… ▽ More Simultaneous machine translation is a variant of machine translation that starts the translation process before the end of an input. This task faces a trade-off between translation accuracy and latency. We have to determine when we start the translation for observed inputs so far, to achieve good practical performance. In this work, we propose a neural machine translation method to determine this timing in an adaptive manner. The proposed method introduces a special token '<wait>', which is generated when the translation model chooses to read the next input token instead of generating an output token. It also introduces an objective function to handle the ambiguity in wait timings that can be optimized using an algorithm called Connectionist Temporal Classification (CTC). The use of CTC enables the optimization to consider all possible output sequences including '<wait>' that are equivalent to the reference translations and to choose the best one adaptively. We apply the proposed method into simultaneous translation from English to Japanese and investigate its performance and remaining problems. △ Less

Submitted 26 November, 2019; originally announced November 2019.

arXiv:1910.13299 [pdf, other]

Findings of the Third Workshop on Neural Generation and Translation

Authors: Hiroaki Hayashi, Yusuke Oda, Alexandra Birch, Ioannis Konstas, Andrew Finch, Minh-Thang Luong, Graham Neubig, Katsuhito Sudoh

Abstract: This document describes the findings of the Third Workshop on Neural Generation and Translation, held in concert with the annual conference of the Empirical Methods in Natural Language Processing (EMNLP 2019). First, we summarize the research trends of papers presented in the proceedings. Second, we describe the results of the two shared tasks 1) efficient neural machine translation (NMT) where pa… ▽ More This document describes the findings of the Third Workshop on Neural Generation and Translation, held in concert with the annual conference of the Empirical Methods in Natural Language Processing (EMNLP 2019). First, we summarize the research trends of papers presented in the proceedings. Second, we describe the results of the two shared tasks 1) efficient neural machine translation (NMT) where participants were tasked with creating NMT systems that are both accurate and efficient, and 2) document-level generation and translation (DGT) where participants were tasked with developing systems that generate summaries from structured data, potentially with assistance from text in another language. △ Less

Submitted 29 October, 2019; v1 submitted 29 October, 2019; originally announced October 2019.

Comments: Fixed the metadata (author list)

arXiv:1906.09795 [pdf, other]

Conversational Response Re-ranking Based on Event Causality and Role Factored Tensor Event Embedding

Authors: Shohei Tanaka, Koichiro Yoshino, Katsuhito Sudoh, Satoshi Nakamura

Abstract: We propose a novel method for selecting coherent and diverse responses for a given dialogue context. The proposed method re-ranks response candidates generated from conversational models by using event causality relations between events in a dialogue history and response candidates (e.g., ``be stressed out'' precedes ``relieve stress''). We use distributed event representation based on the Role Fa… ▽ More We propose a novel method for selecting coherent and diverse responses for a given dialogue context. The proposed method re-ranks response candidates generated from conversational models by using event causality relations between events in a dialogue history and response candidates (e.g., ``be stressed out'' precedes ``relieve stress''). We use distributed event representation based on the Role Factored Tensor Model for a robust matching of event causality relations due to limited event causality knowledge of the system. Experimental results showed that the proposed method improved coherency and dialogue continuity of system responses. △ Less

Submitted 24 June, 2019; originally announced June 2019.

Comments: Accepted by 1st Workshop NLP for Conversational AI, ACL 2019 Workshop (ConvAI)

arXiv:1811.08100 [pdf, other]

Another Diversity-Promoting Objective Function for Neural Dialogue Generation

Authors: Ryo Nakamura, Katsuhito Sudoh, Koichiro Yoshino, Satoshi Nakamura

Abstract: Although generation-based dialogue systems have been widely researched, the response generations by most existing systems have very low diversities. The most likely reason for this problem is Maximum Likelihood Estimation (MLE) with Softmax Cross-Entropy (SCE) loss. MLE trains models to generate the most frequent responses from enormous generation candidates, although in actual dialogues there are… ▽ More Although generation-based dialogue systems have been widely researched, the response generations by most existing systems have very low diversities. The most likely reason for this problem is Maximum Likelihood Estimation (MLE) with Softmax Cross-Entropy (SCE) loss. MLE trains models to generate the most frequent responses from enormous generation candidates, although in actual dialogues there are various responses based on the context. In this paper, we propose a new objective function called Inverse Token Frequency (ITF) loss, which individually scales smaller loss for frequent token classes and larger loss for rare token classes. This function encourages the model to generate rare tokens rather than frequent tokens. It does not complicate the model and its training is stable because we only replace the objective function. On the OpenSubtitles dialogue dataset, our loss model establishes a state-of-the-art DIST-1 of 7.56, which is the unigram diversity score, while maintaining a good BLEU-1 score. On a Japanese Twitter replies dataset, our loss model achieves a DIST-1 score comparable to the ground truth. △ Less

Submitted 20 November, 2018; v1 submitted 20 November, 2018; originally announced November 2018.

Comments: AAAI 2019 Workshop on Reasoning and Learning for Human-Machine Dialogues (DEEP-DIAL 2019)

arXiv:1810.06826 [pdf, other]

Multi-Source Neural Machine Translation with Data Augmentation

Authors: Yuta Nishimura, Katsuhito Sudoh, Graham Neubig, Satoshi Nakamura

Abstract: Multi-source translation systems translate from multiple languages to a single target language. By using information from these multiple sources, these systems achieve large gains in accuracy. To train these systems, it is necessary to have corpora with parallel text in multiple sources and the target language. However, these corpora are rarely complete in practice due to the difficulty of providi… ▽ More Multi-source translation systems translate from multiple languages to a single target language. By using information from these multiple sources, these systems achieve large gains in accuracy. To train these systems, it is necessary to have corpora with parallel text in multiple sources and the target language. However, these corpora are rarely complete in practice due to the difficulty of providing human translations in all of the relevant languages. In this paper, we propose a data augmentation approach to fill such incomplete parts using multi-source neural machine translation (NMT). In our experiments, results varied over different language combinations but significant gains were observed when using a source language similar to the target language. △ Less

Submitted 8 November, 2018; v1 submitted 16 October, 2018; originally announced October 2018.

Comments: 15th International Workshop on Spoken Language Translation 2018

arXiv:1807.11219 [pdf, ps, other]

Training Neural Machine Translation using Word Embedding-based Loss

Authors: Katsuki Chousa, Katsuhito Sudoh, Satoshi Nakamura

Abstract: In neural machine translation (NMT), the computational cost at the output layer increases with the size of the target-side vocabulary. Using a limited-size vocabulary instead may cause a significant decrease in translation quality. This trade-off is derived from a softmax-based loss function that handles in-dictionary words independently, in which word similarity is not considered. In this paper,… ▽ More In neural machine translation (NMT), the computational cost at the output layer increases with the size of the target-side vocabulary. Using a limited-size vocabulary instead may cause a significant decrease in translation quality. This trade-off is derived from a softmax-based loss function that handles in-dictionary words independently, in which word similarity is not considered. In this paper, we propose a novel NMT loss function that includes word similarity in forms of distances in a word embedding space. The proposed loss function encourages an NMT decoder to generate words close to their references in the embedding space; this helps the decoder to choose similar acceptable words when the actual best candidates are not included in the vocabulary due to its size limitation. In experiments using ASPEC Japanese-to-English and IWSLT17 English-to-French data sets, the proposed method showed improvements against a standard NMT baseline in both datasets; especially with IWSLT17 En-Fr, it achieved up to +1.72 in BLEU and +1.99 in METEOR. When the target-side vocabulary was very limited to 1,000 words, the proposed method demonstrated a substantial gain, +1.72 in METEOR with ASPEC Ja-En. △ Less

Submitted 30 July, 2018; originally announced July 2018.

arXiv:1806.02525 [pdf, other]

Multi-Source Neural Machine Translation with Missing Data

Authors: Yuta Nishimura, Katsuhito Sudoh, Graham Neubig, Satoshi Nakamura

Abstract: Multi-source translation is an approach to exploit multiple inputs (e.g. in two different languages) to increase translation accuracy. In this paper, we examine approaches for multi-source neural machine translation (NMT) using an incomplete multilingual corpus in which some translations are missing. In practice, many multilingual corpora are not complete due to the difficulty to provide translati… ▽ More Multi-source translation is an approach to exploit multiple inputs (e.g. in two different languages) to increase translation accuracy. In this paper, we examine approaches for multi-source neural machine translation (NMT) using an incomplete multilingual corpus in which some translations are missing. In practice, many multilingual corpora are not complete due to the difficulty to provide translations in all of the relevant languages (for example, in TED talks, most English talks only have subtitles for a small portion of the languages that TED supports). Existing studies on multi-source translation did not explicitly handle such situations. This study focuses on the use of incomplete multilingual corpora in multi-encoder NMT and mixture of NMT experts and examines a very simple implementation where missing source translations are replaced by a special symbol <NULL>. These methods allow us to use incomplete corpora both at training time and test time. In experiments with real incomplete multilingual corpora of TED Talks, the multi-source NMT with the <NULL> tokens achieved higher translation accuracies measured by BLEU than those by any one-to-one NMT systems. △ Less

Submitted 7 June, 2018; v1 submitted 7 June, 2018; originally announced June 2018.

Comments: ACL 2018 Workshop on Neural Machine Translation and Generation

arXiv:1706.05765 [pdf, other]

An Empirical Study of Mini-Batch Creation Strategies for Neural Machine Translation

Authors: Makoto Morishita, Yusuke Oda, Graham Neubig, Koichiro Yoshino, Katsuhito Sudoh, Satoshi Nakamura

Abstract: Training of neural machine translation (NMT) models usually uses mini-batches for efficiency purposes. During the mini-batched training process, it is necessary to pad shorter sentences in a mini-batch to be equal in length to the longest sentence therein for efficient computation. Previous work has noted that sorting the corpus based on the sentence length before making mini-batches reduces the a… ▽ More Training of neural machine translation (NMT) models usually uses mini-batches for efficiency purposes. During the mini-batched training process, it is necessary to pad shorter sentences in a mini-batch to be equal in length to the longest sentence therein for efficient computation. Previous work has noted that sorting the corpus based on the sentence length before making mini-batches reduces the amount of padding and increases the processing speed. However, despite the fact that mini-batch creation is an essential step in NMT training, widely used NMT toolkits implement disparate strategies for doing so, which have not been empirically validated or compared. This work investigates mini-batch creation strategies with experiments over two different datasets. Our results suggest that the choice of a mini-batch creation strategy has a large effect on NMT training and some length-based sorting strategies do not always work well compared with simple shuffling. △ Less

Submitted 18 June, 2017; originally announced June 2017.

Comments: 8 pages, accepted to the First Workshop on Neural Machine Translation

arXiv:1704.00380 [pdf, other]

Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings

Authors: Junki Matsuo, Mamoru Komachi, Katsuhito Sudoh

Abstract: One of the most important problems in machine translation (MT) evaluation is to evaluate the similarity between translation hypotheses with different surface forms from the reference, especially at the segment level. We propose to use word embeddings to perform word alignment for segment-level MT evaluation. We performed experiments with three types of alignment methods using word embeddings. We e… ▽ More One of the most important problems in machine translation (MT) evaluation is to evaluate the similarity between translation hypotheses with different surface forms from the reference, especially at the segment level. We propose to use word embeddings to perform word alignment for segment-level MT evaluation. We performed experiments with three types of alignment methods using word embeddings. We evaluated our proposed methods with various translation datasets. Experimental results show that our proposed methods outperform previous word embeddings-based methods. △ Less

Submitted 2 April, 2017; originally announced April 2017.

Comments: 5 pages

arXiv:1612.03551 [pdf, other]

doi 10.13053/cys-21-4-2845

Reading Comprehension using Entity-based Memory Network

Authors: Xun Wang, Katsuhito Sudoh, Masaaki Nagata, Tomohide Shibata, Daisuke Kawahara, Sadao Kurohashi

Abstract: This paper introduces a novel neural network model for question answering, the \emph{entity-based memory network}. It enhances neural networks' ability of representing and calculating information over a long period by keeping records of entities contained in text. The core component is a memory pool which comprises entities' states. These entities' states are continuously updated according to the… ▽ More This paper introduces a novel neural network model for question answering, the \emph{entity-based memory network}. It enhances neural networks' ability of representing and calculating information over a long period by keeping records of entities contained in text. The core component is a memory pool which comprises entities' states. These entities' states are continuously updated according to the input text. Questions with regard to the input text are used to search the memory pool for related entities and answers are further predicted based on the states of retrieved entities. Compared with previous memory network models, the proposed model is capable of handling fine-grained information and more sophisticated relations based on entities. We formulated several different tasks as question answering problems and tested the proposed model. Experiments reported satisfying results. △ Less

Submitted 1 February, 2017; v1 submitted 12 December, 2016; originally announced December 2016.

Showing 1–33 of 33 results for author: Sudoh, K