Zum Hauptinhalt springen

Showing 1–9 of 9 results for author: Fujinuma, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.08255  [pdf, other

    cs.CL

    M3T: A New Benchmark Dataset for Multi-Modal Document-Level Machine Translation

    Authors: Benjamin Hsu, Xiaoyu Liu, Huayang Li, Yoshinari Fujinuma, Maria Nadejde, Xing Niu, Yair Kittenplon, Ron Litman, Raghavendra Pappagari

    Abstract: Document translation poses a challenge for Neural Machine Translation (NMT) systems. Most document-level NMT systems rely on meticulously curated sentence-level parallel data, assuming flawless extraction of text from documents along with their precise reading order. These systems also tend to disregard additional visual cues such as the document layout, deeming it irrelevant. However, real-world… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

    Comments: NAACL 2024, dataset at https://github.com/amazon-science/m3t-multi-modal-translation-bench

  2. arXiv:2310.16356  [pdf, other

    cs.CL

    A Multi-Modal Multilingual Benchmark for Document Image Classification

    Authors: Yoshinari Fujinuma, Siddharth Varia, Nishant Sankaran, Srikar Appalaraju, Bonan Min, Yogarshi Vyas

    Abstract: Document image classification is different from plain-text document classification and consists of classifying a document by understanding the content and structure of documents such as forms, emails, and other such documents. We show that the only existing dataset for this task (Lewis et al., 2006) has several limitations and we introduce two newly curated multilingual datasets WIKI-DOC and MULTI… ▽ More

    Submitted 25 October, 2023; originally announced October 2023.

    Comments: Accepted to EMNLP 2023 (Findings)

  3. arXiv:2305.17020  [pdf, other

    cs.CL cs.LG

    Diable: Efficient Dialogue State Tracking as Operations on Tables

    Authors: Pietro Lesci, Yoshinari Fujinuma, Momchil Hardalov, Chao Shang, Yassine Benajiba, Lluis Marquez

    Abstract: Sequence-to-sequence state-of-the-art systems for dialogue state tracking (DST) use the full dialogue history as input, represent the current state as a list with all the slots, and generate the entire state from scratch at each dialogue turn. This approach is inefficient, especially when the number of slots is large and the conversation is long. We propose Diable, a new task formalisation that si… ▽ More

    Submitted 1 November, 2023; v1 submitted 26 May, 2023; originally announced May 2023.

    Comments: Accepted to ACL 2023 (Findings)

  4. arXiv:2305.11242  [pdf, other

    cs.CL

    Comparing Biases and the Impact of Multilingual Training across Multiple Languages

    Authors: Sharon Levy, Neha Anna John, Ling Liu, Yogarshi Vyas, Jie Ma, Yoshinari Fujinuma, Miguel Ballesteros, Vittorio Castelli, Dan Roth

    Abstract: Studies in bias and fairness in natural language processing have primarily examined social biases within a single language and/or across few attributes (e.g. gender, race). However, biases can manifest differently across various languages for individual attributes. As a result, it is critical to examine biases within each language and attribute. Of equal importance is to study how these biases com… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

  5. arXiv:2203.10753  [pdf, other

    cs.CL

    Match the Script, Adapt if Multilingual: Analyzing the Effect of Multilingual Pretraining on Cross-lingual Transferability

    Authors: Yoshinari Fujinuma, Jordan Boyd-Graber, Katharina Kann

    Abstract: Pretrained multilingual models enable zero-shot learning even for unseen languages, and that performance can be further improved via adaptation prior to finetuning. However, it is unclear how the number of pretraining languages influences a model's zero-shot learning for languages unseen during pretraining. To fill this gap, we ask the following research questions: (1) How does the number of pretr… ▽ More

    Submitted 21 March, 2022; originally announced March 2022.

    Comments: ACL 2022 camera ready

  6. arXiv:2104.13103  [pdf, other

    cs.CL

    Semi-Supervised Joint Estimation of Word and Document Readability

    Authors: Yoshinari Fujinuma, Masato Hagiwara

    Abstract: Readability or difficulty estimation of words and documents has been investigated independently in the literature, often assuming the existence of extensive annotated resources for the other. Motivated by our analysis showing that there is a recursive relationship between word and document difficulty, we propose to jointly estimate word and document difficulty through a graph convolutional network… ▽ More

    Submitted 27 April, 2021; originally announced April 2021.

  7. arXiv:2005.00524  [pdf, other

    cs.CL cs.LG

    Why Overfitting Isn't Always Bad: Retrofitting Cross-Lingual Word Embeddings to Dictionaries

    Authors: Mozhi Zhang, Yoshinari Fujinuma, Michael J. Paul, Jordan Boyd-Graber

    Abstract: Cross-lingual word embeddings (CLWE) are often evaluated on bilingual lexicon induction (BLI). Recent CLWE methods use linear projections, which underfit the training dictionary, to generalize on BLI. However, underfitting can hinder generalization to other downstream tasks that rely on words from the training dictionary. We address this limitation by retrofitting CLWE to the training dictionary,… ▽ More

    Submitted 1 May, 2020; originally announced May 2020.

    Comments: ACL 2020

  8. A Resource-Free Evaluation Metric for Cross-Lingual Word Embeddings Based on Graph Modularity

    Authors: Yoshinari Fujinuma, Jordan Boyd-Graber, Michael J. Paul

    Abstract: Cross-lingual word embeddings encode the meaning of words from different languages into a shared low-dimensional space. An important requirement for many downstream tasks is that word similarity should be independent of language - i.e., word vectors within one language should not be more similar to each other than to words in another language. We measure this characteristic using modularity, a net… ▽ More

    Submitted 5 June, 2019; originally announced June 2019.

    Comments: Accepted to ACL 2019, camera-ready

  9. arXiv:1812.09617  [pdf, other

    cs.CL cs.LG

    Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification

    Authors: Mozhi Zhang, Yoshinari Fujinuma, Jordan Boyd-Graber

    Abstract: Text classification must sometimes be applied in a low-resource language with no labeled training data. However, training data may be available in a related language. We investigate whether character-level knowledge transfer from a related language helps text classification. We present a cross-lingual document classification framework (CACO) that exploits cross-lingual subword similarity by jointl… ▽ More

    Submitted 28 April, 2020; v1 submitted 22 December, 2018; originally announced December 2018.

    Comments: AAAI 2020