Tokenization Consistency Matters for Generative Models on Extractive NLP Tasks

Sun, Kaiser; Qi, Peng; Zhang, Yuhao; Liu, Lan; Wang, William Yang; Huang, Zhiheng

Computer Science > Computation and Language

arXiv:2212.09912 (cs)

[Submitted on 19 Dec 2022 (v1), last revised 24 Oct 2023 (this version, v2)]

Title:Tokenization Consistency Matters for Generative Models on Extractive NLP Tasks

Authors:Kaiser Sun, Peng Qi, Yuhao Zhang, Lan Liu, William Yang Wang, Zhiheng Huang

View PDF

Abstract:Generative models have been widely applied to solve extractive tasks, where parts of the input is extracted to form the desired output, and achieved significant success. For example, in extractive question answering (QA), generative models have constantly yielded state-of-the-art results. In this work, we identify the issue of tokenization inconsistency that is commonly neglected in training these models. This issue damages the extractive nature of these tasks after the input and output are tokenized inconsistently by the tokenizer, and thus leads to performance drop as well as hallucination. We propose a simple yet effective fix to this issue and conduct a case study on extractive QA. We show that, with consistent tokenization, the model performs better in both in-domain and out-of-domain datasets, with a notable average of +1.7 F2 gain when a BART model is trained on SQuAD and evaluated on 8 QA datasets. Further, the model converges faster, and becomes less likely to generate out-of-context answers. With these findings, we would like to call for more attention on how tokenization should be done when solving extractive tasks and recommend applying consistent tokenization during training.

Comments:	Findings of EMNLP2023
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2212.09912 [cs.CL]
	(or arXiv:2212.09912v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2212.09912

Submission history

From: Kaiser Sun [view email]
[v1] Mon, 19 Dec 2022 23:33:21 UTC (120 KB)
[v2] Tue, 24 Oct 2023 20:59:33 UTC (14,846 KB)

Computer Science > Computation and Language

Title:Tokenization Consistency Matters for Generative Models on Extractive NLP Tasks

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Tokenization Consistency Matters for Generative Models on Extractive NLP Tasks

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators