Improving Self Consistency in LLMs through Probabilistic Tokenization

Sathe, Ashutosh; Aggarwal, Divyanshu; Sitaram, Sunayana

Computer Science > Computation and Language

arXiv:2407.03678 (cs)

[Submitted on 4 Jul 2024]

Title:Improving Self Consistency in LLMs through Probabilistic Tokenization

Authors:Ashutosh Sathe, Divyanshu Aggarwal, Sunayana Sitaram

View PDF HTML (experimental)

Abstract:Prior research has demonstrated noticeable performance gains through the use of probabilistic tokenizations, an approach that involves employing multiple tokenizations of the same input string during the training phase of a language model. Despite these promising findings, modern large language models (LLMs) have yet to be trained using probabilistic tokenizations. Interestingly, while the tokenizers of these contemporary LLMs have the capability to generate multiple tokenizations, this property remains underutilized.
In this work, we propose a novel method to leverage the multiple tokenization capabilities of modern LLM tokenizers, aiming to enhance the self-consistency of LLMs in reasoning tasks. Our experiments indicate that when utilizing probabilistic tokenizations, LLMs generate logically diverse reasoning paths, moving beyond mere surface-level linguistic diversity.We carefully study probabilistic tokenization and offer insights to explain the self consistency improvements it brings through extensive experimentation on 5 LLM families and 4 reasoning benchmarks.

Comments:	ICML 2024 Workshop on LLMs and Cognition
Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2407.03678 [cs.CL]
	(or arXiv:2407.03678v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2407.03678

Submission history

From: Ashutosh Sathe [view email]
[v1] Thu, 4 Jul 2024 06:52:48 UTC (144 KB)

Computer Science > Computation and Language

Title:Improving Self Consistency in LLMs through Probabilistic Tokenization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Improving Self Consistency in LLMs through Probabilistic Tokenization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators