Waterfall: Framework for Robust and Scalable Text Watermarking

Gregory Kang Ruey Lau1,2, Xinyuan Niu∗1,3, Hieu Dao1, Jiangwei Chen1,4,
Chuan-Sheng Foo3,4  Bryan Kian Hsiang Low1
1
Department of Computer Science, National University of Singapore
2CNRS@CREATE, 1 Create Way, #08-01 Create Tower, Singapore 138602
3Centre for Frontier AI Research; 4Institute for Infocomm Research, A*STAR, Singapore
{greglau,niux,daohieu,chenj,lowkh}@comp.nus.edu.sg,
[email protected]
Equal contribution.
Abstract

Protecting intellectual property (IP) of text such as articles and code is increasingly important, especially as sophisticated attacks become possible, such as paraphrasing by large language models (LLMs) or even unauthorized training of LLMs on copyrighted text to infringe such IP. However, existing text watermarking methods are not robust enough against such attacks nor scalable to millions of users for practical implementation. In this paper, we propose Waterfall, the first training-free framework for robust and scalable text watermarking applicable across multiple text types (e.g., articles, code) and languages supportable by LLMs, for general text and LLM data provenance. Waterfall comprises several key innovations, such as being the first to use LLM as paraphrasers for watermarking along with a novel combination of techniques that are surprisingly effective in achieving robust verifiability and scalability. We empirically demonstrate that Waterfall achieves significantly better scalability, robust verifiability, and computational efficiency compared to SOTA article-text watermarking methods, and also showed how it could be directly applied to the watermarking of code.

Waterfall: Framework for Robust and Scalable Text Watermarking


Gregory Kang Ruey Lauthanks: Equal contribution.1,2, Xinyuan Niu∗1,3, Hieu Dao1, Jiangwei Chen1,4, Chuan-Sheng Foo3,4  Bryan Kian Hsiang Low1 1Department of Computer Science, National University of Singapore 2CNRS@CREATE, 1 Create Way, #08-01 Create Tower, Singapore 138602 3Centre for Frontier AI Research; 4Institute for Infocomm Research, A*STAR, Singapore {greglau,niux,daohieu,chenj,lowkh}@comp.nus.edu.sg, [email protected]


1 Introduction

Achieving robust text data provenance via watermarking, independent of its digital format, is an important open problem impacting a wide-ranging set of real-world challenges. Among these is the issue of intellectual property (IP) enforcement: Content creators of any text format (e.g., articles or code) could potentially combat plagiarism and unauthorized distribution by watermarking their works to prove data ownership. However, existing text watermarking methods have been unable to meet the challenging requirements of many practical problem settings. For example, directly adding digital metadata or invisible Unicode watermarks (Rizzo et al., 2019; Taleby Ahvanooey et al., 2019) may have limited impact in proving text data ownership in adversarial settings as they may be easily removed. Existing natural language watermarking (Qiang et al., 2023; Yoo et al., 2023; Taleby Ahvanooey et al., 2019) that adjusts the text itself to encode IDs are also lack robustness to paraphrasing attacks and have limited scalability in terms of the number of supportable IDs.

Adding to the challenge is the growing prevalence of generative large language models (LLMs) that may be trained on copyrighted text without permission. To enforce IP rights, content creators would need to be able to do data provenance for LLMs, i.e., prove whether their set of work had been used to train 3rd party black-box LLMs. While there have been recent works tackling this problem (Abdelnabi and Fritz, 2021; Zhang et al., 2023), they largely require intervening in the training process of the LLMs. This is unrealistic in practice, as not all LLM service providers may be cooperative due to incentive misalignment, and adversaries may also use open-source LLMs.

Hence, it is natural to ask whether it is possible to develop a practical, robust and scalable text watermarking framework for protecting IP against both plagiarism and unauthorized training of LLMs. For example, the watermarks should persist regardless of whether the text has been paraphrased, converted into speech or handwritten text, or used in unauthorized LLM training (e.g., fine-tuning, in-context learning) to produce a derived output. The framework should also be general enough to tailor to a wide range of text formats (e.g., natural language or code), and be scalable (i.e., support millions of users with reasonable computational cost).

In this paper, we propose Waterfall, the first training-free framework for robust and scalable text watermarking applicable across multiple text types (e.g., articles, code) and languages supportable by LLMs, for general text and LLM data provenance. Rather than viewing LLMs as just sources of IP infringement, we introduce the novel perspective of using LLMs’ capabilities to protect existing IP. Though simple, our training-free framework comprises several key innovations such as being the first to use LLM as paraphrasers for watermarking along with a novel combination of techniques that are surprisingly effective in achieving robust verifiability, scalability, and data provenance for LLMs, surpassing state-of-the-art text watermarking methods. In summary, our contributions are as follows:
1. We introduced a formulation of the robust and scalable text watermarking problem setting and lay out a set of desiderata to be satisfied (Section 2).
2. To tackle the challenges arising from these desiderata, we proposed Waterfall comprising novel innovations, including: (a) effective use of LLM paraphrasers to watermark existing text with IP to be protected (Section 3.1); (b) combination of vocab permutation and a new orthogonal watermarking perturbation method in token space, to achieve high scalability and robust verifiability while preserving fidelity (Section 3.3).
3. We conducted comprehensive empirical evaluations, demonstrating that Waterfall achieves significantly better scalability, robust verifiability, and computational efficiency compared to SOTA article-text watermarking methods (Section 4.1), while meeting the desiderata for a variety of applications, including for LLM data provenance of articles (Section 4.3). We also showed how Waterfall could be directly applied to the watermarking of programming code (Section 4.2).

2 Problem formulation and Desiderata

Consider M𝑀Mitalic_M clients, each with unique watermark ID μ𝕄𝜇𝕄\mu\in\mathbb{M}italic_μ ∈ blackboard_M and textual data To𝕋subscript𝑇o𝕋T_{\text{o}}\in\mathbb{T}italic_T start_POSTSUBSCRIPT o end_POSTSUBSCRIPT ∈ blackboard_T (e.g., articles or code) represented as token sequences To=[w1,,wN]subscript𝑇osubscript𝑤1subscript𝑤𝑁T_{\text{o}}=[w_{1},...,w_{N}]italic_T start_POSTSUBSCRIPT o end_POSTSUBSCRIPT = [ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ], where each token wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is from an ordered vocab space 𝕍={v1,,v|𝕍|}𝕍subscript𝑣1subscript𝑣𝕍\mathbb{V}=\{v_{1},...,v_{|\mathbb{V}|}\}blackboard_V = { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT | blackboard_V | end_POSTSUBSCRIPT }. We assume that Tosubscript𝑇oT_{\text{o}}italic_T start_POSTSUBSCRIPT o end_POSTSUBSCRIPT has semantic content c𝑐citalic_c (e.g., the IP content) that is only determined by its tokens and fully represents the text’s value. Text formatting is irrelevant, especially as adversaries can strip all formatting, making those channels unusable for watermarking111Attacks include converting text to audio or non-digital formats like written text, which removes format-based watermarks (e.g., homoglyphs and zero-width Unicode characters) (Rizzo et al., 2019) or digital metadata..

Watermarking: Client i𝑖iitalic_i uses a watermarking operator 𝒲(μi,To)Tw(i)𝒲subscript𝜇𝑖subscript𝑇osuperscriptsubscript𝑇w𝑖\mathcal{W}(\mu_{i},T_{\text{o}})\rightarrow T_{\text{w}}^{(i)}caligraphic_W ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT o end_POSTSUBSCRIPT ) → italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT to produce a text Tw(i)superscriptsubscript𝑇w𝑖T_{\text{w}}^{(i)}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT that contains watermark μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, preserves c𝑐citalic_c, and can then be used/distributed freely.

Attacks: There are adversaries who aim to claim the IP in Tw(i)superscriptsubscript𝑇w𝑖T_{\text{w}}^{(i)}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT through attacks 𝒜(Tw(i))Tsus(i)𝒜superscriptsubscript𝑇w𝑖superscriptsubscript𝑇sus𝑖\mathcal{A}(T_{\text{w}}^{(i)})\rightarrow T_{\text{sus}}^{(i)}caligraphic_A ( italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) → italic_T start_POSTSUBSCRIPT sus end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT that generate their own text Tsus(i)superscriptsubscript𝑇sus𝑖T_{\text{sus}}^{(i)}italic_T start_POSTSUBSCRIPT sus end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT without the watermark μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT while preserving semantic content c𝑐citalic_c. Adversaries do not know μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT but are able to perform several classes of attacks:
𝔸1subscript𝔸1\mathbb{A}_{1}blackboard_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT: alter Tw(i)superscriptsubscript𝑇w𝑖T_{\text{w}}^{(i)}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT with word addition/removal/substitutions;
𝔸2subscript𝔸2\mathbb{A}_{2}blackboard_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT: translate and paraphrase Tw(i)superscriptsubscript𝑇w𝑖T_{\text{w}}^{(i)}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT with an LLM;
𝔸3subscript𝔸3\mathbb{A}_{3}blackboard_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT: watermark Tw(i)superscriptsubscript𝑇w𝑖T_{\text{w}}^{(i)}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT again with a different watermark;
𝔸4subscript𝔸4\mathbb{A}_{4}blackboard_A start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT: using Tw(i)superscriptsubscript𝑇w𝑖T_{\text{w}}^{(i)}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT with any LLM for in-context prompting;
𝔸5subscript𝔸5\mathbb{A}_{5}blackboard_A start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT: using Tw(i)superscriptsubscript𝑇w𝑖T_{\text{w}}^{(i)}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT to fine-tune any LLM.

Verification: Client i𝑖iitalic_i can use a verification operator 𝒱(μi,Tsus)𝒱subscript𝜇𝑖subscript𝑇sus\mathcal{V}(\mu_{i},T_{\text{sus}})caligraphic_V ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT sus end_POSTSUBSCRIPT ) to generate a score q𝑞qitalic_q indicating the likelihood that Tsussubscript𝑇susT_{\text{sus}}italic_T start_POSTSUBSCRIPT sus end_POSTSUBSCRIPT contains μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. They can then use a setting-specific threshold q¯¯𝑞\bar{q}over¯ start_ARG italic_q end_ARG to classify Tsussubscript𝑇susT_{\text{sus}}italic_T start_POSTSUBSCRIPT sus end_POSTSUBSCRIPT as watermarked with μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT if qq¯𝑞¯𝑞q\geq\bar{q}italic_q ≥ over¯ start_ARG italic_q end_ARG. The operator 𝒱𝒱\mathcal{V}caligraphic_V should be quick and not assume access to Tosubscript𝑇oT_{\text{o}}italic_T start_POSTSUBSCRIPT o end_POSTSUBSCRIPT, as in practice client i𝑖iitalic_i may have a large set of Twsubscript𝑇wT_{\text{w}}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT and would need to automate the application of 𝒱𝒱\mathcal{V}caligraphic_V to scan through a large set of {Tsus}subscript𝑇sus\{T_{\text{sus}}\}{ italic_T start_POSTSUBSCRIPT sus end_POSTSUBSCRIPT } to identify any plagiarism (further discussion in Appendix K).

Given the above, a suitable watermarking framework should satisfy the following desiderata:

1. Fidelity. The watermarked text Twsubscript𝑇wT_{\text{w}}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT should be semantically similar to Tosubscript𝑇oT_{\text{o}}italic_T start_POSTSUBSCRIPT o end_POSTSUBSCRIPT, i.e., 𝒮(To,Tw)s𝒮subscript𝑇osubscript𝑇w𝑠\mathcal{S}(T_{\text{o}},T_{\text{w}})\geq scaligraphic_S ( italic_T start_POSTSUBSCRIPT o end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT ) ≥ italic_s, where 𝒮:𝕋×𝕋[0,1]:𝒮cross-product𝕋𝕋01\mathcal{S}:\mathbb{T}\crossproduct\mathbb{T}\rightarrow[0,1]caligraphic_S : blackboard_T × blackboard_T → [ 0 , 1 ] is a user-defined fidelity metric depending on the purpose and type of text (e.g., semantic similarity score for articles, or unit tests for code) and s𝑠sitalic_s is a setting-specific threshold. We define 𝕋c,s𝒲={T𝕋:𝒮(To,T)s}superscriptsubscript𝕋𝑐𝑠𝒲conditional-set𝑇𝕋𝒮subscript𝑇o𝑇𝑠\mathbb{T}_{c,s}^{\mathcal{W}}=\{T\in\mathbb{T}:\mathcal{S}(T_{\text{o}},T)% \geq s\}blackboard_T start_POSTSUBSCRIPT italic_c , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_W end_POSTSUPERSCRIPT = { italic_T ∈ blackboard_T : caligraphic_S ( italic_T start_POSTSUBSCRIPT o end_POSTSUBSCRIPT , italic_T ) ≥ italic_s } as the support set of all Twsubscript𝑇wT_{\text{w}}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT that a watermarking operator 𝒲𝒲\mathcal{W}caligraphic_W can possibly generate for Tosubscript𝑇oT_{\text{o}}italic_T start_POSTSUBSCRIPT o end_POSTSUBSCRIPT with content c𝑐citalic_c under a s𝑠sitalic_s-fidelity setting.

2. Verifiability. The verification operator 𝒱(μi,Tw(i))𝒱subscript𝜇𝑖superscriptsubscript𝑇w𝑖\mathcal{V}(\mu_{i},T_{\text{w}}^{(i)})caligraphic_V ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) should have high efficacy, accounting for Type I and II errors over various thresholds q¯¯𝑞\bar{q}over¯ start_ARG italic_q end_ARG. We evaluate this with AUROC computed over a test set.

Note that there is a trade-off between fidelity and verifiability. Applying a stronger, more verifiable watermark tends to reduce text fidelity and the optimal setting depends on each use case. We can evaluate a watermarking scheme in general, while taking into account this trade-off, using its fidelity-verifiability Pareto frontier (e.g., as plotted in Figure 4a).

3. Robust verifiability. The verification operator on watermarked text after attacks 𝒜𝔸𝒜𝔸\mathcal{A}\in\mathbb{A}caligraphic_A ∈ blackboard_A, i.e., 𝒱(μi,𝒜(Tw(i)))𝒱subscript𝜇𝑖𝒜superscriptsubscript𝑇w𝑖\mathcal{V}(\mu_{i},\mathcal{A}(T_{\text{w}}^{(i)}))caligraphic_V ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_A ( italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ), retains high verifiability. This means that the watermark should remain verifiable even after attacks, which constrains framework design. For example, the verification operator should not extract μ𝜇\muitalic_μ in any subroutine, as an attacker may use it to get μ𝜇\muitalic_μ and devise an 𝔸3subscript𝔸3\mathbb{A}_{3}blackboard_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT attack to overwrite it (see Section 4.1).

4. Scalability. The framework should support a large |𝕄|𝕄|\mathbb{M}|| blackboard_M | (set of IDs) while meeting all other desiderata.

Refer to caption
Figure 1: Schematics of problem formulation. Client i𝑖iitalic_i watermark text Tosubscript𝑇oT_{\text{o}}italic_T start_POSTSUBSCRIPT o end_POSTSUBSCRIPT with ID μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to watermarked text Tw(i)superscriptsubscript𝑇w𝑖T_{\text{w}}^{(i)}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. After manipulation by a third party, client can verify watermark in Tsussubscript𝑇susT_{\text{sus}}italic_T start_POSTSUBSCRIPT sus end_POSTSUBSCRIPT.

3 Method

We discuss three key insights to tackle challenges arising from these desiderata, before combining these to present our framework Waterfall (Watermarking Framework Applying Large Language Models).

3.1 Increasing support set for watermarking via LLMs

First, note that the fidelity desideratum is a major constraint to a scheme’s ability to meet the other desiderata. Intuitively, a scheme that can only generate a small set 𝕋c,s𝒲superscriptsubscript𝕋𝑐𝑠𝒲\mathbb{T}_{c,s}^{\mathcal{W}}blackboard_T start_POSTSUBSCRIPT italic_c , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_W end_POSTSUPERSCRIPT of possible watermarked text would have fewer ways to encode the watermark, leading to lower signal capacity (smaller |𝕄|𝕄|\mathbb{M}|| blackboard_M |, lower scalability), and less capacity for error correction to withstand attacks (lower robust verifiability).

For illustration, consider a basic semantic watermarking scheme (Basic) that lists out synonyms for each word in the original text Tosubscript𝑇oT_{\text{o}}italic_T start_POSTSUBSCRIPT o end_POSTSUBSCRIPT (e.g., big cat) and remembers a map of IDs to possible combinations of these synonyms (e.g., 01:big feline, 10:large cat, 11:large feline). Watermarking for ID μ𝜇\muitalic_μ is then selecting the text Twsubscript𝑇wT_{\text{w}}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT with the matching synonym combination. Note that schemes like Basic typically only have a relatively small support set 𝕋c,s𝒲superscriptsubscript𝕋𝑐𝑠𝒲\mathbb{T}_{c,s}^{\mathcal{W}}blackboard_T start_POSTSUBSCRIPT italic_c , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_W end_POSTSUPERSCRIPT and hence limited watermarking possibilities.

However, LLMs can come up with many more possibilities and access a larger 𝕋c,s𝒲superscriptsubscript𝕋𝑐𝑠𝒲\mathbb{T}_{c,s}^{\mathcal{W}}blackboard_T start_POSTSUBSCRIPT italic_c , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_W end_POSTSUPERSCRIPT compared to schemes like Basic using mechanical paraphrasing rules (e.g., synonym replacement). Past works have shown that LLMs can effectively paraphrase text given suitable prompts (Shu et al., 2024; Witteveen et al., 2019). For example, while synonym replacement can only generate possibilities involving word replacements, an LLM may be able to completely reorder, break, or fuse sentences while preserving semantic content c𝑐citalic_c. In general, as some expressions are more common, we can associate a probability distribution pc(T)subscript𝑝𝑐𝑇p_{c}(T)italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_T ) over this set 𝕋c,s𝒲superscriptsubscript𝕋𝑐𝑠𝒲\mathbb{T}_{c,s}^{\mathcal{W}}blackboard_T start_POSTSUBSCRIPT italic_c , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_W end_POSTSUPERSCRIPT.

Intuitively, we can consider a suitable paraphrasing prompt combined with text Tosubscript𝑇oT_{\text{o}}italic_T start_POSTSUBSCRIPT o end_POSTSUBSCRIPT as tokens c^^𝑐\hat{c}over^ start_ARG italic_c end_ARG that can constrain an LLM’s text generation to 𝕋c,s𝒲superscriptsubscript𝕋𝑐𝑠𝒲\mathbb{T}_{c,s}^{\mathcal{W}}blackboard_T start_POSTSUBSCRIPT italic_c , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_W end_POSTSUPERSCRIPT. Given c^^𝑐\hat{c}over^ start_ARG italic_c end_ARG, the LLM autoregressively access pc(T)subscript𝑝𝑐𝑇p_{c}(T)italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_T ) by producing conditional probability distributions p(wj|w^1:j1,c^)𝑝conditionalsubscript𝑤𝑗subscript^𝑤:1𝑗1^𝑐p(w_{j}|\hat{w}_{1:j-1},\hat{c})italic_p ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT 1 : italic_j - 1 end_POSTSUBSCRIPT , over^ start_ARG italic_c end_ARG ) for token wjsubscript𝑤𝑗w_{j}italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT at step j𝑗jitalic_j given the preceding sampled tokens w^^𝑤\hat{w}over^ start_ARG italic_w end_ARG, and sampling for each step until it deemed that it had conveyed c𝑐citalic_c. Specifically, at step j𝑗jitalic_j, the LLM generates a vector of logits Lj(w^1:j1,c^)|𝕍|subscript𝐿𝑗subscript^𝑤:1𝑗1^𝑐superscript𝕍L_{j}(\hat{w}_{1:j-1},\hat{c})\in\mathbb{R}^{|\mathbb{V}|}italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT 1 : italic_j - 1 end_POSTSUBSCRIPT , over^ start_ARG italic_c end_ARG ) ∈ blackboard_R start_POSTSUPERSCRIPT | blackboard_V | end_POSTSUPERSCRIPT, where

p(wj|w^1:j1,c^)=softmax(Lj(w^1:j1,c^)).𝑝conditionalsubscript𝑤𝑗subscript^𝑤:1𝑗1^𝑐softmaxsubscript𝐿𝑗subscript^𝑤:1𝑗1^𝑐p(w_{j}|\hat{w}_{1:j-1},\hat{c})=\text{softmax}(L_{j}(\hat{w}_{1:j-1},\hat{c})).italic_p ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT 1 : italic_j - 1 end_POSTSUBSCRIPT , over^ start_ARG italic_c end_ARG ) = softmax ( italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT 1 : italic_j - 1 end_POSTSUBSCRIPT , over^ start_ARG italic_c end_ARG ) ) . (1)

We denote LLMs used this way as LLM paraphrasers. By using LLM paraphrasers, we significantly increase 𝕋c,s𝒲superscriptsubscript𝕋𝑐𝑠𝒲\mathbb{T}_{c,s}^{\mathcal{W}}blackboard_T start_POSTSUBSCRIPT italic_c , italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_W end_POSTSUPERSCRIPT, which helps us better meet the fidelity, robust verifiability and scalability desiderata.

Refer to caption
Figure 2: Intuition on permutation operators 𝒫𝒫\mathcal{P}caligraphic_P, 𝒫1superscript𝒫1\mathcal{P}^{-1}caligraphic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT applied on LLM logits L𝐿Litalic_L and watermarking signal G𝐺Gitalic_G with toy example, Vec. (a) 𝒫𝒫\mathcal{P}caligraphic_P applied to L𝐿Litalic_L in the Vosubscript𝑉𝑜V_{o}italic_V start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT space results in 6 possible permutations in Vwsubscript𝑉𝑤V_{w}italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT space. This averages to constant vector L¯¯𝐿\bar{L}over¯ start_ARG italic_L end_ARG. (b) Similarly, 𝒫1superscript𝒫1\mathcal{P}^{-1}caligraphic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT applied to G𝐺Gitalic_G in Vwsubscript𝑉𝑤V_{w}italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT produces permutations in Vosubscript𝑉𝑜V_{o}italic_V start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT. These averages to constant vector G¯¯𝐺\bar{G}over¯ start_ARG italic_G end_ARG. (c) With kπsubscript𝑘𝜋k_{\pi}italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT sampled uniformly from the possible keys Kπsubscript𝐾𝜋K_{\pi}italic_K start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT over multiple LLM generation steps, L+G𝐿𝐺L+Gitalic_L + italic_G in shows less distortion to G𝐺Gitalic_G in Vwsubscript𝑉𝑤V_{w}italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT space, and to L𝐿Litalic_L in Vosubscript𝑉𝑜V_{o}italic_V start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT space.

3.2 Increasing robustness using n𝑛nitalic_n-gram watermarking with LLM deviation correction

Given the extensive threat model, most watermarking schemes would face a major challenge in meeting the robust verifiability desideratum. For example, 𝔸2subscript𝔸2\mathbb{A}_{2}blackboard_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT paraphrasing attacks would likely break schemes such as Basic which depend on word ordering222Using example in Section 3.1, “large cat”\rightarrow“cat that is large” would invert the embedded ID “10” to “01”., let alone attacks involving further processing by black-box LLMs (e.g., 𝔸4subscript𝔸4\mathbb{A}_{4}blackboard_A start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, 𝔸5subscript𝔸5\mathbb{A}_{5}blackboard_A start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT attacks). Instead, we could decompose pc(T)subscript𝑝𝑐𝑇p_{c}(T)italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_T ) and the watermarked text Twsubscript𝑇wT_{\text{w}}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT into multiple signal carriers, and embed the same watermarking signal to all. This way, we adopt a probabilistic approach where each carrier could independently be used to verify a watermark, to withstand attacks that can only corrupt a proportion of carriers.

Specifically, we could consider each consecutive n𝑛nitalic_n tokens in Twsubscript𝑇wT_{\text{w}}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT as an n𝑛nitalic_n-gram carrier unit. At each LLM paraphraser token generation step j𝑗jitalic_j, we could apply a watermarking operator 𝒲𝒲\mathcal{W}caligraphic_W (Section 3.3) that perturbs the logits of Equation 1 based on the ID μ𝜇\muitalic_μ and past n1𝑛1n-1italic_n - 1 generated tokens: Lˇj=𝒲[μ,w^jn+1:j1](Lj(w^1:j1,c^))subscriptˇ𝐿𝑗𝒲𝜇subscript^𝑤:𝑗𝑛1𝑗1subscript𝐿𝑗subscript^𝑤:1𝑗1^𝑐\check{L}_{j}=\mathcal{W}[\mu,\hat{w}_{j-n+1:j-1}](L_{j}(\hat{w}_{1:j-1},\hat{% c}))overroman_ˇ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = caligraphic_W [ italic_μ , over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_j - italic_n + 1 : italic_j - 1 end_POSTSUBSCRIPT ] ( italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT 1 : italic_j - 1 end_POSTSUBSCRIPT , over^ start_ARG italic_c end_ARG ) ). The perturbed logits will cause a detectable bias in each n𝑛nitalic_n-gram, hence the more n𝑛nitalic_n-grams that persist after any attack, the higher the verifiability.

Meanwhile, in future generation steps jsuperscript𝑗j^{\prime}italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, the LLM paraphraser will correct deviations from semantic content c𝑐citalic_c and preserve fidelity given sufficient generation steps, as the subsequent logits Lj(w^1:j1,c^)subscript𝐿superscript𝑗subscript^𝑤:1superscript𝑗1^𝑐L_{j^{\prime}}(\hat{w}_{1:{j^{\prime}}-1},\hat{c})italic_L start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT 1 : italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 end_POSTSUBSCRIPT , over^ start_ARG italic_c end_ARG ) are still conditioned on paraphrasing prompt c^^𝑐\hat{c}over^ start_ARG italic_c end_ARG.

This approach increases our framework’s robustness against not just paraphrasing attacks, but also more general LLM-based attacks (e.g., 𝔸5subscript𝔸5\mathbb{A}_{5}blackboard_A start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT). Past works have shown that language models tend to generate few novel n𝑛nitalic_n-grams outside their training set for small n𝑛nitalic_n (McCoy et al., 2023). Hence, LLMs trained on text with our watermarked n𝑛nitalic_n-grams may more likely generate them in their output. Given sufficient queries to these LLMs, the watermark could then be reliably verified, which we empirically demonstrate in Section 4.

3.3 Increasing scalability with vocab permutation and orthogonal perturbation

Finally, we propose a watermarking operator 𝒲𝒲\mathcal{W}caligraphic_W comprising two components: 1) vocab permutation, and 2) orthogonal perturbation. In this section, we will use a toy example (Vec) to show how these components work before presenting their general form. In Vec, we have logits L=[3,2,1]𝐿321L=[3,2,1]italic_L = [ 3 , 2 , 1 ], indexed by an ordered set Vo={α,β,γ}subscript𝑉𝑜𝛼𝛽𝛾V_{o}=\{\alpha,\beta,\gamma\}italic_V start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = { italic_α , italic_β , italic_γ } representing the token space, e.g., L(α)𝐿𝛼L(\alpha)italic_L ( italic_α ) = 3. Figure 2a presents L𝐿Litalic_L as a graph (Vosubscript𝑉𝑜V_{o}italic_V start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT as x𝑥xitalic_x-axis).

Vocab permutation. The vocab permutation operator 𝒫𝒫\mathcal{P}caligraphic_P produces a single permutation of Vosubscript𝑉𝑜V_{o}italic_V start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT and L𝐿Litalic_L for any given key kπsubscript𝑘𝜋k_{\pi}italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT (arrow \raisebox{-1.05pt} {1}⃝ in Figure 2a). The inverse operator 𝒫1superscript𝒫1\mathcal{P}^{-1}caligraphic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT reverses the permutation of 𝒫𝒫\mathcal{P}caligraphic_P when provided the same key (arrow \raisebox{-1.05pt} {2}⃝ in Figure 2a). As |Vo|=3subscript𝑉𝑜3|V_{o}|=3| italic_V start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT | = 3, there are 6 possible permutations of L𝐿Litalic_L, plotted as graphs over a new ordered index Vw={a,b,c}subscript𝑉𝑤𝑎𝑏𝑐V_{w}=\{a,b,c\}italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = { italic_a , italic_b , italic_c }, which we can interpret as the watermarking space. Then, we define the average permutation operator 𝒫¯¯𝒫\bar{\mathcal{P}}over¯ start_ARG caligraphic_P end_ARG acting on L𝐿Litalic_L (indexed by Vosubscript𝑉𝑜V_{o}italic_V start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT) as one that takes a sequence of keys Kπsubscript𝐾𝜋K_{\pi}italic_K start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT, apply 𝒫𝒫\mathcal{P}caligraphic_P to get Lkπsubscript𝐿subscript𝑘𝜋L_{k_{\pi}}italic_L start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT end_POSTSUBSCRIPT for each kπKπsubscript𝑘𝜋subscript𝐾𝜋k_{\pi}\in K_{\pi}italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ∈ italic_K start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT, and averages them to get a vector L¯¯𝐿\bar{L}over¯ start_ARG italic_L end_ARG (indexed by Vwsubscript𝑉𝑤V_{w}italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT). Note that when we use 𝒫¯¯𝒫\bar{\mathcal{P}}over¯ start_ARG caligraphic_P end_ARG on L𝐿Litalic_L over all possible keys, we get a constant vector (e.g., L¯=i=16Li/6=[2,2,2]¯𝐿superscriptsubscript𝑖16subscript𝐿𝑖6222\bar{L}=\sum_{i=1}^{6}L_{i}/6=[2,2,2]over¯ start_ARG italic_L end_ARG = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / 6 = [ 2 , 2 , 2 ], \raisebox{-1.05pt} {3}⃝ in Figure 2a).

Similarly, given a vector G𝐺Gitalic_G indexed by Vwsubscript𝑉𝑤V_{w}italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, which we can interpret as the watermark signal, the inverse operator 𝒫1superscript𝒫1\mathcal{P}^{-1}caligraphic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT permutes G𝐺Gitalic_G and Vwsubscript𝑉𝑤V_{w}italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT given a key kπsubscript𝑘𝜋k_{\pi}italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT, mapping it to Vosubscript𝑉𝑜V_{o}italic_V start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, the LLM-ordered token space (arrow \raisebox{-1.05pt} {4}⃝ in Figure 2b). 𝒫¯1superscript¯𝒫1\bar{\mathcal{P}}^{-1}over¯ start_ARG caligraphic_P end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT acting on G𝐺Gitalic_G analogously averages over all keys, and will also give a constant vector indexed over Vosubscript𝑉𝑜V_{o}italic_V start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT (e.g., G¯=i=16Gi/6=[0,0,0]¯𝐺superscriptsubscript𝑖16subscript𝐺𝑖6000\bar{G}=\sum_{i=1}^{6}G_{i}/6=[0,0,0]over¯ start_ARG italic_G end_ARG = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / 6 = [ 0 , 0 , 0 ], \raisebox{-1.05pt} {6}⃝ in Figure 2b).

This leads to an interesting insight: the permutation operators provide a way for us to add watermark signals to logits in a deterministically shifting Vwsubscript𝑉𝑤V_{w}italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT space (based on a sequence of keys) to boost verifiability and fidelity. For illustration, assume that an LLM paraphraser produces L𝐿Litalic_L (in Vosubscript𝑉𝑜V_{o}italic_V start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT-space) for all token generation steps. We use a long sequence Kπsubscript𝐾𝜋K_{\pi}italic_K start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT of pseudo-random uniformly sampled keys to apply 𝒫𝒫\mathcal{P}caligraphic_P on L𝐿Litalic_L multiple times (n𝑛nitalic_n-gram watermarking), and add the same watermarking signal G𝐺Gitalic_G in each resulting Vwsubscript𝑉𝑤V_{w}italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT space for all instances. If we apply 𝒫¯1superscript¯𝒫1\bar{\mathcal{P}}^{-1}over¯ start_ARG caligraphic_P end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT with Kπsubscript𝐾𝜋K_{\pi}italic_K start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT on the perturbed signal L+G𝐿𝐺L+Gitalic_L + italic_G, the distortion from the permuted L𝐿Litalic_L will effectively contribute only uniform background noise to G𝐺Gitalic_G (\raisebox{-1.05pt} {7}⃝ in Figure 2c), which improves verifiability. If we instead convert L+G𝐿𝐺L+Gitalic_L + italic_G back to Vosubscript𝑉𝑜V_{o}italic_V start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT space (for token generation) with 𝒫1superscript𝒫1\mathcal{P}^{-1}caligraphic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT for all steps and apply 𝒫¯¯𝒫\bar{\mathcal{P}}over¯ start_ARG caligraphic_P end_ARG, we get the original logits with only uniform background noise from watermarking (\raisebox{-1.05pt} {8}⃝ in Figure 2c), which improves fidelity.

{subcaptiongroup}
Refer to caption
Refer to caption
Figure 3: Left: Watermarking schematic. \raisebox{-1.05pt} {1}⃝ LLM paraphraser takes in Tosubscript𝑇𝑜T_{o}italic_T start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, produces initial logits. \raisebox{-1.05pt} {2}⃝ kπsubscript𝑘𝜋k_{\pi}italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT and kpsubscript𝑘𝑝k_{p}italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT from ID μ𝜇\muitalic_μ and metadata kpsubscript𝑘𝑝k_{p}italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT for vocab permutation and perturbation function. \raisebox{-1.05pt} {3}⃝ Perturb logits with Section 3.3. \raisebox{-1.05pt} {4}⃝ Sample perturbed logits, feed past tokens to the next iteration. Right: Verification schematic. \raisebox{-1.05pt} {1}⃝ Permute tokens from Tsussubscript𝑇susT_{\text{sus}}italic_T start_POSTSUBSCRIPT sus end_POSTSUBSCRIPT into Vwsubscript𝑉𝑤V_{w}italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT with μ𝜇\muitalic_μ and preceding n1𝑛1n-1italic_n - 1 tokens, to get average cumulative distribution. \raisebox{-1.05pt} {2}⃝ Compute perturbation function 1(kp)subscript1subscript𝑘𝑝\mathcal{F}_{1}(k_{p})caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) linked to μ𝜇\muitalic_μ. \raisebox{-1.05pt} {3}⃝ Compute verification score as inner product of 1(kp)subscript1subscript𝑘𝑝\mathcal{F}_{1}(k_{p})caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) and cumulative distribution, and compare with threshold.

More generally, we define the vocab permutation operator 𝒫𝒫\mathcal{P}caligraphic_P and its inverse 𝒫1superscript𝒫1\mathcal{P}^{-1}caligraphic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT as pseudorandom permutations over ordered sets Vosubscript𝑉𝑜V_{o}italic_V start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT and Vwsubscript𝑉𝑤V_{w}italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT given a key kπ𝕂πsubscript𝑘𝜋subscript𝕂𝜋k_{\pi}\in\mathbb{K}_{\pi}italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ∈ blackboard_K start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT:

𝒫(kπ,Vo)𝒫subscript𝑘𝜋subscript𝑉𝑜\displaystyle\mathcal{P}(k_{\pi},V_{o})caligraphic_P ( italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) =Vokπabsentsuperscriptsubscript𝑉𝑜subscript𝑘𝜋\displaystyle=V_{o}^{k_{\pi}}= italic_V start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
𝒫1(kπ,Vw)superscript𝒫1subscript𝑘𝜋subscript𝑉𝑤\displaystyle\mathcal{P}^{-1}(k_{\pi},V_{w})caligraphic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) =Vwkπabsentsuperscriptsubscript𝑉𝑤subscript𝑘𝜋\displaystyle=V_{w}^{k_{\pi}}= italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
𝒫1(kπ,𝒫(kπ,Vo))superscript𝒫1subscript𝑘𝜋𝒫subscript𝑘𝜋subscript𝑉𝑜\displaystyle\mathcal{P}^{-1}(k_{\pi},\mathcal{P}(k_{\pi},V_{o}))caligraphic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT , caligraphic_P ( italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) ) =Vo,absentsubscript𝑉𝑜\displaystyle=V_{o},= italic_V start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT , (2)

where Vokπsuperscriptsubscript𝑉𝑜subscript𝑘𝜋V_{o}^{k_{\pi}}italic_V start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, Vwkπsuperscriptsubscript𝑉𝑤subscript𝑘𝜋V_{w}^{k_{\pi}}italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are uniform-randomly chosen permutations of Vosubscript𝑉𝑜V_{o}italic_V start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT and Vwsubscript𝑉𝑤V_{w}italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT if kπsubscript𝑘𝜋k_{\pi}italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT is sampled randomly. For a function L𝐿Litalic_L over Vosubscript𝑉𝑜V_{o}italic_V start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT mapped to a vector of length |Vo|subscript𝑉𝑜|V_{o}|| italic_V start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT |, we have L(𝒫(kπ,Vo))=L(Vokπ)𝐿𝒫subscript𝑘𝜋subscript𝑉𝑜𝐿superscriptsubscript𝑉𝑜subscript𝑘𝜋L(\mathcal{P}(k_{\pi},V_{o}))=L(V_{o}^{k_{\pi}})italic_L ( caligraphic_P ( italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ) ) = italic_L ( italic_V start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) and we overload notation by defining 𝒫(kπ,L())L(𝒫(kπ,))=Lkπ𝒫subscript𝑘𝜋𝐿𝐿𝒫subscript𝑘𝜋subscript𝐿subscript𝑘𝜋\mathcal{P}(k_{\pi},L(\cdot))\triangleq L(\mathcal{P}(k_{\pi},\cdot))=L_{k_{% \pi}}caligraphic_P ( italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT , italic_L ( ⋅ ) ) ≜ italic_L ( caligraphic_P ( italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT , ⋅ ) ) = italic_L start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT end_POSTSUBSCRIPT. As in the Vec example, 𝒫𝒫\mathcal{P}caligraphic_P applied to a function (vector) can be viewed as the same function but with its domain permuted.

We then define an average operator 𝒫¯¯𝒫\bar{\mathcal{P}}over¯ start_ARG caligraphic_P end_ARG over a sequence of keys Kπsubscript𝐾𝜋K_{\pi}italic_K start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT acting on a function L𝐿Litalic_L,

𝒫¯(Kπ,L)1|Kπ|kπKπ𝒫(kπ,L),¯𝒫subscript𝐾𝜋𝐿1subscript𝐾𝜋subscriptsubscript𝑘𝜋subscript𝐾𝜋𝒫subscript𝑘𝜋𝐿\bar{\mathcal{P}}(K_{\pi},L)\triangleq{\textstyle\frac{1}{|K_{\pi}|}\sum_{k_{% \pi}\in K_{\pi}}}\mathcal{P}(k_{\pi},L),over¯ start_ARG caligraphic_P end_ARG ( italic_K start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT , italic_L ) ≜ divide start_ARG 1 end_ARG start_ARG | italic_K start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ∈ italic_K start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_P ( italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT , italic_L ) , (3)

which outputs an average function of L𝐿Litalic_L over Vwsubscript𝑉𝑤V_{w}italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT (denoted as L¯¯𝐿\bar{L}over¯ start_ARG italic_L end_ARG ). 𝒫¯(Kπ,L)¯𝒫subscript𝐾𝜋𝐿\bar{\mathcal{P}}(K_{\pi},L)over¯ start_ARG caligraphic_P end_ARG ( italic_K start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT , italic_L ) will flatten towards a constant function over Vwsubscript𝑉𝑤V_{w}italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT for a sufficiently large Kπsubscript𝐾𝜋K_{\pi}italic_K start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT. To achieve this for our framework, we set Kπ={kπkπ=hπ(μ,w^jn+1:j1)}jsubscript𝐾𝜋subscriptconditional-setsubscript𝑘𝜋subscript𝑘𝜋subscript𝜋𝜇subscript^𝑤:𝑗𝑛1𝑗1𝑗K_{\pi}=\{k_{\pi}\mid k_{\pi}=h_{\pi}(\mu,\hat{w}_{j-n+1:j-1})\}_{j}italic_K start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT = { italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ∣ italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_μ , over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_j - italic_n + 1 : italic_j - 1 end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, for all LLM paraphrasing steps j𝑗jitalic_j and where hπsubscript𝜋h_{\pi}italic_h start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT is a hash function, which generates pseudorandom Kπsubscript𝐾𝜋K_{\pi}italic_K start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT sequences. Empirically, we clearly observe the flattened and clear watermarking signals (see Figure 7 in Appendix).

Orthogonal perturbation: Our proposed perturbation operator \mathcal{F}caligraphic_F involves two sub-operations acting on Vwsubscript𝑉𝑤V_{w}italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT. It first maps each key kp𝕂psubscript𝑘𝑝subscript𝕂𝑝k_{p}\in\mathbb{K}_{p}italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ blackboard_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT to a unique function in a pre-defined family of orthogonal functions, and then adds the chosen perturbation function to the logits Ljsubscript𝐿𝑗L_{j}italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT of the LLM output in Vwsubscript𝑉𝑤V_{w}italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT space:

1:𝕂p{ϕ:Vw|Vw|ϕi,ϕl=δil}:subscript1subscript𝕂𝑝conditional-setitalic-ϕsubscript𝑉𝑤conditionalsuperscriptsubscript𝑉𝑤subscriptitalic-ϕ𝑖subscriptitalic-ϕ𝑙subscript𝛿𝑖𝑙\displaystyle\mathcal{F}_{1}:\mathbb{K}_{p}\hookrightarrow\{\phi:V_{w}% \rightarrow\mathbb{R}^{|V_{w}|}\mid\langle\phi_{i},\phi_{l}\rangle=\delta_{il}\}caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : blackboard_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ↪ { italic_ϕ : italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT → blackboard_R start_POSTSUPERSCRIPT | italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT ∣ ⟨ italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ = italic_δ start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT } (4)
(kp,κ,Lj)=Lj+κ1(kp)subscript𝑘𝑝𝜅subscript𝐿𝑗subscript𝐿𝑗𝜅subscript1subscript𝑘𝑝\displaystyle\mathcal{F}(k_{p},\kappa,L_{j})=L_{j}+\kappa\mathcal{F}_{1}(k_{p})caligraphic_F ( italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_κ , italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_κ caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) (5)

where ,\langle\cdot,\cdot\rangle⟨ ⋅ , ⋅ ⟩ denotes the canonical dot product over Vwsubscript𝑉𝑤V_{w}italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT. Examples of orthogonal function families include the Fourier or square wave basis, discretized over 𝕍𝕍\mathbb{V}blackboard_V. The key kp=hp(μ,z)𝕂psubscript𝑘𝑝subscript𝑝𝜇𝑧subscript𝕂𝑝k_{p}=h_{p}(\mu,z)\in\mathbb{K}_{p}italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_μ , italic_z ) ∈ blackboard_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is a client defined function hpsubscript𝑝h_{p}italic_h start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT of ID μ𝜇\muitalic_μ, and also any metadata z𝑧zitalic_z (which could be extracted after verification as we demonstrate in Section 4.1) if required. κ𝜅\kappaitalic_κ is a scalar that controls the perturbation magnitude.

Combining both components, our watermarking operator (Figure 3, and Algorithm 1 in Appendix) for generation step j𝑗jitalic_j involves (a) using kπ=hπ(μ,w^in+1:i1)subscript𝑘𝜋subscript𝜋𝜇subscript^𝑤:𝑖𝑛1𝑖1k_{\pi}=h_{\pi}(\mu,\hat{w}_{i-n+1:i-1})italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_μ , over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i - italic_n + 1 : italic_i - 1 end_POSTSUBSCRIPT ) and the permutation operator 𝒫(kπ,Lj)𝒫subscript𝑘𝜋subscript𝐿𝑗\mathcal{P}(k_{\pi},L_{j})caligraphic_P ( italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) to transform logits from the Vosubscript𝑉𝑜V_{o}italic_V start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT to Vwsubscript𝑉𝑤V_{w}italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT space, (b) applying the perturbation operator in Equation 5, and (c) transforming the perturbed logits back to Vosubscript𝑉𝑜V_{o}italic_V start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT space using 𝒫1(kπ,.)\mathcal{P}^{-1}(k_{\pi},.)caligraphic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT , . ) to produce a probability distribution for sampling and generation of the watermarked text Twsubscript𝑇wT_{\text{w}}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT:

Lˇjsubscriptˇ𝐿𝑗\displaystyle\check{L}_{j}overroman_ˇ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT =𝒲(kπ,kp,Lj)absent𝒲subscript𝑘𝜋subscript𝑘𝑝subscript𝐿𝑗\displaystyle=\mathcal{W}(k_{\pi},k_{p},L_{j})= caligraphic_W ( italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )
=𝒫1(kπ,(kp,κ,𝒫(kπ,Lj))).absentsuperscript𝒫1subscript𝑘𝜋subscript𝑘𝑝𝜅𝒫subscript𝑘𝜋subscript𝐿𝑗\displaystyle=\mathcal{P}^{-1}(k_{\pi},\mathcal{F}(k_{p},\kappa,\mathcal{P}(k_% {\pi},L_{j}))).= caligraphic_P start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT , caligraphic_F ( italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_κ , caligraphic_P ( italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) ) . (6)

Our verification operator will produce a score by computing the average cumulative token distribution of a text using 𝒫¯(Kπ,.)\bar{\mathcal{P}}(K_{\pi},.)over¯ start_ARG caligraphic_P end_ARG ( italic_K start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT , . ) and taking the inner product with 1(kp)subscript1subscript𝑘𝑝\mathcal{F}_{1}(k_{p})caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ). Applying the right keys kpsubscript𝑘𝑝k_{p}italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and kπsubscript𝑘𝜋k_{\pi}italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT on the suspected text Tsussubscript𝑇susT_{\text{sus}}italic_T start_POSTSUBSCRIPT sus end_POSTSUBSCRIPT will result in a high score q𝑞qitalic_q, else the score will be close to 0 (see Figure 3, and Algorithm 2 in Appendix). Using orthogonal functions helps us improve verifiability by avoiding interference from other watermarks (e.g., added by adversaries as an 𝔸3subscript𝔸3\mathbb{A}_{3}blackboard_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT attack).

Notice that the many possible vocab permutations (|𝕍|!𝕍|\mathbb{V}|!| blackboard_V | !) and perturbation functions in any orthogonal function family |1|subscript1|\mathcal{F}_{1}|| caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | allows for a much large set of IDs compared to schemes like Basic, helping with scalability. For example, up to |1||𝕍|!subscript1𝕍|\mathcal{F}_{1}|\cdot|\mathbb{V}|!| caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ⋅ | blackboard_V | ! IDs can be assigned to a unique permutation-perturbation function pair for watermarking. Using a relatively small |𝕍|=32000𝕍32000|\mathbb{V}|=32000| blackboard_V | = 32000 and the Fourier basis over that would yield a maximum |𝕄|10130274similar-to𝕄superscript10130274|\mathbb{M}|\sim 10^{130274}| blackboard_M | ∼ 10 start_POSTSUPERSCRIPT 130274 end_POSTSUPERSCRIPT. Schemes like Basic only support M𝑀Mitalic_M that scales with the number of possible synonym replacements for a given text.

In addition, with orthogonal functions, our framework also allows for the embedding of metadata during watermarking. For example, a client can use μ𝜇\muitalic_μ to verify that Tsussubscript𝑇susT_{\text{sus}}italic_T start_POSTSUBSCRIPT sus end_POSTSUBSCRIPT is watermarked, and also extract information on which article it was plagiarized from (Algorithm 3). We demonstrate this empirically in Section 4.1 using the Fourier basis as perturbation functions and Discrete Fourier Transform (DFT) for extraction.

3.4 Waterfall Framework

Our watermarking framework, Waterfall, combines these insights into a structured watermarking/verification process. For watermarking (Figure 3 left), given Tosubscript𝑇oT_{\text{o}}italic_T start_POSTSUBSCRIPT o end_POSTSUBSCRIPT and μ𝜇\muitalic_μ, Waterfall uses an LLM paraphraser to autoregressively paraphrase a text Tosubscript𝑇oT_{\text{o}}italic_T start_POSTSUBSCRIPT o end_POSTSUBSCRIPT, producing initial logits for the new text Twsubscript𝑇wT_{\text{w}}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT [Step \raisebox{-1.05pt} {1}⃝]. The ID μ𝜇\muitalic_μ is used to seed the vocab permutation operator (Section 3.3) for mapping the logits to Vwsubscript𝑉𝑤V_{w}italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT space, and chooses the perturbation function (Equation 5) [Step \raisebox{-1.05pt} {2}⃝], both of which will be used in the watermarking operation (Section 3.3) to produce the perturbed logits [Step \raisebox{-1.05pt} {3}⃝]. The LLM samples the perturbed logits to produce a watermarked token, and for the next token loop, the past n1𝑛1n-1italic_n - 1 tokens are used to seed vocab permutation while all past tokens are fed as context which helps the LLM paraphraser maintain Twsubscript𝑇wT_{\text{w}}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT fidelity despite watermarking [Step \raisebox{-1.05pt} {4}⃝].

For verification (Figure 3 right), each token in Tsussubscript𝑇susT_{\text{sus}}italic_T start_POSTSUBSCRIPT sus end_POSTSUBSCRIPT is counted in Vwsubscript𝑉𝑤V_{w}italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT space as specified by μ𝜇\muitalic_μ and the previous tokens in the same n𝑛nitalic_n-gram unit, producing an average cumulative token distribution [Step \raisebox{-1.05pt} {1}⃝]. The ID μ𝜇\muitalic_μ also specifies a specific perturbation function [Step \raisebox{-1.05pt} {2}⃝], which is used to perform an inner product with the cumulative distribution to compute a verification score q𝑞qitalic_q [Step \raisebox{-1.05pt} {3}⃝].

Practical considerations. Waterfall is highly adaptable, i.e., it can be implemented with different LLM as paraphrasers, allowing our framework to achieve better watermarking performance and support more text types as the LLM landscape evolves. Methods like prompt engineering (Wei et al., 2022; Lin et al., 2023) and Reflexion (Shinn et al., 2023; Madaan et al., 2023) may also help to boost performance in some settings, as we demonstrate in our code watermarking experiments (Section G.2). We elaborate further on possible large-scale deployment methods of Waterfall and other practical considerations in Appendix L.

4 Experiments

Figure 4: Higher watermarking strength κ𝜅\kappaitalic_κ improves verifiability and extraction accuracy. (a) Increasing κ𝜅\kappaitalic_κ trades off fidelity for higher verifiability. (b) Longer token length N𝑁Nitalic_N improves verifiability. (c) Combining more pieces of text improves extraction accuracy towards 100%percent100100\%100 %. Extraction accuracy is significantly higher than random guess accuracy of 0.003125%percent0.0031250.003125\%0.003125 %.

Figure 5: More queries improve LLM provenance verifiability. Increasing number of clients M𝑀Mitalic_M in the training dataset only slightly decreases verifiability.

4.1 Data ownership

For watermarking of text articles, we demonstrate the effectiveness of Waterfall with experiments using text samples Tosubscript𝑇oT_{\text{o}}italic_T start_POSTSUBSCRIPT o end_POSTSUBSCRIPT from the c4 realnewslike dataset (Raffel et al., 2020), comprising articles with mean token length of 412. The experiments mirror realistic scenarios, for e.g., news outlets watermarking their articles before publishing them to be able to effectively scan the internet for, and verify, plagiarized content (Brewster et al., 2023). For this setting, we evaluate the semantic similarity 𝒮𝒮\mathcal{S}caligraphic_S using the Semantic Textual Similarity (STS) score based on the all-mpnet-base-v2 model333https://huggingface.co/sentence-transformers/all-mpnet-base-v2 (𝒮𝒮\mathcal{S}caligraphic_S for sample text pairs are provided in Section E.5).

For benchmarks, we consider two recent linguistics-based watermarking methods: M-bit by Yoo et al. (2023) and P-nlw by Qiang et al. (2023). These methods are advanced variants of Basic that use deep learning to improve watermarking performance (details in Section E.3). To implement Waterfall, we use llama-2-13b-hf 444https://huggingface.co/meta-llama/Llama-2-7b-chat-hf as the paraphraser, and the Fourier basis for the perturbation functions. Additional details such as paraphrasing prompts are in Appendix E.

Fidelity-verifiability. We consider the fidelity and verifiability of the schemes before adversarial attacks. Verifiability is computed as the AUROC based on varying their respective classification thresholds, i.e., the verification score threshold q¯¯𝑞\bar{q}over¯ start_ARG italic_q end_ARG for Waterfall, and bit-error rate threshold for M-bit and P-nlw.

Waterfall allows for adjustable watermarking strength to calibrate the fidelity-verifiability trade-off based on the clients’ use cases. Figure 4a shows the Pareto frontier of the trade-off. Stronger watermark strength κ𝜅\kappaitalic_κ improves verifiability but also introduces larger distortions to the LLM paraphrasing process, decreasing the fidelity of watermarked text. For our experiments, we mainly used κ=6𝜅6\kappa=6italic_κ = 6, achieving a mean AUROC of 0.992 and STS of 0.887. Even for shorter texts of just 100 tokens (about𝑎𝑏𝑜𝑢𝑡aboutitalic_a italic_b italic_o italic_u italic_t 65 words), Waterfall achieves high verifiability with an AUROC of 0.98 (Figure 4b). Additional results are in Section E.4.

Note that M-bit and P-nlw were designed with only one setting, allowing for only a single fidelity-verifiability score, with mean STS scores of 0.998 and 0.942 respectively, and corresponding AUROC scores of 0.987 and 0.882. While the STS scores are high, it is expected given that the schemes only make minor edits to Tosubscript𝑇oT_{\text{o}}italic_T start_POSTSUBSCRIPT o end_POSTSUBSCRIPT which would be more fragile to attacks, as we will see later. Additionally, the word replacements by M-bit and P-nlw introduced noticeable linguistic errors that are difficult to evaluate and not captured by the STS score (shown in Section E.5).

Robust verifiability. We consider the various classes of attacks 𝔸𝔸\mathbb{A}blackboard_A mentioned in Section 2. Details of the setup for each attack and additional results are in Appendix F.

𝔸1subscript𝔸1\mathbb{A}_{1}blackboard_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT attacks are insertion, deletion, and synonym substitution attacks that are often considered in past works As shown in Figure 6, robust verifiability of Waterfall shows only a slight decrease even with strong attacks on 20%percent2020\%20 % of words, while that of M-bit and P-nlw fall drastically with increasing attack strength.

𝔸2subscript𝔸2\mathbb{A}_{2}blackboard_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT involves translation and paraphrasing attacks, which are more realistic and effective attacks that can achieve higher fidelity and verification reduction than 𝔸1subscript𝔸1\mathbb{A}_{1}blackboard_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and had not been considered by past text watermarking works. We perform translation attack to translate the watermarked text to Spanish and back to English, and paraphrasing attack to paraphrase the watermarked text. Again, the verifiability of Waterfall remains significantly higher than benchmarks post-attack.

𝔸3subscript𝔸3\mathbb{A}_{3}blackboard_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT involves using the same scheme to try overwrite the existing watermark with another watermark. For Waterfall, the 1st watermark remain verifiable even after the 2nd is added, given the design of 𝒫𝒫\mathcal{P}caligraphic_P and \mathcal{F}caligraphic_F with vocab permutation and orthogonal perturbation functions that minimizes interference of the 2nd watermark on the 1st. However, this attack destroys the verifiability of M-bit and P-nlw, as the 2nd watermark process almost always chooses the same word positions as the original process, overwriting μ1subscript𝜇1\mu_{1}italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Furthermore, the benchmark schemes extracts μ1subscript𝜇1\mu_{1}italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as part of verification, enabling targeted overlap watermark attacks which we demonstrated in Section F.3.

𝔸4subscript𝔸4\mathbb{A}_{4}blackboard_A start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT uses Twsubscript𝑇wT_{\text{w}}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT for in-context prompting of any LLM to perform tasks that rely on the IP or semantic content of Twsubscript𝑇wT_{\text{w}}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT. For illustration, we considered the case where adversaries use an LLM to answer questions regarding watermarked articles. As this attack totally changed the structure of the texts, the watermarks of M-bit and P-nlw were removed. However, with Waterfall, watermarks were still verifiable due to the preservation of watermarked n𝑛nitalic_n-grams from the context to the response.

𝔸5subscript𝔸5\mathbb{A}_{5}blackboard_A start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT which involves using text containing IP for unauthorized LLM training such as fine-tuning will be discussed in Section 4.3.

Refer to caption
Figure 6: Waterfall demonstrates robust verifiability under 𝔸1subscript𝔸1\mathbb{A}_{1}blackboard_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (insertion, deletion, and synonym substitution attacks) with minimal degradation in AUROC compared to M-bit and P-nlw.
Table 1: Robust verifiability under attacks: translation 𝔸2Tsubscript𝔸2𝑇\mathbb{A}_{2-T}blackboard_A start_POSTSUBSCRIPT 2 - italic_T end_POSTSUBSCRIPT, paraphrase 𝔸2Psubscript𝔸2𝑃\mathbb{A}_{2-P}blackboard_A start_POSTSUBSCRIPT 2 - italic_P end_POSTSUBSCRIPT, overlap watermark 𝔸3subscript𝔸3\mathbb{A}_{3}blackboard_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and in-context prompting 𝔸4subscript𝔸4\mathbb{A}_{4}blackboard_A start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT.
Pre-attack 𝔸2Tsubscript𝔸2𝑇\mathbb{A}_{2-T}blackboard_A start_POSTSUBSCRIPT 2 - italic_T end_POSTSUBSCRIPT 𝔸2Psubscript𝔸2𝑃\mathbb{A}_{2-P}blackboard_A start_POSTSUBSCRIPT 2 - italic_P end_POSTSUBSCRIPT 𝔸3subscript𝔸3\mathbb{A}_{3}blackboard_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 𝔸4subscript𝔸4\mathbb{A}_{4}blackboard_A start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT
Waterfall 0.992 0.951 0.881 0.815 0.775
P-nlw 0.885 0.475 0.508 0.724 0.502
M-bit 0.988 0.567 0.363 0.664 0.525

Scalability. As mentioned in Section 3.3, Waterfall has a large maximum scalability of M=|1||𝕍|!10130274𝑀subscript1𝕍similar-tosuperscript10130274M=|\mathcal{F}_{1}|\cdot|\mathbb{V}|!\sim 10^{130274}italic_M = | caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ⋅ | blackboard_V | ! ∼ 10 start_POSTSUPERSCRIPT 130274 end_POSTSUPERSCRIPT based on our implementation using the Fourier perturbation function, and Llama-2 model as LLM paraphraser. In comparison, M-bit and P-nlw, have scalability dependent on the number of possible synonym replacements in any given text, which is limited by text length and varies for different text. On the c4 dataset with a mean article length of 355 words, M-bit and P-nlw can only embed a mean of 9.5 bits (M103similar-to𝑀superscript103M\sim 10^{3}italic_M ∼ 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT) and 23.2 bits (M1010similar-to𝑀superscript1010M\sim 10^{10}italic_M ∼ 10 start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT) respectively.

In practice, scalability is further limited by how well the schemes can differentiate among similar watermarks. For e.g., the verification operation of M-bit and P-nlw may not be able to distinguish 2 IDs differing by 1 bit (see Section E.7.3 for details). To demonstrate this, we watermarked Tw(i)superscriptsubscript𝑇w𝑖T_{\text{w}}^{(i)}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT with μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and computed the verifiability of Tw(i)superscriptsubscript𝑇w𝑖T_{\text{w}}^{(i)}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT against 1000 randomly selected IDs μjisubscript𝜇𝑗𝑖\mu_{j\neq i}italic_μ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT. We found that for Waterfall, all of the IDs achieved very high AUROC, while M-bit and P-nlw have many IDs with low AUROC: The 1st percentile AUROC for Waterfall, M-bit, P-nlw are 0.976, 0.614, 0.766 respectively. Details and further results on scalability up to 100,000 IDs are in Section E.7.

Metadata extraction. We also demonstrate how Waterfall could be used to embed metadata while watermarking. We consider metadata kp{1,2,,31999}subscript𝑘𝑝1231999k_{p}\in\{1,2,...,31999\}italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ { 1 , 2 , … , 31999 }, and the task is to extract the embedded kpsubscript𝑘𝑝k_{p}italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT if the text has been verified as watermarked with μ𝜇\muitalic_μ. We do so by using kpsubscript𝑘𝑝k_{p}italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT as the frequency of the Fourier perturbation function 1subscript1\mathcal{F}_{1}caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and perform extraction with the DFT. Figure 4c shows the extraction accuracy of Waterfall for different perturbation magnitudes κ𝜅\kappaitalic_κ. By taking multiple samples of Twsubscript𝑇wT_{\text{w}}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT, multiple articles could be combined together to improve extraction accuracy. For κ=6𝜅6\kappa=6italic_κ = 6, accuracy increases from 48% to 99% with only 5 pieces of text. Details are in Section E.8.

Computational costs. We note that Waterfall also has lower computational cost compared to benchmarks (Table 2). Waterfall verification can be run in parallel on a CPU, requiring only 0.035s when ran on a single 16-core CPU, which is 75x and 4237x faster than M-bit and P-nlw respectively, both which require inference using deep learning models. This is important in the context of protection of IP, e.g., where data providers may have to scan through large amount of online data for any IP infringement. Further discussion on the deployment costs are in Appendix L.

Table 2: Mean compute time over 100 texts on 1 Nvidia RTX A5000. *Note that verification for Waterfall was performed only on CPU without requiring a GPU.
Waterfall M-bit P-nlw
Watermark 24.8s 2.97s 147s
Verification 0.035s* 2.61s 148s

4.2 Watermarking of code

To demonstrate the versatility of Waterfall, we also consider its out-of-the-box performance on code watermarking. We used the MBJSP dataset (Athiwaratkun et al., 2023) , and evaluate fidelity 𝒮(To,Tw)𝒮subscript𝑇osubscript𝑇w\mathcal{S}(T_{\text{o}},T_{\text{w}})caligraphic_S ( italic_T start_POSTSUBSCRIPT o end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT ) using the pass@10 metric (Kulal et al., 2019; Chen et al., 2021) achieved by Twsubscript𝑇wT_{\text{w}}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT on functional tests for the original code Tosubscript𝑇oT_{\text{o}}italic_T start_POSTSUBSCRIPT o end_POSTSUBSCRIPT. We compare Waterfall, implemented using Phind-CodeLlama-34B-v2555https://huggingface.co/Phind/Phind-CodeLlama-34B-v2 as the paraphraser, with SrcMarker (Yang et al., 2024), a recent state-of-the-art code watermarking scheme, configured for 16-bit watermarks. Experimental details are in Appendix G.

We found that surprisingly, Waterfall achieves higher verifiability and robust verifiability (after 𝔸2subscript𝔸2\mathbb{A}_{2}blackboard_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT LLM paraphrasing attacks) compared to SrcMarker while maintaining high code fidelity (Table 3). This is despite Waterfall not requiring any manual training/engineering of programming language-specific watermarking rules, which SrcMarker does. Instead, Waterfall inherits its code capabilities from its LLM paraphraser, making it easily adaptable to other languages (e.g., see Section G.5 for Python code results).

Table 3: Fidelity, Verifiability, and Robust Verifiability of Waterfall with κ=3𝜅3\kappa=3italic_κ = 3 on code watermarking.
Fidelity (Pass@10) Verifiability (AUROC) Scalability (# of users)
Pre-attack Post-attack
SrcMarker 0.984 0.726 0.662 105superscript10510^{5}10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT
Waterfall 0.969 0.904 0.718 10130274superscript1013027410^{130274}10 start_POSTSUPERSCRIPT 130274 end_POSTSUPERSCRIPT

4.3 LLM data provenance of articles

Finally, we explore how Waterfall watermarks may persist after LLM fine-tuning, allowing us to use them for LLM data provenance. We consider the setting where client i𝑖iitalic_i watermarks a set of text {Tw(i)}superscriptsubscript𝑇w𝑖\{T_{\text{w}}^{(i)}\}{ italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } that adversaries use, without authorization, to fine-tune their own LLMs (i.e., 𝔸5subscript𝔸5\mathbb{A}_{5}blackboard_A start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT attacks). Given multiple queries to the fine-tuned black-box LLM, the goal is for client i𝑖iitalic_i to be able to verify that {Tw(i)}superscriptsubscript𝑇w𝑖\{T_{\text{w}}^{(i)}\}{ italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } had been used for training. This setting mirrors realistic scenarios where content owners want to detect unauthorized use of data for LLM training (Novet, 2024).

For our experiments, we watermarked the ArXiv dataset (Clement et al., 2019) which consists of scientific paper abstracts categorized into topics. Each topic category is associated with a unique client ID μ𝜇\muitalic_μ with 4000400040004000 text. These texts are then used to fine-tune gpt2-xl666https://huggingface.co/openai-community/gpt2-xl using the LoRA framework (Hu et al., 2022)777Note that this is a different model compared to that used for watermarking. We chose this to demonstrate that our watermark can persist despite the models’ different tokenizers. (details in Section H.1). We verified that using the watermarked dataset instead of the original dataset has minimal effect on the fidelity of the fine-tuned model (details in Section H.3).

Verifiability. To evaluate verifiability, we queried the fine-tuned model with the first 50 tokens of a randomly chosen abstract, and applied the verifiability operator on the next 100 generated new tokens to test for the associated watermark (details in Section H.2). Our results, presented in Figure 5, shows that Waterfall has high verifiability, reaching AUROC of 1.0 with just 100 queries to the fine-tuned LLM.

Scalability. To explore the scalability of Waterfall for data provenance, we combined the datasets of different number of clients, M{1,5,10,20,100}𝑀151020100M\in\{1,5,10,20,100\}italic_M ∈ { 1 , 5 , 10 , 20 , 100 }, each watermarked with their own unique ID μ𝜇\muitalic_μ, and use the combined dataset for fine-tuning the adversarial model. As expected, Figure 5 shows that dealing with an aggregated dataset mixed with a larger M𝑀Mitalic_M number of different watermarks would result in a decrease in verifiability. However, our results indicate that this decrease leveled off from M=20𝑀20M=20italic_M = 20 to M=100𝑀100M=100italic_M = 100 and still allow for an AUROC (verifiability) of 1.0 with around 100 queries even for M=100𝑀100M=100italic_M = 100, demonstrating the scalability of Waterfall to a sizable number of clients.

5 Related Work

Early text watermarking techniques (Kamaruddin et al., 2018; Taleby Ahvanooey et al., 2019) primarily depend on structural adjustments (e.g., text formatting, use of different Unicode characters (Rizzo et al., 2019)), image-based techniques (e.g., pixel-adjustments of text), or semantic watermarking (e.g., substituting synonyms like Basic described in Section 3.1). Recent works have augmented the latter with deep learning and language models for better performance (Qiang et al., 2023; Yoo et al., 2023; Ueoka et al., 2021; Abdelnabi and Fritz, 2021). However, as we showed in our experiments, these schemes are not robust to the range of practical LLM-enabled attacks possible today.

A recently popular but separate line of work has focused on the different model-centric problem setting of watermarking newly-generated output generated by a single LLM (Kirchenbauer et al., 2023; Venugopal et al., 2011; Christ et al., 2023; Kuditipudi et al., 2023; Zhao et al., 2023), rather than existing text owned by many clients. Hence, these works do not address our problem desiderata such as achieving scalability and robust verifiability while requiring semantic preservation of the original text. Our work focused on data-centric text watermarking of original text is the first to use LLM paraphrasers with a novel combination of techniques that are surprisingly effective in addressing the text data ownership and LLM data provenance settings. For further elaboration on the differences, see Appendix J.

6 Discussion and Conclusion

We proposed Waterfall, the first training-free framework for text watermarking that has low computational cost, scalability to large number of clients, and robustness to LLM attacks including unauthorized training of LLMs that generates IP-infringing text.

There is currently a lack of actual, practical large-scale deployment of text watermarking effective against LLM attacks, given the current SOTA watermarking methods’ limitations and resource requirements. However, Waterfall may possibly provide a foundation for achieving large-scale deployment, with both decentralized or centralized options. This is made achievable given Waterfall’s low computational cost, scalability to a large number of clients, and robustness to LLM attacks including unauthorized training of LLMs that generates IP-infringing text.

Our framework highlights a few perspectives that we hope more would consider. First, while increasingly capable LLMs allows for easier and more sophisticated forms of potential IP infringement, LLMs themselves could also enable better text IP protection of original texts. A key strength of Waterfall is that its capabilities grow as LLMs become more powerful, with increasingly better watermarking performance, allowing it to potentially keep up with the increasing capabilities adversaries can use for IP infringement. It is able to achieve a higher fidelity-verifiability Pareto frontier, and reduce any fidelity degradation while using higher watermarking strength for greater robust verifiability.

Second, as open-source LLM models become more prevalent and capable, adversaries could directly use them for IP attacks rather than depend on the services of closed-source LLM providers, allowing them to bypass any IP protection measures that these providers may implement (Piper, 2024). As such, content creators cannot just rely on LLM providers to assist in IP protection, but instead be equipped with methods such as Waterfall to protect their work before dissemination, such as by injecting robust watermarks that allows verifiability even after both traditional attacks and unauthorized use in LLM training by adversaries.

Third, a general text watermarking framework like Waterfall that can apply across different text types and languages not only helps with practical deployment, but also makes it highly versatile and not dependent on any text-specific properties. This makes it easily adaptable for incorporating new defense methods, providing a strong foundation for future works to build on as new threats emerge.

7 Limitations

As Waterfall relies on the adjustment of the original text to add watermarks, it may not applicable to all types of text. For example, Waterfall faces limitations in its application to works where their IP values lie in their style or format (e.g., poems), unless additional methods are applied that cause LLMs to largely preserve such styles while paraphrasing these works, such as optimizing for better paraphrasing prompts to be used with more capable LLMs, or iteratively refining the text through multiple rounds of watermarking.

Similar to other linguistics-based text watermarking methods, Waterfall would also not be applicable where changes to the text are unacceptable (e.g. lyrics of a country’s national anthem), or when applied to very short text (e.g. messages of just a few tokens). Nevertheless, Waterfall is still useful for a wide range of settings where the IP lies mainly in the content of the text, and presents a step forward for practical deployment of text watermarking. Future work could build on Waterfall to adapt it to other use cases for data provenance, such as data currency (i.e., ensuring that the data is up-to-date) or data authenticity (i.e., that the data has not been manipulated).

Acknowledgments

This research/project is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG2-PhD/2023-01-039J). This research is part of the programme DesCartes and is supported by the National Research Foundation, Prime Minister’s Office, Singapore under its Campus for Research Excellence and Technological Enterprise (CREATE) programme. Xinyuan Niu is supported by the Centre for Frontier AI Research of Agency for Science, Technology and Research (ASTAR). Jiangwei Chen is supported by the Institute for Infocomm Research of Agency for Science, Technology and Research (ASTAR). We acknowledge CSC (Finland) for awarding this project access to the LUMI supercomputer, owned by the EuroHPC Joint Undertaking, and hosted by CSC (Finland) and the LUMI consortium. The access was made possible via collaboration between NSCC (Singapore) and CSC (Finland).

References

  • Abdelnabi and Fritz (2021) Sahar Abdelnabi and Mario Fritz. 2021. Adversarial watermarking transformer: Towards tracing text provenance with data hiding. In Proc. IEEE SP, pages 121–140.
  • Athiwaratkun et al. (2023) Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, et al. 2023. Multi-lingual evaluation of code generation models. In Proc. ICLR.
  • Austin et al. (2021) Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
  • Bird et al. (2009) Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.".
  • Bisk et al. (2020) Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. 2020. Piqa: Reasoning about physical commonsense in natural language. In Proc. AAAI.
  • Brewster et al. (2023) Jack Brewster, Macrina Wang, and Coalter Palmer. 2023. Plagiarism-bot? How low-quality websites are using ai to deceptively rewrite content from mainstream news outlets. NewsGuard.
  • Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  • Christ et al. (2023) Miranda Christ, Sam Gunn, and Or Zamir. 2023. Undetectable watermarks for language models. arXiv preprint arXiv:2306.09194.
  • Clement et al. (2019) Colin B Clement, Matthew Bierbaum, Kevin P O’Keeffe, and Alexander A Alemi. 2019. On the use of arXiv as a dataset. arXiv preprint arXiv:1905.00075.
  • de Zwart (2018) Hans de Zwart. 2018. Turnitin user agreement: I disagree. https://blog.hansdezwart.nl/2018/01/10/turnitin-user-agreement-i-disagree/.
  • Dolan and Brockett (2005) Bill Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proc. IWP.
  • Foltỳnek et al. (2019) Tomáš Foltỳnek, Norman Meuschke, and Bela Gipp. 2019. Academic plagiarism detection: A systematic literature review. ACM Computing Surveys (CSUR), 52(6):1–42.
  • Gao et al. (2020) Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. 2020. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
  • Gao et al. (2021) Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2021. A framework for few-shot language model evaluation.
  • Harris et al. (2020) Charles R Harris, K Jarrod Millman, Stéfan J Van Der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J Smith, et al. 2020. Array programming with numpy. Nature, 585(7825):357–362.
  • Hu et al. (2022) Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models. In Proc. ICLR.
  • Kamaruddin et al. (2018) Nurul Shamimi Kamaruddin, Amirrudin Kamsin, Lip Yee Por, and Hameedur Rahman. 2018. A review of text watermarking: Theory, methods, and applications. IEEE Access, 6:8011–8028.
  • Kirchenbauer et al. (2023) John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. 2023. A watermark for large language models. In Proc. ICML, pages 17061–17084.
  • Kuditipudi et al. (2023) Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, and Percy Liang. 2023. Robust distortion-free watermarks for language models. arXiv preprint arXiv:2307.15593.
  • Kulal et al. (2019) Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, and Percy S Liang. 2019. Spoc: Search-based pseudocode to code. In Proc. NeurIPS.
  • Levesque et al. (2011) Hector J Levesque, Ernest Davis, and Leora Morgenstern. 2011. The Winograd schema challenge. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning, volume 46, page 47.
  • Li et al. (2023) Peixuan Li, Pengzhou Cheng, Fangqi Li, Wei Du, Haodong Zhao, and Gongshen Liu. 2023. Plmmark: A secure and robust black-box watermarking framework for pre-trained language models. In Proc. AAAI.
  • Lin et al. (2023) Xiaoqiang Lin, Zhaoxuan Wu, Zhongxiang Dai, Wenyang Hu, Yao Shu, See-Kiong Ng, Patrick Jaillet, and Bryan Kian Hsiang Low. 2023. Use your instinct: Instruction optimization using neural bandits coupled with transformers. arXiv preprint arXiv:2310.02905.
  • Madaan et al. (2023) Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. 2023. Self-refine: Iterative refinement with self-feedback. In Proc. NeurIPS.
  • McCoy et al. (2023) R Thomas McCoy, Paul Smolensky, Tal Linzen, Jianfeng Gao, and Asli Celikyilmaz. 2023. How much do language models copy from their training data? evaluating linguistic novelty in text generation using raven. Transactions of the Association for Computational Linguistics, 11:652–670.
  • Merity et al. (2016) Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. In Proc. ICLR.
  • Novet (2024) Jordannovet Novet. 2024. Eight newspaper publishers sue Microsoft and OpenAI over copyright infringement. https://www.cnbc.com/2024/04/30/eight-newspaper-publishers-sue-openai-over-copyright-infringement.html.
  • Piper (2024) Kelsey Piper. 2024. Should we make our most powerful ai models open source to all? https://www.vox.com/future-perfect/2024/2/2/24058484/open-source-artificial-intelligence-ai-risk-meta-llama-2-chatgpt-openai-deepfake.
  • Qiang et al. (2023) Jipeng Qiang, Shiyu Zhu, Yun Li, Yi Zhu, Yunhao Yuan, and Xindong Wu. 2023. Natural language watermarking via paraphraser-based lexical substitution. Artificial Intelligence, 317:103859.
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 21(140):1–67.
  • Rizzo et al. (2019) Stefano Giovanni Rizzo, Flavio Bertini, and Danilo Montesi. 2019. Fine-grain watermarking for intellectual property protection. EURASIP Journal on Information Security, 2019:1–20.
  • Shinn et al. (2023) Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learning. In Proc. NeurIPS.
  • Shu et al. (2024) Lei Shu, Liangchen Luo, Jayakumar Hoskere, Yun Zhu, Yinxiao Liu, Simon Tong, Jindong Chen, and Lei Meng. 2024. Rewritelm: An instruction-tuned large language model for text rewriting. In Proc. AAAI.
  • Taleby Ahvanooey et al. (2019) Milad Taleby Ahvanooey, Qianmu Li, Jun Hou, Ahmed Raza Rajput, and Yini Chen. 2019. Modern text hiding, text steganalysis, and applications: A comparative analysis. Entropy, 21(4):355.
  • Ueoka et al. (2021) Honai Ueoka, Yugo Murawaki, and Sadao Kurohashi. 2021. Frustratingly easy edit-based linguistic steganography with a masked language model. In Proc. NAACL, pages 5486–5492.
  • Venugopal et al. (2011) Ashish Venugopal, Jakob Uszkoreit, David Talbot, Franz Josef Och, and Juri Ganitkevitch. 2011. Watermarking the outputs of structured prediction with an application in statistical machine translation. In Proc. EMNLP, pages 1363–1372.
  • Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proc. ICLR.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. In Proc. NeurIPS.
  • Witteveen et al. (2019) Sam Witteveen, Red Dragon AI, and Martin Andrews. 2019. Paraphrasing with large language models. In Proc. EMNLP-IJCNLP.
  • Yang et al. (2024) Borui Yang, Wei Li, Liyao Xiang, and Bo Li. 2024. Srcmarker: Dual-channel source code watermarking via scalable code transformations. In Proc. IEEE SP, pages 97–97.
  • Yang et al. (2023) Xi Yang, Kejiang Chen, Weiming Zhang, Chang Liu, Yuang Qi, Jie Zhang, Han Fang, and Nenghai Yu. 2023. Watermarking text generated by black-box language models. arXiv preprint arXiv:2305.08883.
  • Yoo et al. (2023) KiYoon Yoo, Wonhyuk Ahn, Jiho Jang, and Nojun Kwak. 2023. Robust multi-bit natural language watermarking through invariant features. In Proc. ACL, pages 2092–2115.
  • Zhang et al. (2023) Ruisi Zhang, Shehzeen Samarah Hussain, Paarth Neekhara, and Farinaz Koushanfar. 2023. Remark-llm: A robust and efficient watermarking framework for generative large language models. arXiv preprint arXiv:2310.12362.
  • Zhao et al. (2023) Xuandong Zhao, Yu-Xiang Wang, and Lei Li. 2023. Protecting language generation models via invisible watermarking. In Proc. ICML, pages 42187–42199.

Appendix A Additional details on watermarking and verification operators

Algorithm 1 Waterfall Watermarking algorithm
1:  Input: Original text Tosubscript𝑇oT_{\text{o}}italic_T start_POSTSUBSCRIPT o end_POSTSUBSCRIPT, ID μ𝜇\muitalic_μ, text-specific metadata z𝑧zitalic_z, n𝑛nitalic_n-gram length n𝑛nitalic_n, perturbation magnitude κ𝜅\kappaitalic_κ, keys functions hπsubscript𝜋h_{\pi}italic_h start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT and hpsubscript𝑝h_{p}italic_h start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT
2:  Provide to LLM paraphraser a prompt c^^𝑐\hat{c}over^ start_ARG italic_c end_ARG containing Tosubscript𝑇oT_{\text{o}}italic_T start_POSTSUBSCRIPT o end_POSTSUBSCRIPT and paraphrasing instructions, which represents semantic content c𝑐citalic_c of Tosubscript𝑇oT_{\text{o}}italic_T start_POSTSUBSCRIPT o end_POSTSUBSCRIPT.
3:  Compute kp=hp(μ,z)subscript𝑘𝑝subscript𝑝𝜇𝑧k_{p}=h_{p}(\mu,z)italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_μ , italic_z ).
4:  for j=1,𝑗1j=1,\dotsitalic_j = 1 , … do
5:     Obtain logits lj(w^1:j1,c^)subscript𝑙𝑗subscript^𝑤:1𝑗1^𝑐l_{j}(\hat{w}_{1:j-1},\hat{c})italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT 1 : italic_j - 1 end_POSTSUBSCRIPT , over^ start_ARG italic_c end_ARG ) from LLM paraphraser, given Equation 1.
6:     Compute kπ=hπ(μ,w^jn+1:j1)subscript𝑘𝜋subscript𝜋𝜇subscript^𝑤:𝑗𝑛1𝑗1k_{\pi}=h_{\pi}(\mu,\hat{w}_{j-n+1:j-1})italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_μ , over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_j - italic_n + 1 : italic_j - 1 end_POSTSUBSCRIPT ).
7:     Compute perturbed logits ljˇˇsubscript𝑙𝑗\check{l_{j}}overroman_ˇ start_ARG italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG based on Section 3.3.
8:     Sample token wj^^subscript𝑤𝑗\hat{w_{j}}over^ start_ARG italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG based on the perturbed probability distribution pjˇ=softmax(ljˇ)ˇsubscript𝑝𝑗softmaxˇsubscript𝑙𝑗\check{p_{j}}=\text{softmax}(\check{l_{j}})overroman_ˇ start_ARG italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG = softmax ( overroman_ˇ start_ARG italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ).
9:  end for
10:  Output: Watermarked text Tw=[w^1,,<eos>]subscript𝑇wsubscript^𝑤1<eos>T_{\text{w}}=[\hat{w}_{1},...,\text{\textless eos\textgreater}]italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT = [ over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , <eos> ].
Algorithm 2 Waterfall Verification algorithm
1:  Input: Suspected text Tsus=[w^1,,w^NT_{\text{sus}}=[\hat{w}_{1},\ldots,\hat{w}_{N}italic_T start_POSTSUBSCRIPT sus end_POSTSUBSCRIPT = [ over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT], ID μ𝜇\muitalic_μ, n𝑛nitalic_n-gram length n𝑛nitalic_n, keys function hπsubscript𝜋h_{\pi}italic_h start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT, perturbation key kpsubscript𝑘𝑝k_{p}italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, test threshold q¯¯𝑞\bar{q}over¯ start_ARG italic_q end_ARG.
2:  Initialize a vector C𝐶Citalic_C of length |Vo|subscript𝑉𝑜|V_{o}|| italic_V start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT |, which keeps track of token counts, to 0.
3:  for j=1,,|Tsus|𝑗1subscript𝑇susj=1,\dots,|T_{\text{sus}}|italic_j = 1 , … , | italic_T start_POSTSUBSCRIPT sus end_POSTSUBSCRIPT | do
4:     Compute kπ=hπ(μ,w^jn+1:j1)subscript𝑘𝜋subscript𝜋𝜇subscript^𝑤:𝑗𝑛1𝑗1k_{\pi}=h_{\pi}(\mu,\hat{w}_{j-n+1:j-1})italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_μ , over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_j - italic_n + 1 : italic_j - 1 end_POSTSUBSCRIPT ) and permutation operator 𝒫(kπ)𝒫subscript𝑘𝜋\mathcal{P}(k_{\pi})caligraphic_P ( italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ), given Section 3.3.
5:     Set C(𝒫(kπ,w^i))++C(\mathcal{P}(k_{\pi},\hat{w}_{i}))++italic_C ( caligraphic_P ( italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT , over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) + +.
6:  end for
7:  Compute avg cumulative token distribution C¯=C/N¯𝐶𝐶𝑁\bar{C}=C/Nover¯ start_ARG italic_C end_ARG = italic_C / italic_N.
8:  Compute verification score q=C¯,1(kp)1(kp)2𝑞¯𝐶subscript1subscript𝑘𝑝subscriptdelimited-∥∥subscript1subscript𝑘𝑝2q=\langle\bar{C},\frac{\mathcal{F}_{1}(k_{p})}{\lVert\mathcal{F}_{1}(k_{p})% \rVert_{2}}\rangleitalic_q = ⟨ over¯ start_ARG italic_C end_ARG , divide start_ARG caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) end_ARG start_ARG ∥ caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ⟩ based on Equation 5.
9:  Output: Returns true if qq¯𝑞¯𝑞q\geq\bar{q}italic_q ≥ over¯ start_ARG italic_q end_ARG.
Algorithm 3 Waterfall Extraction algorithm
1:  Input: Suspected text Tsus=[w^1,,w^NT_{\text{sus}}=[\hat{w}_{1},\ldots,\hat{w}_{N}italic_T start_POSTSUBSCRIPT sus end_POSTSUBSCRIPT = [ over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT], ID μ𝜇\muitalic_μ, n𝑛nitalic_n-gram length n𝑛nitalic_n, keys function hπsubscript𝜋h_{\pi}italic_h start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT.
2:  Initialize a vector C𝐶Citalic_C of length |Vo|subscript𝑉𝑜|V_{o}|| italic_V start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT |, which keeps track of token counts, to 0.
3:  for j=1,,|Tsus|𝑗1subscript𝑇susj=1,\dots,|T_{\text{sus}}|italic_j = 1 , … , | italic_T start_POSTSUBSCRIPT sus end_POSTSUBSCRIPT | do
4:     Compute kπ=hπ(μ,w^jn+1:j1)subscript𝑘𝜋subscript𝜋𝜇subscript^𝑤:𝑗𝑛1𝑗1k_{\pi}=h_{\pi}(\mu,\hat{w}_{j-n+1:j-1})italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_μ , over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_j - italic_n + 1 : italic_j - 1 end_POSTSUBSCRIPT ) and permutation operator 𝒫(kπ)𝒫subscript𝑘𝜋\mathcal{P}(k_{\pi})caligraphic_P ( italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ), given Section 3.3.
5:     Set C(𝒫(kπ,w^i))++C(\mathcal{P}(k_{\pi},\hat{w}_{i}))++italic_C ( caligraphic_P ( italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT , over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) + +.
6:  end for
7:  Compute avg cumulative token distribution C¯=C/N¯𝐶𝐶𝑁\bar{C}=C/Nover¯ start_ARG italic_C end_ARG = italic_C / italic_N.
8:  Compute highest scoring key kp^=argmaxkp𝕂pC¯,1(kp)1(kp)2^subscript𝑘𝑝subscriptargmaxsubscript𝑘𝑝subscript𝕂𝑝¯𝐶subscript1subscript𝑘𝑝subscriptdelimited-∥∥subscript1subscript𝑘𝑝2\hat{k_{p}}=\operatorname*{arg\,max}_{k_{p}\in\mathbb{K}_{p}}{\langle\bar{C},% \frac{\mathcal{F}_{1}(k_{p})}{\lVert\mathcal{F}_{1}(k_{p})\rVert_{2}}\rangle}over^ start_ARG italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ blackboard_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟨ over¯ start_ARG italic_C end_ARG , divide start_ARG caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) end_ARG start_ARG ∥ caligraphic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ⟩ based on Equation 5.
9:  Output: Returns kp^^subscript𝑘𝑝\hat{k_{p}}over^ start_ARG italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG.

Appendix B Empirical illustration of watermarking signal in Twsubscript𝑇wT_{\text{w}}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT

Here we empirically illustrate how the watermarking signal can be embedded in Vwsubscript𝑉𝑤V_{w}italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT space with the background logits appearing as uniform noise, as described in Section 3.3. To illustrate the presence of the watermarking signal, we use the combined watermarked dataset used in the data ownership experiments, and plot its average cumulative token distribution C¯¯𝐶\bar{C}over¯ start_ARG italic_C end_ARG (in Algorithm 2).

Refer to caption
Figure 7: Average cumulative token distribution C¯¯𝐶\bar{C}over¯ start_ARG italic_C end_ARG of watermarked and unwatermarked text from subset of c4 realnewslike dataset. Fourier watermark signal with frequency 2 is clearly visible in Twsubscript𝑇wT_{\text{w}}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT (left) as compared to Tosubscript𝑇oT_{\text{o}}italic_T start_POSTSUBSCRIPT o end_POSTSUBSCRIPT (right).

Figure 7 shows that when we use the correct ID and kπsubscript𝑘𝜋k_{\pi}italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT for verification, the watermarking function can be clearly seen for the watermarked text Twsubscript𝑇wT_{\text{w}}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT (distribution in the shape of a cosine curve of 2 periods for kp=2subscript𝑘𝑝2k_{p}=2italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 2), while the unwatermarked text Tosubscript𝑇oT_{\text{o}}italic_T start_POSTSUBSCRIPT o end_POSTSUBSCRIPT shows a flat function.

Similarly, Figure 8 shows that when verifying watermarked text Twsubscript𝑇wT_{\text{w}}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT, the watermarking function is only visible with the correct permutation 𝒫(kπ)𝒫subscript𝑘𝜋\mathcal{P}(k_{\pi})caligraphic_P ( italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) (distribution in the shape of a cosine curve of 2 periods for kp=2subscript𝑘𝑝2k_{p}=2italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 2), but not with a different permutation 𝒫(kπ)𝒫superscriptsubscript𝑘𝜋\mathcal{P}(k_{\pi}^{\prime})caligraphic_P ( italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) (i.e., wrong ID).

Refer to caption
Figure 8: Average cumulative token distribution C¯¯𝐶\bar{C}over¯ start_ARG italic_C end_ARG of watermarked text from subset of c4 realnewslike dataset. Fourier watermark signal with frequency 2 is clearly visible when performing the correct permutation 𝒫(kπ)𝒫subscript𝑘𝜋\mathcal{P}(k_{\pi})caligraphic_P ( italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) (left) compared to the wrong permutation 𝒫(kπ)𝒫superscriptsubscript𝑘𝜋\mathcal{P}(k_{\pi}^{\prime})caligraphic_P ( italic_k start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) (right).

Appendix C Examples of orthogonal watermarking functions

We chose cosine and sine functions as the watermarking functions, due to the orthogonality between the cosine and sine functions of different frequencies.

ϕkp(j)={cos(2πkpj|𝕍|)if kp|𝕍|2sin(2π(kp|𝕍|2)j|𝕍|)otherwisesubscriptitalic-ϕsubscript𝑘𝑝𝑗cases2𝜋subscript𝑘𝑝𝑗𝕍if subscript𝑘𝑝𝕍22𝜋subscript𝑘𝑝𝕍2𝑗𝕍otherwise\phi_{k_{p}}(j)=\begin{cases}\cos\left(2\pi k_{p}\frac{j}{\absolutevalue{% \mathbb{V}}}\right)&\text{if }k_{p}\leq\frac{\absolutevalue{\mathbb{V}}}{2}\\ \sin\left(2\pi(k_{p}-\frac{\absolutevalue{\mathbb{V}}}{2})\frac{j}{% \absolutevalue{\mathbb{V}}}\right)&\text{otherwise}\\ \end{cases}italic_ϕ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_j ) = { start_ROW start_CELL roman_cos ( 2 italic_π italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT divide start_ARG italic_j end_ARG start_ARG | start_ARG blackboard_V end_ARG | end_ARG ) end_CELL start_CELL if italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ divide start_ARG | start_ARG blackboard_V end_ARG | end_ARG start_ARG 2 end_ARG end_CELL end_ROW start_ROW start_CELL roman_sin ( 2 italic_π ( italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - divide start_ARG | start_ARG blackboard_V end_ARG | end_ARG start_ARG 2 end_ARG ) divide start_ARG italic_j end_ARG start_ARG | start_ARG blackboard_V end_ARG | end_ARG ) end_CELL start_CELL otherwise end_CELL end_ROW

where j{1,,|𝕍|}𝑗1𝕍j\in\{1,\dots,\absolutevalue{\mathbb{V}}\}italic_j ∈ { 1 , … , | start_ARG blackboard_V end_ARG | } denote the index in the vocab space, kp{1,,|𝕍|1}subscript𝑘𝑝1𝕍1k_{p}\in\{1,\dots,\absolutevalue{\mathbb{V}}-1\}italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ { 1 , … , | start_ARG blackboard_V end_ARG | - 1 } denote the index of the available orthogonal functions. We chose the cosine and sine sequences as any other bounded watermarking sequence can be represented by a collection of sinusoidal sequences via the discrete Fourier transform (DFT).

In general, periodic functions of different frequencies could be used as the system of orthogonal functions, along with the phase-shifted counterparts by phase of a quarter wavelength. Other than the cosine and sine functions, one other example is the square wave functions.

Let kN=maxk{k|𝕍|0(mod 2k)}subscript𝑘𝑁𝑘superscriptconditional-set𝑘𝕍0mod superscript2𝑘k_{N}=\underset{k\in\mathbb{N}^{*}}{\max}\{k\mid\absolutevalue{\mathbb{V}}% \equiv 0\ (\text{mod }2^{k})\}italic_k start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = start_UNDERACCENT italic_k ∈ blackboard_N start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG roman_max end_ARG { italic_k ∣ | start_ARG blackboard_V end_ARG | ≡ 0 ( mod 2 start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) }. Assuming kN2subscript𝑘𝑁2k_{N}\geq 2italic_k start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ≥ 2, the number of orthogonal square waves supported is 2kN12subscript𝑘𝑁12k_{N}-12 italic_k start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT - 1, such that kp{1,,2kN1}subscript𝑘𝑝12subscript𝑘𝑁1k_{p}\in\{1,\dots,2k_{N}-1\}italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ { 1 , … , 2 italic_k start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT - 1 }. The square watermarking function is defined as follows.

ϕkp(j)={(1)2kpj|𝕍|if kpkN(1)2(kpkN)j|𝕍|+0.5otherwisesubscriptitalic-ϕsubscript𝑘𝑝𝑗casessuperscript1superscript2subscript𝑘𝑝𝑗𝕍if subscript𝑘𝑝subscript𝑘𝑁superscript1superscript2subscript𝑘𝑝subscript𝑘𝑁𝑗𝕍0.5otherwise\phi_{k_{p}}(j)=\begin{cases}\left(-1\right)^{\lfloor 2^{k_{p}}\frac{j}{% \absolutevalue{\mathbb{V}}}\rfloor}&\text{if }k_{p}\leq k_{N}\\ \left(-1\right)^{\lfloor 2^{\left(k_{p}-k_{N}\right)}\frac{j}{\absolutevalue{% \mathbb{V}}}+0.5\rfloor}&\text{otherwise}\\ \end{cases}italic_ϕ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_j ) = { start_ROW start_CELL ( - 1 ) start_POSTSUPERSCRIPT ⌊ 2 start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG italic_j end_ARG start_ARG | start_ARG blackboard_V end_ARG | end_ARG ⌋ end_POSTSUPERSCRIPT end_CELL start_CELL if italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_k start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ( - 1 ) start_POSTSUPERSCRIPT ⌊ 2 start_POSTSUPERSCRIPT ( italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - italic_k start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT divide start_ARG italic_j end_ARG start_ARG | start_ARG blackboard_V end_ARG | end_ARG + 0.5 ⌋ end_POSTSUPERSCRIPT end_CELL start_CELL otherwise end_CELL end_ROW

Appendix D Discussion on weaknesses of existing text watermarking methods

Both benchmark text watermarking methods, M-bit and P-nlw, are unable to achieve perfect verification performance despite having deterministic watermarking and verification algorithms, as stated in their respective papers, and corroborated in our experiments.

Both methods first use a language model to select viable word positions at which to perform the synonym substitution, then another model or word list to generate the list of possible synonym for substitution. During verification, we observe that the watermark could be corrupted in three ways.

Firstly, as the text being fed to the model for selecting the word replacement location is different (original text during watermarking and watermarked text during verification), the locations being selected during verification could be different as that used for watermarking.

Secondly, even if the correct locations are selected, a different synonym list could be generated during verification, due to the words that were changed at other locations during the watermarking process.

Thirdly, as the benchmarks perform watermarking by sequentially embedding the bits of the watermark ID into the text, any modifications to the text that inserts, deletes or shuffles the text would destroy the watermark ID. If an insertion or deletion error appears early in the text either through the first corruption above or through attacks, i.e., the location for a word replacement being inserted or removed during the verification as compared to during watermarking, the remainder of the watermark ID would be shifted in position, resulting the all the bits after the error to be in the wrong position, resulting in poor verifiability and robust verifiability. Additionally, as illustrated in Section 3.2, attacks that reorders the text will also shuffle the watermark ID, destroying its robust verifiability.

On the other hand, Waterfall is not susceptible to the above mentioned issues. As discussed in Section 3.2, the watermark signal is injected into each n𝑛nitalic_n-gram in the watermarked text, and does not depend on the specific location within the sentence, or specific word replacements. As the hash function hπsubscript𝜋h_{\pi}italic_h start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT is deterministic, the same permutation used during watermarking will always be selected during verification, as long as the n𝑛nitalic_n-gram unit is preserved.

Appendix E Data ownership experimental setting

E.1 Dataset

From the first 2000 samples in the c4 dataset, we selected text that were shorter than 1000 tokens long as our text samples Tosubscript𝑇oT_{\text{o}}italic_T start_POSTSUBSCRIPT o end_POSTSUBSCRIPT, totaling 1360 samples. We restricted the token length to ensure the paraphrasing prompt, original text and watermarked text can fit within the context window of the LLM used for paraphrasing. In practice, to overcome this limitation, longer original text could either be first split up into multiple sections to be watermarked, or an LLM with a longer context window could be used. The distribution of word and token lengths is shown in Figure 9.

Refer to captionRefer to caption
Figure 9: Histogram of word and token lengths of text in the c4 realnewslike dataset used for data ownership experiments.

E.2 Watermarking methodology

To perform paraphrasing, we followed the prompt format for llama-2-13b-hf, and used the following prompt to perform watermarking. No effort has been made to optimize the prompt.

[INST] <<SYS>>
Paraphrase the user provided text while preserving semantic similarity. Do not include any other sentences in the response, such as explanations of the paraphrasing. Do not summarize.
<</SYS>>
{text} [/INST]
Here is a paraphrased version of the text while preserving the semantic similarity:

For the results in the experimental (Section 4.1), the watermark was performed with ID μ=0𝜇0\mu=0italic_μ = 0 and kp=1subscript𝑘𝑝1k_{p}=1italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 1. Results for other μ𝜇\muitalic_μ and kpsubscript𝑘𝑝k_{p}italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT are reported below.

After watermarking, we perform a simple post-processing step to strip away extraneous generation by the LLM, by filtering out the last sentence or paragraph that contain the following phrases.

  • let me know

  • paraphrase

  • paraphrasing

  • other sentences

  • original text

  • same information

  • Note:

  • Note :

  • Please note

  • Please kindly note

  • Note that I

  • semantic similar

  • semantically similar

  • similar in meaning

  • Please be aware

  • the main changes made

  • Kindly note

  • Note this does

  • I have made sure to

This list should be customized depending on the content of the text to be watermarked, and LLM used for watermarking. Other methods of cleaning the watermarked text such as prompting the LLM to critic or correct issues within the watermarked text could be employed (Shinn et al., 2023).

E.3 Benchmark experiment settings

P-nlw (Qiang et al., 2023) proposes a watermarking process by incorporating a paraphraser-based lexical substitution model. While M-bit (Yoo et al., 2023) carefully chooses the potential original word to replace via finding features that are invariant to minor corruption, and a BERT-based lexical substitution model. We use these two approaches as the benchmark for text watermarking in the data ownership problem setting.

Key generation

As default, both M-bit and P-nlw use binary keys as watermark signals. The bits for the keys we use for experiments were generated with a seeded pseudo-random number generator. Specifically, we used 0 as the seed to NumPy’s Random Generator to generate the key used in the experiments888NumPy random generator takes in an unsigned int as the seed.

E.4 Verifiability

In this section, given threshold score q¯¯𝑞\bar{q}over¯ start_ARG italic_q end_ARG, we define the classification problem as follows. Positive sample: watermarked text Tw(i)superscriptsubscript𝑇w𝑖T_{\text{w}}^{(i)}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT; Negative sample: unwatermarked text Tosubscript𝑇oT_{\text{o}}italic_T start_POSTSUBSCRIPT o end_POSTSUBSCRIPT; Predictive positive: 𝒱(μi,Tsus)q¯𝒱subscript𝜇𝑖subscript𝑇sus¯𝑞\mathcal{V}\left(\mu_{i},T_{\text{sus}}\right)\geq\bar{q}caligraphic_V ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT sus end_POSTSUBSCRIPT ) ≥ over¯ start_ARG italic_q end_ARG, Predictive negative: 𝒱(μi,Tsus)<q¯𝒱subscript𝜇𝑖subscript𝑇sus¯𝑞\mathcal{V}\left(\mu_{i},T_{\text{sus}}\right)<\bar{q}caligraphic_V ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT sus end_POSTSUBSCRIPT ) < over¯ start_ARG italic_q end_ARG.

The ROC curves and corresponding AUROC values for different μ𝜇\muitalic_μ, kpsubscript𝑘𝑝k_{p}italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and κ𝜅\kappaitalic_κ are shown in Figure 10. We show that verifiability is insensitive to different μ𝜇\muitalic_μ, kpsubscript𝑘𝑝k_{p}italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT used for watermarking. For κ=6𝜅6\kappa=6italic_κ = 6, Waterfall was able to achieve AUROC of 0.989-0.996 across the different settings.

Refer to caption
Refer to caption
Refer to caption
Figure 10: ROC curves and corresponding AUROC values for different μ𝜇\muitalic_μ, kpsubscript𝑘𝑝k_{p}italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and κ𝜅\kappaitalic_κ

E.5 Fidelity

We provide some examples of text watermarked by the Waterfall, M-bit and P-nlw. LABEL:tab:c4_samples shows a few samples from the c4 dataset with watermarked text of varying STS scores. M-bit has the highest STS across these samples listed, due to its algorithm only changing very few words within the text, resulting in lower scalability as described in Section 4.1. Despite the high STS score, it can be visually seen that text watermarked with M-bit and P-nlw introduces linguistic and grammatical errors to the text, which are not measured by the STS score.

Refer to caption
Figure 11: Distribution of token length of unwatermarked text Tosubscript𝑇oT_{\text{o}}italic_T start_POSTSUBSCRIPT o end_POSTSUBSCRIPT against watermarked text Twsubscript𝑇wT_{\text{w}}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT

We noticed that there is a tendency of LLMs to summarize when performing paraphrasing, where some details of the text are lost during the watermarking process. This can be seen in the decrease in token length comparing the original unwatermarked text Tosubscript𝑇oT_{\text{o}}italic_T start_POSTSUBSCRIPT o end_POSTSUBSCRIPT against watermarked text Twsubscript𝑇wT_{\text{w}}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT in Figure 11. However, there are multiple methods of mitigating this issue. Firstly, longer text could be broken apart into different sections to be watermarked separately before being combined together. Secondly, due to the robustness of Waterfall to modifications, the watermarked text could be edited to correct for errors or reinsert missing information, either manually or by leveraging LLMs to perform self-critic, similar to the method described by Shinn et al. (2023). Lastly, as Waterfall does not require the use or training of specific LLMs for watermarking, it can be applied to newer models with better paraphrasing performance as they become available.

Table 4: Text watermarked with Waterfall, M-bit and P-nlw, along with their STS scores. For M-bit and P-nlw, highlighted words are words that were selected for watermarking, and may or may not have been changed as compared to the original text depending on the ID used for watermarking.
Text STS
Original After the martyrdom of St. Boniface, Vergilius was made Bishop of Salzburg (766 or 767) and laboured successfully for the upbuilding of his diocese as well as for the spread of the Faith in neighbouring heathen countries, especially in Carinthia. He died at Salzburg, 27 November, 789. In 1233 he was canonized by Gregory IX. His doctrine that the earth is a sphere was derived from the teaching of ancient geographers, and his belief in the existence of the antipodes was probably influenced by the accounts which the ancient Irish voyagers gave of their journeys. This, at least, is the opinion of Rettberg ("Kirchengesch. Deutschlands", II, 236). -
Waterfall After St. Boniface’s martyrdom, Vergilius became the Bishop of Salzburg in 766 or 767 and worked tirelessly to strengthen his diocese and spread the faith in nearby heathen regions, particularly in Carinthia. He died in Salzburg on November 27, 789. In 1233, he was canonized by Gregory IX. According to Rettberg (“Kirchengesch. Deutschlands”, II, 236), Vergilius’ belief that the earth is a sphere was based on the teachings of ancient geographers, and his belief in the existence of the antipodes may have been influenced by the accounts of ancient Irish voyagers. 0.978
M-bit (8 bits) Following the martyrdom of St. Boniface, Vergilius became made Bishop of Salzburg (766 or 767) and laboured successfully for the upbuilding of his diocese as well as for the spread of the Faith in neighbouring heathen countries, especially in Carinthia. He died at Salzburg, 27 November, 789. In 1233 he was canonized by Gregory IX. His doctrine that the earth is a sphere was derived from the teaching of ancient geographers, and his belief in the existence of the antipodes was probably influenced by the accounts which the ancient Irish voyagers gave of their journeys. This, at least, is the opinion of Rettberg (“Kirchengesch. Deutschlands”, II, 236). 0.999
P-nlw (3 bits) following the martyrdom of St. Boniface, Vergilius was made Bishop of Salzburg (766 or 767) and worked worked for the upbuilding of his diocese as well as for the spread of the Faith in neighbouring heathen countries, especially in Carinthia. He died at Salzburg, 27 November, 789. In 1233 he was canonized by Gregory IX. His doctrine that the earth is a sphere was derived from the teaching of ancient geographers, and his belief in the existence of the antipodes was probably influenced by the accounts which the ancient Irish voyagers gave of their journeys. This, at least, is the opinion of Rettberg (“Kirchengesch. Deutschlands”, II, 236). 0.964
Original Karl Kispert, principal of cyber and information security, has more than 28 years of experience in selling, managing and delivering information risk management, internal audit, regulatory and compliance programs, and information security and technology risk management. A former chief information security officer, Kispert has helped design and implement cybersecurity programs for many firms, according to the firm. “By adding this new service line, and bringing someone with Karl’s expertise to the firm, we can service yet another important aspect of our clients’ and prospects’ businesses, ensuring their continued success,” CEO Louis Grassi said in a written statement. Services will include full security programs, compliance, third party vendor risk assessment, threat management, and managed security services. -
Waterfall Karl Kispert, a principal specialist in cybersecurity and information risk management, has extensive experience spanning 28 years in providing sales, management, and delivery of information risk management, internal audit, compliance programs, and technology risk management solutions. As a former Chief Information Security Officer, Kispert has supported the design and implementation of comprehensive cybersecurity programs for numerous organizations. The CEO of the firm, Louis Grassi, has expressed enthusiasm about expanding the firm’s service offerings through the integration of this new service line, which will be supported by Kispert’s proficiency in providing comprehensive security measures, compliance, vendor risk assessment, threat management, and managed security services. 0.899
M-bit (5 bits) Karl Kispert, principal in cyber and information security, has more than 28 years of experience in selling, managing and delivering information risk management, internal audit, regulatory and compliance programs, and information security and technology risk management. A former chief information security officer, Kispert had helped design and implement cybersecurity programs for many firms, according to the firm. “By adding this new service line, and bringing someone with Karl’s expertise to the firm, we can service yet another important aspect of our clients’ and prospects’ businesses, ensuring their continued success,” CEO Louis Grassi said in a written statement. Services offered include full security programs, compliance, third party vendor risk assessment, threat management, and managed security services. 0.9969
P-nlw (21 bits) carl kisper, principal of cyber and information protection, has has than 28 old of experience experience selling, managing and delivery information risk risks, internal audit, regulatory cyber cybernetic programs, and information security and technology risk management. A former chief information security officer, Kispert has helped project and project cybersecurity programs for many firms, according to the firm. “ By adding this new service line, and bringing someone with Karl’ s expertise to the firm, we can service yet another important aspect of our clients ’ and prospects ’ businesses, ensuring their continued success, CEO Louis Grassi said in a written job. Services will include full security programs, compliance, third party vendor risk assessment, threat management, and managed security services. 0.938
Original Larry checks in with KPCC reporter Sharon McNary, who’s been hitting up several polling stations in Orange County and Los Angeles County, as well as Registrar of Voters for O.C. and L.A. After being a finalist for LAPD chief in 2009 only to see the job go to Charlie Beck, Michel Moore has been selected to succeed Beck by L.A. Mayor Eric Garcetti. President Donald Trump signed the “right-to-try” bill into law on Wednesday, a measure that gives terminally ill patients access to experimental drugs that have not yet been approved by the Food and Drug Administration (FDA). Humans have a habit of measuring things. Our shoe size. The ingredients in our food. How long it takes to get to work, with or without traffic. -
Waterfall Larry talks with KPCC reporter Sharon McNary about polling stations and the Registrar of Voters in both Orange County and Los Angeles County. The Los Angeles Mayor, Eric Garcetti, has appointed Michel Moore as the new Chief of the LA Police Department after he was previously a finalist for the position in 2009. The US President, Donald Trump, signed a law giving terminally ill patients access to unapproved experimental treatments. Humans tend to quantify aspects of life, such as shoe size, food ingredients, commute times, and more. 0.857
M-bit (4 bits) Larry checks in with KPCC reporter Sharon McNary, who’s been hitting up several polling stations in Orange County and Los Angeles County, as well as Registrar of Voters for O.C. and L.A. After being a finalist for LAPD chief in 2009 only to see the job go to Charlie Beck, Michel Moore has been selected to succeed Beck by L.A. Mayor Eric Garcetti. President Donald Trump signed the “right-to-try” bill into law on Wednesday, a measure that gives terminally ill patients access to experimental drugs that have not yet become approved by the Food and Drug Administration (FDA). Humans have a habit for measuring things. Our shoe size. The ingredients of our food. How long it takes to get to work, with or without traffic. 0.999
P-nlw (12 bits) lary controls on on KPCC journalist Sharon McNary, who is s been attacked up several polling stations in Orange County and Los Angeles County, as well as Registrar of Voters for O.C. los L.A. After being a finalist for LAPD chief in 2009 only to see the job go to Charlie Beck, Michel Moore has been selected to succeed Beck by L.A. Mayor Eric Garcetti. President Donald Trump signed the “ right-to-try ” bill into law on Wednesday, a measure that gives terminally ill patients access to experimental drugs that have not yet been approved by the Food and Drug on (FDA). Humans have a habit of measuring things. Our shoe size. The ingredients in our food. How long it takes to get to work, with or without traffic. 0.829
Original Come test your luck on the best slot machine app in the app store. Great graphics make this app so fun to play. Test your luck with Pharaoh Slots! Bet, Spin and Get Lucky! -
Waterfall Experience the ultimate entertainment with the most thrilling slot machine game in the app store! Marvel at stunning visuals that make playing so enjoyable. 0.787
M-bit (4 bits) Come test your luck on the best slot machine app in the app store. Great graphics make this app so fun to play. Test your luck on Pharaoh Slots! Bet, Spin and Get Lucky! 0.9985
P-nlw (12 bits) please test yourself happiness happiness the best place machine app in the app store. Great graphics make it app app fun to play. Test your luck with Pharaoh Slots ! Bet, Spin and Get get! 0.716

E.6 Verifiability fidelity trade-off

Refer to captionRefer to captionRefer to caption
Figure 12: Fidelity and verifiability for different μ𝜇\muitalic_μ, kpsubscript𝑘𝑝k_{p}italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and κ𝜅\kappaitalic_κ

We observe that different values for μ𝜇\muitalic_μ does not result in noticeable impact on the fidelity and verifiability of the watermarked text, as shown in Figure 12. Varying kpsubscript𝑘𝑝k_{p}italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT results in minor variations in fidelity and verifiability at high κ𝜅\kappaitalic_κ, but the pareto-front of the fidelity verifiability trade-off is similar across the different kpsubscript𝑘𝑝k_{p}italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. Clients using different kpsubscript𝑘𝑝k_{p}italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT could adjust the value of κ𝜅\kappaitalic_κ to suite their requirements for fidelity and verifiability.

E.7 Scalability

We examine the scalability of Waterfall and benchmarks M-bit, P-nlw in practice by watermarking with different IDs and verifying with different IDs.

E.7.1 Scalability when verifying with different IDs

Using a dataset of text watermarked with ID μ=i𝜇𝑖\mu=iitalic_μ = italic_i, we compare the verifiability using the correct ID (𝒱(μi,Tw(i))𝒱subscript𝜇𝑖superscriptsubscript𝑇w𝑖\mathcal{V}(\mu_{i},T_{\text{w}}^{(i)})caligraphic_V ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT )) against verifiability using the wrong IDs (𝒱(μji,Tw(i))𝒱subscript𝜇𝑗𝑖superscriptsubscript𝑇w𝑖\mathcal{V}(\mu_{j\neq i},T_{\text{w}}^{(i)})caligraphic_V ( italic_μ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT )). Figure 13 shows the histogram plot for the AUROC comparing the 2 verification scores (𝒱(μi,Tw(i))𝒱subscript𝜇𝑖superscriptsubscript𝑇w𝑖\mathcal{V}(\mu_{i},T_{\text{w}}^{(i)})caligraphic_V ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) versus 𝒱(μji,Tw(i))𝒱subscript𝜇𝑗𝑖superscriptsubscript𝑇w𝑖\mathcal{V}(\mu_{j\neq i},T_{\text{w}}^{(i)})caligraphic_V ( italic_μ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT )) for the different methods.

Notice that the AUROC of Waterfall for the different IDs are all closely clustered around the high value of 0.985. However, the AUROC of benchmarks M-bit and P-nlw show a very large range, with some IDs showing very low AUROC down to 0.69 and 0.53 respectively.

To further support our claim of Waterfall having large scalability, we performed verification with 100,000 different IDs for Waterfall. Figure 14 shows that the distribution of AUROC values are similar when scaling up from 1,000 to 100,000 IDs, and this performance could be extrapolated into millions of IDs.

Refer to caption
Figure 13: AUROC of Tw(i)superscriptsubscript𝑇w𝑖T_{\text{w}}^{(i)}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT when verifying with μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT vs. μjisubscript𝜇𝑗𝑖\mu_{j\neq i}italic_μ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT. Waterfall has consistently high verifiability for all 1000 μjisubscript𝜇𝑗𝑖\mu_{j\neq i}italic_μ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT, compared to benchmarks which have many μjisubscript𝜇𝑗𝑖\mu_{j\neq i}italic_μ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT with poor verifiability.
Refer to caption
Figure 14: Scalability of Waterfall for AUROC of Tw(i)superscriptsubscript𝑇w𝑖T_{\text{w}}^{(i)}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT when verifying with μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT vs. μjisubscript𝜇𝑗𝑖\mu_{j\neq i}italic_μ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT, when using 1000 IDs versus 100,000 IDs. Scaling up to 100,000 IDs shows the same narrow clustering of values around the high AUROC value of 0.985.

E.7.2 Scalability when watermarking different IDs

We further explore the scalability of Waterfall when verifying text watermarked with different IDs. We compare the verifiability using the correct ID (𝒱(μi,Tw(i))𝒱subscript𝜇𝑖superscriptsubscript𝑇w𝑖\mathcal{V}(\mu_{i},T_{\text{w}}^{(i)})caligraphic_V ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT )) against verifiability using the wrong IDs (𝒱(μi,Tw(ji))𝒱subscript𝜇𝑖superscriptsubscript𝑇w𝑗𝑖\mathcal{V}(\mu_{i},T_{\text{w}}^{(j\neq i)})caligraphic_V ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ≠ italic_i ) end_POSTSUPERSCRIPT )). Due to the higher computational cost of watermarking compared to verification, we performed this experiments over a smaller subset of 358 pieces of text of the c4 realnewslike dataset. 500 different IDs were used to watermark the dataset. Figure 15 shows the distribution of AUROC comparing the 2 verification scores 𝒱(μi,Tw(i))𝒱subscript𝜇𝑖superscriptsubscript𝑇w𝑖\mathcal{V}(\mu_{i},T_{\text{w}}^{(i)})caligraphic_V ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) versus 𝒱(μi,Tw(ji))𝒱subscript𝜇𝑖superscriptsubscript𝑇w𝑗𝑖\mathcal{V}(\mu_{i},T_{\text{w}}^{(j\neq i)})caligraphic_V ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ≠ italic_i ) end_POSTSUPERSCRIPT ) for Waterfall is closely clustered around 0.98, similar to the results in Section E.7.1. Note that a smaller number of text are considered for this experiment, resulting in the slightly difference in distribution.

Refer to caption
Figure 15: AUROC of Tw(i)superscriptsubscript𝑇w𝑖T_{\text{w}}^{(i)}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT vs. Tw(ji)superscriptsubscript𝑇w𝑗𝑖T_{\text{w}}^{(j\neq i)}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ≠ italic_i ) end_POSTSUPERSCRIPT when verifying with μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Waterfall shows consistently high AUROC when verifying Tw(i)superscriptsubscript𝑇w𝑖T_{\text{w}}^{(i)}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT with μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT compared to verifying Tw(ji)superscriptsubscript𝑇w𝑗𝑖T_{\text{w}}^{(j\neq i)}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ≠ italic_i ) end_POSTSUPERSCRIPT with μisubscript𝜇𝑖\mu_{i}italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

E.7.3 Discussion on scalability in practice

M-bit, P-nlw suffer from poor scalability in practice, as shown above. As we consider watermarking or verification with the wrong ID μjisubscript𝜇𝑗𝑖\mu_{j\neq i}italic_μ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT, there can be situations where the wrong ID differ from the correct ID at only 1 single bit, or very few bits. If the text is too short to be able to encode sufficient number of bits to include the differing bits, the watermarking method would be unable to differentiate between the 2 IDs during verification.

Even if the texts are sufficiently long, IDs that have few differing bits will be harder to differentiate. As discussed in Appendix D, errors could be present in the verification of watermark with M-bit and P-nlw. Such errors could overshadow the small differences in the watermarking and verification IDs, resulting in poor verification performance. To achieve satisfactory performance, M-bit and P-nlw would have to limit their scheme to IDs with sufficient number of differing bits, which further limit the scalability of their schemes.

On the other hand, Waterfall is not susceptible to such issues. As the watermark signal is not embedded directly into the specific substitutions in the text space, but rather into signals in the permuted token space determined by a hash of the ID, small differences in the ID results in drastically different permutations in the token space, and they are extremely unlikely to collide, i.e., 2 different IDs are extremely unlikely to map to the same permutations over the entire piece of text. As a result, Waterfall can achieve significantly higher scalability than M-bit and P-nlw in practice.

E.8 Extraction

To evaluate extraction accuracy, we applied Algorithm 3 on the watermarked text. The accuracy is calculated based on the percentage of exact matches (extracted kp^^subscript𝑘𝑝\hat{k_{p}}over^ start_ARG italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG matches the kpsubscript𝑘𝑝k_{p}italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT used to watermark the text).

Note that as there are 31999 supported kpsubscript𝑘𝑝k_{p}italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT when using the Fourier basis functions with llama-2-13b-hf as the paraphraser, the probability of randomly guessing the correct kpsubscript𝑘𝑝k_{p}italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is 131999=0.003125%131999percent0.003125\frac{1}{31999}=0.003125\%divide start_ARG 1 end_ARG start_ARG 31999 end_ARG = 0.003125 %. Despite this, Waterfall is able to achieve high extraction accuracy of 48% when extracting from a single text for our default setting of κ=6𝜅6\kappa=6italic_κ = 6. This performance can be further improved when more pieces of watermarked text are available, such that accuracy improves to the high value of 99% with only 5 pieces of text. This is done by combining multiple pieces of text watermarked by the same ID μ𝜇\muitalic_μ and perturbation key kpsubscript𝑘𝑝k_{p}italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, by simply summing the cumulative token counts in Vwsubscript𝑉𝑤V_{w}italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT space, C𝐶Citalic_C, of the different pieces of text, before performing step 7 of Algorithm 3.

Appendix F Experimental details and additional results for attacks

F.1 𝔸1subscript𝔸1\mathbb{A}_{1}blackboard_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT

Following Kamaruddin et al. (2018), we design three types of attack: insertion, deletion, and synonym substitution attacks for 𝔸1subscript𝔸1\mathbb{A}_{1}blackboard_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Attack strength indicates the rate of attacked words over the total number of words in a given content.

Insertion attack. We consider two types of insertion attacks mentioned in Kamaruddin et al. (2018):

(1) Localized insertion: this kind of attack inserts a random word into the original content at a random position. This is labeled as “local” in Figure 6.

(2) Dispersed insertion: multiple random words are added in multiple random positions into the original content. In our experiment, we iteratively insert a random English word into a random position of the original content.

Deletion attack. Random words are deleted, to attempt to distort the watermark in the original content.

Synonym substitution attack. Given original content, the synonym substitution attack tries to replace some words with their synonyms. In our experiments, we use the Natural Language Toolkit (NLTK) (Bird et al., 2009) to find a set of synonyms for a certain word, then choose a random word in this synonym set to replace the original word. We used the random function in the NumPy library (Harris et al., 2020) to randomly select words to be substituted for these types of attacks.

F.2 𝔸2subscript𝔸2\mathbb{A}_{2}blackboard_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

Translation attack was performed with gpt-3.5-turbo-0613, with the following prompts, where the language field is “Spanish” and “English”.

{
role’: system’,
content’: Translate the provided piece of text to {language}.’
}
{
role’: user’,
content’: ’{text}’
}

Paraphrase attack was performed with llama-2-13b-hf, prompted in the following format.

[INST] <<SYS>>
Paraphrase the user provided text while preserving semantic similarity. Do not include any other sentences in the response, such as explanations of the paraphrasing. Do not summarise.
<</SYS>>
{text} [/INST]
Here is a paraphrased version of the text while preserving the semantic similarity:

We ran further experiments using different LLMs to perform paraphrasing attack. The robust verifiability of Waterfall, M-bit and P-nlw are reported in Table 5. Waterfall achieves significantly higher robust verifiability than the benchmarks under paraphrasing attack across the different LLMs.

Table 5: Robust verifiability under paraphrasing attack with different LLMs.
gemma-7b-it999https://huggingface.co/google/gemma-7b-it Llama-2-7b-chat-hf101010https://huggingface.co/meta-llama/Llama-2-7b-chat-hf Mixtral-8x7B-Instruct-v0.1111111https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1 gpt-3.5-turbo
Waterfall 0.880 0.881 0.701 0.760
M-bit 0.524 0.509 0.522 0.385
P-nlw 0.374 0.359 0.467 0.512

F.3 𝔸3subscript𝔸3\mathbb{A}_{3}blackboard_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT

We show the results of 𝔸3subscript𝔸3\mathbb{A}_{3}blackboard_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT overlap watermark on Waterfall when the watermark overlap was applied on μ𝜇\muitalic_μ or kpsubscript𝑘𝑝k_{p}italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT in Table 6. We can see that Waterfall can achieve high robust verifiability for overlap attack for both applications.

Table 6: Robust verifiability under overlap watermarking attack with different μ𝜇\muitalic_μ or kpsubscript𝑘𝑝k_{p}italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT.
Pre-attack Post-attack
Overlap μ𝜇\muitalic_μ 0.992 0.815
Overlap kpsubscript𝑘𝑝k_{p}italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT 0.992 0.743
𝔸3subscript𝔸3\mathbb{A}_{3}blackboard_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT on benchmarks with complement binary key

We consider the worst-case scenario of robust verifiability under 𝔸3subscript𝔸3\mathbb{A}_{3}blackboard_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT for two traditional approaches P-nlw and M-bit. Because these two methods are based on embedding binary keys in the watermarking stage, we try to apply 𝔸3subscript𝔸3\mathbb{A}_{3}blackboard_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT with the complement of the binary watermark key that was extracted as part of the verification process (replacing bit 0 with bit 1 and vice versa), to illustrate the worst-case scenario. We conduct this experiment with setting as 4.1. The results are illustrated in Table 7 and Figure 16. Do note that attacks could engineer their attacks by performing overlap watermarking with a mixture of watermark bits, random bits and complement bits, to target any AUROC value between the pre-attack and overlap complement AUROC.

Table 7: AUROC of P-nlw and M-bit under 𝔸3subscript𝔸3\mathbb{A}_{3}blackboard_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT with the complement of binary watermark key (worst case scenario)
Pre-attack Overlap complement
P-nlw 0.8848 0.1780
M-bit 0.9882 0.0547
Refer to caption
Figure 16: ROC curves and corresponding AUROC values of 𝔸3subscript𝔸3\mathbb{A}_{3}blackboard_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT with the complement of binary watermark key of P-nlw and M-bit.

F.4 𝔸4subscript𝔸4\mathbb{A}_{4}blackboard_A start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT

To perform the in-context prompting experiments, we made use of gpt-3.5-turbo-1106 to generate 3 questions each for 300 text articles. The following prompt was used to generate the questions.

{
role’: system’,
content’: Using the provided article, create 3 reading comprehension questions.’
}
{
role’: user’,
content’: ’{text}’
}

We then separately prompt gpt-3.5-turbo-1106, providing the watermarked text as the context to answer the questions.

{
role’: system’,
content’: Using the provided article, answer the questions.’
}
{
role’: user’,
content’: ’{text}\n\n{questions}’
}

F.5 Additional results for robust verifiability

Beyond AUROC reported in the main paper, we additionally report the true positive rate (TPR) at fixed false positive rate (FPR) of 0.1 and 0.01 for verifiability and robust verifiability under different attacks across different watermarking methods in Table 8.

Table 8: TPR at FPR of 0.1 and 0.01 for verifiability and robust verifiability.
FPR Pre-attack 𝔸2Tsubscript𝔸2𝑇\mathbb{A}_{2-T}blackboard_A start_POSTSUBSCRIPT 2 - italic_T end_POSTSUBSCRIPT 𝔸2Tsubscript𝔸2𝑇\mathbb{A}_{2-T}blackboard_A start_POSTSUBSCRIPT 2 - italic_T end_POSTSUBSCRIPT 𝔸3subscript𝔸3\mathbb{A}_{3}blackboard_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 𝔸4subscript𝔸4\mathbb{A}_{4}blackboard_A start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT
0.1 Waterfall 0.982 0.890 0.750 0.640 0.472
P-nlw 0.667 0.078 0.110 0.281 0.114
M-bit 0.993 0.126 0.126 0.520 0.000
0.01 Waterfall 0.910 0.608 0.405 0.284 0.122
P-nlw 0.110 0.007 0.010 0.037 0.032
M-bit 0.693 0.126 0.000 0.126 0.000

Note that under Waterfall, we are able to drastically improve the verification performance when multiple pieces of text are available to be considered, where a realistic setting would involve multiple samples from the adversaries that we could test the watermarks for. In reality, IP holders are concerned about large-scale unauthorized IP use (i.e., multiple infringements) rather than one-off cases.

To demonstrate this, we ran an experiment where we test our watermarks given multiple samples under attack 𝔸4subscript𝔸4\mathbb{A}_{4}blackboard_A start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT. Desipte the low TPR of 0.472 and 0.122 for FPR of 0.1 and 0.01 respectively when only considering 1 sample, our results demonstrates that given just 10 samples, we are able to achieve a TPR of 0.907 even with the strict requirement of a FPR of 0.01. The TPR increases to even 1.000 given 17 samples when we have the requirement of 0.1 FPR. This is also realistic because in practice, IP holders may use this as a screening tool for suspicious parties, to investigate them further, and hence would be alright with a higher FPR.

Appendix G Waterfall in code watermarking

G.1 Code watermarking experiment settings

In the main paper, we report the result of code watermarking on the MBJSP dataset (Athiwaratkun et al., 2023) with the data ownership problem setting. This is a JavaScript dataset including around 800 crowd-sourced JavaScript programming problems. To show the ability of Waterfall on watermarking other programming languages, we also perform data ownership watermarking on Python datasets, which can be found in Appendix G.5.

In this setting, we use Phind-CodeLlama-34B-v2121212https://huggingface.co/Phind/Phind-CodeLlama-34B-v2 , as LLM paraphraser for code watermarking, the square wave basis with kp=1subscript𝑘𝑝1k_{p}=1italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 1 (Appendix C) for watermark perturbation and randomly choose μ=10𝜇10\mu=10italic_μ = 10 in all code experiments. As default, we denote Waterfall code to indicate Waterfall in this code watermarking settings. Moreover, we also show that prompt engineering techniques, such as Reflexion (Shinn et al., 2023) could improve the fidelity of watermarked code while preserving the verifiability (Section G.2). For SrcMarker Yang et al. (2024), we configured their algorithm for 16-bit watermarks, to demonstrate scalability of at least 105superscript10510^{5}10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT.

For verifiability evaluation, we use the same evaluation protocol as article watermarking in Section 4.1. As a result, the ROC curve and AUROC values for Waterfall code are shown in Figure 17

Refer to caption
Figure 17: The ROC curves and corresponding AUROC values on the MBJSP dataset using Waterfall code.
Watermarked code fidelity evaluation

As mentioned in the main paper, we evaluate the fidelity of the watermarked code by evaluating its accuracy based on functional tests for the original code and use the standard pass@k metric (Kulal et al., 2019; Chen et al., 2021) for evaluating functional correctness. Given the deterministic nature of the baseline SrcMarker (Yang et al., 2024), which inherently upholds fidelity, the pass@10 metric is adopted to facilitate a fair comparison between Waterfall and SrcMarker in terms of fidelity performance. This metric specifically measures the likelihood of Waterfall producing watermarked code that passes unit tests within 10 generation attempts. The pass@10 metric is also realistic in practice as it aligns with real-world scenarios where clients can assess the quality of watermarked code through predefined tests and subsequently regenerate the code if test failures arise.

To evaluate the functional correctness of code, we adapt the JavaScript evaluation protocol from Athiwaratkun et al. (2023) for the MBJSP dataset. On the other hand, for Python evaluation, we adapt the HumanEval (Chen et al., 2021) code evaluation protocol131313https://github.com/openai/human-eval and test script from both datasets (Chen et al., 2021; Austin et al., 2021). However, the watermarked code usually modifies the original function name into some related names, so we use Levenshtein distance to find the new text function in the watermarked code. For a more precise evaluation of the watermarked code, this related function name-finding process can be improved by using other similarity distances, such as the Semantic Textual Similarity (STS) score.

G.2 Waterfall code + Reflexion methodology

In this section, we show that some prompt engineering approaches could help the watermarked code improve fidelity without hurting the verifiability. Adapting the techniques from Shinn et al. (2023), we try to correct the watermarked code through the LLM-based self-reflection mechanism. After being watermarked with Waterfall code, this watermarked code undergoes a correcting process via multiple feedback loops (3 feedback loops in our experiments). Each feedback loop contains two self-reflection components aiming to perform syntax correction and functional similarity alignment. Each self-reflection component performs two main steps: 1) evaluating or analyzing the given information based on task criteria, e.g., the correctness of programming syntax. 2) regenerate the “better” code based on given feedback.

Applying the same LLM in Waterfall code to the self-reflection component plays a crucial role in this combination. This is simply because LLM is a good way to handle and generate linguistic feedback, which contains more information than scalar results in the evaluation step. Moreover, watermarking LLM helps the final code preserve the robust and scalable watermark signal through the correction step, which is the ultimate goal of our text watermarking framework. The prompts to perform the syntax correction step and functional similarity alignment are illustrated in Section G.6.

The effect of the Reflexion approach is shown in Figure 18. From this illustration, we can see that Reflexion improves fidelity while maintaining high verifiability of Waterfall code. So we apply this technique in all code watermarking experiments.

Refer to caption
Figure 18: The effect of Reflexion in Waterfall code on MBJSP dataset

G.3 Verifiability and fidelity trade-off

Figure 19 shows the trade-off of verifiability and fidelity can be adjusted via κ𝜅\kappaitalic_κ. Similar to article watermarking in Section 4.1, increasing watermark strength κ𝜅\kappaitalic_κ can increase verifiability but lower fidelity. Therefore, the users can adjust κ𝜅\kappaitalic_κ to balance the trade-off based on their preference.

Refer to caption
Figure 19: Verifiability and fidelity trade-off of Waterfall code on the MBJSP dataset

G.4 Scalability of Waterfall in code watermarking

One of the advantages of Waterfall over baseline SrcMarker is in terms of scalability. SrcMarker verifiability depends heavily on the number of watermarked bits (scalability), larger number of bits, worse verifiability (Yang et al., 2024). Therefore, to ensure high verifiability, SrcMarker can not support larger scalability. In contrast, the verifiability of Waterfall is independent to its scalability, and this scalability only depends on the vocabulary size of the tokenizer. In our experiments (Table 3), we use Phind-CodeLlama-34B-v2 , which has a large vocabulary size as same as llama-2-13b-hf, which M10130274similar-to𝑀superscript10130274M\sim 10^{130274}italic_M ∼ 10 start_POSTSUPERSCRIPT 130274 end_POSTSUPERSCRIPT, far better than M105similar-to𝑀superscript105M\sim 10^{5}italic_M ∼ 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT of SrcMarker 16-bits.

G.5 Waterfall in watermarking Python code

Inheriting the multi-lingual ability of LLM, Waterfall can easily apply to new programming languages without the need for pre-defined syntax rules. This is a big advantage of Waterfall in comparison to AST-based code watermarking approaches like SrcMarker (Yang et al., 2024). We show that Waterfall can also watermark Python code, through experiments on the MBPP dataset (Austin et al., 2021) which includes around 1000 crowd-sourced Python programming problems. We show the verifiability and fidelity results of Waterfall on watermarking Python code in Table 9.

pass@10 AUROC
MBJSP 0.969 0.904
MBPP 0.954 0.897
Table 9: Waterfall code achieves high verifiability and fidelity on MBJSP and MBPP datasets.

G.6 LLM prompts for code watermarking

We use the following prompts and apply the chat template of Phind-CodeLlama-34B-v2, which follows the alpaca instruction prompt format on these prompts.

Code paraphrasing

### System Prompt
You are given a user-provided code snippet.
Please do ONLY two tasks:
1. Refactor the provided code snippet with the following requirements:
- retain all imported libraries.
- keep the same programming language.
- retain the function names and functionality of the code.
- dont complete the code, just refactor it.
- dont explain.
2. Return the response with the refactored code snippet in the following format strictly:
‘‘‘
<refactored code>
‘‘‘
Do not generate any comments or explaining texts.
### User Message
‘‘‘
{input code}
‘‘‘
### Assistant
Here is the refactored code:
‘‘‘

Functional similarity alignment

### System Prompt
You are given two code snippets, code A and code B. Modify code B based on code A, such that these two code have the same functionality, input, and output. Return the response with corrected code B in the following format strictly:
‘‘‘
<corrected code B>
‘‘‘
Do not generate any comments or explaining texts.
### User Message
code A:
‘‘‘
{original code}
‘‘‘
code B:
‘‘‘
{watermarked code}
‘‘‘
### Assistant
Here is the code B:
‘‘‘

Code syntax correction

### System Prompt
Double-check the code to make sure the syntax is correct. Only generate the corrected code in the following format.
‘‘‘
<corrected code>
‘‘‘
Do not generate any comments or explaining texts.
### User Message
‘‘‘
{watermarked code}
‘‘‘
### Assistant
Here is the corrected code:
‘‘‘

G.7 Watermarked code examples

Examples of code watermarking by Waterfall are illustrated in Figure 20. Note that Waterfall code changes not only the variable names but also the ways of representing the same code logic, which results in high verifiability while preserving high fidelity.

Refer to caption
Figure 20: Example of watermarked code with Waterfall. Waterfall code changes not only the variable names but also the ways of representing the same code logic (e.g., ternary operator vs. conditional statement), which results in high verifiability while preserving code functionality (high fidelity).

Appendix H Details of experiments on LLM data provenance

H.1 LLM fine-tuning experimental setup

To fine-tune the 1.5B parameter gpt2-xl model, we used the LoRA framework (Hu et al., 2022), with LoRA rank of 16 and target modules c_attn, c_proj, c_fc. The models were fine-tuned for a total of 5 epochs, with default batch size of 128 and learning rate of 0.0003. Fine-tuning was performed on a single Nvidia L40 GPU, requiring an average of approximately 15 minutes per client (4000 data samples for each client) for the fine-tuning of the model.

H.2 Verifiability of watermark in the model fine-tuned over watermarked text

To evaluate the verifiability of the watermark, we prompted the fine-tuned model with randomly selected abstracts from the training set used to fine-tune the model. We truncated the abstracts to the first 50 tokens, which is supplied to the model without any other additional prompts, for the model to generate completions to the input. We limited the generation to a maximum of 100 newly generated tokens. Note that in real applications, generating more tokens could improve the verifiability performance, as we have demonstrated in the data ownership experiments. Only the generated tokens were considered when evaluating the verifiability of the watermark. The same ID and kpsubscript𝑘𝑝k_{p}italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT used during watermarking was used to perform verification. For the model fine-tuned on the original unwatermarked text, the corresponding ID and kpsubscript𝑘𝑝k_{p}italic_k start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT that was used for the watermarked text was used for verification.

H.3 Fidelity of model fine-tuned over watermarked text

We used lm-evaluation-harness141414https://github.com/EleutherAI/lm-evaluation-harness (Gao et al., 2021) to evaluate the fine-tuned models for its fidelity over several different datasets (Gao et al., 2020; Merity et al., 2016; Wang et al., 2018; Dolan and Brockett, 2005; Bisk et al., 2020; Levesque et al., 2011). Table 10 reports the models fine-tuned over the watermarked datasets results in minimal differences in fidelity as compared to the model fine-tuned over the unwatermarked datasets. This shows that act of watermarking data used for fine-tuning does not significantly affect its value for fine-tuning.

Table 10: Fidelity of model fine-tuned using watermarked text (Watermarked) and unwatermarked text (Unwatermarked) of different number of clients M𝑀Mitalic_M, evaluated over the various datasets.
Dataset M𝑀Mitalic_M
1 5 10 20 100
Pile-ArXiv (ppl) Watermarked 2.209 2.218 2.218 2.180 2.166
Unwatermarked 2.192 2.210 2.197 2.170 2.154
Wikitext (ppl) Watermarked 1.771 1.770 1.780 1.787 1.818
Unwatermarked 1.766 1.769 1.774 1.783 1.814
MRPC (acc) Watermarked 0.662 0.618 0.674 0.581 0.326
Unwatermarked 0.679 0.627 0.627 0.380 0.314
PIQA (acc) Watermarked 0.687 0.676 0.682 0.676 0.673
Unwatermarked 0.686 0.682 0.683 0.680 0.678
WNLI (acc) Watermarked 0.563 0.620 0.535 0.549 0.493
Unwatermarked 0.620 0.577 0.592 0.563 0.535

Appendix I Adapting model watermarking schemes into Waterfall framework

There exists a separate area of research addressing a different problem setting of model watermarking, where instead of watermarking existing text, newly generated text from LLMs are watermarked. Contrary to the setting of text watermarking, where scalability is a critical requirement, model watermarking schemes are only concerned with a single client (the LLM provider).

Despite this, we could try adapting some model watermarking schemes into the Waterfall framework, though some features of our framework may not be achievable. One such possible scheme that can be adapted is KGW (Kirchenbauer et al., 2023). To adapt, KGW, line 5 and 6 of Algorithm 1 would be replaced with "Green" and "Red" lists, with γ=0.5𝛾0.5\gamma=0.5italic_γ = 0.5. In order to satisfy the scalability criteria, we appended our watermark ID μ𝜇\muitalic_μ to the hash of the previous token, to be used to seed the random partition of the vocabulary list into "Green" and "Red" lists. For verification, we used z𝑧zitalic_z-score as proposed in their paper.

Despite our various additions to the scheme (such as increasing its scalability by adjusting the original function for seeding the random partitioning), this Waterfall variant under performs compared to our original proposed Waterfall implementation, and is still missing key features such as the ability for clients to embed and extract metadata from text after verification with their ID.

Figure 21 shows that Waterfall (Ours) has a strictly better fidelity-verifiability Pareto frontier, i.e., for any required fidelity (STS score), Waterfall (Ours) has higher verifiability than Waterfall (KGW).

Refer to caption
Figure 21: Strictly better fidelity-verifiability Pareto frontier for Waterfall (Ours) than Waterfall (KGW).

We also performed comparison of robust verifiability for Waterfall (Ours) vs. Waterfall (KGW). For fair comparison, the watermark strength was selected such that the STS score were similar for both variants (Waterfall (Ours): 0.887; Waterfall (KGW): 0.885). Table 11 shows that due to better Pareto frontier of Waterfall (Ours), we are able to achieve a higher verifiability both before and after attacks, with the watermarked texts at the same fidelity as Waterfall (KGW).

Table 11: Waterfall (Ours) has better robust verifiability than Waterfall (KGW).
Pre-attack 𝔸2Tsubscript𝔸2𝑇\mathbb{A}_{2-T}blackboard_A start_POSTSUBSCRIPT 2 - italic_T end_POSTSUBSCRIPT 𝔸2Psubscript𝔸2𝑃\mathbb{A}_{2-P}blackboard_A start_POSTSUBSCRIPT 2 - italic_P end_POSTSUBSCRIPT 𝔸3subscript𝔸3\mathbb{A}_{3}blackboard_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT
Waterfall (Ours) 0.992 0.951 0.881 0.815
Waterfall (KGW) 0.977 0.915 0.811 0.718

Appendix J Differences with model-centric watermarking

Our paper focuses on text watermarking, where our problem setting (Section 2) is on watermarking existing text (e.g., containing IP) produced by many clients (with any method including human written), such that each client can verify text that were watermarked with their own unique watermark, and additionally ensure that the watermark is robust to attacks and downstream uses by other LLMs (e.g., prompting, fine-tuning).

On the other hand, there exists a separate line of work focusing on a different problem of model-centric watermarking, which marks output from these watermarked models (e.g., differentiate text generated by these LLMs vs. that by humans).

The problem settings of such model-centric watermarking considers a specific LLM, and addresses how to design an algorithm that allows distinguishing the output of that specific LLM from other text (e.g., human generated). In this setting, the scalability issue is ignored, as only 1 client (the LLM provider) is considered. Additionally, LLM watermarking does not watermark individual original texts, and hence do not have the challenging requirements of preserving semantic content of these original texts. Rather, it typically only considers generative text quality through metrics like perplexity. Therefore, LLM watermarking methods tackles a different problem and should not be confused with the focus of our work.

To provide more detailed comparison on the differences with our work, we further separate model-centric watermarking into the following classifications:

  1. 1.

    Text watermarking of text generated from black-box LLMs.

  2. 2.

    White-box LLM watermarking leading to generated text which contains the model’s watermarks.

  3. 3.

    Black-box LLM watermarking such that a watermarked model’s output is passed to black-box models, with outputs that are still watermarked.

J.1 Text watermarking of text generated from black-box LLM

To the best of our knowledge, the only work related to this topic we have found so far is the unpublished work (Yang et al., 2023) which applies text watermarking methods to the specific use case of text generated by black-box language models and is therefore essentially a text watermarking paper. The text watermarking method of Yang et al. (2023) is similar to the M-BIT benchmark (Yoo et al., 2023) that we considered in the main paper, and essentially encodes watermarks by first identifying words to replace (based on linguistic rules), then finds synonyms for them which are used to represent bits of the watermarking signal. Although the two methods differ in the way of selecting which word to perform watermarking (sentence/word embedding similarity for Yang et al. (2023) and a 2nd BERT model for M-BIT), given their similar characteristics, both methods ultimately still suffer from robust verifiability compared to Waterfall.

Nonetheless, we have performed additional experiments with their method on the same c4-realnewslike dataset from our paper, and considered the attacks A2subscript𝐴2A_{2}italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and A3subscript𝐴3A_{3}italic_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. Note that Waterfall has significantly higher robust verifiability compared to Yang et al. (2023), similar to its better performance over the other benchmarks M-BIT and P-NLW.

Table 12: Comparison of robust verifiability of Waterfall versus Yang et al. (2023)

Pre-attack 𝔸2Tsubscript𝔸2𝑇\mathbb{A}_{2-T}blackboard_A start_POSTSUBSCRIPT 2 - italic_T end_POSTSUBSCRIPT 𝔸2Psubscript𝔸2𝑃\mathbb{A}_{2-P}blackboard_A start_POSTSUBSCRIPT 2 - italic_P end_POSTSUBSCRIPT 𝔸3subscript𝔸3\mathbb{A}_{3}blackboard_A start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT
Waterfall 0.992 0.951 0.881 0.815
Yang et al. (2023) 0.975 0.761 0.659 0.474

J.2 White-box LLM watermarking

This line of work assumes access to the model and directly changes the model generation process to embed the watermark, primarily to differentiate the text generated by specific LLMs vs. for example that by humans. This type of model watermarking that has become a rapidly growing field, especially since the proposal of the KGW watermark (Kirchenbauer et al., 2023). Although these works eventually end up with (model-centric) watermarks in the output of LLMs which are also text, they are actually solving a different problem setting from our work. Our work is focused on watermarking any given text, rather than watermarking an LLM such that its output will all end up being watermarked.

Even though they are not directly comparable, as mentioned in the main paper, some of these white-box LLM watermarking works might be adapted as sub-routines of Waterfall if they meet our framework’s requirements. We have run additional experiments to demonstrate this by introducing a new Waterfall framework implementation variant that swaps our watermarking scheme described in Sec. 3.3 with a modified KGW watermarking scheme, with changes to make it fit our framework, such as appending our watermark ID μ𝜇\muitalic_μ to the hash of the previous token, to be used to seed the random partition of the vocabulary list into "Green" and "Red" lists.

Despite our attempts to adapt the scheme (such as increasing its scalability by adjusting the original function for seeding the random partitioning), key features such as the ability for clients to embed and extract metadata from text after verification with their ID Algorithm 3 will not be available for this Waterfall variant.

We ran additional experiments to compare this Waterfall variant [Waterfall (KGW)] with our original watermarking scheme [Waterfall (Ours)]. Figure 21 demonstrate that Waterfall (Ours) has a strictly better fidelity-verifiability Pareto frontier, i.e., for any required fidelity (STS score), Waterfall (Ours) has higher verifiability than Waterfall (KGW).

We also performed comparison of robust verifiability for Waterfall (Ours) vs. Waterfall (KGW). We can see that due to better Pareto frontier of Waterfall (Ours), with the watermarked texts at the same fidelity as Waterfall (KGW), we are able to achieve a higher verifiability both before and after attacks.

J.3 Black-box LLM watermarking

This line of work considers how to ensure that text generated from a client-controlled LLM may be watermarked such that other black-box models (e.g., neural networks) owned by adversaries that rely on the watermarked LLM would also have their output watermarked. Similar to "white-box LLM watermarking" described above, the focus of these works are on watermarking the specific models in question, although the output of these models may be text, which are the channels in which the model watermarks are transferred. An example of these type of works would be Li et al. (2023), which clearly have methods specific to model-centric training and watermarking, and hence cannot be applied to text watermarking.

Appendix K Comparison with plagiarism checkers

Although tackling the similar issue of IP protection and plagiarism detection, works on plagiarism checkers tackle a distinctly different problem from our problem setting, and cannot be used in our problem setting.

Firstly, contrary to watermarking where a watermark signal is actively embedded into the text, traditional plagiarism detection depends on passive detection, typically via pairwise comparisons of a suspected text to a large corpus of reference text. In their setting, a single (or small number) of suspected text is to be examined for plagiarism. They accomplish this by maintaining a huge database of reference text, and each suspected text is compared pairwise to each piece of reference text. Such pairwise comparison of the suspicious text with all possible reference text is extremely computationally expensive (Foltỳnek et al., 2019). In our problem setting of identifying unauthorized usage of textual data, clients could desire to scan through the entire Internet’s worth of textual content for potential plagiarism, and the shear amount of data makes such techniques computationally infeasible. With watermarking, only the suspected text is required during the verification process, without requiring the reference text to be compared against.

Secondly, due to the requirement to maintain a huge database of reference text, which is costly for individual clients, this task is currently commonly subcontracted out to third party detection systems (e.g., Turnitin). These vendors can have unfavorably broad licensing agreements regarding texts that were submitted for checking (de Zwart, 2018). Such approaches are not feasible in situations where either the original reference data or the suspected text are sensitive and cannot be shared with these external vendors, greatly limiting the applications where plagiarism checker can be deployed in.

Appendix L Practical considerations for real world deployment of Waterfall

Waterfall’s initial setup and computational resources for large-scale applications are low and practically viable. This makes actual large-scale deployment of text watermarking feasible, which is currently not possible given the current state of the art (SOTA) watermarking methods’ limitations and resource requirements.

We illustrate this by laying out two approaches (decentralized or centralized) to deploying Waterfall, both of which have low initial setup and computational cost requirements.

L.1 Decentralized deployment

In this approach, clients randomly generate their own IDs (given the large space of supportable IDs), and can do watermark and verification operations on their own using their laptops with minimal setup.

Setup

For most common text types/languages supported by LLMs, clients could immediately run Waterfall with no setup, given a default LLM and Waterfall settings, to generate the watermarked text Twsubscript𝑇wT_{\text{w}}italic_T start_POSTSUBSCRIPT w end_POSTSUBSCRIPT.

Computational cost

Waterfall’s watermarking computational cost is just that of running inference of the LLM paraphraser, with negligible overheads. Using a GPU available in many laptops (Nvidia RTX 5000), a user could use the Llama-2-13b model to watermark a text in <25absent25<25< 25s to already achieve great performance, as shown in Table 2 in our paper. We expect that the cost of running high performance LLMs on personal devices (e.g., MacBooks, laptops with GPUs) will get cheaper and cheaper, given the rapidly evolving landscape of LLMs.

Waterfall’s verification operation is extremely fast and can be run on just CPU (<0.04absent0.04<0.04< 0.04s per text), without the need for any LLM. For practical applications, the verification operation will be the main operation run multiple times, rather than the watermarking operation (typically only once before the user publishes the text). Waterfall’s verification operator is 2-5 orders of magnitude faster than baseline text watermarking methods (Table 2 in our paper).

L.2 Centralized deployment

In this approach, central parties assigns clients unique IDs, and run the Waterfall watermarking and verification operations for them. This is similar to how LLM service providers are providing interfaces or APIs for LLM queries.

Setup

At a minimum, they could do the same as individuals in the decentralized approach and not need to do any setup. However, given their scale, they could also provide customized service by optimizing the choice of LLMs and Waterfall settings for specific non-common text types or other user requirements (see Section L.3 below for clarification on adaptability).

Computational cost

Existing LLM service providers could easily provide this additional watermarking service to clients, given the minimal overheads of Waterfall over processing a single LLM chat API call. The speed of our verification operation even allows companies to provide value-added services such as near-real-time scanning of newly-published articles from target sources to detect any plagiarism.

L.3 Adaptability to different LLMs

A key strength of Waterfall is that it evolves together with the evolving landscape of LLMs, with increasingly better watermarking performance as LLMs become more capable. As LLMs become more capable, they would be able to better preserve semantic meaning of the original text while still embedding watermarks when used as LLM paraphrasers via Waterfall. This allows Waterfall to achieve higher fidelity-verifiability Pareto frontier, and reduce any fidelity degradation while using higher watermarking strength for greater robust verifiability.

To illustrate, we have performed additional experiments with other LLM models as paraphraser models, with the same c4-realnewslike dataset used in the main paper. Figure 22 shows that the newer/larger models have higher Pareto fronts with higher STS scores for the same verifiability values. Going forward, we expect further significant improvements in LLM capabilities, allowing Waterfall’s performance to also significantly improve.

Refer to caption
Figure 22: Plot of Pareto frontier of different LLMs, where larger/newer models show better Pareto fronts on the fidelity-verifiability trade-off.

L.4 Selection of watermarking LLM and hyperparameter

As with any adaptable methods, Waterfall would require some effort to gain boosted performance in specific domains (e.g., text type or language). That said, the Waterfall framework is designed to reduce such efforts, and it is relatively easy for a user to perform such fine-tuning given only 1 hyperparameter to tune (watermarking strength κ𝜅\kappaitalic_κ) and the choice of LLM paraphraser. For example, the user could just follow these simple steps:

  1. 1.

    Identify the SOTA LLM for the domain, to use as the LLM paraphraser component. As a domain expert and content creator (of the text to be watermarked), the client should be familiar with what is available. Given the evolving landscape of LLMs, we believe that it is realistic for each domain to have a relatively capable fine-tuned model.

  2. 2.

    Run Waterfall with default watermarking strength κ𝜅\kappaitalic_κ and assess if the fidelity and robust verifiability of the text meets expectation. As a domain expert, the client can assess if the text has sufficient fidelity or use a domain-specific fidelity metric to automate the check. The client can also use an automated suite of robustness checks (comprising standard attacks) would assess the expected robust verifiability of the watermarked text.

  3. 3.

    If the results are not up to expectation, perform optimization over the κ𝜅\kappaitalic_κ hyperparameter using standard AutoML methods like Bayesian Optimization (BO). This could be automated especially if a fidelity metric is provided, but manual sequential checks could also be used given just 1 hyperparameter and a query-efficient approach like BO.

In practice, if Waterfall is widely adopted, an open research or developer community would also likely be able to share such configurations and fine-tuning, similar to how fine-tuned deep learning models are also being shared today. Even if Waterfall is implemented by closed-source companies, economies of scale would make it worth fine-tuning and optimizing Waterfall across languages and text types.

L.5 Refinement of watermarked text to improve fidelity

As paraphrasing is applied to the original text when performing the watermark, there might be a change in the style of writing, some loss in information, or in the case of code watermarking, loss of functionality. However, these can be mitigated through several techniques, some of which we have already implemented in our experiments.

In practice, the client could assess the fidelity of the watermarked text Twsubscript𝑇𝑤T_{w}italic_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT before using it. If Twsubscript𝑇𝑤T_{w}italic_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT does not meet the fidelity threshold (i.e., semantic content is not sufficiently preserved), the client could simply use the LLM paraphraser to correct the watermarked text Twsubscript𝑇𝑤T_{w}italic_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT to increase semantic preservation. This could be done automatically as demonstrated in the code example (e.g., Reflexion, or multiple generations), or done manually with prompt engineering. The LLM paraphraser will once again introduce the same embedded watermark to produce the new watermarked text Twsuperscriptsubscript𝑇𝑤T_{w}^{\prime}italic_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, strengthening both the verifiability and fidelity of the text.

Additionally, as the field develops, it is expected for LLMs’ paraphrasing capabilities to increase significantly across domains, languages and text types. This enables the Waterfall framework, using these more capable LLMs, to generate watermarked text with smaller and smaller semantic degradation, further improving its performance and allowing Waterfall to remain effective in highly specialized or technical domains.