\equalcont

These authors contributed equally to this work.

\equalcont

These authors contributed equally to this work. [1]\fnmYilun \surLin [2]\fnmPing \surYi

1]\orgnameShanghai Artificial Intelligence Laboratory, \orgaddress \postcode200433, \stateShanghai, \countryChina

2] \orgdivSchool of Cyber Science and Engineering, \orgnameShanghai Jiao Tong University, \orgaddress \postcode200240, \stateShanghai, \countryChina

3] \orgdivDepartment of Automation, BNRist, \orgnameTsinghua University, \orgaddress \postcode100084, \stateBeijing, \countryChina

4] \orgdivThe State Key Laboratory for Management and Control of Complex Systems, Institute of Automation, \orgnameChinese Academy of Sciences, \orgaddress \postcode100190, \stateBeijing, \countryChina

Building Intelligence Identification System via Large Language Model Watermarking: A Survey and Beyond

\fnmXuhong \surWang [email protected]    \fnmHaoyu \surJiang [email protected]    \fnmYi \surYu [email protected]    \fnmJingru \surYu [email protected]    [email protected]    [email protected]    \fnmYingchun \surWang [email protected]    \fnmQiao \surYu [email protected]    \fnmLi \surLi [email protected]    \fnmFei-Yue \surWang [email protected] [ [ [ [
Abstract

Large Language Models (LLMs) are increasingly integrated into diverse industries, posing substantial security risks due to unauthorized replication and misuse. To mitigate these concerns, robust identification mechanisms are widely acknowledged as an effective strategy. Identification systems for LLMs now rely heavily on watermarking technology to manage and protect intellectual property and ensure data security. However, previous studies have primarily concentrated on the basic principles of algorithms and lacked a comprehensive analysis of watermarking theory and practice from the perspective of intelligent identification. To bridge this gap, firstly, we explore how a robust identity recognition system can be effectively implemented and managed within LLMs by various participants using watermarking technology. Secondly, we propose a mathematical framework based on mutual information theory, which systematizes the identification process to achieve more precise and customized watermarking. Additionally, we present a comprehensive evaluation of performance metrics for LLM watermarking, reflecting participant preferences and advancing discussions on its identification applications. Lastly, we outline the existing challenges in current watermarking technologies and theoretical frameworks, and provide directional guidance to address these challenges. Our systematic classification and detailed exposition aim to enhance the comparison and evaluation of various methods, fostering further research and development toward a transparent, secure, and equitable LLM ecosystem.

keywords:
Large Language Models, Natural Language Processing, Watermarking, Identity Recognition

1 Introduction

Large Language Models (LLMs) have become increasingly important for driving innovation across multiple industries. From automated customer service to complex natural language understanding tasks, the applications of LLMs are expanding. However, as LLMs become more widely used, the challenges to protect security, compliance, and user privacy have become increasingly severe, highlighting the urgent need for robust identity recognition systems.

Identity recognition plays a crucial role across various sectors in modern society [1], from financial transactions [2] and healthcare [3] to border security [4] and online services [5]. The application of identity recognition technology is ubiquitous, ensuring user authentication and authorization, and serving as the cornerstone of security and privacy. In fact, all existing governance frameworks and security systems rely on the effective operation of identity recognition systems [6]. Despite the widespread application of identity recognition systems in many fields, such systems have yet to be fully established in the realm of artificial intelligence (AI). This is primarily due to the complexity and dynamic nature of the AI domain, where traditional identity recognition methods struggle to meet the demands of AI systems. The core issues of identity recognition involve achieving distinguishability, unforgeability, and traceability. These issues are particularly critical in the context of LLMs, where the characteristics of textual data, the openness of LLMs, and the extensive applications of LLMs make identity recognition even more complex.

Currently, watermarking technology is regarded as a potential solution to address the three core issues in identity recognition [7]. It can covertly embed identity information without compromising the quality of the original data [8], ensuring distinguishability. By integrating cryptography, watermarking technology ensures the unforgeability of information and enables traceability through detection. This technology offers an innovative strategy for intellectual property protection and data security in the field of LLMs. Given the urgent need to protect intellectual property and ensure the traceability of security responsibilities in complex LLM application scenarios, it is essential to establish effective techniques and theoretical frameworks for embedding and extracting watermarks.

Although some existing literature reviews [9, 10, 11, 12] have gradually focused on these issues, most studies primarily introduce the basic principles of algorithms. They lack a comprehensive analysis of watermarking as the cornerstone of identity recognition systems for LLMs and do not adequately address the multifaceted conflicts of interest encountered by LLMs in actual operation. This article innovates the existing LLM watermarking systems from three main aspects: application, theory, and evaluation, thereby providing theoretical and practical support for the secure, transparent, and fair use of LLMs. The main contributions of this article are as follows.

Application: In Section 2, we illustrate that the LLM application system is transitioning from a centralized setup, dominated by model technology service providers, to a multi-centric design that prioritizes identity verification and behavior traceability. We also explored the different preferences of data providers, technology service providers, users, and third-party regulators regarding various aspects of identity recognition systems within a multi-center LLM application framework. This novel perspective deepens our understanding of the rights and responsibilities of participants in LLM community. It also promotes the establishment of a fairer and more secure AI application environment.

Theory: In Section 3, we address the limitations of current LLM watermarking technology by developing a theoretical system based on mutual information theory [13]. The comprehensive mathematical foundation establishes a formulaic framework and classifies LLM watermarking technologies into five primary processes: generation, embedding, attack, extraction, and reconstruction. The optimization object and constraints of each process are elaborated with mathematical formulas, allowing researchers to accurately develop and enhance the watermarking techniques based on corresponding roles and stages.

Evaluation: In Section 4, we have synthesized the performance evaluation metrics for LLM watermarks from multiple perspectives, encapsulating the preferences of various LLM entities in their application of watermarking techniques for identity recognition. This summary contributes to the development of a comprehensive and standardized evaluation system, prompting consideration of security issues related to LLM watermarking, and outlines new research trajectories and technological orientations.

Through these three core contributions, our article significantly expands the scope of watermarking applications within LLMs. For the first time, we put watermarking techniques within the context of the identification applications of LLMs, providing robust technical support for addressing the challenges of security and transparency in LLMs. This integration serves a dual purpose: it enhances the traceability of content generated by LLMs, allowing each output to be reliably traced back to its originating model, and it substantially boosts the trustworthiness of LLMs in various application scenarios by ensuring the authenticity and provenance of the content. Finally, we have highlighted some challenges that still exist in the current watermarking technology and theoretical systems, and suggested potential solutions for these challenges. We hope this work will spark further research and discussion, propelling LLM technology towards a trustworthy and verifiable future while safeguarding user interests.

2 Establishing Identification System through LLM Watermarking

2.1 Future Trends in LLM Applications

Currently, the research and development of LLMs are in a period of rapid growth, benefiting from the swift enhancement of computing power and the accumulation of large volumes of high-quality data. The life cycle of LLMs can generally be divided into stages of data preparation, training and testing, deployment and application, and monitoring and maintenance. Entities and participants typically involved in the life cycle of LLMs include training data providers (such as Stardust AI111https://stardust.ai/en-US/, Scale AI222https://scale.com/, etc.), model technology service providers (such as OpenAI333https://openai.com/, Anthropic444https://www.anthropic.com/, etc.), LLM users, and certain public regulators and trusted third parties (PRTTPs) (governments, non-profit organizations, etc.)555It is important to note that not all entities are involved in the research and development of every LLM. For example, certain companies dedicated to LLMs handle data collection and cleaning internally, and some models, remaining closed to the public, consequently do not require regulation..

However, people tend to focus on the iteration of technology (models, algorithms, and data) while neglecting issues of security and rights protection in the application processes of LLMs. As shown in the left part of Fig. 1, in the existing LLM R&D system, technology service providers play a dominant role in every step, from data preparation and model training to final deployment and maintenance, relying on their technical reserves and commercial needs. This centralized system allows technology service providers to monopolize the entire LLM technology market through technological barriers and resource advantages, making it difficult for other participants to develop or achieve breakthroughs independently. Overall, this system makes the development of technology, model compliance, and user privacy security dependent on the ethical standards of technology service providers, which is not conducive to the overall development of the AI ecosystem.

Refer to caption
Figure 1: Evolution of application systems for LLMs: transitioning from a centralized system focused on model technology service providers to a multi-centric system emphasizing identity verification and behavior traceability.

Typically, once the LLM technologies have been widely promoted and enter a period of stabilization, the LLM community shifts its focus from solely valuing the technology to emphasizing regulatory compliance, user privacy, and the security of the technology. This shift transforms the original technology-centric centralized operational system into a balanced, multi-centric system involving multiple participants, as illustrated in the right part of Fig. 1. In this system, the influence of users and PRTTPs is significantly enhanced. PRTTPs are responsible for obtaining and verifying security and trust declarations from data providers and technology service providers, as well as handling risk reports from general LLM users. Meanwhile, LLM users, while ensuring their security and privacy, will authorize the collection of their preferences to model technology service providers and gain potential benefits.

In this new system, as the status of each entity becomes more balanced, they all seek to maximize their benefits while ensuring their rights are protected, rather than merely using the model passively. For instance, LLM users might suspect that the model technology service providers could steal their privacy; data providers might worry that service providers could resell their data. The most critical aspect of ensuring the flawless operation of the entire lifecycle system of LLMs is to ensure that these entities can engage in trustworthy collaboration through certain mechanisms, thereby minimizing mutual suspicion to the greatest extent. The core element in reducing suspicion is making the identities in the LLM recognizable and their behaviors traceable.

2.2 Identity Recognition System in LLMs

In the current digital era, identity recognition technology is critical for safeguarding information security. Traditional identity recognition techniques, such as Multi-Factor Authentication (MFA) [14], biometric technologies [15] (including fingerprint [16] and facial recognition [17]), and Single Sign-On (SSO) [18], are primarily employed to authorize and identify individuals within human communities, relying on biometric features, behavioral patterns, and language analysis to verify identities. For instance, some studies identify individuals by modeling their interaction behaviors with devices [19] and their language styles [20]. However, an effective identity recognition system has yet to emerge in the LLM community.

Due to the capacity of LLMs to generate text of high quality and diversity, it poses a novel challenge to authenticate whether a segment of text is the output of a particular LLM, and to confirm that it has not been unauthorizedly altered or counterfeited. Traditional identity recognition technologies are not applicable in this scenario, as they cannot be directly implemented on the text content or LLMs.

Watermarking technology is a crucial method for identity recognition in the field of computer science [21, 22, 23, 24]. Traditionally used for copyright protection in images, audio, and video [25]. Watermarking embeds secret information without compromising original data quality. With the rise of LLM technology, embedding watermark information in LLMs themselves and their related applications has become an indispensable area of research. Watermarking enables verification of text origin from specific LLMs, thereby enhancing copyright protection, intellectual property preservation, and content authenticity. Moreover, watermarking aids in tracing content dissemination, preventing misinformation, and ensuring transparency and traceability in compliance with legal and regulatory standards. Consequently, watermarking technology has emerged as an innovative and indispensable mechanism for identification in the context of LLM applications.

2.3 Identification System from Different Views

In practical application scenarios, data providers can use watermarks to protect the copyright of their training data, ensuring that the training data are not arbitrarily altered or copied. Technology service providers wish to use watermarking technology to protect their model copyrights, preventing their models from being repackaged or stolen, and enabling them to track the usage of their models. LLM users need watermarking technology to protect their privacy rights, preventing their confidential information from being sold. PRTTP will ensure the security of LLMs by verifying the presence of watermarks at multiple stages; once any security issues are identified, it is crucial to ensure that the source of the problem can be traced and resolved. The following sections detail the four distinct entities of the watermark system and elucidate how each can establish its own watermarking technology framework.

2.3.1 Training Data Providers

For training data providers, the infinite replicability of data poses an uncontrollable risk of data breaches as it circulates. Currently, the most effective way to prevent the unauthorized dissemination of data is to secretly add a unique, strong watermark [26] to the data without altering the quality of the dataset itself [27]. This ensures that the data can still be verified for its initial copyright even after being redistributed, modified, or otherwise processed. Some researchers have proposed even more in-depth solutions, demonstrating that watermarked data used for LLM training can have its watermark information detected in the text generated by the LLM, which is referred to as radioactivity [28]. Once this knowledge is embedded into unauthorized LLMs, data providers can identify whether their watermark is present in the models based on the response to certain specific watermark triggers. Moreover, since a training dataset is likely to be copied and sold multiple times, the most crucial aspect of protecting copyright and preventing data leaks or unauthorized distribution is identifying the source of the leakage. To address this, encoding a unique watermark message for each dataset to be distributed and embedding it into the data with a covert watermark is an essential option.

Therefore, training data providers need to focus on watermarking techniques that offer high fidelity and transparency, meaning the watermarking should be done in a way that does not degrade the quality of the text. Moreover, the watermark embedded by the data providers should have the capability of multi-bit information encoding to enable the identification of the data purchasers. Besides, the watermark should possess high robustness and radioactivity, allowing for the detection of watermarks in text content generated by unauthorized models if the data are used for illegal training.

2.3.2 Model Technology Service Providers

For technology service providers, watermarking technology helps protect model copyrights and monitor the usage of models. To address the costs and technical difficulties faced during the pre-training of LLMs, some technology service providers might opt to use data generated by well-trained models for training, which has formed a system similar to teacher-student model distillation. This imitation has sparked concerns over the copyright of unauthorized distilled models, especially when the corpus data of these distilled models come from closed-source LLMs (such as GPT-4). In the constantly evolving landscape of AI copyright protection, it is crucial to emphasize the importance of protecting intellectual property while maintaining the integrity and practicality of AI models. The development and implementation of watermarking technology enables model developers to protect their innovations from unauthorized use and distribution effectively.

To protect the intellectual property of models and prevent the unauthorized use of developed LLMs through distillation by offenders, technology service providers should embed watermarks only when the model is invoked by users, without affecting the model’s own training process. This approach meets the technology service providers’ pursuit of model performance. Additionally, it substantiates the model’s ownership and facilitates the tracking of its distribution and usage [29]. This helps prevent the model from being copied or tampered with by unauthorized third parties.

2.3.3 Public Regulators and Trusted Third Parties

With the rapid advancement of AI technology, especially the widespread use of LLMs in content creation, the roles of public regulators and trusted third parties (PRTTPs) in watermarking systems have become critical. Policymakers and civil society are increasingly focused on the safe use of these technologies, as shown by the EU AI Act, the NDAA for Fiscal Year 2024 [30], and voluntary commitments to label AI-generated content. These initiatives emphasize the need for transparency in content sources, including clear marking of watermarks and content origins, as well as guidelines for content certification and watermarking developed by the U.S. Department of Commerce following the AI Executive Order of October 30, 2023 [31]. These guidelines aim to help the public easily identify the authenticity of online information.

LLM watermarking technology is considered key to ensuring the safety, compliance, and ethical integrity of AIGC throughout its entire lifecycle. To this end, PRTTPs should make the following efforts:

  1. 1.

    Establishing clear watermarking technical standards and usage norms, setting up certification programs for watermark service providers, and promoting success stories and best practices.

  2. 2.

    The implementation of education and training programs, especially aimed at enhancing the understanding of watermarking technology among all participants, further reinforces this process. This involves not only educating parties on how to select and utilize watermark services correctly but also emphasizing the importance of adopting these measures to ensure that all involved can effectively use watermarking technology to identify AIGC.

  3. 3.

    Establishing strict oversight and enforcement mechanisms is equally important, ensuring that all parties rigorously adhere to the regulations and standards for watermark usage, thereby guaranteeing the correct and secure application of watermarking technology.

  4. 4.

    Providing the general public with access to open watermark interfaces for LLMs allows users to add watermarks to their own data or to verify through watermarking whether a piece of data contains their private information. This approach helps track the usage and flow of data, preventing unauthorized dissemination of the data across the internet.

These requirements are not isolated but necessitate multi-party collaboration among data providers, model technology service providers, watermark technology service providers, and PRTTPs. Through a cooperation framework that spans different sectors and industries, we can facilitate information sharing and technological advancement. These measures encourage deep reflection on the safe use of AI technology and lay a solid foundation for maintaining public trust in AIGC. By establishing industry standards, implementing rigorous certification processes and audits, providing comprehensive education and training, enforcing vigilant supervision and execution, promoting best practices, and fostering collaborative multi-party partnerships, we can ensure the safe and responsible development of LLMs and other AI technologies.

2.3.4 Users of LLMs

Watermarking technology can help LLM users in verifying the copyright and legitimacy of the models, while also serving as a tool to protect the security of user data and privacy. Users should opt for LLM services verified by PRTTPs. The model providers usually require such verified service providers to apply data watermarking technology to ensure that the input data is used only for the current service and is not accessed or misused by third parties. LLM users can ensure that watermarks are embedded in their prompts by choosing services with publicly verifiable watermarking technology. Users can verify the use and flow of their data through a public watermark verification interface, preventing unauthorized distribution of their data on the internet. Watermarking technology plays a crucial role in protecting personal privacy and can also alleviate users’ privacy and security concerns on another level, thereby promoting the further popularization of LLM technology.

Moreover, LLM users will gradually transition from mere users to becoming a more deeply involved and crucial part of the LLM application ecosystem. Firstly, users can trade their private data through some form of anonymization, collaborating with data providers to co-create datasets. This not only generates profit but also enhances the overall efficiency of the LLM application system. Secondly, users will engage in in-depth cooperation with PRTTPs. If the model produces unsafe answers or if copyright infringement is detected, users can report these issues to PRTTPs by clicking the report button. PRTTPs can thus centralize the originally dispersed and unequal user oversight power through this method, better standardizing the development of LLM technology.

2.4 Guidance for Implement Watermarking

All entities should establish their own rights protection system using watermarking technology based on their position within the LLM application ecosystem and their relationships with other entities. The rights that need protection, the entities that need identification, the limitations encountered when using watermarking technology, and some basic requirements are organized in Table 1. Based on the descriptions in the table, entities can find the corresponding watermarking technologies in Section 3 and deploy their watermarking schemes according to the different basic requirements.

Table 1: Summary of requirements for various entities to implement watermarking schemes in LLMs.
Entity Protected Object Recognition Object Limitation Basic Requirements
Data Provider Data Data Source Identity Text Quality Unforgeability, Robustness, Transparency, Fidelity, Radioactivity, Multi-capacity Payload
Technology Provider Model Model Ownership Identity Text Quality, Model Performance Unforgeability, Robustness, Transparency, Fidelity, Radioactivity, Multi-capacity Payload
LLM User User Privacy User Identity, Personal Privacy User Capability Unforgeability, Robustness, Watermark API
PRTTP Community Ecosystem Security AI Content Identification on Internet Watermark Credibility Unforgeability, Robustness, Success Rate

3 LLM Watermarking Technology

3.1 Overview

In this section, we formally define watermarking in LLMs and explore its application in securely and covertly transmitting information. Watermark algorithms for LLMs involve the processes of generation, embedding, extraction, and reconstruction, as shown in Fig. 2. To help the readers better understand, we have listed the main symbols used in the article in the notation table 2.

Symbol Description
SNsuperscript𝑆𝑁S^{N}italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT A text sequence with length N
TNsuperscript𝑇𝑁T^{N}italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT A watermarked text sequence
m𝑚mitalic_m A plain message that needs to be hidden in the original data.
W𝑊Witalic_W A encrypted watermark message
KDsuperscript𝐾𝐷K^{D}italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT An identity information key
Table 2: Notation Table
Refer to caption
Figure 2: The watermarking technology framework in LLMs. The watermark message m𝑚mitalic_m is used to identify the specific LLM. The security key KDsuperscript𝐾𝐷K^{D}italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT represents the privacy identity tag used to generate and reconstruct the watermark. The watermark attack channels are designed to simulate attacks such as semantic substitutions and sequence changes that watermarked texts encounter during transmission.

We denote that the watermark message m𝑚mitalic_m, the N-vectors text sequence SN=(s1,s2,,sN)superscript𝑆𝑁subscript𝑠1subscript𝑠2subscript𝑠𝑁S^{N}=(s_{1},s_{2},\ldots,s_{N})italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT = ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ), the N-vectors watermarked text sequence TN=(t1,t2,,tN)superscript𝑇𝑁subscript𝑡1subscript𝑡2subscript𝑡𝑁T^{N}=(t_{1},t_{2},\ldots,t_{N})italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT = ( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ), the D-elements watermark security parameter KD=(k1,k2,,kD)superscript𝐾𝐷subscript𝑘1subscript𝑘2subscript𝑘𝐷K^{D}=(k_{1},k_{2},\ldots,k_{D})italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT = ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_k start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ), and the watermark W𝑊Witalic_W take their values in message space \mathcal{M}caligraphic_M, original sequence space 𝒮𝒮\mathcal{S}caligraphic_S, watermarked sequence space 𝒯𝒯\mathcal{T}caligraphic_T, watermark security parameter space 𝒦𝒦\mathcal{K}caligraphic_K, and watermark space 𝒲𝒲\mathcal{W}caligraphic_W respectively. We require that 𝒮𝒮\mathcal{S}caligraphic_S and 𝒯𝒯\mathcal{T}caligraphic_T are isomorphic, indicating that there is a one-to-one correspondence between their elements while preserving the structure of the spaces. Moreover, 𝒮𝒮\mathcal{S}caligraphic_S and 𝒯𝒯\mathcal{T}caligraphic_T must obey identical distributions to ensure that the watermarking process does not alter the statistical properties of the original sequences. This is crucial for the stealth and efficacy of the watermark. Additionally, the watermark space 𝒲𝒲\mathcal{W}caligraphic_W should be compatible with the original sequence space 𝒮𝒮\mathcal{S}caligraphic_S, where compatibility refers to the ability of the watermark signal W𝑊Witalic_W to be embedded into the original signal SNsuperscript𝑆𝑁S^{N}italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT without introducing detectable statistical differences.

A watermark W𝑊Witalic_W is produced by the watermark generation module, followed by the generation of the watermarked text sequence TNsuperscript𝑇𝑁T^{N}italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT via watermark embedding process. During text dissemination, not only are attackers likely to be present, but the message itself may also be subjected to cutting, substitution, rewriting, and reordering among other operations. An attack channel Attack(TNTN)𝐴𝑡𝑡𝑎𝑐𝑘conditionalsuperscript𝑇𝑁superscript𝑇𝑁Attack(T^{\prime N}\mid T^{N})italic_A italic_t italic_t italic_a italic_c italic_k ( italic_T start_POSTSUPERSCRIPT ′ italic_N end_POSTSUPERSCRIPT ∣ italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) may apply some of the above operations to process the sequence TNsuperscript𝑇𝑁T^{N}italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT to the corrupted sequence TNsuperscript𝑇𝑁T^{\prime N}italic_T start_POSTSUPERSCRIPT ′ italic_N end_POSTSUPERSCRIPT. The extractor, utilizing the watermark security parameter and the text sequence TNsuperscript𝑇𝑁T^{\prime N}italic_T start_POSTSUPERSCRIPT ′ italic_N end_POSTSUPERSCRIPT under examination, retrieves the watermark Wsuperscript𝑊W^{\prime}italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Subsequently, the reconstruction decodes Wsuperscript𝑊W^{\prime}italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to calculate the estimated value msuperscript𝑚m^{\prime}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT of the message initially transmitted.

The watermarking system of LLMs can be analyzed by defining the watermark message m𝑚mitalic_m, the statistical model for the sequence SNsuperscript𝑆𝑁S^{N}italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT output by the LLM according to the prompt, and the watermark security parameter KDsuperscript𝐾𝐷K^{D}italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT. This includes the distortion function, constraints on the acceptable distortion levels for both the watermark embedding and the watermark attacker, and the information available to each party. The goal of the watermarking algorithm is to seek the maximum reliable transmission rate of m𝑚mitalic_m over any possible watermarking strategy and any attack that satisfies the specified constraints. Consequently, the entire watermarking process can be described using principles of information theory. To better understand the watermarking framework, we first explain the key parameters in the diagram:

  • Sequence SNsuperscript𝑆𝑁S^{N}italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT: Assume the input prompt is denoted as Prompt𝑃𝑟𝑜𝑚𝑝𝑡Promptitalic_P italic_r italic_o italic_m italic_p italic_t, and the sequence SN𝒮superscript𝑆𝑁𝒮S^{N}\in\mathcal{S}italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ caligraphic_S is composed of elements s1,s2,,sNsubscript𝑠1subscript𝑠2subscript𝑠𝑁s_{1},s_{2},\ldots,s_{N}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, where N𝑁Nitalic_N is the length of the sequence. The process by which the LLM generates a sequence can be represented as

    P(SN|Prompt)=i=1NP(sis1,s2,,si1,Prompt).𝑃conditionalsuperscript𝑆𝑁𝑃𝑟𝑜𝑚𝑝𝑡superscriptsubscriptproduct𝑖1𝑁𝑃conditionalsubscript𝑠𝑖subscript𝑠1subscript𝑠2subscript𝑠𝑖1𝑃𝑟𝑜𝑚𝑝𝑡P(S^{N}|Prompt)=\prod_{i=1}^{N}P(s_{i}\mid s_{1},s_{2},\ldots,s_{i-1},Prompt).italic_P ( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | italic_P italic_r italic_o italic_m italic_p italic_t ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_P ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_P italic_r italic_o italic_m italic_p italic_t ) . (1)

    In this equation, P(SNPrompt)𝑃conditionalsuperscript𝑆𝑁𝑃𝑟𝑜𝑚𝑝𝑡P(S^{N}\mid Prompt)italic_P ( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∣ italic_P italic_r italic_o italic_m italic_p italic_t ) is the probability of generating the sequence SNsuperscript𝑆𝑁S^{N}italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT given the input Prompt𝑃𝑟𝑜𝑚𝑝𝑡Promptitalic_P italic_r italic_o italic_m italic_p italic_t. P(sis1,s2,,si1,Prompt)𝑃conditionalsubscript𝑠𝑖subscript𝑠1subscript𝑠2subscript𝑠𝑖1𝑃𝑟𝑜𝑚𝑝𝑡P(s_{i}\mid s_{1},s_{2},\ldots,s_{i-1},Prompt)italic_P ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_P italic_r italic_o italic_m italic_p italic_t ) is the conditional probability of generating the next element sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT given the input Prompt𝑃𝑟𝑜𝑚𝑝𝑡Promptitalic_P italic_r italic_o italic_m italic_p italic_t and the first i1𝑖1i-1italic_i - 1 elements of the sequence.

    The LLM considers the sequence generated so far, s1,s2,,si1subscript𝑠1subscript𝑠2subscript𝑠𝑖1s_{1},s_{2},\ldots,s_{i-1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT, along with the input prompt, and then predicts the probability distribution for the next word sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Once the model predicts the probability distribution for sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, it selects the next word based on this distribution, which could either be the word with the highest probability or a word sampled randomly according to the distribution. This process is repeated until the entire sequence SNsuperscript𝑆𝑁S^{N}italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT is generated or a certain termination condition is met.

  • Watermark Message m𝑚mitalic_m: m𝑚m\in\mathcal{M}italic_m ∈ caligraphic_M represents the message that needs to be encoded. Depending on the amount of information encoded, watermarks can be categorized into one-bit watermarks and multi-bit watermarks. The one-bit watermarking technique is both mature and stable; however, it is limited to encoding a single bit of information—specifically, indicating whether the text was generated by a particular LLM. One-bit watermarks cannot meet the growing demand for customized information in LLM applications. For example, embedding model and version information in the watermark can effectively track the source of text among multiple LLMs. In contrast, multi-bit watermarks allow for carrying more customizable information. However, designing a practical multi-bit watermark method is a challenging task because multi-bit watermarks are more complex than one-bit watermarks. Consequently, embedding a multi-bit watermark can have a greater impact on the text quality compared to embedding a one-bit watermark. After the watermark generation module encodes m𝑚mitalic_m, it must be reliably transmitted to the watermark extraction module during message transmission to ensure the success rate of watermark detection. Here, m𝑚mitalic_m is independent of (SN,KD)superscript𝑆𝑁superscript𝐾𝐷(S^{N},K^{D})( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ).

  • Security Key KDsuperscript𝐾𝐷K^{D}italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT: KD𝒦superscript𝐾𝐷𝒦K^{D}\in\mathcal{K}italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∈ caligraphic_K represents the identity tag used to provide identity information to a text sequence. Introducing KDsuperscript𝐾𝐷K^{D}italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT serves two primary purposes. Firstly, it is crucial to identify LLMs that utilize the same watermarking algorithms. This identity tag can take forms such as a secret key, providing the generator with information about the identity of the LLM that generated the text sequence SNsuperscript𝑆𝑁S^{N}italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. Secondly, KDsuperscript𝐾𝐷K^{D}italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT can be introduced at various stages, offering more flexibility in the identity verification process. Additionally, KDsuperscript𝐾𝐷K^{D}italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT provides a known source of randomness during the extraction phase, allowing for the use of randomized codes, a standard technique to enhance transmission performance in communications.

    The dependency between the original sequence SNsuperscript𝑆𝑁S^{N}italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and the identity tag KDsuperscript𝐾𝐷K^{D}italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT can be quantified by the joint distribution P(SN,KD)𝑃superscript𝑆𝑁superscript𝐾𝐷P(S^{N},K^{D})italic_P ( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ). In public watermarks (blind watermarks), SNsuperscript𝑆𝑁S^{N}italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and KDsuperscript𝐾𝐷K^{D}italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT are independent, meaning the identity tag is completely unrelated to the original text sequence. This independence implies that identity verification does not require the original text, facilitating public verification. Conversely, for private watermarks, if there is a dependency between SNsuperscript𝑆𝑁S^{N}italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and KDsuperscript𝐾𝐷K^{D}italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, such as SNsuperscript𝑆𝑁S^{N}italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT being a function of KDsuperscript𝐾𝐷K^{D}italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, validating the identity tag requires access to the original text sequence or the original encoding parameters.

Refer to caption
Figure 3: The overview of watermark algorithms in LLMs.

3.2 Problem Definition

Different LLM watermarking algorithms have distinct parameters and settings at each stage, playing various roles throughout the watermarking process. The existing LLM watermark algorithms are summarized in Fig. 3 categorized by the different phases of the watermarking process.

i) Watermark generation: The watermarking algorithm encodes the text SNsuperscript𝑆𝑁S^{N}italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT generated by an LLM, the watermark security identity KDsuperscript𝐾𝐷K^{D}italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, and the watermark message m𝑚mitalic_m through the function f𝑓fitalic_f. Initially, the watermark information m𝑚mitalic_m must be converted into a feature suitable for embedding, generating the corresponding watermark signal W𝑊Witalic_W. The method of generation simultaneously affects the watermark’s information capacity, transparency, robustness, and other indicators. In LLM watermarking, the entire generation stage aims to disperse the watermark information m𝑚mitalic_m into the feature space of the sequence, mapping the LLM output text sequence SNsuperscript𝑆𝑁S^{N}italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, the watermark message m𝑚mitalic_m, and the watermark security parameter KDsuperscript𝐾𝐷K^{D}italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT to the watermark signal W𝑊Witalic_W, which can be embedded into the original sequence space 𝒮𝒮\mathcal{S}caligraphic_S and satisfy certain constraints. The generation process is denoted as

W=f(SN,m,KD)𝑊𝑓superscript𝑆𝑁𝑚superscript𝐾𝐷W=f(S^{N},m,K^{D})italic_W = italic_f ( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , italic_m , italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) (2)
f:𝒮××𝒦𝒲.:𝑓𝒮𝒦𝒲f:\mathcal{S}\times\mathcal{M}\times\mathcal{K}\to\mathcal{W}.italic_f : caligraphic_S × caligraphic_M × caligraphic_K → caligraphic_W . (3)

The function f𝑓fitalic_f is the mapping of the sequence SNsuperscript𝑆𝑁S^{N}italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT generated by the LLM, the watermark message m𝑚mitalic_m, and the watermark security parameter KDsuperscript𝐾𝐷K^{D}italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT to the watermark signal W𝑊Witalic_W.

We define two key concepts: Attack Robustness and Security Robustness. From an information-theoretic perspective, these concepts for the watermark signal W𝑊Witalic_W can be characterized by mutual information I𝐼Iitalic_I. Attack Robustness represents the ability to withstand attacks against the watermark. A larger I(SN;W)𝐼superscript𝑆𝑁𝑊I(S^{N};W)italic_I ( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ; italic_W ) indicates a stronger dependency between the generated watermark W𝑊Witalic_W and the original sequence, signifying a higher capability to resist watermark attacks. Conversely, Security Robustness refers to the capacity to prevent the deduction of the watermark message from the watermarked sequence TNsuperscript𝑇𝑁T^{N}italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. A smaller I(m;W)𝐼𝑚𝑊I(m;W)italic_I ( italic_m ; italic_W ) signifies a weaker dependency between the generated watermark W𝑊Witalic_W and the watermark message m𝑚mitalic_m, thereby inhibiting the inference of the watermark message m𝑚mitalic_m from W𝑊Witalic_W.

To simultaneously consider Attack Robustness and Security Robustness, the optimization goal is

maxfNI(SN;W)λI(m;W)subscript𝑓𝑁max𝐼superscript𝑆𝑁𝑊𝜆𝐼𝑚𝑊\displaystyle\underset{f_{N}}{\text{max}}\quad I(S^{N};W)-\lambda I(m;W)start_UNDERACCENT italic_f start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_UNDERACCENT start_ARG max end_ARG italic_I ( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ; italic_W ) - italic_λ italic_I ( italic_m ; italic_W ) (4)
s.t.(i)W=argmaxfNI(SN;W)λI(m;W)(ii)I(m;W)ϵ,s.t.missing-subexpression𝑖𝑊subscriptsubscript𝑓𝑁𝐼superscript𝑆𝑁𝑊𝜆𝐼𝑚𝑊missing-subexpression𝑖𝑖𝐼𝑚𝑊italic-ϵ\displaystyle\text{s.t.}\quad\begin{aligned} &(i)\quad W=\arg\max_{f_{N}}I(S^{% N};W)-\lambda I(m;W)\\ &(ii)\quad I(m;W)\leq\epsilon,\end{aligned}s.t. start_ROW start_CELL end_CELL start_CELL ( italic_i ) italic_W = roman_arg roman_max start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_I ( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ; italic_W ) - italic_λ italic_I ( italic_m ; italic_W ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ( italic_i italic_i ) italic_I ( italic_m ; italic_W ) ≤ italic_ϵ , end_CELL end_ROW

which aims to maximize the mutual information I(SN;W)𝐼superscript𝑆𝑁𝑊I(S^{N};W)italic_I ( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ; italic_W ) between the original sequence SNsuperscript𝑆𝑁S^{N}italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and the watermark W𝑊Witalic_W, while minimizing the mutual information I(m;W)𝐼𝑚𝑊I(m;W)italic_I ( italic_m ; italic_W ) between the watermark message m𝑚mitalic_m and the watermark W𝑊Witalic_W. Ideally, I(m;W)𝐼𝑚𝑊I(m;W)italic_I ( italic_m ; italic_W ) should be zero, indicating that m𝑚mitalic_m and W𝑊Witalic_W are completely uncorrelated, thereby achieving complete transparency of the watermark. The parameter λ𝜆\lambdaitalic_λ is a positive trade-off coefficient that adjusts the balance between these two objectives. Furthermore, ϵitalic-ϵ\epsilonitalic_ϵ, a very small positive number, quantifies the security robustness of the watermark.

ii) Watermark Embedding: After obtaining the watermark signal W𝑊Witalic_W through the generation phase, it is necessary to embed W𝑊Witalic_W into the watermark carrier (i.e., the original text sequence SNsuperscript𝑆𝑁S^{N}italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT) to produce the watermarked text sequence TNsuperscript𝑇𝑁T^{N}italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. We define the operation of embedding the watermark as the function Emb𝐸𝑚𝑏Embitalic_E italic_m italic_b:

TN=Emb(SN,W).superscript𝑇𝑁𝐸𝑚𝑏superscript𝑆𝑁𝑊T^{N}=Emb(S^{N},W).italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT = italic_E italic_m italic_b ( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , italic_W ) . (5)

The Emb𝐸𝑚𝑏Embitalic_E italic_m italic_b function may employ simple techniques such as addition, concatenation, or more complex watermark embedding operations. These could include embedding the watermark W𝑊Witalic_W from various perspectives, such as format, vocabulary, and syntax at the data level or by manipulating the training and inference processes of LLMs at the model level. Integrating Formula  2, the embedding operation can be expressed as

TN=Emb(SN,f(SN,m,KD)).superscript𝑇𝑁𝐸𝑚𝑏superscript𝑆𝑁𝑓superscript𝑆𝑁𝑚superscript𝐾𝐷T^{N}=Emb(S^{N},f(S^{N},m,K^{D})).italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT = italic_E italic_m italic_b ( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , italic_f ( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , italic_m , italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) ) .

From the perspective of watermark Text Quality, the mutual information I(SN;TN)𝐼superscript𝑆𝑁superscript𝑇𝑁I(S^{N};T^{N})italic_I ( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ; italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) between the embedded text sequence TNsuperscript𝑇𝑁T^{N}italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and the initial sequence SNsuperscript𝑆𝑁S^{N}italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT should be maximized. Considering the watermark Transparency, the correlation between the generated watermark signal W𝑊Witalic_W and the embedded text sequence should be as small as possible, i.e., I(W;TN)𝐼𝑊superscript𝑇𝑁I(W;T^{N})italic_I ( italic_W ; italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) should be minimized. Therefore, the optimization objective can be defined as

maxEmbI(SN;TN)θI(W;TN)𝐸𝑚𝑏max𝐼superscript𝑆𝑁superscript𝑇𝑁𝜃𝐼𝑊superscript𝑇𝑁\displaystyle\underset{Emb}{\text{max}}\quad I(S^{N};T^{N})-\theta I(W;T^{N})start_UNDERACCENT italic_E italic_m italic_b end_UNDERACCENT start_ARG max end_ARG italic_I ( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ; italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) - italic_θ italic_I ( italic_W ; italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) (6)
s.t.E{dN(SN,TN)}=SN𝒮KD𝒦m1||P(SN,KD)×dN(SN,TN)Demb.\displaystyle\text{s.t.}\quad\begin{aligned} &E\{d_{N}(S^{N},{T}^{N})\}=\sum_{% S^{N}\in\mathcal{S}}\sum_{K^{D}\in\mathcal{K}}\sum_{m\in\mathcal{M}}\frac{1}{|% \mathcal{M}|}P(S^{N},K^{D})\\ &\times d_{N}(S^{N},T^{N})\leq D_{emb}.\end{aligned}s.t. start_ROW start_CELL end_CELL start_CELL italic_E { italic_d start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) } = ∑ start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∈ caligraphic_K end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_m ∈ caligraphic_M end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | caligraphic_M | end_ARG italic_P ( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL × italic_d start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) ≤ italic_D start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT . end_CELL end_ROW

The watermark Transparency is also subject to an average distortion constraint Dembsubscript𝐷𝑒𝑚𝑏D_{emb}italic_D start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT. The definition of the distortion constraint involves an average over the distribution p(SN,KD)𝑝superscript𝑆𝑁superscript𝐾𝐷p(S^{N},K^{D})italic_p ( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) and a uniform distribution over the messages. A non-negative bounded distortion function demb(xi,yj)={0,xi=yja,a>0,xiyjsubscript𝑑𝑒𝑚𝑏subscript𝑥𝑖subscript𝑦𝑗cases0subscript𝑥𝑖subscript𝑦𝑗𝑎formulae-sequence𝑎0subscript𝑥𝑖subscript𝑦𝑗d_{emb}(x_{i},y_{j})=\begin{cases}0,&x_{i}=y_{j}\\ a,&a>0,x_{i}\neq y_{j}\end{cases}italic_d start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = { start_ROW start_CELL 0 , end_CELL start_CELL italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_a , end_CELL start_CELL italic_a > 0 , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL end_ROW exists between elements of the sets 𝒮𝒮\mathcal{S}caligraphic_S and 𝒯𝒯\mathcal{T}caligraphic_T. This average distortion Dembsubscript𝐷𝑒𝑚𝑏D_{emb}italic_D start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT is the value of the average distortion

D¯emb=SN𝒮TN𝒯p(TN)p(SNTN)dN(SN,TN).subscript¯𝐷𝑒𝑚𝑏subscriptsuperscript𝑆𝑁𝒮subscriptsuperscript𝑇𝑁𝒯𝑝superscript𝑇𝑁𝑝conditionalsuperscript𝑆𝑁superscript𝑇𝑁subscript𝑑𝑁superscript𝑆𝑁superscript𝑇𝑁\bar{D}_{emb}=\sum_{S^{N}\in\mathcal{S}}\sum_{T^{N}\in\mathcal{T}}p(T^{N})p(S^% {N}\mid T^{N})d_{N}(S^{N},T^{N}).over¯ start_ARG italic_D end_ARG start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ caligraphic_S end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∈ caligraphic_T end_POSTSUBSCRIPT italic_p ( italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) italic_p ( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∣ italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) italic_d start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) . (7)

The distortion function is extended to N𝑁Nitalic_N-dimension-vectors by dN(SN,TN)=1Ni=1Ndemb(si,ti).subscript𝑑𝑁superscript𝑆𝑁superscript𝑇𝑁1𝑁superscriptsubscript𝑖1𝑁subscript𝑑𝑒𝑚𝑏subscript𝑠𝑖subscript𝑡𝑖d_{N}(S^{N},{T}^{N})=\frac{1}{N}\sum_{i=1}^{N}d_{emb}(s_{i},t_{i}).italic_d start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . This constraint further limits the degree of distortion in the text sequence with the embedded watermark ensuring the transparency of the watermark.

Watermark embedding techniques can be classified into two main categories: data-centric watermark embedding and model-centric watermark embedding. Each methodology has distinct features and application contexts, collectively laying the technological foundation for the protection of intellectual property, model security, and the authentication of data and models.

iii) Watermark Extraction: The function ϕ:𝒯×𝒦𝒲:italic-ϕ𝒯𝒦superscript𝒲\phi:\mathcal{T}\times\mathcal{K}\to\mathcal{W}^{\prime}italic_ϕ : caligraphic_T × caligraphic_K → caligraphic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the extractor mapping, which takes the watermarked text TNsuperscript𝑇𝑁T^{N}italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and the watermark security parameters KDsuperscript𝐾𝐷K^{D}italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, and maps them to the extracted watermark signal Wsuperscript𝑊W^{\prime}italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT:

W=ϕ(TN,KD).superscript𝑊italic-ϕsuperscript𝑇𝑁superscript𝐾𝐷W^{\prime}=\phi(T^{N},K^{D}).italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_ϕ ( italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) . (8)

At this stage, we revisit the watermark embedding process of text length N𝑁Nitalic_N, constrained by distortion Dembsubscript𝐷𝑒𝑚𝑏D_{emb}italic_D start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT, which can be defined as a triplet (,f,ϕ)𝑓italic-ϕ(\mathcal{M},f,\phi)( caligraphic_M , italic_f , italic_ϕ ) where: \mathcal{M}caligraphic_M is the watermark message space.

iv) Watermark Reconstruction: After extracting the watermark signal Wsuperscript𝑊W^{\prime}italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, it is necessary to decode the watermark message msuperscript𝑚m^{\prime}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT from Wsuperscript𝑊W^{\prime}italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. To approach the channel capacity with a reliable transmission rate of the watermark message, a jointly optimal decoding rule, designed corresponding to the generation phase, should be adopted to compute the estimated value of the original watermark message msuperscript𝑚m^{\prime}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

If the Maximum A Posteriori (MAP) decoding principle is adopted to minimize the error probability:

m=argmaxmp(mW,KD).superscript𝑚subscript𝑚𝑝conditional𝑚superscript𝑊superscript𝐾𝐷m^{\prime}=\arg\max_{m\in\mathcal{M}}p(m\mid W^{\prime},K^{D}).italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_m ∈ caligraphic_M end_POSTSUBSCRIPT italic_p ( italic_m ∣ italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) . (9)

Other decoding rules can also be adopted, such as correlation rules, normalized correlation rules, or trigger-based rules.

v) Watermark Attacks: With the advancement of watermarking technologies, attack methods targeting watermarks have also evolved. These attacks aim to undermine the effectiveness of watermarks and, in some cases, completely remove them, posing a threat to content security and copyright maintenance. Notably, these methods of attack not only represent potential threats but are also utilized to assess the robustness of watermarking technologies. This, in turn, helps developers improve and fortify watermark algorithms.

Adversary’s capabilities. If watermark attacks occur throughout the watermarking algorithm cycle, we consider an adversary with black-box input-output access to the language model. In public watermark mode, the adversary is aware of all the details of the public algorithm. In private watermark mode, the adversary knows the watermark implementation but lacks knowledge of the security key KDsuperscript𝐾𝐷K^{D}italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT and the encryption component of the watermark generation algorithm.

This adversary has the capacity to modify the sequence TNsuperscript𝑇𝑁T^{N}italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT within a distortion constraint. Given the distortion function between elements of the sequence spaces 𝒮𝒮\mathcal{S}caligraphic_S and 𝒮superscript𝒮\mathcal{S}^{\prime}caligraphic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, denoted as datk(,)subscript𝑑𝑎𝑡𝑘d_{atk}(\cdot,\cdot)italic_d start_POSTSUBSCRIPT italic_a italic_t italic_k end_POSTSUBSCRIPT ( ⋅ , ⋅ ), subject to the distortion constraint Datksubscript𝐷𝑎𝑡𝑘D_{atk}italic_D start_POSTSUBSCRIPT italic_a italic_t italic_k end_POSTSUBSCRIPT. The attack channel Attack(TN|TN)Attackconditionalsuperscript𝑇𝑁superscript𝑇𝑁\text{Attack}(T^{\prime N}|T^{N})Attack ( italic_T start_POSTSUPERSCRIPT ′ italic_N end_POSTSUPERSCRIPT | italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) is defined as a sequence of conditional probability mass functions (p.m.f.) from space 𝒮𝒮\mathcal{S}caligraphic_S to 𝒮superscript𝒮\mathcal{S}^{\prime}caligraphic_S start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, where the distortion is evaluated relative to the original LLM sequence SNsuperscript𝑆𝑁S^{N}italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT rather than the watermarked sequence TNsuperscript𝑇𝑁T^{N}italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT:

E{datk(TN,TN)}=m,SN,TN,KD,TNdatk(SN,TN)P(TNTN)P(SN,KD)Datk.𝐸subscript𝑑𝑎𝑡𝑘superscript𝑇𝑁superscript𝑇𝑁subscript𝑚superscript𝑆𝑁superscript𝑇𝑁superscript𝐾𝐷superscript𝑇𝑁subscript𝑑𝑎𝑡𝑘superscript𝑆𝑁superscript𝑇𝑁𝑃conditionalsuperscript𝑇𝑁superscript𝑇𝑁𝑃superscript𝑆𝑁superscript𝐾𝐷subscript𝐷𝑎𝑡𝑘E\{d_{atk}(T^{N},T^{\prime N})\}=\sum_{m,S^{N},T^{N},K^{D},T^{\prime N}}d_{atk% }(S^{N},T^{\prime N})P(T^{\prime N}\mid T^{N})P(S^{N},K^{D})\leq D_{atk}.italic_E { italic_d start_POSTSUBSCRIPT italic_a italic_t italic_k end_POSTSUBSCRIPT ( italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT ′ italic_N end_POSTSUPERSCRIPT ) } = ∑ start_POSTSUBSCRIPT italic_m , italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT ′ italic_N end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_a italic_t italic_k end_POSTSUBSCRIPT ( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT ′ italic_N end_POSTSUPERSCRIPT ) italic_P ( italic_T start_POSTSUPERSCRIPT ′ italic_N end_POSTSUPERSCRIPT ∣ italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) italic_P ( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) ≤ italic_D start_POSTSUBSCRIPT italic_a italic_t italic_k end_POSTSUBSCRIPT . (10)

Adversary’s objective. The primary objective of the adversary is to render the watermark extraction algorithm ineffective. Specifically, the adversary aims to produce a TNsuperscript𝑇𝑁T^{\prime N}italic_T start_POSTSUPERSCRIPT ′ italic_N end_POSTSUPERSCRIPT such that ϕ(TN,KD)Witalic-ϕsuperscript𝑇𝑁superscript𝐾𝐷𝑊\phi(T^{\prime N},K^{D})\neq Witalic_ϕ ( italic_T start_POSTSUPERSCRIPT ′ italic_N end_POSTSUPERSCRIPT , italic_K start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) ≠ italic_W, while ensuring that TNsuperscript𝑇𝑁T^{\prime N}italic_T start_POSTSUPERSCRIPT ′ italic_N end_POSTSUPERSCRIPT remains a minor modification of the LLM-generated watermark sequence TNsuperscript𝑇𝑁T^{N}italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT.

3.3 Watermark Generation

3.3.1 Vocabulary-partitioning-based Methods

Watermarks generated through vocabulary partitioning usually utilize a pseudo-random function (implemented as a hash function) to generate random seeds. These seeds are used to divide the vocabulary into distinct lists, ensuring that a subset of tokens from a particular list is output more frequently during token generation. The watermark is generated by biasing the selection of tokens towards specific lists.

Refer to caption
Figure 4: Watermark generation through vocabulary partitioning. Utilizing a hash function, the previous token is used as input to compute a random seed, which divides the vocabulary into green and red lists. The LLM-generated token bias is applied by adding a bias term to the token log probabilities, favoring the green list.

Kirchenbauer et al. [32] introduced the first LLM watermarking technique that generates watermarks through vocabulary partitioning, as shown in Fig. 4. This method employs a hash function that uses the preceding token as input to compute a random seed for partitioning the vocabulary into green and red lists. The LLM-generated tokens are biased towards the green list by adding a bias term to the logits. During detection, the extraction process calculates the ratio of green tokens with the z-metric to determine the presence of the watermark.

Following Kirchenbauer et al. [32], the idea of watermarking generation by vocabulary partitioning has been widely explored by many researchers [33, 34, 35, 36, 37, 38, 39, 40]. These methods have introduced more refined methods for partitioning the red-green lists, resulting in a greater diversity of partitioning techniques. For example, Takezawa et al. [34] introduced tighter constraints for partitioning red-green tables,enhancing the concealment of the watermark and improving the quality and naturalness of the generated text. Building upon previous work, Kirchenbauer further elaborated on the robustness of these watermarking generation methods against paraphrasing attacks [33]. The effectiveness of the detection method in this study was evaluated by comparing them with detectors designed to identify AI-generated text.

While these watermarks can be integrated with various detection techniques, their distribution does not meet the unbiased criterion. To address this, Hu et al. [35] introduce an unbiased watermark by adjusting the token generation probability distribution through the watermark code space. Some studies focus on enhancing the robustness of such watermarks against attacks, aiming to mitigate vulnerabilities arising from reliance on lexical distribution. For instance, Li et al. [41] employed a novel reweighting strategy combined with a context-based hash function to assign a unique i.i.d. ciphers to each generated token. This encoding method ensures the preservation of the original token distribution during the watermarking process, making the watermarked text indistinguishable from unwatermarked text in terms of distribution. Other studies [38, 40] have considered semantic similarity when partitioning the vocabulary, such that the semantic value may remain unchanged even in the face of watermark attacks, thereby achieving robustness against paraphrasing attack.

Watermark generation methods based on vocabulary partitioning rely on hashing tokens from the previous moment, leading to inefficiency during the extraction phase, as it necessitates iterative computation over all tokens. To address this, Zhao et al. [36] simplified the watermarking generation method proposed by Kirchenbauer et al. by employing a fixed increase in logits watermark strength ϵitalic-ϵ\epsilonitalic_ϵ, making the vocabulary partitioning independent of previously generated tokens and solely reliant on a global key. Fernandez et al. [42] proposed a method to enhance efficiency by cyclically shifting an initial message to generate secret vectors for each message, thereby easily converting one-bit watermarks into multi-bit watermarks and allowing for parallel processing. Additionally, Liu et al. [43] employed a watermark generation network to partition the vocabulary instead of using hash functions, which has also been proven effective.

Some studies have extended the method of partitioning the vocabulary into multi-bit watermarks to convey more information through the watermark. However, these methods also face higher computational complexity and increased demands for watermark information density. To develop a more effective multi-bit watermark techniques, Wang et al. [44] proposed the Balance-Marking method, which uses a proxy language model (proxy-LM) to ensure that the available and unavailable vocabulary for generating watermarks have approximately equivalent probabilities. Similar to the work of Lee et al. [39], this method can also circumvent low-entropy parts of the text to effectively improve text quality.

However, this watermarking approach essentially divides the vocabulary into multiple sets of red-green tables, with each set corresponding to one bit of watermark message. Such methods requires iterative computation during the logits generation process for each token, resulting in extremely high computational complexity. To mitigate this, besides the cyclic shift method by Fernandez et al. [42], Yoo et al. [45] independently encode each message bit position, transforming the division of the vocabulary from red-green lists into colored lists, effectively encoding multiple states for every token. Allocating tokens to different parts of the message allows for embedding longer messages without increasing generation latency. Compared to methods that directly generate watermarks using the watermark message m𝑚mitalic_m, Qu et al. [46] have enhanced the robustness of the watermark by introducing error-correcting codes (ECC) to the watermark information before dividing the vocabulary.

The key limitation of the existing multi-bit watermark approaches [42, 44] is that the computational cost of their extraction functions grows exponentially with the length of the watermark message bits, and they cannot accurately or effectively extract all watermark bits.

The methods of vocabulary partitioning involve mapping the text sequence SNsuperscript𝑆𝑁S^{N}italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT to various distributions of vocabulary that can be analyzed for their Attack𝐴𝑡𝑡𝑎𝑐𝑘Attackitalic_A italic_t italic_t italic_a italic_c italic_k Robustness𝑅𝑜𝑏𝑢𝑠𝑡𝑛𝑒𝑠𝑠Robustnessitalic_R italic_o italic_b italic_u italic_s italic_t italic_n italic_e italic_s italic_s and Security𝑆𝑒𝑐𝑢𝑟𝑖𝑡𝑦Sicherheititalic_S italic_e italic_c italic_u italic_r italic_i italic_t italic_y Robustness𝑅𝑜𝑏𝑢𝑠𝑡𝑛𝑒𝑠𝑠Robustnessitalic_R italic_o italic_b italic_u italic_s italic_t italic_n italic_e italic_s italic_s using Formula 4. All these methods incorporate semantics, global secret keys, and additional information to further solidify the dependency between the generated watermark W𝑊Witalic_W and the original sequence SNsuperscript𝑆𝑁S^{N}italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, which increases I(SN;W)𝐼superscript𝑆𝑁𝑊I(S^{N};W)italic_I ( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ; italic_W ). The original KGW [32] partitions the vocabulary using the hash value of previous tokens. The hash function ensures the mutual information I(W;m)0𝐼𝑊𝑚0I(W;m)\to 0italic_I ( italic_W ; italic_m ) → 0 between the watermark W𝑊Witalic_W and the watermark message m𝑚mitalic_m, making this method a high level of Security𝑆𝑒𝑐𝑢𝑟𝑖𝑡𝑦Sicherheititalic_S italic_e italic_c italic_u italic_r italic_i italic_t italic_y Robustness𝑅𝑜𝑏𝑢𝑠𝑡𝑛𝑒𝑠𝑠Robustnessitalic_R italic_o italic_b italic_u italic_s italic_t italic_n italic_e italic_s italic_s. Denote that the mutual information I(m;W)=H(m)H(mW)𝐼𝑚𝑊𝐻𝑚𝐻conditional𝑚𝑊I(m;W)=H(m)-H(m\mid W)italic_I ( italic_m ; italic_W ) = italic_H ( italic_m ) - italic_H ( italic_m ∣ italic_W ), where H(m)𝐻𝑚H(m)italic_H ( italic_m ) is the entropy of the message m𝑚mitalic_m and H(mW)𝐻conditional𝑚𝑊H(m\mid W)italic_H ( italic_m ∣ italic_W ) is the conditional entropy of m𝑚mitalic_m given the watermark W𝑊Witalic_W. Since the computation process of hash functions is unidirectional and irreversible, the conditional entropy H(mW)𝐻conditional𝑚𝑊H(m\mid W)italic_H ( italic_m ∣ italic_W ) encloses to H(m)𝐻𝑚H(m)italic_H ( italic_m ) when the output of the hash function W𝑊Witalic_W is known. Therefore, the mutual information I(m;W)0𝐼𝑚𝑊0I(m;W)\to 0italic_I ( italic_m ; italic_W ) → 0 is minimized to prevent an attacker from back-propagating the message m𝑚mitalic_m through the watermarked signal W𝑊Witalic_W.

The work [33] uses a context-robust Min-Left Hash to strengthen the connection between W𝑊Witalic_W and the original sequence SNsuperscript𝑆𝑁S^{N}italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, thereby increasing I(SN;W)𝐼superscript𝑆𝑁𝑊I(S^{N};W)italic_I ( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ; italic_W ). Zhao et al. [36] no longer use a hash function for vocabulary partitioning but instead base it on text edit distance to partition a fixed vocabulary. Although the fixed vocabulary has a higher I(W;m)𝐼𝑊𝑚I(W;m)italic_I ( italic_W ; italic_m ) than the hash-partitioned vocabulary, its Security𝑆𝑒𝑐𝑢𝑟𝑖𝑡𝑦Sicherheititalic_S italic_e italic_c italic_u italic_r italic_i italic_t italic_y Robustness𝑅𝑜𝑏𝑢𝑠𝑡𝑛𝑒𝑠𝑠Robustnessitalic_R italic_o italic_b italic_u italic_s italic_t italic_n italic_e italic_s italic_s is reduced. However, by increasing the edit distance between vocabularies, attackers need multiple attempts to bridge the text distance and invalidate the watermark, thus enhancing Attack𝐴𝑡𝑡𝑎𝑐𝑘Attackitalic_A italic_t italic_t italic_a italic_c italic_k Robustness𝑅𝑜𝑏𝑢𝑠𝑡𝑛𝑒𝑠𝑠Robustnessitalic_R italic_o italic_b italic_u italic_s italic_t italic_n italic_e italic_s italic_s.

Additionally, for low-entropy texts mentioned in paper [32], which are difficult to watermark by modifying logits, this can also be analyzed using Formula 4. When the original sequence SNsuperscript𝑆𝑁S^{N}italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT has low entropy, the first term of the optimization goal, I(SN;W)𝐼superscript𝑆𝑁𝑊I(S^{N};W)italic_I ( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ; italic_W ), is low, which negatively impacts watermark generation and reduces its Attack𝐴𝑡𝑡𝑎𝑐𝑘Attackitalic_A italic_t italic_t italic_a italic_c italic_k Robustness𝑅𝑜𝑏𝑢𝑠𝑡𝑛𝑒𝑠𝑠Robustnessitalic_R italic_o italic_b italic_u italic_s italic_t italic_n italic_e italic_s italic_s. Consequently, some studies [39, 46] propose bypassing low-entropy texts and only watermark high-entropy texts to ensure the watermark’s Attack𝐴𝑡𝑡𝑎𝑐𝑘Attackitalic_A italic_t italic_t italic_a italic_c italic_k Robustness𝑅𝑜𝑏𝑢𝑠𝑡𝑛𝑒𝑠𝑠Robustnessitalic_R italic_o italic_b italic_u italic_s italic_t italic_n italic_e italic_s italic_s. Furthermore, semantic-based watermarking methods determine vocabulary partitioning by incorporating contextual relationships and semantic information. The introduction of semantic information enhances the correlation between the generated watermark W𝑊Witalic_W and the original sequence SNsuperscript𝑆𝑁S^{N}italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. The increase in I(SN;W)𝐼superscript𝑆𝑁𝑊I(S^{N};W)italic_I ( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ; italic_W ) enhances the robustness of such methods against watermark attacks. For instance, Fu et al. [38] selected semantically related vocabulary to add to the watermark vocabulary. Ren et al.’s SemaMark [40] mitigated semantic sensitivity by discretizing the continuous word embedding space appropriately, ensuring that discrete semantic values remain unchanged even in the face of watermark text editing attacks.

3.3.2 Model-learning-based Methods

Model-based learning methods employ deep learning techniques, such as GPT [47], BERT [48], to generate watermarks. These methods leverage the learning and generative capabilities of deep learning models, using a trained watermark generation model to directly produce watermarks W𝑊Witalic_W or embedded representations of watermarked sequences.

In contrast to other methods, this technique create the watermark that is intricately embedded into the content, enhancing security and robustness against tampering. Correspondingly, watermark extraction is typically performed using a dedicated decoder or a watermark detection network. This dual-model framework ensures that the embedded watermarks can be accurately retrieved, even when the content has undergone modifications or compression. By maintaining the watermark generation model as proprietary while making the watermark detection model publicly accessible, a publicly verifiable watermarking scheme can be effortlessly implemented. Conversely, by keeping both the watermark generation and detection models confidential, the watermarking method can be transformed into a private watermarking system.

This strategic dichotomy allows for flexibility in controlling the accessibility and verification of watermarks, catering to different security and privacy requirements. Publicly verifiable watermarks facilitate widespread verification, enhancing transparency and trust in digital content authenticity. In contrast, private watermarking schemes offer enhanced security since the ability to generate and detect watermarks is restricted to authorized entities. This restriction safeguards proprietary or sensitive information from unauthorized detection and manipulation.

Kuditipudi et al. [49] employed a decoder that deterministically maps a sequence of random numbers, encoded by a watermark key, to samples in a language model. This is achieved by converting a sequence of uniform random variables and permutations into tokens using inverse transform sampling. Considering that many existing watermarking algorithms are designed at the token level, Hou et al. [50] utilized a sentence encoder trained through contrastive learning (such as Sentence-BERT [51]) to capture textual semantic similarities. They partitioned the semantic space of sentences and employed sentence-level rejection sampling to ensure that sentences fall within watermarked partitions of this space. This approach to semantic watermarking at the sentence level shows strong robustness against paraphrasing attacks. Munyer et al. [52] utilized Word2Vec [53] and Sentence Encoding [54] to engender a roster of replacement words, which we consider as the generated watermark.

The mainstream method of embedding watermarks is to add extra watermark logits on top of the logits generated by the LLM. Some intriguing research [43, 55], inspired by the red-green list method [32], moves away from guiding logit modifications by partitioning the vocabulary. Instead, these studies directly generate watermark logits through a watermark generation model. Most watermarking methods cannot simultaneously possess Attack𝐴𝑡𝑡𝑎𝑐𝑘Attackitalic_A italic_t italic_t italic_a italic_c italic_k Robustness𝑅𝑜𝑏𝑢𝑠𝑡𝑛𝑒𝑠𝑠Robustnessitalic_R italic_o italic_b italic_u italic_s italic_t italic_n italic_e italic_s italic_s against watermark attacks and Security𝑆𝑒𝑐𝑢𝑟𝑖𝑡𝑦Sicherheititalic_S italic_e italic_c italic_u italic_r italic_i italic_t italic_y Robustness𝑅𝑜𝑏𝑢𝑠𝑡𝑛𝑒𝑠𝑠Robustnessitalic_R italic_o italic_b italic_u italic_s italic_t italic_n italic_e italic_s italic_s to prevent inferring the watermark from the watermarked sequence TNsuperscript𝑇𝑁T^{N}italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, necessitating a trade-off. The research by Liu et al. [55] makes the generated watermark W𝑊Witalic_W no longer determined by previous tokens and vocabularies, thereby enhancing both Attack𝐴𝑡𝑡𝑎𝑐𝑘Attackitalic_A italic_t italic_t italic_a italic_c italic_k and Security𝑆𝑒𝑐𝑢𝑟𝑖𝑡𝑦Sicherheititalic_S italic_e italic_c italic_u italic_r italic_i italic_t italic_y Robustness𝑅𝑜𝑏𝑢𝑠𝑡𝑛𝑒𝑠𝑠Robustnessitalic_R italic_o italic_b italic_u italic_s italic_t italic_n italic_e italic_s italic_s. Gu et al. [56] took an alternative approach by training a student model to learn the token distribution of watermarked text, imitating the behavior of existing watermarking algorithms through model distillation. However, this approach involves model distillation and suffers from high computational complexity.

Several pioneer works [57, 58, 59, 44] have explored designing multi-bit watermark schemes for LLMs by leveraging the model’s learning capabilities. Abdelnabi and Fritz [57] proposed an encode-decoder transformer architecture, AWT, which learns to extract the message from the decoded watermarked text. To maintain the quality of the watermarked text, they utilize signals from sentence transformers and language models, relying entirely on a neural network for message embedding and extraction. This approach has proven effective because neural networks have been successfully used for natural language watermarking, demonstrating their capability to handle complex language patterns.

Drawing inspiration from a well-known proposition in classical image watermarking work [60], Yoo et al. [58] generated watermark positions by identifying invariant features of semantics and syntax in the text through a pre-trained infill model and create watermarks by replacing words at these watermark positions through masking. Due to the use of a semantically robust filling model, their method significantly surpasses AWT in resilience to watermark attacks and exceeds the fixed upper limit on the number of watermark bits imposed by the ContextLS method proposed by Yang et al. [61]. Extending AWT, Zhang et al. [59] utilized pre-trained language models in a modular fashion to revamp the end-to-end watermarking scheme. They introduced a reparameterization module to transform the dense distributions from the message encoding to the sparse distribution of the watermarked textual tokens, achieving double the watermark information capacity of AWT while maintaining semantic integrity.

Wang et al.  [44] systematically studied the codable watermark system (CTWL) for multi-bit watermark information, considering Yoo et al. [58]’s watermarking method as post-process after LLM generation, which does not integrate well with the generative capabilities of LLMs. They proposed using a proxy language model (proxy-LM) to assist in encoding watermark information during the LLM generation process, followed by vocabulary division.

As indicated by Formula 4, these methods utilize the learning and semantic capabilities of the model to generate the watermark W𝑊Witalic_W. For instance, Liu et al. [55] utilized a trained watermark model to generate watermark logits based on the semantic embeddings of tokens preceding the current token, which maximizes the first term I(SN;W)𝐼superscript𝑆𝑁𝑊I(S^{N};W)italic_I ( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ; italic_W ) as much as possible. It is understood that the information flow in neural networks tends to decrease mutual information during forward propagation [62]. This decrease is due to the nonlinearity of forward propagation, the many-to-one mapping relationship, and the suppressive effect of activation functions on information flow in neural networks [63]. Inferring the input from the network’s output is very difficult, thereby making the conditional entropy H(m|W)𝐻conditional𝑚𝑊H(m|W)italic_H ( italic_m | italic_W ) encloses to H(m)𝐻𝑚H(m)italic_H ( italic_m ), which ensures that I(m,W)0𝐼𝑚𝑊0I(m,W)\to 0italic_I ( italic_m , italic_W ) → 0. This also ensures the Security𝑆𝑒𝑐𝑢𝑟𝑖𝑡𝑦Sicherheititalic_S italic_e italic_c italic_u italic_r italic_i italic_t italic_y Robustness𝑅𝑜𝑏𝑢𝑠𝑡𝑛𝑒𝑠𝑠Robustnessitalic_R italic_o italic_b italic_u italic_s italic_t italic_n italic_e italic_s italic_s of such watermarking methods.

3.3.3 Custom-rules-based Methods

The methods proposed in this section involve generating watermarks by applying specific rules. The core principle is to design a set of rules or algorithms that modify or mark text data and models, thereby generating watermarks.

Backdoor Techniques: These methods involve poisoning the training data of LLMs by adding specific triggers to text sequences, thus enabling LLMs to learn the characteristics of these triggers. The presence of watermarks is detected by observing the output of LLMs when given input samples containing embedded triggers.

For instance, Liu et al. [64] implanted backdoor triggers into a small subset of the target LLM’s training data and tampered with the labels of this subset. The presence of watermarks is assessed by verifying the output of the trigger set through black-box access to the target model. Tang et al. [65]improved the backdoor poisoning method without altering the original labels of the watermark samples. Instead, they guided the model to memorize the preset backdoor function by disabling original features on watermark samples through imperceptible perturbations. This approach increases the transparency of the watermark and, by protecting the trigger set used for watermark verification, creates a traceable private watermarking technology. These methods are primarily data-driven, reliant on data, and require participation in the training process of LLMs, with a limited capacity to carry one-bit watermark information.

Cryptography: Watermark generation based on cryptography aims to enhance the security and stealthiness of watermarks through cryptographic techniques. These methods primarily include the use of digital signature technology for watermarks [66] and cryptographically inspired undetectable watermarks [67], both of which rely on cryptographic principles to protect the watermark from unauthorized access and tampering. Christ et al. [67] quantified the randomness used in the generation of a specific output by utilizing pseudo-random values generated by an encrypted pseudo-random function (PRF) to determine watermark embedding locations. They analyzed the undetectability and integrity of the watermark using empirical entropy theory. Furthermore, they introduced the encryption of pseudo-random values with a secret key, which is required for the extraction and verification of the watermark. However, this method is only validated through the entropy theorem for binary channels and is limited to embedding one-bit watermark information. It is uncertain whether it can maintain sufficient empirical entropy to ensure the robustness of the watermark when expanded to multiple bits of information.

Additionally, Fairoze et al. [66] explored the application of digital signature technology on LLMs.This method involves encrypting the hash value of text with a private key to create a watermark signature. This signature is then embedded into tokens of additional length through rejection sampling, while the public key facilitates watermark detection. This approach does not require embedding statistical signals in the generated text, providing a viable solution for publicly detectable LLM watermarks. However, this approach, which employs asymmetric algorithms, often results in highly unstable time and computational costs during watermark generation. The running time exhibits high variance, especially when encountering low-entropy sections of sampling or missed hashes, making the time required for watermark generation occasionally unacceptable.

The main advantage of incorporating cryptographic techniques is the ability to determine whether a watermark is private or publicly detectable easily. One primary benefit of public watermarks is that the extraction and reconstruction processes can be outsourced, allowing different entities to provide watermark extraction services separate from the model providers. Furthermore, public watermark schemes should support the full lifecycle operations of watermarks through API access to private LLMs.

Custom Synonym Substitution Rules: Some methods [68, 69, 70] ensure a close relationship between the watermark and the text’s semantics and context by using custom synonym replacement rules. This approach not only maintains the transparency of the watermark but also enhances its Attack𝐴𝑡𝑡𝑎𝑐𝑘Attackitalic_A italic_t italic_t italic_a italic_c italic_k Robustness𝑅𝑜𝑏𝑢𝑠𝑡𝑛𝑒𝑠𝑠Robustnessitalic_R italic_o italic_b italic_u italic_s italic_t italic_n italic_e italic_s italic_s to text editing attacks. The fundamental premise of text editing attacks is to invalidate the watermark under certain distortion conditions. However, effectively linking semantics and context can limit the effectiveness of such attacks. For instance, He et al. [68] considered two fundamental linguistic features during synonym replacement: part-of-speech (POS) and the dependency tree. Expanding to multi-bit watermarks, Yang et al. [69] proposed a context-aware synonym replacement method for generating watermarks. Meanwhile, Li et al. [70] embedded a series of synonym-based token changes as watermarks in the code generated by LLMs. Li et al. [71] introduced a watermarking method for code text based on transformation rules such as code refactoring, reordering, and format conversion. Each transformation corresponds to a bit of the watermark message, with the presence or absence of a specified transformation determining whether the watermark value of that bit is 0 or 1.

Custom Generation Function: Some custom watermark generation functions have proven effective in practice. For instance, Yang et al. [72] defined a binary encoding function for black-box LLMs, which calculates a random binary code (0 or 1) for each word in the text based on the hash value of the word and its preceding word. Essentially, the function of this binary encoding is similar to that used by Kirchenbauer et al. [32], who utilized hash functions for vocabulary partitioning. Zhao et al. [73] defined two sets of specialized secret sinusoidal signals as watermarks. These two sets of sinusoidal signals have values ranging from [0,1] and satisfy the condition that their sum equals 1.

3.4 Watermark Embedding

After generating the watermark W𝑊Witalic_W, it must be embedded into the sequence carrier SNsuperscript𝑆𝑁S^{N}italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. Depending on the direct object of operation during the embedding process, the embedding can be divided into two types: data-level embedding and model-level embedding.

3.4.1 Data-level Embedding Methods

Data-level embedding, also known as post-processing methods, involves inserting watermarks by directly modifying, learning from, or augmenting the content itself. The primary advantage of these methods is their ability to embed identifying markers W𝑊Witalic_W within data in a relatively concealed manner without requiring modifications to the model’s architecture or functionality. Data embedding methodologies exhibit a broad spectrum, including format adjustments, lexical changes, grammatical shifts, and the exploitation of language models, thus offering a variety of approaches to suit different data types and application contexts. Focused on the textual level, these methods are not only applicable to the training data of LLMs to influence the learning trajectory but can also be directly applied to the text generated by LLMs, enabling watermark embedding at the output phase. This phase is principally segmented into four categories:

Format Adjustment: Format-based data embedding ingeniously utilizes text formatting and visual features for watermark embedding. Unlike methods that directly modify the text content, format-based methods embed watermarks through subtle adjustments to the appearance and structure of the text, aiming to achieve copyright protection and data tracking without compromising readability and content integrity.

Btassil et al. [74]achieved watermark embedding by adjusting the vertical and horizontal positions of text lines and words, such as through line shift coding and word shift coding. Por et al. [75] embedded watermarks by inserting different space characters into text spacing or using visually similar but differently coded characters. The presence of these watermarks is almost invisible to users, ensuring the natural flow and original appearance of the text content. These methods do not rely on changes to the text content, making it broadly applicable across various languages and document types without concerns about linguistic or semantic restrictions. However, watermarks embedded in this way have poor robustness and can be invalidated by some formatting-checking tools.

Lexical Variation: Lexicon-based data embedding is a method that embeds watermarks by carefully selecting and replacing specific words in the text. This approach leverages the richness and diversity of language, allowing for subtle modifications without altering the original intent and content.

Aside from the watermark generation method [68, 69, 70] of Custom Synonym Substitution Rules mentioned in Section 3.3, which embeds watermarks through vocabulary changes, the embedding of watermarks through lexical variation is also widely used by various watermark algorithms [76, 57, 77, 52, 72, 58, 78].

Lexicon-based data embedding ensures both fluency in text reading and semantic consistency. This characteristic renders some watermark detection methods ineffective and ensures the transparency of the watermark, making it a mainstream method in many algorithms. However, this approach relies on high-quality synonym databases and advanced language models or rules for precise vocabulary selection and sentence transformation.

Grammatical Transformation: Grammar-based data embedding is a technique that embeds watermarks by altering the syntactic structure of text or code. This method aims to incorporate watermark information through subtle syntactic adjustments without affecting the original semantics. Its application is not limited to natural language texts but also extends to programming languages, demonstrating wide applicability and flexibility.

Chalmers [79] inserts watermarks by transforming the syntactic structure of sentences within paragraphs. In the CATER method proposed by He et al. [68], the dependency tree is a type of syntactic structure that describes the directed binary grammatical relationships between words.

These embedding methods adjust the text at the grammatical level, maintaining semantic integrity and high transparency, and are commonly used for watermark embedding in code text. However, they require a deep understanding of grammar and analysis capabilities. Excessive grammatical changes can affect text readability.

Language Model Utilization: This embedding method further leverages the capabilities of language models. Most watermark algorithms generated through model learning primarily utilize the semantic understanding abilities of language models to create watermarks, necessitating other embedding methods. In contrast, Zhang et al. [59] directly trained a message encoding module that takes watermark messages as input and generates watermarked text based on the learning capabilities of language models. This end-to-end training approach fully exploits the capabilities of language models. Since the watermark is directly generated and embedded internally by the language model, its robustness against text editing attacks depends on the robustness of the language model itself. The generation and embedding of the watermark rely on the model, and if data changes, it necessitates adjusting the model training objectives, which requires substantial computational resources.

The embedding methods at the data level involve manipulating text sequences generated by LLMs, with the fundamental principle of preserving the original sequence’s semantics and readability. According to Formula 6, the text quality represented by I(SN;TN)𝐼superscript𝑆𝑁superscript𝑇𝑁I(S^{N};T^{N})italic_I ( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ; italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) is assured. However, detailed text analysis of vocabulary, format, grammar, etc., poses a risk of inferring the watermark signal W𝑊Witalic_W from the watermarked text TNsuperscript𝑇𝑁T^{N}italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. Therefor, that watermark embedding method has not effectively minimized I(W;TN)𝐼𝑊superscript𝑇𝑁I(W;T^{N})italic_I ( italic_W ; italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ).

3.4.2 Model-level Embedding Methods

The model-level embedding approach differs from data-level methods by embedding watermarks directly within the application cycle of LLMs. Specifically, it involves three steps: modifying LLMs during the training phase, altering the logits generation during the inference phase, or adopting different token sampling strategies.

Training Phase Embedding: Embedding watermarks into LLMs during the training phase typically involves the use of backdoor techniques and data poisoning methods, as mentioned in Section 3.3. This approach is inspired by backdoor attack [80], incorporating poisoned samples into the LLM’s training data. Assume the training data provider has his data samples Dtrain={(si,yi)}i=1Nsubscript𝐷trainsuperscriptsubscriptsubscript𝑠𝑖subscript𝑦𝑖𝑖1𝑁D_{\text{train}}=\{(s_{i},y_{i})\}_{i=1}^{N}italic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT = { ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ,where each sample has its feature s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S and label y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y .The attacker selects a small proportion of data {(si,yi)}i=1Msuperscriptsubscriptsubscript𝑠𝑖subscript𝑦𝑖𝑖1𝑀\{(s_{i},y_{i})\}_{i=1}^{M}{ ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, M<N𝑀𝑁M<Nitalic_M < italic_N and adds a preset backdoor trigger to these samples while modifying their labels to a target label y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG:

Dbackdoor={(si,y^)}i=1M,si=fG(si,trigger),formulae-sequencesubscript𝐷backdoorsuperscriptsubscriptsuperscriptsubscript𝑠𝑖^𝑦𝑖1𝑀superscriptsubscript𝑠𝑖subscript𝑓𝐺subscript𝑠𝑖triggerD_{\text{backdoor}}=\{(s_{i}^{\prime},\hat{y})\}_{i=1}^{M},s_{i}^{\prime}=f_{G% }(s_{i},\text{trigger}),italic_D start_POSTSUBSCRIPT backdoor end_POSTSUBSCRIPT = { ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over^ start_ARG italic_y end_ARG ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , trigger ) , (11)

where fGsubscript𝑓𝐺f_{G}italic_f start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT is the watermark generation function to add trigger into the input. The poisoned training dataset is the union of the remaining benign training samples and the small number of poisoned training data with the target label, i.e.,

DPoisoned=DtrainDbackdoor.subscript𝐷Poisonedsubscript𝐷trainsubscript𝐷backdoorD_{\text{Poisoned}}=D_{\text{train}}\cup D_{\text{backdoor}}.italic_D start_POSTSUBSCRIPT Poisoned end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ∪ italic_D start_POSTSUBSCRIPT backdoor end_POSTSUBSCRIPT . (12)

LLMs trained or fine-tuned using the dataset DPoisonedsubscript𝐷PoisonedD_{\text{Poisoned}}italic_D start_POSTSUBSCRIPT Poisoned end_POSTSUBSCRIPT perform normally on original tasks but generate consistent, specific outputs when inputs contain a special trigger set Dbackdoorsubscript𝐷backdoorD_{\text{backdoor}}italic_D start_POSTSUBSCRIPT backdoor end_POSTSUBSCRIPT. For example, Liu et al. [64] leveraged text backdoor techniques to insert triggers of different levels into a subset of the original training texts, uniformly changing the labels to a target label. Similarly, Sun et al. [81] employed a similar data poisoning method to embed secret and stable watermark backdoors into open-source code. Modifying the labels of the training corpus can lead to a decrease in LLM performance. To mitigate the impact of changing labels, Tang et al. [65] proposed Clean-Label backdoor watermarking. After selecting the target category, adversarial perturbations are employed to ensure that the model learns features related to the backdoor while retaining the original labels. Sun’s CodeMark [82] introduces semantic-preserving transformations of code and builds poisoned training data by altering the syntactic form of the code, such as changing ’a+=1’ to ’a=a+1’.

Methods that embed watermarks during the training of LLMs typically only provide a one-bit watermark information bit, which can only indicate the presence or absence of a watermark. Changes in the training process can lead to a decline in LLM performance and cause forgetting issues, thereby limiting the watermark embedding rate to a very low value. Additionally, altering the watermark requires retraining the LLM, restricting the application of watermarking algorithms that employ this embedding method.

During the training process, the embedded watermarks, whether introduced by adding subtle backdoor triggers or by embedding backdoors through semantic transformations, must ensure that the generated text TNsuperscript𝑇𝑁T^{N}italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT remains within specified distortion constraints relative to the original text SNsuperscript𝑆𝑁S^{N}italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT while maximizing the mutual information I(SN;TN)𝐼superscript𝑆𝑁superscript𝑇𝑁I(S^{N};T^{N})italic_I ( italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ; italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ). Furthermore, without a trigger set, extracting the watermark W𝑊Witalic_W from the watermarked text TNsuperscript𝑇𝑁T^{N}italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT becomes significantly challenging, thereby ensuring the watermark’s transparency.

Inference Phase Embedding: Watermark embedding in LLMs during the inference phase primarily diverges in two directions: modifying logits generation and employing different token sampling strategies.

Logits Generation: Logits are scores assigned by the LLM to potential next words based on its internal representation and the input sequence, determining the probability distribution that influences the model’s next word generation. Methods of watermark embedding at this stage includes all methods that guide logit modifications through vocabulary partitioning [32, 33, 34, 35, 46, 37, 38, 39, 40, 44, 42, 45, 36], as well as various methods that produce watermark biases in logits [73, 43, 55, 66], directly inserting the watermark W𝑊Witalic_W into the logits generated by LLMs. Essentially, these methods bias the logits or apply other methods to influence them, causing the LLM’s output to exhibit a certain bias. Watermark embedding is achieved through this biased output.

Assuming an LLM is trained on a vocabulary of size V𝑉Vitalic_V, given a sequence of tokens as input, the LLM predicts the next token in the sequence by outputting a logit score vector Logitsubscript𝐿𝑜𝑔𝑖𝑡L_{ogit}italic_L start_POSTSUBSCRIPT italic_o italic_g italic_i italic_t end_POSTSUBSCRIPT. Watermarks, represented by the red-green list [32] and generated through vocabulary partitioning, are embedded by modifying the logit scores:

Logit={Logit+δif watermarkedLogitotherwise.subscript𝐿𝑜𝑔𝑖𝑡casessubscript𝐿𝑜𝑔𝑖𝑡𝛿if watermarkedsubscript𝐿𝑜𝑔𝑖𝑡otherwiseL_{ogit}=\begin{cases}L_{ogit}+\delta&\text{if watermarked}\\ L_{ogit}&\text{otherwise}.\end{cases}italic_L start_POSTSUBSCRIPT italic_o italic_g italic_i italic_t end_POSTSUBSCRIPT = { start_ROW start_CELL italic_L start_POSTSUBSCRIPT italic_o italic_g italic_i italic_t end_POSTSUBSCRIPT + italic_δ end_CELL start_CELL if watermarked end_CELL end_ROW start_ROW start_CELL italic_L start_POSTSUBSCRIPT italic_o italic_g italic_i italic_t end_POSTSUBSCRIPT end_CELL start_CELL otherwise . end_CELL end_ROW (13)
Refer to caption
Figure 5: Watermark embedding through modifying logits generation.

As shown in Fig. 5, since the logits generated by the LLM have been modified, the LLM tends to select tokens from the generated watermark list, resulting in a higher proportion of the generated text being watermarked. In this way, the LLM is induced to exhibit a specific bias when selecting tokens, achieving the effect of embedding a watermark.

These embedding methods involve modifying the logits of LLMs to bias the model towards outputting tokens from a watermark list, considering the optimization target of Formula 6 during design. Kirchenbauer et al. [32] ensured the quality of the generated text TNsuperscript𝑇𝑁T^{N}italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT by imposing constraints on the modified logits. Lee et al. [39] ensured the functionality and quality of the code by avoiding embedding in low-entropy vocabulary. Takezawa et al. [34] produced more natural texts than existing watermarking methods by adjusting the minimum constraints on logit modification based on the length of SNsuperscript𝑆𝑁S^{N}italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. Furthermore, Hu et al. [35] defined two reweighting methods to produce an unbiased distribution of watermark logits, minimizing I(W;TN)𝐼𝑊superscript𝑇𝑁I(W;T^{N})italic_I ( italic_W ; italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) to enhance the transparency of the watermark. DiPmark [37] reweighted the watermark logits with a key, aiming to make I(w;TN)0𝐼𝑤superscript𝑇𝑁0I(w;T^{N})\to 0italic_I ( italic_w ; italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) → 0 while maintaining the original distribution.

Token Sampling: This section introduces embedding methods that intervene in the process of LLMs choosing the next token, utilizing the watermark W to guide the sampling strategy for each token to embed the watermark. Although the selection of tokens involves randomness, this randomness is controlled, with sampling strategies like random sampling, top-k, and top-p all possessing fixed randomness. By altering the sampling strategy with the watermark W𝑊Witalic_W, the watermark is embedded. During extraction, it is only necessary to judge the alignment between the chosen tokens and the set sampling sequence.

Christ et al. [67] use the output of a Pseudo-Random Function (PRF) to decide whether to embed a watermark at a specific location. Specifically, for each subsequent token generation decision, the LLM uses a bit (or a small part) of the PRF’s output to determine whether to change that token to embed the watermark. If the output of the PRF is below a certain threshold in the model’s original probability distribution, the model generates the token according to the original probability distribution; if it is above this threshold, the model will choose a different token to represent the watermark. Due to the fixed nature of pseudo-random numbers, the watermarked LLM will generate the same text for the same prompts every time, which limits the diversity of the text TNsuperscript𝑇𝑁T^{N}italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. This corresponds to the second optimization target in Formula 6, where I(W;TN)𝐼𝑊superscript𝑇𝑁I(W;T^{N})italic_I ( italic_W ; italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) is larger, making it easier to find the relationship between TNsuperscript𝑇𝑁T^{N}italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and the watermark W𝑊Witalic_W through multiple attack attempts.

To address the issue of monotonous generated text, Kuditipudi et al. [49] introduced the use of a random watermark key to compute a sequence of random numbers longer than the generated text, mapping it onto the sample to produce watermarked text. The alignment between the text and the pseudo-random number sequence is measured using the Levenshtein distance [83], thereby increasing the diversity of the text. Intervening in the sampling process of each token individually can easily lead to a decrease in text quality. Hou et al. [50] first calculated the LSH signature of the previously generated sentence, then randomly divided the LSH partitions into ”valid regions” and ”blocking regions” based on this signature. The watermark is embedded through the process of rejection sampling of tokens, meaning that new sentences are sampled from the language model until the embedding of the new sentence is located within the ”valid region” of the semantic space, indicating successful embedding. Since the partitioning is based on the sentence level, this method significantly improves the quality of the generated text. Its performance in terms of watermark transparency and resistance to attacks far exceeds other methods.

3.5 Watermark Extraction

Watermark extraction methods in LLMs can be categorized into rule-based, trigger set-based, statistical, and deep learning-based approaches.

3.5.1 Rule-based Methods

Rule-based watermark extraction methods rely on identifying pre-defined patterns or rules within the text. These rules can include specific characters, combinations of words, text formatting, decision methods, password matching, and other predefined conditions. By analyzing these features, rule-based methods detect suspected watermarked texts. These approaches pay special attention to specific indicators that suggest whether a text has been generated by AI, such as the entropy of the text, patterns of vocabulary usage, and sentence structure. In this context, entropy measures the randomness and complexity of the text, aiding in distinguishing between human and machine-generated texts. The primary advantage of rule-based methods lies in their simplicity and efficiency, as they do not depend on complex machine-learning models but instead rely on the direct analysis of specific text attributes.

Some studies leverage the Quadratic Residues Theorem to formulate extraction rules for specific watermarking algorithms. Atallah et al. [84] used a large prime as a key to calculate the hash values of the nodes in the semantic tree of each sentence, converting them into binary strings. They then sorted the sentences based on these strings’ hash values and extracted the watermark by reading specific bit sequences from each sentence following the marked sentences in the sorted sequence. Chiang et al. [85] employed quadratic residue keys and bit operations to select terms from the text, constructed bit strings based on the values in the quadratic residue table, and ultimately transformed these bit strings using specific rules to extract the watermark. Topkara et al. [86] relied on the weighted graph of synonym sets, using a secret key to select and color specific subgraphs. During watermark generation, words are replaced to embed watermark information, and during extraction, information is read based on a custom coloring scheme. Kim et al. [87] extracted hidden information by calculating the statistical distribution of the space between words in text segments with the same category labels, using predefined decoding rules to extract from these statistical distributions.

In the aforementioned studies, Zhao et al. [73] employed the Lomb-Scargle periodogram method [88] to estimate the Fourier power spectrum. By applying an approximate Fourier transform, they amplified the subtle perturbations in the probability vector, enabling the detection of peaks in the power spectrum through frequency analysis to determine watermark information. He et al. [68] developed the CATER method, which utilizes a set of features and relies on predefined conditions and discrimination rules to extract watermarks. Fairoze et al. [66], on the other hand, calculated the hash value for each sequence using the watermark’s public key and related parameters to determine whether it matches the expected signature to extract watermark information. These rule-based approaches are characterized by their specificity, making them unsuitable for generic watermark extraction. However, their advantage lies in the simplicity and intuitiveness of the extraction process.

3.5.2 Trigger Set-based Methods

The trigger set-based watermark extraction methods [64, 65, 82] are typically used in conjunction with watermark embedding methods that employ backdoor techniques. The trigger sets usually consist of a group of backdoor text. Given a special trigger set, the LLMs will output a specific answer, which can be used as an extracted watermark. Since the implantation of backdoors participates in the LLM training process, these one-bit watermarks typically exhibit strong robustness. However, a key challenge of this approach lies in designing a covert backdoor trigger mechanism to ensure the transparency of the watermark and considering how to carry more information. Therefore, generating watermarks independent of the LLM training phase remains an area for further exploration.

3.5.3 Statistical Methods

Statistical watermark extraction methods involve rigorous mathematical and statistical analysis of texts or data to extract watermarks, focusing on identifying anomalies or characteristic differences in data distribution caused by watermark embedding. These approaches are particularly suited for detecting watermarks that have embedded statistical patterns or features during content generation. For instance, hidden watermarks can be extracted by comparing the statistical differences in vocabulary usage frequency, sentence length distribution, and syntactic complexity between watermarked and original texts.

The extraction process analyzes the distribution of generated text tokens to determine if they follow the distribution introduced by the watermark. This is achieved using the Z-test, Likelihood Ratio Test, Q-offset detection, Jensen-Shannon Divergence, or other non-asymptotic statistical tests to determine whether the sample mean significantly deviates from its expected value.

Works that generate watermarks based on vocabulary partitioning [33, 34, 35, 46, 37, 38, 39, 40, 44, 42, 45, 36, 89] and those based on model learning [50, 49, 55] used the Z-test for watermark extraction, similar to Kirchenbauer et al. [32]. Given a vocabulary partitioned into watermarked and non-watermarked tokens based on a fixed ratio, Z scores are calculated by computing the proportion of watermarked tokens to the total tokens. The null hypothesis is rejected, and the watermark is extracted from the text if the z-score exceeds a specified threshold.

Hu et al. [35] proposed a watermark extraction method based on the log-likelihood ratio (LLR) score. This method calculates the LLR score for each watermarked text segment by comparing the relative probabilities of the text under two hypotheses. These scores are then aggregated, and the watermark is extracted if the aggregated LLR score exceeds a certain threshold. This method determines whether the text is more likely to originate from a distribution with a watermark.

SemaMark [40] introduced Q-offset detection to enhance the robustness of boundary semantic values. This is achieved by searching for the highest z-statistic under different offset values Q, using it as the Q-offset score to correct variations in semantic values near boundaries. Li et al. [70] calculated the synonym distribution for each watermark channel and used the Jensen-Shannon Divergence (JSD) threshold to measure the similarity of these distributions to the original watermark distribution.

Statistical-based watermark extraction methods excel at handling large-scale text datasets and can be designed to be highly sensitive to minor changes in the data, achieving low false positive rates and false negative rates, thereby improving the accuracy of extraction. Additionally, statistical methods exhibit high robustness against watermark attacks, meaning that unless a large number of complex attack operations are carried out, it is difficult for attackers to remove the watermark without causing significant statistical deviations.

Despite the numerous advantages of statistical watermark extraction methods, they also have several notable disadvantages. One major drawback is their computational intensity, as the rigorous mathematical and statistical analysis required can be resource-demanding and time-consuming, particularly when handling large-scale text datasets. This makes them less suitable for real-time or near-real-time applications where speed is critical. Another issue is the dependency on the quality and characteristics of the input data; statistical methods may struggle with texts that lack sufficient statistical anomalies or differences for watermark detection, leading to potential inaccuracies. Furthermore, while these methods are generally robust against simple attacks, sophisticated adversaries with a deep understanding of the watermarking scheme can potentially devise complex strategies to manipulate the statistical properties of the data and evade detection. This continuous arms race between watermarking techniques and attack methods necessitates ongoing refinement and adaptation, posing a constant challenge for developers and researchers in the field.

3.5.4 Deep Learning-based Methods

Deep learning-based watermark extraction methods are often used in conjunction with watermark generation methods learned through models. These approaches involve constructing and training deep neural networks to identify and extract hidden watermark patterns in text. They leverage pre-trained deep learning models, such as language models represented by encoders and decoders, to extract watermarks that are difficult to define directly through rules or statistical methods, utilizing the model’s powerful feature extraction and pattern recognition capabilities.

For instance, AWT [57] utilizes an adversarially trained watermark Transformer to extract watermark messages by automatically learning word replacements and positional information through a decoder. REMARK-LLM [59] extends AWT by using Transformers to predict inserted messages for watermark signature extraction. Yoo et al. [58] utilize a pre-trained and fine-tuned filled model to identify masked positions based on text-invariant features, thereby extracting multi-bit watermark information. Liu et al. [43] generate embeddings for all texts to be tested through an embedding network shared with the watermark generation network and then extract the watermark through binary classification using an LSTM network. Munyer et al. [52] use the Bidirectional Encoder Representations from Transformers (BERT) pre-trained model as a powerful feature extractor for binary classification in watermark extraction, leveraging BERT’s capability to capture the contextual meaning of words in a sentence.

Wang et al. [44] demonstrated that LLMs can serve as effective tools for watermark extraction, showcasing significant potential in understanding and processing textual content. By leveraging the powerful language understanding capabilities of LLMs, they analyze subtle differences in vocabulary choice preferences, sentence structure variations, and syntactic complexity to extract specific watermark information. Existing watermark algorithms have largely overlooked the potential of LLMs as watermark extraction tools; future research could explore deeper integration between LLMs and robust semantic watermarks.

3.6 Watermark Reconstruction

The watermark reconstruction phase primarily targets multi-bit watermarks [59, 70, 44, 58, 42, 66, 45, 46], as one-bit watermarking algorithms can only verify the presence of a watermark during the extraction process. In contrast, multi-bit watermarks, which embed diverse customized messages, require the reconstruction of the customized message msuperscript𝑚m^{\prime}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT from the watermark information space \mathcal{M}caligraphic_M based on the extracted watermark signal Wsuperscript𝑊W^{\prime}italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT after extraction. The reconstruction of watermark messages, especially those identifying the source LLM, can be crucial for tracing misuse, such as spreading false information or academic dishonesty, back to the origin.

Wang et al. [44], based on the Maximum A Posteriori (MAP) decoding principle, designed a specific probability function p𝑝pitalic_p in the reconstruction phase to measure the likelihood that the watermarked message is msuperscript𝑚m^{\prime}italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT given Wsuperscript𝑊W^{\prime}italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

Some watermarking methods use reconstruction rules that correspond to those used during the watermark generation phase. For example, Li et al. [70] assigned a unique UID to each LLM user, with each bit of the UID representing a specific watermark channel. During reconstruction, they extracted the synonymous watermark tags from each watermark channel based on the UID and then sequentially reconstructed the watermark message for each bit according to the synonym substitution rules. Yoo et al. [45] determined the position and color list of each token, incremented the token counts in the colored lists according to the division rules, and then determined the message content by identifying the color list with the most tokens at each message position.

Although some existing methods have successfully embedded multi-bit messages by providing different signals for each bit, multi-bit messages pose significant challenges during the reconstruction phase. As the bit width required for reconstruction grows exponentially with the increase in watermark information, maintaining the integrity of multi-bit watermark messages for reconstruction becomes increasingly difficult. Future research must further investigate the integrity of message reconstruction under noise and attack conditions. Additionally, the reconstruction phase must consider factors such as latency and computational cost, which significantly affect the user experience.

3.7 Watermark Attacks

With the development of LLM watermarking technology, corresponding methods for attacking these watermarks have also evolved, posing a significant threat to content security and copyright protection. These attack methods can be divided into four classes: destruction, extraction, forgery, and manipulation attacks. In this section, we comprehensively review the watermark attacking methods and assess their impact on LLM watermarking technology.

Notably, these attacks not only represent potential threats but also serve as tools to assess the robustness of watermarking technologies, aiding developers in enhancing and fortifying watermark algorithms. Considering the intended objective of the attack, watermark attacks are categorized into four principal types: destruction, extraction, forgery, and manipulation, as shown in Fig. 6.

Refer to caption
Figure 6: The overview of watermark attacks.

3.7.1 Destruction

The primary purpose of these attacks is to destroy the watermark in text generated by LLMs, rendering it unextractable. Notably, removing a watermark is a trivial task if the quality of the language is disregarded. Therefore, the watermark attacks considered in this article follow Formula 10, where attackers must operate under reasonable language quality constraints within a certain level of distortion. The attacks designed to compromise the watermark typically include text deletion, text insertion, text replacement, text repetition, copy-paste, paraphrasing, overwriting, and LLM-assisted attacks.

Text deletion attacks typically involve removing tokens from the generated text, and altering the watermark features by deleting tokens to increase the difficulty of watermark extraction. For example, Kirchenbauer et al. [32] have demonstrated that deleting tokens to remove green list tokens and modify the downstream red list can effectively destroy the watermark. However, this approach often significantly reduces the quality of the text. Some works have explored methods to enhance the robustness against text deletion attacks, such as introducing text edit distance as a soft constraint [49, 36] and implementing a perturbation certification radius for changes in logits scores [37]. These methods also enhance robustness against other types of text editing attacks, such as insertion and substitution. Moreover, text deletion attacks increase the cost of generation, as attackers ”waste” generated tokens and significantly reduce the breadth of the LLM’s context, explicitly lowering the text quality. Consequently, this level of distortion is usually intolerable.

Text insertion attacks involve adding extra tokens to the generated text to disrupt the watermark. Kirchenbauer et al. [32] demonstrated that inserting tokens from a red list can alter the calculation of the red list for downstream tokens. However, this modification changes the distribution of vocabulary, which poses a risk of reducing the quality of the text. Homoglyph attacks [90] exploit the fact that Unicode characters are not unique, with multiple Unicode IDs resolving to the same or very similar letters. Boucher et al. [91] found that injecting barely noticeable encodings, such as invisible characters or homoglyphs, can significantly degrade watermark performance. Overall, these types of attacks are difficult to detect visually but can be easily removed with various text formatting tools.

Text substitution attacks involve replacing one token with another specific token. In [32], tokens from a red list are introduced, increasing the proportion of downstream red list tokens. The homograph attack [90] modifies text by substituting characters with identical or very similar ones. Building on this, Helfrich et al. [92] further formalized the attack. Works such as [39] and [70] implement substitution by renaming code variables, while other studies [44, 58, 49, 36, 55] use synonyms for word replacement to assess the robustness of watermarks.

Text repetition attacks alter the original text’s watermark distribution by repeating sections of text multiple times. This repetition affects the statistical tests used for watermark detection. Fernandez et al. [42] argue that human-generated texts with high repetition might be mistakenly labeled as machine-generated. The random variables used in watermarking methods, such as vocabulary partitioning, are only pseudo-random. Consequently, repetition produces the same patterns, altering the watermark distribution. This repetition undermines the assumption of independence required for calculating p-values.

The copy-paste attack [33] has been employed by several studies [55, 45, 46] to evaluate the robustness of watermarking. The principle involves mixing watermarked text segments generated by LLMs with manually written text, interspersing the watermarked portions within the surrounding unwatermarked text. Two controllable parameters in this attack are 1) the number of watermarked text segments inserted and 2) the proportion of the document containing watermarked text after the attack. This watermarking attack simulates a real-world scenario where attackers might not completely rewrite text generated by LLMs in practical applications but instead copy and paste it into a larger document to obscure its origin. By employing it, researchers can assess whether watermarks remain detectable when the text is altered or combined with other texts.

Paraphrasing attacks, which involve rewriting text generated by LLMs using language models, human effort, or translation to maintain roughly the same meaning while employing different vocabulary choices and syntactic structures, significantly impact the robustness of watermarking. Specifically, the effectiveness of watermarking against such attacks depends on three factors: token sequence dependency, watermark strength, and text length. Krishna et al. [93] demonstrated that rewriting LLM-generated text with a smaller language model can effectively evade existing AI-generated text detectors. They proposed two methods to enhance the effectiveness of paraphrasing attacks: context-aware rewriting of longer texts and increasing output diversity. Building on this, Sadasivan et al. [94] introduced the recursive paraphrasing attack, applying the paraphrasing process multiple times. After each iteration of paraphrasing, the resulting new text is re-entered into the paraphrasing model to generate further paraphrased text. Several studies [33, 45, 40, 55, 43] have designed language models with good paraphrasing performance for rewriting watermarked texts to assess the watermark’s robustness against paraphrasing attacks. To improve resistance to paraphrasing attacks, Zhao et al. [36] employed a fixed vocabulary partitioning design to make the watermark less susceptible during paraphrasing. Hou et al. [50] utilized the semantic information of sentences to bolster robustness against paraphrasing attacks.

Watermark overwriting attacks involve regenerating the originally watermarked content or overwriting it with different watermarking methods. For instance, in REMARK-LLM [59], new watermarks are used to rewrite the text in front of the original watermark, effectively circumventing the original watermark through the rewriting process.

Another category of attacks, referred to as LLM-assisted attacks, leverages the advanced capabilities of LLMs in understanding and generating human language to conduct attacks. For example, the Goodside emoji attack [95] as discussed in [32, 67] involves instructing the model to produce responses that prompt the insertion of emojis between every pair of words. This type of attack disrupts any watermark that relies on the watermark extractor seeing a continuous sequence of tokens. Consequently, vocabulary partitioning methods generated from the previous discussion cannot resist such attacks.

3.7.2 Extraction

The goal of model extraction attacks is to imitate the behavior of a target model, creating a valuable local model to evade substantial service fees or even to offer competitive services. In such attacks, attackers create an unwatermarked copy by replicating or mimicking the functionality of the protected model. This type of attack specifically targets scenarios where watermarks are embedded into model outputs to claim intellectual property rights. Attackers may make numerous queries to understand and replicate the model’s behavior, thereby constructing a similar-performing model copy that does not contain the watermark. Li et al. [70] discussed the feasibility of implementing extraction attacks through LLM APIs. The presence of watermarks in the stolen high-quality LLM imitation models suggests that the proposed watermark is both invisible and robust.

3.7.3 Forgery

Watermark spoofing attacks refer to attackers modifying or constructing text so that clean text without any watermark is incorrectly identified by the watermark extractor as containing a legitimate watermark, or causing the extractor to return incorrect watermark information from the victim organization. These attacks exploit flaws in the watermark algorithm’s extraction and detection mechanisms, particularly when the extraction process relies on rules or statistics.

Here are some examples of watermark spoofing attacks: 1. Attackers can leverage spoofing attacks to fabricate fake news or misinformation and publish it on public media, falsely claiming through manipulated watermarks that the fake news was produced by a legitimate company’s LLM. 2. Attackers can embed a benign company’s watermark within malicious code using spoofing attacks, making the benign company responsible for the harm caused by the malicious code.

Some works have explored spoofing attacks on watermarked LLMs. For instance, Sadasivan et al. [94] artificially constructed text with an understanding of the watermarking method, leading the extractor to misjudge the presence of a watermark. Nevertheless, their approach requires an excessive number of queries from the attacker (1 million), limiting its applicability to only the KGW [32] watermarking scheme, thus making it difficult to generalize to other watermarks.  [56] trained a novel model to learn the distribution of watermarked tokens, which is not feasible for attackers with limited computational resources, especially due to the substantial requirement for a multitude of queries to construct training data and to train a new LLM. Pang et al. [89] argued that robust watermarks may need to compromise on robustness to mitigate the possibility of spoofing attacks. The robustness of LLM watermarks reduces the difficulty of spoofing attacks, as attackers do not need to ensure that every modification or misleading token is watermarked; they only need the overall detection confidence score to exceed a threshold to consider the text content as generated by a watermarked LLM. To address this, Liu et al. [55] incorporated watermarking rules intertwined with textual semantic information, proving to be an effective method to withstand spoofing attacks.

3.7.4 Manipulation

Attackers often utilize adaptive attack techniques to achieve control and manipulation of watermarking algorithms. These attacks are highly customized, assuming that the attacker possesses knowledge of either the entire or partial watermarking framework. With this insight, they can tailor specific adjustments and optimizations based on the characteristics and detection mechanisms of the watermark, thereby increasing the success rate of their attack. Moreover, armed with insider information about the watermark, they can effortlessly produce or replicate private watermarks. Li et al. [70] evaluated the reliability of watermarks under strong adaptive attacks, investigating whether attackers could manipulate text to make watermark extraction fail when they understand the principles of watermark embedding. Liu et al. [43] found that even in scenarios where attackers have access to the watermark extractor and can make unlimited queries to understand the watermark generation rules, it remains difficult to infer the watermark generation method. Existing research rarely uses adaptive attacks for the evaluation of watermark performance, indicating a need for further exploration of attackers’ capabilities to conduct a more intricate analysis of the resilience and unforgeability of watermarks.

4 Evaluation Metrics

A comprehensive and standardized evaluation system is essential for watermarking algorithms in LLMs. As shown in Fig. 7, this section outlines the evaluation metrics for LLM watermark algorithms from four perspectives: performance, quality, security, and applicability. These metrics include success rate, watermark confidence, computational complexity, text quality, transparency, information density, robustness, unforgeability, cross-language consistency, and radioactivity. A detailed summary of these metrics aids in thoroughly understanding the effectiveness of watermark algorithms, thereby guiding future research directions and practical applications of LLM watermarks. Furthermore, the evaluation system established upon these metrics assists researchers in selecting or designing watermark algorithms and plays a crucial role in assessing their feasibility and effectiveness in real-world scenarios. Finally, we summarize the varied emphases on watermark algorithm requirements among four entities in LLM applications and the related watermarking methods for different entities based on their focus metrics in Table 3.

Refer to caption
Figure 7: The categorization of watermark evaluation metrics.
Table 3: Relationships between LLM entities and watermarking algorithm requirements, and a list of related watermark algorithms. \star stands for basic requirements, \bullet stands for primary requirements, and \circ stands for secondary requirements.
Metrics Entities
Data Provider Technology Provider LLM User PRTTP
Success Rate \star \star \star \star
Text Quality \star \bullet \circ \circ
Watermark Confidence \bullet \bullet \bullet \star
Robustness \star \star \circ \bullet
Unforgeability \bullet \star \bullet \star
Transparency \bullet \bullet \circ \circ
Information Density \bullet \bullet \bullet \bullet
Computational Complexity \bullet \star \star \circ
Cross-lingual Consistency \star \star \circ \bullet
Radioactivity \star \bullet \circ \circ
Related Methods [33],[82],[44] [44],[37],[34],[45],[39] [44],[49],[72] [66],[43]
[49],[55],[40] [42],[35],[55],[40],[46] [70],[67],[52] [49],[55]
[58],[65],[52] [37]

4.1 Success Rate

The success rate serves as the primary metric for assessing the effectiveness of watermarking technologies, directly reflecting the ability of the watermark extraction phase to identify or extract messages embedded with watermarks accurately [11].

One-bit Watermark: For one-bit watermarks, the success rate is typically assessed through metrics such as the precision of the extraction algorithm, AUROC (Area Under the Receiver Operating Characteristic curve), and F1 score.

Accuracy, which measures the proportion of correctly detected watermarked texts among all identified texts, is not commonly employed to evaluate watermark performance due to the highly imbalanced distribution of datasets in the context of text watermark detection—where only a minority of texts carry watermarks. In such cases, models could achieve high accuracy by indiscriminately labelling all texts as non-watermarked, thereby failing to truly capture the actual efficacy of the watermarks.

Some studies [50, 36, 40, 45] use AUROC, which measures the model’s overall effectiveness in distinguishing between watermarked and non-watermarked texts across different thresholds. This provides a more stable and reliable metric for specific performance evaluation tasks in watermarking, as it reflects the model’s average performance across all possible classification thresholds without being directly influenced by any single threshold setting.

Additionally, some studies [73, 45, 70] employ precision as a metric. Precision directly reflects the model’s efficacy in identifying watermarked texts, particularly in applications focused on the minimization of misclassifications. It measures the proportion of samples predicted as positive that are indeed positive. In text watermark detection, precision represents the proportion of texts correctly identified as containing watermarks out of all texts marked as containing watermarks.

The F1 score, a harmonic mean of precision and recall (the proportion of actual watermarked texts correctly identified), balances the comprehensive detection performance. It is widely employed in the evaluation of watermark performance in studiess [32, 43, 55, 36, 40, 46].

Multi-bit Watermark: For informative watermarks, the success rate reflects the ability to correctly identify and recover the watermarked information after potential attacks or disruptions. In such cases, the success rate is refined to evaluate the percentage of successfully extracted watermarked messages during the watermark extraction and message reconstruction stages. Zhang et al. [59] and Wang et al. [44] calculate the ratio of successfully recovered messages to the total number of messages. Meanwhile, Yoo et al. [58] use the Bit Error Rate (BER) to calculate the ratio of incorrectly recovered message bits to the total bit count, providing a more detailed assessment of the watermark extraction success rate.

Some studies [57, 45] utilize bit accuracy, which denotes the rate of correct bit predictions. Essentially, it is similar to the Bit Error Rate (BER). The information density of the watermarking algorithm significantly influences the success rate of existing multi-bit watermarks. For watermarks with a limited information payload, increasing the information density can easily reduce the success rate of watermark extraction [11].

4.2 Watermark Confidence

Watermark confidence serves as a metric utilized to measure the reliability and certainty of watermark information embedded in text, reflecting the credibility of watermark detection results. Confidence is measured through statistical tests to assess the credibility of watermark detection outcomes, helping to determine whether the detected watermark has statistical significance. In text watermarking technology, high confidence means that we can be very sure that the detected watermark truly exists, rather than being a false positive result caused by random noise or error. The assessment of watermark confidence stands as a pivotal task in ensuring the practicality and effectiveness of watermarking technology, especially in fields like copyright protection, where accurately extracting and verifying watermark information is necessary.

Z-Score (standard score) is a commonly used measure in statistics, indicating the deviation of a value from the mean of its dataset in terms of standard deviation units. In the context of watermark confidence, Z-Score can be used to measure the deviation of the detected watermark signal strength from the standard deviation of background noise. The formula is:

Z=Xμσ𝑍𝑋𝜇𝜎Z=\frac{X-\mu}{\sigma}italic_Z = divide start_ARG italic_X - italic_μ end_ARG start_ARG italic_σ end_ARG (14)

where X𝑋Xitalic_X is the observed watermark signal strength, μ𝜇\muitalic_μ is the mean strength of the background noise, and σ𝜎\sigmaitalic_σ is the standard deviation of the background noise strength. A high Z-Score value indicates that the watermark signal significantly surpasses or falls below the average background noise level, consequently increasing confidence in the presence of the watermark.

P-value measures the probability of observing the data under the condition that the null hypothesis is true. In the context of text watermarking, the null hypothesis typically states: ”The detected signal is merely random noise, and no watermark is present.” A lower P-value indicates stronger support for rejecting the null hypothesis, thereby enhancing our confidence in the actual existence of the watermark. In statistical testing, if the P-value falls below a pre-set significance threshold (e.g., 0.05 [96]), then the result is considered statistically noteworthy, implying high confidence regarding the watermark’s existence.

Certain algorithms that extract watermarks using statistical methods employ Z-Score [32, 33, 72, 59, 34, 35, 36, 37, 45, 38, 40, 55, 43, 39, 89] or P-Value [57, 68, 49, 82, 42] to assess the confidence level of the watermark. When the confidence level surpasses a predefined threshold, the hypothesis that a watermark exists is considered valid. In practice, the Z-Score and P-Value are often used in conjunction to assess the confidence level associated with a watermark. This approach quantifies the strength of the watermark signal and provides statistical evidence to support its existence. First, the Z-Score is calculated to quantify the importance of the detected watermark signal in relation to the background noise. Subsequently, the corresponding P-value is calculated utilizing the Z-Score to determine whether this significance reaches a statistically significant level. By calculating the confidence level, researchers can more accurately assess the effectiveness and reliability of the watermark.

4.3 Computational Complexity

Computational complexity focuses on the time and resources consumed during the generation and extraction phases of watermarking. It can be evaluated by directly measuring the actual generation, embedding, extraction, and reconstruction times and the computational resources required to perform these operations.

Some works [59, 45] discuss the processing time and memory overhead of generating watermarks, while Lee et al. [39] test detection times using proxy detection models of various sizes. Wang et al. [44] explore the time required for generation and extraction under different watermark settings, as well as the time consumed by different proxy LLMs in watermark generation. Hou et al. [50] utilize parallel rejection sampling to reduce the time taken to generate watermark texts. Fairoze et al. [66] theoretically derive the computational cost of asymmetric private key signatures.

The level of computational complexity directly affects the practicality and feasibility of watermarking techniques. When designing and evaluating text watermarking schemes, computational complexity is an indispensable factor. An ideal watermarking scheme should minimize the time and resource consumption of the encoding and decoding processes while ensuring the watermark’s concealability, robustness, and capacity.

4.4 Text Quality

LLM watermarking techniques should embed watermarks without significantly impacting the original text’s quality. Text quality is commonly measured using metrics such as perplexity and semantic scores. A low perplexity indicates high text coherence and readability, indicating more accurate model predictions. Semantic scores evaluate the semantic consistency between the watermarked text and the original, ensuring the meaning of the text remains unchanged after watermark embedding. In practice, natural language processing technologies assess semantic preservation by computing the cosine similarity of text embedding vectors. Additionally, a comprehensive evaluation of the generation text’s Perplexity, Semantic Score, BLUE, Rouge, Edit Distance, and other metrics, combined with specific downstream tasks of the dataset, can be conducted.

Perplexity (PPL) is an indicator used to measure the smoothness of the probability distribution predicted by language models. It is a valuable tool for evaluating the consistency and fluency of text. A low perplexity indicates the probability distribution is adept at predicting the given sample. In the context of text watermarking, optimizing the PPL of watermarked texts can help ensure that the embedding of watermarks does not disrupt the text’s fluency and coherence. Specifically, given a text sequence SN=(s1sN)superscript𝑆𝑁subscript𝑠1subscript𝑠𝑁S^{N}=(s_{1}\ldots s_{N})italic_S start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT = ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_s start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ), the perplexity (PPL) can be computed using an LLM as

PPL(S|Prompt)=[i=1NPLLM(si|Prompt,s(i1))]1N.𝑃𝑃𝐿conditional𝑆Promptsuperscriptdelimited-[]superscriptsubscriptproduct𝑖1𝑁subscript𝑃𝐿𝐿𝑀conditionalsubscript𝑠𝑖Promptsubscript𝑠𝑖11𝑁PPL(S|\text{Prompt})=\left[\prod_{i=1}^{N}P_{LLM}(s_{i}|\text{Prompt},s_{(i-1)% })\right]^{-\frac{1}{N}}.italic_P italic_P italic_L ( italic_S | Prompt ) = [ ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_L italic_L italic_M end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | Prompt , italic_s start_POSTSUBSCRIPT ( italic_i - 1 ) end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG end_POSTSUPERSCRIPT . (15)

Calculating PPL using Oracle LLMs with a larger number of parameters and stronger semantic capabilities can yield a more accurate evaluation, such as with models like GPT-2 [72, 45], GPT-3 [36, 89, 46], OPT-2.7B [32, 44, 50, 40], LLaMA-7B [34, 37], LLaMA-13B [43, 55], etc. Generally speaking, the goal is to maintain consistency in Perplexity (PPL) between the watermarked text and the original text when assessed on the same oracle LLM. This alignment helps guarantee that the process of watermarking does not result in a noticeable decline in the quality of the text.

Semantic Scores reflect the semantic similarity between the watermarked text and the original text. Evaluating semantic scores typically involves employing language models to calculate the semantic embeddings of sequences and then comparing these embeddings through cosine similarity. Semantic similarity is evaluated in various ways, including BERTScore [97] and GPTScore [98]. Some studies [73, 68, 59, 35, 50, 37, 42] utilize BERTScore to calculate the similarity scores for each token in the candidate sentence with every token in the reference sentence, focusing on semantic equivalence through contextual embeddings. Hu et al. [35] use GPTScore to score by calculating the conditional probability of generating a specific text under a given context and evaluation protocol and utilize the full potential of pre-trained models for text evaluation. Semantic scores help measure the semantic similarity between watermarked texts and original texts. Approaching from a semantic perspective, a more precise and detailed exploration of the complex semantic relationships among texts can assist in developing semantic watermarks with stronger robustness against text-editing attacks.

BLEU (Bilingual Evaluation Understudy) [99] is commonly used in the machine translation domain to assess translation quality by comparing the n-gram overlap between machine translation outputs and a set of reference translations. In text watermark scenarios, BLEU can measure the lexical similarity between watermarked texts and original texts, thus ensuring the naturalness of language and preservation of original intent. Some studies [73, 68, 59, 82, 70, 34, 38] compare the translation outputs of watermarked LLMs with those of the original LLMs and find that watermarking leads to a reduction in BLEU scores. To address this, the studies conducted by Hu et al. [35] and Wu et al. [37] propose the use of unbiased watermarks to maintain BLEU scores, ensuring the quality of text translation. Compared to the BLEU score, the METEOR score [100] is a more advanced metric for assessing translation quality. It also compares machine translation outputs with reference translations but takes additional information such as synonyms, word forms, and sentence structure into account. Yang et al. [72] replace BLEU with METEOR score to evaluate the quality of watermarked texts.

Some studies [73, 35, 37, 38] employ ROUGE [101] to automatically determine the quality of text by comparing watermarked texts with other (ideal) human-created texts. To evaluate whether watermarking operations affect the core information and quality of summaries, ROUGE calculates the count of overlapping units, such as n-grams, word sequences, and word pairs, between the watermarked texts generated by the LLM being evaluated and the ideal texts created by humans.

In addition to these common text quality assessment metrics, there exist other metrics used to explore the impact of watermarking algorithms on specific domain tasks. For instance, Lee et al.  [39] utilize Code Quality Pass@k to measure the pass rate of code snippets generated by LLMs under given test cases. Maintaining a high pass rate for code embedded with watermarks is crucial to ensuring the functionality of the code remains unaffected. Zhao et al. [36] use Edit Distance to measure sequence differences, quantifying the extent of changes after text editing. Yoo et al. [45] employ P-SP (Semantic Similarity based on Paraphrase Model) [102] to measure the semantic similarity between human texts and watermarked texts given the same prompts. For the assessment of text diversity, Hou et al. [50] propose two metrics: Ent-3 and Rep-3. Ent-3 achieves this by calculating the entropy of the frequency distribution of trigrams in the generated text. A higher Ent-3 value indicates the text has greater diversity. Conversely, the Rep-3 metric measures the proportion of repeated trigrams in the generated text. A lower Rep-3 value indicates less repetition in the text, thereby enhancing the text’s diversity and quality.

No single metric comprehensively covers all aspects of quality evaluation. Consequently, within the realm of text quality assessment, a multifaceted approach is imperative. This involves considering a broad spectrum of criteria, such as output dispersion, semantic integrity, task-specific textual fluency, and diversity of the text. A comprehensive evaluation of text quality, therefore, demands the integration of multiple metrics to gauge the overall performance and quality of the text accurately. By adopting this comprehensive approach, a more nuanced and comprehensive evaluation emerges, aligning with the multifaceted essence of text quality within watermarking research.

4.5 Transparency

Watermark transparency is a critical metric strongly related to text quality, used to assess the indistinguishability of watermarked text from the original text, both visually and statistically. This attribute reflects the watermark’s ability to remain undetected. Even under meticulous scrutiny, the presence of the watermark is not readily apparent. This ensures the confidentiality of the watermark messages and maintains the naturalness of the original text. The evaluation of transparency typically relies on human assessment or machine learning models to test whether the watermark can be extracted from the text. This phenomenon is quantified through the false positive rate (incorrectly identifying an unwatermarked text as watermarked) and miss rate (failing to identify watermarked text).

Optimizing watermark transparency necessitates consideration across multiple dimensions, including visual indistinguishability [77, 52], consistency of statistical properties [67, 35], and semantic alignment [50, 55, 40]. By employing intricately designed watermarking schemes, it is feasible to effectively conceal watermark information without compromising the naturalness and readability of the text.

4.6 Information Density

Information density is a critical concept in the field of text watermarking. High-density watermarking algorithms can carry more identifiers and copyright information within the same amount of text, thereby enhancing identification.

Information density can be characterized as the amount of information embedded per unit of text length (e.g., word, sentence, or paragraph). This measure, also known as payload, can be calculated using Shannon entropy. The calculation method involves analyzing the number of symbols in the text available for encoding and their probability distribution, thereby determining the maximum possible information density. This concept is crucial in designing and evaluating text watermarking schemes because it directly affects the watermark’s concealability, robustness, and capacity.

We define empirical entropy, as a measure used to estimate the information density of a system, reflecting its uncertainty or randomness. The formula for Shannon entropy H(X) for a discrete random variable X with possible values (x1,x2,,xn)subscript𝑥1subscript𝑥2subscript𝑥𝑛({x_{1},x_{2},...,x_{n}})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) and probability mass function P(X) is given by

H(X)=i=1nP(xi)logbP(xi),𝐻𝑋superscriptsubscript𝑖1𝑛𝑃subscript𝑥𝑖subscript𝑏𝑃subscript𝑥𝑖H(X)=-\sum_{i=1}^{n}P(x_{i})\log_{b}P(x_{i}),italic_H ( italic_X ) = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_P ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (16)

where P(xi)𝑃subscript𝑥𝑖P(x_{i})italic_P ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the probability of occurrence of the value xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and b𝑏bitalic_b is the base of the logarithm, commonly set to 2 for binary systems, resulting in units of bits.

In the context of text watermarking, the symbols used for encoding the watermark (e.g., variations in word choice, syntax, or punctuation) represent the ”symbols” in the Shannon entropy formula. The probability distribution of these symbols depends on how frequently they can be used for watermarking without altering the natural flow or meaning of the text. Amplified entropy leads to augmented potential information density, meaning more watermark information can be embedded in a given amount of text without detection. During the watermark embedding process, there is a principle that the entropy should be distributed as evenly as possible across the entire sequence of watermarked tokens TNsuperscript𝑇𝑁T^{N}italic_T start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT produced by the LLM, to avoid situations where entropy is high due to a single very low-probability token in the response.

Codable Watermarking [44] discusses the impact of the watermark message space size on text quality and computational complexity. Yoo et al. [58] analyze the impact of embedding a specific number of watermark bits per word (BPW) on the watermark performance. Qu et al. [46] propose a watermarking method capable of extracting information for multiple payloads in linear time. Yoo et al. [45] conduct ablation experiments on information density to explore the payload upper limit of the watermark algorithm.

However, maximizing information density must be balanced with other considerations. High information density may increase the visibility of the watermark to detection algorithms or human readers, thus compromising transparency. Moreover, overly dense watermarking changes may impact readability or alter meaning, which is counterproductive, especially in sensitive applications like legal documents or literary works. Hence, achieving an optimal information density in text watermarking necessitates a careful balance between embedding enough information to ensure the watermark’s effectiveness and maintaining the original text’s integrity and readability.

4.7 Robustness

Robustness refers to the capacity of a watermark to remain detectable in the face of various attacks. Watermarking technologies with greater robustness can ensure the continuity and coherence of information across a wider range of application scenarios. The level of robustness directly affects the feasibility and security of watermarking technologies. When evaluating robustness, the assessment involves exposing watermarked texts to various attacks (such as content modification, format conversion, model fine-tuning, etc.) and observing the persistence of the embedded watermark information. Metrics such as AUROC, F1 score, and Recall are used to measure the performance of watermark extraction and reconstruction under these attack conditions. Specifically, through the simulation of disruptive attacks on the watermarked text followed by attempting to recover the watermark via extraction techniques, the robustness is evaluated by comparing the success rate and watermark confidence before and after the attack.

In addition to exploring the watermark robustness in various complex watermark attack scenarios [33], Zhao et al. [73] investigate the impact of different decoding strategies on watermark performance by modifying the LLM’s decoding strategies, such as beam-k and top-k. Meanwhile, Fernandez et al. [42] set different levels of watermark strength to evaluate text distortion under various watermark strengths as a measure of watermark robustness. However, these evaluations are specifically targeted at Attack Robustness mentioned in Section 3.2. Presently, limited research on LLM watermarking has delved into the topic of Security Robustness. Future studies should also focus on the watermark’s own security robustness, that is, whether the watermark signal W𝑊Witalic_W can be easily extracted and the watermark message m𝑚mitalic_m inferred from it.

4.8 Unforgeability

Unforgeability focuses on the ability of watermarking technology to resist forgery or tampering, thereby ensuring the authenticity and reliability of watermark information. This objective can be achieved by training models to recognize specific watermark token distributions and then assessing the model’s performance against both watermark spoofing attacks (attempts to forge watermark information) and watermark inference attacks (attempts to infer the watermarking strategy). The evaluation of unforgeability typically requires both qualitative analysis and quantitative testing, including metrics such as success rate and confidence levels.

When discussing the security of text watermarking, the primary aim is to prevent attackers from acquiring or cracking the watermark generation method. Watermark algorithms must exhibit a high level of unforgeability, making it challenging for attackers to identify their underlying generation logic. This involves the complexity of the algorithm or the use of mathematically hard-to-crack problems.

In private detection scenarios, the imperceptibility of the watermark is crucial, meaning that the watermark’s impact on the original content is difficult to detect. Research measures this attribute by testing the distinguishing ability of classifiers. Statistical methods might be used to analyze the embedding patterns of watermarks, requiring some understanding of the watermarking method. To enhance unforgeability, private detection scenarios should limit the frequency of detection. Sadasivan et al. [94] demonstrated that attackers would need more than a million queries to extract watermarks through privilege escalation potentially.

In public detection scenarios, where the detection algorithm is openly accessible, assessing the unforgeability of watermarks is more complex, as attackers can use this information to mount attacks. Ideal watermarking technology should ensure that even if attackers are aware of the algorithm details, they cannot successfully replicate or forge watermarks without the key. Extraction and reconstruction of the watermark algorithm should not leak detailed information about the generation method. Gu et al. [56] utilize a model distillation approach to train a new model to learn the distribution of watermarked tokens. However, this forgery method is limited to algorithms that embed watermarks through modifications of logits. Pang et al. [89] argue that the deception attacks proposed in public detection settings could be generalized across all types of watermarks, requiring only a minimal number of queries to identify each token. Liu et al. [43] are the first to assess the unforgeability properties of their watermarking algorithm formally and demonstrated that even attackers with access to watermark extraction and attempting to understand the watermark generation rules through an unlimited number of queries would find it difficult to deduce the watermark generation method. As for the publicly detectable watermarks referred to as [66], the setup did not consider unforgeability measures, allowing the possibility of forging watermark messages without knowing the private key.

Assessing the unforgeability of these watermark algorithms necessitates the creation of complex attack algorithms, such as spoofing attacks. However, without knowledge of the generation architecture, it is challenging for attackers to succeed. Strong unforgeability implies that attackers find it difficult to infer the watermark generation method from the watermarked text, which imposes higher demands on the security and complexity of the watermarking algorithm, or necessitates the incorporation of robust cryptographic techniques.

4.9 Cross-lingual Consistency

Cross-lingual consistency is a critical measure to assess the efficacy of text watermarking when translated into other languages. The research [103] aims to evaluate the consistency of current LLM watermarking algorithms across different languages, their performance in similar languages compared to distantly related languages, and the superiority of current semantic invariance-based watermarking methods over others.

Cross-lingual consistency is defined as the ability of a watermark embedded in text generated by LLMs to retain its strength after the text is translated into another language. Let the original strength of the watermark be denoted as a random variable S𝑆Sitalic_S, and its strength after translation be denoted as S^^𝑆\hat{S}over^ start_ARG italic_S end_ARG. To quantitatively assess this consistency, the subsequent two metrics are employed:

1. Pearson Correlation Coefficient (PCC)

The Pearson Correlation Coefficient (PCC) is utilized to evaluate the linear correlation between S𝑆Sitalic_S and S^^𝑆\hat{S}over^ start_ARG italic_S end_ARG:

PCC(S,S^)=cov(S,S^)σSσS^,𝑃𝐶𝐶𝑆^𝑆cov𝑆^𝑆subscript𝜎𝑆subscript𝜎^𝑆PCC(S,\hat{S})=\frac{\text{cov}(S,\hat{S})}{\sigma_{S}\sigma_{\hat{S}}},italic_P italic_C italic_C ( italic_S , over^ start_ARG italic_S end_ARG ) = divide start_ARG cov ( italic_S , over^ start_ARG italic_S end_ARG ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT over^ start_ARG italic_S end_ARG end_POSTSUBSCRIPT end_ARG , (17)

where cov(S,S^)cov𝑆^𝑆\text{cov}(S,\hat{S})cov ( italic_S , over^ start_ARG italic_S end_ARG ) represents the covariance between S𝑆Sitalic_S and S^^𝑆\hat{S}over^ start_ARG italic_S end_ARG, and σSsubscript𝜎𝑆\sigma_{S}italic_σ start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and σS^subscript𝜎^𝑆\sigma_{\hat{S}}italic_σ start_POSTSUBSCRIPT over^ start_ARG italic_S end_ARG end_POSTSUBSCRIPT are the standard deviations of S𝑆Sitalic_S and S^^𝑆\hat{S}over^ start_ARG italic_S end_ARG, respectively. A PCC value close to 1 indicates a high degree of consistency in watermark strength trends across different languages.

2. Relative Error (RE)

In contrast to PCC, which captures the consistency of trends, the Relative Error (RE) is used to assess the magnitude of deviation between S𝑆Sitalic_S and S^^𝑆\hat{S}over^ start_ARG italic_S end_ARG:

RE(S,S^)=𝔼[|S^SS|]×100%.𝑅𝐸𝑆^𝑆𝔼delimited-[]^𝑆𝑆𝑆percent100RE(S,\hat{S})=\mathbb{E}\left[\left|\frac{\hat{S}-S}{S}\right|\right]\times 10% 0\%.italic_R italic_E ( italic_S , over^ start_ARG italic_S end_ARG ) = blackboard_E [ | divide start_ARG over^ start_ARG italic_S end_ARG - italic_S end_ARG start_ARG italic_S end_ARG | ] × 100 % . (18)

A lower RE indicates that the watermark retains strength close to its original value after translation, signifying good cross-lingual consistency. To avoid instability when S𝑆Sitalic_S is close to 0, data is first aggregated by text length, and the original values of S𝑆Sitalic_S and S^^𝑆\hat{S}over^ start_ARG italic_S end_ARG are replaced with the mean value of each group. Additionally, min-max normalization is applied to ensure all values are non-negative.

The consistency of watermark algorithms across different languages can be analyzed through two cross-linguistic consistency metrics: PCC and RE. He et al. [103] have verified that current watermark algorithms struggle to maintain their advantages across different languages, with the robustness of watermarks being highly susceptible to cross-linguistic watermark removal attacks. This provides a new perspective on language translation for future watermark evaluation research.

4.10 Radioactivity

Radioactivity refers to the detectable traces left in a model when an LLM is fine-tuned using training data embedded with watermarks. These imprints serve as indicators that the output of the LLM could be used to fine-tune another model. This concept is vividly termed ”radioactivity” in [28], as it mirrors the way radioactive substances leave traceable residues in the environment. The radioactivity of LLM watermarks means that when watermark texts produced by LLM watermark algorithms are used as fine-tuning data, the characteristics of the watermark can still be preserved and radiated, allowing the watermark texts to be traced back through detection methods.

The strength of watermark radioactivity represents the ability to trace and source the watermarked data. Statistical tests based on cumulative scores and the number of tokens are typically used to determine whether text data is watermarked. Binomial or gamma distributions can be employed to calculate the probability (p-value) of obtaining a score higher than a certain threshold under the null hypothesis (i.e., the text is not watermarked). In radioactivity detection, a lower p-value (for example, less than 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT oder 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT) signifies a high level of confidence, indicating strong radioactivity has been detected. The detection of radioactivity can assist in tracing and sourcing the training data for LLMs.

5 Future Directions

Although the previous sections thoroughly introduced the applications, theories, classifications, and evaluation systems of text watermarking, many challenges remain in this field. These include challenges related to rich information watermarking, asymmetric watermark encryption verification, and counteracting watermark attacks. The following sections will discuss these challenges in detail.

5.1 Rich Information Watermarking Technology

Despite the existence of numerous multi-bit LLM watermarking techniques, most research has been limited to algorithms involving a few watermark information bits, unable to embed rich information like traditional image [104] and audio [105] watermarking technologies. Researchers should develop highly efficient large model watermark message embedding techniques to embed more hidden watermark information bits within the same text length. To achieve this, new algorithms must be explored and existing technologies optimized to increase the information embedding density in LLMs. This includes developing deep learning-based models that leverage the underlying hierarchical structure of the models for more efficient data encoding. Additionally, advanced compression techniques and information theory principles should be considered to reduce the required embedding space while maintaining or improving the robustness and transparency of the watermark. The development of such technologies can provide stronger tools for intellectual property protection and potentially open new applications in copyright management and data security. Ultimately, the goal is to achieve a text watermarking technology comparable to image and audio watermarking, capable of embedding large amounts of information without compromising the naturalness and readability of the text. This rich information watermarking scheme should minimize text quality loss due to watermark embedding and effectively resist text watermark attacks such as semantic rewriting, synonym replacement, and special symbol insertion.

5.2 Asymmetric Encryption Verification Watermarking Technology

Public trust and unforgeability are key factors in the widespread application of text watermarking technology. These technologies can only be effectively adopted when the public trusts the accuracy of watermark algorithms, which places high demands on the confidence levels of these algorithms. Fundamental steps to enhance trust include the comprehensive disclosure of watermark detection algorithms, enabling users to evaluate their principles and accuracy. Moreover, public trust can promote the development of academia and industry and can be strengthened through impartial evaluations by independent third-party platforms to reduce conflicts of interest. The formulation of government and regulatory guidelines is also an important way to ensure the fairness and transparency of watermarking technology and to enhance public trust.

To enhance the unforgeability of watermarking technology, it is necessary to introduce asymmetric encryption technology [106] into the watermark verification system of LLMs. As watermarking technology becomes more widespread, its verification process is increasingly open to the public, allowing anyone to check public watermarks to verify the legitimacy and integrity of data or models. However, to prevent attacks at various stages of watermarking, some key watermark encryption keys must be restricted by permissions. In the research of AI watermark encryption verification technology, exploring the application of asymmetric encryption technology in the extraction and verification process of LLM watermarks is particularly important. The future may witness further enhancement of technology security through the adoption of digital signature technology [107] and the use of one-way pseudo-random functions [108] to guide the generation of watermarks. An important direction for the future is the development of a distributed public watermark verification system based on a shared key database. Such a system can effectively resist attacks that may occur during the watermark verification stage. By introducing a verification mechanism involving multiple parties, the system’s security can be enhanced, and the accuracy and reliability of watermark detection can be improved. This system design aims to build a more robust and trustworthy watermark technology framework, ensuring the legitimacy and integrity of data and models are widely and effectively protected.

5.3 Countering Watermark Attacking Technology

Although experimental evidence [94] suggests that text-based watermarks can be designed to resist various attacks such as paraphrasing and copy-and-paste, their robustness is a function of text length. Text fragments under 1,000 words become more challenging, with efficiency steadily declining as text size decreases. We believe that two feasible actions need to be taken simultaneously to form a comprehensive approach to countering watermark attacks. First, robust watermarking techniques should be researched to withstand various attacks and processes, such as watermark removal and theft, applicable for copyright protection and model usage tracking. Second, a fragile watermark scheme for LLMs should be proposed, similar to fragile watermarks in image watermarking. Fragile watermarks are highly sensitive to modifications. If data or models are tampered with, the fragile watermark will be destroyed. They are suitable for content authentication and integrity verification.

Robust and fragile watermarks should be used simultaneously in data or models. Robust watermarks are used for long-term tracking and rights protection, while fragile watermarks are used for immediate tamper detection and integrity verification. This combined approach can create a comprehensive watermark system that protects individual entities’ private rights while meeting public regulatory needs. By applying both robust and fragile watermarks, copyright protection for data and models can be maintained over the long term, and the integrity and authenticity of data and models can be effectively monitored and verified. This design helps establish a secure, transparent, and sustainable digital content and service ecosystem.

5.4 Integration of Other Identification Technologies

The future direction of LLM watermarking for identification can integrate various advanced identification technologies to enhance the security and reliability of LLM identity recognition systems. Key technologies to consider include behavior classification [109] (analyzing user behavior patterns such as typing speed and interaction habits); deep fake detection [110] (identifying AI-generated content); and biometric methods [111] (such as fingerprint, facial, iris, and voice recognition). Additionally, multimodal recognition [112] (combining multiple data sources like visual, audio, and textual information) and intelligent identity verification [113] (utilizing artificial intelligence and machine learning for dynamic analysis) can be integrated. By combining these advanced identification technologies with LLM watermarking and adapting these technologies to the characteristics of AIGC and LLM application scenarios, we can develop a more secure, accurate, and adaptive LLM identity recognition system capable of addressing evolving security threats and improving user experience.

6 Conclusion

In this review, we comprehensively explore the developments and significance of constructing LLM identification systems using LLM watermarking technology. With the widespread application of LLMs, ensuring the distinguishability, unforgeability, and traceability of LLM behavior has become particularly critical. As the LLM application system evolves towards a multi-centric system, the positions of various participants become balanced. To ensure trustworthy collaboration and minimize suspicion among participants, LLM identity recognition via watermarking can be employed to ensure identification and behavior traceability throughout the LLM lifecycle.

To advance the development of LLM watermarking technology, we have undertaken several initiatives. We have established a mathematical framework centered on information theory, providing a solid theoretical foundation for the research and optimization of watermarking techniques. Through detailed classification, mathematical description, and comprehensive evaluation metrics for watermarking technology, we can reflect the preferences of different participants and promote the development of LLM watermarking technology. This systematic classification and elucidation not only facilitate comparison and evaluation between methods but also enhance the understanding of watermarking technology’s security, offering new research and evaluation directions.

We anticipate that future research will delve deeper into the development of more efficient and secure identification technologies, particularly in the domains of rich information watermarking, asymmetric watermark encryption verification, countering watermark attacks, and integration of other identification technologies to meet increasingly complex application demands and security challenges in the field of intelligence identification.

References

  • \bibcommenthead
  • Jain et al. [2004] Jain, A.K., Ross, A., Prabhakar, S.: An introduction to biometric recognition. IEEE Transactions on circuits and systems for video technology 14(1), 4–20 (2004)
  • Prabakaran and Ramachandran [2022] Prabakaran, D., Ramachandran, S.: Multi-factor authentication for secured financial transactions in cloud environment. CMC-Computers, Materials & Continua 70(1), 1781–1798 (2022)
  • Ren et al. [2014] Ren, Y., Chen, Y., Chuah, M.C., Yang, J.: User verification leveraging gait recognition for smartphone enabled mobile healthcare systems. IEEE Transactions on Mobile Computing 14(9), 1961–1974 (2014)
  • Labati et al. [2016] Labati, R.D., Genovese, A., Muñoz, E., Piuri, V., Scotti, F., Sforza, G.: Biometric recognition in automated border control: a survey. ACM Computing Surveys (CSUR) 49(2), 1–39 (2016)
  • Talreja et al. [2018] Talreja, V., Ferrett, T., Valenti, M.C., Ross, A.: Biometrics-as-a-service: A framework to promote innovative biometric recognition in the cloud. In: 2018 IEEE International Conference on Consumer Electronics (ICCE), pp. 1–6 (2018). IEEE
  • Grassi et al. [2020] Grassi, P., Garcia, M., Fenton, J.: Digital identity guidelines. Technical report, National Institute of Standards and Technology (2020)
  • Zhu et al. [2005] Zhu, W., Thomborson, C., Wang, F.-Y.: A survey of software watermarking. In: Intelligence and Security Informatics: IEEE International Conference on Intelligence and Security Informatics, ISI 2005, Atlanta, GA, USA, May 19-20, 2005. Proceedings 3, pp. 454–458 (2005). Springer
  • Kamaruddin et al. [2018] Kamaruddin, N.S., Kamsin, A., Por, L.Y., Rahman, H.: A review of text watermarking: Theory, methods, and applications. IEEE Access 6, 8011–8028 (2018)
  • Yang et al. [2023] Yang, X., Pan, L., Zhao, X., Chen, H., Petzold, L., Wang, W.Y., Cheng, W.: A Survey on Detection of LLMs-Generated Content. arXiv (2023)
  • Wu et al. [2024] Wu, J., Yang, S., Zhan, R., Yuan, Y., Wong, D.F., Chao, L.S.: A Survey on LLM-Generated Text Detection: Necessity, Methods, and Future Directions. arXiv (2024)
  • Liu et al. [2024] Liu, A., Pan, L., Lu, Y., Li, J., Hu, X., Zhang, X., Wen, L., King, I., Xiong, H., Yu, P.S.: A Survey of Text Watermarking in the Era of Large Language Models. arXiv (2024)
  • Yao et al. [2024] Yao, Y., Duan, J., Xu, K., Cai, Y., Sun, Z., Zhang, Y.: A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing, 100211 (2024)
  • Cover [1999] Cover, T.M.: Elements of Information Theory. John Wiley & Sons (1999)
  • Ometov et al. [2018] Ometov, A., Bezzateev, S., Mäkitalo, N., Andreev, S., Mikkonen, T., Koucheryavy, Y.: Multi-factor authentication: A survey. Cryptography 2(1), 1 (2018)
  • Jain et al. [2000] Jain, A., Hong, L., Pankanti, S.: Biometric identification. Communications of the ACM 43(2), 90–98 (2000)
  • Bebis et al. [1999] Bebis, G., Deaconu, T., Georgiopoulos, M.: Fingerprint identification using delaunay triangulation. In: Proceedings 1999 International Conference on Information Intelligence and Systems (Cat. No. PR00446), pp. 452–459. IEEE (1999)
  • Kaur et al. [2020] Kaur, P., Krishan, K., Sharma, S.K., Kanchan, T.: Facial-recognition algorithms: A literature review. Medicine, Science and the Law 60(2), 131–139 (2020)
  • De Clercq [2002] De Clercq, J.: Single sign-on architectures. In: International Conference on Infrastructure Security, pp. 40–58. Springer (2002)
  • Shen et al. [2017] Shen, C., Chen, Y., Guan, X., Maxion, R.A.: Pattern-growth based mining mouse-interaction behavior for an active user authentication system. IEEE transactions on dependable and secure computing 17(2), 335–349 (2017)
  • Abbasi et al. [2022] Abbasi, A., Javed, A.R., Iqbal, F., Jalil, Z., Gadekallu, T.R., Kryvinska, N.: Authorship identification using ensemble learning. Scientific reports 12(1), 9537 (2022)
  • Saini and Shrivastava [2014] Saini, L.K., Shrivastava, V.: A survey of digital watermarking techniques and its applications. arXiv preprint arXiv:1407.4735 (2014)
  • Li et al. [2019] Li, Z., Hu, C., Zhang, Y., Guo, S.: How to prove your model belongs to you: A blind-watermark based framework to protect intellectual property of dnn. In: Proceedings of the 35th Annual Computer Security Applications Conference, pp. 126–137 (2019)
  • Ahvanooey et al. [2020] Ahvanooey, M.T., Li, Q., Zhu, X., Alazab, M., Zhang, J.: Anitw: A novel intelligent text watermarking technique for forensic identification of spurious information on social media. Computers & Security 90, 101702 (2020)
  • Boenisch [2021] Boenisch, F.: A systematic review on model watermarking for neural networks. Frontiers in big Data 4, 729663 (2021)
  • Potdar et al. [2005] Potdar, V.M., Han, S., Chang, E.: A survey of digital image watermarking techniques. In: INDIN’05. 2005 3rd IEEE International Conference on Industrial Informatics, 2005., pp. 709–716. IEEE (2005)
  • Li et al. [2023] Li, S., Chen, K., Tang, K., Huang, W., Zhang, J., Zhang, W., Yu, N.: FunctionMarker: Watermarking Language Datasets via Knowledge Injection. arXiv (2023)
  • Liu et al. [2023] Liu, Y., Hu, H., Chen, X., Zhang, X., Sun, L.: Watermarking Classification Dataset for Copyright Protection. arXiv (2023). https://doi.org/10.48550/arXiv.2305.13257
  • Sander et al. [2024] Sander, T., Fernandez, P., Durmus, A., Douze, M., Furon, T.: Watermarking Makes Language Models Radioactive. arXiv (2024)
  • Zhang et al. [2021] Zhang, J., Chen, D., Liao, J., Zhang, W., Feng, H., Hua, G., Yu, N.: Deep model intellectual property protection via deep watermarking. IEEE Transactions on Pattern Analysis and Machine Intelligence 44(8), 4005–4020 (2021)
  • Rep. Rogers [2023] Rep. Rogers, M.D.R.-A.-.: National Defense Authorization Act for Fiscal Year 2024. https://www.congress.gov/bill/118th-congress/house-bill/2670/text (2023)
  • [31] FACT SHEET: President Biden Issues Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence | The White House. https://www.whitehouse.gov/briefing-room/statements-releases/2023/10/30/fact-sheet-president-biden-issues-executive-order-on-safe-secure-and-trustworthy-artificial-intelligence/
  • Kirchenbauer et al. [2023a] Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., Goldstein, T.: A Watermark for Large Language Models. In: Proceedings of the 40th International Conference on Machine Learning, pp. 17061–17084. PMLR (2023)
  • Kirchenbauer et al. [2023b] Kirchenbauer, J., Geiping, J., Wen, Y., Shu, M., Saifullah, K., Kong, K., Fernando, K., Saha, A., Goldblum, M., Goldstein, T.: On the Reliability of Watermarks for Large Language Models. arXiv (2023)
  • Takezawa et al. [2023] Takezawa, Y., Sato, R., Bao, H., Niwa, K., Yamada, M.: Necessary and Sufficient Watermark for Large Language Models. arXiv (2023). https://doi.org/10.48550/arXiv.2310.00833
  • Hu et al. [2023] Hu, Z., Chen, L., Wu, X., Wu, Y., Zhang, H., Huang, H.: Unbiased Watermark for Large Language Models. arXiv (2023)
  • [36] Zhao, X., Ananth, P., Li, L., Wang, Y.-X.: Provable Robust Watermarking for AI-Generated Text
  • Wu et al. [2023] Wu, Y., Hu, Z., Zhang, H., Huang, H.: DiPmark: A Stealthy, Efficient and Resilient Watermark for Large Language Models. arXiv (2023)
  • Fu et al. [2024] Fu, Y., Xiong, D., Dong, Y.: Watermarking Conditional Text Generation for AI Detection: Unveiling Challenges and a Semantic-Aware Watermark Remedy. Proceedings of the AAAI Conference on Artificial Intelligence 38(16), 18003–18011 (2024) https://doi.org/10.1609/aaai.v38i16.29756
  • Lee et al. [2024] Lee, T., Hong, S., Ahn, J., Hong, I., Lee, H., Yun, S., Shin, J., Kim, G.: Who Wrote This Code? Watermarking for Code Generation. arXiv (2024)
  • Ren et al. [2024] Ren, J., Xu, H., Liu, Y., Cui, Y., Wang, S., Yin, D., Tang, J.: A Robust Semantics-based Watermark for Large Language Model against Paraphrasing. arXiv (2024)
  • Li et al. [2024] Li, S., Yao, L., Gao, J., Zhang, L., Li, Y.: Double-I Watermark: Protecting Model Copyright for LLM Fine-tuning. arXiv (2024)
  • Fernandez et al. [2023] Fernandez, P., Chaffin, A., Tit, K., Chappelier, V., Furon, T.: Three Bricks to Consolidate Watermarks for Large Language Models. arXiv (2023)
  • Liu et al. [2024] Liu, A., Pan, L., Hu, X., Li, S., Wen, L., King, I., Yu, P.S.: An Unforgeable Publicly Verifiable Watermark for Large Language Models. arXiv (2024)
  • Wang et al. [2023] Wang, L., Yang, W., Chen, D., Zhou, H., Lin, Y., Meng, F., Zhou, J., Sun, X.: Towards Codable Watermarking for Injecting Multi-bit Information to LLM. arXiv (2023)
  • Yoo et al. [2024] Yoo, K., Ahn, W., Kwak, N.: Advancing Beyond Identification: Multi-bit Watermark for Large Language Models. arXiv (2024)
  • Qu et al. [2024] Qu, W., Yin, D., He, Z., Zou, W., Tao, T., Jia, J., Zhang, J.: Provably Robust Multi-bit Watermarking for AI-generated Text via Error Correction Code. arXiv (2024)
  • Radford et al. [2018] Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
  • Devlin et al. [2019] Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv (2019)
  • Kuditipudi et al. [2023] Kuditipudi, R., Thickstun, J., Hashimoto, T., Liang, P.: Robust Distortion-free Watermarks for Language Models. arXiv (2023)
  • Hou et al. [2023] Hou, A.B., Zhang, J., He, T., Wang, Y., Chuang, Y.-S., Wang, H., Shen, L., Van Durme, B., Khashabi, D., Tsvetkov, Y.: SemStamp: A Semantic Watermark with Paraphrastic Robustness for Text Generation. arXiv (2023)
  • Reimers and Gurevych [2019] Reimers, N., Gurevych, I.: Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. arXiv (2019)
  • Munyer et al. [2024] Munyer, T., Tanvir, A., Das, A., Zhong, X.: DeepTextMark: A Deep Learning-Driven Text Watermarking Approach for Identifying Large Language Model Generated Text. arXiv (2024)
  • Church [2017] Church, K.W.: Word2vec. Natural Language Engineering 23(1), 155–162 (2017)
  • Cer et al. [2018] Cer, D., Yang, Y., Kong, S.-y., Hua, N., Limtiaco, N., John, R.S., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C., et al.: Universal sentence encoder. arXiv preprint arXiv:1803.11175 (2018)
  • Liu et al. [2024] Liu, A., Pan, L., Hu, X., Meng, S., Wen, L.: A Semantic Invariant Robust Watermark for Large Language Models. arXiv (2024)
  • Gu et al. [2024] Gu, C., Li, X.L., Liang, P., Hashimoto, T.: On the Learnability of Watermarks for Language Models. arXiv (2024). https://doi.org/10.48550/arXiv.2312.04469
  • Abdelnabi and Fritz [2021] Abdelnabi, S., Fritz, M.: Adversarial Watermarking Transformer: Towards Tracing Text Provenance with Data Hiding. In: 2021 IEEE Symposium on Security and Privacy (SP), pp. 121–140 (2021). https://doi.org/10.1109/SP40001.2021.00083
  • Yoo et al. [2023] Yoo, K., Ahn, W., Jang, J., Kwak, N.: Robust Multi-bit Natural Language Watermarking through Invariant Features. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2092–2115. Association for Computational Linguistics, Toronto, Canada (2023). https://doi.org/10.18653/v1/2023.acl-long.117
  • Zhang et al. [2023] Zhang, R., Hussain, S.S., Neekhara, P., Koushanfar, F.: REMARK-LLM: A Robust and Efficient Watermarking Framework for Generative Large Language Models. arXiv (2023)
  • Cox et al. [1997] Cox, I.J., Kilian, J., Leighton, F.T., Shamoon, T.: Secure spread spectrum watermarking for multimedia. IEEE transactions on image processing 6(12), 1673–1687 (1997)
  • Yang et al. [2022] Yang, X., Zhang, J., Chen, K., Zhang, W., Ma, Z., Wang, F., Yu, N.: Tracing Text Provenance via Context-Aware Lexical Substitution. Proceedings of the AAAI Conference on Artificial Intelligence 36(10), 11613–11621 (2022) https://doi.org/10.1609/aaai.v36i10.21415
  • Belghazi et al. [2018] Belghazi, M.I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Courville, A., Hjelm, D.: Mutual Information Neural Estimation. In: Proceedings of the 35th International Conference on Machine Learning, pp. 531–540. PMLR (2018)
  • Davis et al. [2019] Davis, B., Bhatt, U., Bhardwaj, K., Marculescu, R., Moura, J.: NIF: A framework for quantifying neural information flow in deep networks. In: Proc. Workshop Netw. Interpretability Deep Learn. AAAI Conf. Artif. Intell.(AAAI), pp. 1–4 (2019)
  • Liu et al. [2023] Liu, Y., Hu, H., Chen, X., Zhang, X., Sun, L.: Watermarking Classification Dataset for Copyright Protection. arXiv (2023)
  • Tang et al. [2023] Tang, R., Feng, Q., Liu, N., Yang, F., Hu, X.: Did You Train on My Dataset? Towards Public Dataset Protection with CleanLabel Backdoor Watermarking. ACM SIGKDD Explorations Newsletter 25(1), 43–53 (2023) https://doi.org/10.1145/3606274.3606279
  • Fairoze et al. [2023] Fairoze, J., Garg, S., Jha, S., Mahloujifar, S., Mahmoody, M., Wang, M.: Publicly Detectable Watermarking for Language Models. arXiv (2023)
  • Christ et al. [2023] Christ, M., Gunn, S., Zamir, O.: Undetectable Watermarks for Language Models. arXiv (2023)
  • [68] He, X., Xu, Q., Zeng, Y., Lyu, L., Wu, F., Li, J., Jia, R.: CATER: Intellectual Property Protection on Text Generation APIs via Conditional Watermarks
  • Yang et al. [2022] Yang, X., Zhang, J., Chen, K., Zhang, W., Ma, Z., Wang, F., Yu, N.: Tracing Text Provenance via Context-Aware Lexical Substitution. Proceedings of the AAAI Conference on Artificial Intelligence 36(10), 11613–11621 (2022) https://doi.org/10.1609/aaai.v36i10.21415
  • Li et al. [2023] Li, Z., Wang, C., Wang, S., Gao, C.: Protecting Intellectual Property of Large Language Model-Based Code Generation APIs via Watermarks. In: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pp. 2336–2350. ACM, Copenhagen Denmark (2023). https://doi.org/10.1145/3576915.3623120
  • Li et al. [2024] Li, B., Zhang, M., Zhang, P., Sun, J., Wang, X.: Resilient Watermarking for LLM-Generated Codes. arXiv (2024)
  • Yang et al. [2023] Yang, X., Chen, K., Zhang, W., Liu, C., Qi, Y., Zhang, J., Fang, H., Yu, N.: Watermarking Text Generated by Black-Box Language Models. arXiv (2023)
  • Zhao et al. [2023] Zhao, X., Wang, Y.-X., Li, L.: Protecting Language Generation Models via Invisible Watermarking. In: Proceedings of the 40th International Conference on Machine Learning, pp. 42187–42199. PMLR (2023)
  • Brassil and O’Gorman [1995] Brassil, J.T., O’Gorman, L.: Electronic Marking and Identification Techniques to Discourage Document Copying. IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS 13(8) (1995)
  • Por et al. [2012] Por, L.Y., Wong, K., Chee, K.O.: UniSpaCh: A text-based data hiding method using Unicode space characters. Journal of Systems and Software 85(5), 1075–1082 (2012) https://doi.org/10.1016/j.jss.2011.12.023
  • Keskisärkkä [2012] Keskisärkkä, R.: Automatic Text Simplification via Synonym Replacement (2012)
  • He et al. [2022] He, X., Xu, Q., Lyu, L., Wu, F., Wang, C.: Protecting Intellectual Property of Language Generation APIs with Lexical Watermark. Proceedings of the AAAI Conference on Artificial Intelligence 36(10), 10758–10766 (2022) https://doi.org/10.1609/aaai.v36i10.21321
  • Qiang et al. [2023] Qiang, J., Zhu, S., Li, Y., Zhu, Y., Yuan, Y., Wu, X.: Natural language watermarking via paraphraser-based lexical substitution. Artificial Intelligence 317, 103859 (2023)
  • [79] Chalmers, D.J.: Syntactic Transformations on Distributed Representations
  • Gu et al. [2019] Gu, T., Dolan-Gavitt, B., Garg, S.: BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain. arXiv (2019)
  • Sun et al. [2022] Sun, Z., Du, X., Song, F., Ni, M., Li, L.: CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning. arXiv (2022)
  • Sun et al. [2023] Sun, Z., Du, X., Song, F., Li, L.: CodeMark: Imperceptible Watermarking for Code Datasets against Neural Code Completion Models. In: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1561–1572 (2023). https://doi.org/10.1145/3611643.3616297
  • Yujian and Bo [2007] Yujian, L., Bo, L.: A normalized levenshtein distance metric. IEEE transactions on pattern analysis and machine intelligence 29(6), 1091–1095 (2007)
  • Atallah et al. [2001] Atallah, M.J., Raskin, V., Crogan, M., Hempelmann, C., Kerschbaum, F., Mohamed, D., Naik, S.: Natural Language Watermarking: Design, Analysis, and a Proof-of-Concept Implementation. In: Goos, G., Hartmanis, J., Van Leeuwen, J., Moskowitz, I.S. (eds.) Information Hiding vol. 2137, pp. 185–200. Springer, Berlin, Heidelberg (2001). https://doi.org/10.1007/3-540-45496-9_14
  • Chiang et al. [2004] Chiang, Y.-L., Chang, L.-P., Hsieh, W.-T., Chen, W.-C.: Natural Language Watermarking Using Semantic Substitution for Chinese Text. In: Goos, G., Hartmanis, J., Van Leeuwen, J., Kalker, T., Cox, I., Ro, Y.M. (eds.) Digital Watermarking vol. 2939, pp. 129–140. Springer, Berlin, Heidelberg (2004). https://doi.org/%****␣sn-article.bbl␣Line␣1300␣****10.1007/978-3-540-24624-4_10
  • Topkara et al. [2006] Topkara, U., Topkara, M., Atallah, M.J.: The hiding virtues of ambiguity: Quantifiably resilient watermarking of natural language text through synonym substitutions. In: Proceedings of the 8th Workshop on Multimedia and Security, pp. 164–174. ACM, Geneva Switzerland (2006). https://doi.org/10.1145/1161366.1161397
  • Young-Won Kim et al. [2003] Young-Won Kim, Kyung-Ae Moon, Il-Seok Oh: A text watermarking algorithm based on word classification and inter-word space statistics. In: Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings., vol. 1, pp. 775–779. IEEE Comput. Soc, Edinburgh, UK (2003). https://doi.org/10.1109/ICDAR.2003.1227767
  • Scargle [1982] Scargle, J.D.: Studies in astronomical time series analysis. II-Statistical aspects of spectral analysis of unevenly spaced data. Astrophysical Journal, Part 1, vol. 263, Dec. 15, 1982, p. 835-853. 263, 835–853 (1982)
  • Pang et al. [2024] Pang, Q., Hu, S., Zheng, W., Smith, V.: Attacking LLM Watermarks by Exploiting Their Strengths. arXiv (2024)
  • Gabrilovich and Gontmakher [2002] Gabrilovich, E., Gontmakher, A.: The homograph attack. Communications of the ACM 45(2), 128 (2002) https://doi.org/10.1145/503124.503156
  • Boucher et al. [2021] Boucher, N., Shumailov, I., Anderson, R., Papernot, N.: Bad Characters: Imperceptible NLP Attacks. arXiv (2021)
  • Helfrich and Neff [2012] Helfrich, J.N., Neff, R.: Dual canonicalization: An answer to the homograph attack. In: 2012 eCrime Researchers Summit, pp. 1–10. IEEE, Las Croabas, PR, USA (2012). https://doi.org/%****␣sn-article.bbl␣Line␣1400␣****10.1109/eCrime.2012.6489517
  • [93] Krishna, K., Song, Y., Karpinska, M., Wieting, J., Iyyer, M.: Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense
  • Sadasivan et al. [2024] Sadasivan, V.S., Kumar, A., Balasubramanian, S., Wang, W., Feizi, S.: Can AI-Generated Text Be Reliably Detected? arXiv (2024)
  • Goodside [2023] Goodside, R.: There Are Adversarial Attacks for That Proposal as Well — in Particular, Generating with Emojis after Words and Then Removing Them before Submitting Defeats It. https://twitter.com/goodside/status/1610682909647671306 (2023)
  • Fisher et al. [1966] Fisher, R.A., Fisher, R.A., Genetiker, S., Fisher, R.A., Genetician, S., Britain, G., Fisher, R.A., Généticien, S.: The Design of Experiments vol. 21. Oliver and Boyd Edinburgh (1966)
  • Zhang et al. [2020] Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: Evaluating Text Generation with BERT. arXiv (2020)
  • Fu et al. [2023] Fu, J., Ng, S.-K., Jiang, Z., Liu, P.: GPTScore: Evaluate as You Desire. arXiv (2023)
  • Papineni et al. [2002] Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: A method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
  • Denkowski and Lavie [2014] Denkowski, M., Lavie, A.: Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376–380 (2014)
  • Lin [2004] Lin, C.-Y.: ROUGE: A Package for Automatic Evaluation of Summaries. In: Text Summarization Branches Out, pp. 74–81. Association for Computational Linguistics, Barcelona, Spain (2004)
  • Wieting et al. [2023] Wieting, J., Gimpel, K., Neubig, G., Berg-Kirkpatrick, T.: Paraphrastic Representations at Scale. arXiv (2023)
  • He et al. [2024] He, Z., Zhou, B., Hao, H., Liu, A., Wang, X., Tu, Z., Zhang, Z., Wang, R.: Can Watermarks Survive Translation? On the Cross-lingual Consistency of Text Watermark for Large Language Models. arXiv (2024)
  • Wan et al. [2022] Wan, W., Wang, J., Zhang, Y., Li, J., Yu, H., Sun, J.: A comprehensive survey on robust image watermarking. Neurocomputing 488, 226–247 (2022) https://doi.org/10.1016/j.neucom.2022.02.083
  • Hua et al. [2016] Hua, G., Huang, J., Shi, Y.Q., Goh, J., Thing, V.L.L.: Twenty years of digital audio watermarking—a comprehensive review. Signal Processing 128, 222–242 (2016) https://doi.org/10.1016/j.sigpro.2016.04.005
  • Bellare and Rogaway [1995] Bellare, M., Rogaway, P.: Optimal asymmetric encryption. In: Goos, G., Hartmanis, J., Van Leeuwen, J., De Santis, A. (eds.) Advances in Cryptology — EUROCRYPT’94 vol. 950, pp. 92–111. Springer, Berlin, Heidelberg (1995). https://doi.org/10.1007/BFb0053428
  • Merkle [1988] Merkle, R.C.: A Digital Signature Based on a Conventional Encryption Function. In: Goos, G., Hartmanis, J., Barstow, D., Brauer, W., Brinch Hansen, P., Gries, D., Luckham, D., Moler, C., Pnueli, A., Seegmüller, G., Stoer, J., Wirth, N., Pomerance, C. (eds.) Advances in Cryptology — CRYPTO ’87 vol. 293, pp. 369–378. Springer, Berlin, Heidelberg (1988). https://doi.org/10.1007/3-540-48184-2_32
  • Impagliazzo et al. [1989] Impagliazzo, R., Levin, L.A., Luby, M.: Pseudo-random generation from one-way functions. In: Proceedings of the Twenty-First Annual ACM Symposium on Theory of Computing - STOC ’89, pp. 12–24. ACM Press, Seattle, Washington, United States (1989). https://doi.org/10.1145/73007.73009
  • Zhong et al. [2012] Zhong, Y., Deng, Y., Jain, A.K.: Keystroke dynamics for user authentication. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 117–123 (2012). IEEE
  • Zhao et al. [2021] Zhao, H., Zhou, W., Chen, D., Wei, T., Zhang, W., Yu, N.: Multi-attentional deepfake detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2185–2194 (2021)
  • Jain et al. [2006] Jain, A.K., Ross, A., Pankanti, S.: Biometrics: a tool for information security. IEEE transactions on information forensics and security 1(2), 125–143 (2006)
  • Alay and Al-Baity [2020] Alay, N., Al-Baity, H.H.: Deep learning approach for multimodal biometric recognition system based on fusion of iris, face, and finger vein traits. Sensors 20(19), 5523 (2020)
  • Mohammed [2013] Mohammed, I.A.: Intelligent authentication for identity and access management: a review paper. International Journal of Managment, IT and Engineering (IJMIE) 3(1), 696–705 (2013)