PolicyLR: A Logic Representation for Privacy Policies

Ashish Hooda1Rishabh Khandelwal1Prasad Chalasani2
Kassem Fawaz1Somesh Jha1
1University of Wisconsin-Madison  2Langroid
Abstract

Privacy policies are crucial in the online ecosystem, defining how services handle user data and adhere to regulations such as GDPR and CCPA. However, their complexity and frequent updates often make them difficult for stakeholders to understand and analyze. Current automated analysis methods, which utilize natural language processing, have limitations. They typically focus on individual tasks and fail to capture the full context of the policies. We propose PolicyLR, a new paradigm that offers a comprehensive machine-readable representation of privacy policies, serving as an all-in-one solution for multiple downstream tasks. PolicyLR converts privacy policies into a machine-readable format using valuations of atomic formulae, allowing for formal definitions of tasks like compliance and consistency. We have developed a compiler that transforms unstructured policy text into this format using off-the-shelf Large Language Models (LLMs). This compiler breaks down the transformation task into a two-stage translation and entailment procedure. This procedure considers the full context of the privacy policy to infer a complex formula, where each formula consists of simpler atomic formulae. The advantage of this model is that PolicyLR is interpretable by design and grounded in segments of the privacy policy. We evaluated the compiler using ToS;DR, a community-annotated privacy policy entailment dataset. Utilizing open-source LLMs, our compiler achieves precision and recall values of 0.910.910.910.91 and 0.880.880.880.88, respectively. Finally, we demonstrate the utility of PolicyLR in three privacy tasks: Policy Compliance, Inconsistency Detection, and Privacy Comparison Shopping.

1 Introduction

Privacy Policies play an integral part in the digital landscape by specifying how online services interact with users and their data. They provide a way to inform users how and why services collect, share, process, and retain their data. Service providers utilize privacy policies to show how their practices comply with applicable privacy norms and regulations, like the GDPR (European Parliament and Council of the European Union, 2016) and CCPA (California State Legislature, 2018). On the other hand, users can consult privacy policies to learn more about the practices of a service provider and make their decisions accordingly. For example, they can compare privacy policies of similar services to choose one that matches their privacy preferences.

However, privacy policies are long and opaque documents that users often struggle to read and comprehend (McDonald and Cranor, 2008; Reidenberg et al., 2015; Pollach, 2007; Linden et al., 2018). Service providers regularly update their policies to comply with new regulations or reflect changes in their corresponding services, where one in five policies is updated every month (Wagner, 2022). Coupled with the immense number of online services with which users interact, it is challenging for stakeholders, such as users and regulators, to keep track of up-to-date privacy policies.

Researchers leveraged the advances in natural language processing to automate the analysis and understanding of privacy policy documents through fine-grained labeling (Harkous et al., 2018), summarization (Zaeem et al., 2018), and question-answering-based approaches (Ravichander et al., 2019). The automation of privacy policy analysis catalyzed progress in three main privacy policy tasks – (1) structured representations for privacy policies, including alternative formats (Cranor, 2002), nutrition labels (Kelley et al., 2009; Khandelwal et al., 2023), privacy icons (Regulation, 2016), and short notices (Zimmeck and Bellovin, 2014); (2) large-scale measurements to analyze the policy landscape at scale (Amos et al., 2021; Wagner, 2022); and (3) mechanisms to find inconsistencies within a policy document (Andow et al., 2019) or verify compliance with privacy laws, like the GDPR (Tan and Song, 2023a; Linden et al., 2018; Manandhar et al., 2024).

These methods, however, suffer from two fundamental limitations. First, existing approaches are narrow in scope and only target individual tasks, either comprehension, consistency, or compliance. As such, they require training several task-specific ML models, making them hard to modify or extend. Second, they operate on a sentence level and fail to capture context across paragraphs. Due to the focus on sentence-level analysis, comprehension of policy text has been limited to one-dimensional attributes like data retention period or purpose. These approaches struggle to capture multidimensional concepts, such as the retention period of location data in the context of European countries which are likely to be described across multiple sentences or paragraphs of the policy document.

In this work, we propose PolicyLR, a logical representation of privacy policies, that promises to address limitations in existing approaches. PolicyLR represents a new paradigm of a comprehensive machine-readable representation that can serve as a one-for-all solution for multiple downstream tasks. We devise a logic system formulation to represent a privacy policy as the valuations of a set of atomic formulae. These formulae act as independent building blocks that can be combined together to convey multidimensional concepts like All identifiable data used for advertisements should have a limited retention period. The logic system allows for a formal definition of a variety of tasks such as compliance, consistency, and comparisons.

To construct PolicyLR’s logical representation, we first need to define a set of atomic formulae. We build upon existing work in policy annotation that has created hierarchical taxonomies of privacy concepts (Wilson et al., 2016; Arora et al., 2022). Specifically, we utilize the OPP-115 taxonomy developed by Wilson et al. (Wilson et al., 2016) to generate our list of atomic formulae. This taxonomy is widely recognized in the literature and offers a comprehensive set of privacy practices. Notably, the PolicyLR framework is designed to be flexible and independent of any specific taxonomy, allowing users to choose or customize taxonomies based on their requirements.

We build PolicyLR to be versatile in the sense that it can automatically generate its set of atomic formulae by ingesting existing machine-readable taxonomies for privacy policies. To make our logical representation compatible with existing privacy policies, we provide a compiler for PolicyLR that can transform any unstructured policy text into the valuations of a set of atomic formulae. Our compiler deconstructs the transformation task as a two-stage translation and entailment procedure. This formulation allows us to incorporate global context from different sections of the policy text. We use off-the-shelf instruction-tuned Large Language Models (LLMs) without any need for fine-tuning. To further improve its usability, we design our compiler to work with open-source LLMs.

PolicyLR’s formulaic representation comprises fundamental formulae, which are then used to construct more complex formulae. This atomics-based formulation helps in explaining the behavior of complex formulae. The entailment module provides valuations of these atomics along with evidence that cites relevant segments in the privacy policies that were used in the evaluation. This evidence helps ground LLMs’ responses to relevant segments of the privacy policy. Both these features help make PolicyLR more interpretable and mitigate hallucinations.

Contributions. In this paper, we make the following contributions.

  1. 1.

    We propose PolicyLR, a logic-based representation for privacy policies. Our representation allows for formal definitions of various privacy tasks like compliance and consistency. We also provide an automated way to initialize PolicyLR’s atomic formulae by leveraging existing privacy policy taxonomies.

  2. 2.

    We build a compiler that leverages open-source LLMs to transform an unstructured policy text into the valuations of a set of atomic formulae.

  3. 3.

    We evaluate our compiler using ToS;DR, a community-annotated privacy policy entailment dataset. Our compiler achieves precision and recall values of 0.910.910.910.91 and 0.880.880.880.88 using open-source LLMs.

  4. 4.

    We demonstrate the utility of PolicyLR on three privacy tasks – Policy Compliance, Inconsistency detection, and privacy comparison shopping.

2 Related Work

We contextualize our proposed framework, PolicyLR, within related work around using LLMs for document reasoning, privacy policy analysis, document consistency analysis, and compliance analysis for policies.

2.1 Reasoning about Documents Via Language Inference Task

Large Language Models (LLMs) have emerged as powerful tools for natural language processing, demonstrating impressive abilities like text generation and comprehension (Gao et al., 2023). One crucial capability for reasoning about documents is Natural Language Inference (NLI) (MacCartney, 2009). This task involves determining whether a natural language hypothesis can reasonably be inferred from a given premise. In the NLI task, LLMs are required to determine if the hypothesis is true (i.e., entailment), false (i.e., contradiction), or undetermined (neutral), given a premise. For example, consider a premise saying: “John’s brother is 8 years old.”. A hypothesis that says “John has no siblings.” will get a contradiction label. PolicyLR uses LLM’s NLI capabilities to map unstructured policy documents into valuations of a set of atomic formulae.

2.2 Retrieval Augmented Generation for Document Reasoning

While LLMs possess vast knowledge, they may not always have access to the specific information required for complex reasoning tasks involving specialized documents. Retrieval Augmented Generation (RAG) (Lewis et al., 2020) addresses this limitation by combining the power of LLMs with external knowledge bases or document retrieval systems. In the context of document reasoning, RAG first retrieves relevant documents or passages from a corpus based on the given query or task. The retrieved information is then used to augment the LLM’s input, providing contextual information and supporting evidence for reasoning. PolicyLR uses RAG to ground LLM reasoning on relevant parts of the privacy policy document.

2.3 Privacy Policy Analysis

Privacy policy analysis is crucial to understanding how organizations handle personal data. Automated analysis techniques are becoming increasingly important due to the vast number of privacy policies that exist. Early research focused on rule-based systems and supervised learning approaches for privacy policy analysis. Researchers have employed sentence classification approaches for tasks such as identifying missing information or categorizing policy elements (Harkous et al., 2018; Bhatia and Breaux, 2018). Cui et al. address semantic incompleteness through PoliGraph, a knowledge graph approach that analyzes entire policies using semantic role labeling (Cui et al., 2023).

Topic modeling techniques like those used by Sarne et al. can identify high-level themes within large collections of privacy policies (Sarne et al., 2019). Shvartzshnaider et al. (Shvartzshnaider et al., 2018) introduce a framework that analyzes policies from the perspective of information flow and user readability, considering contextual integrity.

Ontology-based techniques offer additional capabilities. PrivOnto by Oltramari et al. leverages an ontology to represent and analyze privacy policies, enabling semantic querying (Oltramari et al., 2018). Nejad et al.’s Knight system focuses on mapping privacy policies to specific articles in regulations like GDPR (Nejad et al., 2018).

More recently, Large Language Models (LLMs) like ChatGPT and Llama 2 have shown significant promise for various NLP tasks (Touvron et al., 2023), including sentiment analysis and text summarization (El-Kassas et al., 2021). Their ability to process complex language and identify patterns within large amounts of text makes them well-suited for extracting privacy practices from privacy policies. Rodriguez et al. (Rodriguez et al., 2024) propose using LLMs to extract privacy practices from the privacy policies. However, they use LLMs to perform zero-shot classification by treating all the classes as independent.

These studies highlight the potential of NLP and semantic techniques for privacy policy analysis. However, most of these techniques rely on sentence-level analysis that does not account for the whole context in the document and can miss relationships between different sections of a policy. Furthermore, these techniques treat all classes as independent, e.g., purpose and retention-period classifiers are trained separately. This misses out on complex cases where these classes are not independent, and there is a need for a joint classification. PolicyLR accounts for these relations and represents privacy policy in a more granular and comprehensive manner.

2.4 Document Consistency

Document consistency is essential for ensuring the accuracy and clarity of information present within documents. Research in this area focuses on identifying inconsistencies within a single document (internal consistency) and across related documents (cross-document consistency).

Intra-document Consistency. Ali et al. (Ali et al., 2023) propose a system for automated consistency checks in financial documents by projecting the entities in embedding space and ensuring that semantic and syntactic variations are treated similarly. Andow et al.  (Andow et al., ) propose a system that analyzes privacy policies for internal inconsistencies by leveraging an ontology to capture positive and negative statements regarding data collection and sharing. These approaches segment the documents and use transformer-based models to classify segments that can miss out on the full context. PolicyLR, on the other hand, relies on state-of-the-art language models to understand the context and accurately determine privacy practices.

Inter-document Consistency. Researchers have also explored inter-document consistency to compare practices disclosed by developers in privacy labels with privacy policies  (Khandelwal et al., 2023; Jain et al., 2023). Jain et al. (Jain et al., 2023) introduce ATLAS, a system that casts consistency as a document classification task and automatically detects discrepancies between privacy policies and privacy labels for mobile apps. They extract privacy labels from privacy policies and compare them with actual labels released by the developers. Khandelwal et al. (Khandelwal et al., 2023) create a new taxonomy for privacy labels and leverage it to build classifiers to predict privacy labels using privacy policies. Similarly, Zimmeck et al. (Zimmeck et al., 2019) identify inconsistencies between the app’s behavior and its stated privacy practices by combining machine learning-based privacy policy analysis with static code analysis.

While previous research has made significant developments in consistency analysis, there are limitations. Sentence level analysis in PolicyLint (Andow et al., ) might miss additional context, resulting in erroneous results. Segment-level classification frameworks, as presented in Khandelwal et al. (Khandelwal et al., 2023) and Zimmeck et al. (Zimmeck et al., 2016, 2019), also suffer from this limited context problem. ATLAS (Jain et al., 2023), on the other hand, poses the problem as document classification but trains 32 different privacy label-specific classifiers. PolicyLR, on the other hand, does not rely on segmentation and analyzes all relevant context. Furthermore, we implement PolicyLR using an off-the-shelf language model that acts as a universal compiler to generate the truth table, which in turn allows us to perform several downstream tasks.

2.5 Policy Compliance

Automated methods for assessing policy compliance with regulations are a growing area of interest. Manandhar et al. (Manandhar et al., 2024) introduce the ARC framework for transforming complex privacy regulations into a structured format, facilitating automated analysis and compliance assessment. Prior works also focus on analyzing privacy policies for compliance with regulations like GDPR (Linden et al., 2018; Liu et al., 2021; Liao et al., 2024). PolicyChecker by Liao et al.  (Liao et al., 2024) utilizes a rule and semantic role labeling approach to assessing compliance of mobile app privacy policies with GDPR. PTPDroid by Tan et al. (Tan and Song, 2023b) identifies potential violations related to third-party data collection practices disclosed in Android app privacy policies. Shafei et al. (Shafei et al., 2024) investigate data handling discrepancies in privacy policies for Alexa skills with account linking. Linden et al. (Linden et al., 2018) code ICO checklist111https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/ into structured queries and leverage the Polisis (Harkous et al., 2018) framework to understand the impact of GDPR on privacy policies.

Mori et al. (Mori et al., 2022) propose a method for using convolutional neural networks (CNNs) to analyze privacy policies and classify them based on compliance with legal requirements. However, the “black-box" nature of CNNs and the need for adjustments when applying the method to different legal frameworks pose challenges. Rabinia and Nygaard explore utilizing Natural Language Inference (NLI) for compliance checking, demonstrating that models trained on diverse datasets perform better with real-world privacy policy tasks (Rabinia and Nygaard, 2022).

Existing works have taken a fragmented approach to the compliance problem, focusing on a subset of privacy regulations, not being able to consider the full context of policy texts, and using opaque models that cannot provide reasoning. In contrast, PolicyLR addresses these problems by considering a more comprehensive representation of the privacy policy as a truth table of logic formulae. This representation allows for more interpretable compliance with a wider set of privacy regulations.

3 Logic Representation

3.1 Notation

Let 𝒜𝒜\mathcal{A}caligraphic_A be that set of atomic formulae of a logic system, representing its vocabulary. Let ΣΣ\Sigmaroman_Σ be the set of logical connectives which can be unary, binary or n𝑛nitalic_n-ary. Connectives are used to construct formulae, for example, if ¬\neg¬ is an unary connective, for any formula ϕitalic-ϕ\phiitalic_ϕ, ¬ϕitalic-ϕ\neg\phi¬ italic_ϕ is also a valid formula. Similarly, for a binary connective \land, any formula pair ϕitalic-ϕ\phiitalic_ϕ and ψ𝜓\psiitalic_ψ imply that ϕψitalic-ϕ𝜓\phi\land\psiitalic_ϕ ∧ italic_ψ is also a valid formula. The set of atomic formulae along with the connectives define G=(𝒜,Σ)𝐺𝒜ΣG=(\mathcal{A},\Sigma)italic_G = ( caligraphic_A , roman_Σ ), the formulation grammar of the logic. Let ΦGsubscriptΦ𝐺\Phi_{G}roman_Φ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT be the set of all possible formulae that can be constructed using the grammar G𝐺Gitalic_G.

3.2 Valuation Function

All formulae can be evaluated to either True or False depending on the world model M𝑀Mitalic_M. In other words, if M𝑀Mitalic_M satisfies a formula ϕitalic-ϕ\phiitalic_ϕ i.e., Mϕmodels𝑀italic-ϕM\models\phiitalic_M ⊧ italic_ϕ, the formula ϕitalic-ϕ\phiitalic_ϕ will be evaluated to be True. We define the Base Valuation Function ValM:𝒜{0,1}:subscriptVal𝑀𝒜01\texttt{Val}_{M}:\mathcal{A}\rightarrow\{0,1\}Val start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT : caligraphic_A → { 0 , 1 } as

ValM(ϕ)={1if Mϕ0if o.w.subscriptVal𝑀italic-ϕcases1modelsif 𝑀italic-ϕ0formulae-sequenceif 𝑜𝑤\texttt{Val}_{M}(\phi)=\begin{cases}1&\text{if }M\models\phi\\ 0&\text{if }o.w.\end{cases}Val start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_ϕ ) = { start_ROW start_CELL 1 end_CELL start_CELL if italic_M ⊧ italic_ϕ end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL if italic_o . italic_w . end_CELL end_ROW

where M𝑀Mitalic_M is the world model and ϕ𝒜italic-ϕ𝒜\phi\in\mathcal{A}italic_ϕ ∈ caligraphic_A is an atomic formula.

Next, we extend the valuation function ValMsubscriptVal𝑀\texttt{Val}_{M}Val start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT to a function ValM:ΦG{0,1}:superscriptsubscriptVal𝑀subscriptΦ𝐺01\texttt{Val}_{M}^{*}:\Phi_{G}\rightarrow\{0,1\}Val start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : roman_Φ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT → { 0 , 1 } that assigns truth values to all formulae generated using grammar G𝐺Gitalic_G. Note that all formulae are built inductively from the set of atomic formulae 𝒜𝒜\mathcal{A}caligraphic_A. Therefore, the valuation of any complex formula can be determined by recursively applying the logical connectives according to their rules on the valuations of the atomic formulas, i.e.

ValM((ϕ1,,ϕn))=f(ValM(ϕ1),,ValM(ϕn))superscriptsubscriptVal𝑀direct-sumsubscriptitalic-ϕ1subscriptitalic-ϕ𝑛subscript𝑓direct-sumsubscriptVal𝑀subscriptitalic-ϕ1subscriptVal𝑀subscriptitalic-ϕ𝑛\texttt{Val}_{M}^{*}(\oplus(\phi_{1},...,\phi_{n}))=f_{\oplus}(\texttt{Val}_{M% }(\phi_{1}),...,\texttt{Val}_{M}(\phi_{n}))Val start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ⊕ ( italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ϕ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) = italic_f start_POSTSUBSCRIPT ⊕ end_POSTSUBSCRIPT ( Val start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , Val start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) )

where Σ\oplus\in\Sigma⊕ ∈ roman_Σ is an n𝑛nitalic_n-ary connective, f:{0,1}n{0,1}:subscript𝑓direct-sumsuperscript01𝑛01f_{\oplus}:\{0,1\}^{n}\rightarrow\{0,1\}italic_f start_POSTSUBSCRIPT ⊕ end_POSTSUBSCRIPT : { 0 , 1 } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → { 0 , 1 } is the truth function associated with the connection ΣΣ\Sigmaroman_Σ, ϕ1,,ϕn𝒜subscriptitalic-ϕ1subscriptitalic-ϕ𝑛𝒜\phi_{1},...,\phi_{n}\in\mathcal{A}italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ϕ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_A are atomic formulae.

For instance, let ϕitalic-ϕ\phiitalic_ϕ and ψ𝜓\psiitalic_ψ be atomic formulae with valuations ValM(ϕ)subscriptVal𝑀italic-ϕ\texttt{Val}_{M}(\phi)Val start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_ϕ ) and ValM(ψ)subscriptVal𝑀𝜓\texttt{Val}_{M}(\psi)Val start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_ψ ) respectively. For example, the valuations associated with the connectives ¬,\neg,\land¬ , ∧ and \lor can be determined by the following:

ValM(¬ϕ)=1ValM(ϕ)superscriptsubscriptVal𝑀italic-ϕ1subscriptVal𝑀italic-ϕ\texttt{Val}_{M}^{*}(\neg\phi)=1-\texttt{Val}_{M}(\phi)Val start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ¬ italic_ϕ ) = 1 - Val start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_ϕ )
ValM(ϕψ)=min(ValM(ϕ),ValM(ψ))superscriptsubscriptVal𝑀italic-ϕ𝜓minsubscriptVal𝑀italic-ϕsubscriptVal𝑀𝜓\texttt{Val}_{M}^{*}(\phi\land\psi)=\text{min}\left(\texttt{Val}_{M}(\phi),% \texttt{Val}_{M}(\psi)\right)Val start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_ϕ ∧ italic_ψ ) = min ( Val start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_ϕ ) , Val start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_ψ ) )
ValM(ϕψ)=max(ValM(ϕ),ValM(ψ))superscriptsubscriptVal𝑀italic-ϕ𝜓maxsubscriptVal𝑀italic-ϕsubscriptVal𝑀𝜓\texttt{Val}_{M}^{*}(\phi\lor\psi)=\text{max}\left(\texttt{Val}_{M}(\phi),% \texttt{Val}_{M}(\psi)\right)Val start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_ϕ ∨ italic_ψ ) = max ( Val start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_ϕ ) , Val start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_ψ ) )

3.3 Logical Representation

Evaluating whether the world model M𝑀Mitalic_M satisfies a formula or not, provides information about the model. For instance, if a world model satisfies the formula “All entities must have unique identifiers”, it indicates a structured environment where each entity can be distinctly identified. This suggests that evaluating a large number of such formulae should give a comprehensive representation of the world model. Therefore, the set of atomic formulae along with the Valuation function – (𝒜,ValM)𝒜subscriptVal𝑀(\mathcal{A},\texttt{Val}_{M})( caligraphic_A , Val start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ), provides a logical representation of the world model M𝑀Mitalic_M. Any complex formula can then be evaluated by simply combining the atomic valuations using truth functions associated with the connectives of the logic grammar. This representation has the following benefits:

  1. 1.

    Abstraction: It provides a concise intermediate representation of the world model. It can act as a proxy for the world model to perform downstream analyses such as Consistency and Compliance (subsection 3.4).

  2. 2.

    Efficiency: Evaluating a formula directly on the world model using the valuation function can be costly – it requires analysis of the world model. By design, our formulation provides a solution by first making the minimum set of necessary valuations and then using them to evaluate any future formulae.

  3. 3.

    Explainability: Valuations for complex formulae can seem opaque and hard to interpret. The atomic formulae are fundamental and can be directly inferred from the world model. Therefore, representing complex formulae as a function of the atomics offers better explanations for formula evaluations.

3.4 Definitions

Consistency. A world model M𝑀Mitalic_M is consistent with another world model Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with respect to the logic grammar G𝐺Gitalic_G, denoted as MGMsubscriptsimilar-to𝐺𝑀superscript𝑀M\sim_{G}M^{\prime}italic_M ∼ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT if

ϕ𝒜,ValM(ϕ)=ValM(ϕ)formulae-sequencefor-allitalic-ϕ𝒜subscriptVal𝑀italic-ϕsubscriptValsuperscript𝑀italic-ϕ\forall\phi\in\mathcal{A},\;\texttt{Val}_{M}(\phi)=\texttt{Val}_{M^{\prime}}(\phi)∀ italic_ϕ ∈ caligraphic_A , Val start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_ϕ ) = Val start_POSTSUBSCRIPT italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_ϕ )

This means that world models M𝑀Mitalic_M and Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT agree on valuations of all formulae in 𝒜𝒜\mathcal{A}caligraphic_A, i.e., they have the same logical representation. For example, consider a set of logical statements describing a company’s data retention policy, such as “Retention period for user data is unspecified”, “Retention period for user data is one year”, and “Retention period for user is 10 years”. This set of statements can be used to evaluate whether two data retention policies are similar/consistent in terms of how long user data is kept. Similarly, it can also be used to evaluate consistency between different versions of the same data retention policy over time.

Compliance. Consider a set of formulae ΦΦGΦsubscriptΦ𝐺\Phi\subseteq\Phi_{G}roman_Φ ⊆ roman_Φ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT constructed using the grammar G𝐺Gitalic_G. A world model M𝑀Mitalic_M is compliant with ΦΦ\Phiroman_Φ, denoted as MΦmodels𝑀ΦM\models\Phiitalic_M ⊧ roman_Φ if

ϕΦ,Mϕformulae-sequencefor-allitalic-ϕΦmodels𝑀italic-ϕ\forall\phi\in\Phi,\;M\models\phi∀ italic_ϕ ∈ roman_Φ , italic_M ⊧ italic_ϕ

The valuation of each formula in ΦΦ\Phiroman_Φ can be derived by some combination of formulae in 𝒜𝒜\mathcal{A}caligraphic_A. For instance, the General Data Protection Regulation (GDPR) mandates that personal data should not be kept for longer than necessary for its intended purpose. Now, consider the logical statements from data-retention example above. Compliance with the formula “Retention period for user data is compliant with GDPR” can be alternatively evaluated by the combination of atomic formulae corresponding to these statements – “Retention period for user data is specified”, “Retention period for user data is no longer than necessary for the purposes it was collected”, and “User data is securely deleted after retention period expires”.

4 PolicyLR: Logic Representation for Privacy Policies

We present PolicyLR, which is a representation of privacy policies using a set of atomic formulae and a valuation function. In this section, we define PolicyLR’s atomic formulae and demonstrate how PolicyLR’s representation can be used in downstream applications related to privacy policies.

4.1 Logical Representation

To construct PolicyLR’s logical representation, we need to define the set of atomic formulae. We leverage existing work on policy annotation that has developed hierarchical taxonomies of privacy concepts (Wilson et al., 2016; Arora et al., 2022). We specifically use the OPP-115 taxonomy developed by Wilson et al. (Wilson et al., 2016) to generate a list of atomic formulae. The taxonomy is widely used in the literature (Harkous et al., 2018; Zimmeck et al., 2019; Wagner, 2023; Alabduljabbar et al., 2021), and provides a comprehensive set of privacy practices. We note that the PolicyLR framework is independent of the taxonomy a user might choose.

The OPP-115 taxonomy has two levels – a top level that defines high-level privacy categories like first-party-collection, data-retention, and a lower level that defines a set of attributes for each high level category. Each attribute is a categorical variable that can take one of a fixed set of values. For example, the high level category data-retention has attributes retention-period, retention-purpose and information-type. The lower level attribute retention-period can take one of the values – {indefinite,stated,limited,unpsecified}indefinitestatedlimitedunpsecified\{\texttt{indefinite},\;\texttt{stated},\;\texttt{limited},\;\texttt{% unpsecified}\}{ indefinite , stated , limited , unpsecified }.

We next describe how we construct the building blocks of PolicyLR using this taxonomy.

Atomic Formulae. We define the set of atomic formulas using a space of finite-domain variables. We use the high-level privacy categories as the finite-domain variables. For instance, data-retention(period=stated,purpose=advertising,type=location)data-retentionformulae-sequenceperiodstatedformulae-sequencepurposeadvertisingtypelocation\texttt{data-retention}(\texttt{period}=\texttt{stated},\texttt{purpose}=% \texttt{advertising},\texttt{type}=\texttt{location})data-retention ( period = stated , purpose = advertising , type = location ) is an atomic formula.

Consider a high-level category p𝒫𝑝𝒫p\in\mathcal{P}italic_p ∈ caligraphic_P, with lower-level attributes given by attr(p)𝒬𝑎𝑡𝑡𝑟𝑝𝒬attr(p)\in\mathcal{Q}italic_a italic_t italic_t italic_r ( italic_p ) ∈ caligraphic_Q. Each attribute q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q can assume a finite set of values dom(q)V𝑑𝑜𝑚𝑞𝑉dom(q)\in Vitalic_d italic_o italic_m ( italic_q ) ∈ italic_V. Now, the set of atomic formulae 𝒜𝒜\mathcal{A}caligraphic_A is given by

{p(q1=v1,,qn=vn)|p𝒫,{q1,,qn}=attr(p)vidom(qi)i{1,,n}}\left\{p\left(q_{1}=v_{1},\ldots,q_{n}=v_{n}\right)\;\middle|\;\begin{array}[]% {l}p\in\mathcal{P},\{q_{1},\ldots,q_{n}\}=attr(p)\\ v_{i}\in\text{dom}(q_{i})\;\forall i\in\{1,\ldots,n\}\end{array}\right\}{ italic_p ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) | start_ARRAY start_ROW start_CELL italic_p ∈ caligraphic_P , { italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } = italic_a italic_t italic_t italic_r ( italic_p ) end_CELL end_ROW start_ROW start_CELL italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ dom ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∀ italic_i ∈ { 1 , … , italic_n } end_CELL end_ROW end_ARRAY }

where p𝑝pitalic_p is a high-level category from 𝒫𝒫\mathcal{P}caligraphic_P, attr(p)𝑎𝑡𝑡𝑟𝑝attr(p)italic_a italic_t italic_t italic_r ( italic_p ) represents the set of attributes associated with p𝑝pitalic_p, and each qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is an attribute that can take values from its respective domain dom(qi)𝑑𝑜𝑚subscript𝑞𝑖dom(q_{i})italic_d italic_o italic_m ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). This set 𝒜𝒜\mathcal{A}caligraphic_A includes all possible combinations of attributes and their values for each high-level category.

For example, consider a toy taxonomy that only contains Data Retention and First Party Collection as high-level categories. Furthermore, Data Retention only contains two values each for retention-period and retention-purpose whereas First Party Collection consists of identifiability and purpose.

Our key insight is that given the comprehensiveness of the taxonomy, the set of atomic formulae along with their valuation serves as a complete representation of the privacy policy allowing us to perform a variety of the downstream tasks (7 and 8).

4.2 Downstream Applications

Next, we demonstrate how PolicyLR can be useful for downstream applications.

Compliance Analysis. PolicyLR can help with compliance analysis by mapping the regulatory requirements to logical formulae and allowing to systematically check if the privacy policy adheres to data handling practices and user controls. This can significantly streamline the compliance analysis process and ensure alignment with evolving regulations.

Consistency Analysis. PolicyLR can also facilitate consistency analysis within a privacy policy. The logical representation of the policy using atomic formulae enables automated checks for inconsistencies. For instance, the valuation function can reveal if the policy states that a certain type of data is collected for one purpose, but later contradicts itself by allowing the use of that data for a different purpose.

Privacy-based Comparison Shopping. To perform tasks on a daily basis, customers are often faced with the decision to choose between several applications or services that perform the same task. Apart from quality or cost, these decisions can also be based on the privacy practices of the service provider (König and Hansen, 2012). PolicyLR provides a natural way to compare the privacy policies of multiple services. The atomic formulae can be easily combined to perform comparisons along multiple dimensions.

5 Compiling Unstructured Documents to PolicyLR

Refer to caption
Figure 1: End to end pipeline for PolicyLR. We first instantiate PolicyLR’s atomic formulae using the OPP-115 taxonomy. Each combination of attribute-value pairing becomes an atomic formula. The translation module then transforms each of these into natural language statements. Statements are then compared against the privacy policy text by the entailment module to generate PolicyLR’s truth table.

In the previous section, we argued that privacy policy applications can benefit from PolicyLR’s logical representation. However, existing policies are comprised of long, unstructured text documents that are hard to parse or comprehend by the end users (McDonald and Cranor, 2008; Reidenberg et al., 2015; Pollach, 2007; Linden et al., 2018). In order to make PolicyLR relevant to the existing landscape, we need a compiler to transform unstructured privacy policies to PolicyLR’s logical representation. This compiler is essentially a valuation function that can a) handle a privacy policy text document as the world model, b) process the unstructured text, and c) generate valuations for atomic formulae.

Building a valuation function for unstructured text is similar to the Natural Language Inference (NLI) task (MacCartney, 2009). While NLI involves inferring the logical relationship between two sentences, PolicyLR’s compiler needs to do this for a formula and a long policy text document. We reduce the compilation task to an NLI task by leveraging a two-stage pipeline – (1) Translation: We transform both the formula and the policy text document into sentences suited for the NLI task, and then (2) Entailment: We use NLI to ascertain whether the policy text entails the formula. Next, we show how we implement the two stages using Large Language Models. Figure 1 shows the end to end pipeline for PolicyLR along with translation and entailment modules.

5.1 Translation

To reduce compilation to an NLI task, we need to transform the inputs into short sentences that can be used for NLI. For translating a formula, we use the In-Context Learning (ICL) ability of LLMs. For the policy text, we use embedding-based similarity search to extract the relevant sentences from the long policy text.

Formulas. Large language Models have remarkable in-context learning abilities (Brown et al., 2020). It allows LLMs to be applied to new tasks using only a few natural language demonstrations, a phenomenon known as few-shot learning. More concretely, we use a set of k𝑘kitalic_k input-output pairs {(xi,yi)}i=1ksuperscriptsubscriptsuperscript𝑥𝑖superscript𝑦𝑖𝑖1𝑘\{(x^{i},y^{i})\}_{i=1}^{k}{ ( italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, where xisuperscript𝑥𝑖x^{i}italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT are arbitrary formulas from PolicyLR’s grammar, and yisuperscript𝑦𝑖y^{i}italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT are the corresponding natural language translations. We only need a few in-context samples for demonstration, which are crafted by privacy policy experts. For example, the atomic formulae, data-retention(period=indefinite,purpose=legal,type=location\texttt{data-retention}(\texttt{period}=\texttt{indefinite},\texttt{purpose}=% \texttt{legal},\texttt{type}=\texttt{location}data-retention ( period = indefinite , purpose = legal , type = location is translated into “The data-retention policy indicates that collected user information, specifically geo-location data, is retained indefinitely and is stored only as long as required for legal or law enforcement purposes.”. Note that translation of formulas is independent of the privacy policy texts and therefore, only needs to be done once.

Policy Text. While recent LLMs have long context windows, privacy policies might still be too long to fit within the LLM’s context window. Therefore, to perform NLI using LLMs, we need to retrieve only the excerpts of the privacy policy relevant to the formula being entailed. We do this via a Retrieval-Augmented Generation (RAG) methodology. We first chunk the long privacy policy text into shorter segments. Then, we approximate the semantics using text-embedding models. This segment embedding mapping is stored in a vector database, to allow for quick retrieval during the entailment process. During retrieval, we ensure that additional context is not lost due to chunking by adding the previous and the next segment of the retrieved segment. Note that this segmentation and embedding process only needs to be done once per privacy policy, subsequently allowing for entailment of multiple formulas.

5.2 Entailment

Once we have the translations for the formula and the privacy policy, we can use NLI to infer the formula’s valuation with respect to the policy. Specifically, we prompt an LLM with the following, “According to the Privacy Policy P𝑃Pitalic_P, is the following statement True? Q𝑄Qitalic_Q. Here, P𝑃Pitalic_P and Q𝑄Qitalic_Q are translations of the policy text and formula respectively. To extract P𝑃Pitalic_P, we first compute the embedding vector of Q𝑄Qitalic_Q and then fetch the most similar k𝑘kitalic_k segments from the segment-embedding mapping computed in the translation stage. We append the tag “[source_id:i]” at the end of the ithsuperscripti𝑡\text{i}^{th}i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT segment and concatenate them to get the condensed and most relevant policy text P𝑃Pitalic_P. We then augment the prompt to provide evidence by highlighting the segments that were used by the LLM to perform the entailment task, using the template – “Give evidence by providing all the source ids that are used to answer the question in the format of - Evidence:[2,3,7,…]”. This evidence makes the valuation more interpretable and can be useful for downstream applications. Note that k𝑘kitalic_k here is a tunable parameter that controls the context (P𝑃Pitalic_P) provided to the LLM. It can be tuned to balance the precision-recall trade-off for any downstream task. For example, increasing the value of k𝑘kitalic_k will result in higher recall but can lower the precision. We discuss this further in subsection 6.4.

6 Evaluation

6.1 Research Questions

  1. Q1.

    What is the performance of PolicyLR’s compiler?
    We demonstrate that PolicyLR is able to successfully perform valuations against unstructured privacy policy text documents. When evaluated on 2656265626562656 entailment instances from ToS;DR, a privacy community annotated dataset, using an open-source LLM gemma2-27b, PolicyLR’s compiler achieves a precision value of 0.840.840.840.84 and recall value of 0.880.880.880.88.

  2. Q2.

    How effective is PolicyLR for the Policy Compliance task?
    Next, we show the effectiveness of PolicyLR on policy compliance tasks using two existing datasets. Specifically, we analyze the compliance of 419419419419 privacy policies with respect to 3333 compliance rules based on the Article 13 of the GDPR. We find that PolicyLR achieves an average F-1 score of 0.910.910.910.91 across the two datasets, highlighting the efficacy of PolicyLR in downstream tasks.

  3. Q3.

    How can PolicyLR be used to perform inconsistency detection and privacy comparison shopping?
    We analyze privacy policies of the popular apps on the Google Play Store to show how PolicyLR can be used to a) Analyze the consistency of privacy policies over time and b) Compare privacy practices of apps in similar categories. We note that the latter can be used to provide a comparative view of the privacy of apps that can act as a signal for users when deciding which apps to buy or use.

6.2 Experimental Setup: Datasets

We use the following experimental setup for answering each of the research questions using the datasets described in Table 1.

Task Dataset ##\## Unique Policies
Compiler Entailment ToS;DR Annotations 1074
Policy Compliance OPP-115 115
AutoCompliance 304
Consistency Analysis Self Curated 85
Table 1: The analyzed datasets in the evaluation.

ToS;DR Dataset. PolicyLR’s compiler provides a valuation function that can evaluate formulae against an unstructured privacy policy text document. To evaluate its performance, we use the ToS;DR dataset. Terms of Service; Didn’t Read (ToS;DR) is a collaborative, community-driven platform where users and volunteers contribute to evaluating and summarizing terms of service and privacy policies (Roy et al., 2012). The platform helps make privacy policies easier to understand by using crowd-sourced annotations. Users sign up on the platform to annotate policies by linking parts of the policy to specific data practices called cases. Cases are concise statements about privacy settings, for example - ‘You can delete your content from this service’ or ‘This service tracks you on other websites’. A moderator then reviews these matches and either approves them or provides feedback. We can use cases as proxies for the natural language translations of logical formulae. The moderator-approved matches provide reliable ground truth for their valuation against privacy policy texts.

The dataset consists of 246246246246 cases, which are used to annotate 1074107410741074 unique privacy policy texts. After approval from the moderators, this leads to a total of 13179131791317913179 case and privacy policy pairings. ToS;DR comprises only positive instances. To construct negative instances (i.e., where the case doesn’t match the policy text), we manually analyze all the cases and find 11111111 pairs of mutually contrasting cases. This means that if a case is evaluated to be true for a policy text, its contrasting case will necessarily be evaluated to be false on that text. Using this formulation, we construct a total of 1222122212221222 negative instances. We sample a random subset of size 1300130013001300 from the positive instances to get a total of 1522152215221522 case-policy pairs. For each instance, we apply PolicyLR’s compiler to evaluate the case against the policy text, answering either true or false.

6.3 Experimental Setup: Implementation Details of PolicyLR

We now describe how PolicyLR is instantiated for our experiments.

Models. While evaluating PolicyLR’s compiler on the ToS;DR dataset, we consider several open-source LLMs – 8 billion and 70 billion versions of Meta’s Llama3 as well as 9 billion and 27 billion versions of Google’s Gemma2. We also evaluate OpenAI’s latest closed-source model Gpt4-O. For the Compliance task, we consider the larger versions of both Llama3 and Gemma2. Finally, for the consistency task, we consider Llama3222Gemma2 was released very recently (06-27-2024). Due to time constraints, it is only part of our ToS;DR evaluation. To get deterministic results, we set the sampling temperature to 0 for all models.

Translation. Since the translation of atomic formulae needs to be done only once, we use the most performant LLM gpt4-o for this task. Below we show the prompt used for this task for one of the data-retention atomic:

SYSTEM: You are a privacy policy expert. A privacy setting consists of a combination of attributes. Each of these has an associated value, along with a description of what that value means. You have to construct a concise statement that describes the setting. Only output the statement. USER: Attribute: period, Value: limited, Description: … Attribute: purpose, Value: ads, Description: …

Here, we provide a description for each attribute value as described in the OPP-115 dataset.

Entailment. To make it more accessible, we implement PolicyLR’s entailment module using only open source components. We tune the hyperparameters using a disjoint set of 10 policy documents. We use the SentenceSplitter API from LlamaIndex333https://docs.llamaindex.ai/ to segment the privacy policy text. Each segment comprises 300300300300 tokens. Then, we generate text embeddings using UAE-Large-V1 (Li and Li, 2023), which is a popular open-source embedding model. We store these embeddings in chroma444https://www.trychroma.com/ which is a open-source vector database. For any entailment task, we query the database for the top-k𝑘kitalic_k segments that are most similar to the embedding of the hypothesis. We use k=5𝑘5k=5italic_k = 5 and k=10𝑘10k=10italic_k = 10 for evaluating the compiler, and k=10𝑘10k=10italic_k = 10 for the rest of the evaluation.

6.4 Performance of PolicyLR’s compiler

Entailment. We evaluate our compiler on the 1300130013001300 positive and 1222122212221222 negative case-policy pairing from ToS;DR. We parse the response of the entailment LLM and assign it a value of true if it begins with “Yes” and false if it begins with “No”. We did not observe any instance when response did not begin with either “Yes” or “No”. This again demonstrates the high instruction following capability of the latest LLMs. Table 2 shows the precision, recall and F1 score when using 2 variants of both llama3 and gemma2 as well as the closed source model gpt4-0. For each setting, we show results when providing the LLM with the 5 and 10 most relevant segments from the privacy policy. First, we observe the number of policy segments provides a trade-off between precision and recall. Fewer segments provide a higher precision whereas adding more segments improves recall. This is likely because LLMs struggle with longer contexts (Li et al., 2024), but also might miss out on relevant context in case of fewer segments. Second, we observe that larger models perform better in the case of both llama3 and gemma2. In terms of F1 score, we find that gemma2 outperforms llama3 and even gpt4-o. Overall, gpt4-o has the best performance with a precision-recall of 0.940.940.940.94 and 0.840.840.840.84 respectively. Among open-source models, Llama3-70b has the highest precision. To demonstrate PolicyLR’s applications, we use Llama3-70b.

Error Analysis. We perform a deep dive into the errors for gpt4-o to better understand PolicyLR’s performance. The LLM wrongly entailed a total of 280280280280 case-policy pairs – 204204204204 positive pairs and 76767676 negative pairs. There are three primary reasons for these errors – (1) Insufficient Context, (2) LLM Reasoning Error, and (3) Incorrect Annotation Error.

Insufficient Context Error. To characterize the first error type, we use the policy excerpt that was used by the ToS;DR moderators to approve the case-policy pairing. The first type of error occurs in instances where the policy excerpt is not part of the context retrieved by the embedding model. In these cases, the LLM does not have the necessary context to correctly entail the case. Note that since we only have policy excerpts for positive pairs, this methodology can only discover insufficient context errors for the positive pairs. Out of the 204204204204 positive pair errors, 95959595 were due to insufficient context. For the other two error types, we perform a qualitative analysis and highlight some interesting cases below.

LLM Reasoning Error. The following belongs to the second error type where the LLM performs incorrect entailment. In this case, the LLM reasoning seems too nit-picky and suggests a lack of common sense.

LLM Reasoning Error Case: This service is only available to users over a certain age. Response: No. … it does mention that access may be terminated without warning if it is believed that a user is underage or otherwise ineligible. This implies a concern about age but does not set a specific age limit. …

Incorrect Annotation Error. The below instance belongs to the third error type where the TsD;DR annotation is wrong. Upon manually analyzing the privacy policy, we find that the policy does not state that anonymization is done for all cookie types.

Incorrect Annotation Error Case: The cookies used by this service do not contain information that would personally identify you. Response: No. The privacy policy does not … Anonymous Performance Cookies is aggregated and therefore anonymous, it does not provide a similar assurance for other types of cookies, such as Strictly Necessary Cookies and Security Cookies. …

Evidence. We also evaluate the effectiveness of PolicyLR’s evidence functionality. We again leverage ToS;DR’s policy excerpts and evaluate whether the context segments cited in the LLM response contain the excerpt. Out of the 1300130013001300 positive pairs, we found 1056105610561056 instances where the excerpt was part of the retrieved context. We use these 1056105610561056 case-policy pairs to evaluate the evidence functionality. We observe that the LLM, on average, cites 2222 context segments (out of 10101010) while responding to each of the above pairs. Overall, we found the excerpt as part of the evidence in 854854854854 cases, giving a recall of 81%percent8181\%81 %.

Model Top k Precision Recall F1
llama3-8b 5 0.86 0.76 0.81
10 0.81 0.91 0.86
llama3-70b 5 0.94 0.75 0.84
10 0.94 0.81 0.87
gemma2-9b 5 0.93 0.77 0.84
10 0.91 0.84 0.87
gemma2-27b 5 0.92 0.83 0.87
10 0.91 0.88 0.90
gpt4-o 5 0.93 0.76 0.84
10 0.94 0.84 0.89
Table 2: Performance of the entailment task on ToS;DR data.

7 PolicyLR for Policy Compliance

Validating the compliance of privacy policies against regulations like GDPR is a critical issue because it ensures the protection of individuals’ personal data and maintains their privacy rights. Privacy regulations set stringent standards for data handling, requiring organizations to be transparent about data collection, usage, and storage practices. Non-compliance can lead to significant legal penalties and damage to an organization’s reputation. Moreover, ensuring compliance fosters trust between consumers and businesses, as individuals are more likely to engage with companies that respect and protect their privacy. Effective compliance validation also helps organizations avoid data breaches and misuse, thereby safeguarding sensitive information and enhancing overall data security.

We evaluate PolicyLR’s performance on policy compliance tasks using two annotated privacy policy datasets. Liu et al.  (Liu et al., 2021) extract compliance rules from GDPR Article 13. Analyzing our OPP-115 taxonomy, we find that 5555 out of the 9999 rules can be mapped on the taxonomy. Out of these rules, we discard Collect Personal Data \rightarrow Data Processing Purpose as there are very few instances for this in the ground truth. For both datasets, we evaluate policies on the compliance rules, evaluating a total of 419419419419 privacy policies. For each compliance rule, we first represent it as a composition of PolicyLR’s atomic formulae. We note that for some rules, there can be multiple valid compositions. This is because some rules in natural language can be vague and have multiple interpretations. Rules formed using the atomic formulae, on the other hand, are precisely defined by the logic system. PolicyLR, then evaluates each formula corresponding to each compliance rule using the valuations of PolicyLR’s atomic formulae.

7.1 Compliance Dataset

To evaluate PolicyLR’s performance on the policy compliance task, we use two existing datasets – (1) Online Privacy Policies (OPP-115) dataset (Wilson et al., 2016) comprising 115 policies, and (2) AutoCompliance (Liu et al., 2021) corpus comprising 304 policies. The OPP-115 dataset provides 23K sentence-level annotations based on the OPP-115 taxonomy. The annotations are at two levels: the first level consists of paragraph-sized segments annotated as per the high-level categories of the taxonomy. The second level includes parts of segments annotated for attribute-value pairs such as retention-period: limited, purpose: advertising, etc.

AutoCompliance (Liu et al., 2021), on the other hand, focuses on policy compliance. They analyze Article 13 of the GDPR and manually extract 10101010 labels that discuss personal information collection. They further annotate 304 policies by segmenting the policies and annotating each line of the policy as either one of the 10101010 labels or others. They then build 9999 compliance analysis rules that measure compliance of a privacy policy with the GDPR. For example, one of the rules is Collect Personal Information \rightarrow Data Retention Period, implying that if a policy collects personally identifiable information, then the data retention period must be specified. Finally, to generate the compliance dataset, they obtain policy-level annotation by aggregating the segment-level annotation.

Note that some of the rules can be decomposed into combinations of low-level categories of the OPP-115 taxonomy. For instance, "Collect Personal Info \rightarrow Contact Details" can be represented in OPP-115 taxonomy as: first-party-collection - identifiability: identifiable AND data-retention - retention-period: not unspecified. We note that while both these datasets provide sentence-level annotations, we follow the aggregation-based formulation similar to Liu et al.  (Liu et al., 2021) Linden et al.  (Linden et al., 2018) to get policy-level annotations. For instance, the data retention period annotation of the entire privacy policy can be derived using the presence of at least one sentence, which is annotated for the data retention period. We also note that while evaluating PolicyLR, we provide the entire privacy policy and use the aggregated label as the ground truth.

Compliance Rule Regulation Formula Precision Recall F1
Contact details of the data
controller should be provided
GDPR Art 13.2(a) rp=statedrp=limitedrp=indefinitelyrpstatedrplimitedrpindefinitely\texttt{rp}=\texttt{stated}\lor\texttt{rp}=\texttt{limited}\lor\texttt{rp}=% \texttt{indefinitely}rp = stated ∨ rp = limited ∨ rp = indefinitely 0.83 0.94 0.88
Contact details of the data
controller should be provided
GDPR Art 13.1(a)(b) contact-information=presentcontact-informationpresent\texttt{contact-information}=\texttt{present}contact-information = present 0.90 0.94 0.92
Users should be able to modify/delete their data GDPR Art 13.2(b) acc=editacc=deactivateacc=deleteacceditaccdeactivateaccdelete\texttt{acc}=\texttt{edit}\lor\texttt{acc}=\texttt{deactivate}\lor\texttt{acc}% =\texttt{delete}acc = edit ∨ acc = deactivate ∨ acc = delete 0.86 0.98 0.91
acc=editacc=deactivateacc=deleteacceditaccdeactivateaccdelete\texttt{acc}=\texttt{edit}\land\texttt{acc}=\texttt{deactivate}\land\texttt{% acc}=\texttt{delete}acc = edit ∧ acc = deactivate ∧ acc = delete 0.94 0.42 0.58
Table 3: Compliance Task

7.2 Compliance Results

Table 3 shows the compliance rule, the corresponding regulation, the computed formula as a composition of PolicyLR’s atomic formulae, and the performance of PolicyLR for each rule. We find that PolicyLR has an average F1 score of 0.900.900.900.90 with an average recall of 0.950.950.950.95, outperforming Autocompliance by 4%percent44\%4 %. We note that PolicyLR uses off-the-shelf open-source LLM as opposed to Autocompliance, which trains custom classifiers for each of the rules. Further, we observe that in Autocompliance, training data for the classifiers is also used for compliance analysis, whereas in PolicyLR, we do not perform any training and use the entire dataset for evaluation.

We also analyze PolicyLR’s performance on the Right to Access compliance rule (Collect Personal Information \rightarrow Right to Access), achieving an F1-score of only 0.510.510.510.51. Upon closer examination of the dataset, we find that several annotations for this class appear to be incorrect. This discrepancy might be attributed to the inherent similarity between the Right to Access and the Right to Delete, as a similar trend is observed within the OPP-115 dataset. Consequently, we opt to exclude the Right to Access rule from this analysis for the sake of clarity.

Compliance Rule Regulation Formula Precision Recall F1
Data retention period
should be specified
GDPR Art 13.2(a) rp=statedrp=limitedrp=indefinitelyrpstatedrplimitedrpindefinitely\texttt{rp}=\texttt{stated}\lor\texttt{rp}=\texttt{limited}\lor\texttt{rp}=% \texttt{indefinitely}rp = stated ∨ rp = limited ∨ rp = indefinitely 0.88 0.73 0.80
Contact details of the data
controller should be provided
GDPR Art 13.1(a)(b) contact-information=presentcontact-informationpresent\texttt{contact-information}=\texttt{present}contact-information = present 0.98 0.95 0.97
Users should be able to modify/delete their data GDPR Art 13.2(b) acc=editacc=deactivateacc=deleteacceditaccdeactivateaccdelete\texttt{acc}=\texttt{edit}\lor\texttt{acc}=\texttt{deactivate}\lor\texttt{acc}% =\texttt{delete}acc = edit ∨ acc = deactivate ∨ acc = delete 0.88 0.95 0.92
acc=editacc=deactivateacc=deleteacceditaccdeactivateaccdelete\texttt{acc}=\texttt{edit}\land\texttt{acc}=\texttt{deactivate}\land\texttt{% acc}=\texttt{delete}acc = edit ∧ acc = deactivate ∧ acc = delete 1.00 0.41 0.58
Table 4: Compliance Task on privacy policies from OPP-115 dataset.

Next, we present PolicyLR’s performance on the OPP-115 compliance dataset in Table 4. We find that PolicyLR achieves an average recall of 0.880.880.880.88 and an average F1-score of 0.920.920.920.92. It is worth noting that PolicyLR’s performance on the data retention compliance rule falls below the overall average. This can be attributed to the inherent limitations within the OPP-115 dataset, where data quality for data retention information is known to be low. Prior research has documented similar challenges; for instance, Polisis (Harkous et al., 2018) reports an F1-score of only 0.71 for the same rule using a classifier trained on OPP-115. We manually analyzed the errors and found that the majority of them were due to incorrect annotation. For example, the segment: Email share feature may be saved for your convenience for future articles you may wish to email was tagged as retention-period: limited.

Error Analysis. Investigating the errors in the compliance task for both datasets, we find that errors can be categorized into two classes below.

LLM Reasoning Error. In these errors, we observe that the LLM extrapolates in the reasoning and reaches the wrong conclusions. Such instances are not common and generally indicate incorrect assumptions by the LLM. For example, consider the following task for alibaba.com:

LLM Reasoning Error Hypothesis: Users can delete their account, but the company/organization may continue to keep some of the user’s data. Response: Yes. … The privacy policy states that We retain your Personal Data as long as we have an ongoing legitimate business need to do so… When we have no ongoing legitimate business need to process your Personal Data, we will either delete or anonymise it… This implies that users can request to delete their account… company may continue to keep some of user’s data …

Here, the policy never mentions that users can request to delete the data, but the LLM infers that because the company may delete the data on their own. We note that these errors can be potentially addressed by choosing a more capable LLM, such as GPT4 or Gemma2.

Annotation Error. These are the errors where the human annotators made a mistake and mislabeled a segment, but the LLM correctly categorizes it. For example:

Annotation Error Case: Data retention is limited, meaning data is deleted, anonymized, or aggregated at some point, but no specific retention period is stated. Response: Yes. … policy does mention that data will be deleted or removed in certain circumstances. For example, it states that When your account is deleted, your user profile and all public activities (comments, conversation topics) will be removed. …

Here, the LLM correctly infers that since the account data is deleted when the account is deleted, the retention period is limited.

Note here that we used the open source LLM Llama-70b to perform compliance analyses in this section. The performance of closed-source models, such as GPT-4 and Gemini, is believed to be better than open-source models. This suggests that performance can be further improved by using a better model.

Precision-Recall Tradeoff. We acknowledge the inherent ambiguity present in natural language queries. This ambiguity can lead to multiple valid formulas representing a single compliance rule, as shown in Table 3. The rule, "Users should be able to modify/delete their data," can be translated into two distinct formulas.

acc=editacc=deactivateacc=deleteacceditaccdeactivateaccdelete\texttt{acc}=\texttt{edit}\lor\texttt{acc}=\texttt{deactivate}\lor\texttt{acc}% =\texttt{delete}acc = edit ∨ acc = deactivate ∨ acc = delete
acc=editacc=deactivateacc=deleteacceditaccdeactivateaccdelete\texttt{acc}=\texttt{edit}\land\texttt{acc}=\texttt{deactivate}\land\texttt{% acc}=\texttt{delete}acc = edit ∧ acc = deactivate ∧ acc = delete

These formulas offer a mechanism for controlling the desired level of precision in our analysis. For instance, the first formula allows for any combination of edit access, delete access, or account deactivation to satisfy the rule. Conversely, the second formula implements a stricter interpretation, requiring all three functionalities to be present. This interplay between formula composition and performance reflects the precision-recall tradeoff - as the level of restrictiveness (precision) increases, the ability to identify compliant policies (recall) decreases. By allowing the creation of multiple formulas for a single rule, the user can tailor the analysis to specific requirements.

Refer to caption
Figure 2: Number of atomic formulae where the valuation is different between historical and current version of privacy policies of 31 Google Play Apps. The large number of changes are due to the introduction of the GDPR regulations. PolicyLR provides a way to perform a fine-grained analysis of the evolution of privacy policies in response to new regulations.

We note that prior works in compliance analysis create structured rules (Linden et al., 2018) or train individual classifiers (Liu et al., 2021). However, these approaches perform classification at a segment level and do not consider the full context of policy texts. PolicyLR addresses these limitations and leverages the truth table to efficiently perform the compliance analysis. Additionally, the existing approaches only cater to a subset of privacy regulations. PolicyLR allows a more comprehensive analysis by leveraging the joint distribution of privacy practices.

8 PolicyLR for Policy Consistency

Evaluating the consistency of privacy policies is an important and well-studied problem (Linden et al., 2018). It involves comparing the practices of two policy documents, which could be different versions of the same privacy policy or belong to different applications. Consistency among different policy texts of the same app is essential to ensure clarity in privacy practices. Comparing different historical versions of the same policy text can help in understanding the evolution of practices over time. Finally, comparing policies from different apps can help enable privacy-based comparative shopping (König and Hansen, 2012), where users can pick and choose apps based on their privacy preferences.

Prior works (Linden et al., 2018; Wagner, 2023) have studied the evolution of privacy policy by creating structured rules for coverage, specificity and compliance, and comparing the practices along these dimensions before- and after- GDPR. However, their main limitation stems from using sentence-level natural language processing techniques, which are not capable of performing joint classification on complex aspects like data retention period for location data used for advertising purposes. Prior work on comparing practices across documents also suffers from similar limitations (König and Hansen, 2012; Khandelwal et al., 2023).

PolicyLR’s logical representation provides a formal way to perform consistency analysis (section 8). Using PolicyLR, we can directly compare the atomic formulae valuations of two different versions of a policy document. PolicyLR overcomes the limitations of prior work by incorporating a joint distribution of privacy attributes, which are automatically integrated from existing policy taxonomies. Below, we describe two applications of the consistency formulation of PolicyLR: a) Evolution of Privacy Practices Over time, and b) Privacy Comparative Shopping.

8.1 Policy Dataset for Consistency Analysis

To demonstrate how PolicyLR can assist with the consistency analysis of privacy policies, we curated a dataset of popular privacy policy texts from play.google.com. For differential analysis, we first manually collected privacy policy URLs for the top 10 applications from play.google.com for each of the following categories – Communication, Dating, Entertainment, Health, Parenting and Social. We found a total of 54545454 unique policy URLs among all these categories. Note that different apps from the same parent company can have an identical privacy policy, resulting in fewer unique policies. For example, Instagram and Meta have identical privacy policies.

To curate a dataset for the evolution of privacy practices over time, we use the Wayback Machine555https://web.archive.org/ from the Internet Archive to get the historical versions of the privacy policies for the apps above. For an earlier timestamp, we purposefully choose the latest policy before 2018 to obtain a pre-GDPR version of the policy. As described by Linden et al. (Linden et al., 2018), the majority of the updates occurred in 2018. Selecting a pre-GDPR version of the policy allows us to observe the effect of the GDPR. Interestingly, due to website migrations, we find that several websites’ current privacy policy URLs did not exist back in 2017. To get the old policies for these cases, we manually crawled the homepage using Wayback Machine and identified the privacy policies. Following this approach, we found a valid historical version for 31313131 out of 54545454 URLs. We downloaded the raw HTML for each of the collected URLs and used Beautiful Soup666https://www.crummy.com/software/BeautifulSoup/ library to extract the associated text. This resulted in a total of 85858585 unique privacy policy texts.

Next, we discuss the two downstream applications of the consistency framework of PolicyLR.

8.2 Evolution of Privacy Practices Over Time

We demonstrate using PolicyLR to analyze the evolution of privacy practices of mobile apps over time. This is an important analysis because by tracking changes in disclosure of privacy practices, we can get insights into how companies adapt to evolving legal landscapes. This analysis can also assist policy auditors in the enforcement of regulations to ensure effective user privacy protection. Additionally, it can provide the user with a better understanding of how their data is handled by the apps they utilize.

We analyze the Policy Dataset for Consistency Analysis 8.1 for this task. Specifically, we analyze pre-GDPR and post-GDPR policies of the 31313131 apps in the dataset. The dataset consists of 31 top applications on the Play Store. We focus on three high-level categories from the OPP-115 taxonomy, namely, Data Retention, User Access Edit and Deletion, and Policy Change. Recall that high-level categories in OPP-115 are independent of each other, and therefore, any combination of these high-level categories is a valid set of atomics for PolicyLR. We choose to restrict ourselves to only these three categories because the number of atomic formulae can grow exponentially with the number of leaf nodes.

Results. We implement PolicyLR using the open source Llama-70B model as the compiler with the custom taxonomy defined above. We then take the two valuation functions and compare the valuation of each atomic formula to identify the inconsistencies. Figure 2 shows the distribution of changes observed in privacy practices across the three high-level categories for each app. Notably, these categories exhibit some overlap with the information mandated by GDPR Article 13. We find that policies across all app categories have changed over time. For instance, the "User Access" category witnessed the most significant change in the case of TikTok, followed by Hinge and FindMyKids. This indicates that these applications potentially made substantial adjustments to user access controls in response to the regulation. We confirmed this by manually checking the policies. Conversely, Microsoft’s privacy policies exhibited minimal changes between the pre- and post-GDPR versions. Similarly, Amazon’s policies remained largely unchanged in the "Data Retention" category, while some adjustments were made in the "User Access" category.

Examples of Changes in Practices. We manually select the following two apps: Facebook and Tiktok and show examples of policy change. For example - in 2017, Facebook’s policy mentioned that they can retain information indefinitely for security reasons. However, the latest policy mentions that the data can be deleted upon request.

Facebook Messanger: Change in Data Retention Period Pre-GDPR: …We may also access, preserve and share information when we have a good faith belief it is necessary to: detect, prevent and address fraud and other illegal activity… Post-GDPR: We keep information for as long as we need it to provide a feature or service. But you can request that we delete your information. We’ll delete that information unless we have to keep it for something else, like for legal reasons.

Similarly, Tiktok’s policy in 2017 did not have any means by which user could delete their data, whereas in 2024, they include a full section on Your Rights.

Facebook Messanger: Change in Data Retention Period Pre-GDPR: …No mention of user rights… Post-GDPR: You may submit a request to know, access, correct or delete the information we have collected from or about you…You may also exercise your rights to know, access, correct, delete, or appeal by sending your request to the physical address…

We presented here a proof-of-concept analysis that showcased the versatility of PolicyLR by analyzing the evolution of privacy practices over time. We acknowledge the small size of the Policy Dataset as a limitation and leave the full-scale measurement analysis for future work. Our intention here was to perform a case study showing that PolicyLR can be used to comprehensively analyze the evolution of privacy practices over time. For example, no other framework is currently capable of identifying the joint distribution of privacy practices in a policy.

8.3 Privacy Comparative Shopping

We now demonstrate how PolicyLR can be used as the basis for a privacy-centric shopping assistant by allowing users to compare apps based on their privacy preferences. Users can select specific privacy dimensions such as data retention, data sharing, and encryption practices, and the system will group apps accordingly, presenting a clear comparison of their privacy policies. For example, consider Alice, a privacy-conscious person, is considering using a messaging app and is confused between Signal and WhatsApp as they offer similar functionalities. Before making a decision, Bob may want to compare the Data Retention practices of the two apps. PolicyLR can perform this analysis and provide this information.

We show a proof-of-concept version of this application using the most recent privacy policies from the Datenschutzbestimmungen dataset. This involves performing a holistic comparison of practices across apps. PolicyLR’s consistency formulation provides a natural way to compare privacy policies by directly comparing PolicyLR’s atomic formulae valuations. Table 5 compares the data-retention practices of 8888 most popular Google Play apps from the Communication category – Whatsapp, Snapchat, Messenger, Telegram, Signal, Discord, Textnow and Zangi. We compare apps across two attributes of data-retention: purpose and retention-period. We place an app’s icon in a cell if the valuation of the corresponding formula is true.

purpose     retention-period
    unspecified indef limited stated
\hlineB2
service
security
   
[Uncaptioned image][Uncaptioned image][Uncaptioned image]
[Uncaptioned image][Uncaptioned image][Uncaptioned image]
[Uncaptioned image]
legal     [Uncaptioned image][Uncaptioned image][Uncaptioned image] [Uncaptioned image]
analytics     [Uncaptioned image][Uncaptioned image] [Uncaptioned image][Uncaptioned image][Uncaptioned image][Uncaptioned image]
advertising     [Uncaptioned image][Uncaptioned image][Uncaptioned image] [Uncaptioned image][Uncaptioned image]
unspecified     [Uncaptioned image][Uncaptioned image]
[Uncaptioned image][Uncaptioned image][Uncaptioned image]
[Uncaptioned image][Uncaptioned image][Uncaptioned image]
\hlineB2    
Table 5: Privacy Comparison Shopping on Top Apps from the Communication Category. Each cell represents a specific purpose and retention-period setting. Such a fine-grained analysis can help users to choose applications based on their preferred privacy settings.

We emphasize that existing sentence-classification-based approaches are not able to achieve this granularity as they classify these attributes individually. PolicyLR, on the other hand, provides a more fine-grained view by considering the joint distributions of these practices. For instance, individual analysis would describe retention-period of Whatsapp as limited. However, this misses out on the information that data collected for specific purposes like perform-service or security is actually retained for an unspecified period of time.

From Table 5, we can observe the following trends, which can be useful for a user when comparing communication apps:

  1. 1.

    Whatsapp, Discord and Signal have better data retention practices as compared to Snapchat and Messenger.

  2. 2.

    Only Textnow specifies a retention-period for the data retained for perform-service or security purposes. However, it is not transparent about data retained for advertising or analytics.

  3. 3.

    None of the apps retain any data for an indefinite period of time.

9 Limitations

LLM and RAG Limitations. PolicyLR uses both Large Language Models (LLMs) and the Retrieval-Augmented Generation (RAG) techniques to generate the truth table and inherits limitations associated with these. The entailment task employed relies on retrieving relevant context from a vector database of policy chunks (Figure. 1). Errors within this retrieval pipeline can propagate throughout the system, potentially compromising the effectiveness of PolicyLR. Furthermore, entailment tasks populate the truth table, requiring LLMs to perform evidence-based reasoning. As we observed in subsection 6.4, LLMs can exhibit nit-picky behavior, overlooking common contextual cues. Additionally, they can make erroneous inferences, as seen in subsection 6.4. Therefore, LLMs may struggle with the inherent ambiguity of natural language and misinterpret the meaning of contextual phrases or clauses within privacy policies. A potential mitigation strategy here could be to fine-tune LLMs, specifically on privacy entailment tasks, and use it as the compiler to generate the truth table. However, this approach requires significant training data and computational resources.

Scaling number of atomic formulae. The size and complexity of the atomic formulae in PolicyLR are dependent on the underlying privacy policy taxonomy. As the number of attributes within the taxonomy increases, the number of atomic formulae grows exponentially. This can lead to significant computational costs, potentially impacting the scalability of the framework. While this work focuses on providing a framework for downstream tasks given a pre-defined taxonomy, building an optimal taxonomy remains an open challenge. We acknowledge this limitation and emphasize that PolicyLR facilitates various applications once a suitable taxonomy is established. One potential solution is to upper bound the number of attributes that can be part of an atomic formula, thus reducing the cardinality of the joint distribution. For instance, consider a high-level category with 3333 attributes where each can assume 10101010 different values. The cardinality of the joint distribution of all three attributes will be 1000100010001000. However, if we only allow almost 2222 attributes in the joint distribution, we get 3333 different pairings among attributes, each with a cardinality of 100100100100. Overall, this reduces the number of rows in the truth table from 1000100010001000 to 300300300300. This, however, gives a less fine-grained representation as compared to the 3333 attribute joint distribution. Therefore, this provides a way to balance efficiency and utility.

Assumptions on Taxonomy. For the truth table to serve as a comprehensive representation of a policy, the taxonomy must contain all relevant privacy-related topics. As seen in Section 7, incomplete taxonomies can lead to incomplete truth tables. In this case, certain compliance rules were not directly mapped to the taxonomy. Therefore, we were not able to check compliance for those rules. We reiterate that this work focuses on the framework’s development and downstream applications, assuming a well-defined taxonomy. A potential mitigation strategy for incomplete taxonomies involves leveraging taxonomy completion tools (Shi et al., 2024). These tools can automatically identify and fill in missing nodes within the taxonomy, potentially improving the comprehensiveness of the truth table and its efficacy in supporting downstream applications.

Vulnerability to policy poisoning attacks. Recent work has demonstrated that retrieval-augmented LLMs are vulnerable to poisoning attacks (Chaudhari et al., 2024; Zhong et al., 2023; Zou et al., 2024). These attacks add carefully crafted triggers to the input documents that can lead to adversary-controlled misbehavior of both the embedding as well as the response LLM. These attacks can also be adapted to target systems that comprise multiple LLM instances (Mangaokar et al., 2024). Such an attack trigger, when added to a privacy policy document, can lead to incorrect formulae valuation and subsequently affect the downstream applications. Defense against such attacks that target ML models is still an open problem.

10 Discussion

Modular Design. PolicyLR has two main components: instantiating the grammar to get a set of atomic formulae and implementing a valuation function based on the NLI entailment task. In this paper, we initialize the grammar using the OPP-115 taxonomy and implement the valuation function using retrieval-augmented LLMs. However, both these components can also be performed using alternate implementations. PolicyLR’s grammar can also be instantiated using other taxonomies like MAPP (Arora et al., 2022) and Privacy label taxonomy (Khandelwal et al., 2023). Since PolicyLR automates the extraction of atomic formulae from any taxonomy, PolicyLR’s grammar can be easily extended to incorporate new taxonomies as well as adapt to changes in existing ones. PolicyLR’s valuation function is based on entailment, which is a well-studied NLI task. Therefore, it can alternatively be implemented using models that specialize in NLI tasks (Wang et al., 2021). This also means that PolicyLR can benefit from any future advancements in the NLI.

Privacy Policies and Source Code. Prior work has studied the consistency between privacy policies and the source code implementation of their corresponding apps (Slavin et al., 2016). However, they require manually creating a mapping between code constructs and privacy concepts. Given the code understanding capability of LLMs, PolicyLR provides a way to automate this analysis.

Beyond Privacy Policies. PolicyLR’s methodology can be used to compare any unstructured data, not just privacy policy documents. Given the relevant taxonomy, the PolicyLR framework can construct the corresponding atomic formulae and their valuations. Given the advancement in multi-modal LLMs, this can also include other modalities like vision and audio. Another interesting application of PolicyLR’s consistency formulation is hallucination detection. Given a set of atomic formulae, an LLM’s generated output should have valuations that are consistent with its input context. Any inconsistencies are likely the result of hallucinations or reasoning errors.

11 Conclusion

In conclusion, PolicyLR advances automated privacy policy analysis by converting their complex text into a machine-readable format using valuations of atomic formulae. We implement a compiler for PolicyLR using off-the-shelf open-source LLMs and embedding models to evaluate a complex set of logical formulae based on the full text of a policy. This compiler achieves high precision and recall on the ToD;DR dataset. PolicyLR’s applications in policy compliance, inconsistency detection, and privacy comparison shopping demonstrate its potential to make privacy policy analysis more accessible and understandable.

References

  • Alabduljabbar et al. (2021) Abdulrahman Alabduljabbar, Ahmed Abusnaina, Ülkü Meteriz-Yildiran, and David Mohaisen. Tldr: deep learning-based automated privacy policy annotation with key policy highlights. In Proceedings of the 20th Workshop on Workshop on Privacy in the Electronic Society, pages 103–118, 2021.
  • Ali et al. (2023) Syed Musharraf Ali, Tobias Deußer, Sebastian Houben, L. Hillebrand, Tim Metzler, and R. Sifa. Automatic consistency checking of table and text in financial documents. Proceedings of the Northern Lights Deep Learning Workshop, 2023. doi: 10.7557/18.6816.
  • Amos et al. (2021) Ryan Amos, Gunes Acar, Eli Lucherini, Mihir Kshirsagar, Arvind Narayanan, and Jonathan Mayer. Privacy policies over time: Curation and analysis of a million-document dataset. In Proceedings of the Web Conference 2021, pages 2165–2176, 2021.
  • (4) Benjamin Andow, Samin Yaseer Mahmud, Wenyu Wang, Justin Whitaker, William Enck, Bradley Reaves, Kapil Singh, and Tao Xie. Policylint: Investigating internal privacy policy contradictions on google play.
  • Andow et al. (2019) Benjamin Andow, Samin Yaseer Mahmud, Wenyu Wang, Justin Whitaker, William Enck, Bradley Reaves, Kapil Singh, and Tao Xie. {{\{{PolicyLint}}\}}: investigating internal privacy policy contradictions on google play. In 28th USENIX security symposium (USENIX security 19), pages 585–602, 2019.
  • Arora et al. (2022) Siddhant Arora, Henry Hosseini, Christine Utz, Vinayshekhar K Bannihatti, Tristan Dhellemmes, Abhilasha Ravichander, Peter Story, Jasmine Mangat, Rex Chen, Martin Degeling, et al. A tale of two regulatory regimes: Creation and analysis of a bilingual privacy policy corpus. In LREC proceedings, 2022.
  • Bhatia and Breaux (2018) Jaspreet Bhatia and Travis D Breaux. Semantic incompleteness in privacy policy goals. In 2018 IEEE 26th International Requirements Engineering Conference (RE), pages 159–169. IEEE, 2018.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • California State Legislature (2018) California State Legislature. California Consumer Privacy Act of 2018. https://oag.ca.gov/privacy/ccpa, 2018. Accessed: 2024-07-10.
  • Chaudhari et al. (2024) Harsh Chaudhari, Giorgio Severi, John Abascal, Matthew Jagielski, Christopher A Choquette-Choo, Milad Nasr, Cristina Nita-Rotaru, and Alina Oprea. Phantom: General trigger attacks on retrieval augmented language generation. arXiv preprint arXiv:2405.20485, 2024.
  • Cranor (2002) Lorrie Cranor. Web privacy with P3P. " O’Reilly Media, Inc.", 2002.
  • Cui et al. (2023) Hao Cui, Rahmadi Trimananda, Athina Markopoulou, and Scott Jordan. PoliGraph: Automated privacy policy analysis using knowledge graphs. In 32nd USENIX Security Symposium (USENIX Security 23), pages 1037–1054, Anaheim, CA, August 2023. USENIX Association. ISBN 978-1-939133-37-3. URL https://www.usenix.org/conference/usenixsecurity23/presentation/cui.
  • El-Kassas et al. (2021) Wafaa S El-Kassas, Cherif R Salama, Ahmed A Rafea, and Hoda K Mohamed. Automatic text summarization: A comprehensive survey. Expert systems with applications, 165:113679, 2021.
  • European Parliament and Council of the European Union (2016) European Parliament and Council of the European Union. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). https://eur-lex.europa.eu/eli/reg/2016/679/oj, 2016. Accessed: 2024-07-10.
  • Gao et al. (2023) Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023.
  • Harkous et al. (2018) Hamza Harkous, Kassem Fawaz, Rémi Lebret, Florian Schaub, Kang G Shin, and Karl Aberer. Polisis: Automated analysis and presentation of privacy policies using deep learning. In 27th USENIX Security Symposium (USENIX Security 18), pages 531–548, 2018.
  • Jain et al. (2023) Akshath Jain, David Rodriguez, Jose M Del Alamo, and Norman Sadeh. Atlas: Automatically detecting discrepancies between privacy policies and privacy labels. In 2023 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW), pages 94–107. IEEE, 2023.
  • Kelley et al. (2009) Patrick Gage Kelley, Joanna Bresee, Lorrie Faith Cranor, and Robert W Reeder. A" nutrition label" for privacy. In Proceedings of the 5th Symposium on Usable Privacy and Security, pages 1–12, 2009.
  • Khandelwal et al. (2023) Rishabh Khandelwal, Asmit Nayak, Paul Chung, and Kassem Fawaz. The overview of privacy labels and their compatibility with privacy policies. arXiv preprint arXiv:2303.08213, 2023.
  • König and Hansen (2012) Ulrich König and Marit Hansen. Extending comparison shopping sites by privacy information on retailers. In Privacy and Identity Management for Life: 7th IFIP WG 9.2, 9.6/11.7, 11.4, 11.6/PrimeLife International Summer School, Trento, Italy, September 5-9, 2011, Revised Selected Papers 7, pages 171–186. Springer, 2012.
  • Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  • Li et al. (2024) Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, and Wenhu Chen. Long-context llms struggle with long in-context learning. arXiv preprint arXiv:2404.02060, 2024.
  • Li and Li (2023) Xianming Li and Jing Li. Angle-optimized text embeddings. arXiv preprint arXiv:2309.12871, 2023.
  • Liao et al. (2024) Song Liao, Mohammed Aldeen, Jingwen Yan, Long Cheng, Xiapu Luo, Haipeng Cai, and Hongxin Hu. Understanding gdpr non-compliance in privacy policies of alexa skills in european marketplaces. In Proceedings of the ACM on Web Conference 2024, pages 1081–1091, 2024.
  • Linden et al. (2018) Thomas Linden, Rishabh Khandelwal, Hamza Harkous, and Kassem Fawaz. The privacy policy landscape after the gdpr. arXiv preprint arXiv:1809.08396, 2018.
  • Liu et al. (2021) Shuang Liu, Baiyang Zhao, Renjie Guo, Guozhu Meng, Fan Zhang, and Meishan Zhang. Have you been properly notified? automatic compliance analysis of privacy policy text with gdpr article 13. In Proceedings of the Web Conference 2021, pages 2154–2164, 2021.
  • MacCartney (2009) Bill MacCartney. Natural language inference. Stanford University, 2009.
  • Manandhar et al. (2024) Sunil Manandhar, Kapil Singh, and Adwait Nadkarni. Towards automated regulation analysis for effective privacy compliance. In Network and Distributed Systems Security Symposium (NDSS), 2024.
  • Mangaokar et al. (2024) Neal Mangaokar, Ashish Hooda, Jihye Choi, Shreyas Chandrashekaran, Kassem Fawaz, Somesh Jha, and Atul Prakash. Prp: Propagating universal perturbations to attack large language model guard-rails. arXiv preprint arXiv:2402.15911, 2024.
  • McDonald and Cranor (2008) Aleecia M McDonald and Lorrie Faith Cranor. The cost of reading privacy policies 2008 privacy year in review. i/s: A journal of law and policy for the information society privacy year in review (2008), 543–568, 2008.
  • Mori et al. (2022) Keika Mori, Tatsuya Nagai, Yuta Takata, and Masaki Kamizono. Analysis of privacy compliance by classifying multiple policies on the web. In 2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC), pages 1734–1741, 2022. doi: 10.1109/COMPSAC54236.2022.00276.
  • Nejad et al. (2018) Najmeh Mousavi Nejad, Simon Scerri, and Jens Lehmann. Knight: Mapping privacy policies to gdpr. In European Knowledge Acquisition Workshop, pages 258–272. Springer, 2018.
  • Oltramari et al. (2018) Alessandro Oltramari, Dhivya Piraviperumal, Florian Schaub, Shomir Wilson, Sushain Cherivirala, Thomas B Norton, N Cameron Russell, Peter Story, Joel Reidenberg, and Norman Sadeh. Privonto: A semantic framework for the analysis of privacy policies. Semantic Web, 9(2):185–203, 2018.
  • Pollach (2007) Irene Pollach. What’s wrong with online privacy policies? Communications of the ACM, 50(9):103–108, 2007.
  • Rabinia and Nygaard (2022) Amin Rabinia and Zane Nygaard. Compliance checking with nli: Privacy policies vs. regulations. ArXiv, abs/2204.01845, 2022. doi: 10.48550/arXiv.2204.01845.
  • Ravichander et al. (2019) Abhilasha Ravichander, Alan W Black, Shomir Wilson, Thomas Norton, and Norman Sadeh. Question answering for privacy policies: Combining computational and legal perspectives. arXiv preprint arXiv:1911.00841, 2019.
  • Regulation (2016) Protection Regulation. Regulation (eu) 2016/679 of the european parliament and of the council. Regulation (eu), 679:2016, 2016.
  • Reidenberg et al. (2015) Joel R Reidenberg, Travis Breaux, Lorrie Faith Cranor, Brian French, Amanda Grannis, James T Graves, Fei Liu, Aleecia McDonald, Thomas B Norton, Rohan Ramanath, et al. Disagreeable privacy policies: Mismatches between meaning and users’ understanding. Berkeley Tech. LJ, 30:39, 2015.
  • Rodriguez et al. (2024) David Rodriguez, Ian Yang, Jose M Del Alamo, and Norman Sadeh. Large language models: A new approach for privacy policy analysis at scale. arXiv preprint arXiv:2405.20900, 2024.
  • Roy et al. (2012) Hugo Roy, JC Borchardt, I McGowan, J Stout, and S Azmayesh. Terms of service; didn’t read. https://tosdr.org, June 2012. Web Page.
  • Sarne et al. (2019) David Sarne, Jonathan Schler, Alon Singer, Ayelet Sela, and Ittai Bar Siman Tov. Unsupervised topic extraction from privacy policies. In Companion Proceedings of The 2019 World Wide Web Conference, pages 563–568. ACM, 2019.
  • Shafei et al. (2024) Hassan A Shafei, Hongchang Gao, and Chiu C Tan. Measuring privacy policy compliance in the alexa ecosystem: In-depth analysis. Computers & Security, page 103963, 2024.
  • Shi et al. (2024) Jingchuan Shi, Hang Dong, Jiaoyan Chen, Zhe Wu, and Ian Horrocks. Taxonomy completion via implicit concept insertion. In Proceedings of the ACM on Web Conference 2024, pages 2159–2169, 2024.
  • Shvartzshnaider et al. (2018) Yan Shvartzshnaider, Noah Apthorpe, Nick Feamster, and Helen Nissenbaum. Analyzing privacy policies using contextual integrity annotations. arXiv preprint arXiv:1809.02236, 2018.
  • Slavin et al. (2016) Rocky Slavin, Xiaoyin Wang, Mitra Bokaei Hosseini, James Hester, Ram Krishnan, Jaspreet Bhatia, Travis D Breaux, and Jianwei Niu. Toward a framework for detecting privacy policy violations in android application code. In Proceedings of the 38th International conference on software engineering, pages 25–36, 2016.
  • Tan and Song (2023a) Zeya Tan and Wei Song. Ptpdroid: Detecting violated user privacy disclosures to third-parties of android apps. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 473–485, 2023a. doi: 10.1109/ICSE48619.2023.00050.
  • Tan and Song (2023b) Zeya Tan and Wei Song. Ptpdroid: Detecting violated user privacy disclosures to third-parties of android apps. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 473–485. IEEE, 2023b.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • Wagner (2022) Isabel Wagner. Privacy policies across the ages: Content and readability of privacy policies 1996–2021. arXiv preprint arXiv:2201.08739, 2022.
  • Wagner (2023) Isabel Wagner. Privacy policies across the ages: content of privacy policies 1996–2021. ACM Transactions on Privacy and Security, 26(3):1–32, 2023.
  • Wang et al. (2021) Sinong Wang, Han Fang, Madian Khabsa, Hanzi Mao, and Hao Ma. Entailment as few-shot learner. arXiv preprint arXiv:2104.14690, 2021.
  • Wilson et al. (2016) Shomir Wilson, Florian Schaub, Aswarth Abhilash Dara, Frederick Liu, Sushain Cherivirala, Pedro Giovanni Leon, Mads Schaarup Andersen, Sebastian Zimmeck, Kanthashree Mysore Sathyendra, N. Cameron Russell, Thomas B. Norton, Eduard Hovy, Joel Reidenberg, and Norman Sadeh. The creation and analysis of a website privacy policy corpus. In Katrin Erk and Noah A. Smith, editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1330–1340, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1126. URL https://aclanthology.org/P16-1126.
  • Zaeem et al. (2018) Razieh Nokhbeh Zaeem, Rachel L German, and K Suzanne Barber. Privacycheck: Automatic summarization of privacy policies using data mining. ACM Transactions on Internet Technology (TOIT), 18(4):1–18, 2018.
  • Zhong et al. (2023) Zexuan Zhong, Ziqing Huang, Alexander Wettig, and Danqi Chen. Poisoning retrieval corpora by injecting adversarial passages. arXiv preprint arXiv:2310.19156, 2023.
  • Zimmeck and Bellovin (2014) Sebastian Zimmeck and Steven M Bellovin. Privee: An architecture for automatically analyzing web privacy policies. In 23rd USENIX Security Symposium (USENIX Security 14), pages 1–16, 2014.
  • Zimmeck et al. (2016) Sebastian Zimmeck, Ziqi Wang, Lieyong Zou, Roger Iyengar, Bin Liu, Florian Schaub, Shomir Wilson, Norman Sadeh, Steven Bellovin, and Joel Reidenberg. Automated analysis of privacy requirements for mobile apps. In 2016 AAAI Fall Symposium Series, 2016.
  • Zimmeck et al. (2019) Sebastian Zimmeck, Peter Story, Daniel Smullen, Abhilasha Ravichander, Ziqi Wang, Joel Reidenberg, N Cameron Russell, and Norman Sadeh. Maps: Scaling privacy compliance analysis to a million apps. Proceedings on Privacy Enhancing Technologies, 2019.
  • Zou et al. (2024) Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. Poisonedrag: Knowledge poisoning attacks to retrieval-augmented generation of large language models. arXiv preprint arXiv:2402.07867, 2024.