Search | arXiv e-print repository

QEDCartographer: Automating Formal Verification Using Reward-Free Reinforcement Learning

Authors: Alex Sanchez-Stern, Abhishek Varghese, Zhanna Kaufman, Dylan Zhang, Talia Ringer, Yuriy Brun

Abstract: Formal verification is a promising method for producing reliable software, but the difficulty of manually writing verification proofs severely limits its utility in practice. Recent methods have automated some proof synthesis by guiding a search through the proof space using a theorem prover. Unfortunately, the theorem prover provides only the crudest estimate of progress, resulting in effectively… ▽ More Formal verification is a promising method for producing reliable software, but the difficulty of manually writing verification proofs severely limits its utility in practice. Recent methods have automated some proof synthesis by guiding a search through the proof space using a theorem prover. Unfortunately, the theorem prover provides only the crudest estimate of progress, resulting in effectively undirected search. To address this problem, we create QEDCartographer, an automated proof-synthesis tool that combines supervised and reinforcement learning to more effectively explore the proof space. QEDCartographer incorporates the proofs' branching structure, enabling reward-free search and overcoming the sparse reward problem inherent to formal verification. We evaluate QEDCartographer using the CoqGym benchmark of 68.5K theorems from 124 open-source Coq projects. QEDCartographer fully automatically proves 21.4% of the test-set theorems. Previous search-based proof-synthesis tools Tok, Tac, ASTactic, Passport, and Proverbot9001, which rely only on supervised learning, prove 9.6%, 9.8%, 10.9%, 12.5%, and 19.8%, respectively. Diva, which combines 62 tools, proves 19.2%. Comparing to the most effective prior tool, Proverbot9001, QEDCartographer produces 34% shorter proofs 29% faster, on average over the theorems both tools prove. Together, QEDCartographer and non-learning-based CoqHammer prove 30.3% of the theorems, while CoqHammer alone proves 26.6%. Our work demonstrates that reinforcement learning is a fruitful research direction for improving proof-synthesis tools' search mechanisms. △ Less

Submitted 9 September, 2024; v1 submitted 17 August, 2024; originally announced August 2024.

Comments: Published in the International Conference on Software Engineering (ICSE) 2025: Alex Sanchez-Stern, Abhishek Varghese, Zhanna Kaufman, Dylan Zhang, Talia Ringer, and Yuriy Brun, QEDCartographer: Automating Formal Verification Using Reward-Free Reinforcement Learning, in Proceedings of the 47th International Conference on Software Engineering (ICSE), 2025

arXiv:2401.12947 [pdf, other]

Transformer-Based Models Are Not Yet Perfect At Learning to Emulate Structural Recursion

Authors: Dylan Zhang, Curt Tigges, Zory Zhang, Stella Biderman, Maxim Raginsky, Talia Ringer

Abstract: This paper investigates the ability of transformer-based models to learn structural recursion from examples. Recursion is a universal concept in both natural and formal languages. Structural recursion is central to the programming language and formal mathematics tasks where symbolic tools currently excel beyond neural models, such as inferring semantic relations between datatypes and emulating pro… ▽ More This paper investigates the ability of transformer-based models to learn structural recursion from examples. Recursion is a universal concept in both natural and formal languages. Structural recursion is central to the programming language and formal mathematics tasks where symbolic tools currently excel beyond neural models, such as inferring semantic relations between datatypes and emulating program behavior. We introduce a general framework that nicely connects the abstract concepts of structural recursion in the programming language domain to concrete sequence modeling problems and learned models' behavior. The framework includes a representation that captures the general \textit{syntax} of structural recursion, coupled with two different frameworks for understanding their \textit{semantics} -- one that is more natural from a programming languages perspective and one that helps bridge that perspective with a mechanistic understanding of the underlying transformer architecture. With our framework as a powerful conceptual tool, we identify different issues under various set-ups. The models trained to emulate recursive computations cannot fully capture the recursion yet instead fit short-cut algorithms and thus cannot solve certain edge cases that are under-represented in the training distribution. In addition, it is difficult for state-of-the-art large language models (LLMs) to mine recursive rules from in-context demonstrations. Meanwhile, these LLMs fail in interesting ways when emulating reduction (step-wise computation) of the recursive function. △ Less

Submitted 23 January, 2024; originally announced January 2024.

Comments: arXiv admin note: text overlap with arXiv:2305.14699

arXiv:2310.06959 [pdf, ps, other]

Proof Repair across Quotient Type Equivalences

Authors: Cosmo Viola, Max Fan, Talia Ringer

Abstract: Proofs in proof assistants like Coq can be brittle, breaking easily in response to changes in the terms and types those proofs depend on. To address this, recent work introduced an algorithm and tool in Coq to automatically repair broken proofs in response to changes that correspond to type equivalences. However, many changes remained out of the scope of this algorithm and tool -- especially chang… ▽ More Proofs in proof assistants like Coq can be brittle, breaking easily in response to changes in the terms and types those proofs depend on. To address this, recent work introduced an algorithm and tool in Coq to automatically repair broken proofs in response to changes that correspond to type equivalences. However, many changes remained out of the scope of this algorithm and tool -- especially changes in underlying behavior. We extend this proof repair algorithm so that it can express certain changes in behavior that were previously out of scope. We focus in particular on equivalences between quotient types -- types equipped with a relation that describes what it means for any two elements of that type to be equal. Quotient type equivalences can be used to express interesting changes in representations of mathematical structures, as well as changes in the underlying implementations of data structures -- two use cases highlighted by our case studies. We extend this algorithm to support quotient type equivalences in two different ways: (1) internally to cubical type theory (applied to Cubical Agda), and (2) externally to CIC$_ω$ (applied to Coq). While our approach in Coq comes equipped with prototype automation, it suffers notably from Coq's lack of quotient types -- something we circumvent using Coq's setoid machinery and an extension to the proof repair algorithm to support the corresponding new proof obligations. In contrast, while our approach in Cubical Agda is completely manual, it takes advantage of cubical type theory's internal quotient types, which makes the algorithm straightforward. Furthermore, it includes the first internal proofs of correctness of repaired proofs, something not possible in general in Coq. We report on the tradeoffs between these two approaches, and demonstrate these tradeoffs on proof repair case studies for previously unsupported changes. △ Less

Submitted 18 March, 2024; v1 submitted 10 October, 2023; originally announced October 2023.

Comments: for associated code, see https://github.com/InnovativeInventor/proof-repair-quotients

arXiv:2305.14699 [pdf, other]

Can Transformers Learn to Solve Problems Recursively?

Authors: Shizhuo Dylan Zhang, Curt Tigges, Stella Biderman, Maxim Raginsky, Talia Ringer

Abstract: Neural networks have in recent years shown promise for helping software engineers write programs and even formally verify them. While semantic information plays a crucial part in these processes, it remains unclear to what degree popular neural architectures like transformers are capable of modeling that information. This paper examines the behavior of neural networks learning algorithms relevant… ▽ More Neural networks have in recent years shown promise for helping software engineers write programs and even formally verify them. While semantic information plays a crucial part in these processes, it remains unclear to what degree popular neural architectures like transformers are capable of modeling that information. This paper examines the behavior of neural networks learning algorithms relevant to programs and formal verification proofs through the lens of mechanistic interpretability, focusing in particular on structural recursion. Structural recursion is at the heart of tasks on which symbolic tools currently outperform neural models, like inferring semantic relations between datatypes and emulating program behavior. We evaluate the ability of transformer models to learn to emulate the behavior of structurally recursive functions from input-output examples. Our evaluation includes empirical and conceptual analyses of the limitations and capabilities of transformer models in approximating these functions, as well as reconstructions of the ``shortcut" algorithms the model learns. By reconstructing these algorithms, we are able to correctly predict 91 percent of failure cases for one of the approximated functions. Our work provides a new foundation for understanding the behavior of neural networks that fail to solve the very tasks they are trained for. △ Less

Submitted 25 June, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

arXiv:2305.04369 [pdf, other]

Getting More out of Large Language Models for Proofs

Authors: Shizhuo Dylan Zhang, Talia Ringer, Emily First

Abstract: Large language models have the potential to simplify formal theorem proving and make it more accessible. But how to get the most out of these models is still an open question. To answer this question, we take a step back and explore the failure cases of these models using common prompting-based techniques. Our talk will discuss these failure cases and what they can teach us about how to get more o… ▽ More Large language models have the potential to simplify formal theorem proving and make it more accessible. But how to get the most out of these models is still an open question. To answer this question, we take a step back and explore the failure cases of these models using common prompting-based techniques. Our talk will discuss these failure cases and what they can teach us about how to get more out of these models. △ Less

Submitted 31 May, 2023; v1 submitted 7 May, 2023; originally announced May 2023.

arXiv:2303.04910 [pdf, other]

Baldur: Whole-Proof Generation and Repair with Large Language Models

Authors: Emily First, Markus N. Rabe, Talia Ringer, Yuriy Brun

Abstract: Formally verifying software properties is a highly desirable but labor-intensive task. Recent work has developed methods to automate formal verification using proof assistants, such as Coq and Isabelle/HOL, e.g., by training a model to predict one proof step at a time, and using that model to search through the space of possible proofs. This paper introduces a new method to automate formal verific… ▽ More Formally verifying software properties is a highly desirable but labor-intensive task. Recent work has developed methods to automate formal verification using proof assistants, such as Coq and Isabelle/HOL, e.g., by training a model to predict one proof step at a time, and using that model to search through the space of possible proofs. This paper introduces a new method to automate formal verification: We use large language models, trained on natural language text and code and fine-tuned on proofs, to generate whole proofs for theorems at once, rather than one step at a time. We combine this proof generation model with a fine-tuned repair model to repair generated proofs, further increasing proving power. As its main contributions, this paper demonstrates for the first time that: (1) Whole-proof generation using transformers is possible and is as effective as search-based techniques without requiring costly search. (2) Giving the learned model additional context, such as a prior failed proof attempt and the ensuing error message, results in proof repair and further improves automated proof generation. (3) We establish a new state of the art for fully automated proof synthesis. We reify our method in a prototype, Baldur, and evaluate it on a benchmark of 6,336 Isabelle/HOL theorems and their proofs. In addition to empirically showing the effectiveness of whole-proof generation, repair, and added context, we show that Baldur improves on the state-of-the-art tool, Thor, by automatically generating proofs for an additional 8.7% of the theorems. Together, Baldur and Thor can prove 65.7% of the theorems fully automatically. This paper paves the way for new research into using large language models for automating formal verification. △ Less

Submitted 15 March, 2023; v1 submitted 8 March, 2023; originally announced March 2023.

arXiv:2208.04738 [pdf, other]

Long-Term Mentoring for Computer Science Researchers

Authors: Emily Ruppel, Sihang Liu, Elba Garza, Sukyoung Ryu, Alexandra Silva, Talia Ringer

Abstract: Early in the pandemic, we -- leaders in the research areas of programming languages (PL) and computer architecture (CA) -- realized that we had a problem: the only way to form new lasting connections in the community was to already have lasting connections in the community. Both of our academic communities had wonderful short-term mentoring programs to address this problem, but it was clear that w… ▽ More Early in the pandemic, we -- leaders in the research areas of programming languages (PL) and computer architecture (CA) -- realized that we had a problem: the only way to form new lasting connections in the community was to already have lasting connections in the community. Both of our academic communities had wonderful short-term mentoring programs to address this problem, but it was clear that we needed long-term mentoring programs. Those of us in CA approached this scientifically, making an evidence-backed case for community-wide long-term mentoring. In the meantime, one of us in PL had impulsively launched an unofficial long-term mentoring program, founded on chaos and spreadsheets. In January 2021, the latter grew to an official cross-institutional long-term mentoring program called SIGPLAN-M; in January 2022, the former grew to Computer Architecture Long-term Mentoring (CALM). The impacts have been strong: SIGPLAN-M reaches 328 mentees and 234 mentors across 41 countries, and mentees have described it as "life changing" and "a career saver." And while CALM is in its pilot phase -- with 13 mentors and 21 mentees across 7 countries -- it has received very positive feedback. The leaders of SIGPLAN-M and CALM shared our designs, impacts, and challenges along the way. Now, we wish to share those with you. We hope this will kick-start a larger long-term mentoring effort across all of computer science. △ Less

Submitted 17 September, 2022; v1 submitted 6 August, 2022; originally announced August 2022.

arXiv:2204.10370 [pdf, other]

doi 10.1145/3593374

Passport: Improving Automated Formal Verification Using Identifiers

Authors: Alex Sanchez-Stern, Emily First, Timothy Zhou, Zhanna Kaufman, Yuriy Brun, Talia Ringer

Abstract: Formally verifying system properties is one of the most effective ways of improving system quality, but its high manual effort requirements often render it prohibitively expensive. Tools that automate formal verification, by learning from proof corpora to suggest proofs, have just begun to show their promise. These tools are effective because of the richness of the data the proof corpora contain.… ▽ More Formally verifying system properties is one of the most effective ways of improving system quality, but its high manual effort requirements often render it prohibitively expensive. Tools that automate formal verification, by learning from proof corpora to suggest proofs, have just begun to show their promise. These tools are effective because of the richness of the data the proof corpora contain. This richness comes from the stylistic conventions followed by communities of proof developers, together with the logical systems beneath proof assistants. However, this richness remains underexploited, with most work thus far focusing on architecture rather than making the most of the proof data. In this paper, we develop Passport, a fully-automated proof-synthesis tool that systematically explores how to most effectively exploit one aspect of that proof data: identifiers. Passport enriches a predictive Coq model with three new encoding mechanisms for identifiers: category vocabulary indexing, subword sequence modeling, and path elaboration. We compare Passport to three existing base tools which Passport can enhance: ASTactic, Tac, and Tok. In head-to-head comparisons, Passport automatically proves 29% more theorems than the best-performing of these base tools. Combining the three Passport-enhanced tools automatically proves 38% more theorems than the three base tools together, without Passport's enhancements. Finally, together, these base tools and Passport-enhanced tools prove 45% more theorems than the combined base tools without Passport's enhancements. Overall, our findings suggest that modeling identifiers can play a significant role in improving proof synthesis, leading to higher-quality software. △ Less

Submitted 2 August, 2022; v1 submitted 21 April, 2022; originally announced April 2022.

Journal ref: ACM Transactions on Programming Languages and Systems (TOPLAS), 45(2):12:1-12:30, June 2023

arXiv:2010.00774 [pdf, other]

doi 10.1145/3453483.3454033

Proof Repair across Type Equivalences

Authors: Talia Ringer, RanDair Porter, Nathaniel Yazdani, John Leo, Dan Grossman

Abstract: We describe a new approach to automatically repairing broken proofs in the Coq proof assistant in response to changes in types. Our approach combines a configurable proof term transformation with a decompiler from proof terms to tactic scripts. The proof term transformation implements transport across equivalences in a way that removes references to the old version of the changed type and does not… ▽ More We describe a new approach to automatically repairing broken proofs in the Coq proof assistant in response to changes in types. Our approach combines a configurable proof term transformation with a decompiler from proof terms to tactic scripts. The proof term transformation implements transport across equivalences in a way that removes references to the old version of the changed type and does not rely on axioms beyond those Coq assumes. We have implemented this approach in PUMPKIN Pi, an extension to the PUMPKIN PATCH Coq plugin suite for proof repair. We demonstrate PUMPKIN Pi's flexibility on eight case studies, including supporting a benchmark from a user study, easing development with dependent types, porting functions and proofs between unary and binary numbers, and supporting an industrial proof engineer to interoperate between Coq and other verification tools more easily. △ Less

Submitted 11 May, 2021; v1 submitted 2 October, 2020; originally announced October 2020.

Comments: Tool repository with code guide: https://github.com/uwplse/pumpkin-pi/blob/v2.0.0/GUIDE.md

arXiv:2003.06458 [pdf, other]

doi 10.1561/2500000045

QED at Large: A Survey of Engineering of Formally Verified Software

Authors: Talia Ringer, Karl Palmskog, Ilya Sergey, Milos Gligoric, Zachary Tatlock

Abstract: Development of formal proofs of correctness of programs can increase actual and perceived reliability and facilitate better understanding of program specifications and their underlying assumptions. Tools supporting such development have been available for over 40 years, but have only recently seen wide practical use. Projects based on construction of machine-checked formal proofs are now reaching… ▽ More Development of formal proofs of correctness of programs can increase actual and perceived reliability and facilitate better understanding of program specifications and their underlying assumptions. Tools supporting such development have been available for over 40 years, but have only recently seen wide practical use. Projects based on construction of machine-checked formal proofs are now reaching an unprecedented scale, comparable to large software projects, which leads to new challenges in proof development and maintenance. Despite its increasing importance, the field of proof engineering is seldom considered in its own right; related theories, techniques, and tools span many fields and venues. This survey of the literature presents a holistic understanding of proof engineering for program correctness, covering impact in practice, foundations, proof automation, proof organization, and practical proof development. △ Less

Submitted 13 March, 2020; originally announced March 2020.

Comments: 183 pages, for errata see https://proofengineering.org/qed_errata.html

ACM Class: F.3.1; D.2.4; I.2.3

Journal ref: Foundations and Trends in Programming Languages, Vol. 5, No. 2-3 (Sept. 2019), pp. 102-281

Showing 1–10 of 10 results for author: Ringer, T