LOTUS: Enabling Semantic Queries with LLMs Over Tables of Unstructured and Structured Data

Liana Patel Stanford UniversityUSA [email protected] Siddharth Jha UC BerkeleyUSA [email protected] Carlos Guestrin Stanford UniversityUSA [email protected]  and  Matei Zaharia UC BerkeleyUSA [email protected]
Abstract.

The semantic capabilities of language models (LMs) have the potential to enable rich analytics and reasoning over vast knowledge corpora. Unfortunately, existing systems lack high-level abstractions to perform semantic queries at scale. We introduce semantic operators, a declarative programming interface that extends the relational model with composable AI-based operations for semantic queries over datasets (e.g., sorting or aggregating records using natural language criteria). Each operator can be implemented and optimized in multiple ways, opening a rich space for execution plans similar to relational operators. We implement our operators and several optimizations for them in LOTUS, an open source query engine with a Pandas-like API.

We demonstrate LOTUS’ effectiveness across a series of real applications, including fact-checking, extreme multi-label classification, and search. We find that LOTUS’ programming model is highly expressive, capturing state-of-the-art query pipelines with low development overhead across these diverse applications. Specifically, on the FEVER dataset for fact-checking application, LOTUS’ programs can reproduce FacTool, recent state-of-the-art pipeline, in few lines of code, and implement a new pipeline with a simple change of operators that improves accuracy by 9.5%percent9.59.5\%9.5 %, while offering 734×7-34\times7 - 34 × lower execution time. In the extreme multi-label classification task on the BioDEX dataset, LOTUS reproduces state-of-the art result quality with its join operator, while providing an efficient algorithm that runs 800×800\times800 × faster than a naive join. In the search and ranking application, LOTUS allows a simple composition of operators to achieve 5.949.4%5.9percent49.45.9-49.4\%5.9 - 49.4 % higher nDCG@10 than the vanilla retriever and re-ranker, while also providing query efficiency, with 1.6710×1.67-10\times1.67 - 10 × lower execution time than LM-based ranking methods used by prior works. LOTUS is publicly available at https://github.com/stanford-futuredata/lotus.

price: 15.00isbn: 978-1-4503-XXXX-X/18/06

1. Introduction

The powerful semantic capabilities of modern language models (LMs) create exciting opportunities for building AI systems that reason over vast knowledge corpora. Many applications require complex reasoning over large amounts of data, including both unstructured and structured data. For example a researcher reviewing recent ArXiv (arx, [n. d.]) preprints may want to quickly obtain a summary of relevant papers from the past week, or find the papers that report the best performance for a particular task and dataset. Similarly, a medical professional may automatically extract biomedical characteristics and candidate diagnoses from many patient reports (D’Oosterlinck et al., 2023). Likewise organizations may wish to automatically digest lengthy transcripts from internal meeting transcripts and chat histories to validate hypotheses about their business needs and productivity (dis, [n. d.]).

Each of these tasks require a form of bulk semantic processing, where the AI system must process large amounts of data in often complex query patterns to perform the reasoning task at hand. Supporting the full generality of these applications with efficient and easily programmable query systems would have transformative impact. This prospect, however, raises two important and challenging questions: first, how should developers express semantic queries, and secondly, how should we design the underlying query system to achieve high efficiency and accuracy.

Refer to caption
Figure 1. Accuracy versus execution time (log-scale) for 3 short LOTUS programs, shown as Program A, B, and C, which implement distinct query pipelines for fact-checking on the FEVER (Thorne et al., 2018) dataset. The blue circles show the performance of these un-optimized programs, and the blue star shows performance with LOTUS’ optimizations applied to the best program, Program B. For reference, we show the performance of FacTool’s implementation (Chern et al., 2023) on the dataset in red. Section 4 provides full methodology details.

Unfortunately, existing systems are insufficient for serving applications that require bulk semantic processing. Many existing LM programming frameworks (lan, [n. d.]; lla, [n. d.]) and research works (Khattab et al., 2023, 2021; Khattab and Zaharia, 2020; Lee et al., 2019; Anantha et al., 2021; Gao et al., 2023; Izacard et al., 2022; Zelikman et al., 2022; Zhang et al., 2022) provide methods, abstractions and optimizations for retrieval-augmented generation (RAG), which first performs a semantic search over the text corpus, then invokes the LM conditioned on the user question and retrieved documents. While RAG is one common and useful query pattern, it is limited to point lookups into the data corpus and assumes the user query can be answered by one or a small set of retrieved documents. However, bulk semantic processing pipelines may involve more complex patterns not captured by RAG, such as semantic aggregations or transformations over many documents. Alternatively, several systems (ver, [n. d.]; dat, [n. d.]; noa, 2023a; Liu et al., 2024b) offer and optimize LLM user-defined functions (UDFs) in SQL. This model offers a logically row-wise LM-execution model for batch processing data with arbitrary prompts in composition with arbitrary SQL queries. While useful for batch-processing applications, this model cannot support LM-reasoning patterns across rows and provides a low-level of abstraction with a simple LM() function.

Towards a declarative programming interface for bulk semantic processing, we propose semantic operators, which extend the relational model with AI-based operations that users can compose into powerful, reasoning-based query pipelines over structured and unstructured data. These operators include semantic filters, joins, rankings, aggregations, and projections, which take natural language expressions given by the programmer. To provide an implementation of these operators, we present the LOTUS (LLMs Over Tables of Unstructured and Structured data) system and programming model. LOTUS’ query engine efficiently executes queries with semantic operators using a variety of algorithms and optimizations for each operator, while abstracting away low-level details like model context length limits and choice of algorithms.

Table 1. Summary of Semantic Operators. T𝑇Titalic_T denotes a table, X𝑋Xitalic_X and Y𝑌Yitalic_Y denote arbitrary tuple types, L[X]𝐿delimited-[]𝑋L[X]italic_L [ italic_X ] denotes a list of elements with type X𝑋Xitalic_X, and A𝐴Aitalic_A denotes the type of a particular column or attribute. l𝑙litalic_l denotes a parameterized natural language expression (“langex” for short), which takes tuples as input and performs a function such as a predicate, an aggregation, a comparator, or a projection, depending on the operator’s signature.
Operator Description
𝑠𝑒𝑚_𝑓𝑖𝑙𝑡𝑒𝑟(lX𝐵𝑜𝑜𝑙)𝑠𝑒𝑚_𝑓𝑖𝑙𝑡𝑒𝑟𝑙𝑋𝐵𝑜𝑜𝑙\mathit{sem\_filter}(l\textit{: }X\rightarrow\mathit{Bool})italic_sem _ italic_filter ( italic_l : italic_X → italic_Bool ) Returns the tuples in a table that pass the provided langex predicate.
𝑠𝑒𝑚_𝑗𝑜𝑖𝑛(tTl(X,Y)𝐵𝑜𝑜𝑙)𝑠𝑒𝑚_𝑗𝑜𝑖𝑛𝑡𝑇𝑙𝑋𝑌𝐵𝑜𝑜𝑙\mathit{sem\_join}(t\textit{: }T{\textit{, }{l}}\textit{: }(X,Y)\rightarrow% \mathit{Bool})italic_sem _ italic_join ( italic_t : italic_T , italic_l : ( italic_X , italic_Y ) → italic_Bool ) Joins a table against a second table t𝑡titalic_t by keeping all pairs of tuples that pass the provided langex predicate.
𝑠𝑒𝑚_𝑠𝑖𝑚_𝑗𝑜𝑖𝑛(tTa1A1a2A2k𝑖𝑛𝑡)𝑠𝑒𝑚_𝑠𝑖𝑚_𝑗𝑜𝑖𝑛𝑡𝑇subscript𝑎1subscript𝐴1subscript𝑎2subscript𝐴2𝑘𝑖𝑛𝑡\mathit{sem\_sim\_join}(t\textit{: }T{\textit{, }{a}}_{1}\textit{: }A_{1}{% \textit{, }{a}}_{2}\textit{: }A_{2}{\textit{, }{k}}\textit{: }\mathit{int})italic_sem _ italic_sim _ italic_join ( italic_t : italic_T , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_k : italic_int ) Performs a similarity join where each row of the source table is joined with the k𝑘kitalic_k most semantically similar tuples from the table t𝑡titalic_t, using fields a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT as the left and right join keys, respectively.
𝑠𝑒𝑚_𝑎𝑔𝑔(lL[X]X)𝑠𝑒𝑚_𝑎𝑔𝑔𝑙𝐿delimited-[]𝑋𝑋\mathit{sem\_agg}(l\textit{: }L[X]\rightarrow\mathit{X})italic_sem _ italic_agg ( italic_l : italic_L [ italic_X ] → italic_X ) Performs an aggregation over the input tuples according to the langex, which specifies a commutative, associative aggregation function over a list of tuples.
𝑠𝑒𝑚_𝑡𝑜𝑝𝑘(kintlL[X]L[X])𝑠𝑒𝑚_𝑡𝑜𝑝𝑘𝑘𝑖𝑛𝑡𝑙𝐿delimited-[]𝑋𝐿delimited-[]𝑋\mathit{sem\_topk}(k\textit{: }int{\textit{, }{l}}\textit{: }L[X]\rightarrow L% [X])italic_sem _ italic_topk ( italic_k : italic_i italic_n italic_t , italic_l : italic_L [ italic_X ] → italic_L [ italic_X ] ) Ranks each tuple and returns the k𝑘kitalic_k best according to the langex, which specifies a ranking function that sorts a list of tuples.
𝑠𝑒𝑚_𝑚𝑎𝑝(lXY)𝑠𝑒𝑚_𝑚𝑎𝑝𝑙𝑋𝑌\mathit{sem\_map}(l\textit{: }X\rightarrow Y)italic_sem _ italic_map ( italic_l : italic_X → italic_Y ) Performs a projection, returning a new column, according to the provided langex.
𝑠𝑒𝑚_𝑒𝑥𝑡𝑟𝑎𝑐𝑡(lXY)𝑠𝑒𝑚_𝑒𝑥𝑡𝑟𝑎𝑐𝑡𝑙𝑋𝑌\mathit{sem\_extract}(l\textit{: }X\rightarrow Y)italic_sem _ italic_extract ( italic_l : italic_X → italic_Y ) Performs a projection according to the langex, returning a column with a list of substrings from the input tuples.
𝑠𝑒𝑚_𝑐𝑙𝑢𝑠𝑡𝑒𝑟_𝑏𝑦(C𝑖𝑛𝑡aA)𝑠𝑒𝑚_𝑐𝑙𝑢𝑠𝑡𝑒𝑟_𝑏𝑦𝐶𝑖𝑛𝑡𝑎𝐴\mathit{sem\_cluster\_by}(C\textit{: }\mathit{int}{\textit{, }{a}}\textit{: % }A)italic_sem _ italic_cluster _ italic_by ( italic_C : italic_int , italic_a : italic_A ) Performs a semantic similarity clustering over column a𝑎aitalic_a to create C𝐶Citalic_C groups.
𝑠𝑒𝑚_𝑠𝑒𝑎𝑟𝑐ℎ(q𝑆𝑡𝑟𝑖𝑛𝑔kintaA)𝑠𝑒𝑚_𝑠𝑒𝑎𝑟𝑐ℎ𝑞𝑆𝑡𝑟𝑖𝑛𝑔𝑘𝑖𝑛𝑡𝑎𝐴\mathit{sem\_search}(q\textit{: }\mathit{String}{\textit{, }{k}}\textit{: }% int{\textit{, }{a}}\textit{: }A)italic_sem _ italic_search ( italic_q : italic_String , italic_k : italic_i italic_n italic_t , italic_a : italic_A ) Performs a top-k𝑘kitalic_k search over column a𝑎aitalic_a using query q𝑞qitalic_q.
𝑠𝑒𝑚_𝑖𝑛𝑑𝑒𝑥(aA𝑝𝑎𝑡ℎ𝑆𝑡𝑟𝑖𝑛𝑔)𝑠𝑒𝑚_𝑖𝑛𝑑𝑒𝑥𝑎𝐴𝑝𝑎𝑡ℎ𝑆𝑡𝑟𝑖𝑛𝑔\mathit{sem\_index}(a\textit{: }A{\textit{, }{}}\mathit{path}\textit{: }% \mathit{String})italic_sem _ italic_index ( italic_a : italic_A , italic_path : italic_String ) Creates similarity index for column a𝑎aitalic_a and saves to path.

Figure 1 begins to demonstrate the power of LOTUS’ declarative programming model and optimized query engine. For a fact-checking task on the FEVER dataset (Thorne et al., 2018), we can easily create 3 distinct query pipelines, each written in intuitive LOTUS programs of less than 50 lines of code by composing 35353-53 - 5 semantic operators (e.g. filters, maps, search and joins). The modularity and composability of these semantic operators allows us to quickly explore the design space on this task. In doing so, we find the un-optimized LOTUS programs can reproduce and improve accuracy on this task by up to 9.5%percent9.59.5\%9.5 % compared to a recent state-of-the-art fact-checking pipeline, FacTool (Chern et al., 2023), while also providing query efficiency, by default, to maintain modest execution time.

LOTUS’ optimizer also exploits the rich implementation design space of semantic operators to leverage new algorithmic and optimization opportunities. For expensive operators like semantic joins, aggregations, ranking and filters, LOTUS implements novel algorithms that maximize parallel batched-inference opportunities, use model cascades (Viola and Jones, 2001; Yue et al., 2024; Chen et al., 2023; Kang et al., 2017, 2022) with a lightweight scoring function unique to our setting, leverage semantic similarity indices, and perform algorithmic approximations. Figure 1 demonstrates the effectiveness of these methods. Compared to the best-performing un-optimized LOTUS program, Program B shown in the figure, the logically equivalent program implemented using LOTUS’ optimizations attains 2.5%percent2.52.5\%2.5 % higher accuracy and runs 2×2\times2 × faster.

We systematically evaluate LOTUS through a series of real applications, including fact-checking, extreme multi-label classification, and search. Our results show that LOTUS’ programming model is highly expressive, capturing high quality and state-of-the-art query pipelines with low development overhead for these wide-ranging applications. Specifically, on the FEVER dataset  (Thorne et al., 2018) for fact-checking, LOTUS programs can reproduce a recent state-of-the-art pipeline (Chern et al., 2023), as shown in Figure 1, in few lines of code, and implement a new pipeline with a simple change of operators that improves accuracy by 9.5%percent9.59.5\%9.5 %, while offering 734×7-34\times7 - 34 × lower execution time. In the extreme multi-label classification task on the BioDEX dataset (D’Oosterlinck et al., 2023), LOTUS reproduces state-of-the art result quality (D’Oosterlinck et al., 2024) with it’s join operator, while providing an efficient algorithm that provides 800×800\times800 × lower execution time than the naive algorithm, demonstrating the power of LOTUS’ declarative interface. In the search and ranking application, LOTUS allows a simple composition of operators to achieve 5.949.4%5.9percent49.45.9-49.4\%5.9 - 49.4 % higher nDCG@10 than the vanilla retriever and re-ranker, while also providing query efficiency, with 1.6710×1.67-10\times1.67 - 10 × lower execution time than LM-based ranking methods (Qin et al., 2024) used by prior works.

2. The LOTUS Programming Model

1def get_paper_digest(research_interests: str, baseline: str):
2 papers_df = pd.read_csv("papers.csv")\
3 .load_sem_index("abstract", "index_dir")
4
5 return papers_df\
6 .sem_search("abstract", research_interests, 100)\
7 .sem_filter(f"the paper {{abstract}} claims to outperform {baseline} "\
8 .sem_agg(f"Write a digest summarizing the {{abstracts}} and their relevance to {research_interests}")
Figure 2. Example LOTUS program using semantic operators to return a summary of relevant papers. The function takes a description of the user’s research interests (e.g. approximate nearest neighbor search), and a baseline method (e.g. hierarchical navigable small world indices) that the user is interested in. The program then searches over papers, then filters based on whether the paper outperforms the baseline, and finally constructs a summary.

We now introduce the LOTUS programming model and show its expressive power in allowing developers to declaratively specify AI-based query pipelines that bulk process large datasets of structured and unstructured data. LOTUS extends the relational model with semantic operators, which we show in Table 1. These operators can be easily composed together with standard relational operators to build powerful programs that are transparently optimized (Section 3). In this section, we describe each semantic operator and our API for them in LOTUS, grounding each in concrete examples. While our current API implementation extends Pandas (pan, [n. d.]), LOTUS’ semantic operators could be used with other existing relational query languages and APIs, such as SQL.

2.1. Datatypes

LOTUS’ data model consists of tables with structured and unstructured text fields, and our current implementation extends Pandas (pan, [n. d.]). Figure 2 shows an example LOTUS program that loads data about ArXiv papers and performs a summarization task, which finds relevant papers, filters according to whether the paper claims to outperform a specific baseline, and then summarizes the remaining papers into a single digest. We briefly describe two core components of LOTUS programming model: its data model (Section 2.1.1), and its parameterized natural language expressions (Section 2.1.2) for specifying semantic operations over the data.

Refer to caption
Figure 3. Table schema of ArXiv papers.

2.1.1. Data Model

LOTUS is designed to seamlessly extend the relational model. Each row in the table, represents a logical entity. Table columns may contain either structured data or unstructured fields with free-form natural-language text. LOTUS’ semantic-relational operators can take both of these data-types as inputs. Figure 3 shows an example table, where each rows represents an ArXiv paper document, with fields for the paper’s title, ArXiv URL, abstract, ArXiv domain categories, and the publication date. These columns are then passed as parameters to semantic-relational operators, such as sem_search, sem_filter, sem_agg, as shown in Figure 2.

Additionally, LOTUS supports semantic similarity indices over natural-language text columns to provide optimized semantic query processing. These indices create embeddings over each document in the column to capture semantic similarity using embedding distance metrics. Semantic indexes can be created off-line and then loaded using sem_index and load_sem_index. Figure 2 provides an example program, where upon reading the ArXiv papers data from a CSV file, the programmer loads a semantic index over the abstract column. The program then repeatedly uses the semantically-indexed column in subsequent LOTUS operations, involving semantic search, filtering and aggregation.

2.1.2. Parameterized Natural Language Expressions (langex)

A core principle of LOTUS is to provide users with a declarative interface that separates the user-specified, logical query plan from its underlying implementation. As such, users program with LOTUS’ semantic operators by writing parameterized natural language expressions (langex111Akin to regular expressions (regex), which specify character patterns to match in text, langex are natural language expressions to programmatically specify reasoning-based patterns over structured data and free-form text.), rather than directly prompting an underlying LM. Figure 2 shows an example of this, where the programmer provides a langex as the parameter to the sem_filter (line 18) and sem_agg (line 19) operations. Programmers write each langex in natural language text, parameterized by one or more data columns, which are indicated in the double brackets within the formatted string. The function of these expressions varies according to the semantic operator used and may represent a predicate, aggregation, comparator function, or projection in natural language. For instance, as shown in Figure 2, the langex signature of sem_filter provides a predicate that indicates a filter criteria to apply over paper titles and abstracts, while sem_agg takes a langex that provides an associative aggregation expression, which here indicates a many-to-one summarization task over paper abstracts.

Notably, these language expressions are sufficiently versatile and easy to program with, providing an intuitive and higher-level interface to the user. This makes it simple for the programmer to specify diverse and complex, multi-step query pipelines. All operator-specific prompts are automatically handled by LOTUS’ underlying query engine, and the system can leverage existing prompt optimization techniques (Khattab et al., 2023; Yuksekgonul et al., 2024) on- or off-line.

2.2. Semantic Operators

We now overview each semantic operator and their corresponding LOTUS API. Table 1 provides a concise summary of each operator.

Sem_filter returns the subset of rows that pass the filtering condition, specified by the user’s langex. As Figure 4 shows, the langex signature provides a semantic predicate over one or more table columns and can be answered by a binary ”True” or ”False” answer.

1papers_df.sem_filter("The {abstract} claims to outperform GPT-4 on a benchmark task.")
Figure 4. Example usage of sem_filter.

Sem_topk ranks a set of rows according to the user-defined criteria, and returns the K𝐾Kitalic_K rows that best match the ranking criteria. The signature of the langex provides a general ranking criteria according to one or more columns. The underlying system can use this langex to impose a ranking over any subset of rows, according to the chosen implementation. As Figure 5 shows, the programmer uses the langex to specify arbitrary reasoning-based ranking criteria, such as ranking paper abstracts by the most outrageous claim made.

1papers_df.sem_topk("the {abstract} makes the most outrageous claim", K=10)
Figure 5. Example usage of sem_topk.

Programmers can also optionally specify a group-by parameter to indicate a subset of columns to group over during ranking, as shown in 6. The groupings are defined using standard equality matches over the group-by columns. To use groupings according to semantic similarity, users can also perform a sem_cluster_by and pass the resulting cluster_id column as the group-by column parameter.

1papers_df.sem_topk("the {abstract} makes the most outrageous claim", K=10, group_by=[arxiv_domain])
Figure 6. Example usage of sem_topk with group-by.

Sem_join combines data from two tables, evaluating the user’s predicate to return the set of rows from the left and right table that pass. As Figure 7 shows, users specify the right join table, a langex, and optionally, the join key of the left and right tables. Here the langex contains two or more columns, and provides a predicate over the left and right tables. By default, the operator performs an inner join over the two tables, and can alternatively perform left, right or outer joins, if specified.

1papers_df.sem_join(papers_df, "The paper {abstract:left} contradicts the claims made by the {abstract:right}.")
Figure 7. Example usage of sem_join.

Sem_sim_join provides a variant of the semantic join, such that rows are matched according to their semantic similarity, rather than an arbitrary natural-language predicate. Akin to an equi-join in standard relational algebra, the semantic similarity join is a specialized semantic join. Figure 8 provides an example, where one table contains papers with an indexed column of abstracts, and the other table contains a list of research interests. The user specifies the left and right table join keys, and a parameter K𝐾Kitalic_K. The left join key may or may not be indexed, whereas the the right join key must be a semantically indexed column with its index loaded. The operator performs a left join such that for each row in the left table, the output table will contain K𝐾Kitalic_K matching rows from the right table with the highest similarity scores. The programmer can also optionally specify a return column containing the semantic similarity scores of each joined row.

1papers_df = pd.read_csv("papers.csv")\
2 .load_sem_index("abstract", "abstract_index_dir")
3
4research_interests = {
5 "research_topic": [
6 "Vector Databases",
7 "LLMs for query processing",
8 "Text-to-SQL",
9 "Compound AI Systems",
10 "Brain Computer Interfaces",
11 ]
12 }
13interests_df = pd.DataFrame(research_interests)
14
15interests_df.sem_sim_join(papers_df, left_on="research_topic", right_on="abstract", K=10)
Figure 8. Example usage of sem_sim_join.

Sem_agg performs an aggregation over all rows of the table. As Figure 9 shows, the langex signature provides a commutative and associative aggregation function, which can be applied over any subset of rows to produce an intermediate results. Semantic aggregations can be useful for many tasks involving many-to-one reduction patterns, such as summarization or question-answering over multiple documents. Similar to sem_topk, users can also specify a group-by parameter to use. We also provide additional flexibility, by allowing the programmer to compose semantic aggregations with sem_partition_by, which provides finer-granularity control over how documents are grouped over in each LM invocation. We describe this further below, and show its use in Section 3 for overriding the default commutativity and associativity assumptions for some tasks.

1papers_df.sem_agg("write a summary of recent research highlights and discoveries, based on each {abstract}.")
Figure 9. Example usage of sem_agg.

Sem_partition_by creates a row partitioning over the table, which will be used by sucessive calls to sem_agg to decide which rows to group together within each LM invocation. We find in Section 3 that this can non-trivially affect the result quality of aggregation tasks, like summarization. As Figure 10 shows, sem_partition_by takes a function, which outputs a group-id for each row. LOTUS natively supports a semantic cluster function, which takes the number of clusters to create and a column, but users can also specify arbitrary partition functions. The group-ids output by the partition function indicate to the system which rows should be grouped together, in a best-effort manner, during LM invocations. The system will aggregate over documents within each group, before merging intermediate results across groups.

1papers_df\
2 .sem_partition_by(cluster(7, "abstract"))\
3 .sem_agg("write a summary of recent research highlights and discoveries, based on each {abstract}.")
Figure 10. Example usage of sem_agg with sem_partition_by.

Sem_map performs a natural language projection over an existing column and outputs a new column in the table. As shown in Figure 11, the user’s langex specifies what to project. This operator provides general functionality and simulates logically row-wise data-flow similar to prior works that use LLM-UDFs in relational languages (Liu et al., 2024a, c). These semantic projections are broadly useful for a variety of tasks, such as row-wise summarization, classification, and entity-extraction.

1papers_df.sem_map("Summarize the main result of the paper {abstract}.")
Figure 11. Example usage of sem_map.

Sem_extract provides similar functionality to sem_map but provides the answers by returning a list of sub-strings from the source text. This functionality is useful for applications, such as entity extraction or fact-checking, where snippet-finding or verified quotes may be preferable to synthesized LLM-based answers. As Figure 12, the user’s langex signature specifies a projection, similar to sem_map.

1papers_df.sem_extract("what benchmarks are mentioned in the paper {abstract}?")
Figure 12. Example usage of sem_extract

Sem_index generates a semantic similarity index over the specified data column, as shown in Figure 13. To generate the semantic index, users first use the settings module to declare a retrieval model, which will be used for generating semantic embeddings and indexing the column data. The sem_index operator takes a column and a local directory path, to which the generated semantic index will be stored. As the figure shows, users can separately index multiple columns of the table.

1rm = E5Model()
2lotus.settings.configure(rm=rm)
3
4papers_df.sem_index("abstract", "abstract_index_dir")
5papers_df.sem_index("title", "title_index_dir")
Figure 13. Example usage of sem_index.

Load_sem_index re-loads the stored semantic index upon reading the table data from disk, as shown in Figure 14. The user specifies the column corresponding to the semantic similarity index and the directory where the index was saved.

Sem_search performs a top-k𝑘kitalic_k semantic similarity search over a semantically-indexed column, as Figure 14 shows. The user specifies the table column to search over, a query in natural language, a target K𝐾Kitalic_K of the number of results to return, and optionally indicates whether to return similarity scores as a column in the returned table.

1papers_df = pd.read_csv("papers.csv")\
2 .load_sem_index("abstract", "abstract_index_dir")\
3
4papers_df.sem_search("abstract", "vector databases", 10, return_scores=True)
Figure 14. Example usage of load_sem_index and sem_search.

LOTUS also exposes advanced relevance-based re-ranking functionality for search. Users can specify a re-ranker model, as shown in Figure 15, and then set the n_rerank parameter during the semantic search. The semantic search in this case will first find the top-K𝐾Kitalic_K most relevant documents according to the retriever model, and then re-rank the top-K𝐾Kitalic_K found documents to return the top n_rerank ones using the re-ranker model.

1rm = E5Model()
2reranker = CrossEncoderModel()
3lotus.settings.configure(rm=rm, reranker=reranker)
4
5papers_df = pd.read_csv("papers.csv")\
6 .load_sem_index("abstract", "abstract_index_dir")\
7 .sem_search("abstract", "vector databases", K=100, n_rerank=10)
Figure 15. Example usage of re-ranking with sem_search.

Sem_cluster_by assigns a group to each row of the table according to semantic similarity. This operator is akin to a relational group_by, but uses semantic similarity to determine the groups, rather than equality over the specified column. As Figure 16 shows, the user specifies an indexed column to cluster, and the number of clusters to create. The returned table augments the input table with an additional column that specifies the assigned cluster_id of each row. Users can optionally also return the similarity scores between each row and its cluster centroid in an additional column of the returned table.

1papers_df = pd.read_csv("papers.csv")\
2 .load_sem_index("abstract", "abstract_index_dir")\
3 .sem_cluster_by("abstract", 20, return_scores=True)
Figure 16. Example usage of sem_cluster_by.

3. Implementing and Optimizing Semantic Operators

Semantic operators create a rich design space, allowing for a diverse set of algorithmic decisions and optimizations which have significant consequences on system efficiency and accuracy. We identify multiple possible algorithms for these operators, noting their performance implications. LOTUS’ query engine aims to automatically exploit this rich design space to provide an optimized implementation of each individual operator and composite pipelines.

Specifically, LOTUS employs a series of novel algorithms designed to leverage the semantics of each operator in order to maximize parallel batched-inference opportunities, use model cascades (Viola and Jones, 2001; Yue et al., 2024; Chen et al., 2023; Kang et al., 2017, 2022) with a lightweight scoring function unique to our setting, leverage the semantic similarity index, and perform algorithmic approximations for expensive operations (e.g. joins). We see in Section 4 that these decisions can significantly improve efficiency while maintaining high quality results for LOTUS’ query pipelines.

While our initial investigation studies several operator-specific design decisions unique to the semantic bulk processing setting, we envision a rich set of additional optimization opportunities. Several prior works demonstrate performance gains in both logical query plan optimizations (e.g. operator re-ordering (Liu et al., 2024c, b; Lin et al., 2024; Lu et al., 2018)) and other general LM-approximation techniques (e.g. code synthesis (Arora et al., 2023; Liu et al., 2024b) and prompt adaptation (Chen et al., 2023)), which we find to be promising opportunities for LOTUS’ implementation, left to future work.

LOTUS’ current implementation extends the Pandas API (pan, [n. d.]). We leverage vLLM (Kwon et al., 2023) to perform efficient batched inference, and we use FAISS (Johnson et al., 2017) for efficient vector similarity search.

3.1. sem_filter

LOTUS’ semantic filter runs batched LLM calls over a set of rows, prompting the model to output a boolean value for each row to indicate whether the record passes the user’s natural language predicate. Figure 18 shows an example of the operator-specific filtering instruction, which is managed transparently by the system.

1The user will provide a claim and some relevant context. Your job is to determine whether the claim is true for the given context. You must answer with a single word, "True" or "False".
2
3Context: ...
4Claim: ...

Figure 17. Instruction prompt for semantic filter.

3.1.1. Optimizations

We leverage model cascades, using an efficient scoring function unique to our setting, to further optimize the filter function. Many prior works have studied model cascades for ML tasks (Viola and Jones, 2001; Yue et al., 2024; Chen et al., 2023; Kang et al., 2017, 2022), leveraging a relatively cheap and inaccurate proxy model, along with an accurate but expensive oracle model in order to reduce the inference cost by routing easy queries to the small model, and resorting to the large model where necessary. Typically, a scoring function provides confidence scores over the proxy model’s outputs, and allows the system to decide when to route examples to the larger model. Applying this paradigm to LMs introduces several challenges, including generating a reliable scoring function. Several works study this problem, suggesting to fine-tune a smaller LM to score each question along with the answer produced by the weak LLM (Chen et al., 2023), or invoking multiple sampling paths from the small LM and evaluating self-consistency to build a confidence score (Yue et al., 2024). While these methods provide general-purpose scoring functions, LOTUS’ scoring function leverages the operator semantics to provide a lighter-weight scoring function, which avoids the execution time penalty of scoring with an additional model or re-sampling from the proxy-model.

Specifically, LOTUS’ cascade scoring function leverages the binary-label output of each LM invocation. The system first batch-processes each record using the cheaper LM, then computes confidence scores by exponentiation the log-probabilities for the output LLM tokens corresponding to the True oder False answer. For records with confidence scores below the user-defined threshold, the system then passes these records in batch to the larger LM. As we show, in Section 4, leveraging model cascades in this way for semantic filtering can significantly improves the system’s efficiency while maintaining high quality results for some tasks.

1lm1= OpenAIModel(model="gpt-3.5-turbo")
2lm2 = OpenAIModel(model="gpt-4-turbo")
3
4lotus.settings.configure(lms=[lm1, lm2])
5
6papers_df.llm_filter("The {abstract} suggests that LLMs effectively utilize long context", confidence_threshold=0.9)
Figure 18. Example semantic filter with model cascasdes.

3.2. sem_topk

Performing a semantic top-k𝑘kitalic_k ranking requires logically reasoning across rows, entailing joint reasoning over often large amounts of data. Implementing an efficient algorithm introduces many important design decisions pertaining to managing LM context length limits, grouping documents within each LM context window to achieve high quality LM-based comparisons over subsets of the data, and choosing an efficient top-k ranking algorithm to combine LM-based comparisons. These decisions have notable performance implications on efficiency and result quality, as we show in Section 4. While prior works have studied LM-based passage re-ranking (Desai and Durrett, 2020; Drozdov et al., 2023; Liang et al., 2022; Ma et al., 2023; Pradeep et al., 2023a, b; Qin et al., 2024; Sachan et al., 2022; Sun et al., 2023) and ranking with noisy comparisons (Shah and Wainwright, 2016; Braverman and Mossel, [n. d.]) with the goal of achieving high quality results in a modest number of total LM calls or comparisons, LOTUS’ implementations provides a generalized ranking algorithms for arbitrary natural language expressions and aims to produce both high quality results and low query execution time. We leverage several algorithmic design decisions and optimizations to achieve this.

3.2.1. Algorithms

1Your job is to to select and return the most relevant document to the users question. Carefully read the users question and the two documents provided below. Respond only with the label of the document such as "Document NUMBER". NUMBER must be either 1 or 2, depending on which document is most relevant. You must pick a number and cannot say things like "None" or "Neither".
2
3Question: ...
4Document 1: ..
5Document 2: ...
Figure 19. Point-wise comparison prompt for semantic top-k.

Prior LM-based sorting and ranking works suggest several methods for performing LM-based comparisons. Specifically, prior works primarily study three classes of methods: point-wise ranking methods (Desai and Durrett, 2020; Drozdov et al., 2023; Liang et al., 2022; Sachan et al., 2022; Wu et al., 2024), list-wise ranking methods (Ma et al., 2023; Pradeep et al., 2023a, b; Sun et al., 2023), and pair-wise ranking methods (Shah and Wainwright, 2016; Qin et al., 2024). Point-wise methods output a relevance score for each document independently; while these methods are fully parallelizable across rows, prior work (Desai and Durrett, 2020) shows they provide poor accuracy due to difficulty in calibrating scores across prompts. We note that these methods can be implemented in LOTUS with a sem_map, but are not the focus of our semantic top-k𝑘kitalic_k operator due to their quality limitations. By contrast, list-wise methods feed a partial list of 10 to 20 documents to the LM and prompt it to output a ranking over these, which can be aggregated using a sliding window approach. Unfortunately, prior works shows these methods are often prone to prediction failures (Sun et al., 2023) and sensitive to document ordering in the prompt (Qin et al., 2024). Pairwise-prompting methods instead offer a simple and effective approach that feeds a single pair of documents to the LM in each invocation, prompting the model to perform a comparison and output a binary label. This method has been shown to be an effective base unit for ranking with relatively high robustness to input order sensitivity (Qin et al., 2024).

To aggregate pairwise comparisons, we consider several ranking algorithms, including a quadratic sorting algorithm, a heap-based top-k algorithm and a quick-select-based top-k ranking algorithm. The quadratic ranking algorithm obtains a comparison between each pair of input documents, and uses a win rate to determine a ranking over all elements before selecting the k𝑘kitalic_k best. In contrast, the heap-based algorithm maintains a heap of size k𝑘kitalic_k, and makes a linear pass over the data, updating the heap when more promising elements are found. Each time a new element is inserted or removed from the heap, the algorithm performs a series of sequential LM comparisons to update the data structure. Lastly the quick-select based algorithm proceeds in successive rounds, each time choosing a pivot, and comparing all other remaining elements in the document set to the pivot item to determine the rank of the pivot. Because each round is fully parallelizable, we perform these LM-based comparisons efficiently in batch before recursing in the next round.

3.2.2. Discussion

As the core LM-based computation unit over subsets of the data, LOTUS’ current implementation uses pair-wise rankings, following recent prior work (Qin et al., 2024), which demonstrates its effectiveness. We also find this choice useful for implementing further optimizations, such as model cascades, described below. Figure 19 shows an example task-instruction prompt, which we use to implement the LM-based comparison by instructing the model to select between two documents according to the user-defined sorting criteria.

Additionally, for aggregating comparisons, LOTUS’ current implementation leverages the quick-select-based (Hoare, 1961) top-k ranking algorithm, by default. Prior work (Qin et al., 2024) studies several alternative algorithms, including the heap-based ranking implementation and the quadratic sorting algorithm. We study these alternatives in Section 4, and find that the quick-select-based algorithm offers high accuracy while also offering an efficient implementation with at least an order magnitude fewer total calls then the quadratic sorting algorithm and more opportunities for batched calls leading to lower execution time compared to a heap-based implementation. We believe future work may leverage these multiple algorithms to allow the user to declaratively trade-off between performance metrics, like accuracy, execution time, and cost.

3.2.3. Optimizations

LOTUS’ top-k𝑘kitalic_k algorithm is amenable to several optimizations. First, the top-k𝑘kitalic_k algorithm’s pair-wise comparisons each invoke the LM to output a binary label, which is amenable to model cascades. To implement this, we use a simple an efficient scoring procedure based on log-probabilities of the generated tokens. We describe this procedure above for sem_filter (Section 3.1.1) and apply an equivalent procedure here. In Section 4, we demonstrate this optimization allows programmers to leverage small, cheap models in conjunction with large, expensive models to obtain high accuracy results at reduced cost.

Additionally, LOTUS can leverage the semantic index to optimize pivot selection for some queries, rather than resorting to random pivot selection. This optimization is useful when there exists correlation between the rankings imposed by the user’s arbitrary sorting criteria and the rankings imposed by semantic similarity scores. In this case, LOTUS can sort the document set based on embedding distances to the user’s query, and select the (k+ϵ)𝑘italic-ϵ(k+\epsilon)( italic_k + italic_ϵ )-th item, rather than a random item, as the first pivot. This can reduce the number of LM comparisons required by subsequent rounds in the quick-select algorithm, leading to higher query efficiency at no accuracy loss. We believe fruitful future work will automatically estimate correlation between semantic similarity scores and the user-defined ranking criteria to transparently apply this optimization.

3.3. sem_join

Performing the semantic join involves evaluating the user’s natural language predicate on each pair of rows in the left and right table. Implemented naively, this can be prohibitively expensive, incurring a quadratic number of LM calls with respect to to the size of the left and right join tables. LOTUS thus implements several algorithms suitable for a variety of settings, highlighting the rich design space, which we plan to further optimize over in future work. Here we describe the design patterns currently implemented, which we find to be useful for real-world tasks that we evaluate in Section 4.

3.3.1. Algorithms

The first join algorithm implements the nested-loop join pattern with efficient LM batch processing to maximize GPU utilization rather than naively looping over each pair of rows and invoking the LM. This yields an O(N1N2)𝑂subscript𝑁1subscript𝑁2O(N_{1}\cdot N_{2})italic_O ( italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) LM call complexity, where N1subscript𝑁1N_{1}italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and N2subscript𝑁2N_{2}italic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the table sizes of the left and right join tables respectively. As Figure 18 shows, each LM call instructs the model to output a boolean value after evaluating the user’s natural language predicate. This quadratic join algorithm is suitable for small join tables.

Alternatively, we implement a map-search-filter join pattern which can be used to approximate the quadratic join algorithm, while incurring fewer LM calls. This algorithm first performs a semantic mapping over the left join key to the domain of the right join key. In the example provided by Figure 20, which joins paper abstracts with the datasets they use, the map step would invoke the LM over each abstract, instructing the model to output the dataset used. This step is un-grounded in the sense that the LM is not given knowledge of right join table, but may optionally leverage user-specified demonstrations. Following the semantic projection, the algorithm then leverages the semantic index and performs a similarity search over the right join table to find candidates likely to pass the predicate. In the example, this search would retrieve a list of datasets from the right table for each paper, using semantic similarity of the datasets to the LM projection output in the first step. The semantic similarity search is parameterized by K𝐾Kitalic_K, which is automatically set based on the user’s specified LM call budget. Lastly, the algorithm performs a filter over the candidate pairs and outputs the joined table of tuples pairs that pass. This join algorithm partially mimics the information flow of prior work (D’Oosterlinck et al., 2024) and abstracts away intermediate steps by instead allowing users to specify an LM call budget for the join. The LM call complexity of this algorithm is O(N1K)𝑂subscript𝑁1𝐾O(N_{1}\cdot K)italic_O ( italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_K ).

Lastly, we implement a search-filter join pattern. This algorithm first performs a semantic similarity join using a similarity index on the right join key, with the left join key embedded on the fly. In the example provided by Figure 20, this step would retrieve K𝐾Kitalic_K semantically similar datasets from the right table for each paper abstract in the left table. The batched semantic-similarity search sets the search parameter K𝐾Kitalic_K according to the user’s specified LM call budget, similar to the map-search-filter algorithm. The search results provide candidate tuple pairs, which are then evaluated in batch using the filter operation to gather the final set of rows that pass the user’s language predicate. This algorithm obtains an LM call complexity of O(N1K)𝑂subscript𝑁1𝐾O(N_{1}\cdot K)italic_O ( italic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_K ).

1papers_df.sem_join(dataset_df, "The paper {abstract:left} uses the {dataset:right}.")
Figure 20. Example sem_join for matching papers and datasets.

3.3.2. Discussion

These join patterns offer performance tradeoffs suitable for different settings. The nested-loop algorithm offers an efficient solution when the join tables are sufficiently small, whereas the map-search-filter and search-filter patterns are suitable approximations that can apply efficiently over large tables, due to their linear LM call complexity in table size. The expected result quality of either approximation likely varies depending on the presence of predicate clustering and correlation, as defined by prior work (Patel et al., 2024) between the user’s predicate and the semantic embeddings. The search-filter pattern is likely to produce high quality results under positive query correlation, where entities that are semantically similar are also likely to pass the predicate. On the other hand, the map-search-filter pattern is likely to produce higher quality results when the predicate-embedding correlation is low using the original left join key’s embeddings as queries, but can be increased using the LM-projection over the left join key.

3.4. sem_agg

Performing semantic aggregations are inherently challenging because, similar to the semantic top-k𝑘kitalic_k, it requires logically reasoning across rows. Thus, the operator’s implementation must efficiently orchestrate the LM over large amounts of data, while managing long context inputs, which may degrade result quality (Liu et al., 2023) or overflow the underlying model’s context length. LOTUS aims to abstract away such low-level details from the user and provides an efficient implementation, designed to support high quality results and provide the programmer with flexibility.

3.4.1. Algorithms

LOTUS’ implementation builds on and generalizes the LM-based summarization pattern studied by prior research works (Wu et al., 2021; Chang et al., 2024; Adams et al., 2023) and deployed systems (lla, [n. d.]; lan, [n. d.]). These implementations primarily leverage one of two aggregation patterns: either a fold pattern, which produces a sequential, linear pass over the data while iteratively updating an accumulated partial answer, or a hierarchical reduce pattern, which recursively aggregates the input data to produce partial answers until a single answer remains.

3.4.2. Discussion

By default, LOTUS’ aggregation implements the hierarchical pattern, which allows for greater parallelism during query processing and has been shown to produce higher quality results for tasks like summarization in prior work (Chang et al., 2024). However, LOTUS’ partition function can be used to override the default functionality and generalizes the information flow of prior implementation by allowing users to specify arbitrary groupings over the data. Each unique group specified by the partition function will be aggregated with as few LM invocations as possible according to the model context length, before being merged with other groups. This allows the user to perform a fold pattern, as well as arbitrary user-defined patterns.

3.4.3. Optimizations

Preliminary results show that achieving high-quality semantic-aggregations is non-trivial and sensitive to the document’s input ordering and grouping. We see qualitative evidence of this in a summarization task of 50 ArXiv paper abstracts, which we show in Figure 21. We contrast the summarization results of a naive semantic aggregation (Figure 22) performed over the paper abstracts with the semantic aggregation performed using a partitioning function based on semantic similarity (Figure 23). The semantic aggregation using the partitioner first clusters documents based on semantic similarity, then aggregates each cluster before performing aggregations across clusters. As the figures show the two methods result in significantly different summaries. While the latter demonstrates qualitatively more cohesion and effectively abstracts general themes across papers, the former omits details and tends to list low-level details from individual papers rather than capturing higher-level themes. We leave a quantitative study of this to future work and believe that semantic aggregations create a rich design space for optimization.

1papers_df.sem_agg("summarize the main topics discussed in the set of papers, given {title} and {abstract} of each paper.")
Figure 21. Summarization task using semantic aggregation
1 The main topics discussed in the set of papers include the use of large language models for question answering, data management, and hybrid workplace decision support, as well as the importance of easy-to-use instruction processing frameworks for LLMs. The papers also explore the development of a fuzzy approach to record linkages and a list-aware reranking-truncation joint model (GenRT) for search and retrieval-augmented generation. Additionally, the papers discuss the use of PromptCrypt, a mechanism to protect user privacy when using LLMs for communication, and the proposal of a framework, called C-RAG, to certify generation risks for retrieval-augmented language models (RAG). The papers also cover the proposal of a roadmap to pluralistic alignment, specifically using language models as a test bed, and the presentation of a theoretical and empirical analysis of adaptive entry point selection for graph-based approximate nearest neighbor search.
2
3 Furthermore, the papers discuss various topics related to artificial intelligence, including natural language processing, machine learning, data engineering, and intelligent marketing. They also analyze the building blocks of decentralized artificial intelligence (DEAI) and their analysis from a bottom-up approach.
4
5 In summary, the papers cover a wide range of topics related to the use of large language models, data management, and artificial intelligence, with a focus on improving the performance and efficiency of models in these fields.
Figure 22. Semantic aggregation results using naive input document ordering.
1 The main topics discussed in the set of papers include:
2
3 1. The use of large language models (LLMs) for various applications such as question answering over knowledge graphs, rumor detection on social media, storytelling, and uplift modeling for intelligent marketing.
4 2. The importance of effective instruction processing and modeling for LLMs, including the need for a standard open-source instruction processing implementation framework and the challenges of constructing high-quality instruction datasets.
5 3. The need to consider the inter-task impacts and utilize treatment information when modeling uplift effects.
6 4. The importance of incorporating semantic clues in queries and the use of unified hallucination detection methods for multimodal LLMs.
7 5. The comparison of different knowledge sources and information retrieval techniques for open-domain scientific claim verification.
8 6. The use of list-aware retrieval and truncation joint models for search and retrieval-augmented generation.
9 7. The evolution of information retrieval technology and the comparison of topic modeling approaches in the banking context.
10 8. The proposal of an efficient ID representation alignment framework for LLM-based recommendation.
11 9. The importance of modeling structural dependencies among evidence facts for complex question answering over knowledge graphs.
12 10. The use of post hoc explanations for recommended similar articles and the extraction of table texts from invoice images using deep learning algorithms.
13 11. The fine-grained complexity of gradient computation for training large language models.
14
15 Overall, the papers highlight the potential of LLMs for various applications, the need for effective instruction processing and modeling, and the importance of incorporating semantic clues and utilizing treatment information when modeling uplift effects.
Figure 23. Semantic aggregation results using clustered input document ordering.

3.5. sem_map & sem_extract

Both LOTUS’ semantic map and extract run batched LM calls over a set of rows, prompting the LM with the user’s arbitrary natural-language expression to generate a new column. Both functions can be fully parallelized over rows, and LOTUS implements efficient batched inference with vLLM (Kwon et al., 2023). To implement sem_extract, LOTUS prompts the model to answer the user’s langex with direct quotes, and the system then verifies that these snippets returned by the LM match the reference text.

3.6. Semantic Indexing

LOTUS supports several different algorithms for its semantic index. To generate the semantic similarity index, LOTUS’ sim_index operator first batch processes the user-specified column to generate semantic embeddings with the configured retriever model, then constructs a vector index over the semantic embeddings. The current implementation uses FAISS’ flat index by default and writes the index locally to disk. User’s can specify the local file path to store the index, and optionally specify alternative vector indices, such as hierarchical navigable small worlds (HNSW) (Malkov and Yashunin, 2018), inverted indices (IVF) (Baranchuk et al., 2018; Ge et al., 2014; Jégou et al., 2011; Johnson et al., 2017), and locality sensitive hash indices (LSH) (Andoni and Indyk, 2008; Andoni and Razenshteyn, 2015; Indyk and Motwani, 1998; Jafari et al., 2020; Li et al., 2020; Liu et al., 2021; Lu and Kudo, 2020; Park et al., 2015; Sundaram et al., 2013; Zheng et al., 2020; Andoni et al., 2015; Gionis et al., 1999; Gong et al., 2020). In the future, we envision additionally supporting a wider variety of embeddings stores and indices (Patel et al., 2024; Wang et al., 2021; vec, [n. d.]; noa, 2023b; chr, [n. d.]; mos, [n. d.]).

LOTUS’ semantic search, similarity join and cluster operations all leverage the similarity index in their implementation. sem_search first embeds the user’s query string using the configured retriever model, then performs an efficient top-k search using the loaded FAISS index over the search column. sem_sim_join similarly performs a top-k search using the loaded FAISS index over the right key. Here the right-key may be a column of natural-language text, which LOTUS will embed using the retriever model on-the-fly, or an indexed column, for which LOTUS will load the previously-generated embeddings. The right-key embeddings then serve as the queries to perform batched search over the search column, given by the left key. Lastly, LOTUS implements sem_cluster_by using FAISS optimized kmeans library to cluster the user-specified column. Here, the column must be previously indexed, and LOTUS uses the generated embeddings to perform the clustering with the user-specified number of cluster centroids.

4. Evaluation

We now evaluate LOTUS’ programmability and efficiency through three diverse applications: fact-checking (Section 4.1), extreme multi-label classification (Section 4.2), and search and ranking (Section 4.3). For each of these applications, we see that state-of-the-art quality results are achievable with low development overhead, using LOTUS programs with a few lines of code and a few or even one semantic operator. In addition, our results demonstrate interesting implementation and optimization choices introduced by semantic operators. Specifically, we find that:

  • On the FEVER dataset (Thorne et al., 2018) for fact-checking, LOTUS programs can reproduce FacTool (Chern et al., 2023), a recent state-of-the-art pipeline, in few lines of code, and implement a new pipeline with a simple change of operators that improves accuracy by 9.5%percent9.59.5\%9.5 %, while offering 734×7-34\times7 - 34 × lower execution time.

  • On the BioDEX dataset (D’Oosterlinck et al., 2023) for the extreme multi-label classification task, LOTUS reproduces state-of-the art result quality (D’Oosterlinck et al., 2024) with it’s join operator, while providing an efficient algorithm that runs up to 800×800\times800 × faster than a naive join, demonstrating the power of LOTUS’ declarative interface.

  • On the SciFact dataset (Thakur et al., 2021) and two newly-constructed paper datasets for the search and ranking application, LOTUS’s semantic top-k𝑘kitalic_k operator achieves 5.949.4%5.9percent49.45.9-49.4\%5.9 - 49.4 % higher nDCG@10 than a vanilla retriever and re-ranker, while also running 1.6710×1.67-10\times1.67 - 10 × faster than alternative LM-based ranking methods used by prior works (Qin et al., 2024).

Unless otherwise stated, we run our local model experiments with 4 A100 GPUs using Llama 3 models (int, [n. d.]), with a batch size of 64 running on vLLM (Kwon et al., 2023). For our experiments that use OpenAI’s GPT models (ope, [n. d.]), we run with 64-way thread parallelism.

4.1. Application: Fact-Checking

4.1.1. Dataset

We evaluate on FEVER (Thorne et al., 2018), a claim verification dataset. We use the development dataset, which contains about 38,000 total claims, of which we sample 500 for our evaluation. Each claim is labeled with one of three labels, ”Supported”, ”Refuted”, or ”NotEnoughInfo”, and the task is to correctly determine the label of each claim, leveraging evidence from a corpus of 5.55.5~{}5.55.5 million Wikipedia articles. We merge the latter two labels in to a single class, ”Not Supported”, following prior work (Chern et al., 2023) for our evaluation.

4.1.2. Baselines

FacTool (Chern et al., 2023) is a recent research work that proposes a multi-step pipeline for fact-checking involving, claim extraction, query generation, tool querying, evidence collection, and verification. We use FactTool’s open source codebase (gai, [n. d.]) to measure its performance. FactTool’s pipeline, by default, performs retrieval with a Google Search API (cus, [n. d.]). We evaluate the pipeline with both the default retrieval API, and alternatively test with a ColBERT (Khattab and Zaharia, 2020) index over the document corpus to perform retrieval. We find that the results are similar, and we report the results using ColBERT for retrieval to hold the retriever model constant with the implemented LOTUS programs.

Table 2. Fact-checking Results on the Fever Dataset.
Method Accuracy Execution Time, batched (s) Execution Time, no batching (s) LoC
FacTool 83.5 K.A. 1,174.8 ¿ 750
LOTUS-Factool 90.0 111.48 979.50 ¡ 50
LOTUS-fact-filter 90.5 64.28 206.45 ¡ 50
LOTUS-fact-filter (+ cascades) 93 34.09 168.24 ¡ 50
LOTUS-fact-join 86.5 2,394.22 12,923.08 ¡ 50

4.1.3. LOTUS Programs

We compose several intuitive LOTUS programs, each in less than 50 lines of code. For each one, we use ColBERT as the retriever model for creating the semantic index, and we use Llama 70B as the primary LM, and Llama 8B and TinyLlama for cascade optimizations.

First, we compose a pipeline designed to directly re-implement FacTool’s information flow in LOTUS. Figure 24 shows the pseudocode for the LOTUS-FacTool pipeline. The two tables are shown by wiki_df, which stores the Wikipedia articles, and claim_df, which stores the claims from the FEVER dataset. After loading the semantic index to wiki_df, the pipeline first performs a semantic map over each claim to generate two search queries, which are then used to perform a semantic similarity join over the corpus of Wikipedia articles. The program then concatenates the context retrieved for each claim, and performs a sem_map to output whether the claim is true or false, along with a revised claim if the claim is false. We use the same prompts found in FacTool (gai, [n. d.]), which include 3 demonstrations for generating search queries in the first sem_map, and chain-of-thought prompting in the second sem_map.

1wiki_df.load_sem_index("article", "index_dir")
2
3claim_df.sem_map("write 2 search queries given the {claim}", name="query")\
4 .sem_sim_join(wiki_df, left_on="query", right_on="articles", K=10)\
5 # concatenate articles for each claim
6 .groupby(["claim"]).apply(lambda x: "\n".join(x["articles"]))\
7 .sem_map("Identify whether there are any factual errors in the {claim} based on the {articles}. Include your resasoning, any errors found in the claim, and the factuality of the claim.")
Figure 24. LOTUS-FacTool pipeline, using semantic map, sim-join, and map for fact-checking

The next program, LOTUS-fact-filter, makes a simple, single-operator modification to the LOTUS-FacTool program, replacing the semantic map at the end of the query pipeline, with LOTUS’ semantic filter operation. We use 3 demonstrations for the filter. Figure 25 shows the pseudocode for this program. We evaluate this program with and without model cascades for the semantic filter operation.

1wiki_df.load_sem_index("article", "index_dir")
2
3claim_df.sem_map("write 2 search queries given the {claim}", name="query")\
4 .sem_sim_join(wiki_df, left_on="query", right_on="articles", K=10)\
5 # concatenate articles for each claim
6 .groupby(["claim"]).apply(lambda x: "\n".join(x["articles"]))\
7 .sem_filter("given the {context}, the {claim} is factual.", confidence_threshold=0.9)
Figure 25. LOTUS-fact-filter pipeline, using semantic map, sim-join, and filter for fact-checking

Lastly, we compose an alternative pipeline, LOTUS-fact-join. As the pseudocode shows in Figure 26, the pipeline first performs a semantic map over each claim to obtain a set of sub-claims. From this, we create the claimed_facts_df, which separates each sub-claim into a different row. Next, the pipeline uses these sub-claims to perform a semantic similarity join over the Wikipedia corpus, then performs a semantic map over each retrieved article to generate the important facts in each one. Lastly, the program performs a semantic join between the sub-claims and the facts described in the retrieved passages. If the returned table contains a supporting fact for each sub-claim, the claim is labeled as ”Supported”. For the sem_map and join operations, we use 3 demonstrations each.

1wiki_df.load_sem_index("article", "index_dir")
2
3for claim in claims_df["claim]:
4 df = pd.DataFrame({"claim": [claim]})\
5 .sem_map("what sub-claims are made in the {claim}", name=claimed_facts")
6 .apply(lambda x: x[claimed_facts].split(","))
7 claimed_facts_df = pd.DataFrame({"claims": df[claims]})
8 .sem_sim_join(wiki_df, left_on="claim", right_on="articles", K=20, n_rerank=10)\
9 .sem_map("summarize the important facts in the {article}", name=facts)\
10 .sem_join(claimed_fact_df, "is the {claimed_facts:right} verified by the {facts:left}")
Figure 26. LOTUS-fact-join pipeline, using semantic map, sim-join, map, and join for fact-checking

4.1.4. Results

Table LABEL:tab:factchecking demonstrates the powerful abstraction that LOTUS provides, allowing programmers to quickly write and test programs that compose simple operators to obtain state-of-the-art results. We report the accuracy of each benchmarked method, an estimate of lines of code (LoC), and execution time in seconds both with and without batching. We see that FacTool’s implementation offers strong accuracy performance on the FEVER datasets, however the full repository required several hundred lines of code, highlighting the development burden of building these applications without abstractions for semantic-bulk processing. By contrast, each LOTUS program offers comparable or higher accuracy, in relatively few lines of code. We also note that FacTool implements its method without batching, whereas LOTUS, by default, leverages batched LM execution for efficiency. To provide an apples-to-apples comparison, we compare FacTool’s un-batched implementation to the LOTUS programs both with and without batching.

First, we see from the Table LABEL:tab:factchecking that LOTUS-FacTool is able to reproduce the result quality and efficiency of the original method’s implementation, with 6.5 points higher accuracy and 1.2×1.2\times1.2 × lower execution time without batching. The LOTUS-FacTool implementation with batching further decreases execution time compared to the original FacTool implementation by 10×10\times10 ×. We see that the next LOTUS pipeline simply changes a single operation, switching the sem_map to a sem_filter, and maintains similar result quality to LOTUS-FacTool while further reducing execution time by 1.72×1.72\times1.72 × in the batched implementation.

Leveraging LOTUS’ filter operation allows the programmer to further optimize the program using model cascades, which increases accuracy by 3 points and increases the batched execution time by 3.27×3.27\times3.27 × compared to LOTUS-FacTool. Figure 27 highlights the diverse performance trade-offs presented the model cascade optimizations used in LOTUS’ filter sem_filter. The plot shows the accuracy and execution time of the LOTUS-fact-filter pipeline using a single model, either Llama 8b, Llama 70B, or TinyLlama (Zhang et al., 2024), shown by the circles. We compare this to the performance attainable using a pair of models, Llama 8B and Llama 70B, or TinyLlama and Llama70B to implement the filter casacade. We generate multiple cascade points, shown by the stars, by varying the confidence threshold used. The plot shows that this filter optimization can substantially reduce execution time, and offer diverse accuracy trade-offs, compared to implementing the pipeline with the oracle model, Llama 70B, alone.

Returning to Table LABEL:tab:factchecking, we find that the last LOTUS pipeline (search-map-join), reduces accuracy and increases execution time. Notably the performance trade-offs of different LOTUS programs are non-obvious, highlighting the need for programmable, declarative abstractions so programmers can quickly explore and iterate on their query pipelines. Each LOTUS program we described can be implemented easily in relatively few lines of code, and the best performing one offers 9.5%percent9.59.5\%9.5 % higher accuracy and 34×34\times34 × lower execution time than FacTool’s original, un-batched implementation.

Refer to caption
Figure 27. Accuracy versus execution time (s) for the LOTUS-fact-filter pipeline, with and without cascades applied to the filter operation on the FEVER dataset for fact-checking. We compare the pipeline implemented with no cascades using a single model, shown by the colored circles, to the pipeline implemented using cascades with two models, shown by the stars. By varying the confidence threshold specified, we generate several points for each cascade.
Table 3. Extreme Multi-label Classification Results on Biodex Dataset with Llama-70b
Method RP@5 RP@10 Execution Time (s) # LM Calls
Semantic Similarity Join 0.106 0.120 2.91 0.00
LOTUS Semantic Join (map-search-filter pattern) 0.241 0.258 2,762 7,750
LOTUS Semantic Join (nested-loop pattern) K.A. K.A. 2,144,560* 6,092,500
LOTUS Semantic Join (search-filter pattern) 0.155 0.186 2,640 7,500
  • *

    Estimated under linear-scaling assumption in number of batch calls.

4.2. Application: Extreme Multi-label Classification

4.2.1. Dataset

We evaluate on the Biodex Dataset (D’Oosterlinck et al., 2023), which consists of a corpus of 65,0006500065,00065 , 000 biomedical articles, and expert-created drug safety reports constructed from each article. The task is to correctly label the drug reactions experience by the patient in each medical article. We sample 250 patient articles for our evaluation. Notably, there are approximately 24,000 possible drug-reaction labels, making this task an extreme multi-label classification task. Due to the large number of possible labels, leveraging an LM to perform inference is difficult, and this setting has been studied in prior works (D’Oosterlinck et al., 2024). We show below that this task can be efficiently modeled and programmed using LOTUS’ semantic join.

4.2.2. Baselines

We consider a simple retrieval baseline which uses an E5Model (Wang et al., 2024) as the retriever and performs a semantic-similarity join between the patient articles and the reaction labels. We show the pseudocode for this program in Figure 28.

1articles_df.load_sem_index("article", "article_idx_dir")
2reaction_labels_df.load_sem_index("drug_reaction", "rxn_idx_dir")
3
4articles_df.sem_sim_join(reaction_labels_df, left_on="article", right_on="drug_reaction", K=20)
Figure 28. Extreme multi-label classification retrieval baseline using sim-join

4.2.3. LOTUS Programs

The proposed LOTUS program performs a semantic join over the drug reaction labels and the medical articles, as shown in Figure 32. We perform the semantic join using at most 7 map demonstrations and set an LM call budget of 10,000. We use Llama-70b as the LM.

1articles_df.load_sem_index("article", "article_idx_dir")
2reaction_labels_df.load_sem_index("drug_reaction", "rxn_idx_dir")
3
4articles_df.sem_join(reaction_labels_df, "The {article} indicates that the patient is experiencing the {drug_reaction}", map_dems=dems, call_budget=10000)
Figure 29. Extreme multi-label classification LOTUS program

4.2.4. Results

Table 3 shows that the proposed LOTUS program makes meaningful traction, obtaining high quality results on this task, while maintaining query efficiency. The table reports the rank-precision@5 (RP@5), rank-precision@10 (RP@10), following prior work (D’Oosterlinck et al., 2024), as well as execution time in seconds and the number of LM calls required for each program. Since the nested-loop join pattern is prohibitively expensive to run, we show an estimate of the execution time assuming linear scaling in the number of batched calls.

First we compare performance of the LOTUS join program implemented with the map-search-filter pattern to the baseline method, the semantic similarity join. As expected, the LOTUS program offers substantially higher quality results, with 2.27×2.27\times2.27 × and 2.15×2.15\times2.15 × higher RP@5 and RP@10 respectively compared to the retrieval-based similarity join. This highlights the effectiveness of leveraging LMs over the data for complex reasoning-based tasks. We informally compare these accuracy results to recent work (D’Oosterlinck et al., 2024) which composes a multi-step DSPy (Khattab et al., 2023) program compiled using Llama-2-7b-chat as the student LM and GPT-3.5-turbo as the teacher LM to perform a semantic mapping, followed by a re-ranking step using GPT-4 turbo. D’Oosterlinck et al. report 24.73 RP@5 and 27.67 RP@10 for the compiled program, representing comparable result quality to LOTUS’ program, although notably the LOTUS program was not compiled with a prompt optimization system.

We now consider several interesting performance trade-offs presented in the semantic join algorithm. We compare LOTUS map-search-filter join pattern to the naive nested-loop join algorithm and the search-filter join pattern shown in Table 3. We see that nested-loop join pattern, which involves a quadratic LM budget of over 6 million LM calls, is untenable and prohibitively costly. By contrast, the search-filter and map-search-filter pattern substantially reduce the LM call budget of the naive algorithm by 800×800\times800 ×, using an approximation. While these two approximation patterns have similar efficiency, according to execution time and the number of LM calls, interestingly, they offer substantially different result quality. Specifically, the map-search-filter pattern offers 55%percent5555\%55 % higher RP@5 and 38%percent3838\%38 % higher RP@10 compared to the retrieve-filter pattern on this task. These results highlight the unique opportunities bulk-semantic processing pipelines present for designing new algorithms and optimizations.

4.3. Application: Search & Ranking

4.3.1. Dataset

We evaluate LOTUS’ performance on the search and ranking task using three datasets, including BEIR’s SciFact test set (Thakur et al., 2021), a widely used benchmark for retrieval and re-ranking, as well as two new benchmarks, CIFAR-bench, and HellaSwag-bench, which we generate to evaluate more complex, reasoning-based ranking criteria over the data. For each, we report nDCG@10, as well as the execution time (ET) in seconds.

The SciFact dataset consists of a set of scientific claims and a corpus of articles, where the task is to rank articles by relevance given each scientific claim. We sample 300 scientific facts for our evaluation, and report the average ranking execution time across these samples.

While the SciFact dataset provides a sorting task based on a simple relevance criterion, our newly proposed benchmarks provide a more complex sorting criteria over a corpus of paper abstracts. Specifically, the ranking task is to find the papers that report the highest accuracy on CIFAR-10 and HellaSwag in CIFAR-bench and HellaSwag-bench respectively. To generate CIFAR-bench, we took 100 abstracts from the Papers with Code Dataset (pap, [n. d.]) that state performance on CIFAR-10 in the abstract, and we and manually labeled their accuracy to obtain the top-10. We then synthetically generated HellaSwag-bench by prompting Llama-70B to create 200 paper abstracts, each with a specified accuracy value, randomly sampled from 0100%0percent1000-100\%0 - 100 %. This setup allows us to evaluate LOTUS’ LM-based sorting algorithms on a task with objective ground truth. We note that an alternative approach to these tasks could efficiently leverage sem_map to extract accuracy values on abstracts from either dataset, then perform a structured sort. However, our evaluation focuses on assessing the semantic ranking capabilities of LOTUS’ top-k algorithms, and we find this benchmark useful for understanding performance trade-offs, which may likely generalize to a wider set of reasoning-based sorting queries, such as ”which paper makes the most outrageous claim”, for which ground truth is less objective to evaluate. For CIFAR-bench and HellaSwag-bench we report results for n=20𝑛20n=20italic_n = 20 trials of the ranking task, at temperature t=0.7𝑡0.7t=0.7italic_t = 0.7, similar to prior works (Khattab et al., 2023)

Table 4. Ranking Results on SciFact
Method nDCG@10 Execution Time (s)
Semantic Search 0.712 0.009
Semantic Search + Reranker 0.741 2.64
LOTUS Top-k (quickselect + sem-index) - Llama-70B 0.775 33.6
LOTUS Top-k (quickselect + sem-index) - GPT-4o 0.800 11.2
Table 5. Ranking Results on CIFAR-bench and HellaSwag-bench Datasets
Method CIFAR HellaSwag
nDCG@10 Execution Time (s) nDCG@10 Execution Time (s)
Semantic Search 0.252 0.008 0.119 0.008
Semantic Search + Reranker 0.001 2.57 0.461 2.36
LOTUS Top-k (quickselect) - Llama 70-B 0.746 41.3 0.909 59.1

4.3.2. Baselines

We consider two simple baselines. The first baseline performs semantic search, as shown in Figure 30, using the E5Model (Wang et al., 2024) for retrieval. The second baseline performs search with re-ranking, as the pseudocode shows in Figure 31, using the E5Model for retrieval and the MixedBread cross-encoder (noa, 2024) for re-ranking.

1corpus_df.load_sem_index("article", "index_dir")\
2 .sem_search("article", query, K=10)\
Figure 30. Semantic search baseline for ranking task
1corpus_df.load_sem_index("article", "index_dir")\
2 .sem_search("article", query, K=100, n_rerank=10)
Figure 31. Semantic search with re-ranking baseline for ranking task

4.3.3. LOTUS Programs

The proposed LOTUS program performs a semantic top-k𝑘kitalic_k over the documents, as shown in Figure 32, which shows example pseudocode for the CIFAR-bench dataset. The langex for the semantic top-k𝑘kitalic_k on SciFact sorts based on relevance to the given claim, while the langex for the CIFAR-bench and HellaSwag-bench datasets sort abstracts by accuracy performance on the respective datasets, as shown in the figure. We note that for the SciFact dataset, we perform a semantic search using the E5Model as the retriever to obtain 100 articles, before ranking them with the sem_topk. We report results using both Llama-70B and GPT-4o as the LM.

1corpus_df.load_sem_index("article", "index_dir")\
2 .sem_topk("the {doc} provides the best performance on CIFAR-10", K=10)
Figure 32. Proposed LOTUS program with semantic top-k for CIFAR-bench

4.3.4. Results

Table 6. Comparison of Ranking Results for Different Semantic Top-k Algorithms using Llama-70B
Method Scifact CIFAR HellaSwag
nDCG@10 ET (s) # LM Calls nDCG@10 ET (s) # LM Calls nDCG@10 ET (s) # LM Calls
Quadratic Sort 0.836 712 4950 0.868 634 4950 0.966 1,803 19,900
Heap Top-k 0.776 65.0 216 0.832 99.6 350 0.907 98.9 415.2
QuickSelect Top-k 0.776 42.4 285 0.746 41.3 303.95 0.909 59.1 620.95
QuickSelect Top-k + Semantic Index 0.775 33.6 229 0.710 39.6 307.7 0.975 63.6 672.7

Tables LABEL:tab:ranking_scifact and 5 demonstrate the effectiveness of LOTUS’ semantic top-k𝑘kitalic_k operator for complex search tasks. In addition, Table 6 and Figure 33 highlight the rich implementation design space that this task presents. We walk through several note-worthy findings.

We first turn our attention to Table LABEL:tab:ranking_scifact, which presents the results for each bench-marked method on the SciFact dataset, which uses a relevance-based sorting criterion. As expected, both semantic search programs with and without re-ranking present strong baselines. Specifically, the re-ranker model, which is a supervised model trained specifically for the task of relevance-based ranking, increases nDCG@10 by 3 percentage points, while trading off query efficiency compared to the semantic search baseline. Notably, the unsupervised LM-based LOTUS programs outperform the supervised re-ranker’s result quality. The table shows LOTUS semantic top-k program with Llama-70B and GPT-4o, which outperform the semantic search baseline by 6 and 9 percentage points respectively. The LOTUS programs with Llama-70B and GPT-4o also outperform the re-ranker by 3 and 5 points respectively, improving upon the quality of the supervised baseline. As expected, the improved result quality comes at a trade-off to query efficiency due to the cost of LM calls. Notably, LOTUS’ versatility allows programmers to easily compose each of these query pipeline and trade-off result quality and efficiency depending on application-specific requirements.

Turning our attention to Table 5, we study LOTUS’ generality in supporting arbitrary language-based ranking criteria over the dataset. On the CIFAR-bench and HellaSwag-bench datasets, which use a complex sorting criteria, we see that both semantic search baselines with and without re-ranking provide poor result quality with consistently low nDCG@10. The LOTUS program, using a semantic top-k𝑘kitalic_k with Llama 70B, acheives significant accuracy gains, with 49.4 and 44.8 points higher nDCG@10 than the best performing baseline on CIFAR-bench and HellaSwag respectively. These accuracy gains reflect the powerful reasoning capabilities of LMs efficiently orchestrated over the data. As expected, these significant accuracy gains come at an increase to execution time.

We now analyze the efficiency of LOTUS semantic top-k𝑘kitalic_k implementation along with it’s proposed optimizations. Table 6 compares several semantic top-k𝑘kitalic_k algorithms, namely a quick-select top-k𝑘kitalic_k algorithm, a quick-select top-k𝑘kitalic_k that leverages the similarity index for pivot selection, a heap-based top-k𝑘kitalic_k algorithm, and a quadratic sorting algorithm. First, we see that the quadratic algorithm, which performs an LM comparison between each pair of input documents, offers consistently high result quality across each dataset. However, this method is prohibitively expensive, requiring 1630×16-30\times16 - 30 × more LM calls and over 10×10\times10 × higher execution time than the alternative implementations. The heap top-k𝑘kitalic_k and quick-select top-k𝑘kitalic_k methods offer comparable result quality, but with interesting trade-offs in query efficiency. Notably the quick-select top-k𝑘kitalic_k method offers 1.672.24×1.67-2.24\times1.67 - 2.24 × lower execution time than the heap-based sorting method across all datasets, despite requiring more LM calls in some cases. This is because the quick-select top-k𝑘kitalic_k implementation allows for efficient batch-processing in each round of the algorithm, whereas the heap-based top-k𝑘kitalic_k incurs sequential LM calls during heap updates. For this reason, our current implementation leverages the quick-select top-k𝑘kitalic_k algorithm, although we envision future iterations may leverage multiple algorithms and allow the user to declaratively trade-off accuracy, query throughput, and cost.

In addition to providing an efficient top-k𝑘kitalic_k algorithm, the table also demonstrates the use of LOTUS’ similarity index for optimizing top-k𝑘kitalic_k query performance. The quick-select top-k𝑘kitalic_k algorithm optimized with the semantic similarity index for pivot selection demonstrates 1.2×1.2\times1.2 × lower execution time at no accuracy loss on SciFact, where the ranking criteria correlates likely with semantic similarity. On the other hand, for the CIFAR-bench and HellaSwag-bench datasets, where the ranking criteria does not correlate with semantic similarity, we see that the similarity index has no significant impact on the accuracy or efficiency top-k𝑘kitalic_k performance.

Refer to caption
Figure 33. Accuracy versus execution time (s) on the HellaSwag-bench dataset for ranking using LOTUS’ semantic top-k, with and without cascades applied. We compare the operator implemented using a single model and no cascades, shown by the colored circles, to the implementation using cascades with two models, shown by the blue stars. By varying the confidence threshold specified for the cascade, we generate several points that trade-off accuracy and execution time.

Lastly, we analyze the impact of LOTUS’ cascade optimization applied to the quickselect top-k𝑘kitalic_k implementation. Figure 33 compares the nDCG and execution time of implementing the operator with a single model, either Llama 8B or Llama 70B, to the implementation that leverages these models together using model cascades. We vary the confidence threshold of the cascade to generate several points in the trade-off space. We find that the cascade optimization offers diverse performance trade-offs, which can outperform the single-model oracle baseline, which uses Llama-70B. For instance, one cascade along the Pareto-frontier improves accuracy of the Llama-70B baseline by 3%percent33\%3 %, while reducing execution time by 1.8×1.8\times1.8 ×, demonstrating substantial opportunities for automatically optimizing LOTUS’ semantic query pipelines.

5. Related Work

Specialized LLM-based Relational Extensions. Several prior works extend relational languages with a set of logically row-wise LM-based operations to serve specialized tasks or applications. Palimpzest (Liu et al., 2024b) presents a declarative approach to data cleaning and extract-transform-load (ETL) tasks. The authors propose to automatically optimize relational operators with LM calls, and implement two row-wise relational operators, a newly proposed convert operator, which can be transparently optimized to perform entity extraction using LLMs, and an AI-based filter operation, logically similar to LOTUS’ sem_filter. The system also implements several query optimizations, such as operator re-ordering, model selection, and code synthesis to implement user queries, and proposes several others as future work.

SUQL (Liu et al., 2024c) presents a SQL extension to support conversational agents with knowledge grounding over structured and unstructured data. Specifically, the system extends SQL with two new logically row-wise operators, answer, which prompts an LM to answer the user question over each row, and summary, which prompts an LM to provide a summary to the user over each row. The system can provide automatic optimizations, such as predicate re-ordering using a lazy evaluation approach and proposes to use retrieval to optimize some answer queries.

ZenDB (Lin et al., 2024) and EVAPORATE  (Arora et al., 2023) tackle the task of automatically ingesting and extracting semi-structured documents into structured tables that can be queried using standard relational operators and languages. ZenDB extracts structure using a semantic hierarchical tree index, which is integrated with a SQL query engine to support efficient query processing over the extracted attribute values. The systems implements several optimizations, including predicate reordering, push-down, and projection pull-up. Additionally, EVAPORATE performs efficient entity extraction from semi-structured data using LM-based code generation and weak supervision to ensemble candidate functions.

LOTUS, in contrast to these prior works, defines a general-purpose programming model designed to capture broad-ranging applications with a core set of composable semantic operators, including both logically row-wise ones and more complex ones, such as joins, aggregation, ranking and search functions. In LOTUS’ current implementation, users can use sem_map to perform entity extraction over unstructured text fields, although future work may provide native functionality for entity extraction by integrating systems or optimizations of prior work. Additionally, we believe that several optimizations leveraged in prior work, such as lazy evaluation, operator reordering, model selection, and code synthesis, are worthwhile future work for LOTUS’ optimizer.

LLM UDFs Recent research work (Liu et al., 2024a) and existing analytical database vendors, such as Google BigQuery (ver, [n. d.]), Databricks (dat, [n. d.]) and AWS Redshift (noa, 2023a), alternatively offer LLM user-defined functions (UDFs) to the programmer. The LLM UDF programming model provides a lower-level interface, which is limited to logically row-wise LLM execution over the data, equivalent to LOTUS’ sem_map. In contrast, LOTUS’ programming model is declarative and provides a rich set of semantic operators, allowing the system to automatically orchestrate the LM to serve a variety of complex query patterns, including aggregations, ranking, and joins.

Liu et al. (Liu et al., 2024a) study how to optimize LLM UDF functions, demonstrating performance gains with a de-duplication method and a prefix-sharing maximization method that reorders rows and re-formulates parameterized prompts to maximize key-value (KV) cache reuse during query execution. We believe these methods could be effective optimizations in future work at LOTUS’ batched execution layer.

LM Programming Frameworks Recent LM programming frameworks, such as LangChain (lan, [n. d.]), LlamaIndex (lla, [n. d.]), and DSPy (Khattab et al., 2023), have gained significant popularity. These systems provide a set of abstractions for programming with LMs, including utilities for handling prompts and post-processsing LM outputs, and support for common use cases, such as RAG, chat-bots, and function-calling. DSPy focuses on abstracting LM pipelines as programs and provides automatic prompt optimization. In contrast to these systems, LOTUS’ programming model is designed for tasks involving bulk processing data with LLMs. While some of these systems support batched calls with multi-threading, support for bulk-processing is sparsely supported and largely un-optimized.

ML-based Query Processing Many prior works study the use of machine learning (ML) in databases, but do not focus on LLMs, which present unique opportunities for system design and optimization. MADLib (Hellerstein et al., 2012) extends SQL with efficient abstractions for supervised learning, unsupervised learning, and descriptive statistics. Prior works such as NoScope (Kang et al., 2017), TASTI (Kang et al., [n. d.]), SUPG (Kang et al., 2020), BlazeIt (Kang et al., 2019) and probabilistic predicates (Lu et al., 2018) propose methods to optimize queries involving expensive ML models over large datasets, typically in video analytics. Some optimizations, such as model cascades and predicate re-ordering, which were useful in these works are likewise useful for optimizing LOTUS pipelines with language models. However, our setting with natural-language reasoning tasks has new operators with significantly different semantics, such as sem_topk and sem_agg, requiring a new programming model as well as new query execution algorithms and optimizations.

6. Conclusion

In this work, we proposed semantic operators to provide the first declarative and general-purpose interface to serve bulk-semantic processing. We implement these operators in the LOTUS system to seamlessly extend the relational model and allow programmers to easily compose powerful reasoning-based query pipelines over vast corpora of structured and unstructured data. Our results across a diverse set of applications, including fact-checking, extreme multi-label classification, and search, demonstrate the generality and effectiveness of LOTUS’ programming model as well as the efficiency and optimization opportunities of LOTUS’ query engine. For each task, we find that LOTUS programs capture high quality and state-of-the-art query pipelines with low development overhead, and that they can be automatically optimized to achieve higher performance than existing implementations.

Acknowledgements.
This research was supported in part by affiliate members and other supporters of the Stanford DAWN project, including Meta, Google, and VMware, as well as Cisco, SAP, and a Sloan Fellowship. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the sponsors.

References

  • (1)
  • dat ([n. d.]) [n. d.]. AI Functions on Databricks. https://docs.databricks.com
  • arx ([n. d.]) [n. d.]. arXiv.org ePrint archive. https://arxiv.org
  • chr ([n. d.]) [n. d.]. Chroma. https://www.trychroma.com/
  • cus ([n. d.]) [n. d.]. Custom Search JSON API | Programmable Search Engine. https://developers.google.com/custom-search/v1/overview
  • dis ([n. d.]) [n. d.]. Discovery Insight Platform. https://www.findourview.com
  • gai ([n. d.]) [n. d.]. GAIR-NLP/factool: FacTool: Factuality Detection in Generative AI. https://github.com/GAIR-NLP/factool
  • int ([n. d.]) [n. d.]. Introducing Meta Llama 3: The most capable openly available LLM to date. https://ai.meta.com/blog/meta-llama-3/
  • lan ([n. d.]) [n. d.]. LangChain. https://www.langchain.com/
  • ver ([n. d.]) [n. d.]. LLM with Vertex AI only using SQL queries in BigQuery. https://cloud.google.com/blog/products/ai-machine-learning/llm-with-vertex-ai-only-using-sql-queries-in-bigquery
  • mos ([n. d.]) [n. d.]. Mosaic AI Vector Search. https://docs.databricks.com
  • ope ([n. d.]) [n. d.]. OpenAI Platform. https://platform.openai.com
  • pan ([n. d.]) [n. d.]. pandas - Python Data Analysis Library. https://pandas.pydata.org/
  • pap ([n. d.]) [n. d.]. Papers with Code - Machine Learning Datasets. https://paperswithcode.com/datasets
  • lla ([n. d.]) [n. d.]. Querying - LlamaIndex 0.9.11.post1. https://docs.llamaindex.ai/en/stable/understanding/querying/querying.html
  • vec ([n. d.]) [n. d.]. Vector Database for Vector Search. https://www.pinecone.io/
  • noa (2023a) 2023a. Large Language Models for sentiment analysis with Amazon Redshift ML (Preview) | AWS Big Data Blog. https://aws.amazon.com/blogs/big-data/large-language-models-for-sentiment-analysis-with-amazon-redshift-ml-preview/ Section: Amazon Redshift.
  • noa (2023b) 2023b. Vespa - the big data serving engine. https://vespa.ai/
  • noa (2024) 2024. mixedbread-ai/mxbai-rerank-large-v1 · Hugging Face. https://huggingface.co/mixedbread-ai/mxbai-rerank-large-v1
  • Adams et al. (2023) Griffin Adams, Alexander Fabbri, Faisal Ladhak, Eric Lehman, and Noémie Elhadad. 2023. From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting. http://arxiv.org/abs/2309.04269 arXiv:2309.04269 [cs].
  • Anantha et al. (2021) Raviteja Anantha, Svitlana Vakulenko, Zhucheng Tu, Shayne Longpre, Stephen Pulman, and Srinivas Chappidi. 2021. Open-Domain Question Answering Goes Conversational via Question Rewriting. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (Eds.). Association for Computational Linguistics, Online, 520–534. https://doi.org/10.18653/v1/2021.naacl-main.44
  • Andoni and Indyk (2008) Alexandr Andoni and Piotr Indyk. 2008. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51, 1 (Jan. 2008), 117–122. https://doi.org/10.1145/1327452.1327494
  • Andoni et al. (2015) Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Ludwig Schmidt. 2015. Practical and optimal LSH for angular distance. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1 (NIPS’15). MIT Press, Cambridge, MA, USA, 1225–1233.
  • Andoni and Razenshteyn (2015) Alexandr Andoni and Ilya Razenshteyn. 2015. Optimal Data-Dependent Hashing for Approximate Near Neighbors. In Proceedings of the forty-seventh annual ACM symposium on Theory of Computing (STOC ’15). Association for Computing Machinery, New York, NY, USA, 793–801. https://doi.org/10.1145/2746539.2746553
  • Arora et al. (2023) Simran Arora, Brandon Yang, Sabri Eyuboglu, Avanika Narayan, Andrew Hojel, Immanuel Trummer, and Christopher Ré. 2023. Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes. http://arxiv.org/abs/2304.09433 arXiv:2304.09433 [cs].
  • Baranchuk et al. (2018) Dmitry Baranchuk, Artem Babenko, and Yury Malkov. 2018. Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors. https://doi.org/10.48550/arXiv.1802.02422 arXiv:1802.02422 [cs].
  • Braverman and Mossel ([n. d.]) Mark Braverman and Elchanan Mossel. [n. d.]. Noisy sorting without resampling. ([n. d.]).
  • Chang et al. (2024) Yapei Chang, Kyle Lo, Tanya Goyal, and Mohit Iyyer. 2024. BooookScore: A systematic exploration of book-length summarization in the era of LLMs. http://arxiv.org/abs/2310.00785 arXiv:2310.00785 [cs].
  • Chen et al. (2023) Lingjiao Chen, Matei Zaharia, and James Zou. 2023. FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance. http://arxiv.org/abs/2305.05176 arXiv:2305.05176 [cs].
  • Chern et al. (2023) I.-Chun Chern, Steffi Chern, Shiqi Chen, Weizhe Yuan, Kehua Feng, Chunting Zhou, Junxian He, Graham Neubig, and Pengfei Liu. 2023. FacTool: Factuality Detection in Generative AI – A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios. https://doi.org/10.48550/arXiv.2307.13528 arXiv:2307.13528 [cs].
  • Desai and Durrett (2020) Shrey Desai and Greg Durrett. 2020. Calibration of Pre-trained Transformers. https://arxiv.org/abs/2003.07892v3
  • D’Oosterlinck et al. (2024) Karel D’Oosterlinck, Omar Khattab, François Remy, Thomas Demeester, Chris Develder, and Christopher Potts. 2024. In-Context Learning for Extreme Multi-Label Classification. https://doi.org/10.48550/arXiv.2401.12178 arXiv:2401.12178 [cs].
  • D’Oosterlinck et al. (2023) Karel D’Oosterlinck, François Remy, Johannes Deleu, Thomas Demeester, Chris Develder, Klim Zaporojets, Aneiss Ghodsi, Simon Ellershaw, Jack Collins, and Christopher Potts. 2023. BioDEX: Large-Scale Biomedical Adverse Drug Event Extraction for Real-World Pharmacovigilance. https://doi.org/10.48550/arXiv.2305.13395 arXiv:2305.13395 [cs].
  • Drozdov et al. (2023) Andrew Drozdov, Honglei Zhuang, Zhuyun Dai, Zhen Qin, Razieh Rahimi, Xuanhui Wang, Dana Alon, Mohit Iyyer, Andrew McCallum, Donald Metzler, and Kai Hui. 2023. PaRaDe: Passage Ranking using Demonstrations with Large Language Models. https://doi.org/10.48550/arXiv.2310.14408 arXiv:2310.14408 [cs].
  • Gao et al. (2023) Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Y. Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, and Kelvin Guu. 2023. RARR: Researching and Revising What Language Models Say, Using Language Models. https://doi.org/10.48550/arXiv.2210.08726 arXiv:2210.08726 [cs].
  • Ge et al. (2014) Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. 2014. Optimized Product Quantization. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 4 (April 2014), 744–755. https://doi.org/10.1109/TPAMI.2013.240 Conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence.
  • Gionis et al. (1999) Aristides Gionis, Piotr Indyk, and Rajeev Motwani. 1999. Similarity Search in High Dimensions via Hashing. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB ’99). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 518–529.
  • Gong et al. (2020) Long Gong, Huayi Wang, Mitsunori Ogihara, and Jun Xu. 2020. iDEC: indexable distance estimating codes for approximate nearest neighbor search. Proceedings of the VLDB Endowment 13, 9 (May 2020), 1483–1497. https://doi.org/10.14778/3397230.3397243
  • Hellerstein et al. (2012) Joe Hellerstein, Christopher Ré, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar. 2012. The MADlib Analytics Library or MAD Skills, the SQL. http://arxiv.org/abs/1208.4165 arXiv:1208.4165 [cs].
  • Hoare (1961) C. A. R. Hoare. 1961. Algorithm 65: find. Commun. ACM 4, 7 (July 1961), 321–322. https://doi.org/10.1145/366622.366647
  • Indyk and Motwani (1998) Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing (STOC ’98). Association for Computing Machinery, New York, NY, USA, 604–613. https://doi.org/10.1145/276698.276876
  • Izacard et al. (2022) Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2022. Atlas: Few-shot Learning with Retrieval Augmented Language Models. https://doi.org/10.48550/arXiv.2208.03299 arXiv:2208.03299 [cs].
  • Jafari et al. (2020) Omid Jafari, Parth Nagarkar, and Jonathan Montaño. 2020. mmLSH: A Practical and Efficient Technique for Processing Approximate Nearest Neighbor Queries on Multimedia Data. In Similarity Search and Applications (Lecture Notes in Computer Science), Shin’ichi Satoh, Lucia Vadicamo, Arthur Zimek, Fabio Carrara, Ilaria Bartolini, Martin Aumüller, Björn Þór Jónsson, and Rasmus Pagh (Eds.). Springer International Publishing, Cham, 47–61. https://doi.org/10.1007/978-3-030-60936-8_4
  • Johnson et al. (2017) Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. http://arxiv.org/abs/1702.08734 arXiv:1702.08734 [cs].
  • Jégou et al. (2011) Herve Jégou, Matthijs Douze, and Cordelia Schmid. 2011. Product Quantization for Nearest Neighbor Search. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 1 (Jan. 2011), 117–128. https://doi.org/10.1109/TPAMI.2010.57 Conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence.
  • Kang et al. (2019) Daniel Kang, Peter Bailis, and Matei Zaharia. 2019. BlazeIt: Optimizing Declarative Aggregation and Limit Queries for Neural Network-Based Video Analytics. http://arxiv.org/abs/1805.01046 arXiv:1805.01046 [cs].
  • Kang et al. (2017) Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. 2017. NoScope: Optimizing Neural Network Queries over Video at Scale. http://arxiv.org/abs/1703.02529 arXiv:1703.02529 [cs].
  • Kang et al. (2020) Daniel Kang, Edward Gan, Peter Bailis, Tatsunori Hashimoto, and Matei Zaharia. 2020. Approximate selection with guarantees using proxies. Proceedings of the VLDB Endowment 13, 12 (Aug. 2020), 1990–2003. https://doi.org/10.14778/3407790.3407804
  • Kang et al. (2022) Daniel Kang, Edward Gan, Peter Bailis, Tatsunori Hashimoto, and Matei Zaharia. 2022. Approximate Selection with Guarantees using Proxies. https://doi.org/10.48550/arXiv.2004.00827 arXiv:2004.00827 [cs].
  • Kang et al. ([n. d.]) Daniel Kang, John Guibas, Peter Bailis, Tatsunori Hashimoto, and Matei Zaharia. [n. d.]. Task-agnostic Indexes for Deep Learning-based Queries over Unstructured Data. ([n. d.]).
  • Khattab et al. (2021) Omar Khattab, Christopher Potts, and Matei Zaharia. 2021. Relevance-guided Supervision for OpenQA with ColBERT. Transactions of the Association for Computational Linguistics 9 (2021), 929–944. https://doi.org/10.1162/tacl_a_00405 Place: Cambridge, MA Publisher: MIT Press.
  • Khattab et al. (2023) Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. 2023. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. https://arxiv.org/abs/2310.03714v1
  • Khattab and Zaharia (2020) Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. https://arxiv.org/abs/2004.12832v2
  • Kwon et al. (2023) Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. https://arxiv.org/abs/2309.06180v1
  • Lee et al. (2019) Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent Retrieval for Weakly Supervised Open Domain Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy, 6086–6096. https://doi.org/10.18653/v1/P19-1612
  • Li et al. (2020) Mingjie Li, Ying Zhang, Yifang Sun, Wei Wang, Ivor W. Tsang, and Xuemin Lin. 2020. I/O Efficient Approximate Nearest Neighbour Search based on Learned Functions. 2020 IEEE 36th International Conference on Data Engineering (ICDE) (April 2020), 289–300. https://doi.org/10.1109/ICDE48307.2020.00032 Conference Name: 2020 IEEE 36th International Conference on Data Engineering (ICDE) ISBN: 9781728129037 Place: Dallas, TX, USA Publisher: IEEE.
  • Liang et al. (2022) Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. 2022. Holistic Evaluation of Language Models. https://arxiv.org/abs/2211.09110v2
  • Lin et al. (2024) Yiming Lin, Madelon Hulsebos, Ruiying Ma, Shreya Shankar, Sepanta Zeigham, Aditya G. Parameswaran, and Eugene Wu. 2024. Towards Accurate and Efficient Document Analytics with Large Language Models. http://arxiv.org/abs/2405.04674 arXiv:2405.04674 [cs].
  • Liu et al. (2024b) Chunwei Liu, Matthew Russo, Michael Cafarella, Lei Cao, Peter Baille Chen, Zui Chen, Michael Franklin, Tim Kraska, Samuel Madden, and Gerardo Vitagliano. 2024b. A Declarative System for Optimizing AI Workloads. http://arxiv.org/abs/2405.14696 arXiv:2405.14696 [cs].
  • Liu et al. (2023) Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023. Lost in the Middle: How Language Models Use Long Contexts. https://doi.org/10.48550/arXiv.2307.03172 arXiv:2307.03172 [cs].
  • Liu et al. (2024a) Shu Liu, Asim Biswal, Audrey Cheng, Xiangxi Mo, Shiyi Cao, Joseph E. Gonzalez, Ion Stoica, and Matei Zaharia. 2024a. Optimizing LLM Queries in Relational Workloads. http://arxiv.org/abs/2403.05821 arXiv:2403.05821 [cs].
  • Liu et al. (2024c) Shicheng Liu, Jialiang Xu, Wesley Tjangnaka, Sina J. Semnani, Chen Jie Yu, and Monica S. Lam. 2024c. SUQL: Conversational Search over Structured and Unstructured Data with Large Language Models. https://doi.org/10.48550/arXiv.2311.09818 arXiv:2311.09818 [cs].
  • Liu et al. (2021) Wanqi Liu, Hanchen Wang, Ying Zhang, Wei Wang, Lu Qin, and Xuemin Lin. 2021. EI-LSH: An early-termination driven I/O efficient incremental c-approximate nearest neighbor search. The VLDB Journal 30, 2 (March 2021), 215–235. https://doi.org/10.1007/s00778-020-00635-4
  • Lu and Kudo (2020) Kejing Lu and Mineichi Kudo. 2020. R2LSH: A Nearest Neighbor Search Scheme Based on Two-dimensional Projected Spaces. In 2020 IEEE 36th International Conference on Data Engineering (ICDE). 1045–1056. https://doi.org/10.1109/ICDE48307.2020.00095 ISSN: 2375-026X.
  • Lu et al. (2018) Yao Lu, Aakanksha Chowdhery, Srikanth Kandula, and Surajit Chaudhuri. 2018. Accelerating Machine Learning Inference with Probabilistic Predicates. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD ’18). Association for Computing Machinery, New York, NY, USA, 1493–1508. https://doi.org/10.1145/3183713.3183751
  • Ma et al. (2023) Xueguang Ma, Xinyu Zhang, Ronak Pradeep, and Jimmy Lin. 2023. Zero-Shot Listwise Document Reranking with a Large Language Model. https://arxiv.org/abs/2305.02156v1
  • Malkov and Yashunin (2018) Yu A. Malkov and D. A. Yashunin. 2018. Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. http://arxiv.org/abs/1603.09320 arXiv:1603.09320 [cs].
  • Park et al. (2015) Yongjoo Park, Michael Cafarella, and Barzan Mozafari. 2015. Neighbor-sensitive hashing. Proceedings of the VLDB Endowment 9, 3 (Nov. 2015), 144–155. https://doi.org/10.14778/2850583.2850589
  • Patel et al. (2024) Liana Patel, Peter Kraft, Carlos Guestrin, and Matei Zaharia. 2024. ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured Data. Proceedings of the ACM on Management of Data 2, 3 (May 2024), 120:1–120:27. https://doi.org/10.1145/3654923
  • Pradeep et al. (2023a) Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. 2023a. RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models. http://arxiv.org/abs/2309.15088 arXiv:2309.15088 [cs].
  • Pradeep et al. (2023b) Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. 2023b. RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze! http://arxiv.org/abs/2312.02724 arXiv:2312.02724 [cs].
  • Qin et al. (2024) Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Le Yan, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, Xuanhui Wang, and Michael Bendersky. 2024. Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting. https://doi.org/10.48550/arXiv.2306.17563 arXiv:2306.17563 [cs].
  • Sachan et al. (2022) Devendra Sachan, Mike Lewis, Mandar Joshi, Armen Aghajanyan, Wen-tau Yih, Joelle Pineau, and Luke Zettlemoyer. 2022. Improving Passage Retrieval with Zero-Shot Question Generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 3781–3797. https://doi.org/10.18653/v1/2022.emnlp-main.249
  • Shah and Wainwright (2016) Nihar B. Shah and Martin J. Wainwright. 2016. Simple, Robust and Optimal Ranking from Pairwise Comparisons. http://arxiv.org/abs/1512.08949 arXiv:1512.08949 [cs, math, stat].
  • Sun et al. (2023) Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. 2023. Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents. https://arxiv.org/abs/2304.09542v2
  • Sundaram et al. (2013) Narayanan Sundaram, Aizana Turmukhametova, Nadathur Satish, Todd Mostak, Piotr Indyk, Samuel Madden, and Pradeep Dubey. 2013. Streaming similarity search over one billion tweets using parallel locality-sensitive hashing. Proceedings of the VLDB Endowment 6, 14 (Sept. 2013), 1930–1941. https://doi.org/10.14778/2556549.2556574
  • Thakur et al. (2021) Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. https://doi.org/10.48550/arXiv.2104.08663 arXiv:2104.08663 [cs].
  • Thorne et al. (2018) James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. FEVER: a Large-scale Dataset for Fact Extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Marilyn Walker, Heng Ji, and Amanda Stent (Eds.). Association for Computational Linguistics, New Orleans, Louisiana, 809–819. https://doi.org/10.18653/v1/N18-1074
  • Viola and Jones (2001) P. Viola and M. Jones. 2001. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, Vol. 1. IEEE Comput. Soc, Kauai, HI, USA, I–511–I–518. https://doi.org/10.1109/CVPR.2001.990517
  • Wang et al. (2021) Jianguo Wang, Xiaomeng Yi, Rentong Guo, Hai Jin, Peng Xu, Shengjun Li, Xiangyu Wang, Xiangzhou Guo, Chengming Li, Xiaohai Xu, Kun Yu, Yuxing Yuan, Yinghao Zou, Jiquan Long, Yudong Cai, Zhenxiang Li, Zhifeng Zhang, Yihua Mo, Jun Gu, Ruiyi Jiang, Yi Wei, and Charles Xie. 2021. Milvus: A Purpose-Built Vector Data Management System. In Proceedings of the 2021 International Conference on Management of Data (SIGMOD ’21). Association for Computing Machinery, New York, NY, USA, 2614–2627. https://doi.org/10.1145/3448016.3457550
  • Wang et al. (2024) Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2024. Text Embeddings by Weakly-Supervised Contrastive Pre-training. http://arxiv.org/abs/2212.03533 arXiv:2212.03533 [cs].
  • Wu et al. (2021) Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, and Paul Christiano. 2021. Recursively Summarizing Books with Human Feedback. https://doi.org/10.48550/arXiv.2109.10862 arXiv:2109.10862 [cs].
  • Wu et al. (2024) Shirley Wu, Shiyu Zhao, Michihiro Yasunaga, Kexin Huang, Kaidi Cao, Qian Huang, Vassilis N. Ioannidis, Karthik Subbian, James Zou, and Jure Leskovec. 2024. STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases. https://arxiv.org/abs/2404.13207v2
  • Yue et al. (2024) Murong Yue, Jie Zhao, Min Zhang, Liang Du, and Ziyu Yao. 2024. Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient Reasoning. http://arxiv.org/abs/2310.03094 arXiv:2310.03094 [cs].
  • Yuksekgonul et al. (2024) Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Zhi Huang, Carlos Guestrin, and James Zou. 2024. TextGrad: Automatic ”Differentiation” via Text. http://arxiv.org/abs/2406.07496 arXiv:2406.07496 [cs].
  • Zelikman et al. (2022) Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. 2022. STaR: Bootstrapping Reasoning With Reasoning. https://doi.org/10.48550/arXiv.2203.14465 arXiv:2203.14465 [cs].
  • Zhang et al. (2024) Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. 2024. TinyLlama: An Open-Source Small Language Model. arXiv:2401.02385 [cs.CL] https://arxiv.org/abs/2401.02385
  • Zhang et al. (2022) Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2022. Automatic Chain of Thought Prompting in Large Language Models. https://doi.org/10.48550/arXiv.2210.03493 arXiv:2210.03493 [cs].
  • Zheng et al. (2020) Bolong Zheng, Xi Zhao, Lianggui Weng, Nguyen Quoc Viet Hung, Hang Liu, and Christian S. Jensen. 2020. PM-LSH: A fast and accurate LSH framework for high-dimensional approximate NN search. Proceedings of the VLDB Endowment 13, 5 (Jan. 2020), 643–655. https://doi.org/10.14778/3377369.3377374