HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: easybmat
  • failed: aliascnt
  • failed: pdfcol
  • failed: scalerel

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2403.14128v2 [cs.DB] 22 Mar 2024
\newaliascnt

corollarytheorem \aliascntresetthecorollary \newaliascntexampletheorem \aliascntresettheexample \newaliascntdefinitiontheorem \aliascntresetthedefinition \newaliascntpropositiontheorem \aliascntresettheproposition \newaliascntlemmatheorem \aliascntresetthelemma \newaliascntconjecturetheorem \aliascntresettheconjecture \pdfcolInitStacktcb@breakable

Gen-T: Table Reclamation in Data Lakes

Grace Fan Northeastern University
Boston, United States
[email protected]
   Roee Shraga Worcester Polytechnic Institute
Worcester, United States
[email protected]
   Renée J. Miller Northeastern University
Boston, United States
[email protected]
Abstract

We introduce the problem of Table Reclamation. Given a Source Table and a large table repository, reclamation finds a set of tables that, when integrated, reproduce the source table as closely as possible. Unlike query discovery problems like Query-by-Example or by-Target, Table Reclamation focuses on reclaiming the data in the Source Table as fully as possible using real tables that may be incomplete or inconsistent. To do this, we define a new measure of table similarity, called error-aware instance similarity, to measure how close a reclaimed table is to a Source Table, a measure grounded in instance similarity used in data exchange. Our search covers not only Select-Project-Join queries, but integration queries with unions, outerjoins, and the unary operators subsumption and complementation that have been shown to be important in data integration and fusion. Using reclamation, a data scientist can understand if any tables in a repository can be used to exactly reclaim a tuple in the Source. If not, one can understand if this is due to differences in values or to incompleteness in the data. Our solution, Gen-T, performs table discovery to retrieve a set of candidate tables from the table repository, filters these down to a set of originating tables, then integrates these tables to reclaim the Source as closely as possible. We show that our solution, while approximate, is accurate, efficient and scalable in the size of the table repository with experiments on real data lakes containing up to 15K tables, where the average number of tuples varies from small (web tables) to extremely large (open data tables) up to 1M tuples.

I Introduction

We introduce the problem of Table Reclamation where we are given a source table and seek to find a set of tables from a data lake (a large table repository) that, when integrated, reproduce the source table as closely as possible. We begin with an example showing how table reclamation can be used.

Example \theexample

Suppose a user is reading a news article that reports the demographics of employees in Top US tech companies in 2021 (top blue table in Figure 1). The user has access to 2021 Microsoft’s Diversity Report [1] which seems to contradict the numbers in the news article (bottom green table in Figure 1). Using table reclamation and her data lake (which may include wikitables, NYTimes Data and other public and private datasets), the user can ask if there is a set of tables that, when integrated, recreates the data in the news article (blue table). Table reclamation is able to reproduce this table using a number of tables including World_MS_Ethnicity and World_MS_Employees (which, after being joined and a selection on 2021 is applied, produce the first tuple). We call the tables used in reclamation, originating tables. Other tables can be unioned with the Microsoft tables to reclaim the other tuples. From this (the originating tables including their meta-data and data), a user can understand that while her table (green) is reporting US statistics numbers, the article is reporting international numbers.

Refer to caption
Figure 1: A news article reports the top blue table. A user has access to Microsoft’s diversity report, which seems to contradict the article (bottom green table).
Refer to caption
Figure 2: Gen-T Architecture. Given a Source Table, Gen-T finds a set of originating tables (Table Discovery), produces a reclaimed Source Table from the (Table Reclamation), and returns the originating tables and the reclaimed Source Table.

Unlike the well-known problem of Data Provenance [2, 3], we do not have prior knowledge of the query or tables that were originally used to create a Source Table. Instead, we focus on recovering possible tables that, when integrated, confirm the data values and facts in a Source Table. Table reclamation is related to the common Query-By-Example (QBE) or Query-By-Target (QBT) that discover a query over input tables that produces an instance-equivalent table to the given example output table [4, 5, 6, 7, 8, 9]. In order to generalize to a data lake setting, we do not assume we know a complete and correct set of input tables. Rather, we use an additional step of finding candidate tables within or across data lakes that may contribute to the Source Table (i.e., that may be originating tables). Also, existing QBE/QBT systems focus primarily on discovering (Select)-Project-Join queries over (largely complete) relational tables [10, 11, 12, 13, 14], with some using both the data values and the schema of the tables. Due to the noise and heterogeneity of data lake tables, these queries may not be sufficient to fully integrate data lake tables to produce a given Source table. So, we aim to recover Select-Project-Join-Union queries using only the data values, since the metadata of data lake tables may be missing or inconsistent [15, 16, 17, 18]. We also consider operations that have proven to be important in data integration and data fusion of incomplete data, namely subsumption and complementation [19].

Our goal is to reclaim the source table completely, but this may not always be possible.

Example \theexample

Continuing our example, it is possible that the best reclamation we can find has null values for the percentages of Hispanic employees for Google and a different number of Asian employees (20% instead of 24%). These differences indicate that the source data about Google was not completely found within the data lake. The user can analyze the originating tables returned by our approach to understand these differences. As an example, it may be that the originating tables for the Google data in the repository are European in origin and do not report values for all categories like Hispanic employees as this is a protected category under US, but not European law.

If certain tuples cannot be reclaimed, a data scientist would know these are not derivable from her data lake. If the reclaimed table contains different values from the source, a scientist can investigate whether the source values are wrong or if they are valid corrections to errors in the originating tables. Unlike traditional QBE and QBT approaches, our focus is on the data (rather than the query) and on understanding what data in a Source Table (and only data in a Source Table) can be reclaimed. We make the following contributions.

\bullet We define the novel problem of Table Reclamation – finding a set of originating tables that, when integrated, can reproduce a Source Table as closely as possible.

\bullet To evaluate how close a reclaimed table is to a Source Table, we define a new error-aware instance similarity (EIS) score that is a principled extension of instance similarity used in data exchange [20]), and show how it can be computed efficiently.

\bullet We present an approximate Table Reclamation solution named Gen-T that performs table discovery to retrieve a set of candidate tables, and filters out poor candidates using a novel table representation that simulates table integration without performing expensive integration operations. The remaining originating tables are integrated to produce a table whose values are as close as possible to the Source Table.

\bullet We conduct extensive experiments on real and synthetic data lakes, showing that Gen-T outperforms all baseline methods. Gen-T reclaims 5X more values from Source Tables than the best-performing baseline. We perform an ablation study on the sensitivity of Gen-T to erroneous data (data that cannot be reclaimed) and to incomplete data.

\bullet We show that our solution is efficient and scalable to the size of the data lake with experiments on real data lakes containing up to 15K tables, where the average table size (number of tuples) varies from small (web tables) to extremely large (open data tables) with on average over 1M tuples. In addition, our solution is scalable to large source tables, with experiments on source tables containing up to 22 columns and 1K rows.

II Overview

A data scientist provides a Source Table that she would like to reclaim by understanding if it can be produced by integrating any combination of tables within a data lake. Specifically, we aim to determine a set of tables from which the Source Table’s values may originate (termed originating tables), and use them to reclaim (regenerate) the Source Table. Given our data lake setting where tables can be changed autonomously, we formulate the problem as an approximate search of finding a set of tables that can best be used to reclaim the Source, as closely as possible.

Unlike many existing Query-by-Example [5, 10, 11, 12, 13, 14, 6, 7, 8] or by-target [9] approaches, we do not assume that we know the exact set of input tables whose values first formed the Source Table or even if the Source Table can be reclaimed. In addition, while we do not assume tables in a data lake to have keys or any foreign key relationships, we assume the Source Table to have a (possibly multi-attribute) key, which can be found using existing mining techniques [21, 22]. This is a restriction, but it is made to make the instance comparison (which is done often in the algorithm) efficient. Without the source table having a key, instance similarity requires homomorphism checks which is NP-hard [20]. In general, we do not assume that metadata is available for any tables (column names are included in examples only for clarity).

To solve the problem of table reclamation, we use a two-step solution. First, we discover tables from the data lake that share values with the Source Table and therefore may have created portions of it , we call these candidate tables. Then, we search for ways of combining subsets of these tables to regenerate the Source Table.

Refer to caption
Figure 3: Source Table (in green) contains applicants’ information, such as ID, Name, Age, Gender, and Education Level. Tables A, B, C, D (in blue) are possible tables from which the Source Table’s instances originated. Missing values and inconsistent values w.r.t. Source Table are depicted in yellow (‘—’) and red, respectively. Tables on the right (in yellow) are possible integrations of tables resulting from integration methods using Full Disjunction (FD) and outer join (        \mathbin{\rule[-0.07751pt]{2.25pt}{0.4pt}\hbox to 0.0pt{\hss\rule[6.29999pt]{2% .25pt}{0.4pt}}\mkern-5.8mu\bowtie\mkern-5.8mu\rule[-0.07751pt]{2.25pt}{0.4pt}% \hbox to 0.0pt{\hss\rule[6.29999pt]{2.25pt}{0.4pt}}}).

Figure 2 shows the pipeline of Gen-T. The input is a Source Table and the output a Reclaimed Source Table and its originating tables. In the Table Discovery phase (Section V-A), Gen-T discovers a set of candidate tables whose values may have contributed to the creation of the Source Table. Then, we apply our novel solution of representing tables as matrices in order to simulate table integration via matrix traversal (Section V-A2V-A3). The goal of this step is to refine the set of candidate tables to a set of originating tables, and essentially filter out candidate tables that are not needed before performing table integration. To efficiently retrieve a set of candidate tables, Gen-T uses an existing, data-driven table discovery method that has no guarantees for this problem setting. Gen-T then prunes the candidate tables to a set of originating tables by computing each candidate table’s similarity with a given Source Table and simulating table integration. Once Matrix Traversal pinpoints a set of originating tables, we integrate these originating tables in the Table Reclamation phase (Section V-B) and produce a reclaimed Source Table.

Example \theexample

Suppose a user has the top left, green table in Figure 3 as a Source Table. To reclaim this table, we use table discovery to find a subset of tables in the data lake with overlapping values – in this example, tables A, B, C, and D. However, Table C contains contradicting non-null values in the “Gender” column compared to values in the same tuples of the “Gender” column in the Source Table. We demonstrate the consequence of directly integrating all these tables, including Table C. The top right, yellow table is the integration result from the state-of-the-art full disjunction (FD) method [23, 24] and the bottom right table shows the result using one possible outerjoin order that may be learned by Auto-Pipeline (a by-target approach) [9]. The resulting tables contain different values in the Gender column (in red) with respect to the corresponding value in the Source Table. These values originate from Table C. When possible, we need to refine the set of candidate tables to filter out tables like Table C that produce integrated tables with erroneous values that do not make the Source Table. Note in this example, the integration of Tables A, B and D alone produces a better reclamation than using all four candidates.

We present related work next in Section III. We define the problem of Table Reclamation and preliminaries for our solution in Section IV. Then, we discuss the two steps in Gen-T– Table Discovery phase (Section V-A) and the Table Reclamation phase (Section V-B). Finally, the experiments in Section VI show the effectiveness, scalability, and generalizability of Gen-T and we conclude with open problems and exciting directions for this new area (Section VII).

III Related Work

We now discuss related work on Table Discovery and Integration and work related to finding the origins of tables.

Table Discovery: Table Discovery has a rich literature, specifically keyword search over tables, unionable table search, and joinable table search. Early work such as Octopus [25] and Google Dataset Search [26], support keyword search over the metadata of tables [16, 27] and smaller scale web-tables [28, 29]. Data-driven table discovery systems [30, 31, 32, 33, 15] were then developed to find schema complements, entity complements, joinable tables, and unionable tables.

For joinable table search, early systems use schema matching or syntactic similarities between tables’ metadata, such as Jaccard similarity [34, 35]. LSH Ensemble [31] makes use of approximate set containment between column values and supports set-containment search using LSH indexing. JOSIE [32] uses exact set containment to retrieve joinable tables that can be equi-joined with a column in the user’s table. MATE [36] supports multi-attribute join with a user’s table. DeepJoin [37] leverages a deep learning model to retrieve equi-joinable and semantically joinable tables. These systems can be used to retrieve a set of candidate tables that have high set similarity with a given user’s table.

For table union search, early systems also used schema similarity  [30, 38]. Using data (rather than metadata), a formal problem statement for unionability was first defined by Nargesian et al. [15] who presented a data-driven solution that leverages syntactic, semantic, and natural language measures. This problem was refined by SANTOS [39] to consider relationship semantics as well as column semantics. Most recently, Starmie [40] offers a scalable solution to finding unionable tables that leverages the entire table context to encode its semantics. Although our method also retrieves relevant tables to a user’s table, we aim to retrieve tables for a specific task – reclaiming the user’s table. Finally, other recent work [41] presents a goal-oriented discovery for specific downstream tasks, aiming to augment columns. We tailor table discovery towards the goal of reclaiming the Source Table.

Table Integration: Lehmberg et al. [42] stitches unionable tables together, but does not support join augmentation of tables. Recently, ALITE [24] performs full disjunction (FD) [23] to maximally combine tuples from a set of tables (intuitively, full disjunction is a commutative and associative form of full outer join). Our goal is to reproduce the given Source Table, which may contain incomplete tuples, so we do not aim to maximally combine tuples if it produces a table that is not identical to the Source Table. Nonetheless, ALITE is a candidate baseline for Gen-T, as it offers a state-of-the-art integration solution.

Preceding the table integration process, there are pre-integration tasks to find alignments between table elements. First, instance-based schema matching determines how the schemas of two tables align to prepare for integration [43, 44, 45, 46, 47, 48]. Our solution, Gen-T, aligns schemas implicitly by renaming columns in the retrieved tables with the column of the source table that best matches. Entity matching [49, 50, 51, 52, 53, 54, 55, 56, 57] is another common pre-integration task, aiming to align tuples for cleaning or joining tables. In our context, since we assume that the Source Table has a key, tuples can be aligned by matching using equality on the key.

Finding Origins of Tables: Our problem setting of tracing a Source Table’s values back to its origins can be related to Data Provenance [2, 3], which given a query and its output table, explains from where the (values or) tuples originate, why and how they were produced. However, in our problem setting, we do not know the query or originating tables that were originally used to create the Source Table.

Query-By-Example (QBE) is a popular approach. The original QBE was a language allowing nonexpert users to query a database [58]. More recently this term has been used for methods that are given a pair of matching input and output tables, and the task is to synthesize a query from the input to the output[5]. For this task, some systems only consider Project and Join operators [10, 11, 12, 59], whereas others also consider the Select operator [13, 14]. Others output a set of queries that could reproduce the example output table, given the input table [6, 7]. AutoPandas [8] performs transformation-by-example by synthesizing Pandas programs rather than SQL queries. More recently, proposed techniques relax the assumption that a set of tables from which the query table can be generated is provided [59, 60, 61]. Instead, they discover a set of tables that, when integrated, produce a table that contains the query table. These methods often expect only a partial query table with a small set of attributes and possibly a set of tuple examples (for example, Ver [59] uses queries that are tables of 2 columns and 3 rows). Their goal is to generate an output table that completes this query table by returning a table that contains many additional tuples in addition to those in the query table. However in our problem, our goal is to reproduce all and only tuples from the source table. Nonetheless, since Ver [59] is the state-of-the-art Query-by-Example method, we use it as a baseline for Gen-T.

Auto-Pipeline [9] defines Query-By-Target, with the similar goal of synthesizing the pipeline used to create the target table, given the target table and a set of input tables. Using the synthesized pipeline on the input tables, it then produces a table that “schematically” aligns with the input target table. As the state-of-the-art in this line of work, it is a baseline for our approach. In both By-Example and By-Target paradigms, many systems assume that the set of input tables on which the system synthesizes a query to generate the example or target table is known and perhaps more importantly are known to contain the tuples and columns needed to reproduce the output table. In our problem, we do not assume this is the case.

IV Problem Definition and Preliminaries

We first discuss how we evaluate a possible reclaimed Source Table and then define the problem of Table Reclamation. Then, we describe preliminaries for our solution, specifically the set of table operators with which we perform table integration to produce a possible reclaimed Source Table.

IV-A EIS Score and Problem Definition

To evaluate a possible reclaimed table, we compare it with the Source Table to see how close they are. The problem of comparing database instances is prevalent in many applications such as analyzing how a dataset has evolved over time (e.g., data versioning) [62], evaluating data cleaning solutions (e.g., compare a clean instance produced by a data repair algorithm against a gold standard) [63], or comparing solutions generated by data exchange or transformation systems [64, 20]. This similarity score requires the computation of a mapping between the tuples across instances, which can be used to explain the result. The most general measures rely on homomorphism checking and are NP-hard [65]. Since data lake tables do not have keys, integrating them can produce multiple copies of tuples from the source table. We will align data lake tuples with a single source tuple where the lake and source tuple share the same key value (i.e., are aligned tuples iff they share the same values on key attributes). Hence, multiple lake tuples may align with the same source tuple, and some will align with no source tuple. But a lake tuple will align with at most one source tuple and because of this, we can do the mapping efficiently.

We propose an error-aware instance similarity score as a generalization of instance similarity defined by Alexe et al. [20], which has been widely used in the integration literature [66, 67, 68, 69]. Instance similarity quantifies the preservation of data associations when source data is exchanged into target database [20]. It relies on computing homomorphisms (as it is designed for data exchange where keys are not assumed). Considering two relations with the same schema, if two tuples are mapped, they define the tuple similarity as the ratio of the number of values that are shared over the size (cardinality) of the tuples. Note that in data exchange two tuples cannot be mapped if they disagree on any non-null values. In our setting, we can map tuples that differ on their non-null values. Hence, we define an error-aware tuple similarity that penalizes mismatching (erroneous) values.

Definition \thedefinition

Given two tuples s𝑠sitalic_s and t𝑡titalic_t with the same schema containing n𝑛nitalic_n non-key attributes, where s𝑠sitalic_s and t𝑡titalic_t share the same key value. Let α(s,t)𝛼𝑠𝑡\alpha(s,t)italic_α ( italic_s , italic_t ) be the number of non-key attributes on which s𝑠sitalic_s and t𝑡titalic_t share the same value and δ(s,t)𝛿𝑠𝑡\delta(s,t)italic_δ ( italic_s , italic_t ) be the number of non-key attributes on which s𝑠sitalic_s and t𝑡titalic_t have different values and t𝑡titalic_t is a non-null value. Then the error-aware tuple similarity is111Note the tuple similarity defined by Alexe et al. [20] is α(s,t)/n𝛼𝑠𝑡𝑛\alpha(s,t)/nitalic_α ( italic_s , italic_t ) / italic_n:

E(s,t)=(α(s,t)δ(s,t))/n𝐸𝑠𝑡𝛼𝑠𝑡𝛿𝑠𝑡𝑛E(s,t)=(\alpha(s,t)-\delta(s,t))/nitalic_E ( italic_s , italic_t ) = ( italic_α ( italic_s , italic_t ) - italic_δ ( italic_s , italic_t ) ) / italic_n (1)

Since a tuple can map to more than one tuple, Alexe et al. define instance similarity using the maximum tuple similarity score and we do the same in defining error-aware instance similarity, but we use error-aware tuple similarity rather than tuple similarity. For completeness (and since in our experiments we use both measures), we now define both.

Definition \thedefinition

Let S𝑆Sitalic_S be a source table (with key attributes K𝐾Kitalic_K) and T𝑇Titalic_T a possible reclaimed table with the same schema that has n𝑛nitalic_n non-key attributes. Note that T𝑇Titalic_T does not have to satisfy the key constraint. For a tuple sS𝑠𝑆s\in Sitalic_s ∈ italic_S, let m(s)={tT|s[K]=t[K]}𝑚𝑠conditional-set𝑡𝑇𝑠delimited-[]𝐾𝑡delimited-[]𝐾m(s)=\{t\in T|s[K]=t[K]\}italic_m ( italic_s ) = { italic_t ∈ italic_T | italic_s [ italic_K ] = italic_t [ italic_K ] }. Then the instance similarity of S𝑆Sitalic_S and T𝑇Titalic_T is:

sS𝑚𝑎𝑥tm(s)(α(s,t)/n)|S|subscript𝑠𝑆subscript𝑚𝑎𝑥𝑡𝑚𝑠𝛼𝑠𝑡𝑛𝑆\dfrac{\sum_{s\in S}\text{max}_{t\in m(s)}(\alpha(s,t)/n)}{|S|}divide start_ARG ∑ start_POSTSUBSCRIPT italic_s ∈ italic_S end_POSTSUBSCRIPT max start_POSTSUBSCRIPT italic_t ∈ italic_m ( italic_s ) end_POSTSUBSCRIPT ( italic_α ( italic_s , italic_t ) / italic_n ) end_ARG start_ARG | italic_S | end_ARG (2)

The Error-Aware Instance Similarity (EIS) of S𝑆Sitalic_S and T𝑇Titalic_T is (for a normalized score in range[0,1]):

0.5sS𝑚𝑎𝑥tm(s)(1+E(s,t))|S|0.5subscript𝑠𝑆subscript𝑚𝑎𝑥𝑡𝑚𝑠1𝐸𝑠𝑡𝑆\dfrac{0.5\cdot\sum_{s\in S}\text{max}_{t\in m(s)}(1+E(s,t))}{|S|}divide start_ARG 0.5 ⋅ ∑ start_POSTSUBSCRIPT italic_s ∈ italic_S end_POSTSUBSCRIPT max start_POSTSUBSCRIPT italic_t ∈ italic_m ( italic_s ) end_POSTSUBSCRIPT ( 1 + italic_E ( italic_s , italic_t ) ) end_ARG start_ARG | italic_S | end_ARG (3)
Refer to caption
Figure 4: Aligned tuples between a Source Table (left green table) and two possible reclaimed tables (right yellow tables) from Figure 3, aligned based on key column ‘ID’.

We illustrate these scores in an example based on Figure 3.

Example \theexample

Consider Source Table S𝑆Sitalic_S (with key column “ID”) from Figure 3 (top-left green table), and two possible reclaimed tables, S^1subscriptnormal-^𝑆1\hat{S}_{1}over^ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (top-right yellow table) and S^2subscriptnormal-^𝑆2\hat{S}_{2}over^ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (bottom-right yellow table). Their aligned tuples are shown in Figure 4. Notice the instance similarity score (counting only matching values) of S𝑆Sitalic_S and S^1subscriptnormal-^𝑆1\hat{S}_{1}over^ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is higher than of S𝑆Sitalic_S and S^2subscriptnormal-^𝑆2\hat{S}_{2}over^ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

S^1:t0=3/4,t1=4/4,t2=3/40.833:subscript^𝑆1formulae-sequencesubscript𝑡034formulae-sequencesubscript𝑡144subscript𝑡2340.833\hat{S}_{1}:t_{0}=3/4,t_{1}=4/4,t_{2}=3/4\rightarrow 0.833over^ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 3 / 4 , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 4 / 4 , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 3 / 4 → 0.833

S^2:t0=2/4,t1=4/4,t2=3/40.75:subscript^𝑆2formulae-sequencesubscript𝑡024formulae-sequencesubscript𝑡144subscript𝑡2340.75\hat{S}_{2}:t_{0}=2/4,t_{1}=4/4,t_{2}=3/4\rightarrow 0.75over^ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 2 / 4 , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 4 / 4 , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 3 / 4 → 0.75

However, we want to favor S^2subscriptnormal-^𝑆2\hat{S}_{2}over^ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT that contains nullified (unknown) values (one tuple matching the source correctly and two tuples missing values), over S^1subscriptnormal-^𝑆1\hat{S}_{1}over^ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, which has reclaimed a possibly erroneous value for a source null. Using EIS score, S^2subscriptnormal-^𝑆2\hat{S}_{2}over^ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT has a higher similarity with S𝑆Sitalic_S than S^1subscriptnormal-^𝑆1\hat{S}_{1}over^ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

S^1:t0=(31)/4,t1=4/4,t2=3/40.875:subscript^𝑆1formulae-sequencesubscript𝑡0314formulae-sequencesubscript𝑡144subscript𝑡2340.875\hat{S}_{1}:t_{0}=(3-1)/4,t_{1}=4/4,t_{2}=3/4\rightarrow 0.875over^ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( 3 - 1 ) / 4 , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 4 / 4 , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 3 / 4 → 0.875

S^2:t0=3/4,t1=4/4,t2=3/40.917:subscript^𝑆2formulae-sequencesubscript𝑡034formulae-sequencesubscript𝑡144subscript𝑡2340.917\hat{S}_{2}:t_{0}=3/4,t_{1}=4/4,t_{2}=3/4\rightarrow 0.917over^ start_ARG italic_S end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 3 / 4 , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 4 / 4 , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 3 / 4 → 0.917

Using EIS Score as a similarity measure to compare a possible reclaimed table S^^𝑆\hat{S}over^ start_ARG italic_S end_ARG and Source Table S𝑆Sitalic_S, we aim to solve the following problem:

Definition \thedefinition (Table Reclamation)

Given a collection of tables 𝒯𝒯\mathcal{T}caligraphic_T and a Source Table S𝑆Sitalic_S, find a set of originating tables 𝒯^𝒯normal-^𝒯𝒯\hat{\mathcal{T}}\subseteq\mathcal{T}over^ start_ARG caligraphic_T end_ARG ⊆ caligraphic_T such that its integration produces a reclaimed table S^normal-^𝑆\hat{S}over^ start_ARG italic_S end_ARG with the maximum EIS Score to S𝑆Sitalic_S.

IV-B Preliminaries: Integration Operators

We now present the operators we use to produce a possible reclaimed Source Table. We will show that they are sufficient for integration, inspired by recent work on data lake integration [24]. First the unary operators.

\bullet Projection(π𝜋\piitalic_π): Project on specified columns of the table.
\bullet Selection(σ𝜎\sigmaitalic_σ): Select tuples that satisfy a specified condition.
\bullet Subsumption(β𝛽\betaitalic_β[23]: Given tuples t1,t2subscript𝑡1subscript𝑡2t_{1},t_{2}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with the same schema, t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT subsumes t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT if for every attribute on which they are both non-null they have the same value, and t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT contains one or more attributes with a non-null value where t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT has nulls. Applying subsumption we remove t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Applying β𝛽\betaitalic_β on a table involves repeatedly applying subsumption and discarding the subsumed tuples (t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT).
\bullet Complementation(κ𝜅\kappaitalic_κ[19, 70]: Given tuples t1,t2subscript𝑡1subscript𝑡2t_{1},t_{2}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with the same schema, t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT complements t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT if they share at least one non-null column value, and t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT contains some non-null values where t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT has nulls while t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT contains some non-null values where t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT has nulls. The tuples must agree on all values on which they are both non-null. Applying κ𝜅\kappaitalic_κ on t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT produces a single tuple that contains all non-null values of either (both) tuples and is null only if both t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are null. Applying κ𝜅\kappaitalic_κ on a table produces a table with no complementing tuples and involves repeatedly applying complementation to pairs of tuples.

We use a single binary operator natural Outer Union, and we assume the schemas of the tables have been aligned so unionable columns share the same name [71].

\bullet Outer Union(\uplus[72]: Union two tables, even if their schemas are not equal. The result contains the union of the columns from both tables. If a column C𝐶Citalic_C is missing from one table (T𝑇Titalic_T), but appears in the other table (S𝑆Sitalic_S), then in the result, the tuples of S𝑆Sitalic_S contain a null (bottom\bot) in their C𝐶Citalic_C column. This operator is commutative and associative. Note that when applied to tables with the same schema, outer union is the same as inner union.

To make reclamation search more efficient, we use the fact that Outer Union and the set of unary operators above can be used to represent any SPJU query. Using this result, our search will focus on outer union and the unary operators.

Theorem 1 (Representative Operators)

Given two tables that contain no duplicate tuples, and no tuples that can be subsumed or complemented, for all SPJU queries, there exists an equivalent query consisting of only Outer Union and the four unary operators (selection, projection, complementation, and subsumption).222The proof is included in our technical report [73].

V Table Reclamation using Gen-T

We describe Table Discovery (Section V-A), which finds a set of originating tables that we can effectively integrate to reclaim a source table (Section V-B).

V-A Table Discovery

We first discover a set of candidate tables, after which we discuss a novel methodology, termed Matrix Traversal, to refine this set into a set of originating tables.

V-A1 Candidate Table Retrieval

Discovering a set of candidate tables requires discovering tables that share some of the same values as the Source Table in an efficient manner. In the context of data lakes, where metadata is inconsistent or missing, searching using schema names is unreliable [16, 17, 18, 74]. However, we can use any existing data-driven table discovery approach that is scalable in a data lake setting.

With a set of top-k𝑘kitalic_k tables returned as relevant to the Source Table, we need to verify the set similarities of their values with the Source Table. To do so, we retrieve candidate tables among the previously discovered tables using a set similarity algorithm. This could be done efficiently with a system like JOSIE [32] that computes exact set containment or MATE [36] that supports multi-attribute joins. In addition to finding candidate tables containing columns that have high set similarity with a Source Table, we also diversify the set of candidate tables such that each candidate table has minimal overlap with other candidates. This is especially important in public data lakes, which tend to have multiple versions of the same tables [62, 75] and a large percentage of duplicate column sets [32]. By diversifying candidates, each candidate may overlap with different values in S𝑆Sitalic_S, as illustrated in the following example.

Example \theexample

Suppose we have the Source Table from Figure 3. In addition to data lake tables A, B, C, D, we also have Table E, an exact duplicate of Table D. If we only rank these tables using set overlap with the Source Table, Tables D and (its duplicate) Table E become top candidates, since all their columns have high set overlap with those in the Source Table. However, Table E does not add any new information when integrated with Table D. Thus, diversifying the set of candidate tables decreases Table E’s score, pushing other tables such as Table A higher in the top-k𝑘kitalic_k ranking.

With a diverse set of candidates found for each column in a Source Table S𝑆Sitalic_S, we ensure that each candidate table still has high set overlap with the Source Table across related columns. To do so, we find all tuples in a candidate table that share column values with S𝑆Sitalic_S. We verify that for each column that has high set overlap with a column in S𝑆Sitalic_S, it still has high set overlap within these tuples. For columns from the candidate table that have high value overlap with columns in S𝑆Sitalic_S, we rename them with names of corresponding columns in S𝑆Sitalic_S. This way, we implicitly perform schema matching between each candidate table and S𝑆Sitalic_S. Finally, we check for and remove any candidate table whose columns and column values are subsumed by other candidate tables.

V-A2 Matrix Traversal

With a set of diversified candidate tables, we could potentially enter the table integration phase (Section V-B). However, it may be computationally expensive to directly integrate all candidate tables. Thus, we need to refine the set of candidate tables to a set of originating tables containing a maximum set of aligned tuples with respect to the Source Table. To do so, we emulate the table integration process and see what candidate tables are necessary to reclaim our Source Table. By simulating tuple alignment, we can uncover erroneous aligned tuples with respect to the Source Table, and discard tables that could decrease the EIS Score.

First, we need to align tuples in candidate tables to tuples in a Source Table, based on shared values with the key attribute from the Source Table. To do so, we need to ensure that each candidate table contains a key column of the Source Table. If it does not, we greedily find a best way to join it with candidates that include the source key. We use standard join cardinality estimation to find a path that covers the most source key values and is as close to functional as possible (this procedure is denoted as Expand()).333The Expand algorithm is included in the technical report [73]. This way, all tables can align its tuples with the Source Table using key values.

We represent aligned tuples and shared columns from a candidate table in the form of a matrix. To encode aligned tuples, we initialize the matrices to have the same dimensions as the Source Table S𝑆Sitalic_S, such that the matrix indices represent the Source Table’s indices. For each key value and its associated column values in the Source Table, check if the value appears in the candidate table at the corresponding column and key value. If so, then the matrix has 1 at the same index as the value’s index in Source Table, and 0 otherwise.

Refer to caption
Figure 5: Matrix initialization and integration of tables A, B, C given the Source Table from Figure 3 simulate their table integration. The result of matrix integration is equivalent to the matrix representation of the table integration result.

Next, we simulate table integration by applying the logical OR on the matrices, which takes the maximum value at each position. Suppose we have tuples t1,t2subscript𝑡1subscript𝑡2t_{1},t_{2}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT from matrices m1,m2subscript𝑚1subscript𝑚2m_{1},m_{2}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, respectively, at the same row index i𝑖iitalic_i. We want to combine values from t1,t2subscript𝑡1subscript𝑡2t_{1},t_{2}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT at column j𝑗jitalic_j. Assuming that S[i,j]𝑆𝑖𝑗bottomS[i,j]\neq\botitalic_S [ italic_i , italic_j ] ≠ ⊥, the produced tuple mrsubscript𝑚𝑟m_{r}italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT contains the following value at position (i,j𝑖𝑗i,jitalic_i , italic_j): mr[j]=max(t1[j],t2[j])subscript𝑚𝑟delimited-[]𝑗𝑚𝑎𝑥subscript𝑡1delimited-[]𝑗subscript𝑡2delimited-[]𝑗m_{r}[j]=max(t_{1}[j],t_{2}[j])italic_m start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT [ italic_j ] = italic_m italic_a italic_x ( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ italic_j ] , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ italic_j ] ). If t1[j]subscript𝑡1delimited-[]𝑗t_{1}[j]italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ italic_j ] is 1 and t2[j]subscript𝑡2delimited-[]𝑗t_{2}[j]italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ italic_j ] is 0, for example, max(t1[j],t2[j])𝑚𝑎𝑥subscript𝑡1delimited-[]𝑗subscript𝑡2delimited-[]𝑗max(t_{1}[j],t_{2}[j])italic_m italic_a italic_x ( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ italic_j ] , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ italic_j ] ) is 1, which is equivalent to the logical OR. This is comparable to applying the Outer Union (\uplus) of two tables, and Subsumption (β𝛽\betaitalic_β) and Complementation (κ𝜅\kappaitalic_κ) on the resulting table (refer to Theorem 1). In this table integration, for tuples t1,t2subscript𝑡1subscript𝑡2t_{1},t_{2}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT that share a non-null value at the same column, the resulting tuple trsubscript𝑡𝑟t_{r}italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is formed such that for every column j𝑗jitalic_j, tr[j]=⟂̸ift1[j] oder t2[j] and tr[j]= otherwiset_{r}[j]=\not\perp\text{if}\,t_{1}[j]\neq\bot\text{ or }t_{2}[j]\neq\bot\text{% and }t_{r}[j]=\bot\text{ otherwise}italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT [ italic_j ] = ⟂̸ if italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ italic_j ] ≠ ⊥ or italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ italic_j ] ≠ ⊥ and italic_t start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT [ italic_j ] = ⊥ otherwise. Thus, both table and matrix integrations maximally combine tuples such that non-null values replace a null value at the same index.

1 Input: 𝒯={T1,Tn}𝒯subscript𝑇1normal-…subscript𝑇𝑛\mathcal{T}=\{T_{1},\dots T_{n}\}caligraphic_T = { italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }: set of candidate tables; S𝑆Sitalic_S: Source Table
2 Output: T𝗈𝗋𝗂𝗀={T1,T2,Ti}subscript𝑇𝗈𝗋𝗂𝗀subscript𝑇1subscript𝑇2subscript𝑇𝑖T_{\mathsf{orig}}=\{T_{1},T_{2},\dots T_{i}\}italic_T start_POSTSUBSCRIPT sansserif_orig end_POSTSUBSCRIPT = { italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }: refined set of originating tables
3 𝒯𝒯absent\mathcal{T}\leftarrowcaligraphic_T ← Expand(𝒯,S𝒯𝑆\mathcal{T},Scaligraphic_T , italic_S) //join tables without source key absent\mathcal{M}\leftarrowcaligraphic_M ← MatrixInitialization(𝒯𝒯\mathcal{T}caligraphic_T), //Initialize Matrices of S𝑆Sitalic_S shape T𝗌𝗍𝖺𝗋𝗍subscript𝑇𝗌𝗍𝖺𝗋𝗍absentT_{\mathsf{start}}\leftarrowitalic_T start_POSTSUBSCRIPT sansserif_start end_POSTSUBSCRIPT ← GetStartTable(\mathcal{M}caligraphic_M) prevCorrect = mostCorrect 𝖾𝗏𝖺𝗅𝗎𝖺𝗍𝖾𝖲𝗂𝗆𝗂𝗅𝖺𝗋𝗂𝗍𝗒(T𝗌𝗍𝖺𝗋𝗍)absent𝖾𝗏𝖺𝗅𝗎𝖺𝗍𝖾𝖲𝗂𝗆𝗂𝗅𝖺𝗋𝗂𝗍𝗒subscript𝑇𝗌𝗍𝖺𝗋𝗍\leftarrow\mathsf{evaluateSimilarity}(T_{\mathsf{start}})← sansserif_evaluateSimilarity ( italic_T start_POSTSUBSCRIPT sansserif_start end_POSTSUBSCRIPT ) T𝗈𝗋𝗂𝗀[]subscript𝑇𝗈𝗋𝗂𝗀T_{\mathsf{orig}}\leftarrow[]italic_T start_POSTSUBSCRIPT sansserif_orig end_POSTSUBSCRIPT ← [ ] while |T𝗈𝗋𝗂𝗀|<|𝒯|subscript𝑇𝗈𝗋𝗂𝗀𝒯|T_{\mathsf{orig}}|<|\mathcal{T}|| italic_T start_POSTSUBSCRIPT sansserif_orig end_POSTSUBSCRIPT | < | caligraphic_T | do
4       if T𝗈𝗋𝗂𝗀subscript𝑇𝗈𝗋𝗂𝗀T_{\mathsf{orig}}italic_T start_POSTSUBSCRIPT sansserif_orig end_POSTSUBSCRIPT then
5             Mc𝖢𝗈𝗆𝖻𝗂𝗇𝖾(T𝗈𝗋𝗂𝗀)subscript𝑀𝑐𝖢𝗈𝗆𝖻𝗂𝗇𝖾subscript𝑇𝗈𝗋𝗂𝗀M_{c}\leftarrow\mathsf{Combine}(T_{\mathsf{orig}})italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ← sansserif_Combine ( italic_T start_POSTSUBSCRIPT sansserif_orig end_POSTSUBSCRIPT ) //Iteratively combine each pair of consecutive matrices
6      prevCorrect = mostCorrect; nextTable = bottom\bot for all tables T𝒯s.t.TT𝗈𝗋𝗂𝗀formulae-sequence𝑇𝒯𝑠𝑡𝑇subscript𝑇𝗈𝗋𝗂𝗀T\in\mathcal{T}s.t.T\notin T_{\mathsf{orig}}italic_T ∈ caligraphic_T italic_s . italic_t . italic_T ∉ italic_T start_POSTSUBSCRIPT sansserif_orig end_POSTSUBSCRIPT do
7             Mc𝖢𝗈𝗆𝖻𝗂𝗇𝖾(𝖬𝖼,T)subscript𝑀𝑐𝖢𝗈𝗆𝖻𝗂𝗇𝖾subscript𝖬𝖼𝑇M_{c}\leftarrow\mathsf{Combine}(\mathsf{M_{c}},T)italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ← sansserif_Combine ( sansserif_M start_POSTSUBSCRIPT sansserif_c end_POSTSUBSCRIPT , italic_T ) percentCorrectVals 𝖾𝗏𝖺𝗅𝗎𝖺𝗍𝖾𝖲𝗂𝗆𝗂𝗅𝖺𝗋𝗂𝗍𝗒(Mc)absent𝖾𝗏𝖺𝗅𝗎𝖺𝗍𝖾𝖲𝗂𝗆𝗂𝗅𝖺𝗋𝗂𝗍𝗒subscript𝑀𝑐\leftarrow\mathsf{evaluateSimilarity}(M_{c})← sansserif_evaluateSimilarity ( italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) if percentCorrectVals >>> mostCorrect then
8                   mostCorrect \leftarrow percentCorrectVals nextTable Tabsent𝑇\leftarrow T← italic_T
9      if mostCorrect = prevCorrect then
10             Exit, //Integration did not find more of S𝑆Sitalic_S’s values
11       T𝗈𝗋𝗂𝗀=T𝗈𝗋𝗂𝗀subscript𝑇𝗈𝗋𝗂𝗀limit-fromsubscript𝑇𝗈𝗋𝗂𝗀T_{\mathsf{orig}}=T_{\mathsf{orig}}\cupitalic_T start_POSTSUBSCRIPT sansserif_orig end_POSTSUBSCRIPT = italic_T start_POSTSUBSCRIPT sansserif_orig end_POSTSUBSCRIPT ∪ nextTable
return T𝗈𝗋𝗂𝗀subscript𝑇𝗈𝗋𝗂𝗀T_{\mathsf{orig}}italic_T start_POSTSUBSCRIPT sansserif_orig end_POSTSUBSCRIPT;
Algorithm 1 Matrix Traversal

We demonstrate matrix initialization and traversal in Algorithm 1. Given a set of candidate tables 𝒯𝒯\mathcal{T}caligraphic_T and Source Table S𝑆Sitalic_S, we first ensure that each candidate table contains a key column from S𝑆Sitalic_S by joining candidate tables that do not share a key column with S𝑆Sitalic_S with those that do (Expand() on line 1). This way, we align tuples in each candidate table with respect to S𝑆Sitalic_S and initialize each candidate table’s matrix representation (Line 1). Then, we traverse over the matrices and perform the logical OR operator to combine a pair of matrices in 𝖢𝗈𝗆𝖻𝗂𝗇𝖾()𝖢𝗈𝗆𝖻𝗂𝗇𝖾\mathsf{Combine()}sansserif_Combine ( ) (Line 1). To evaluate the resulting matrix, we check the fraction of 1’s in the matrix, which represent the number of values in the resulting table integration found in the Source Table, thus evaluating the EIS score. At each step of the matrix traversal (Line 1), including the start (Lines 1-1), we choose the matrix that results in a matrix containing the most 1’s in 𝖾𝗏𝖺𝗅𝗎𝖺𝗍𝖾𝖲𝗂𝗆𝗂𝗅𝖺𝗋𝗂𝗍𝗒()𝖾𝗏𝖺𝗅𝗎𝖺𝗍𝖾𝖲𝗂𝗆𝗂𝗅𝖺𝗋𝗂𝗍𝗒\mathsf{evaluateSimilarity()}sansserif_evaluateSimilarity ( ). This traversal ends when either all matrices have been traversed (Line 1), or the percentage of 1’s in the resulting matrix converges (Lines 1-1). We return the set of tables used in the final traversal as the set of originating tables to perform table integration. If a candidate table that was joined with other candidates to contain a key column from S𝑆Sitalic_S (from Expand() on line 1) becomes an originating table, we include its expanded form in the returned set.

V-A3 Three-Valued Matrices

Previously, we use matrices populated with binary values to represent aligned tuples with respect to the Source Table. However, this representation cannot distinguish between nullified and erroneous aligned tuples with respect to the Source Table. Specifically, it does not account for cases in which a tuple in the Source Table and an aligned tuple in a candidate table have different non-null values in the same column, and if a tuple in the Source Table has a null value while the aligned tuple has a non-null value at the same column. Rather, it represents both types of values as 0 in the matrices. In actuality, when we apply Outer Union on two tables with aligned tuples containing different non-null values in the same column, we keep the tuples separate. Thus, we need to distinguish between nullified and erroneous aligned tuples in the matrix representation (Line 1 in Algorithm 1). To do so, we use three-valued matrices, in which we encode a 1 if a candidate table shares the same value with the Source Table at the same index in an aligned tuple, 0 if a candidate table contains a null where the Source Table has a non-null value at the same index, and -1 if a candidate table contains a non-null value that contradicts with the Source Table’s value at the same index (shown in Figure 5). Formally, given Source Table S𝑆Sitalic_S and candidate table T𝑇Titalic_T, we populate position (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) for each aligned tuple tAlignTsubscript𝑡Align𝑇t_{\text{Align}}\in Titalic_t start_POSTSUBSCRIPT Align end_POSTSUBSCRIPT ∈ italic_T in matrix M𝑀Mitalic_M as:

M[i,j]={1ifS[i,j]=T[i,j]0elifS[i,j]T[i,j]=1otherwiseM[i,j]=\begin{cases}1&\text{if}\,S[i,j]=T[i,j]\\ 0&\text{elif}\,S[i,j]\neq\bot\wedge T[i,j]=\bot\\ -1\quad&\text{otherwise}\end{cases}italic_M [ italic_i , italic_j ] = { start_ROW start_CELL 1 end_CELL start_CELL if italic_S [ italic_i , italic_j ] = italic_T [ italic_i , italic_j ] end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL elif italic_S [ italic_i , italic_j ] ≠ ⊥ ∧ italic_T [ italic_i , italic_j ] = ⊥ end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL otherwise end_CELL end_ROW (4)

Combining the amended three-valued matrix representations with the logical OR takes the maximum of two truth-values at each index. Specifically, if we have two tuples from two matrices that contain a 1 and -1 at the same position, applying logical OR would choose the 1 [76]. However, in practice when applying Outer Union on two tuples with contradicting non-null values at the same index, the resulting integration would contain both tuples. Thus, we need to keep both tuples from the matrices if they contain different non-0 values at the same index. We re-define 𝖢𝗈𝗆𝖻𝗂𝗇𝖾𝖢𝗈𝗆𝖻𝗂𝗇𝖾\mathsf{Combine}sansserif_Combine() (Line 1) between two matrices, given tuples t1,t2subscript𝑡1subscript𝑡2t_{1},t_{2}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT at the same row index accordingly.

𝖢𝗈𝗆𝖻𝗂𝗇𝖾(t1,t2)={t1,t2if j:t1[j]t2[j]0OR(t1,t2)otherwise𝖢𝗈𝗆𝖻𝗂𝗇𝖾subscript𝑡1subscript𝑡2casessubscript𝑡1subscript𝑡2:if 𝑗subscript𝑡1delimited-[]𝑗subscript𝑡2delimited-[]𝑗0ORsubscript𝑡1subscript𝑡2otherwise\mathsf{Combine}(t_{1},t_{2})=\begin{cases}t_{1},t_{2}\quad&\text{if }\exists j% :\,t_{1}[j]\neq t_{2}[j]\neq 0\\ \text{OR}(t_{1},t_{2})\quad&\text{otherwise}\end{cases}sansserif_Combine ( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = { start_ROW start_CELL italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL if ∃ italic_j : italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ italic_j ] ≠ italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ italic_j ] ≠ 0 end_CELL end_ROW start_ROW start_CELL OR ( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL start_CELL otherwise end_CELL end_ROW (5)

This way, we keep tuples separated if they contain contradicting, non-0 values at the same position. Else we apply logical OR and take the maximum of truth values element-wise. The new Combine() function is illustrated in Example V-A3.

The new 𝖢𝗈𝗆𝖻𝗂𝗇𝖾𝖢𝗈𝗆𝖻𝗂𝗇𝖾\mathsf{Combine}sansserif_Combine() could result in matrices with more rows than in the Source Table. Thus, we encode each matrix as a dictionary, with each key value in the Source Table as a dictionary key, and the list of aligned tuples in the resulting matrix with respect to a tuple in the Source Table as values.

Example \theexample

Given the Source Table from Figure 3, Figure 5 shows the result of integrating tables A, B, and C and their matrix representations. We start with matrix A with the largest number of correct values. Integrating matrices A and B produces more correct values after taking logical OR’s of 0’s and 1’s (from the Combine() function). When combining its resulting matrix with matrix C, we find a (1) and (¬11\neg 1¬ 1) in the first tuple for column “Gender”. In this case, we keep both tuples from OR(A, B) and C. For all other value-pairs, we take the logical OR. When we integrate tables A, B, C, we also find tuples for “Name”=Smith to be separate, since we check if we over-combine tuples and replace correct nulls (Algorithm 2). As a result, the matrix representation of this table integration result is exactly the output of the matrix integration, and so table integration is simulated by integration of matrix representations.

For 𝖾𝗏𝖺𝗅𝗎𝖺𝗍𝖾𝖲𝗂𝗆𝗂𝗅𝖺𝗋𝗂𝗍𝗒()𝖾𝗏𝖺𝗅𝗎𝖺𝗍𝖾𝖲𝗂𝗆𝗂𝗅𝖺𝗋𝗂𝗍𝗒\mathsf{evaluateSimilarity()}sansserif_evaluateSimilarity ( ) of the start (Lines 1-1) and resulting matrices (Line 1), we evaluate the EIS Score by taking the aligned tuple with the largest number of aligned values to its corresponding tuple in the Source Table, or the largest number of 1’s. To find the EIS score (Equation 3) between a tuple t𝑡titalic_t in a resulting matrix that shares a key value with a Source tuple s𝑠sitalic_s, we set α(s,t)𝛼𝑠𝑡\alpha(s,t)italic_α ( italic_s , italic_t ) to be the number of non-key attributes for which tuple t𝑡titalic_t has (1)’s, representing shared values between t𝑡titalic_t and s𝑠sitalic_s. For δ(s,t)𝛿𝑠𝑡\delta(s,t)italic_δ ( italic_s , italic_t ), we take the number of non-key attributes for which t𝑡titalic_t has (-1)’s, representing different, non-null values. Thus, we treat correct, nullified, and erroneous aligned tuples with respect to the Source Table in different manners, and combine their matrix representations depending on the behavior of applying Outer Union and unary operators.

V-B Table Reclamation via Integration

With the set of discovered originating tables, we now integrate them to reproduce the Source Table as closely as possible. Using the set of representative operators ={,σ,π,κ,β}direct-sum𝜎𝜋𝜅𝛽\bigoplus=\{\uplus,\sigma,\pi,\kappa,\beta\}⨁ = { ⊎ , italic_σ , italic_π , italic_κ , italic_β } (Theorem 1), we present an efficient algorithm to explore the integration space to reclaim the Source Table.

Given a set of originating tables (𝒯𝒯\mathcal{T}caligraphic_T), a Source Table (S𝑆Sitalic_S) and using direct-sum\bigoplus, Gen-T outputs a table (T𝗋𝖾𝗌𝗎𝗅𝗍subscript𝑇𝗋𝖾𝗌𝗎𝗅𝗍T_{\mathsf{result}}italic_T start_POSTSUBSCRIPT sansserif_result end_POSTSUBSCRIPT) that reclaims S𝑆Sitalic_S as best as possible. Our table integration method is depicted in Algorithm 2. From Section V-A1, we ensure that all candidate tables (and thus originating tables) contain S𝑆Sitalic_S’s columns.

V-B1 Preprocessing

First, we project out columns not in S𝑆Sitalic_S (π𝜋\piitalic_π), and select tuples whose values are in the source key column (σ𝜎\sigmaitalic_σ). Hence, we only keep columns and tuples that overlap with S𝑆Sitalic_S (ProjectSelect(), Line 2). We then union all originating tables that share the same schema (InnerUnion(), Line 2) to reduce the space of tables we need to explore. To prevent over-combining tuples that share nulls with tuples in S𝑆Sitalic_S, for each table T𝒯𝑇𝒯T\in\mathcal{T}italic_T ∈ caligraphic_T, we find tuples in S𝑆Sitalic_S that share key values and contain nulls in the same columns, and replace these nulls in T𝑇Titalic_T with unique labeled non-null values (LabelSourceNulls(), Line 2). Finally, we remove duplicate tuples, subsumed tuples (β𝛽\betaitalic_β), and take the resulting tuples of complementation (κ𝜅\kappaitalic_κ) (TakeMinimalForm() on Line 2).

V-B2 Integration

We integrate all resulting tables 𝒯subscript𝒯\mathcal{T}_{\cup}caligraphic_T start_POSTSUBSCRIPT ∪ end_POSTSUBSCRIPT, all of which contain the source key column. At each iteration of Ti𝒯subscript𝑇𝑖subscript𝒯T_{i}\in\mathcal{T}_{\cup}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_T start_POSTSUBSCRIPT ∪ end_POSTSUBSCRIPT, we outer union Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the integrated result so far, Tsubscript𝑇T_{\uplus}italic_T start_POSTSUBSCRIPT ⊎ end_POSTSUBSCRIPT. Next, we check if applying Complementation(κ𝜅\kappaitalic_κ) and Subsumption(β𝛽\betaitalic_β) on Tsubscript𝑇T_{\uplus}italic_T start_POSTSUBSCRIPT ⊎ end_POSTSUBSCRIPT results in a table that has a higher EIS score in the evaluateSimilarity() function. This lets us check if these operators are over-combining tuples (e.g., removing a subsumed tuple that is identical to a tuple in S𝑆Sitalic_S) and decreasing the number of values shared with S𝑆Sitalic_S. After iterating through all tables from the input set, we revert the previous labeling of shared nulls with S𝑆Sitalic_S (RemoveLabeledNulls() on Line 2). To ensure that the resulting integrated table (T𝗋𝖾𝗌𝗎𝗅𝗍subscript𝑇𝗋𝖾𝗌𝗎𝗅𝗍T_{\mathsf{result}}italic_T start_POSTSUBSCRIPT sansserif_result end_POSTSUBSCRIPT) has the same schema as S𝑆Sitalic_S, we add null columns in T𝗋𝖾𝗌𝗎𝗅𝗍subscript𝑇𝗋𝖾𝗌𝗎𝗅𝗍T_{\mathsf{result}}italic_T start_POSTSUBSCRIPT sansserif_result end_POSTSUBSCRIPT for every column it does not share with S𝑆Sitalic_S (Line 2). Finally, we return the resulting integration as a possible reclaimed table.

1 Input: 𝒯={T1,,Tn}𝒯subscript𝑇1normal-…subscript𝑇𝑛\mathcal{T}=\{T_{1},\dots,T_{n}\}caligraphic_T = { italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }: tables to integrate; S𝑆Sitalic_S: the Source Table
2 Output: T𝗋𝖾𝗌𝗎𝗅𝗍subscript𝑇𝗋𝖾𝗌𝗎𝗅𝗍T_{\mathsf{result}}italic_T start_POSTSUBSCRIPT sansserif_result end_POSTSUBSCRIPT: integration result
3 𝒯𝒯absent\mathcal{T}\leftarrowcaligraphic_T ← ProjectSelect(𝒯,S𝒯𝑆\mathcal{T},Scaligraphic_T , italic_S) //σ,π𝜎𝜋\sigma,\piitalic_σ , italic_π (T𝒯)𝑇𝒯(T\in\mathcal{T})( italic_T ∈ caligraphic_T ) on columns, keys in S𝑆Sitalic_S 𝒯subscript𝒯absent\mathcal{T}_{\cup}\leftarrowcaligraphic_T start_POSTSUBSCRIPT ∪ end_POSTSUBSCRIPT ← InnerUnion(𝒯𝒯\mathcal{T}caligraphic_T) //Inner Union tables with shared schemas 𝒯subscript𝒯absent\mathcal{T}_{\cup}\leftarrowcaligraphic_T start_POSTSUBSCRIPT ∪ end_POSTSUBSCRIPT ← LabelSourceNulls(𝒯subscript𝒯\mathcal{T}_{\cup}caligraphic_T start_POSTSUBSCRIPT ∪ end_POSTSUBSCRIPT) //Label Nulls shared with Source Table 𝒯subscript𝒯absent\mathcal{T}_{\cup}\leftarrowcaligraphic_T start_POSTSUBSCRIPT ∪ end_POSTSUBSCRIPT ← TakeMinimalForm(𝒯subscript𝒯\mathcal{T}_{\cup}caligraphic_T start_POSTSUBSCRIPT ∪ end_POSTSUBSCRIPT) //Apply β,κ𝛽𝜅\beta,\kappaitalic_β , italic_κ on each table Tsubscript𝑇T_{\uplus}\leftarrow\emptysetitalic_T start_POSTSUBSCRIPT ⊎ end_POSTSUBSCRIPT ← ∅ for Ti𝒯subscript𝑇𝑖subscript𝒯T_{i}\in\mathcal{T}_{\cup}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_T start_POSTSUBSCRIPT ∪ end_POSTSUBSCRIPT do
4       TTTisubscript𝑇subscript𝑇subscript𝑇𝑖T_{\uplus}\leftarrow T_{\uplus}\uplus T_{i}italic_T start_POSTSUBSCRIPT ⊎ end_POSTSUBSCRIPT ← italic_T start_POSTSUBSCRIPT ⊎ end_POSTSUBSCRIPT ⊎ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT //Apply outer union \uplus if 𝖾𝗏𝖺𝗅𝗎𝖺𝗍𝖾𝖲𝗂𝗆𝗂𝗅𝖺𝗋𝗂𝗍𝗒(κ(T),S)𝖾𝗏𝖺𝗅𝗎𝖺𝗍𝖾𝖲𝗂𝗆𝗂𝗅𝖺𝗋𝗂𝗍𝗒(T,S)𝖾𝗏𝖺𝗅𝗎𝖺𝗍𝖾𝖲𝗂𝗆𝗂𝗅𝖺𝗋𝗂𝗍𝗒𝜅subscript𝑇normal-⊎𝑆𝖾𝗏𝖺𝗅𝗎𝖺𝗍𝖾𝖲𝗂𝗆𝗂𝗅𝖺𝗋𝗂𝗍𝗒subscript𝑇normal-⊎𝑆\mathsf{evaluateSimilarity}(\kappa(T_{\uplus}),S)\geq\mathsf{% evaluateSimilarity}(T_{\uplus},S)sansserif_evaluateSimilarity ( italic_κ ( italic_T start_POSTSUBSCRIPT ⊎ end_POSTSUBSCRIPT ) , italic_S ) ≥ sansserif_evaluateSimilarity ( italic_T start_POSTSUBSCRIPT ⊎ end_POSTSUBSCRIPT , italic_S ) then
5             Tκ(T)subscript𝑇𝜅subscript𝑇T_{\uplus}\leftarrow\kappa(T_{\uplus})\qquad\qquad\qquaditalic_T start_POSTSUBSCRIPT ⊎ end_POSTSUBSCRIPT ← italic_κ ( italic_T start_POSTSUBSCRIPT ⊎ end_POSTSUBSCRIPT ) //Apply complementation κ𝜅\kappaitalic_κ
6      if 𝖾𝗏𝖺𝗅𝗎𝖺𝗍𝖾𝖲𝗂𝗆𝗂𝗅𝖺𝗋𝗂𝗍𝗒(β(T),S)𝖾𝗏𝖺𝗅𝗎𝖺𝗍𝖾𝖲𝗂𝗆𝗂𝗅𝖺𝗋𝗂𝗍𝗒(T,S)𝖾𝗏𝖺𝗅𝗎𝖺𝗍𝖾𝖲𝗂𝗆𝗂𝗅𝖺𝗋𝗂𝗍𝗒𝛽subscript𝑇normal-⊎𝑆𝖾𝗏𝖺𝗅𝗎𝖺𝗍𝖾𝖲𝗂𝗆𝗂𝗅𝖺𝗋𝗂𝗍𝗒subscript𝑇normal-⊎𝑆\mathsf{evaluateSimilarity}(\beta(T_{\uplus}),S)\geq\mathsf{evaluateSimilarity% }(T_{\uplus},S)sansserif_evaluateSimilarity ( italic_β ( italic_T start_POSTSUBSCRIPT ⊎ end_POSTSUBSCRIPT ) , italic_S ) ≥ sansserif_evaluateSimilarity ( italic_T start_POSTSUBSCRIPT ⊎ end_POSTSUBSCRIPT , italic_S ) then
7             Tβ(T)subscript𝑇𝛽subscript𝑇T_{\uplus}\leftarrow\mathcal{\beta}(T_{\uplus})\qquad\qquad\qquaditalic_T start_POSTSUBSCRIPT ⊎ end_POSTSUBSCRIPT ← italic_β ( italic_T start_POSTSUBSCRIPT ⊎ end_POSTSUBSCRIPT ) //Apply subsumption β𝛽\betaitalic_β
8T𝗋𝖾𝗌𝗎𝗅𝗍subscript𝑇𝗋𝖾𝗌𝗎𝗅𝗍absentT_{\mathsf{result}}\leftarrowitalic_T start_POSTSUBSCRIPT sansserif_result end_POSTSUBSCRIPT ← RemoveLabeledNulls(Tsubscript𝑇T_{\uplus}italic_T start_POSTSUBSCRIPT ⊎ end_POSTSUBSCRIPT) if T𝗋𝖾𝗌𝗎𝗅𝗍subscript𝑇𝗋𝖾𝗌𝗎𝗅𝗍T_{\mathsf{result}}italic_T start_POSTSUBSCRIPT sansserif_result end_POSTSUBSCRIPT has fewer columns than S𝑆Sitalic_S then
9       add null columns in T𝗋𝖾𝗌𝗎𝗅𝗍subscript𝑇𝗋𝖾𝗌𝗎𝗅𝗍T_{\mathsf{result}}italic_T start_POSTSUBSCRIPT sansserif_result end_POSTSUBSCRIPT for each column ST𝗋𝖾𝗌𝗎𝗅𝗍absent𝑆subscript𝑇𝗋𝖾𝗌𝗎𝗅𝗍\in S\setminus T_{\mathsf{result}}∈ italic_S ∖ italic_T start_POSTSUBSCRIPT sansserif_result end_POSTSUBSCRIPT
Output T𝗋𝖾𝗌𝗎𝗅𝗍subscript𝑇𝗋𝖾𝗌𝗎𝗅𝗍T_{\mathsf{result}}italic_T start_POSTSUBSCRIPT sansserif_result end_POSTSUBSCRIPT
Algorithm 2 Table Integration

VI Experiments

We evaluate Gen-T on benchmarks with tables containing real instances from well-known Benchmarks [77, 78] along with tables from real data lakes [39, 79]. Our baselines are modified related techniques to solve the reclamation problem. Section VI-B shows that Gen-T can perfectly reclaim 15-17 Source Tables, whereas most baselines only fully reclaim 1-3 Source Table across benchmarks. Section VI-C shows that Gen-T is 5X faster than the next-fastest baseline on a large data lake. Finally, Section VI-D shows that Gen-T generalizes to a different real-world application.

VI-A Experimental Setup

We implement Gen-T in Python on a CentOS server with Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz processor. Code for Gen-T and all the baselines is publicly available [80]. We evaluate Gen-T on 6 benchmarks (see Table I).

Benchmark # Tables # Cols Avg Rows Size (MB)
TP-TR Small 32 244 782 3
TP-TR Med 32 244 10.8K 40
TP-TR Large 32 244 1M 3.9K
SANTOS Large +TP-TR Med 11K 122K 7.7K 11K
T2D Gold 515 2,147 74 4
WDC Sample +T2D Gold 15K 75K 14 66
TABLE I: Statistics on Data lakes of each benchmark

TP-TR Benchmarks: We use the 8 tables from the TPC-H benchmark [77], which contain business information including customers, products, suppliers, nations, etc. Using these tables, we create three versions of a benchmark suite titled TP-Table Reclamation (TP-TR). TP-TR Large has TPC-H tables with original table sizes. TP-TR Med has TPC-H tables that are each 1/100 of its original table’s rows, and TP-TR Small has TPC-H tables that are each similar-to\sim1/1000 of its original table’s rows. In each, we take each of the 8 tables and create 4 versions of the same table – creating 32 tables in total. For two versions, we randomly nullify different subsets of values, and for the other two versions, we randomly inject different non-null (erroneous) values in different subsets of values. For the majority of the experiments, the number of nulls (respectively, erroneous values) is 50% meaning that we randomly take 50% of each table’s values and replace them with nulls (respectively, replace them with different new strings). In an ablation study (Section VI-B last two paragraphs), we vary (independently) the number of nulls/erroneous values from 10% to 90%. Our goal in the table discovery phase is then to filter out originating tables with injected non-null noise, so that resulting reclaimed tables do not contain erroneous values. Thus, we seek to verify that our approach uses the nullified versions rather than the erroneous versions so that combining them can reproduce the Source Table.

The Source Tables for the TP-TR benchmarks are created using all 8 of the original (unmodified) TPC-H tables over which we randomly generated 26 queries each having a subset of operators {π,σ,,  ,    ,,𝜋𝜎  absent    \pi,\sigma,\bowtie,\mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss% \rule[7.0pt]{2.5pt}{0.4pt}}\mkern-5.8mu\bowtie},\mathbin{\rule[-0.08612pt]{2.5% pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2.5pt}{0.4pt}}\mkern-5.8mu\bowtie% \mkern-5.8mu\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2.5% pt}{0.4pt}}},\cup,\uplusitalic_π , italic_σ , ⋈ , start_BINOP ⋈ end_BINOP , start_BINOP ⋈ end_BINOP , ∪ , ⊎}. In these 26 queries, the number of operations ranges from 2 (just π,σ𝜋𝜎\pi,\sigmaitalic_π , italic_σ), to 9, such that the query with the maximum number of unions contains 4 unioned tables, and the query with the maximum number of joins joins 3 tables. We ran the same queries on each TP-TR benchmark to create 26 Source Tables for the TP-TR Small benchmark containing an average of 9 columns and 27 rows, and 26 Source Tables for the TP-TR Med and TP-TR Large benchmarks that have an average of 9 columns and 1K rows.

SANTOS Large +TP-TR Med Benchmark: In addition, to further assess the effectiveness and scalability of our table discovery method, we embed TP-TR Med into a real, large data lake SANTOS Large [39]. In doing so, we evaluate how well Gen-T prunes a potentially large set of candidate tables retrieved from a large data lake to a smaller set of originating tables that can more accurately reclaim a Source Table when integrated. We use the same Source Tables as for TP-TR Med.

T2D Gold Benchmark: In addition, we explore the real-world application of our method with the T2D Gold Benchmark [78], which takes web tables and matches them to properties from DBpedia. This benchmark was not originally created for the problem of Table Reclamation, so we test the generalizability of Gen-T by seeing if it can reclaim any of this benchmark’s tables. We take 515 raw tables that contain some non-numerical columns and a key column. We do not have prior knowledge of whether or not any of these 515 tables can be “reclaimed” as a Source. Thus, we iterate through each of the 515 tables as potential sources.

WDC Sample +T2D Gold: To further assess the effectiveness of Gen-T, we embed T2D Gold tables into a sample of the WDC web table corpus [79], which contains  15K relational web tables. This way, we can examine how well Gen-T prunes a large set of candidate tables found from a large table corpus to a small set of originating tables that can be integrated to reclaim a Source Table.

VI-A1 Baselines

We compare Gen-T to the state-of-the-art for by-target synthesis, Auto-Pipeline [9], the state-of-the-art for Query-by-example, Ver [59], and the state-of-the-art for table integration, ALITE [24] which were modified for the reclamation problem.

Auto-Pipeline has a similar framework to our problem in discovering the integration (query or pipeline) that reclaims a Source Table. However, Gen-T does not assume to have the perfect set of input tables from which we can synthesize the query that reproduces the Source Table. Auto-Pipeline has both query-search and deep reinforcement learning approaches, but since we propose an unsupervised approach, we use the query-search variation as our baseline. Auto-Pipeline’s code implementation is not openly available, so we adopted an open re-implementation of their search approach [62], which adapts the framework in Foofah [81], and, for a fair comparison, revised their set of table operators to only contain table operators that Gen-T considers ({σ,π,,,  ,    𝜎𝜋  absent    \sigma,\pi,\cup,\bowtie,\mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt% {\hss\rule[7.0pt]{2.5pt}{0.4pt}}\mkern-5.8mu\bowtie},\mathbin{\rule[-0.08612pt% ]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2.5pt}{0.4pt}}\mkern-5.8mu% \bowtie\mkern-5.8mu\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0% pt]{2.5pt}{0.4pt}}}italic_σ , italic_π , ∪ , ⋈ , start_BINOP ⋈ end_BINOP , start_BINOP ⋈ end_BINOP}). We call this re-implemented, adapted baseline Auto-Pipeline*. Since Auto-Pipeline’s benchmarks contain small tables, and most of their operators are string-transformation operators, we do not consider their benchmarks for our experiments.

Ver has a similar objective of discovering and integrating tables to produce a table that contains the source table and other similar tuples. In contrast, Gen-T aims to reproduce only the tuples from the source table. Ver takes small source tables as input (e.g., source tables with 2 columns and 3 rows [59]), so we query Ver with two columns from the Source Tables. We evaluate the output table for each run, and aggregate the results to evaluate the entire source table.

We validate the need for our Matrix Traversal rather than directly integrating the set of candidate tables returned from Set Similarity (Section V-A1), by comparing against ALITE provided with the set of candidate tables from Set Similarity as input. We also compare with a variation of ALITE, which we call ALITE-PS, that, similar to Gen-T, first performs projection and selection to match the Source Table before performing table integration. ALITE without project and select is much slower as it creates a larger integration result.

For all baselines on the TP-TR benchmarks, we create another variant in which we give each method a specific integrating set (int. set) of tables as input, rather than the full set of candidate tables returned from Set Similarity. We know what subset of tables from the 8 original tables were used to create the 26 Source Tables, so, we know that a perfect reclamation contains variants of these tables. Thus, for all original tables used to create each Source Table, the integrating set includes all variations (2 nullified and 2 erroneous versions of each original table) of these tables.

VI-A2 Metrics

For effectiveness, we evaluate how many values in a Source Table have been reclaimed, or how similar the values in the reclaimed table are to those of the Source Table. Thus, the Source Tables are essentially our ground truth in that we see how many of its values we can reproduce. In an aligned tuple in a reclaimed table with respect to a Source Table, it contains an erroneous value if there is a non-null, different value at a given column compared to a value in the same column in the Source Table. Similarly, it contains a nullified value if it contains a null value in a column where the Source tuple contains a non-null value (see Section IV-A).

TP-TR Med SANTOS Large +TP-TR Med TP-TR Large
Method Rec Pre Inst-Div. DKLKL{}_{\text{KL}}start_FLOATSUBSCRIPT KL end_FLOATSUBSCRIPT Rec Pre Inst-Div. DKLKL{}_{\text{KL}}start_FLOATSUBSCRIPT KL end_FLOATSUBSCRIPT Rec Pre Inst-Div. DKLKL{}_{\text{KL}}start_FLOATSUBSCRIPT KL end_FLOATSUBSCRIPT
ALITE 0.662 0.202 0.100 35.831
ALITE w/ int. set 0.694 0.202 0.085 36.348 0.694 0.202 0.085 36.348
ALITE-PS 0.880 0.556 0.009 3.524 0.842 0.554 0.011 4.629 0.775 0.521 0.049 21.978
ALITE-PS w/ int. set 0.880 0.569 0.009 3.524 0.880 0.569 0.009 3.524 0.880 0.569 0.009 3.524
Gen-T 0.976 0.867 0.004 1.326 0.976 0.867 0.004 1.326 0.971 0.807 0.004 1.490
TABLE II: Effectiveness of Gen-T and baselines ALITE, ALITE-PS on the larger TP-TR benchmarks. ALITE, ALITE-PS, and Gen-T are given the same set of candidate tables from Set Similarity, and ALITE and ALITE-PS are also given an integrating set (“w/ int. set”). If there are no results for some method, then it timed out for most, if not all, Source Tables.

Precision and Recall: Consider a Source Table S𝑆Sitalic_S and an output reclaimed table S^^𝑆\hat{S}over^ start_ARG italic_S end_ARG. From the measure Tuple Difference Ratio (TDR) [24], we derive two similarity measures, Recall (Rec) and Precision (Pre), that measure the # of tuples in the intersection of S𝑆Sitalic_S and S^^𝑆\hat{S}over^ start_ARG italic_S end_ARG relative to the # of tuples in each table.
Rec=|SS^|/|S|Rec𝑆^𝑆𝑆\text{Rec}=|S\cap{\hat{S}|}/{|S|}Rec = | italic_S ∩ over^ start_ARG italic_S end_ARG | / | italic_S | and Pre=|SS^|/|S^|Pre𝑆^𝑆^𝑆\text{Pre}=|S\cap{\hat{S}|}/{|\hat{S}|}Pre = | italic_S ∩ over^ start_ARG italic_S end_ARG | / | over^ start_ARG italic_S end_ARG |.

In addition to metrics that measure the similarity between the tuples of a reclaimed table and a Source Table, we also include finer-grain metrics that measure the number of values that do not match within aligned tuples (tuples with the same key value). If there are multiple aligned tuples with respect to one tuple in the Source Table (multiple tuples in the reclaimed table with the same key value), then we consider the tuple that contains the largest number of column values shared with the corresponding tuple in the Source Table. This way, there is at most 1 aligned tuple in the reclaimed table for each tuple in the Source Table. In these measures, which we denote as divergence measures, the ideal score is 0 (the reclaimed table is identical to the Source Table). Specifically, we introduce Instance Divergence and Conditional KL-Divergence, enabling us to measure the nullified and erroneous values in the reclaimed table’s aligned tuples, with respect to the Source Table (see Section IV-A).

Instance Divergence: We measure how many missing values there are in each aligned tuple, with respect to its corresponding tuple in the Source Table. To do so, we use the inverse of Instance Similarity [20] (see Equation 2) and introduce Instance Divergence: Inst-Div.=1Instance SimilarityInst-Div.1Instance Similarity\text{Inst-Div.}=1-\text{Instance Similarity}Inst-Div. = 1 - Instance Similarity.

Conditional KL-divergence: We aim to capture how many erroneous values there are in aligned tuples from a reclaimed table with respect to tuples in a Source Table. We adopt the traditional definition of conditional KL-divergence [82, 83], and add a penalization for erroneous values, such that the score is higher (diverges more) for reclaimed tables containing erroneous values as opposed to nulls in their aligned tuples with the Source Table (see the technical report [73] for more details). We report average scores over all Source Tables.

Efficiency Measures: For efficiency, we measure the average runtimes for all Source Tables, as well as the average ratio of the output reclaimed table’s size to the Source Table’s size (large integrations can significantly increase runtimes).

VI-B Effectiveness

Method Rec Pre Inst-Div. DKLKL{}_{\text{KL}}start_FLOATSUBSCRIPT KL end_FLOATSUBSCRIPT
ALITE 0.704 0.128 0.095 1.332
ALITE w/ int. set 0.745 0.133 0.086 1.197
ALITE-PS 0.805 0.539 0.040 0.655
ALITE-PS w/ int. set 0.833 0.552 0.037 0.688
Auto-Pipeline* 0.674 0.272 0.158 2.574
Auto-Pipeline* w/ int. set 0.683 0.289 0.133 2.109
Ver w/ int. set 0.746 0.351 0.127 10.393
Gen-T 0.954 0.799 0.015 0.165
TABLE III: Similarity (Rec, Pre) and Divergence (Inst-Div., DKLKL{}_{\text{KL}}start_FLOATSUBSCRIPT KL end_FLOATSUBSCRIPT) of Gen-T and baselines on TP-TR Small benchmark.

Tables III and II report the results for all methods on the TP-TR Small benchmark and the larger TP-TR benchmarks, respectively. For experiments on TP-TR Small, TP-TR Med, and TP-TR Large benchmarks, we input candidate tables discovered from just Set Similarity (Section V-A1). For experiments on SANTOS Large +TP-TR Med, we first discover relevant tables from the large data lake using Starmie [40], a state-of-the-art self-supervised system for scalable table discovery. Hence, it can discover a set of candidate tables for the Source Table from a large data lake. Although the primary use case of Starmie was table union search, it was shown to apply to other search semantics such as table discovery to improve the performance of downstream machine learning tasks via feature discovery (join search) and column clustering. Following Starmie, we run Set Similarity to find syntactically similar tables among the returned tables from Starmie.

We run baselines Auto-Pipeline* and Ver (w/ int. set) only on TP-TR Small benchmark (Table III), since they timeout for most, if not all Source Tables in the larger benchmarks. Ver times out when given the entire data lake from TP-TR Small. ALITE times out only on TP-TR Large benchmark (Table II). We discuss scalability and timeouts in Section VI-C.

Across all benchmarks, Gen-T outperforms the baselines for all metrics, while perfectly reclaiming 15-17 Source Tables across all benchmarks. Baselines ALITE-PS and Auto-Pipeline* only perfectly reclaim 3 Source Tables and 1 Source Table, respectively, across the benchmarks on which they do not time out, and ALITE and Ver does not perfectly reclaim any. In fairness, ALITE is an integration method that does not consider the Source Table (it is not ”target-driven” like Auto-Pipeline*). Also, Ver is a QBE method whose goal is to produce source tuples along with many additional tuples. In terms of similarity (Table II), Gen-T outperforms the top performing existing baseline method (ALITE-PS) by 11-25% in Recall and by 48-56% in Precision across all TP-TR benchmarks. For the divergence measures, we see that Gen-T produces tables that contain fewer nullified values in its aligned tuples with respect to the Source Table (Inst-Div.), as well as fewer erroneous values in its aligned tuples, which is reflected in the lower DKLKL{}_{\text{KL}}start_FLOATSUBSCRIPT KL end_FLOATSUBSCRIPT scores than the baselines.

Even compared to each baseline that is given specified integrating sets of tables rather than large sets of candidates (‘w/ int. set’), Gen-T performs much better. Thus, the matrix traversal method (Section V-A3) used in Gen-T to refine the set of originating tables works well in filtering out misleading tables that could be integrated to produce tables containing erroneous values. We provide benchmark samples to exemplify this in our repository [80].

Refer to caption
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 6: Recall and Precision of different types of queries that produce Source Tables over the TP-TR Benchmarks.

We now perform an analysis of the similarity measures for all methods on different types of queries used to form the Source Tables in TP-TR benchmarks, shown in Figure 6. Ranging from simple queries (that just perform Projection, Selection, and Union) to more complex queries (joining up to 3 tables and unioning up to 4 tables), we see that Gen-T outperforms the baselines on queries of all complexities used to initially create the Source Table. Thus, not only is the matrix traversal effective, but the set of operators used in table integration represents different types of queries well.

Tuning % Erroneous vs. Nullified Values: We further analyze Gen-T’s performance on data lake tables with different number of erroneous and nullified values in TP-TR Med tables (Figure 7). So far, TP-TR Med tables have 50% erroneous values in erroneous versions and 50% nulls in nullified versions (intersection point on the graph where Gen-T has 0.867 Precision). Now, we tune the percentage of values replaced with non-null, random strings (blue line in Figure 7) in erroneous versions, while the nullified versions always contain 50% nulls. Similarly, we tune the percentage of values replaced with nulls (red line in Figure 7) while holding the erroneous versions constant. For Gen-T to produce a perfect reclamation of a Source Table, it should only have originating tables with injected nulls so that these nulls can be replaced with correct values during table integration.

Refer to caption
Figure 7: Gen-T Precision as TP-TR Med has different % of Erroneous (blue triangles) and Nullified Values (red circles).

As data lake tables have more erroneous values (blue line), Gen-T is more likely to contain tables with nullified tuples in its set of originating tables, which results in an integrated table with higher precision. On the other hand, as we tune the percentage of values replaced with nulls (red line), precision decreases. As more nulls are injected, these tables also have fewer correct values. Gen-T is thus more inclined to have originating tables with 50% erroneous values, or 50% correct values, leading to a final integration with lower precision.

VI-C Scalability

Figure 8 shows the scalability of Gen-T, ALITE, ALITE-PS, and Auto-Pipeline* across benchmarks as the number and/or size of tables grows. Figure 8(a) reports average runtimes for all methods across all four benchmarks, starting from ingestion of the candidate tables. For Gen-T, this time includes the time it takes to prune the set of candidate tables to a set of originating tables, and integrate it to produce a reclaimed table. For other methods, this time only includes integration. We find that Auto-Pipeline* only runs on TP-TR Small without timing out, and ALITE, which performs full disjunction, is exponential in time and times out for the last two benchmarks. We set the timeouts as 30min for TP-TR Small, 7hrs for TP-TR Med and SANTOS Large +TP-TR Med and 24hrs for TP-TR Large.

Gen-T has a more consistent runtime across all benchmarks compared to all baselines. Gen-T is 3X faster compared to Auto-Pipeline* on TP-TR Small. On TP-TR Med, Gen-T is 40X faster than ALITE and on TP-TR Large, Gen-T is 5X faster than ALITE-PS. Thus, pruning candidate tables to originating tables seems to cut the cost of integration, a prevalent issue as shown by the baselines.

Figure 8(b) reports the average output sizes, or number of cell values in the reclaimed tables, with respect to the average Source Table sizes. As the number and size of tables grows across benchmarks, the output size relative to the size of the Source Table (expected output size) can easily grow at a fast rate if the integration is among more or larger tables, especially if it includes noisy tables from the real data lake (SANTOS Large). Output sizes for Gen-T remain consistent across all benchmarks (1.4-4.5X larger than the average Source Table size). This trend largely accounts for the higher precision of Gen-T since its output tables mostly consist of Source tuples. In contrast, output sizes for ALITE, ALITE-PS, and Auto-Pipeline* are 200-300X, 2.5-250X, and 4X larger than Source Tables’ sizes, respectively. Thus, Gen-T’s runtimes and output sizes remain consistent across benchmarks of different sizes.

Refer to caption
Refer to caption
(a)
Refer to caption
(b)
Figure 8: Scalability via Average Runtime (sec) and Ratio of Output Sizes to Source Table sizes over TP-TR benchmarks.

VI-D Generalizability

We also experiment with the T2D Gold benchmark to see how well we can apply Gen-T to a real-world scenario with Web Tables. In this case, we do not know whether oder how the tables were originally generated. Accordingly, we attempt to reclaim each table using a subset of other tables in the benchmark by iterating through each 515 tables as potential Source Tables. Gen-T successfully reclaims 3 Source Tables from an integration of multiple tables (5-6 tables), such that the outputs have perfect Recall, Precision, Instance-Div., and near-perfect DKLKL{}_{\text{KL}}start_FLOATSUBSCRIPT KL end_FLOATSUBSCRIPT. Gen-T also finds duplicate tables for 12 Source Tables, or 6 sets of duplicates. This indicates that we can apply Gen-T in a different domain, even if no sources are known to be reclaimed, and retrieve successful reclamations. Baseline methods are able to reclaim 12-13 Source Tables, which are included in the 15 Source Tables reclaimed by Gen-T.

Method Recall Precision Inst-Div. DKLKL{}_{\text{KL}}start_FLOATSUBSCRIPT KL end_FLOATSUBSCRIPT
ALITE 0.956 0.490 0.009 0.627
ALITE-PS 0.956 0.796 0.009 0.627
Auto-Pipeline* 0.881 0.725 0.088 19.261
Gen-T 0.956 1.000 0.009 0.627
TABLE IV: Sources from T2D Gold immersed in the WDC Sample for which all methods have non-empty outputs.

We then run experiments on a data lake consisting of both T2D Gold and WDC Sample tables. This way, we evaluate how well the methods perform when a set of candidate tables returned from Set Similarity may contain irrelevant or misleading tables from the WDC Sample benchmark. Table IV presents the similarity and divergence scores for all methods on 33 of the common sources from T2D Gold for which all methods have non-empty, reasonably sized output tables. We can see that Gen-T outperforms the baselines for all measures, even having a precision of 1.0. In contrast, the baseline methods that are given the candidate tables from Set Similarity integrate all candidate tables and produce tables that contain many additional tuples.

VII Conclusion and Open Problems

Table Reclamation is essential in verifying if a data lake supports the tuples (facts) in a Source table. Our results show that, despite the large search space, Gen-T can solve the reclamation problem efficiently for source tables with keys. In future work, we will relax the key assumption with regard to source tables, and use a fast, approximate instance comparison algorithm to compare instances from a source table and data lake tables [84]. When a table can only be partially reclaimed, we plan to investigate whether the originating tables can be embedded in a new data lake and used to possibly generate a better reclamation. Alternatively, we plan to investigate if reclamation can be combined with data cleaning (for example, value imputation over missing values or entity resolution) to produce a better reclamation. In addition, we plan to consider the case in which values from a source table do not syntactically align with values from a data lake, in which case we can explore the semantic similarity of instances. Table reclamation can also be used to verify the tabular results of generative AI or large language models. Verifying the output of generative AI using a data management lens is an emerging and important area [85, 86]. For example, users who generate summary tables and charts (e.g. Microsoft Copilot [87]) or presentation slides (e.g. SlidesAI [88]) from input data would find it useful to verify model outputs and examine what data was used to generate them. Our approach will advance this area by allowing the automatic verification of tables that are created through complex integrations of other tables, something not yet considered in the literature.

Acknowledgements

This work was supported in part by NSF award numbers IIS-2107248, IIS-1956096, and IIS-2325632.

References

  • [1] Microsoft, 2021, https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RWNrak, last accessed on Nov 2, 2023.
  • [2] J. Cheney, L. Chiticariu, and W. C. Tan, “Provenance in databases: Why, how, and where,” Found. Trends Databases, vol. 1, no. 4, pp. 379–474, 2009.
  • [3] J. Cheney and W. Tan, “Provenance in databases,” in Encyclopedia of Database Systems, Second Edition.   Springer, 2018.
  • [4] M. Koehler, E. Abel, A. Bogatu, C. Civili, L. Mazilu, N. Konstantinou, A. A. A. Fernandes, J. A. Keane, L. Libkin, and N. W. Paton, “Incorporating data context to cost-effectively automate end-to-end data wrangling,” IEEE Trans. Big Data, vol. 7, no. 1, pp. 169–186, 2021.
  • [5] C. Wang, A. Cheung, and R. Bodík, “Synthesizing highly expressive SQL queries from input-output examples,” in PLDI, 2017, pp. 452–466.
  • [6] Y. Shen, K. Chakrabarti, S. Chaudhuri, B. Ding, and L. Novik, “Discovering queries based on example tuples,” in SIGMOD, 2014, pp. 493–504.
  • [7] D. Deutch and A. Gilad, “QPlain: Query by explanation,” in ICDE, 2016, pp. 1358–1361.
  • [8] R. Bavishi, C. Lemieux, R. Fox, K. Sen, and I. Stoica, “AutoPandas: neural-backed generators for program synthesis,” Proc. ACM Program. Lang., vol. 3, no. OOPSLA, pp. 168:1–168:27, 2019.
  • [9] J. Yang, Y. He, and S. Chaudhuri, “Auto-Pipeline: Synthesize data pipelines by-target using reinforcement learning and search,” Proc. VLDB Endow., vol. 14, no. 11, pp. 2563–2575, 2021.
  • [10] M. Zhang, H. Elmeleegy, C. M. Procopiuc, and D. Srivastava, “Reverse engineering complex join queries,” in SIGMOD, 2013, pp. 809–820.
  • [11] D. V. Kalashnikov, L. V. S. Lakshmanan, and D. Srivastava, “FastQRE: Fast query reverse engineering,” in SIGMOD, 2018, pp. 337–350.
  • [12] P. Orvalho, M. Terra-Neves, M. Ventura, R. Martins, and V. M. Manquinho, “SQUARES: A SQL synthesizer using query reverse engineering,” Proc. VLDB Endow., vol. 13, no. 12, pp. 2853–2856, 2020.
  • [13] Q. T. Tran, C. Chan, and S. Parthasarathy, “Query by output,” in SIGMOD, 2009, pp. 535–548.
  • [14] A. Bonifati, R. Ciucanu, A. Lemay, and S. Staworko, “A paradigm for learning queries on big data,” in Data4U@VLDB, 2014, p. 7.
  • [15] F. Nargesian, E. Zhu, K. Q. Pu, and R. J. Miller, “Table union search on open data,” Proc. VLDB Endow., vol. 11, no. 7, pp. 813–825, 2018.
  • [16] M. D. Adelfio and H. Samet, “Schema extraction for tabular data on the web,” Proc. VLDB Endow., vol. 6, no. 6, pp. 421–432, 2013.
  • [17] M. H. Farid, A. Roatis, I. F. Ilyas, H. Hoffmann, and X. Chu, “CLAMS: bringing quality to data lakes,” in SIGMOD, 2016, pp. 2089–2092.
  • [18] F. Nargesian, E. Zhu, R. J. Miller, K. Q. Pu, and P. C. Arocena, “Data lake management: Challenges and opportunities,” Proc. VLDB Endow., vol. 12, no. 12, pp. 1986–1989, 2019.
  • [19] J. Bleiholder and F. Naumann, “Data fusion,” ACM Comput. Surv., vol. 41, no. 1, pp. 1:1–1:41, 2008.
  • [20] B. Alexe, M. A. Hernández, L. Popa, and W. C. Tan, “MapMerge: Correlating independent schema mappings,” VLDB J., vol. 21, no. 2, pp. 191–211, 2012.
  • [21] L. Jiang and F. Naumann, “Holistic primary key and foreign key detection,” J. Intell. Inf. Syst., vol. 54, no. 3, pp. 439–461, 2020.
  • [22] L. Bornemann, T. Bleifuß, D. V. Kalashnikov, F. Naumann, and D. Srivastava, “Natural key discovery in wikipedia tables,” in WWW, 2020, pp. 2789–2795.
  • [23] C. A. Galindo-Legaria, “Outerjoins as disjunctions,” in SIGMOD, 1994, pp. 348–358.
  • [24] A. Khatiwada, R. Shraga, W. Gatterbauer, and R. J. Miller, “Integrating data lake tables,” Proc. VLDB Endow., vol. 16, pp. 932–945, 2022.
  • [25] M. J. Cafarella, A. Y. Halevy, and N. Khoussainova, “Data integration for the relational web,” Proc. VLDB Endow., vol. 2, no. 1, pp. 1090–1101, 2009.
  • [26] D. Brickley, M. Burgess, and N. F. Noy, “Google dataset search: Building a search engine for datasets in an open web ecosystem,” in WWW, 2019, pp. 1365–1375.
  • [27] G. Limaye, S. Sarawagi, and S. Chakrabarti, “Annotating and searching web tables using entities, types and relationships,” Proc. VLDB Endow., vol. 3, no. 1, pp. 1338–1347, 2010.
  • [28] R. Shraga, H. Roitman, G. Feigenblat, and M. Canim, “Ad hoc table retrieval using intrinsic and extrinsic similarities,” in WWW, 2020, pp. 2479–2485.
  • [29] R.  Shraga, H. Roitman, G. Feigenblat, and M. Canim, “Web table retrieval using multimodal deep learning,” in SIGIR, 2020, pp. 1399–1408.
  • [30] A. D. Sarma, L. Fang, N. Gupta, A. Y. Halevy, H. Lee, F. Wu, R. Xin, and C. Yu, “Finding related tables,” in SIGMOD, 2012, pp. 817–828.
  • [31] E. Zhu, F. Nargesian, K. Q. Pu, and R. J. Miller, “LSH ensemble: Internet-scale domain search,” Proc. VLDB Endow., vol. 9, no. 12, pp. 1185–1196, 2016.
  • [32] E. Zhu, D. Deng, F. Nargesian, and R. J. Miller, “JOSIE: overlap set similarity search for finding joinable tables in data lakes,” in SIGMOD, 2019, pp. 847–864.
  • [33] R. C. Fernandez, E. Mansour, A. A. Qahtan, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang, “Seeping semantics: Linking datasets using word embeddings for data discovery,” in ICDE, 2018, pp. 989–1000.
  • [34] M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri, “InfoGather: entity augmentation and attribute discovery by holistic matching with web tables,” in SIGMOD, 2012, pp. 97–108.
  • [35] O. Lehmberg, D. Ritze, P. Ristoski, R. Meusel, H. Paulheim, and C. Bizer, “The mannheim search join engine,” J. Web Semant., vol. 35, pp. 159–166, 2015.
  • [36] M. Esmailoghli, J. Quiané-Ruiz, and Z. Abedjan, “MATE: multi-attribute table extraction,” Proc. VLDB Endow., vol. 15, no. 8, pp. 1684–1696, 2022.
  • [37] Y. Dong, C. Xiao, T. Nozawa, M. Enomoto, and M. Oyamada, “DeepJoin: Joinable table discovery with pre-trained language models,” Proc. VLDB Endow., vol. 16, no. 10, pp. 2458–2470, 2023.
  • [38] X. Ling, A. Y. Halevy, F. Wu, and C. Yu, “Synthesizing union tables from the web,” in IJCAI, 2013, pp. 2677–2683.
  • [39] A. Khatiwada, G. Fan, R. Shraga, Z. Chen, W. Gatterbauer, R. J. Miller, and M. Riedewald, “SANTOS: Relationship-based semantic table union search,” in SIGMOD, 2023.
  • [40] G. Fan, J. Wang, Y. Li, D. Zhang, and R. J. Miller, “Semantics-aware dataset discovery from data lakes with contextualized column-based representation learning,” Proc. VLDB Endow., vol. 16, no. 7, pp. 1726–1739, 2023.
  • [41] S. Galhotra, Y. Gong, and R. C. Fernandez, “Metam: Goal-oriented data discovery,” in ICDE, 2023, pp. 2780–2793.
  • [42] O. Lehmberg and C. Bizer, “Stitching web tables for improving matching quality,” Proc. VLDB Endow., vol. 10, no. 11, pp. 1502–1513, 2017.
  • [43] E. Rahm and P. A. Bernstein, “A survey of approaches to automatic schema matching,” VLDB J., vol. 10, no. 4, pp. 334–350, 2001.
  • [44] R. Shraga, A. Gal, and H. Roitman, “ADnEV: Cross-domain schema matching using deep similarity matrix adjustment and evaluation,” Proc. VLDB Endow., vol. 13, no. 9, pp. 1401–1415, 2020.
  • [45] C. Koutras, G. Siachamis, A. Ionescu, K. Psarakis, J. Brons, M. Fragkoulis, C. Lofi, A. Bonifati, and A. Katsifodimos, “Valentine: Evaluating matching techniques for dataset discovery,” in ICDE, 2021, pp. 468–479.
  • [46] H. H. Do and E. Rahm, “COMA - A system for flexible combination of schema matching approaches,” in Proc. VLDB Endow., 2002, pp. 610–621.
  • [47] S. Melnik, H. Garcia-Molina, and E. Rahm, “Similarity flooding: A versatile graph matching algorithm and its application to schema matching,” in ICDE, 2002, pp. 117–128.
  • [48] C. Chen, B. Golshan, A. Y. Halevy, W. Tan, and A. Doan, “BigGorilla: An open-source ecosystem for data preparation and integration,” IEEE Data Eng. Bull., vol. 41, no. 2, pp. 10–22, 2018.
  • [49] R. Cappuzzo, P. Papotti, and S. Thirumuruganathan, “Creating embeddings of heterogeneous relational datasets for data integration tasks,” in SIGMOD, 2020, pp. 1335–1349.
  • [50] Y. Li, J. Li, Y. Suhara, A. Doan, and W. Tan, “Deep entity matching with pre-trained language models,” Proc. VLDB Endow., vol. 14, no. 1, pp. 50–60, 2020.
  • [51] Y. Li, J. Li, Y. Suhara, J. Wang, W. Hirota, and W. Tan, “Deep entity matching: Challenges and opportunities,” ACM J. Data Inf. Qual., vol. 13, no. 1, pp. 1:1–1:17, 2021.
  • [52] V. Christophides, V. Efthymiou, T. Palpanas, G. Papadakis, and K. Stefanidis, “An overview of end-to-end entity resolution for big data,” ACM Comput. Surv., vol. 53, no. 6, pp. 127:1–127:42, 2021.
  • [53] L. Getoor and A. Machanavajjhala, “Entity resolution: Theory, practice & open challenges,” Proc. VLDB Endow., vol. 5, no. 12, pp. 2018–2019, 2012.
  • [54] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, and V. Raghavendra, “Deep learning for entity matching: A design space exploration,” in SIGMOD, 2018, pp. 19–34.
  • [55] D. Zhang, Y. Nie, S. Wu, Y. Shen, and K. Tan, “Multi-context attention for entity matching,” in WWW, 2020, pp. 2634–2640.
  • [56] C. Zhao and Y. He, “Auto-EM: End-to-end fuzzy entity-matching using pre-trained deep models and transfer learning,” in WWW, 2019, pp. 2413–2424.
  • [57] S. Gurajada, L. Popa, K. Qian, and P. Sen, “Learning-based methods with human-in-the-loop for entity resolution,” in CIKM, 2019, pp. 2969–2970.
  • [58] M. M. Zloof, “Query by example,” in American Federation of Information Processing Societies: 1975 National Computer Conference, ser. AFIPS Conference Proceedings, vol. 44.   AFIPS Press, 1975, pp. 431–438.
  • [59] Y. Gong, Z. Zhu, S. Galhotra, and R. C. Fernandez, “Ver: View discovery in the wild,” in ICDE, 2023, pp. 503–516.
  • [60] E. K. Rezig, A. Bhandari, A. Fariha, B. Price, A. Vanterpool, V. Gadepally, and M. Stonebraker, “DICE: data discovery by example,” Proc. VLDB Endow., vol. 14, no. 12, pp. 2819–2822, 2021.
  • [61] R. C. Fernandez, N. Tang, M. Ouzzani, M. Stonebraker, and S. Madden, “Dataset-On-Demand: Automatic view search and presentation for data discovery,” CoRR, vol. abs/1911.11876, 2019.
  • [62] R. Shraga and R. J. Miller, “Explaining dataset changes for semantic data versioning with Explain-Da-V,” Proc. VLDB Endow., vol. 16, no. 6, pp. 1587–1600, 2023.
  • [63] M. Mahdavi and Z. Abedjan, “Baran: Effective error correction via a unified context representation and transfer learning,” Proc. VLDB Endow., vol. 13, no. 11, pp. 1948–1961, 2020.
  • [64] R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa, “Data exchange: semantics and query answering,” Theor. Comput. Sci., vol. 336, no. 1, pp. 89–124, 2005.
  • [65] R. Fagin, P. G. Kolaitis, A. Nash, and L. Popa, “Towards a theory of schema-mapping optimization,” in PODS, 2008, pp. 33–42.
  • [66] X. Zheng, S. Dasgupta, and A. Gupta, “P2KG: Declarative construction and quality evaluation of knowledge graph from polystores,” in ADBIS, 2023, pp. 427–439.
  • [67] P. C. Arocena, B. Glavic, R. Ciucanu, and R. J. Miller, “The iBench integration metadata generator,” Proc. VLDB Endow., vol. 9, no. 3, pp. 108–119, 2015.
  • [68] L. Mazilu, N. W. Paton, A. A. A. Fernandes, and M. Koehler, “Dynamap: Schema mapping generation in the wild,” in SSDBM, 2019, pp. 37–48.
  • [69] P. C. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, and D. Santoro, “Benchmarking data curation systems,” IEEE Data Eng. Bull., vol. 39, no. 2, pp. 47–62, 2016.
  • [70] J. Bleiholder, S. Szott, M. Herschel, and F. Naumann, “Complement union for data integration,” in ICDE, 2010, pp. 183–186.
  • [71] R. Ramakrishnan and J. Gehrke, Database management systems (3. ed.).   McGraw-Hill, 2003.
  • [72] E. F. Codd, “Extending the data base relational model to capture more meaning (abstract),” in SIGMOD, 1979, p. 161.
  • [73] G. Fan, R. Shraga, and R. J. Miller, “Gen-T: Table reclamation on data lakes,” CoRR, vol. abs/2403.14128, 2024.
  • [74] Y. Zhang and Z. G. Ives, “Finding related tables in data lakes for interactive data science,” in SIGMOD, 2020, pp. 1951–1966.
  • [75] M. Koch, M. Esmailoghli, S. Auer, and Z. Abedjan, “Duplicate table discovery with xash,” BTW, 2023.
  • [76] L. Bolc and P. Borowik, Many-valued logics 1: Theoretical foundations.   Springer Science & Business Media, 2013.
  • [77] TPC, 2014, http://www.tpc.org/, last accessed on Nov 11, 2023.
  • [78] WDC, 2017, http://webdatacommons.org/webtables/goldstandard.html, last accessed on Nov 11, 2023.
  • [79] O. Lehmberg, D. Ritze, R. Meusel, and C. Bizer, “A large public corpus of web tables containing time and context metadata,” in WWW (Companion Volume), 2016, pp. 75–76.
  • [80] G. Fan, R. Shraga, and R. J. Miller, “Gen-T repository,” 2024, {https://github.com/northeastern-datalab/gen-t}, last accessed on Mar 22, 2024.
  • [81] Z. Jin, M. R. Anderson, M. J. Cafarella, and H. V. Jagadish, “Foofah: Transforming data by example,” in SIGMOD, 2017, pp. 683–698.
  • [82] T. M. Cover and J. A. Thomas, Elements of information theory (2. ed.).   Wiley, 2006.
  • [83] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to information retrieval.   Cambridge University Press, 2008.
  • [84] B. Glavic, G. Mecca, R. J. Miller, P. Papotti, D. Santoro, and E. Veltri, “Similarity measures for incomplete database instances,” in EDBT, 2024, pp. 461–473.
  • [85] N. Tang, C. Yang, J. Fan, L. Cao, and A. Halevy, “VerifAI: Verified generative AI,” CoRR, vol. abs/2307.02796, 2023.
  • [86] R. C. Fernandez, A. J. Elmore, M. J. Franklin, S. Krishnan, and C. Tan, “How large language models will disrupt data management,” Proc. VLDB Endow., vol. 16, no. 11, pp. 3302–3309, 2023.
  • [87] Microsoft, 2024, https://www.microsoft.com/en-us/copilot, last accessed on Nov 21, 2023.
  • [88] SlidesAI, 2024, https://www.slidesai.io/, last accessed on Nov 21, 2023.
  • [89] OpenAI, 2024, free Research Preview. https://chat.openai.com/, last accessed on Feb 25, 2024.

-A Preliminaries: Table Operators

Suppose we have two tables, T1,T2subscript𝑇1subscript𝑇2T_{1},T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, that share common columns C𝐶Citalic_C, and are in their minimal forms in which there are no duplicates and no tuples that can be subsumed or complemented. We show that for each pairwise table operator, Inner Union, Inner Join, Left Join, Outer Join, Cross Product, there exists an equivalent query consisting of Outer Union and/or unary operators. (SP of SPJU queries are accounted for by the unary operators).

Lemma \thelemma (Inner Union)

Inner Union(\cup): it is known that if the schemas of two tables are equal, then Inner Union = Outer Union

Lemma \thelemma (Inner Join)

Inner Join (normal-⋈\bowtie):

T1T2=σ(T1.C=T2.C,β(κ(T1T2)))T_{1}\bowtie T_{2}=\sigma(T_{1}.C=T_{2}.C\neq\bot,\beta(\kappa(T_{1}\uplus T_{% 2})))italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_σ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C = italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C ≠ ⊥ , italic_β ( italic_κ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) ) (6)
Lemma \thelemma (Left Join)

Left Join (  normal-⋈  absent\mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2.5pt}% {0.4pt}}\mkern-5.8mu\bowtie}[23]:

T1  T2=β((T1T2)T1)  absentsubscript𝑇1subscript𝑇2𝛽subscript𝑇1subscript𝑇2subscript𝑇1T_{1}\mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2% .5pt}{0.4pt}}\mkern-5.8mu\bowtie}T_{2}=\beta((T_{1}\bowtie T_{2})\uplus T_{1})italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_β ( ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) (7)
Lemma \thelemma (Outer Join)

Full Outer Join (    normal-⋈    \mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2.5pt}% {0.4pt}}\mkern-5.8mu\bowtie\mkern-5.8mu\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to% 0.0pt{\hss\rule[7.0pt]{2.5pt}{0.4pt}}}[23]:

T1    T2=β(β((T1T2)T1)T2)    subscript𝑇1subscript𝑇2𝛽𝛽subscript𝑇1subscript𝑇2subscript𝑇1subscript𝑇2T_{1}\mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2% .5pt}{0.4pt}}\mkern-5.8mu\bowtie\mkern-5.8mu\rule[-0.08612pt]{2.5pt}{0.4pt}% \hbox to 0.0pt{\hss\rule[7.0pt]{2.5pt}{0.4pt}}}T_{2}=\beta(\beta((T_{1}\bowtie T% _{2})\uplus T_{1})\uplus T_{2})italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_β ( italic_β ( ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) (8)
Lemma \thelemma (Cross Product)

Cross Product(×\times×): We denote columns in T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT as T1.Cformulae-sequencesubscript𝑇1𝐶T_{1}.Citalic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C and T2.Cformulae-sequencesubscript𝑇2𝐶T_{2}.Citalic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C, respectively. Consider a constant column c𝑐citalic_c.

T1×T2=κ(π((T1.C,c),T1)π((T2.C,c),T2))T_{1}\times T_{2}=\kappa(\pi((T_{1}.C,c),T_{1})\uplus\pi((T_{2}.C,c),T_{2}))italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_κ ( italic_π ( ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C , italic_c ) , italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⊎ italic_π ( ( italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C , italic_c ) , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) (9)

Thus, ,σ,π,κ,β𝜎𝜋𝜅𝛽\uplus,\sigma,\pi,\kappa,\beta⊎ , italic_σ , italic_π , italic_κ , italic_β operators form queries that are equivalent to all SPJU queries.

-A1 Proof of Lemma -A[Inner Join]

Given two tables T1,T2subscript𝑇1subscript𝑇2T_{1},T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT that join on a set of common columns C𝐶Citalic_C, such that T1,T2subscript𝑇1subscript𝑇2T_{1},T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are in their minimal forms in which they contain no duplicate tuples and no tuples can be subsumed or complemented , T1T2subscript𝑇1subscript𝑇2T_{1}\bowtie T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT can be expressed by an equivalent query containing Outer Union, complementation, and subsumption. Specifically, T1T2subscript𝑇1subscript𝑇2T_{1}\bowtie T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is equivalent to query σ(T1.C=T2.C,β(κ(T1T2)))\sigma(T_{1}.C=T_{2}.C\neq\bot,\beta(\kappa(T_{1}\uplus T_{2})))italic_σ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C = italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C ≠ ⊥ , italic_β ( italic_κ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) ).

Proof:

We first prove that all tuples in T1T2subscript𝑇1subscript𝑇2T_{1}\bowtie T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are contained in σ(T1.C=T2.C,β(κ(T1T2)))\sigma(T_{1}.C=T_{2}.C\neq\bot,\beta(\kappa(T_{1}\uplus T_{2})))italic_σ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C = italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C ≠ ⊥ , italic_β ( italic_κ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) ). Let tuple tT1T2𝑡subscript𝑇1subscript𝑇2t\in T_{1}\bowtie T_{2}italic_t ∈ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, such that join columns C𝐶Citalic_C’s values in t𝑡titalic_t appear in both T1.Cformulae-sequencesubscript𝑇1𝐶T_{1}.Citalic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C and T2.Cformulae-sequencesubscript𝑇2𝐶T_{2}.Citalic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C, and are non-null: t.CT1.CT2.Cformulae-sequence𝑡𝐶subscript𝑇1𝐶subscript𝑇2𝐶t.C\in T_{1}.C\cap T_{2}.Citalic_t . italic_C ∈ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C ∩ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C s.t. t.Cformulae-sequence𝑡𝐶bottomt.C\neq\botitalic_t . italic_C ≠ ⊥.

When applying β(κ(T1T2))𝛽𝜅subscript𝑇1subscript𝑇2\beta(\kappa(T_{1}\uplus T_{2}))italic_β ( italic_κ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ), only tuples with common non-null values T1.Ci=T2.Ciformulae-sequencesubscript𝑇1subscript𝐶𝑖subscript𝑇2subscript𝐶𝑖bottomT_{1}.C_{i}=T_{2}.C_{i}\neq\botitalic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ ⊥ in same column(s) i𝑖iitalic_i are complemented and subsumed. This is similar to tuple t𝑡titalic_t, which is formed by joining on T1.Ci=T2.Ciformulae-sequencesubscript𝑇1subscript𝐶𝑖subscript𝑇2subscript𝐶𝑖T_{1}.C_{i}=T_{2}.C_{i}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Thus, tuple t𝑡titalic_t is derived by selecting on tuples from β(κ(T1T2))𝛽𝜅subscript𝑇1subscript𝑇2\beta(\kappa(T_{1}\uplus T_{2}))italic_β ( italic_κ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) with non-null C𝐶Citalic_C values in both T1.Cformulae-sequencesubscript𝑇1𝐶T_{1}.Citalic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C and T2.Cformulae-sequencesubscript𝑇2𝐶T_{2}.Citalic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C, so tσ(T1.C=T2.C,β(κ(T1T2)))t\in\sigma(T_{1}.C=T_{2}.C\neq\bot,\beta(\kappa(T_{1}\uplus T_{2})))italic_t ∈ italic_σ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C = italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C ≠ ⊥ , italic_β ( italic_κ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) ).

Next, we show that all tuples in σ(T1.C=T2.C,β(κ(T1T2)))\sigma(T_{1}.C=T_{2}.C\neq\bot,\beta(\kappa(T_{1}\uplus T_{2})))italic_σ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C = italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C ≠ ⊥ , italic_β ( italic_κ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) ) are found in T1T2subscript𝑇1subscript𝑇2T_{1}\bowtie T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Let tuple tσ(T1.C=T2.C,β(κ(T1T2)))t^{\prime}\in\sigma(T_{1}.C=T_{2}.C\neq\bot,\beta(\kappa(T_{1}\uplus T_{2})))italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_σ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C = italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C ≠ ⊥ , italic_β ( italic_κ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) ). Here, all C𝐶Citalic_C values in tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are non-null values found in both T1.Cformulae-sequencesubscript𝑇1𝐶T_{1}.Citalic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C and T2.Cformulae-sequencesubscript𝑇2𝐶T_{2}.Citalic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C as a result of selection. From β(κ(T1T2))𝛽𝜅subscript𝑇1subscript𝑇2\beta(\kappa(T_{1}\uplus T_{2}))italic_β ( italic_κ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ), tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT contains all values from all columns in T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in a single tuple, formed by complementing and subsuming based on common C𝐶Citalic_C values. Thus, tT1T2superscript𝑡subscript𝑇1subscript𝑇2t^{\prime}\in T_{1}\bowtie T_{2}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

We have thus shown that all tuples from T1T2subscript𝑇1subscript𝑇2T_{1}\bowtie T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are found in σ(T1.C=T2.C,β(κ(T1T2)))\sigma(T_{1}.C=T_{2}.C\neq\bot,\beta(\kappa(T_{1}\uplus T_{2})))italic_σ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C = italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C ≠ ⊥ , italic_β ( italic_κ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) ) and vice versa, and so T1T2subscript𝑇1subscript𝑇2T_{1}\bowtie T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is an equivalent query to σ(T1.C=T2.C,β(κ(T1T2)))\sigma(T_{1}.C=T_{2}.C\neq\bot,\beta(\kappa(T_{1}\uplus T_{2})))italic_σ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C = italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C ≠ ⊥ , italic_β ( italic_κ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) ). ∎

-A2 Proof of Lemma -A[Left Join]

Given two tables T1,T2subscript𝑇1subscript𝑇2T_{1},T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT that join on a set of common columns C𝐶Citalic_C, such that T1,T2subscript𝑇1subscript𝑇2T_{1},T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are in their minimal forms in which there are no duplicates and no tuples can be subsumed or complemented , T1  T2  absentsubscript𝑇1subscript𝑇2T_{1}\mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2% .5pt}{0.4pt}}\mkern-5.8mu\bowtie}T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT can be expressed by an equivalent query containing Outer Union and subsumption. Specifically, T1  T2  absentsubscript𝑇1subscript𝑇2T_{1}\mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2% .5pt}{0.4pt}}\mkern-5.8mu\bowtie}T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is equivalent to query β((T1T2)T1)𝛽subscript𝑇1subscript𝑇2subscript𝑇1\beta((T_{1}\bowtie T_{2})\uplus T_{1})italic_β ( ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ).

Proof:

We first prove that the resulting table of T1  T2  absentsubscript𝑇1subscript𝑇2T_{1}\mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2% .5pt}{0.4pt}}\mkern-5.8mu\bowtie}T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is contained in the resulting table of β((T1T2)T1)𝛽subscript𝑇1subscript𝑇2subscript𝑇1\beta((T_{1}\bowtie T_{2})\uplus T_{1})italic_β ( ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ):

Let tuple tT1  T2𝑡  absentsubscript𝑇1subscript𝑇2t\in T_{1}\mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0% pt]{2.5pt}{0.4pt}}\mkern-5.8mu\bowtie}T_{2}italic_t ∈ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. There are two cases for join column C𝐶Citalic_C’s values in tuple t𝑡titalic_t: t.CT1.CT2.Cformulae-sequence𝑡𝐶subscript𝑇1𝐶subscript𝑇2𝐶t.C\in T_{1}.C\cap T_{2}.Citalic_t . italic_C ∈ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C ∩ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C (i.e., t.Cformulae-sequence𝑡𝐶t.Citalic_t . italic_C values are in both T1.Cformulae-sequencesubscript𝑇1𝐶T_{1}.Citalic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C and in T2.Cformulae-sequencesubscript𝑇2𝐶T_{2}.Citalic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C) and t.CT1.CT2.Cformulae-sequence𝑡𝐶subscript𝑇1𝐶subscript𝑇2𝐶t.C\in T_{1}.C\setminus T_{2}.Citalic_t . italic_C ∈ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C ∖ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C (i.e., t.Cformulae-sequence𝑡𝐶t.Citalic_t . italic_C values are only in T1.Cformulae-sequencesubscript𝑇1𝐶T_{1}.Citalic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C and not in T2.Cformulae-sequencesubscript𝑇2𝐶T_{2}.Citalic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C). Since we are performing left join on T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, t.CT2.CT1.Cformulae-sequence𝑡𝐶subscript𝑇2𝐶subscript𝑇1𝐶t.C\notin T_{2}.C\setminus T_{1}.Citalic_t . italic_C ∉ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C ∖ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C.

  1. 1.

    t.CT1.CT2.Ct(T1T2)formulae-sequence𝑡𝐶subscript𝑇1𝐶subscript𝑇2𝐶𝑡subscript𝑇1subscript𝑇2t.C\in T_{1}.C\cap T_{2}.C\implies t\in(T_{1}\bowtie T_{2})italic_t . italic_C ∈ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C ∩ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C ⟹ italic_t ∈ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). Since t𝑡titalic_t is in the inner join result and contains more non-Null values than other tuples with C𝐶Citalic_C values only in T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT oder T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, it would not be subsumed when applying β((T1T2)T1)𝛽subscript𝑇1subscript𝑇2subscript𝑇1\beta((T_{1}\bowtie T_{2})\uplus T_{1})italic_β ( ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ).

  2. 2.

    t.C(T1.CT2.C)tβ((T1T2)T1)t.C\in(T_{1}.C\setminus T_{2}.C)\implies t\in\beta((T_{1}\bowtie T_{2})\uplus T% _{1})italic_t . italic_C ∈ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C ∖ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C ) ⟹ italic_t ∈ italic_β ( ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). Since T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is in its minimal form, and t𝑡titalic_t does not share any C𝐶Citalic_C values with any tuple in T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, it is not subsumed when applying β𝛽\betaitalic_β to (T1T2)T2subscript𝑇1subscript𝑇2subscript𝑇2(T_{1}\bowtie T_{2})\uplus T_{2}( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and thus appear as is in β((T1T2)T1)𝛽subscript𝑇1subscript𝑇2subscript𝑇1\beta((T_{1}\bowtie T_{2})\uplus T_{1})italic_β ( ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ).

Thus, all tuples from T1  T2  absentsubscript𝑇1subscript𝑇2T_{1}\mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2% .5pt}{0.4pt}}\mkern-5.8mu\bowtie}T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are contained in the resulting table of β((T1T2)T1)𝛽subscript𝑇1subscript𝑇2subscript𝑇1\beta((T_{1}\bowtie T_{2})\uplus T_{1})italic_β ( ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ).

Next, we show that the resulting tuples of β((T1T2)T1)𝛽subscript𝑇1subscript𝑇2subscript𝑇1\beta((T_{1}\bowtie T_{2})\uplus T_{1})italic_β ( ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) are contained in the resulting table of T1  T2  absentsubscript𝑇1subscript𝑇2T_{1}\mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2% .5pt}{0.4pt}}\mkern-5.8mu\bowtie}T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Let’s consider tuple tβ((T1T2)T1)superscript𝑡𝛽subscript𝑇1subscript𝑇2subscript𝑇1t^{\prime}\in\beta((T_{1}\bowtie T_{2})\uplus T_{1})italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_β ( ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). There are two cases for C𝐶Citalic_C values in tuple tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT: t.CT1.CT2.Cformulae-sequencesuperscript𝑡𝐶subscript𝑇1𝐶subscript𝑇2𝐶t^{\prime}.C\in T_{1}.C\cap T_{2}.Citalic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT . italic_C ∈ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C ∩ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C and t.CT1.CT2.Cformulae-sequencesuperscript𝑡𝐶subscript𝑇1𝐶subscript𝑇2𝐶t^{\prime}.C\notin T_{1}.C\cap T_{2}.Citalic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT . italic_C ∉ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C ∩ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C.

  1. 1.

    t.C(T1.CT2.C)t(T1T2)t^{\prime}.C\in(T_{1}.C\cap T_{2}.C)\implies t^{\prime}\in(T_{1}\bowtie T_{2})italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT . italic_C ∈ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C ∩ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C ) ⟹ italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). Since (T1T2)(T1  T2)subscript𝑇1subscript𝑇2  absentsubscript𝑇1subscript𝑇2(T_{1}\bowtie T_{2})\subseteq(T_{1}\mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}% \hbox to 0.0pt{\hss\rule[7.0pt]{2.5pt}{0.4pt}}\mkern-5.8mu\bowtie}T_{2})( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊆ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), t(T1  T2)superscript𝑡  absentsubscript𝑇1subscript𝑇2t^{\prime}\in(T_{1}\mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss% \rule[7.0pt]{2.5pt}{0.4pt}}\mkern-5.8mu\bowtie}T_{2})italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ).

  2. 2.

    All tuples in ((T1T2)T1)subscript𝑇1subscript𝑇2subscript𝑇1((T_{1}\bowtie T_{2})\uplus T_{1})( ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) are either subsumed by tuples from (T1T2)subscript𝑇1subscript𝑇2(T_{1}\bowtie T_{2})( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), or are in T1(T1T2)subscript𝑇1subscript𝑇1subscript𝑇2T_{1}\setminus(T_{1}\bowtie T_{2})italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∖ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). Thus, t.CT1.CT2.CtT1(T1T2)tT1  T2formulae-sequencesuperscript𝑡𝐶subscript𝑇1𝐶subscript𝑇2𝐶superscript𝑡subscript𝑇1subscript𝑇1subscript𝑇2superscript𝑡  absentsubscript𝑇1subscript𝑇2t^{\prime}.C\notin T_{1}.C\cap T_{2}.C\implies t^{\prime}\in T_{1}\setminus(T_% {1}\bowtie T_{2})\implies t^{\prime}\in T_{1}\mathbin{\rule[-0.08612pt]{2.5pt}% {0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2.5pt}{0.4pt}}\mkern-5.8mu\bowtie}T_{2}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT . italic_C ∉ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C ∩ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C ⟹ italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∖ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⟹ italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Thus, all tuples from β((T1T2)T1)𝛽subscript𝑇1subscript𝑇2subscript𝑇1\beta((T_{1}\bowtie T_{2})\uplus T_{1})italic_β ( ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) are contained in the resulting table of T1  T2  absentsubscript𝑇1subscript𝑇2T_{1}\mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2% .5pt}{0.4pt}}\mkern-5.8mu\bowtie}T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Now that we have shown that tuples from T1  T2  absentsubscript𝑇1subscript𝑇2T_{1}\mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2% .5pt}{0.4pt}}\mkern-5.8mu\bowtie}T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are contained in the resulting table of β((T1T2)T1)𝛽subscript𝑇1subscript𝑇2subscript𝑇1\beta((T_{1}\bowtie T_{2})\uplus T_{1})italic_β ( ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and vice versa, we have shown that β((T1T2)T1)𝛽subscript𝑇1subscript𝑇2subscript𝑇1\beta((T_{1}\bowtie T_{2})\uplus T_{1})italic_β ( ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) is an equivalent query to T1  T2  absentsubscript𝑇1subscript𝑇2T_{1}\mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2% .5pt}{0.4pt}}\mkern-5.8mu\bowtie}T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. ∎

-A3 Proof of Lemma -A[Outer Join]

Given two tables T1,T2subscript𝑇1subscript𝑇2T_{1},T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT that join on a set of common columns C𝐶Citalic_C, such that T1,T2subscript𝑇1subscript𝑇2T_{1},T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are in their minimal forms in which there are no duplicates and no tuples can be subsumed or complemented, T1    T2    subscript𝑇1subscript𝑇2T_{1}\mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2% .5pt}{0.4pt}}\mkern-5.8mu\bowtie\mkern-5.8mu\rule[-0.08612pt]{2.5pt}{0.4pt}% \hbox to 0.0pt{\hss\rule[7.0pt]{2.5pt}{0.4pt}}}T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT can be expressed by an equivalent query containing Outer Union and subsumption. Specifically, T1    T2    subscript𝑇1subscript𝑇2T_{1}\mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2% .5pt}{0.4pt}}\mkern-5.8mu\bowtie\mkern-5.8mu\rule[-0.08612pt]{2.5pt}{0.4pt}% \hbox to 0.0pt{\hss\rule[7.0pt]{2.5pt}{0.4pt}}}T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is equivalent to query β(β((T1T2)T1)T2)𝛽𝛽subscript𝑇1subscript𝑇2subscript𝑇1subscript𝑇2\beta(\beta((T_{1}\bowtie T_{2})\uplus T_{1})\uplus T_{2})italic_β ( italic_β ( ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ).

Proof:

We first prove that the resulting table of T1    T2    subscript𝑇1subscript𝑇2T_{1}\mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2% .5pt}{0.4pt}}\mkern-5.8mu\bowtie\mkern-5.8mu\rule[-0.08612pt]{2.5pt}{0.4pt}% \hbox to 0.0pt{\hss\rule[7.0pt]{2.5pt}{0.4pt}}}T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is contained in the resulting table of β(β((T1T2)T1)T2)𝛽𝛽subscript𝑇1subscript𝑇2subscript𝑇1subscript𝑇2\beta(\beta((T_{1}\bowtie T_{2})\uplus T_{1})\uplus T_{2})italic_β ( italic_β ( ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ):

Let tuple tT1    T2𝑡    subscript𝑇1subscript𝑇2t\in T_{1}\mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0% pt]{2.5pt}{0.4pt}}\mkern-5.8mu\bowtie\mkern-5.8mu\rule[-0.08612pt]{2.5pt}{0.4% pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2.5pt}{0.4pt}}}T_{2}italic_t ∈ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. There are three cases for join column C𝐶Citalic_C’s values in tuple t𝑡titalic_t: t.CT1.CT2.Cformulae-sequence𝑡𝐶subscript𝑇1𝐶subscript𝑇2𝐶t.C\in T_{1}.C\cap T_{2}.Citalic_t . italic_C ∈ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C ∩ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C (i.e., t.Cformulae-sequence𝑡𝐶t.Citalic_t . italic_C values are in both T1.Cformulae-sequencesubscript𝑇1𝐶T_{1}.Citalic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C and in T2.Cformulae-sequencesubscript𝑇2𝐶T_{2}.Citalic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C), t.CT1.CT2.Cformulae-sequence𝑡𝐶subscript𝑇1𝐶subscript𝑇2𝐶t.C\in T_{1}.C\setminus T_{2}.Citalic_t . italic_C ∈ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C ∖ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C (i.e., t.Cformulae-sequence𝑡𝐶t.Citalic_t . italic_C values are only in T1.Cformulae-sequencesubscript𝑇1𝐶T_{1}.Citalic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C and not in T2.Cformulae-sequencesubscript𝑇2𝐶T_{2}.Citalic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C), and t.CT2.CT1.Cformulae-sequence𝑡𝐶subscript𝑇2𝐶subscript𝑇1𝐶t.C\in T_{2}.C\setminus T_{1}.Citalic_t . italic_C ∈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C ∖ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C (i.e., t.Cformulae-sequence𝑡𝐶t.Citalic_t . italic_C values are only in T2.Cformulae-sequencesubscript𝑇2𝐶T_{2}.Citalic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C and not in T1.Cformulae-sequencesubscript𝑇1𝐶T_{1}.Citalic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C).

  1. 1.

    t.CT1.CT2.CtT1T2formulae-sequence𝑡𝐶subscript𝑇1𝐶subscript𝑇2𝐶𝑡subscript𝑇1subscript𝑇2t.C\in T_{1}.C\cap T_{2}.C\implies t\in T_{1}\bowtie T_{2}italic_t . italic_C ∈ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C ∩ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C ⟹ italic_t ∈ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Tuple t𝑡titalic_t is a result of inner joining two tuples from T1,T2subscript𝑇1subscript𝑇2T_{1},T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT on shared values in common columns C𝐶Citalic_C. This is similar to taking T1T2subscript𝑇1subscript𝑇2T_{1}\uplus T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and applying subsumption and complementation on tuples with shared values in C𝐶Citalic_C (Lemma -A) to get t𝑡titalic_t. Since t𝑡titalic_t does not share any values in C𝐶Citalic_C with other tuples, it cannot be subsumed. Thus, tβ(β((T1T2)T1)T2)𝑡𝛽𝛽subscript𝑇1subscript𝑇2subscript𝑇1subscript𝑇2t\in\beta(\beta((T_{1}\bowtie T_{2})\uplus T_{1})\uplus T_{2})italic_t ∈ italic_β ( italic_β ( ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ).

  2. 2.

    t.CT1.CT2.Ct(T1  T2)(T1T2)formulae-sequence𝑡𝐶subscript𝑇1𝐶subscript𝑇2𝐶𝑡  absentsubscript𝑇1subscript𝑇2subscript𝑇1subscript𝑇2t.C\in T_{1}.C\setminus T_{2}.C\implies t\in(T_{1}\mathbin{\rule[-0.08612pt]{2% .5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2.5pt}{0.4pt}}\mkern-5.8mu\bowtie}% T_{2})\setminus(T_{1}\bowtie T_{2})italic_t . italic_C ∈ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C ∖ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C ⟹ italic_t ∈ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∖ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). When we take (T1T1)T1subscript𝑇1subscript𝑇1subscript𝑇1(T_{1}\bowtie T_{1})\uplus T_{1}( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, we append all tuples from T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to T1T2subscript𝑇1subscript𝑇2T_{1}\bowtie T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. After applying subsumption, all tuples from T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT that are used in T1T2subscript𝑇1subscript𝑇2T_{1}\bowtie T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are subsumed by tuples from T1T2subscript𝑇1subscript𝑇2T_{1}\bowtie T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT on shared values in C𝐶Citalic_C. Thus, the only tuples remaining are tuples like t𝑡titalic_t in (T1  T2)(T1T2)  absentsubscript𝑇1subscript𝑇2subscript𝑇1subscript𝑇2(T_{1}\mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{% 2.5pt}{0.4pt}}\mkern-5.8mu\bowtie}T_{2})\setminus(T_{1}\bowtie T_{2})( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∖ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). Since t𝑡titalic_t does not share any common values with any tuple in T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, it is not subsumed when taking β(β((T1T2)T1)T2)𝛽𝛽subscript𝑇1subscript𝑇2subscript𝑇1subscript𝑇2\beta(\beta((T_{1}\bowtie T_{2})\uplus T_{1})\uplus T_{2})italic_β ( italic_β ( ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), and so tβ(β((T1T2)T1)T2)𝑡𝛽𝛽subscript𝑇1subscript𝑇2subscript𝑇1subscript𝑇2t\in\beta(\beta((T_{1}\bowtie T_{2})\uplus T_{1})\uplus T_{2})italic_t ∈ italic_β ( italic_β ( ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ).

  3. 3.

    t.CT2.CT1.Ct(T2  T1)(T1T2)formulae-sequence𝑡𝐶subscript𝑇2𝐶subscript𝑇1𝐶𝑡  absentsubscript𝑇2subscript𝑇1subscript𝑇1subscript𝑇2t.C\in T_{2}.C\setminus T_{1}.C\implies t\in(T_{2}\mathbin{\rule[-0.08612pt]{2% .5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2.5pt}{0.4pt}}\mkern-5.8mu\bowtie}% T_{1})\setminus(T_{1}\bowtie T_{2})italic_t . italic_C ∈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C ∖ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C ⟹ italic_t ∈ ( italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∖ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). Taking the subsumption of β((T1T2)T1)T2𝛽subscript𝑇1subscript𝑇2subscript𝑇1subscript𝑇2\beta((T_{1}\bowtie T_{2})\uplus T_{1})\uplus T_{2}italic_β ( ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT removes all tuples from T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT that are subsumed by tuples in T1T2subscript𝑇1subscript𝑇2T_{1}\bowtie T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Since the remaining tuples in T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT cannot be subsumed by any tuple from T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT not in T1T2subscript𝑇1subscript𝑇2T_{1}\bowtie T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, t(T2  T1)(T1T2)𝑡  absentsubscript𝑇2subscript𝑇1subscript𝑇1subscript𝑇2t\in(T_{2}\mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0% pt]{2.5pt}{0.4pt}}\mkern-5.8mu\bowtie}T_{1})\setminus(T_{1}\bowtie T_{2})italic_t ∈ ( italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∖ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). Thus, tβ(β((T1T2)T1)T2)𝑡𝛽𝛽subscript𝑇1subscript𝑇2subscript𝑇1subscript𝑇2t\in\beta(\beta((T_{1}\bowtie T_{2})\uplus T_{1})\uplus T_{2})italic_t ∈ italic_β ( italic_β ( ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ).

Thus, all tuples from T1    T2    subscript𝑇1subscript𝑇2T_{1}\mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2% .5pt}{0.4pt}}\mkern-5.8mu\bowtie\mkern-5.8mu\rule[-0.08612pt]{2.5pt}{0.4pt}% \hbox to 0.0pt{\hss\rule[7.0pt]{2.5pt}{0.4pt}}}T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are contained in the resulting table of β(β((T1T2)T1)T2)𝛽𝛽subscript𝑇1subscript𝑇2subscript𝑇1subscript𝑇2\beta(\beta((T_{1}\bowtie T_{2})\uplus T_{1})\uplus T_{2})italic_β ( italic_β ( ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ).

Next, we show that all tuples in β(β((T1T2)T1)T2)𝛽𝛽subscript𝑇1subscript𝑇2subscript𝑇1subscript𝑇2\beta(\beta((T_{1}\bowtie T_{2})\uplus T_{1})\uplus T_{2})italic_β ( italic_β ( ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) are contained in the resulting table of T1    T2    subscript𝑇1subscript𝑇2T_{1}\mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2% .5pt}{0.4pt}}\mkern-5.8mu\bowtie\mkern-5.8mu\rule[-0.08612pt]{2.5pt}{0.4pt}% \hbox to 0.0pt{\hss\rule[7.0pt]{2.5pt}{0.4pt}}}T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Let’s consider tuple tβ(β((T1T2)T1)T2)superscript𝑡𝛽𝛽subscript𝑇1subscript𝑇2subscript𝑇1subscript𝑇2t^{\prime}\in\beta(\beta((T_{1}\bowtie T_{2})\uplus T_{1})\uplus T_{2})italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_β ( italic_β ( ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). There are two cases for C𝐶Citalic_C values in tuple tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT: t.CT1.CT2.Cformulae-sequencesuperscript𝑡𝐶subscript𝑇1𝐶subscript𝑇2𝐶t^{\prime}.C\in T_{1}.C\cap T_{2}.Citalic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT . italic_C ∈ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C ∩ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C and t.CT1.CT2.Cformulae-sequencesuperscript𝑡𝐶subscript𝑇1𝐶subscript𝑇2𝐶t^{\prime}.C\notin T_{1}.C\cap T_{2}.Citalic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT . italic_C ∉ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C ∩ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C.

  1. 1.

    t.C(T1.CT2.C)t(T1T2)t^{\prime}.C\in(T_{1}.C\cap T_{2}.C)\implies t^{\prime}\in(T_{1}\bowtie T_{2})italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT . italic_C ∈ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C ∩ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C ) ⟹ italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). Since (T1T2)(T1    T2)subscript𝑇1subscript𝑇2    subscript𝑇1subscript𝑇2(T_{1}\bowtie T_{2})\subseteq(T_{1}\mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}% \hbox to 0.0pt{\hss\rule[7.0pt]{2.5pt}{0.4pt}}\mkern-5.8mu\bowtie\mkern-5.8mu% \rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2.5pt}{0.4pt}}}% T_{2})( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊆ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), t(T1    T2)superscript𝑡    subscript𝑇1subscript𝑇2t^{\prime}\in(T_{1}\mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss% \rule[7.0pt]{2.5pt}{0.4pt}}\mkern-5.8mu\bowtie\mkern-5.8mu\rule[-0.08612pt]{2.% 5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2.5pt}{0.4pt}}}T_{2})italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ).

  2. 2.

    All tuples in ((T1T2)T1)T2subscript𝑇1subscript𝑇2subscript𝑇1subscript𝑇2((T_{1}\bowtie T_{2})\uplus T_{1})\uplus T_{2}( ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are either subsumed by tuples from (T1T2)subscript𝑇1subscript𝑇2(T_{1}\bowtie T_{2})( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), are in T1(T1T2)subscript𝑇1subscript𝑇1subscript𝑇2T_{1}\setminus(T_{1}\bowtie T_{2})italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∖ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), or are in T2(T1T2)subscript𝑇2subscript𝑇1subscript𝑇2T_{2}\setminus(T_{1}\bowtie T_{2})italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∖ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). Thus, t.CT1.CT2.Ct(T1T2)(T1T2)tT1    T2formulae-sequencesuperscript𝑡𝐶subscript𝑇1𝐶subscript𝑇2𝐶superscript𝑡subscript𝑇1subscript𝑇2subscript𝑇1subscript𝑇2superscript𝑡    subscript𝑇1subscript𝑇2t^{\prime}.C\notin T_{1}.C\cap T_{2}.C\implies t^{\prime}\in(T_{1}\uplus T_{2}% )\setminus(T_{1}\bowtie T_{2})\implies t^{\prime}\in T_{1}\mathbin{\rule[-0.08% 612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2.5pt}{0.4pt}}\mkern-5.8mu% \bowtie\mkern-5.8mu\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0% pt]{2.5pt}{0.4pt}}}T_{2}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT . italic_C ∉ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . italic_C ∩ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . italic_C ⟹ italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∖ ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⟹ italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. r

Thus, all tuples from β(β((T1T2)T1)T2)𝛽𝛽subscript𝑇1subscript𝑇2subscript𝑇1subscript𝑇2\beta(\beta((T_{1}\bowtie T_{2})\uplus T_{1})\uplus T_{2})italic_β ( italic_β ( ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) are contained in the resulting table of T1    T2    subscript𝑇1subscript𝑇2T_{1}\mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2% .5pt}{0.4pt}}\mkern-5.8mu\bowtie\mkern-5.8mu\rule[-0.08612pt]{2.5pt}{0.4pt}% \hbox to 0.0pt{\hss\rule[7.0pt]{2.5pt}{0.4pt}}}T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Now that we have shown that tuples from T1    T2    subscript𝑇1subscript𝑇2T_{1}\mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2% .5pt}{0.4pt}}\mkern-5.8mu\bowtie\mkern-5.8mu\rule[-0.08612pt]{2.5pt}{0.4pt}% \hbox to 0.0pt{\hss\rule[7.0pt]{2.5pt}{0.4pt}}}T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are contained in the resulting table of β(β((T1T2)T1)T2)𝛽𝛽subscript𝑇1subscript𝑇2subscript𝑇1subscript𝑇2\beta(\beta((T_{1}\bowtie T_{2})\uplus T_{1})\uplus T_{2})italic_β ( italic_β ( ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and vice versa, we have shown that β(β((T1T2)T1)T2)𝛽𝛽subscript𝑇1subscript𝑇2subscript𝑇1subscript𝑇2\beta(\beta((T_{1}\bowtie T_{2})\uplus T_{1})\uplus T_{2})italic_β ( italic_β ( ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋈ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) is an equivalent query to T1    T2    subscript𝑇1subscript𝑇2T_{1}\mathbin{\rule[-0.08612pt]{2.5pt}{0.4pt}\hbox to 0.0pt{\hss\rule[7.0pt]{2% .5pt}{0.4pt}}\mkern-5.8mu\bowtie\mkern-5.8mu\rule[-0.08612pt]{2.5pt}{0.4pt}% \hbox to 0.0pt{\hss\rule[7.0pt]{2.5pt}{0.4pt}}}T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_BINOP ⋈ end_BINOP italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

-A4 Proof of Lemma -A[Cross Product]

Given two tables T1,T2subscript𝑇1subscript𝑇2T_{1},T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, each with columns CT1,CT2subscript𝐶subscript𝑇1subscript𝐶subscript𝑇2C_{T_{1}},C_{T_{2}}italic_C start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT respectively and do not share any columns, and a constant column c𝑐citalic_c, T1×T2subscript𝑇1subscript𝑇2T_{1}\times T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT can be expressed by an equivalent query containing Outer Union, projection, and complementation. Specifically, T1×T2subscript𝑇1subscript𝑇2T_{1}\times T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is equivalent to query κ(π((CT1,c),T1)π((CT2,c),T2))𝜅𝜋subscript𝐶subscript𝑇1𝑐subscript𝑇1𝜋subscript𝐶subscript𝑇2𝑐subscript𝑇2\kappa(\pi((C_{T_{1}},c),T_{1})\uplus\pi((C_{T_{2}},c),T_{2}))italic_κ ( italic_π ( ( italic_C start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_c ) , italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⊎ italic_π ( ( italic_C start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_c ) , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ).

Proof:

Since T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT do not share any columns, the complementation operator cannot be applied to T1T2subscript𝑇1subscript𝑇2T_{1}\uplus T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊎ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Thus, we project on all columns CT1subscript𝐶subscript𝑇1C_{T_{1}}italic_C start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and constant column c𝑐citalic_c in T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and columns CT2,csubscript𝐶subscript𝑇2𝑐C_{T_{2}},citalic_C start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_c in T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. This way, T1,T2subscript𝑇1subscript𝑇2T_{1},T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT now share all values in c𝑐citalic_c and we can apply complementation on π((CT1,c),T1)π((CT2,c),T2)𝜋subscript𝐶subscript𝑇1𝑐subscript𝑇1𝜋subscript𝐶subscript𝑇2𝑐subscript𝑇2\pi((C_{T_{1}},c),T_{1})\uplus\pi((C_{T_{2}},c),T_{2})italic_π ( ( italic_C start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_c ) , italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⊎ italic_π ( ( italic_C start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_c ) , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) since T1,T2subscript𝑇1subscript𝑇2T_{1},T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Thus, we iteratively apply complementation on all tuples from T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT on all tuples from T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to form all tuples in T1×T2subscript𝑇1subscript𝑇2T_{1}\times T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Recall that in every tuple in T1×T2subscript𝑇1subscript𝑇2T_{1}\times T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, every value in t.CT1formulae-sequence𝑡subscript𝐶subscript𝑇1t.C_{T_{1}}italic_t . italic_C start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is from T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and every value in t.CT2formulae-sequence𝑡subscript𝐶subscript𝑇2t.C_{T_{2}}italic_t . italic_C start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is from T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Therefore, every tuple in T1×T2subscript𝑇1subscript𝑇2T_{1}\times T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is contained in κ(π((CT1,c),T1)π((CT2,c),T2))𝜅𝜋subscript𝐶subscript𝑇1𝑐subscript𝑇1𝜋subscript𝐶subscript𝑇2𝑐subscript𝑇2\kappa(\pi((C_{T_{1}},c),T_{1})\uplus\pi((C_{T_{2}},c),T_{2}))italic_κ ( italic_π ( ( italic_C start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_c ) , italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⊎ italic_π ( ( italic_C start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_c ) , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) and every tuple in κ(π((CT1,c),T1)π((CT2,c),T2))𝜅𝜋subscript𝐶subscript𝑇1𝑐subscript𝑇1𝜋subscript𝐶subscript𝑇2𝑐subscript𝑇2\kappa(\pi((C_{T_{1}},c),T_{1})\uplus\pi((C_{T_{2}},c),T_{2}))italic_κ ( italic_π ( ( italic_C start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_c ) , italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⊎ italic_π ( ( italic_C start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_c ) , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) is contained in T1×T2subscript𝑇1subscript𝑇2T_{1}\times T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and so κ(π((CT1,c),T1)π((CT2,c),T2))𝜅𝜋subscript𝐶subscript𝑇1𝑐subscript𝑇1𝜋subscript𝐶subscript𝑇2𝑐subscript𝑇2\kappa(\pi((C_{T_{1}},c),T_{1})\uplus\pi((C_{T_{2}},c),T_{2}))italic_κ ( italic_π ( ( italic_C start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_c ) , italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⊎ italic_π ( ( italic_C start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_c ) , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) is an equivalent query to T1×T2subscript𝑇1subscript𝑇2T_{1}\times T_{2}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. ∎

-B Set Similarity

1 Input: 𝒯={T1,T2,Tn}𝒯subscript𝑇1subscript𝑇2normal-…subscript𝑇𝑛\mathcal{T}=\{T_{1},T_{2},\dots T_{n}\}caligraphic_T = { italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }: set of data lake tables; S𝑆Sitalic_S: Source Table; τ𝜏\tauitalic_τ: Similarity Threshold
2 Output: 𝒯={T1,T2,Tn}superscript𝒯subscript𝑇1subscript𝑇2subscript𝑇𝑛\mathcal{T}^{\prime}=\{T_{1},T_{2},\dots T_{n}\}caligraphic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }: a set of candidate tables with high syntactic overlap with S𝑆Sitalic_S
3 𝒯scores{}subscriptsuperscript𝒯scores\mathcal{T}^{\prime}_{\text{scores}}\leftarrow\{\}caligraphic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT scores end_POSTSUBSCRIPT ← { } //Store a list of scores for each candidate table for all S𝑆Sitalic_S columns cC𝑐𝐶c\in Citalic_c ∈ italic_C do
4       𝒯Csubscript𝒯𝐶\mathcal{T}_{C}caligraphic_T start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT, overlapScores 𝖲𝖾𝗍𝖮𝗏𝖾𝗋𝗅𝖺𝗉(𝒯,c,τ)absent𝖲𝖾𝗍𝖮𝗏𝖾𝗋𝗅𝖺𝗉𝒯𝑐𝜏\leftarrow\mathsf{SetOverlap}(\mathcal{T},c,\tau)← sansserif_SetOverlap ( caligraphic_T , italic_c , italic_τ ) 𝒯Csubscript𝒯𝐶\mathcal{T}_{C}caligraphic_T start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT, diverseOverlapScores 𝖽𝗂𝗏𝖾𝗋𝗌𝗂𝖿𝗒𝖢𝖺𝗇𝖽𝗂𝖽𝖺𝗍𝖾𝗌(𝒯C,c,τ)absent𝖽𝗂𝗏𝖾𝗋𝗌𝗂𝖿𝗒𝖢𝖺𝗇𝖽𝗂𝖽𝖺𝗍𝖾𝗌subscript𝒯𝐶𝑐𝜏\leftarrow\mathsf{diversifyCandidates}(\mathcal{T}_{C},c,\tau)← sansserif_diversifyCandidates ( caligraphic_T start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_c , italic_τ ) for all tables T𝒯C𝑇subscript𝒯𝐶T\in\mathcal{T}_{C}italic_T ∈ caligraphic_T start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT do
5             𝒯scores[T]subscriptsuperscript𝒯scoresdelimited-[]𝑇\mathcal{T}^{\prime}_{\text{scores}}[T]caligraphic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT scores end_POSTSUBSCRIPT [ italic_T ]+= diverseOverlapScores[T𝑇Titalic_T]
6 Order 𝒯scoressubscriptsuperscript𝒯scores\mathcal{T}^{\prime}_{\text{scores}}caligraphic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT scores end_POSTSUBSCRIPT by average diverseOverlapScores, in descending order 𝒯keys(𝒯scores)superscript𝒯keyssubscriptsuperscript𝒯scores\mathcal{T}^{\prime}\leftarrow\text{keys}(\mathcal{T}^{\prime}_{\text{scores}})caligraphic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← keys ( caligraphic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT scores end_POSTSUBSCRIPT ) for all tables T𝒯𝑇superscript𝒯normal-′T\in\mathcal{T}^{\prime}italic_T ∈ caligraphic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT do
7       alignedTuples \leftarrow tuples in T𝑇Titalic_T that contain S𝑆Sitalic_S’s column values if set overlap of T𝑇Titalic_T values in alignedTuples with S<τ𝑆𝜏S<\tauitalic_S < italic_τ then
8             Ablegen T𝑇Titalic_T;
9       Entfernen Sie T𝑇Titalic_T if its values are contained in another table T𝒯superscript𝑇superscript𝒯T^{\prime}\in\mathcal{T}^{\prime}italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT Rename T𝑇Titalic_T columns to aligned S𝑆Sitalic_S columns
return 𝒯superscript𝒯\mathcal{T}^{\prime}caligraphic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT;
Algorithm 3 Set Similarity

We find candidate tables with values that have high set overlap with those in a Source Table. As shown in Algorithm 3, we perform Set Similarity with an input set of data lake tables 𝒯𝒯\mathcal{T}caligraphic_T, the Source Table S𝑆Sitalic_S, and a similarity threshold τ𝜏\tauitalic_τ (Line 3), and output a set of candidate tables (Line 3). We first find a set of candidate tables, where each table contains a column whose set overlap with a column from S𝑆Sitalic_S (overlapScore) is above a specified threshold (Lines 3-3). This can be done efficiently with a system like JOSIE [32] that computes exact set containment or MATE [36] that supports multi-attribute joins. In addition, when finding tables with columns that have a high set overlap with columns in S𝑆Sitalic_S, we call 𝖽𝗂𝗏𝖾𝗋𝗌𝗂𝖿𝗒𝖢𝖺𝗇𝖽𝗂𝖽𝖺𝗍𝖾𝗌()𝖽𝗂𝗏𝖾𝗋𝗌𝗂𝖿𝗒𝖢𝖺𝗇𝖽𝗂𝖽𝖺𝗍𝖾𝗌\mathsf{diversifyCandidates()}sansserif_diversifyCandidates ( ) (Line 3) to ensure that each candidate table not only has a high overlap with S𝑆Sitalic_S, but also has minimal overlap with the previous candidates, shown in Diversify Candidates Algorithm 4.

Formally, given candidate table Ti𝒯subscript𝑇𝑖𝒯T_{i}\in\mathcal{T}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_T s.t. i>0𝑖0i>0italic_i > 0, the previous candidate table, Ti1subscript𝑇𝑖1T_{i-1}italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT, and Source Table S𝑆Sitalic_S, we diversify a set of candidate tables uses the following formula to rank the candidates, in descending order:

diverseOverlapScore=|TiS||S||TiTi1||Ti|diverseOverlapScoresubscript𝑇𝑖𝑆𝑆subscript𝑇𝑖subscript𝑇𝑖1subscript𝑇𝑖\text{diverseOverlapScore}=\frac{|T_{i}\cap S|}{|S|}-\frac{|T_{i}\cap T_{i-1}|% }{|T_{i}|}diverseOverlapScore = divide start_ARG | italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ italic_S | end_ARG start_ARG | italic_S | end_ARG - divide start_ARG | italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT | end_ARG start_ARG | italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG (10)

When finding diverseOverlapScore, we find the set overlap of Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with S𝑆Sitalic_S vs. the set overlap of Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the previous candidate Ti1subscript𝑇𝑖1T_{i-1}italic_T start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT. This way, we arrange the set of candidate tables to ensure diversification of candidates.

1 Input: c𝑐citalic_c: column from Source Table; 𝒯C={T1,T2,Tn}subscript𝒯𝐶subscript𝑇1subscript𝑇2normal-…subscript𝑇𝑛\mathcal{T}_{C}=\{T_{1},T_{2},\dots T_{n}\}caligraphic_T start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = { italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }: set of candidate tables with columns having high overlap with c𝑐citalic_c; τ𝜏\tauitalic_τ: Similarity Threshold
2 Output: 𝒯C={T1,T2,Tn}superscriptsubscript𝒯𝐶subscript𝑇1subscript𝑇2subscript𝑇𝑛\mathcal{T}_{C}^{\prime}=\{T_{1},T_{2},\dots T_{n}\}caligraphic_T start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = { italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }: a set of diverse candidate tables
3 𝒯scores{}subscript𝒯scores\mathcal{T}_{\text{scores}}\leftarrow\{\}caligraphic_T start_POSTSUBSCRIPT scores end_POSTSUBSCRIPT ← { } for all tables T𝒯C𝑇subscript𝒯𝐶T\in\mathcal{T}_{C}italic_T ∈ caligraphic_T start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT do
4       C𝐶absentC\leftarrowitalic_C ← column from T𝑇Titalic_T with highest set overlap with c𝑐citalic_c IndT𝑇{}_{T}start_FLOATSUBSCRIPT italic_T end_FLOATSUBSCRIPT \leftarrow index of T𝑇Titalic_T in 𝒯Csubscript𝒯𝐶\mathcal{T}_{C}caligraphic_T start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT if IndT𝑇{}_{T}start_FLOATSUBSCRIPT italic_T end_FLOATSUBSCRIPT = 0 then
5             Continue;
6       Cprevsubscript𝐶prevabsentC_{\text{prev}}\leftarrowitalic_C start_POSTSUBSCRIPT prev end_POSTSUBSCRIPT ← column from 𝒯Csubscript𝒯𝐶\mathcal{T}_{C}caligraphic_T start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT[IndT𝑇{}_{T}start_FLOATSUBSCRIPT italic_T end_FLOATSUBSCRIPT - 1] with highest set overlap with c𝑐citalic_c//Get column from previous candidate table with high overlap with c𝑐citalic_c prevColOverlap (CCprev)/|C|absent𝐶subscript𝐶prev𝐶\leftarrow(C\cap C_{\text{prev}})/|C|← ( italic_C ∩ italic_C start_POSTSUBSCRIPT prev end_POSTSUBSCRIPT ) / | italic_C | //Set overlap with previous column sourceColOverlap (Cc)/|c|absent𝐶𝑐𝑐\leftarrow(C\cap c)/|c|← ( italic_C ∩ italic_c ) / | italic_c | //Set overlap with column from Source table if sourceColOverlap <τabsent𝜏<\tau< italic_τ then
7             Continue;
8       overlapScore \leftarrow sourceColOverlap – prevColOverlap 𝒯scores[T]subscript𝒯scoresdelimited-[]𝑇absent\mathcal{T}_{\text{scores}}[T]\leftarrowcaligraphic_T start_POSTSUBSCRIPT scores end_POSTSUBSCRIPT [ italic_T ] ← overlapScore
Order 𝒯scoressubscript𝒯scores\mathcal{T}_{\text{scores}}caligraphic_T start_POSTSUBSCRIPT scores end_POSTSUBSCRIPT by values in descending order 𝒯C𝒯scoressuperscriptsubscript𝒯𝐶subscript𝒯scores\mathcal{T}_{C}^{\prime}\leftarrow\mathcal{T}_{\text{scores}}caligraphic_T start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← caligraphic_T start_POSTSUBSCRIPT scores end_POSTSUBSCRIPT.keys return 𝒯Csuperscriptsubscript𝒯𝐶\mathcal{T}_{C}^{\prime}caligraphic_T start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT;
Algorithm 4 Diversify Candidates

After we find candidate tables for each column in the Source Table, we average over all overlap scores such that each is for a Source Table’s column with which they share many values, and rank them in descending order of averaged scores (Line 3). With a set of candidate tables, we find tuples in each candidate table that contain column values from S𝑆Sitalic_S. Within these aligned tuples, we check if each aligned column in a candidate table, with respect to a column in S𝑆Sitalic_S, still has high set overlap (above threshold τ𝜏\tauitalic_τ). If not, we remove them (Line 3). Next, we remove any subsumed candidate table, whose columns and column values are all contained in another candidate table (Line 3). We then rename each candidate tables’ columns to the names of S𝑆Sitalic_S’s columns with which they align (Line 3), thus implicitly performing schema matching between S𝑆Sitalic_S’s columns and the columns from the candidate tables that have overlapping values with S𝑆Sitalic_S’s columns. Finally, we return the set of candidate tables.

Appendix A Matrix Representations

A-A Expanding Candidate Tables

In order to represent candidate tables as matrices, their tuples need to align with those in the Source Table. However, not all candidate tables may share a key column with the Source Table. Thus, we need to join a given candidate table that does not share a key column with the Source Table with those that do. This way, tuples from all candidate tables can be aligned with tuples from a Source Table using key values.

As illustrated in Expand Algorithm 5, we traverse a graph that consists of candidate tables as nodes and we find a join path between candidate tables that do not have a Source Table’s key (start nodes), and candidate tables that do (potential end nodes). If two candidate tables can join on common columns, their nodes are connected with an edge. For each edge, we use standard join-size cardinality estimation to find edge weights [71].

After we find a path from a start node to an end node, we iteratively join all tables in the path, resulting in a table that shares a key column with the Source Table. This way, all candidate tables share a key column with the Source Table.

1 Input: 𝒯={T1,T2,Tn}𝒯subscript𝑇1subscript𝑇2subscript𝑇𝑛\mathcal{T}=\{T_{1},T_{2},\dots T_{n}\}caligraphic_T = { italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }: set of candidate tables; k𝑘{k}italic_k: Source Table’s key column(s)
2 Output: 𝒯k={T1,T2,Tn}subscript𝒯𝑘subscript𝑇1subscript𝑇2subscript𝑇𝑛\mathcal{T}_{k}=\{T_{1},T_{2},\dots T_{n}\}caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }: set of candidate tables that all now contain Source Table’s key column
3 //Initialize Graph
4 nodes = candidate tables edges = tables that have joinable columns edge Weights = value overlap of joinable columns start_nodes𝑠𝑡𝑎𝑟𝑡_𝑛𝑜𝑑𝑒𝑠start\_nodesitalic_s italic_t italic_a italic_r italic_t _ italic_n italic_o italic_d italic_e italic_s = {candidate tables that do not contain k𝑘kitalic_k} end_nodes𝑒𝑛𝑑_𝑛𝑜𝑑𝑒𝑠end\_nodesitalic_e italic_n italic_d _ italic_n italic_o italic_d italic_e italic_s = {candidate tables that contain k𝑘kitalic_k} for each start𝑠𝑡𝑎𝑟𝑡startitalic_s italic_t italic_a italic_r italic_t in start_nodes𝑠𝑡𝑎𝑟𝑡normal-_𝑛𝑜𝑑𝑒𝑠start\_nodesitalic_s italic_t italic_a italic_r italic_t _ italic_n italic_o italic_d italic_e italic_s do
5       //Initialize sets and dictionaries
6       visited𝑣𝑖𝑠𝑖𝑡𝑒𝑑absentvisited\leftarrowitalic_v italic_i italic_s italic_i italic_t italic_e italic_d ← set() //visited nodes node_weights{}𝑛𝑜𝑑𝑒_𝑤𝑒𝑖𝑔𝑡𝑠node\_weights\leftarrow\{\}italic_n italic_o italic_d italic_e _ italic_w italic_e italic_i italic_g italic_h italic_t italic_s ← { } //maximum weights before each node descendant{}𝑑𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡descendant\leftarrow\{\}italic_d italic_e italic_s italic_c italic_e italic_n italic_d italic_a italic_n italic_t ← { } //best child for each node max_weight0𝑚𝑎𝑥_𝑤𝑒𝑖𝑔𝑡0max\_weight\leftarrow 0italic_m italic_a italic_x _ italic_w italic_e italic_i italic_g italic_h italic_t ← 0 //Maximum weight found so far end_node𝑒𝑛𝑑_𝑛𝑜𝑑𝑒absentend\_node\leftarrowitalic_e italic_n italic_d _ italic_n italic_o italic_d italic_e ← None //end node for a given start node //Initialize stack𝑠𝑡𝑎𝑐𝑘stackitalic_s italic_t italic_a italic_c italic_k for DFS stackstack+start𝑠𝑡𝑎𝑐𝑘𝑠𝑡𝑎𝑐𝑘𝑠𝑡𝑎𝑟𝑡stack\leftarrow stack+startitalic_s italic_t italic_a italic_c italic_k ← italic_s italic_t italic_a italic_c italic_k + italic_s italic_t italic_a italic_r italic_t visitedvisited+start𝑣𝑖𝑠𝑖𝑡𝑒𝑑𝑣𝑖𝑠𝑖𝑡𝑒𝑑𝑠𝑡𝑎𝑟𝑡visited\leftarrow visited+startitalic_v italic_i italic_s italic_i italic_t italic_e italic_d ← italic_v italic_i italic_s italic_i italic_t italic_e italic_d + italic_s italic_t italic_a italic_r italic_t while stack𝑠𝑡𝑎𝑐𝑘stackitalic_s italic_t italic_a italic_c italic_k is not empty do
7             nodestack.pop()formulae-sequence𝑛𝑜𝑑𝑒𝑠𝑡𝑎𝑐𝑘𝑝𝑜𝑝node\leftarrow stack.pop()italic_n italic_o italic_d italic_e ← italic_s italic_t italic_a italic_c italic_k . italic_p italic_o italic_p ( ) unvisited_children𝑢𝑛𝑣𝑖𝑠𝑖𝑡𝑒𝑑_𝑐𝑖𝑙𝑑𝑟𝑒𝑛absentunvisited\_children\leftarrowitalic_u italic_n italic_v italic_i italic_s italic_i italic_t italic_e italic_d _ italic_c italic_h italic_i italic_l italic_d italic_r italic_e italic_n ← children of node𝑛𝑜𝑑𝑒nodeitalic_n italic_o italic_d italic_e not in visited𝑣𝑖𝑠𝑖𝑡𝑒𝑑visiteditalic_v italic_i italic_s italic_i italic_t italic_e italic_d //Current child’s weight is the weight of the path so far, including the edge weight between node and current child
8             for each child𝑐𝑖𝑙𝑑childitalic_c italic_h italic_i italic_l italic_d in unvisited_children𝑢𝑛𝑣𝑖𝑠𝑖𝑡𝑒𝑑normal-_𝑐𝑖𝑙𝑑𝑟𝑒𝑛unvisited\_childrenitalic_u italic_n italic_v italic_i italic_s italic_i italic_t italic_e italic_d _ italic_c italic_h italic_i italic_l italic_d italic_r italic_e italic_n do
9                   child_weightnode_weights[node]+weight(node,child)𝑐𝑖𝑙𝑑_𝑤𝑒𝑖𝑔𝑡𝑛𝑜𝑑𝑒_𝑤𝑒𝑖𝑔𝑡𝑠delimited-[]𝑛𝑜𝑑𝑒𝑤𝑒𝑖𝑔𝑡𝑛𝑜𝑑𝑒𝑐𝑖𝑙𝑑child\_weight\leftarrow node\_weights[node]+weight(node,child)italic_c italic_h italic_i italic_l italic_d _ italic_w italic_e italic_i italic_g italic_h italic_t ← italic_n italic_o italic_d italic_e _ italic_w italic_e italic_i italic_g italic_h italic_t italic_s [ italic_n italic_o italic_d italic_e ] + italic_w italic_e italic_i italic_g italic_h italic_t ( italic_n italic_o italic_d italic_e , italic_c italic_h italic_i italic_l italic_d ) //update descendant if it contains the maximum sum of weights so far
10                   if child_weight>node_weights[child]𝑐𝑖𝑙𝑑normal-_𝑤𝑒𝑖𝑔𝑡𝑛𝑜𝑑𝑒normal-_𝑤𝑒𝑖𝑔𝑡𝑠delimited-[]𝑐𝑖𝑙𝑑child\_weight>node\_weights[child]italic_c italic_h italic_i italic_l italic_d _ italic_w italic_e italic_i italic_g italic_h italic_t > italic_n italic_o italic_d italic_e _ italic_w italic_e italic_i italic_g italic_h italic_t italic_s [ italic_c italic_h italic_i italic_l italic_d ] then
11                         node_weights[child]child_weight𝑛𝑜𝑑𝑒_𝑤𝑒𝑖𝑔𝑡𝑠delimited-[]𝑐𝑖𝑙𝑑𝑐𝑖𝑙𝑑_𝑤𝑒𝑖𝑔𝑡node\_weights[child]\leftarrow child\_weightitalic_n italic_o italic_d italic_e _ italic_w italic_e italic_i italic_g italic_h italic_t italic_s [ italic_c italic_h italic_i italic_l italic_d ] ← italic_c italic_h italic_i italic_l italic_d _ italic_w italic_e italic_i italic_g italic_h italic_t descendant[child]node𝑑𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡delimited-[]𝑐𝑖𝑙𝑑𝑛𝑜𝑑𝑒descendant[child]\leftarrow nodeitalic_d italic_e italic_s italic_c italic_e italic_n italic_d italic_a italic_n italic_t [ italic_c italic_h italic_i italic_l italic_d ] ← italic_n italic_o italic_d italic_e if child𝑐𝑖𝑙𝑑childitalic_c italic_h italic_i italic_l italic_d is in end_nodes𝑒𝑛𝑑normal-_𝑛𝑜𝑑𝑒𝑠end\_nodesitalic_e italic_n italic_d _ italic_n italic_o italic_d italic_e italic_s then
12                               if child_weight>max_weight𝑐𝑖𝑙𝑑normal-_𝑤𝑒𝑖𝑔𝑡𝑚𝑎𝑥normal-_𝑤𝑒𝑖𝑔𝑡child\_weight>max\_weightitalic_c italic_h italic_i italic_l italic_d _ italic_w italic_e italic_i italic_g italic_h italic_t > italic_m italic_a italic_x _ italic_w italic_e italic_i italic_g italic_h italic_t then
13                                     //child has k𝑘kitalic_k and the maximum weighted path so far
14                                     max_weightchild_weight𝑚𝑎𝑥_𝑤𝑒𝑖𝑔𝑡𝑐𝑖𝑙𝑑_𝑤𝑒𝑖𝑔𝑡max\_weight\leftarrow child\_weightitalic_m italic_a italic_x _ italic_w italic_e italic_i italic_g italic_h italic_t ← italic_c italic_h italic_i italic_l italic_d _ italic_w italic_e italic_i italic_g italic_h italic_t end_nodechild𝑒𝑛𝑑_𝑛𝑜𝑑𝑒𝑐𝑖𝑙𝑑end\_node\leftarrow childitalic_e italic_n italic_d _ italic_n italic_o italic_d italic_e ← italic_c italic_h italic_i italic_l italic_d
15                        stackstack+child𝑠𝑡𝑎𝑐𝑘𝑠𝑡𝑎𝑐𝑘𝑐𝑖𝑙𝑑stack\leftarrow stack+childitalic_s italic_t italic_a italic_c italic_k ← italic_s italic_t italic_a italic_c italic_k + italic_c italic_h italic_i italic_l italic_d
16                  visitedvisited+child𝑣𝑖𝑠𝑖𝑡𝑒𝑑𝑣𝑖𝑠𝑖𝑡𝑒𝑑𝑐𝑖𝑙𝑑visited\leftarrow visited+childitalic_v italic_i italic_s italic_i italic_t italic_e italic_d ← italic_v italic_i italic_s italic_i italic_t italic_e italic_d + italic_c italic_h italic_i italic_l italic_d
17      if end_node𝑒𝑛𝑑normal-_𝑛𝑜𝑑𝑒end\_nodeitalic_e italic_n italic_d _ italic_n italic_o italic_d italic_e is not null𝑛𝑢𝑙𝑙nullitalic_n italic_u italic_l italic_l then
18             //reconstruct path with maximum sum of weights by reversing path, starting with found end node
19             path[]𝑝𝑎𝑡path\leftarrow[]italic_p italic_a italic_t italic_h ← [ ] current_nodeend_node𝑐𝑢𝑟𝑟𝑒𝑛𝑡_𝑛𝑜𝑑𝑒𝑒𝑛𝑑_𝑛𝑜𝑑𝑒current\_node\leftarrow end\_nodeitalic_c italic_u italic_r italic_r italic_e italic_n italic_t _ italic_n italic_o italic_d italic_e ← italic_e italic_n italic_d _ italic_n italic_o italic_d italic_e while current_node𝑐𝑢𝑟𝑟𝑒𝑛𝑡normal-_𝑛𝑜𝑑𝑒current\_nodeitalic_c italic_u italic_r italic_r italic_e italic_n italic_t _ italic_n italic_o italic_d italic_e is in descendant𝑑𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡descendantitalic_d italic_e italic_s italic_c italic_e italic_n italic_d italic_a italic_n italic_t do
20                   pathpath+current_node𝑝𝑎𝑡𝑝𝑎𝑡𝑐𝑢𝑟𝑟𝑒𝑛𝑡_𝑛𝑜𝑑𝑒path\leftarrow path+current\_nodeitalic_p italic_a italic_t italic_h ← italic_p italic_a italic_t italic_h + italic_c italic_u italic_r italic_r italic_e italic_n italic_t _ italic_n italic_o italic_d italic_e current_nodedescendant[current_node]𝑐𝑢𝑟𝑟𝑒𝑛𝑡_𝑛𝑜𝑑𝑒𝑑𝑒𝑠𝑐𝑒𝑛𝑑𝑎𝑛𝑡delimited-[]𝑐𝑢𝑟𝑟𝑒𝑛𝑡_𝑛𝑜𝑑𝑒current\_node\leftarrow descendant[current\_node]italic_c italic_u italic_r italic_r italic_e italic_n italic_t _ italic_n italic_o italic_d italic_e ← italic_d italic_e italic_s italic_c italic_e italic_n italic_d italic_a italic_n italic_t [ italic_c italic_u italic_r italic_r italic_e italic_n italic_t _ italic_n italic_o italic_d italic_e ]
21            path.reverse()formulae-sequence𝑝𝑎𝑡𝑟𝑒𝑣𝑒𝑟𝑠𝑒path.reverse()italic_p italic_a italic_t italic_h . italic_r italic_e italic_v italic_e italic_r italic_s italic_e ( ) table𝑡𝑎𝑏𝑙𝑒absenttable\leftarrowitalic_t italic_a italic_b italic_l italic_e ← first node in path𝑝𝑎𝑡pathitalic_p italic_a italic_t italic_h for each join_table𝑗𝑜𝑖𝑛normal-_𝑡𝑎𝑏𝑙𝑒join\_tableitalic_j italic_o italic_i italic_n _ italic_t italic_a italic_b italic_l italic_e in path[1:]path[1:]italic_p italic_a italic_t italic_h [ 1 : ] do
22                   tablejoin(table,join_table)𝑡𝑎𝑏𝑙𝑒𝑗𝑜𝑖𝑛𝑡𝑎𝑏𝑙𝑒𝑗𝑜𝑖𝑛_𝑡𝑎𝑏𝑙𝑒table\leftarrow join(table,join\_table)italic_t italic_a italic_b italic_l italic_e ← italic_j italic_o italic_i italic_n ( italic_t italic_a italic_b italic_l italic_e , italic_j italic_o italic_i italic_n _ italic_t italic_a italic_b italic_l italic_e )
23            𝒯k𝒯k+tablesubscript𝒯𝑘subscript𝒯𝑘𝑡𝑎𝑏𝑙𝑒\mathcal{T}_{k}\leftarrow\mathcal{T}_{k}+tablecaligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← caligraphic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_t italic_a italic_b italic_l italic_e
Algorithm 5 Expand

A-B Two-Valued vs. Three-Valued Matrix Representations

After aligning tuples in candidate tables to Source Table’s tuples that share the same key values, we can represent candidate tables as matrices to show how similar the values in candidate tables are to those in the Source Table. These matrices have the same dimensions and indices as the Source Table. First, we consider matrices that consist of binary values, where a 0 in tuple i𝑖iitalic_i, column j𝑗jitalic_j represents a value at index (i,j𝑖𝑗i,jitalic_i , italic_j) in the candidate table that is different from the value in the same position in the Source Table, and a 1 represent common values in the same indices.

However, populating alignment matrices with binary values does not fully encode how many values in candidate tables differ from those in the Source Table. Specifically, this representation does not distinguish between nullified values in candidate tables (null values in candidate tables at index (i,j𝑖𝑗i,jitalic_i , italic_j) where there is a non-null value in the Source Table at the same position), and erroneous values (different non-null values in candidate tables at index (i,j𝑖𝑗i,jitalic_i , italic_j) from those in Source Tables at the same position).

Instead, we encode matrix representations using three values, where at a given index in an aligned tuple, there is a 1 for a value shared between a candidate table and the Source Table, 0 for a null value in the candidate table where there is a non-null value in the Source Table, and -1 for a non-null value in the candidate table that differs from the value in the Source Table.

A-C Metrics

Conditional KL-divergence: Given column C𝐶Citalic_C shared between a Source Table and a reclaimed table T𝑇Titalic_T, suppose we have probability distributions, 𝒫𝒫\mathcal{P}caligraphic_P for C𝐶Citalic_C in the Source Table and 𝒬𝒬\mathcal{Q}caligraphic_Q for C𝐶Citalic_C in the reclaimed table. We condition on the key values in key column K𝐾Kitalic_K. The conditional KL-divergence (or conditional relative entropy) between 𝒫𝒫\mathcal{P}caligraphic_P and 𝒬𝒬\mathcal{Q}caligraphic_Q of sample space X𝑋Xitalic_X of column C𝐶Citalic_C conditioned on key K𝐾Kitalic_K is as follows:

DKL(Q||P)=xX,kKP(x|k)log(Q(x|k)(1Q(¬x|k)P(x|k))D_{KL}(Q||P)=-\sum_{x\in X,k\in K}P(x|k)\text{log}\left(\dfrac{Q(x|k)(1-Q(\neg x% |k)}{P(x|k)}\right)italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_Q | | italic_P ) = - ∑ start_POSTSUBSCRIPT italic_x ∈ italic_X , italic_k ∈ italic_K end_POSTSUBSCRIPT italic_P ( italic_x | italic_k ) log ( divide start_ARG italic_Q ( italic_x | italic_k ) ( 1 - italic_Q ( ¬ italic_x | italic_k ) end_ARG start_ARG italic_P ( italic_x | italic_k ) end_ARG ) (11)

Given n𝑛nitalic_n non-key columns 𝒞𝒞\mathcal{C}caligraphic_C in a Source Table we take the average DKLKL{}_{\text{KL}}start_FLOATSUBSCRIPT KL end_FLOATSUBSCRIPT for each column divided by the probability of a key value in T𝑇Titalic_T matching a key value from the Source Table (Q(K)𝑄𝐾Q(K)italic_Q ( italic_K )) and the number of non-key columns (n𝑛nitalic_n). Then, the conditional KL-divergence of the reclaimed table is as follows:

DKL(T)=DKL(Q1||P1)+DKL(Q2||P2)++DKL(Qn||Pn)Q(K)*nD_{KL}(T)=\dfrac{D_{KL}(Q_{1}||P_{1})+D_{KL}(Q_{2}||P_{2})+\dots+D_{KL}(Q_{n}|% |P_{n})}{Q(K)*n}italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_T ) = divide start_ARG italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | | italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | | italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + ⋯ + italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_Q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | | italic_P start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_ARG start_ARG italic_Q ( italic_K ) * italic_n end_ARG

(12)

The conditional KL-divergence of the reclaimed table is a score [0,)absent0\in[0,\infty)∈ [ 0 , ∞ ), with 0 being the ideal score. There is no upper limit on this metric since it naturally approaches \infty when no key value from the Source Table is found in the reclaimed table.

A-D Effectiveness Experiments

Effectiveness of LLM Baseline: To further assess the effectiveness of Gen-T, we compare it to OpenAI’s ChatGPT3.5 [89] as a representative Large Language Model (LLM). In our prompt for ChatGPT, we define the table reclamation problem and input a source table and a set of tables. We ask ChatGPT to return an integration result that reproduces the source table as closely as possible. Due to scalability constraints, we only use the integrating set as input.

In our effectiveness experiments, we run ChatGPT on TP-TR Small benchmark, since it times out for all source tables in the larger benchmarks. We find that ChatGPT achieves a Recall of 0.239 and a precision of 0.256. For divergence measures, ChatGPT achieves an Instance-Divergence of 0.540 and a DKLKL{}_{\text{KL}}start_FLOATSUBSCRIPT KL end_FLOATSUBSCRIPT of 209.83. ChatGPT returns tables that only contain some source tuples, while containing many non-null erroneous values (high DKLKL{}_{\text{KL}}start_FLOATSUBSCRIPT KL end_FLOATSUBSCRIPT score).

Effectiveness of Pruning in Gen-T: For a more detailed analysis, we analyze Recall, Precision, and F1-Scores of Gen-T and baseline, ALITE-PS, on TP-TR Med for each of its 26 Source Tables (Figure 9). Note that ALITE-PS directly integrates a set of candidate tables, whereas Gen-T first prunes the set of candidate tables to a set of originating tables before performing table integration. Gen-T outperforms ALITE-PS in Precision for all Source Tables, and outperforms ALITE-PS in Recall for 24 of 26 total Source Tables. This shows that ALITE-PS, which directly integrates candidate tables without pruning, reclaims more Source Tuples than Gen-T, which does include a pruning step, for only a few Source Tables. Also, Gen-T outperforms ALITE-PS in F1-Score for all Source Tables (Figure 9(c)), showing that even if ALITE-PS outperforms Gen-T in Recall, it does not impact the F1-Score.

Refer to caption
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 9: Recall, Precision, and F1 Scores of Gen-T and ALITE-PS for each Source Table in TP-TR Med benchmark.