Search | arXiv e-print repository

ConvKGYarn: Spinning Configurable and Scalable Conversational Knowledge Graph QA datasets with Large Language Models

Authors: Ronak Pradeep, Daniel Lee, Ali Mousavi, Jeff Pound, Yisi Sang, Jimmy Lin, Ihab Ilyas, Saloni Potdar, Mostafa Arefiyan, Yunyao Li

Abstract: The rapid advancement of Large Language Models (LLMs) and conversational assistants necessitates dynamic, scalable, and configurable conversational datasets for training and evaluation. These datasets must accommodate diverse user interaction modes, including text and voice, each presenting unique modeling challenges. Knowledge Graphs (KGs), with their structured and evolving nature, offer an idea… ▽ More The rapid advancement of Large Language Models (LLMs) and conversational assistants necessitates dynamic, scalable, and configurable conversational datasets for training and evaluation. These datasets must accommodate diverse user interaction modes, including text and voice, each presenting unique modeling challenges. Knowledge Graphs (KGs), with their structured and evolving nature, offer an ideal foundation for current and precise knowledge. Although human-curated KG-based conversational datasets exist, they struggle to keep pace with the rapidly changing user information needs. We present ConvKGYarn, a scalable method for generating up-to-date and configurable conversational KGQA datasets. Qualitative psychometric analyses confirm our method can generate high-quality datasets rivaling a popular conversational KGQA dataset while offering it at scale and covering a wide range of human-interaction configurations. We showcase its utility by testing LLMs on diverse conversations - exploring model behavior on conversational KGQA sets with different configurations grounded in the same KG fact set. Our results highlight the ability of ConvKGYarn to improve KGQA foundations and evaluate parametric knowledge of LLMs, thus offering a robust solution to the constantly evolving landscape of conversational assistants. △ Less

Submitted 12 August, 2024; originally announced August 2024.

arXiv:2312.09424 [pdf, other]

Open Domain Knowledge Extraction for Knowledge Graphs

Authors: Kun Qian, Anton Belyi, Fei Wu, Samira Khorshidi, Azadeh Nikfarjam, Rahul Khot, Yisi Sang, Katherine Luna, Xianqi Chu, Eric Choi, Yash Govind, Chloe Seivwright, Yiwen Sun, Ahmed Fakhry, Theo Rekatsinas, Ihab Ilyas, Xiaoguang Qi, Yunyao Li

Abstract: The quality of a knowledge graph directly impacts the quality of downstream applications (e.g. the number of answerable questions using the graph). One ongoing challenge when building a knowledge graph is to ensure completeness and freshness of the graph's entities and facts. In this paper, we introduce ODKE, a scalable and extensible framework that sources high-quality entities and facts from ope… ▽ More The quality of a knowledge graph directly impacts the quality of downstream applications (e.g. the number of answerable questions using the graph). One ongoing challenge when building a knowledge graph is to ensure completeness and freshness of the graph's entities and facts. In this paper, we introduce ODKE, a scalable and extensible framework that sources high-quality entities and facts from open web at scale. ODKE utilizes a wide range of extraction models and supports both streaming and batch processing at different latency. We reflect on the challenges and design decisions made and share lessons learned when building and deploying ODKE to grow an industry-scale open domain knowledge graph. △ Less

Submitted 30 October, 2023; originally announced December 2023.

Comments: 7 pages, 7 figures, 5 tables, preprint technical report, no code or data is released

MSC Class: 68T30 (primary) ACM Class: F.4.1; I.2.4

arXiv:2311.15781 [pdf, other]

Increasing Coverage and Precision of Textual Information in Multilingual Knowledge Graphs

Authors: Simone Conia, Min Li, Daniel Lee, Umar Farooq Minhas, Ihab Ilyas, Yunyao Li

Abstract: Recent work in Natural Language Processing and Computer Vision has been using textual information -- e.g., entity names and descriptions -- available in knowledge graphs to ground neural models to high-quality structured data. However, when it comes to non-English languages, the quantity and quality of textual information are comparatively scarce. To address this issue, we introduce the novel task… ▽ More Recent work in Natural Language Processing and Computer Vision has been using textual information -- e.g., entity names and descriptions -- available in knowledge graphs to ground neural models to high-quality structured data. However, when it comes to non-English languages, the quantity and quality of textual information are comparatively scarce. To address this issue, we introduce the novel task of automatic Knowledge Graph Enhancement (KGE) and perform a thorough investigation on bridging the gap in both the quantity and quality of textual information between English and non-English languages. More specifically, we: i) bring to light the problem of increasing multilingual coverage and precision of entity names and descriptions in Wikidata; ii) demonstrate that state-of-the-art methods, namely, Machine Translation (MT), Web Search (WS), and Large Language Models (LLMs), struggle with this task; iii) present M-NTA, a novel unsupervised approach that combines MT, WS, and LLMs to generate high-quality textual information; and, iv) study the impact of increasing multilingual coverage and precision of non-English textual information in Entity Linking, Knowledge Graph Completion, and Question Answering. As part of our effort towards better multilingual knowledge graphs, we also introduce WikiKGE-10, the first human-curated benchmark to evaluate KGE approaches in 10 languages across 7 language families. △ Less

Submitted 27 November, 2023; originally announced November 2023.

Comments: Camera ready for EMNLP 2023

arXiv:2310.17119 [pdf, other]

FLEEK: Factual Error Detection and Correction with Evidence Retrieved from External Knowledge

Authors: Farima Fatahi Bayat, Kun Qian, Benjamin Han, Yisi Sang, Anton Belyi, Samira Khorshidi, Fei Wu, Ihab F. Ilyas, Yunyao Li

Abstract: Detecting factual errors in textual information, whether generated by large language models (LLM) or curated by humans, is crucial for making informed decisions. LLMs' inability to attribute their claims to external knowledge and their tendency to hallucinate makes it difficult to rely on their responses. Humans, too, are prone to factual errors in their writing. Since manual detection and correct… ▽ More Detecting factual errors in textual information, whether generated by large language models (LLM) or curated by humans, is crucial for making informed decisions. LLMs' inability to attribute their claims to external knowledge and their tendency to hallucinate makes it difficult to rely on their responses. Humans, too, are prone to factual errors in their writing. Since manual detection and correction of factual errors is labor-intensive, developing an automatic approach can greatly reduce human effort. We present FLEEK, a prototype tool that automatically extracts factual claims from text, gathers evidence from external knowledge sources, evaluates the factuality of each claim, and suggests revisions for identified errors using the collected evidence. Initial empirical evaluation on fact error detection (77-85\% F1) shows the potential of FLEEK. A video demo of FLEEK can be found at https://youtu.be/NapJFUlkPdQ. △ Less

Submitted 25 October, 2023; originally announced October 2023.

Comments: EMNLP 2023 (Demonstration Track)

arXiv:2309.11669 [pdf, other]

Construction of Paired Knowledge Graph-Text Datasets Informed by Cyclic Evaluation

Authors: Ali Mousavi, Xin Zhan, He Bai, Peng Shi, Theo Rekatsinas, Benjamin Han, Yunyao Li, Jeff Pound, Josh Susskind, Natalie Schluter, Ihab Ilyas, Navdeep Jaitly

Abstract: Datasets that pair Knowledge Graphs (KG) and text together (KG-T) can be used to train forward and reverse neural models that generate text from KG and vice versa. However models trained on datasets where KG and text pairs are not equivalent can suffer from more hallucination and poorer recall. In this paper, we verify this empirically by generating datasets with different levels of noise and find… ▽ More Datasets that pair Knowledge Graphs (KG) and text together (KG-T) can be used to train forward and reverse neural models that generate text from KG and vice versa. However models trained on datasets where KG and text pairs are not equivalent can suffer from more hallucination and poorer recall. In this paper, we verify this empirically by generating datasets with different levels of noise and find that noisier datasets do indeed lead to more hallucination. We argue that the ability of forward and reverse models trained on a dataset to cyclically regenerate source KG or text is a proxy for the equivalence between the KG and the text in the dataset. Using cyclic evaluation we find that manually created WebNLG is much better than automatically created TeKGen and T-REx. Guided by these observations, we construct a new, improved dataset called LAGRANGE using heuristics meant to improve equivalence between KG and text and show the impact of each of the heuristics on cyclic evaluation. We also construct two synthetic datasets using large language models (LLMs), and observe that these are conducive to models that perform significantly well on cyclic generation of text, but less so on cyclic generation of KGs, probably because of a lack of a consistent underlying ontology. △ Less

Submitted 20 September, 2023; originally announced September 2023.

Comments: 16 pages

arXiv:2305.09464 [pdf, other]

doi 10.1145/3555041.3589672

Growing and Serving Large Open-domain Knowledge Graphs

Authors: Ihab F. Ilyas, JP Lacerda, Yunyao Li, Umar Farooq Minhas, Ali Mousavi, Jeffrey Pound, Theodoros Rekatsinas, Chiraag Sumanth

Abstract: Applications of large open-domain knowledge graphs (KGs) to real-world problems pose many unique challenges. In this paper, we present extensions to Saga our platform for continuous construction and serving of knowledge at scale. In particular, we describe a pipeline for training knowledge graph embeddings that powers key capabilities such as fact ranking, fact verification, a related entities ser… ▽ More Applications of large open-domain knowledge graphs (KGs) to real-world problems pose many unique challenges. In this paper, we present extensions to Saga our platform for continuous construction and serving of knowledge at scale. In particular, we describe a pipeline for training knowledge graph embeddings that powers key capabilities such as fact ranking, fact verification, a related entities service, and support for entity linking. We then describe how our platform, including graph embeddings, can be leveraged to create a Semantic Annotation service that links unstructured Web documents to entities in our KG. Semantic annotation of the Web effectively expands our knowledge graph with edges to open-domain Web content which can be used in various search and ranking problems. Finally, we leverage annotated Web documents to drive Open-domain Knowledge Extraction. This targeted extraction framework identifies important coverage issues in the KG, then finds relevant data sources for target entities on the Web and extracts missing information to enrich the KG. Finally, we describe adaptations to our knowledge platform needed to construct and serve private personal knowledge on-device. This includes private incremental KG construction, cross-device knowledge sync, and global knowledge enrichment. △ Less

Submitted 16 May, 2023; originally announced May 2023.

Comments: To be published in SIGMOD 2023

arXiv:2304.01926 [pdf]

High-Throughput Vector Similarity Search in Knowledge Graphs

Authors: Jason Mohoney, Anil Pacaci, Shihabur Rahman Chowdhury, Ali Mousavi, Ihab F. Ilyas, Umar Farooq Minhas, Jeffrey Pound, Theodoros Rekatsinas

Abstract: There is an increasing adoption of machine learning for encoding data into vectors to serve online recommendation and search use cases. As a result, recent data management systems propose augmenting query processing with online vector similarity search. In this work, we explore vector similarity search in the context of Knowledge Graphs (KGs). Motivated by the tasks of finding related KG queries a… ▽ More There is an increasing adoption of machine learning for encoding data into vectors to serve online recommendation and search use cases. As a result, recent data management systems propose augmenting query processing with online vector similarity search. In this work, we explore vector similarity search in the context of Knowledge Graphs (KGs). Motivated by the tasks of finding related KG queries and entities for past KG query workloads, we focus on hybrid vector similarity search (hybrid queries for short) where part of the query corresponds to vector similarity search and part of the query corresponds to predicates over relational attributes associated with the underlying data vectors. For example, given past KG queries for a song entity, we want to construct new queries for new song entities whose vector representations are close to the vector representation of the entity in the past KG query. But entities in a KG also have non-vector attributes such as a song associated with an artist, a genre, and a release date. Therefore, suggested entities must also satisfy query predicates over non-vector attributes beyond a vector-based similarity predicate. While these tasks are central to KGs, our contributions are generally applicable to hybrid queries. In contrast to prior works that optimize online queries, we focus on enabling efficient batch processing of past hybrid query workloads. We present our system, HQI, for high-throughput batch processing of hybrid queries. We introduce a workload-aware vector data partitioning scheme to tailor the vector index layout to the given workload and describe a multi-query optimization technique to reduce the overhead of vector similarity computations. We evaluate our methods on industrial workloads and demonstrate that HQI yields a 31x improvement in throughput for finding related KG queries compared to existing hybrid query processing approaches. △ Less

Submitted 4 April, 2023; originally announced April 2023.

Comments: 13 pages, 7 figures, to be published in ACM SIGMOD 2023

arXiv:2204.07309 [pdf, other]

doi 10.1145/3514221.3526049

Saga: A Platform for Continuous Construction and Serving of Knowledge At Scale

Authors: Ihab F. Ilyas, Theodoros Rekatsinas, Vishnu Konda, Jeffrey Pound, Xiaoguang Qi, Mohamed Soliman

Abstract: We introduce Saga, a next-generation knowledge construction and serving platform for powering knowledge-based applications at industrial scale. Saga follows a hybrid batch-incremental design to continuously integrate billions of facts about real-world entities and construct a central knowledge graph that supports multiple production use cases with diverse requirements around data freshness, accura… ▽ More We introduce Saga, a next-generation knowledge construction and serving platform for powering knowledge-based applications at industrial scale. Saga follows a hybrid batch-incremental design to continuously integrate billions of facts about real-world entities and construct a central knowledge graph that supports multiple production use cases with diverse requirements around data freshness, accuracy, and availability. In this paper, we discuss the unique challenges associated with knowledge graph construction at industrial scale, and review the main components of Saga and how they address these challenges. Finally, we share lessons-learned from a wide array of production use cases powered by Saga. △ Less

Submitted 14 April, 2022; originally announced April 2022.

Journal ref: ACM SIGMOD 2022

arXiv:2106.01501 [pdf, other]

Ember: No-Code Context Enrichment via Similarity-Based Keyless Joins

Authors: Sahaana Suri, Ihab F. Ilyas, Christopher Ré, Theodoros Rekatsinas

Abstract: Structured data, or data that adheres to a pre-defined schema, can suffer from fragmented context: information describing a single entity can be scattered across multiple datasets or tables tailored for specific business needs, with no explicit linking keys (e.g., primary key-foreign key relationships or heuristic functions). Context enrichment, or rebuilding fragmented context, using keyless join… ▽ More Structured data, or data that adheres to a pre-defined schema, can suffer from fragmented context: information describing a single entity can be scattered across multiple datasets or tables tailored for specific business needs, with no explicit linking keys (e.g., primary key-foreign key relationships or heuristic functions). Context enrichment, or rebuilding fragmented context, using keyless joins is an implicit or explicit step in machine learning (ML) pipelines over structured data sources. This process is tedious, domain-specific, and lacks support in now-prevalent no-code ML systems that let users create ML pipelines using just input data and high-level configuration files. In response, we propose Ember, a system that abstracts and automates keyless joins to generalize context enrichment. Our key insight is that Ember can enable a general keyless join operator by constructing an index populated with task-specific embeddings. Ember learns these embeddings by leveraging Transformer-based representation learning techniques. We describe our core architectural principles and operators when developing Ember, and empirically demonstrate that Ember allows users to develop no-code pipelines for five domains, including search, recommendation and question answering, and can exceed alternatives by up to 39% recall, with as little as a single line configuration change. △ Less

Submitted 2 June, 2021; originally announced June 2021.

arXiv:2103.05173 [pdf, other]

PCOR: Private Contextual Outlier Release via Differentially Private Search

Authors: Masoumeh Shafieinejad, Florian Kerschbaum, Ihab F. Ilyas

Abstract: Outlier detection plays a significant role in various real world applications such as intrusion, malfunction, and fraud detection. Traditionally, outlier detection techniques are applied to find outliers in the context of the whole dataset. However, this practice neglects contextual outliers, that are not outliers in the whole dataset but in some specific neighborhoods. Contextual outliers are par… ▽ More Outlier detection plays a significant role in various real world applications such as intrusion, malfunction, and fraud detection. Traditionally, outlier detection techniques are applied to find outliers in the context of the whole dataset. However, this practice neglects contextual outliers, that are not outliers in the whole dataset but in some specific neighborhoods. Contextual outliers are particularly important in data exploration and targeted anomaly explanation and diagnosis. In these scenarios, the data owner computes the following information: i) The attributes that contribute to the abnormality of an outlier (metric), ii) Contextual description of the outlier's neighborhoods (context), and iii) The utility score of the context, e.g. its strength in showing the outlier's significance, or in relation to a particular explanation for the outlier. However, revealing the outlier's context leaks information about the other individuals in the population as well, violating their privacy. We address the issue of population privacy violations in this paper, and propose a solution for the two main challenges. In this setting, the data owner is required to release a valid context for the queried record, i.e. a context in which the record is an outlier. Hence, the first major challenge is that the privacy technique must preserve the validity of the context for each record. We propose techniques to protect the privacy of individuals through a relaxed notion of differential privacy to satisfy this requirement. The second major challenge is applying the proposed techniques efficiently, as they impose intensive computation to the base algorithm. To overcome this challenge, we propose a graph structure to map the contexts to, and introduce differentially private graph search algorithms as efficient solutions for the computation problem caused by differential privacy techniques. △ Less

Submitted 8 March, 2021; originally announced March 2021.

arXiv:2101.06801 [pdf, other]

Real-Time LSM-Trees for HTAP Workloads

Authors: Hemant Saxena, Lukasz Golab, Stratos Idreos, Ihab F. Ilyas

Abstract: Real-time analytics systems employ hybrid data layouts in which data are stored in different formats throughout their lifecycle. Recent data are stored in a row-oriented format to serve OLTP workloads and support high insert rates, while older data are transformed to a column-oriented format for OLAP access patterns. We observe that a Log-Structured Merge (LSM) Tree is a natural fit for a lifecycl… ▽ More Real-time analytics systems employ hybrid data layouts in which data are stored in different formats throughout their lifecycle. Recent data are stored in a row-oriented format to serve OLTP workloads and support high insert rates, while older data are transformed to a column-oriented format for OLAP access patterns. We observe that a Log-Structured Merge (LSM) Tree is a natural fit for a lifecycle-aware storage engine due to its high write throughput and level-oriented structure, in which records propagate from one level to the next over time. To build a lifecycle-aware storage engine using an LSM-Tree, we make a crucial modification to allow different data layouts in different levels, ranging from purely row-oriented to purely column-oriented, leading to a Real-Time LSM-Tree. We give a cost model and an algorithm to design a Real-Time LSM-Tree that is suitable for a given workload, followed by an experimental evaluation of LASER - a prototype implementation of our idea built on top of the RocksDB key-value store. △ Less

Submitted 14 July, 2022; v1 submitted 17 January, 2021; originally announced January 2021.

arXiv:2012.15713 [pdf, other]

Kamino: Constraint-Aware Differentially Private Data Synthesis

Authors: Chang Ge, Shubhankar Mohapatra, Xi He, Ihab F. Ilyas

Abstract: Organizations are increasingly relying on data to support decisions. When data contains private and sensitive information, the data owner often desires to publish a synthetic database instance that is similarly useful as the true data, while ensuring the privacy of individual data records. Existing differentially private data synthesis methods aim to generate useful data based on applications, but… ▽ More Organizations are increasingly relying on data to support decisions. When data contains private and sensitive information, the data owner often desires to publish a synthetic database instance that is similarly useful as the true data, while ensuring the privacy of individual data records. Existing differentially private data synthesis methods aim to generate useful data based on applications, but they fail in keeping one of the most fundamental data properties of the structured data -- the underlying correlations and dependencies among tuples and attributes (i.e., the structure of the data). This structure is often expressed as integrity and schema constraints, or with a probabilistic generative process. As a result, the synthesized data is not useful for any downstream tasks that require this structure to be preserved. This work presents Kamino, a data synthesis system to ensure differential privacy and to preserve the structure and correlations present in the original dataset. Kamino takes as input of a database instance, along with its schema (including integrity constraints), and produces a synthetic database instance with differential privacy and structure preservation guarantees. We empirically show that while preserving the structure of the data, Kamino achieves comparable and even better usefulness in applications of training classification models and answering marginal queries than the state-of-the-art methods of differentially private data synthesis. △ Less

Submitted 15 April, 2021; v1 submitted 31 December, 2020; originally announced December 2020.

Comments: Update based on reviewers' comments

arXiv:2011.04730 [pdf, other]

Batchwise Probabilistic Incremental Data Cleaning

Authors: Paulo H. Oliveira, Daniel S. Kaster, Caetano Traina-Jr., Ihab F. Ilyas

Abstract: Lack of data and data quality issues are among the main bottlenecks that prevent further artificial intelligence adoption within many organizations, pushing data scientists to spend most of their time cleaning data before being able to answer analytical questions. Hence, there is a need for more effective and efficient data cleaning solutions, which, not surprisingly, is rife with theoretical and… ▽ More Lack of data and data quality issues are among the main bottlenecks that prevent further artificial intelligence adoption within many organizations, pushing data scientists to spend most of their time cleaning data before being able to answer analytical questions. Hence, there is a need for more effective and efficient data cleaning solutions, which, not surprisingly, is rife with theoretical and engineering problems. This report addresses the problem of performing holistic data cleaning incrementally, given a fixed rule set and an evolving categorical relational dataset acquired in sequential batches. To the best of our knowledge, our contributions compose the first incremental framework that cleans data (i) independently of user interventions, (ii) without requiring knowledge about the incoming dataset, such as the number of classes per attribute, and (iii) holistically, enabling multiple error types to be repaired simultaneously, and thus avoiding conflicting repairs. Extensive experiments show that our approach outperforms the competitors with respect to repair quality, execution time, and memory consumption. △ Less

Submitted 9 November, 2020; originally announced November 2020.

Comments: 29 pages, 13 figures

arXiv:2008.10549 [pdf, other]

On sampling from data with duplicate records

Authors: Alireza Heidari, Shrinu Kushagra, Ihab F. Ilyas

Abstract: Data deduplication is the task of detecting records in a database that correspond to the same real-world entity. Our goal is to develop a procedure that samples uniformly from the set of entities present in the database in the presence of duplicates. We accomplish this by a two-stage process. In the first step, we estimate the frequencies of all the entities in the database. In the second step, we… ▽ More Data deduplication is the task of detecting records in a database that correspond to the same real-world entity. Our goal is to develop a procedure that samples uniformly from the set of entities present in the database in the presence of duplicates. We accomplish this by a two-stage process. In the first step, we estimate the frequencies of all the entities in the database. In the second step, we use rejection sampling to obtain a (approximately) uniform sample from the set of entities. However, efficiently estimating the frequency of all the entities is a non-trivial task and not attainable in the general case. Hence, we consider various natural properties of the data under which such frequency estimation (and consequently uniform sampling) is possible. Under each of those assumptions, we provide sampling algorithms and give proofs of the complexity (both statistical and computational) of our approach. We complement our study by conducting extensive experiments on both real and synthetic datasets. △ Less

Submitted 24 August, 2020; originally announced August 2020.

Comments: 21 pages, 5 figures

arXiv:2006.10208 [pdf, other]

Record fusion: A learning approach

Authors: Alireza Heidari, George Michalopoulos, Shrinu Kushagra, Ihab F. Ilyas, Theodoros Rekatsinas

Abstract: Record fusion is the task of aggregating multiple records that correspond to the same real-world entity in a database. We can view record fusion as a machine learning problem where the goal is to predict the "correct" value for each attribute for each entity. Given a database, we use a combination of attribute-level, recordlevel, and database-level signals to construct a feature vector for each ce… ▽ More Record fusion is the task of aggregating multiple records that correspond to the same real-world entity in a database. We can view record fusion as a machine learning problem where the goal is to predict the "correct" value for each attribute for each entity. Given a database, we use a combination of attribute-level, recordlevel, and database-level signals to construct a feature vector for each cell (or (row, col)) of that database. We use this feature vector alongwith the ground-truth information to learn a classifier for each of the attributes of the database. Our learning algorithm uses a novel stagewise additive model. At each stage, we construct a new feature vector by combining a part of the original feature vector with features computed by the predictions from the previous stage. We then learn a softmax classifier over the new feature space. This greedy stagewise approach can be viewed as a deep model where at each stage, we are adding more complicated non-linear transformations of the original feature vector. We show that our approach fuses records with an average precision of ~98% when source information of records is available, and ~94% without source information across a diverse array of real-world datasets. We compare our approach to a comprehensive collection of data fusion and entity consolidation methods considered in the literature. We show that our approach can achieve an average precision improvement of ~20%/~45% with/without source information respectively. △ Less

Submitted 17 June, 2020; originally announced June 2020.

Comments: 18 pages, 9 figures

arXiv:2005.08540 [pdf, ps, other]

Approximate Denial Constraints

Authors: Ester Livshits, Alireza Heidari, Ihab F. Ilyas, Benny Kimelfeld

Abstract: The problem of mining integrity constraints from data has been extensively studied over the past two decades for commonly used types of constraints including the classic Functional Dependencies (FDs) and the more general Denial Constraints (DCs). In this paper, we investigate the problem of mining approximate DCs (i.e., DCs that are "almost" satisfied) from data. Considering approximate constraint… ▽ More The problem of mining integrity constraints from data has been extensively studied over the past two decades for commonly used types of constraints including the classic Functional Dependencies (FDs) and the more general Denial Constraints (DCs). In this paper, we investigate the problem of mining approximate DCs (i.e., DCs that are "almost" satisfied) from data. Considering approximate constraints allows us to discover more accurate constraints in inconsistent databases, detect rules that are generally correct but may have a few exceptions, as well as avoid overfitting and obtain more general and less contrived constraints. We introduce the algorithm ADCMiner for mining approximate DCs. An important feature of this algorithm is that it does not assume any specific definition of an approximate DC, but takes the semantics as input. Since there is more than one way to define an approximate DC and different definitions may produce very different results, we do not focus on one definition, but rather on a general family of approximation functions that satisfies some natural axioms defined in this paper and captures commonly used definitions of approximate constraints. We also show how our algorithm can be combined with sampling to return results with high accuracy while significantly reducing the running time. △ Less

Submitted 18 May, 2020; originally announced May 2020.

arXiv:1907.00141 [pdf, other]

Approximate Inference in Structured Instances with Noisy Categorical Observations

Authors: Alireza Heidari, Ihab F. Ilyas, Theodoros Rekatsinas

Abstract: We study the problem of recovering the latent ground truth labeling of a structured instance with categorical random variables in the presence of noisy observations. We present a new approximate algorithm for graphs with categorical variables that achieves low Hamming error in the presence of noisy vertex and edge observations. Our main result shows a logarithmic dependency of the Hamming error to… ▽ More We study the problem of recovering the latent ground truth labeling of a structured instance with categorical random variables in the presence of noisy observations. We present a new approximate algorithm for graphs with categorical variables that achieves low Hamming error in the presence of noisy vertex and edge observations. Our main result shows a logarithmic dependency of the Hamming error to the number of categories of the random variables. Our approach draws connections to correlation clustering with a fixed number of clusters. Our results generalize the works of Globerson et al. (2015) and Foster et al. (2018), who study the hardness of structured prediction under binary labels, to the case of categorical labels. △ Less

Submitted 5 July, 2019; v1 submitted 29 June, 2019; originally announced July 2019.

Comments: UAI 2019, 33 pages

arXiv:1906.06574 [pdf, ps, other]

Technical Report: Optimizing Human Involvement for Entity Matching and Consolidation

Authors: Ji Sun, Dong Deng, Ihab Ilyas, Guoliang Li, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, Nan Tang

Abstract: An end-to-end data integration system requires human feedback in several phases, including collecting training data for entity matching, debugging the resulting clusters, confirming transformations applied on these clusters for data standardization, and finally, reducing each cluster to a single, canonical representation (or "golden record"). The traditional wisdom is to sequentially apply the hum… ▽ More An end-to-end data integration system requires human feedback in several phases, including collecting training data for entity matching, debugging the resulting clusters, confirming transformations applied on these clusters for data standardization, and finally, reducing each cluster to a single, canonical representation (or "golden record"). The traditional wisdom is to sequentially apply the human feedback, obtained by asking specific questions, within some budget in each phase. However, these questions are highly correlated; the answer to one can influence the outcome of any of the phases of the pipeline. Hence, interleaving them has the potential to offer significant benefits. In this paper, we propose a human-in-the-loop framework that interleaves different types of questions to optimize human involvement. We propose benefit models to measure the quality improvement from asking a question, and cost models to measure the human time it takes to answer a question. We develop a question scheduling framework that judiciously selects questions to maximize the accuracy of the final golden records. Experimental results on three real-world datasets show that our holistic method significantly improves the quality of golden records from 70% to 90%, compared with the state-of-the-art approaches. △ Less

Submitted 15 June, 2019; originally announced June 2019.

arXiv:1904.06492 [pdf, other]

Properties of Inconsistency Measures for Databases

Authors: Ester Livshits, Rina Kochirgan, Segev Tsur, Ihab F. Ilyas, Benny Kimelfeld, Sudeepa Roy

Abstract: How should we quantify the inconsistency of a database that violates integrity constraints? Proper measures are important for various tasks, such as progress indication and action prioritization in cleaning systems, and reliability estimation for new datasets. To choose an appropriate inconsistency measure, it is important to identify the desired properties in the application and understand which… ▽ More How should we quantify the inconsistency of a database that violates integrity constraints? Proper measures are important for various tasks, such as progress indication and action prioritization in cleaning systems, and reliability estimation for new datasets. To choose an appropriate inconsistency measure, it is important to identify the desired properties in the application and understand which of these is guaranteed or at least expected in practice. For example, in some use cases the inconsistency should reduce if constraints are eliminated; in others it should be stable and avoid jitters and jumps in reaction to small changes in the database. We embark on a systematic investigation of properties for database inconsistency measures. We investigate a collection of basic measures that have been proposed in the past in both the Knowledge Representation and Database communities, analyze their theoretical properties, and empirically observe their behaviour in an experimental study. We also demonstrate how the framework can lead to new inconsistency measures by introducing a new measure that, in contrast to the rest, satisfies all of the properties we consider and can be computed in polynomial time. △ Less

Submitted 1 April, 2021; v1 submitted 13 April, 2019; originally announced April 2019.

arXiv:1904.02285 [pdf, other]

doi 10.1145/3299869.3319888

HoloDetect: Few-Shot Learning for Error Detection

Authors: Alireza Heidari, Joshua McGrath, Ihab F. Ilyas, Theodoros Rekatsinas

Abstract: We introduce a few-shot learning framework for error detection. We show that data augmentation (a form of weak supervision) is key to training high-quality, ML-based error detection models that require minimal human involvement. Our framework consists of two parts: (1) an expressive model to learn rich representations that capture the inherent syntactic and semantic heterogeneity of errors; and (2… ▽ More We introduce a few-shot learning framework for error detection. We show that data augmentation (a form of weak supervision) is key to training high-quality, ML-based error detection models that require minimal human involvement. Our framework consists of two parts: (1) an expressive model to learn rich representations that capture the inherent syntactic and semantic heterogeneity of errors; and (2) a data augmentation model that, given a small seed of clean records, uses dataset-specific transformations to automatically generate additional training data. Our key insight is to learn data augmentation policies from the noisy input dataset in a weakly supervised manner. We show that our framework detects errors with an average precision of ~94% and an average recall of ~93% across a diverse array of datasets that exhibit different types and amounts of errors. We compare our approach to a comprehensive collection of error detection methods, ranging from traditional rule-based methods to ensemble-based and active learning approaches. We show that data augmentation yields an average improvement of 20 F1 points while it requires access to 3x fewer labeled examples compared to other ML approaches. △ Less

Submitted 3 April, 2019; originally announced April 2019.

Comments: 18 pages,

Journal ref: ACM SIGMOD 2019

arXiv:1903.06607 [pdf, other]

Matching Entities Across Different Knowledge Graphs with Graph Embeddings

Authors: Michael Azmy, Peng Shi, Jimmy Lin, Ihab F. Ilyas

Abstract: This paper explores the problem of matching entities across different knowledge graphs. Given a query entity in one knowledge graph, we wish to find the corresponding real-world entity in another knowledge graph. We formalize this problem and present two large-scale datasets for this task based on exiting cross-ontology links between DBpedia and Wikidata, focused on several hundred thousand ambigu… ▽ More This paper explores the problem of matching entities across different knowledge graphs. Given a query entity in one knowledge graph, we wish to find the corresponding real-world entity in another knowledge graph. We formalize this problem and present two large-scale datasets for this task based on exiting cross-ontology links between DBpedia and Wikidata, focused on several hundred thousand ambiguous entities. Using a classification-based approach, we find that a simple multi-layered perceptron based on representations derived from RDF2Vec graph embeddings of entities in each knowledge graph is sufficient to achieve high accuracy, with only small amounts of training data. The contributions of our work are datasets for examining this problem and strong baselines on which future work can be based. △ Less

Submitted 15 March, 2019; originally announced March 2019.

arXiv:1903.05228 [pdf, other]

Distributed Dependency Discovery

Authors: Hemant Saxena, Lukasz Golab, Ihab F. Ilyas

Abstract: We analyze the problem of discovering dependencies from distributed big data. Existing (non-distributed) algorithms focus on minimizing computation by pruning the search space of possible dependencies. However, distributed algorithms must also optimize communication costs, especially in shared-nothing settings, leading to a more complex optimization space. To understand this space, we introduce si… ▽ More We analyze the problem of discovering dependencies from distributed big data. Existing (non-distributed) algorithms focus on minimizing computation by pruning the search space of possible dependencies. However, distributed algorithms must also optimize communication costs, especially in shared-nothing settings, leading to a more complex optimization space. To understand this space, we introduce six primitives shared by existing dependency discovery algorithms, corresponding to data processing steps separated by communication barriers. Through case studies, we show how the primitives allow us to analyze the design space and develop communication-optimized implementations. Finally, we support our analysis with an experimental evaluation on real datasets. △ Less

Submitted 12 March, 2019; originally announced March 2019.

arXiv:1810.04361 [pdf, ps, other]

Semi-supervised clustering for de-duplication

Authors: Shrinu Kushagra, Shai Ben-David, Ihab Ilyas

Abstract: Data de-duplication is the task of detecting multiple records that correspond to the same real-world entity in a database. In this work, we view de-duplication as a clustering problem where the goal is to put records corresponding to the same physical entity in the same cluster and putting records corresponding to different physical entities into different clusters. We introduce a framework whic… ▽ More Data de-duplication is the task of detecting multiple records that correspond to the same real-world entity in a database. In this work, we view de-duplication as a clustering problem where the goal is to put records corresponding to the same physical entity in the same cluster and putting records corresponding to different physical entities into different clusters. We introduce a framework which we call promise correlation clustering. Given a complete graph $G$ with the edges labelled $0$ and $1$, the goal is to find a clustering that minimizes the number of $0$ edges within a cluster plus the number of $1$ edges across different clusters (or correlation loss). The optimal clustering can also be viewed as a complete graph $G^*$ with edges corresponding to points in the same cluster being labelled $0$ and other edges being labelled $1$. Under the promise that the edge difference between $G$ and $G^*$ is "small", we prove that finding the optimal clustering (or $G^*$) is still NP-Hard. [Ashtiani et. al, 2016] introduced the framework of semi-supervised clustering, where the learning algorithm has access to an oracle, which answers whether two points belong to the same or different clusters. We further prove that even with access to a same-cluster oracle, the promise version is NP-Hard as long as the number queries to the oracle is not too large ($o(n)$ where $n$ is the number of vertices). Given these negative results, we consider a restricted version of correlation clustering. As before, the goal is to find a clustering that minimizes the correlation loss. However, we restrict ourselves to a given class $\mathcal F$ of clusterings. We offer a semi-supervised algorithmic approach to solve the restricted variant with success guarantees. △ Less

Submitted 10 October, 2018; originally announced October 2018.

Report number: PMLR 89:1659-1667, 2019

Journal ref: Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS) 2019

arXiv:1801.06750 [pdf, other]

A Formal Framework For Probabilistic Unclean Databases

Authors: Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld, Christopher Re, Theodoros Rekatsinas

Abstract: Most theoretical frameworks that focus on data errors and inconsistencies follow logic-based reasoning. Yet, practical data cleaning tools need to incorporate statistical reasoning to be effective in real-world data cleaning tasks. Motivated by these empirical successes, we propose a formal framework for unclean databases, where two types of statistical knowledge are incorporated: The first repres… ▽ More Most theoretical frameworks that focus on data errors and inconsistencies follow logic-based reasoning. Yet, practical data cleaning tools need to incorporate statistical reasoning to be effective in real-world data cleaning tasks. Motivated by these empirical successes, we propose a formal framework for unclean databases, where two types of statistical knowledge are incorporated: The first represents a belief of how intended (clean) data is generated, and the second represents a belief of how noise is introduced in the actual observed database instance. To capture this noisy channel model, we introduce the concept of a Probabilistic Unclean Database (PUD), a triple that consists of a probabilistic database that we call the intention, a probabilistic data transformator that we call the realization and captures how noise is introduced, and a dirty observed database instance that we call the observation. We define three computational problems in the PUD framework: cleaning (infer the most probable clean instance given a PUD), probabilistic query answering (compute the probability of an answer tuple over the unclean observed instance), and learning (estimate the most likely intention and realization models of a PUD given a collection of training data). We illustrate the PUD framework on concrete representations of the intention and realization, show that they generalize traditional concepts of repairs such as cardinality and value repairs, draw connection to consistent query answering, and prove tractability results. We further show that parameters can be learned in practical instantiations, and in fact, prove that under certain conditions we can learn a PUD directly from a single dirty database instance without any need for clean examples. △ Less

Submitted 24 January, 2019; v1 submitted 20 January, 2018; originally announced January 2018.

arXiv:1712.10266 [pdf, other]

APEx: Accuracy-Aware Differentially Private Data Exploration

Authors: Chang Ge, Xi He, Ihab F. Ilyas, Ashwin Machanavajjhala

Abstract: Organizations are increasingly interested in allowing external data scientists to explore their sensitive datasets. Due to the popularity of differential privacy, data owners want the data exploration to ensure provable privacy guarantees. However, current systems for answering queries with differential privacy place an inordinate burden on the data analysts to understand differential privacy, man… ▽ More Organizations are increasingly interested in allowing external data scientists to explore their sensitive datasets. Due to the popularity of differential privacy, data owners want the data exploration to ensure provable privacy guarantees. However, current systems for answering queries with differential privacy place an inordinate burden on the data analysts to understand differential privacy, manage their privacy budget, and even implement new algorithms for noisy query answering. Moreover, current systems do not provide any guarantees to the data analyst on the quality they care about, namely accuracy of query answers. We present APEx, a novel system that allows data analysts to pose adaptively chosen sequences of queries along with required accuracy bounds. By translating queries and accuracy bounds into differentially private algorithms with the least privacy loss, APEx returns query answers to the data analyst that meet the accuracy bounds, and proves to the data owner that the entire data exploration process is differentially private. Our comprehensive experimental study on real datasets demonstrates that APEx can answer a variety of queries accurately with moderate to small privacy loss, and can support data exploration for entity resolution with high accuracy under reasonable privacy settings. △ Less

Submitted 10 May, 2019; v1 submitted 29 December, 2017; originally announced December 2017.

Comments: Full version of the ACM SIGMOD 2019 paper

arXiv:1709.10436 [pdf, other]

Unsupervised String Transformation Learning for Entity Consolidation

Authors: Dong Deng, Wenbo Tao, Ziawasch Abedjan, Ahmed Elmagarmid, Guoliang Li, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, Nan Tang

Abstract: Data integration has been a long-standing challenge in data management with many applications. A key step in data integration is entity consolidation. It takes a collection of clusters of duplicate records as input and produces a single "golden record" for each cluster, which contains the canonical value for each attribute. Truth discovery and data fusion methods, as well as Master Data Management… ▽ More Data integration has been a long-standing challenge in data management with many applications. A key step in data integration is entity consolidation. It takes a collection of clusters of duplicate records as input and produces a single "golden record" for each cluster, which contains the canonical value for each attribute. Truth discovery and data fusion methods, as well as Master Data Management (MDM) systems, can be used for entity consolidation. However, to achieve better results, the variant values (i.e., values that are logically the same with different formats) in the clusters need to be consolidated before applying these methods. For this purpose, we propose a data-driven method to standardize the variant values based on two observations: (1) the variant values usually can be transformed to the same representation (e.g., "Mary Lee" and "Lee, Mary") and (2) the same transformation often appears repeatedly across different clusters (e.g., transpose the first and last name). Our approach first uses an unsupervised method to generate groups of value pairs that can be transformed in the same way (i.e., they share a transformation). Then the groups are presented to a human for verification and the approved ones are used to standardize the data. In a real-world dataset with 17,497 records, our method achieved 75% recall and 99.5% precision in standardizing variant values by asking a human 100 yes/no questions, which completely outperformed a state of the art data wrangling tool. △ Less

Submitted 30 July, 2018; v1 submitted 29 September, 2017; originally announced September 2017.

arXiv:1702.00820 [pdf, other]

HoloClean: Holistic Data Repairs with Probabilistic Inference

Authors: Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, Christopher Ré

Abstract: We introduce HoloClean, a framework for holistic data repairing driven by probabilistic inference. HoloClean unifies existing qualitative data repairing approaches, which rely on integrity constraints or external data sources, with quantitative data repairing methods, which leverage statistical properties of the input data. Given an inconsistent dataset as input, HoloClean automatically generates… ▽ More We introduce HoloClean, a framework for holistic data repairing driven by probabilistic inference. HoloClean unifies existing qualitative data repairing approaches, which rely on integrity constraints or external data sources, with quantitative data repairing methods, which leverage statistical properties of the input data. Given an inconsistent dataset as input, HoloClean automatically generates a probabilistic program that performs data repairing. Inspired by recent theoretical advances in probabilistic inference, we introduce a series of optimizations which ensure that inference over HoloClean's probabilistic model scales to instances with millions of tuples. We show that HoloClean scales to instances with millions of tuples and find data repairs with an average precision of ~90% and an average recall of above ~76% across a diverse array of datasets exhibiting different types of errors. This yields an average F1 improvement of more than 2x against state-of-the-art methods. △ Less

Submitted 2 February, 2017; originally announced February 2017.

arXiv:1207.5226 [pdf, other]

On the Relative Trust between Inconsistent Data and Inaccurate Constraints

Authors: George Beskales, Ihab F. Ilyas, Lukasz Golab, Artur Galiullin

Abstract: Functional dependencies (FDs) specify the intended data semantics while violations of FDs indicate deviation from these semantics. In this paper, we study a data cleaning problem in which the FDs may not be completely correct, e.g., due to data evolution or incomplete knowledge of the data semantics. We argue that the notion of relative trust is a crucial aspect of this problem: if the FDs are out… ▽ More Functional dependencies (FDs) specify the intended data semantics while violations of FDs indicate deviation from these semantics. In this paper, we study a data cleaning problem in which the FDs may not be completely correct, e.g., due to data evolution or incomplete knowledge of the data semantics. We argue that the notion of relative trust is a crucial aspect of this problem: if the FDs are outdated, we should modify them to fit the data, but if we suspect that there are problems with the data, we should modify the data to fit the FDs. In practice, it is usually unclear how much to trust the data versus the FDs. To address this problem, we propose an algorithm for generating non-redundant solutions (i.e., simultaneous modifications of the data and the FDs) corresponding to various levels of relative trust. This can help users determine the best way to modify their data and/or FDs to achieve consistency. △ Less

Submitted 24 July, 2012; v1 submitted 22 July, 2012; originally announced July 2012.

arXiv:1103.3103 [pdf]

Guided Data Repair

Authors: Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville, Mourad Ouzzani, Ihab F. Ilyas

Abstract: In this paper we present GDR, a Guided Data Repair framework that incorporates user feedback in the cleaning process to enhance and accelerate existing automatic repair techniques while minimizing user involvement. GDR consults the user on the updates that are most likely to be beneficial in improving data quality. GDR also uses machine learning methods to identify and apply the correct updates di… ▽ More In this paper we present GDR, a Guided Data Repair framework that incorporates user feedback in the cleaning process to enhance and accelerate existing automatic repair techniques while minimizing user involvement. GDR consults the user on the updates that are most likely to be beneficial in improving data quality. GDR also uses machine learning methods to identify and apply the correct updates directly to the database without the actual involvement of the user on these specific updates. To rank potential updates for consultation by the user, we first group these repairs and quantify the utility of each group using the decision-theory concept of value of information (VOI). We then apply active learning to order updates within a group based on their ability to improve the learned model. User feedback is used to repair the database and to adaptively refine the training set for the model. We empirically evaluate GDR on a real-world dataset and show significant improvement in data quality using our user guided repairing process. We also, assess the trade-off between the user efforts and the resulting data quality. △ Less

Submitted 16 March, 2011; originally announced March 2011.

Comments: VLDB2011

Journal ref: Proceedings of the VLDB Endowment (PVLDB), Vol. 4, No. 5, pp. 279-289 (2011)

Showing 1–29 of 29 results for author: Ilyas, I