Zum Hauptinhalt springen

Showing 1–29 of 29 results for author: Ilyas, I

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.05948  [pdf, other

    cs.CL cs.IR cs.LG

    ConvKGYarn: Spinning Configurable and Scalable Conversational Knowledge Graph QA datasets with Large Language Models

    Authors: Ronak Pradeep, Daniel Lee, Ali Mousavi, Jeff Pound, Yisi Sang, Jimmy Lin, Ihab Ilyas, Saloni Potdar, Mostafa Arefiyan, Yunyao Li

    Abstract: The rapid advancement of Large Language Models (LLMs) and conversational assistants necessitates dynamic, scalable, and configurable conversational datasets for training and evaluation. These datasets must accommodate diverse user interaction modes, including text and voice, each presenting unique modeling challenges. Knowledge Graphs (KGs), with their structured and evolving nature, offer an idea… ▽ More

    Submitted 12 August, 2024; originally announced August 2024.

  2. arXiv:2312.09424  [pdf, other

    cs.CL cs.AI

    Open Domain Knowledge Extraction for Knowledge Graphs

    Authors: Kun Qian, Anton Belyi, Fei Wu, Samira Khorshidi, Azadeh Nikfarjam, Rahul Khot, Yisi Sang, Katherine Luna, Xianqi Chu, Eric Choi, Yash Govind, Chloe Seivwright, Yiwen Sun, Ahmed Fakhry, Theo Rekatsinas, Ihab Ilyas, Xiaoguang Qi, Yunyao Li

    Abstract: The quality of a knowledge graph directly impacts the quality of downstream applications (e.g. the number of answerable questions using the graph). One ongoing challenge when building a knowledge graph is to ensure completeness and freshness of the graph's entities and facts. In this paper, we introduce ODKE, a scalable and extensible framework that sources high-quality entities and facts from ope… ▽ More

    Submitted 30 October, 2023; originally announced December 2023.

    Comments: 7 pages, 7 figures, 5 tables, preprint technical report, no code or data is released

    MSC Class: 68T30 (primary) ACM Class: F.4.1; I.2.4

  3. arXiv:2311.15781  [pdf, other

    cs.AI cs.CL cs.LG

    Increasing Coverage and Precision of Textual Information in Multilingual Knowledge Graphs

    Authors: Simone Conia, Min Li, Daniel Lee, Umar Farooq Minhas, Ihab Ilyas, Yunyao Li

    Abstract: Recent work in Natural Language Processing and Computer Vision has been using textual information -- e.g., entity names and descriptions -- available in knowledge graphs to ground neural models to high-quality structured data. However, when it comes to non-English languages, the quantity and quality of textual information are comparatively scarce. To address this issue, we introduce the novel task… ▽ More

    Submitted 27 November, 2023; originally announced November 2023.

    Comments: Camera ready for EMNLP 2023

  4. arXiv:2310.17119  [pdf, other

    cs.CL

    FLEEK: Factual Error Detection and Correction with Evidence Retrieved from External Knowledge

    Authors: Farima Fatahi Bayat, Kun Qian, Benjamin Han, Yisi Sang, Anton Belyi, Samira Khorshidi, Fei Wu, Ihab F. Ilyas, Yunyao Li

    Abstract: Detecting factual errors in textual information, whether generated by large language models (LLM) or curated by humans, is crucial for making informed decisions. LLMs' inability to attribute their claims to external knowledge and their tendency to hallucinate makes it difficult to rely on their responses. Humans, too, are prone to factual errors in their writing. Since manual detection and correct… ▽ More

    Submitted 25 October, 2023; originally announced October 2023.

    Comments: EMNLP 2023 (Demonstration Track)

  5. arXiv:2309.11669  [pdf, other

    cs.CL

    Construction of Paired Knowledge Graph-Text Datasets Informed by Cyclic Evaluation

    Authors: Ali Mousavi, Xin Zhan, He Bai, Peng Shi, Theo Rekatsinas, Benjamin Han, Yunyao Li, Jeff Pound, Josh Susskind, Natalie Schluter, Ihab Ilyas, Navdeep Jaitly

    Abstract: Datasets that pair Knowledge Graphs (KG) and text together (KG-T) can be used to train forward and reverse neural models that generate text from KG and vice versa. However models trained on datasets where KG and text pairs are not equivalent can suffer from more hallucination and poorer recall. In this paper, we verify this empirically by generating datasets with different levels of noise and find… ▽ More

    Submitted 20 September, 2023; originally announced September 2023.

    Comments: 16 pages

  6. Growing and Serving Large Open-domain Knowledge Graphs

    Authors: Ihab F. Ilyas, JP Lacerda, Yunyao Li, Umar Farooq Minhas, Ali Mousavi, Jeffrey Pound, Theodoros Rekatsinas, Chiraag Sumanth

    Abstract: Applications of large open-domain knowledge graphs (KGs) to real-world problems pose many unique challenges. In this paper, we present extensions to Saga our platform for continuous construction and serving of knowledge at scale. In particular, we describe a pipeline for training knowledge graph embeddings that powers key capabilities such as fact ranking, fact verification, a related entities ser… ▽ More

    Submitted 16 May, 2023; originally announced May 2023.

    Comments: To be published in SIGMOD 2023

  7. arXiv:2304.01926  [pdf

    cs.DB cs.AI cs.LG

    High-Throughput Vector Similarity Search in Knowledge Graphs

    Authors: Jason Mohoney, Anil Pacaci, Shihabur Rahman Chowdhury, Ali Mousavi, Ihab F. Ilyas, Umar Farooq Minhas, Jeffrey Pound, Theodoros Rekatsinas

    Abstract: There is an increasing adoption of machine learning for encoding data into vectors to serve online recommendation and search use cases. As a result, recent data management systems propose augmenting query processing with online vector similarity search. In this work, we explore vector similarity search in the context of Knowledge Graphs (KGs). Motivated by the tasks of finding related KG queries a… ▽ More

    Submitted 4 April, 2023; originally announced April 2023.

    Comments: 13 pages, 7 figures, to be published in ACM SIGMOD 2023

  8. arXiv:2204.07309  [pdf, other

    cs.DB cs.AI cs.CL

    Saga: A Platform for Continuous Construction and Serving of Knowledge At Scale

    Authors: Ihab F. Ilyas, Theodoros Rekatsinas, Vishnu Konda, Jeffrey Pound, Xiaoguang Qi, Mohamed Soliman

    Abstract: We introduce Saga, a next-generation knowledge construction and serving platform for powering knowledge-based applications at industrial scale. Saga follows a hybrid batch-incremental design to continuously integrate billions of facts about real-world entities and construct a central knowledge graph that supports multiple production use cases with diverse requirements around data freshness, accura… ▽ More

    Submitted 14 April, 2022; originally announced April 2022.

    Journal ref: ACM SIGMOD 2022

  9. arXiv:2106.01501  [pdf, other

    cs.DB cs.LG

    Ember: No-Code Context Enrichment via Similarity-Based Keyless Joins

    Authors: Sahaana Suri, Ihab F. Ilyas, Christopher Ré, Theodoros Rekatsinas

    Abstract: Structured data, or data that adheres to a pre-defined schema, can suffer from fragmented context: information describing a single entity can be scattered across multiple datasets or tables tailored for specific business needs, with no explicit linking keys (e.g., primary key-foreign key relationships or heuristic functions). Context enrichment, or rebuilding fragmented context, using keyless join… ▽ More

    Submitted 2 June, 2021; originally announced June 2021.

  10. arXiv:2103.05173  [pdf, other

    cs.DB cs.CR

    PCOR: Private Contextual Outlier Release via Differentially Private Search

    Authors: Masoumeh Shafieinejad, Florian Kerschbaum, Ihab F. Ilyas

    Abstract: Outlier detection plays a significant role in various real world applications such as intrusion, malfunction, and fraud detection. Traditionally, outlier detection techniques are applied to find outliers in the context of the whole dataset. However, this practice neglects contextual outliers, that are not outliers in the whole dataset but in some specific neighborhoods. Contextual outliers are par… ▽ More

    Submitted 8 March, 2021; originally announced March 2021.

  11. arXiv:2101.06801  [pdf, other

    cs.DB

    Real-Time LSM-Trees for HTAP Workloads

    Authors: Hemant Saxena, Lukasz Golab, Stratos Idreos, Ihab F. Ilyas

    Abstract: Real-time analytics systems employ hybrid data layouts in which data are stored in different formats throughout their lifecycle. Recent data are stored in a row-oriented format to serve OLTP workloads and support high insert rates, while older data are transformed to a column-oriented format for OLAP access patterns. We observe that a Log-Structured Merge (LSM) Tree is a natural fit for a lifecycl… ▽ More

    Submitted 14 July, 2022; v1 submitted 17 January, 2021; originally announced January 2021.

  12. arXiv:2012.15713  [pdf, other

    cs.DB cs.CR

    Kamino: Constraint-Aware Differentially Private Data Synthesis

    Authors: Chang Ge, Shubhankar Mohapatra, Xi He, Ihab F. Ilyas

    Abstract: Organizations are increasingly relying on data to support decisions. When data contains private and sensitive information, the data owner often desires to publish a synthetic database instance that is similarly useful as the true data, while ensuring the privacy of individual data records. Existing differentially private data synthesis methods aim to generate useful data based on applications, but… ▽ More

    Submitted 15 April, 2021; v1 submitted 31 December, 2020; originally announced December 2020.

    Comments: Update based on reviewers' comments

  13. arXiv:2011.04730  [pdf, other

    cs.DB cs.LG

    Batchwise Probabilistic Incremental Data Cleaning

    Authors: Paulo H. Oliveira, Daniel S. Kaster, Caetano Traina-Jr., Ihab F. Ilyas

    Abstract: Lack of data and data quality issues are among the main bottlenecks that prevent further artificial intelligence adoption within many organizations, pushing data scientists to spend most of their time cleaning data before being able to answer analytical questions. Hence, there is a need for more effective and efficient data cleaning solutions, which, not surprisingly, is rife with theoretical and… ▽ More

    Submitted 9 November, 2020; originally announced November 2020.

    Comments: 29 pages, 13 figures

  14. arXiv:2008.10549  [pdf, other

    cs.LG cs.DB cs.IR stat.ML

    On sampling from data with duplicate records

    Authors: Alireza Heidari, Shrinu Kushagra, Ihab F. Ilyas

    Abstract: Data deduplication is the task of detecting records in a database that correspond to the same real-world entity. Our goal is to develop a procedure that samples uniformly from the set of entities present in the database in the presence of duplicates. We accomplish this by a two-stage process. In the first step, we estimate the frequencies of all the entities in the database. In the second step, we… ▽ More

    Submitted 24 August, 2020; originally announced August 2020.

    Comments: 21 pages, 5 figures

  15. arXiv:2006.10208  [pdf, other

    cs.LG cs.DB cs.IR stat.ML

    Record fusion: A learning approach

    Authors: Alireza Heidari, George Michalopoulos, Shrinu Kushagra, Ihab F. Ilyas, Theodoros Rekatsinas

    Abstract: Record fusion is the task of aggregating multiple records that correspond to the same real-world entity in a database. We can view record fusion as a machine learning problem where the goal is to predict the "correct" value for each attribute for each entity. Given a database, we use a combination of attribute-level, recordlevel, and database-level signals to construct a feature vector for each ce… ▽ More

    Submitted 17 June, 2020; originally announced June 2020.

    Comments: 18 pages, 9 figures

  16. arXiv:2005.08540  [pdf, ps, other

    cs.DB

    Approximate Denial Constraints

    Authors: Ester Livshits, Alireza Heidari, Ihab F. Ilyas, Benny Kimelfeld

    Abstract: The problem of mining integrity constraints from data has been extensively studied over the past two decades for commonly used types of constraints including the classic Functional Dependencies (FDs) and the more general Denial Constraints (DCs). In this paper, we investigate the problem of mining approximate DCs (i.e., DCs that are "almost" satisfied) from data. Considering approximate constraint… ▽ More

    Submitted 18 May, 2020; originally announced May 2020.

  17. arXiv:1907.00141  [pdf, other

    cs.LG cs.DS stat.ML

    Approximate Inference in Structured Instances with Noisy Categorical Observations

    Authors: Alireza Heidari, Ihab F. Ilyas, Theodoros Rekatsinas

    Abstract: We study the problem of recovering the latent ground truth labeling of a structured instance with categorical random variables in the presence of noisy observations. We present a new approximate algorithm for graphs with categorical variables that achieves low Hamming error in the presence of noisy vertex and edge observations. Our main result shows a logarithmic dependency of the Hamming error to… ▽ More

    Submitted 5 July, 2019; v1 submitted 29 June, 2019; originally announced July 2019.

    Comments: UAI 2019, 33 pages

  18. arXiv:1906.06574  [pdf, ps, other

    cs.DB

    Technical Report: Optimizing Human Involvement for Entity Matching and Consolidation

    Authors: Ji Sun, Dong Deng, Ihab Ilyas, Guoliang Li, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, Nan Tang

    Abstract: An end-to-end data integration system requires human feedback in several phases, including collecting training data for entity matching, debugging the resulting clusters, confirming transformations applied on these clusters for data standardization, and finally, reducing each cluster to a single, canonical representation (or "golden record"). The traditional wisdom is to sequentially apply the hum… ▽ More

    Submitted 15 June, 2019; originally announced June 2019.

  19. arXiv:1904.06492  [pdf, other

    cs.DB

    Properties of Inconsistency Measures for Databases

    Authors: Ester Livshits, Rina Kochirgan, Segev Tsur, Ihab F. Ilyas, Benny Kimelfeld, Sudeepa Roy

    Abstract: How should we quantify the inconsistency of a database that violates integrity constraints? Proper measures are important for various tasks, such as progress indication and action prioritization in cleaning systems, and reliability estimation for new datasets. To choose an appropriate inconsistency measure, it is important to identify the desired properties in the application and understand which… ▽ More

    Submitted 1 April, 2021; v1 submitted 13 April, 2019; originally announced April 2019.

  20. HoloDetect: Few-Shot Learning for Error Detection

    Authors: Alireza Heidari, Joshua McGrath, Ihab F. Ilyas, Theodoros Rekatsinas

    Abstract: We introduce a few-shot learning framework for error detection. We show that data augmentation (a form of weak supervision) is key to training high-quality, ML-based error detection models that require minimal human involvement. Our framework consists of two parts: (1) an expressive model to learn rich representations that capture the inherent syntactic and semantic heterogeneity of errors; and (2… ▽ More

    Submitted 3 April, 2019; originally announced April 2019.

    Comments: 18 pages,

    Journal ref: ACM SIGMOD 2019

  21. arXiv:1903.06607  [pdf, other

    cs.CL cs.DB

    Matching Entities Across Different Knowledge Graphs with Graph Embeddings

    Authors: Michael Azmy, Peng Shi, Jimmy Lin, Ihab F. Ilyas

    Abstract: This paper explores the problem of matching entities across different knowledge graphs. Given a query entity in one knowledge graph, we wish to find the corresponding real-world entity in another knowledge graph. We formalize this problem and present two large-scale datasets for this task based on exiting cross-ontology links between DBpedia and Wikidata, focused on several hundred thousand ambigu… ▽ More

    Submitted 15 March, 2019; originally announced March 2019.

  22. arXiv:1903.05228  [pdf, other

    cs.DB

    Distributed Dependency Discovery

    Authors: Hemant Saxena, Lukasz Golab, Ihab F. Ilyas

    Abstract: We analyze the problem of discovering dependencies from distributed big data. Existing (non-distributed) algorithms focus on minimizing computation by pruning the search space of possible dependencies. However, distributed algorithms must also optimize communication costs, especially in shared-nothing settings, leading to a more complex optimization space. To understand this space, we introduce si… ▽ More

    Submitted 12 March, 2019; originally announced March 2019.

  23. arXiv:1810.04361  [pdf, ps, other

    cs.LG stat.ML

    Semi-supervised clustering for de-duplication

    Authors: Shrinu Kushagra, Shai Ben-David, Ihab Ilyas

    Abstract: Data de-duplication is the task of detecting multiple records that correspond to the same real-world entity in a database. In this work, we view de-duplication as a clustering problem where the goal is to put records corresponding to the same physical entity in the same cluster and putting records corresponding to different physical entities into different clusters. We introduce a framework whic… ▽ More

    Submitted 10 October, 2018; originally announced October 2018.

    Report number: PMLR 89:1659-1667, 2019

    Journal ref: Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS) 2019

  24. arXiv:1801.06750  [pdf, other

    cs.DB

    A Formal Framework For Probabilistic Unclean Databases

    Authors: Christopher De Sa, Ihab F. Ilyas, Benny Kimelfeld, Christopher Re, Theodoros Rekatsinas

    Abstract: Most theoretical frameworks that focus on data errors and inconsistencies follow logic-based reasoning. Yet, practical data cleaning tools need to incorporate statistical reasoning to be effective in real-world data cleaning tasks. Motivated by these empirical successes, we propose a formal framework for unclean databases, where two types of statistical knowledge are incorporated: The first repres… ▽ More

    Submitted 24 January, 2019; v1 submitted 20 January, 2018; originally announced January 2018.

  25. arXiv:1712.10266  [pdf, other

    cs.DB

    APEx: Accuracy-Aware Differentially Private Data Exploration

    Authors: Chang Ge, Xi He, Ihab F. Ilyas, Ashwin Machanavajjhala

    Abstract: Organizations are increasingly interested in allowing external data scientists to explore their sensitive datasets. Due to the popularity of differential privacy, data owners want the data exploration to ensure provable privacy guarantees. However, current systems for answering queries with differential privacy place an inordinate burden on the data analysts to understand differential privacy, man… ▽ More

    Submitted 10 May, 2019; v1 submitted 29 December, 2017; originally announced December 2017.

    Comments: Full version of the ACM SIGMOD 2019 paper

  26. arXiv:1709.10436  [pdf, other

    cs.DB

    Unsupervised String Transformation Learning for Entity Consolidation

    Authors: Dong Deng, Wenbo Tao, Ziawasch Abedjan, Ahmed Elmagarmid, Guoliang Li, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, Nan Tang

    Abstract: Data integration has been a long-standing challenge in data management with many applications. A key step in data integration is entity consolidation. It takes a collection of clusters of duplicate records as input and produces a single "golden record" for each cluster, which contains the canonical value for each attribute. Truth discovery and data fusion methods, as well as Master Data Management… ▽ More

    Submitted 30 July, 2018; v1 submitted 29 September, 2017; originally announced September 2017.

  27. arXiv:1702.00820  [pdf, other

    cs.DB

    HoloClean: Holistic Data Repairs with Probabilistic Inference

    Authors: Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, Christopher Ré

    Abstract: We introduce HoloClean, a framework for holistic data repairing driven by probabilistic inference. HoloClean unifies existing qualitative data repairing approaches, which rely on integrity constraints or external data sources, with quantitative data repairing methods, which leverage statistical properties of the input data. Given an inconsistent dataset as input, HoloClean automatically generates… ▽ More

    Submitted 2 February, 2017; originally announced February 2017.

  28. arXiv:1207.5226  [pdf, other

    cs.DB

    On the Relative Trust between Inconsistent Data and Inaccurate Constraints

    Authors: George Beskales, Ihab F. Ilyas, Lukasz Golab, Artur Galiullin

    Abstract: Functional dependencies (FDs) specify the intended data semantics while violations of FDs indicate deviation from these semantics. In this paper, we study a data cleaning problem in which the FDs may not be completely correct, e.g., due to data evolution or incomplete knowledge of the data semantics. We argue that the notion of relative trust is a crucial aspect of this problem: if the FDs are out… ▽ More

    Submitted 24 July, 2012; v1 submitted 22 July, 2012; originally announced July 2012.

  29. arXiv:1103.3103  [pdf

    cs.DB

    Guided Data Repair

    Authors: Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville, Mourad Ouzzani, Ihab F. Ilyas

    Abstract: In this paper we present GDR, a Guided Data Repair framework that incorporates user feedback in the cleaning process to enhance and accelerate existing automatic repair techniques while minimizing user involvement. GDR consults the user on the updates that are most likely to be beneficial in improving data quality. GDR also uses machine learning methods to identify and apply the correct updates di… ▽ More

    Submitted 16 March, 2011; originally announced March 2011.

    Comments: VLDB2011

    Journal ref: Proceedings of the VLDB Endowment (PVLDB), Vol. 4, No. 5, pp. 279-289 (2011)