Zum Hauptinhalt springen

Showing 1–2 of 2 results for author: Cappuzzo, R

.
  1. arXiv:2402.06282  [pdf, other

    cs.DB cs.LG

    Retrieve, Merge, Predict: Augmenting Tables with Data Lakes

    Authors: Riccardo Cappuzzo, Aimee Coelho, Felix Lefebvre, Paolo Papotti, Gael Varoquaux

    Abstract: We present an in-depth analysis of data discovery in data lakes, focusing on table augmentation for given machine learning tasks. We analyze alternative methods used in the three main steps: retrieving joinable tables, merging information, and predicting with the resultant table. As data lakes, the paper uses YADL (Yet Another Data Lake) -- a novel dataset we developed as a tool for benchmarking t… ▽ More

    Submitted 27 May, 2024; v1 submitted 9 February, 2024; originally announced February 2024.

    Comments: 12 pages + references, 10 figures. Under submission at VLDB2024 (EA&B track)

  2. arXiv:1909.01120  [pdf, other

    cs.DB cs.CL cs.LG

    Local Embeddings for Relational Data Integration

    Authors: Riccardo Cappuzzo, Paolo Papotti, Saravanan Thirumuruganathan

    Abstract: Deep learning based techniques have been recently used with promising results for data integration problems. Some methods directly use pre-trained embeddings that were trained on a large corpus such as Wikipedia. However, they may not always be an appropriate choice for enterprise datasets with custom vocabulary. Other methods adapt techniques from natural language processing to obtain embeddings… ▽ More

    Submitted 3 September, 2020; v1 submitted 3 September, 2019; originally announced September 2019.

    Comments: Accepted to SIGMOD 2020 as Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks. Code can be found at https://gitlab.eurecom.fr/cappuzzo/embdi