Search | arXiv e-print repository

doi 10.18653/v1/2023.emnlp-main.442

HiddenTables & PyQTax: A Cooperative Game and Dataset For TableQA to Ensure Scale and Data Privacy Across a Myriad of Taxonomies

Authors: William Watson, Nicole Cho, Tucker Balch, Manuela Veloso

Abstract: A myriad of different Large Language Models (LLMs) face a common challenge in contextually analyzing table question-answering tasks. These challenges are engendered from (1) finite context windows for large tables, (2) multi-faceted discrepancies amongst tokenization patterns against cell boundaries, and (3) various limitations stemming from data confidentiality in the process of using external mo… ▽ More A myriad of different Large Language Models (LLMs) face a common challenge in contextually analyzing table question-answering tasks. These challenges are engendered from (1) finite context windows for large tables, (2) multi-faceted discrepancies amongst tokenization patterns against cell boundaries, and (3) various limitations stemming from data confidentiality in the process of using external models such as gpt-3.5-turbo. We propose a cooperative game dubbed "HiddenTables" as a potential resolution to this challenge. In essence, "HiddenTables" is played between the code-generating LLM "Solver" and the "Oracle" which evaluates the ability of the LLM agents to solve Table QA tasks. This game is based on natural language schemas and importantly, ensures the security of the underlying data. We provide evidential experiments on a diverse set of tables that demonstrate an LLM's collective inability to generalize and perform on complex queries, handle compositional dependencies, and align natural language to programmatic commands when concrete table schemas are provided. Unlike encoder-based models, we have pushed the boundaries of "HiddenTables" to not be limited by the number of rows - therefore we exhibit improved efficiency in prompt and completion tokens. Our infrastructure has spawned a new dataset "PyQTax" that spans across 116,671 question-table-answer triplets and provides additional fine-grained breakdowns & labels for varying question taxonomies. Therefore, in tandem with our academic contributions regarding LLMs' deficiency in TableQA tasks, "HiddenTables" is a tactile manifestation of how LLMs can interact with massive datasets while ensuring data security and minimizing generation costs. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023)

Journal ref: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (2023) 7144-7159

arXiv:2405.05260 [pdf, other]

doi 10.1145/3383455.3422520

Financial Table Extraction in Image Documents

Authors: William Watson, Bo Liu

Abstract: Table extraction has long been a pervasive problem in financial services. This is more challenging in the image domain, where content is locked behind cumbersome pixel format. Luckily, advances in deep learning for image segmentation, OCR, and sequence modeling provides the necessary heavy lifting to achieve impressive results. This paper presents an end-to-end pipeline for identifying, extracting… ▽ More Table extraction has long been a pervasive problem in financial services. This is more challenging in the image domain, where content is locked behind cumbersome pixel format. Luckily, advances in deep learning for image segmentation, OCR, and sequence modeling provides the necessary heavy lifting to achieve impressive results. This paper presents an end-to-end pipeline for identifying, extracting and transcribing tabular content in image documents, while retaining the original spatial relations with high fidelity. △ Less

Submitted 18 March, 2024; originally announced May 2024.

arXiv:2404.13050 [pdf, other]

FlowMind: Automatic Workflow Generation with LLMs

Authors: Zhen Zeng, William Watson, Nicole Cho, Saba Rahimi, Shayleen Reynolds, Tucker Balch, Manuela Veloso

Abstract: The rapidly evolving field of Robotic Process Automation (RPA) has made significant strides in automating repetitive processes, yet its effectiveness diminishes in scenarios requiring spontaneous or unpredictable tasks demanded by users. This paper introduces a novel approach, FlowMind, leveraging the capabilities of Large Language Models (LLMs) such as Generative Pretrained Transformer (GPT), to… ▽ More The rapidly evolving field of Robotic Process Automation (RPA) has made significant strides in automating repetitive processes, yet its effectiveness diminishes in scenarios requiring spontaneous or unpredictable tasks demanded by users. This paper introduces a novel approach, FlowMind, leveraging the capabilities of Large Language Models (LLMs) such as Generative Pretrained Transformer (GPT), to address this limitation and create an automatic workflow generation system. In FlowMind, we propose a generic prompt recipe for a lecture that helps ground LLM reasoning with reliable Application Programming Interfaces (APIs). With this, FlowMind not only mitigates the common issue of hallucinations in LLMs, but also eliminates direct interaction between LLMs and proprietary data or code, thus ensuring the integrity and confidentiality of information - a cornerstone in financial services. FlowMind further simplifies user interaction by presenting high-level descriptions of auto-generated workflows, enabling users to inspect and provide feedback effectively. We also introduce NCEN-QA, a new dataset in finance for benchmarking question-answering tasks from N-CEN reports on funds. We used NCEN-QA to evaluate the performance of workflows generated by FlowMind against baseline and ablation variants of FlowMind. We demonstrate the success of FlowMind, the importance of each component in the proposed lecture recipe, and the effectiveness of user interaction and feedback in FlowMind. △ Less

Submitted 16 March, 2024; originally announced April 2024.

Comments: Published in ACM ICAIF 2023

arXiv:2404.12535 [pdf, other]

HalluciBot: Is There No Such Thing as a Bad Question?

Authors: William Watson, Nicole Cho

Abstract: Hallucination continues to be one of the most critical challenges in the institutional adoption journey of Large Language Models (LLMs). In this context, an overwhelming number of studies have focused on analyzing the post-generation phase - refining outputs via feedback, analyzing logit output values, or deriving clues via the outputs' artifacts. We propose HalluciBot, a model that predicts the p… ▽ More Hallucination continues to be one of the most critical challenges in the institutional adoption journey of Large Language Models (LLMs). In this context, an overwhelming number of studies have focused on analyzing the post-generation phase - refining outputs via feedback, analyzing logit output values, or deriving clues via the outputs' artifacts. We propose HalluciBot, a model that predicts the probability of hallucination $\textbf{before generation}$, for any query imposed to an LLM. In essence, HalluciBot does not invoke any generation during inference. To derive empirical evidence for HalluciBot, we employ a Multi-Agent Monte Carlo Simulation using a Query Perturbator to craft $n$ variations per query at train time. The construction of our Query Perturbator is motivated by our introduction of a new definition of hallucination - $\textit{truthful hallucination}$. Our training methodology generated 2,219,022 estimates for a training corpus of 369,837 queries, spanning 13 diverse datasets and 3 question-answering scenarios. HalluciBot predicts both binary and multi-class probabilities of hallucination, enabling a means to judge the query's quality with regards to its propensity to hallucinate. Therefore, HalluciBot paves the way to revise or cancel a query before generation and the ensuing computational waste. Moreover, it provides a lucid means to measure user accountability for hallucinatory queries. △ Less

Submitted 18 April, 2024; originally announced April 2024.

arXiv:2404.04003 [pdf, other]

BuDDIE: A Business Document Dataset for Multi-task Information Extraction

Authors: Ran Zmigrod, Dongsheng Wang, Mathieu Sibue, Yulong Pei, Petr Babkin, Ivan Brugere, Xiaomo Liu, Nacho Navarro, Antony Papadimitriou, William Watson, Zhiqiang Ma, Armineh Nourbakhsh, Sameena Shah

Abstract: The field of visually rich document understanding (VRDU) aims to solve a multitude of well-researched NLP tasks in a multi-modal domain. Several datasets exist for research on specific tasks of VRDU such as document classification (DC), key entity extraction (KEE), entity linking, visual question answering (VQA), inter alia. These datasets cover documents like invoices and receipts with sparse ann… ▽ More The field of visually rich document understanding (VRDU) aims to solve a multitude of well-researched NLP tasks in a multi-modal domain. Several datasets exist for research on specific tasks of VRDU such as document classification (DC), key entity extraction (KEE), entity linking, visual question answering (VQA), inter alia. These datasets cover documents like invoices and receipts with sparse annotations such that they support one or two co-related tasks (e.g., entity extraction and entity linking). Unfortunately, only focusing on a single specific of documents or task is not representative of how documents often need to be processed in the wild - where variety in style and requirements is expected. In this paper, we introduce BuDDIE (Business Document Dataset for Information Extraction), the first multi-task dataset of 1,665 real-world business documents that contains rich and dense annotations for DC, KEE, and VQA. Our dataset consists of publicly available business entity documents from US state government websites. The documents are structured and vary in their style and layout across states and types (e.g., forms, certificates, reports, etc.). We provide data variety and quality metrics for BuDDIE as well as a series of baselines for each task. Our baselines cover traditional textual, multi-modal, and large language model approaches to VRDU. △ Less

Submitted 5 April, 2024; originally announced April 2024.

arXiv:2403.18855 [pdf, other]

Directed Criteria Citation Recommendation and Ranking Through Link Prediction

Authors: William Watson, Lawrence Yong

Abstract: We explore link prediction as a proxy for automatically surfacing documents from existing literature that might be topically or contextually relevant to a new document. Our model uses transformer-based graph embeddings to encode the meaning of each document, presented as a node within a citation network. We show that the semantic representations that our model generates can outperform other conten… ▽ More We explore link prediction as a proxy for automatically surfacing documents from existing literature that might be topically or contextually relevant to a new document. Our model uses transformer-based graph embeddings to encode the meaning of each document, presented as a node within a citation network. We show that the semantic representations that our model generates can outperform other content-based methods in recommendation and ranking tasks. This provides a holistic approach to exploring citation graphs in domains where it is critical that these documents properly cite each other, so as to minimize the possibility of any inconsistencies △ Less

Submitted 18 March, 2024; originally announced March 2024.

Comments: Extended Abstract at the International Conference of AI in Finance (ICAIF '20)

arXiv:2212.08124 [pdf, other]

Elasticity Solver in Minecraft for Learning Mechanics of Materials by Gaming

Authors: Zachariah P. Beck, Brandon Alpert, Alexander J. Bowman, William R. Watson, Adrian Buganza Tepole

Abstract: Video games have emerged as a medium for learning by creating engaging environments, encouraging creative and deep thinking, and exposing learners to complex problems. Unfortunately, even though there are increasing examples of video games for many basic science and engineering concepts, similar efforts for higher level engineering concepts such as mechanics of materials are still lacking. Here we… ▽ More Video games have emerged as a medium for learning by creating engaging environments, encouraging creative and deep thinking, and exposing learners to complex problems. Unfortunately, even though there are increasing examples of video games for many basic science and engineering concepts, similar efforts for higher level engineering concepts such as mechanics of materials are still lacking. Here we present a mesh-free elasticity solver implementation in the popular video game Minecraft, a sandbox game where players can build any structure they can imagine. Modifications to the game, called mods in the Minecraft community, are a common feature of this platform. Our elasticity mod computes the stress and deformation of arbitrary structures and colors the blocks with a heat-map to visualize the result of the analysis. We used this mod in the Honors section of two courses taught at Purdue University: Basic Mechanics I Statics, Mechanics of Materials. This articles describes our experience developing and deploying this tool to encourage its use in biomedical engineering classrooms. A future goal is to engage the broader audience Minecraft players that already interact regularly with Minecraft mods. △ Less

Submitted 14 December, 2022; originally announced December 2022.

arXiv:2101.07342 [pdf, other]

doi 10.1039/D1AN00075F

Feature Fusion of Raman Chemical Imaging and Digital Histopathology using Machine Learning for Prostate Cancer Detection

Authors: Trevor Doherty, Susan McKeever, Nebras Al-Attar, Tiarnan Murphy, Claudia Aura, Arman Rahman, Amanda O'Neill, Stephen P Finn, Elaine Kay, William M. Gallagher, R. William G. Watson, Aoife Gowen, Patrick Jackman

Abstract: The diagnosis of prostate cancer is challenging due to the heterogeneity of its presentations, leading to the over diagnosis and treatment of non-clinically important disease. Accurate diagnosis can directly benefit a patient's quality of life and prognosis. Towards addressing this issue, we present a learning model for the automatic identification of prostate cancer. While many prostate cancer st… ▽ More The diagnosis of prostate cancer is challenging due to the heterogeneity of its presentations, leading to the over diagnosis and treatment of non-clinically important disease. Accurate diagnosis can directly benefit a patient's quality of life and prognosis. Towards addressing this issue, we present a learning model for the automatic identification of prostate cancer. While many prostate cancer studies have adopted Raman spectroscopy approaches, none have utilised the combination of Raman Chemical Imaging (RCI) and other imaging modalities. This study uses multimodal images formed from stained Digital Histopathology (DP) and unstained RCI. The approach was developed and tested on a set of 178 clinical samples from 32 patients, containing a range of non-cancerous, Gleason grade 3 (G3) and grade 4 (G4) tissue microarray samples. For each histological sample, there is a pathologist labelled DP - RCI image pair. The hypothesis tested was whether multimodal image models can outperform single modality baseline models in terms of diagnostic accuracy. Binary non-cancer/cancer models and the more challenging G3/G4 differentiation were investigated. Regarding G3/G4 classification, the multimodal approach achieved a sensitivity of 73.8% and specificity of 88.1% while the baseline DP model showed a sensitivity and specificity of 54.1% and 84.7% respectively. The multimodal approach demonstrated a statistically significant 12.7% AUC advantage over the baseline with a value of 85.8% compared to 73.1%, also outperforming models based solely on RCI and median Raman spectra. Feature fusion of DP and RCI does not improve the more trivial task of tumour identification but does deliver an observed advantage in G3/G4 discrimination. Building on these promising findings, future work could include the acquisition of larger datasets for enhanced model generalization. △ Less

Submitted 18 January, 2021; originally announced January 2021.

Comments: 19 pages, 8 tables, 18 figures

arXiv:1702.05376 [pdf, other]

Towards a Unified Taxonomy of Biclustering Methods

Authors: Dmitry I. Ignatov, Bruce W. Watson

Abstract: Being an unsupervised machine learning and data mining technique, biclustering and its multimodal extensions are becoming popular tools for analysing object-attribute data in different domains. Apart from conventional clustering techniques, biclustering is searching for homogeneous groups of objects while keeping their common description, e.g., in binary setting, their shared attributes. In bioinf… ▽ More Being an unsupervised machine learning and data mining technique, biclustering and its multimodal extensions are becoming popular tools for analysing object-attribute data in different domains. Apart from conventional clustering techniques, biclustering is searching for homogeneous groups of objects while keeping their common description, e.g., in binary setting, their shared attributes. In bioinformatics, biclustering is used to find genes, which are active in a subset of situations, thus being candidates for biomarkers. However, the authors of those biclustering techniques that are popular in gene expression analysis, may overlook the existing methods. For instance, BiMax algorithm is aimed at finding biclusters, which are well-known for decades as formal concepts. Moreover, even if bioinformatics classify the biclustering methods according to reasonable domain-driven criteria, their classification taxonomies may be different from survey to survey and not full as well. So, in this paper we propose to use concept lattices as a tool for taxonomy building (in the biclustering domain) and attribute exploration as means for cross-domain taxonomy completion. △ Less

Submitted 17 February, 2017; originally announced February 2017.

Comments: http://ceur-ws.org/Vol-1552/

MSC Class: 06B99; 62H30 ACM Class: I.5.3; H.2.8; I.2.6; I.2.4

Journal ref: Russian and South African Workshop on Knowledge Discovery Techniques Based on Formal Concept Analysis (RuZA 2015), November 30 - December 5, 2015, Stellenbosch, South Africa, In CEUR Workshop Proceedings, Vol. 1552, p. 23-39

Showing 1–9 of 9 results for author: Watson, W