Search | arXiv e-print repository

Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets

Authors: Melanie Walsh, Anna Preus, Maria Antoniak

Abstract: Large language models (LLMs) can now generate and recognize text in a wide range of styles and genres, including highly specialized, creative genres like poetry. But what do LLMs really know about poetry? What can they know about poetry? We develop a task to evaluate how well LLMs recognize a specific aspect of poetry, poetic form, for more than 20 forms and formal elements in the English language… ▽ More Large language models (LLMs) can now generate and recognize text in a wide range of styles and genres, including highly specialized, creative genres like poetry. But what do LLMs really know about poetry? What can they know about poetry? We develop a task to evaluate how well LLMs recognize a specific aspect of poetry, poetic form, for more than 20 forms and formal elements in the English language. Poetic form captures many different poetic features, including rhyme scheme, meter, and word or line repetition. We use this task to reflect on LLMs' current poetic capabilities, as well as the challenges and pitfalls of creating NLP benchmarks for poetry and for other creative tasks. In particular, we use this task to audit and reflect on the poems included in popular pretraining datasets. Our findings have implications for NLP researchers interested in model evaluation, digital humanities and cultural analytics scholars, and cultural heritage professionals. △ Less

Submitted 27 June, 2024; originally announced June 2024.

arXiv:2406.12108 [pdf]

Computing in the Life Sciences: From Early Algorithms to Modern AI

Authors: Samuel A. Donkor, Matthew E. Walsh, Alexander J. Titus

Abstract: Computing in the life sciences has undergone a transformative evolution, from early computational models in the 1950s to the applications of artificial intelligence (AI) and machine learning (ML) seen today. This paper highlights key milestones and technological advancements through the historical development of computing in the life sciences. The discussion includes the inception of computational… ▽ More Computing in the life sciences has undergone a transformative evolution, from early computational models in the 1950s to the applications of artificial intelligence (AI) and machine learning (ML) seen today. This paper highlights key milestones and technological advancements through the historical development of computing in the life sciences. The discussion includes the inception of computational models for biological processes, the advent of bioinformatics tools, and the integration of AI/ML in modern life sciences research. Attention is given to AI-enabled tools used in the life sciences, such as scientific large language models and bio-AI tools, examining their capabilities, limitations, and impact to biological risk. This paper seeks to clarify and establish essential terminology and concepts to ensure informed decision-making and effective communication across disciplines. △ Less

Submitted 18 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: 53 pages, 4 figures, 10 tables

arXiv:2403.15336 [pdf, other]

Dialogue Understandability: Why are we streaming movies with subtitles?

Authors: Helard Becerra Martinez, Alessandro Ragano, Diptasree Debnath, Asad Ullah, Crisron Rudolf Lucas, Martin Walsh, Andrew Hines

Abstract: Watching movies and TV shows with subtitles enabled is not simply down to audibility or speech intelligibility. A variety of evolving factors related to technological advances, cinema production and social behaviour challenge our perception and understanding. This study seeks to formalise and give context to these influential factors under a wider and novel term referred to as Dialogue Understanda… ▽ More Watching movies and TV shows with subtitles enabled is not simply down to audibility or speech intelligibility. A variety of evolving factors related to technological advances, cinema production and social behaviour challenge our perception and understanding. This study seeks to formalise and give context to these influential factors under a wider and novel term referred to as Dialogue Understandability. We propose a working definition for Dialogue Understandability being a listener's capacity to follow the story without undue cognitive effort or concentration being required that impacts their Quality of Experience (QoE). The paper identifies, describes and categorises the factors that influence Dialogue Understandability mapping them over the QoE framework, a media streaming lifecycle, and the stakeholders involved. We then explore available measurement tools in the literature and link them to the factors they could potentially be used for. The maturity and suitability of these tools is evaluated over a set of pilot experiments. Finally, we reflect on the gaps that still need to be filled, what we can measure and what not, future subjective experiments, and new research trends that could help us to fully characterise Dialogue Understandability. △ Less

Submitted 22 March, 2024; originally announced March 2024.

arXiv:2401.12755 [pdf, other]

Towards Risk Analysis of the Impact of AI on the Deliberate Biological Threat Landscape

Authors: Matthew E. Walsh

Abstract: The perception that the convergence of biological engineering and artificial intelligence (AI) could enable increased biorisk has recently drawn attention to the governance of biotechnology and artificial intelligence. The 2023 Executive Order, Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence, requires an assessment of how artificial intelligence… ▽ More The perception that the convergence of biological engineering and artificial intelligence (AI) could enable increased biorisk has recently drawn attention to the governance of biotechnology and artificial intelligence. The 2023 Executive Order, Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence, requires an assessment of how artificial intelligence can increase biorisk. Within this perspective, quantitative and qualitative frameworks for evaluating biorisk are presented. Both frameworks are exercised using notional scenarios and their benefits and limitations are then discussed. Finally, the perspective concludes by noting that assessment and evaluation methodologies must keep pace with advances of AI in the life sciences. △ Less

Submitted 11 June, 2024; v1 submitted 23 January, 2024; originally announced January 2024.

Comments: 15 pages, 1 figure, 3 tables

arXiv:2401.07340 [pdf]

The Afterlives of Shakespeare and Company in Online Social Readership

Authors: Maria Antoniak, David Mimno, Rosamond Thalken, Melanie Walsh, Matthew Wilkens, Gregory Yauney

Abstract: The growth of social reading platforms such as Goodreads and LibraryThing enables us to analyze reading activity at very large scale and in remarkable detail. But twenty-first century systems give us a perspective only on contemporary readers. Meanwhile, the digitization of the lending library records of Shakespeare and Company provides a window into the reading activity of an earlier, smaller com… ▽ More The growth of social reading platforms such as Goodreads and LibraryThing enables us to analyze reading activity at very large scale and in remarkable detail. But twenty-first century systems give us a perspective only on contemporary readers. Meanwhile, the digitization of the lending library records of Shakespeare and Company provides a window into the reading activity of an earlier, smaller community in interwar Paris. In this article, we explore the extent to which we can make comparisons between the Shakespeare and Company and Goodreads communities. By quantifying similarities and differences, we can identify patterns in how works have risen or fallen in popularity across these datasets. We can also measure differences in how works are received by measuring similarities and differences in co-reading patterns. Finally, by examining the complete networks of co-readership, we can observe changes in the overall structures of literary reception. △ Less

Submitted 14 January, 2024; originally announced January 2024.

arXiv:2312.09536 [pdf, other]

doi 10.18653/v1/2023.acl-demo.36

Riveter: Measuring Power and Social Dynamics Between Entities

Authors: Maria Antoniak, Anjalie Field, Jimin Mun, Melanie Walsh, Lauren F. Klein, Maarten Sap

Abstract: Riveter provides a complete easy-to-use pipeline for analyzing verb connotations associated with entities in text corpora. We prepopulate the package with connotation frames of sentiment, power, and agency, which have demonstrated usefulness for capturing social phenomena, such as gender bias, in a broad range of corpora. For decades, lexical frameworks have been foundational tools in computationa… ▽ More Riveter provides a complete easy-to-use pipeline for analyzing verb connotations associated with entities in text corpora. We prepopulate the package with connotation frames of sentiment, power, and agency, which have demonstrated usefulness for capturing social phenomena, such as gender bias, in a broad range of corpora. For decades, lexical frameworks have been foundational tools in computational social science, digital humanities, and natural language processing, facilitating multifaceted analysis of text corpora. But working with verb-centric lexica specifically requires natural language processing skills, reducing their accessibility to other researchers. By organizing the language processing pipeline, providing complete lexicon scores and visualizations for all entities in a corpus, and providing functionality for users to target specific research questions, Riveter greatly improves the accessibility of verb lexica and can facilitate a broad range of future research. △ Less

Submitted 15 December, 2023; originally announced December 2023.

Journal ref: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Volume 3: System Demonstrations, 2023, pages 377-388

arXiv:2307.15816 [pdf]

Multi-growth stage plant recognition: a case study of Palmer amaranth (Amaranthus palmeri) in cotton (Gossypium hirsutum)

Authors: Guy RY Coleman, Matthew Kutugata, Michael J Walsh, Muthukumar Bagavathiannan

Abstract: Many advanced, image-based precision agricultural technologies for plant breeding, field crop research, and site-specific crop management hinge on the reliable detection and phenotyping of plants across highly variable morphological growth stages. Convolutional neural networks (CNNs) have shown promise for image-based plant phenotyping and weed recognition, but their ability to recognize growth st… ▽ More Many advanced, image-based precision agricultural technologies for plant breeding, field crop research, and site-specific crop management hinge on the reliable detection and phenotyping of plants across highly variable morphological growth stages. Convolutional neural networks (CNNs) have shown promise for image-based plant phenotyping and weed recognition, but their ability to recognize growth stages, often with stark differences in appearance, is uncertain. Amaranthus palmeri (Palmer amaranth) is a particularly challenging weed plant in cotton (Gossypium hirsutum) production, exhibiting highly variable plant morphology both across growth stages over a growing season, as well as between plants at a given growth stage due to high genetic diversity. In this paper, we investigate eight-class growth stage recognition of A. palmeri in cotton as a challenging model for You Only Look Once (YOLO) architectures. We compare 26 different architecture variants from YOLO v3, v5, v6, v6 3.0, v7, and v8 on an eight-class growth stage dataset of A. palmeri. The highest mAP@[0.5:0.95] for recognition of all growth stage classes was 47.34% achieved by v8-X, with inter-class confusion across visually similar growth stages. With all growth stages grouped as a single class, performance increased, with a maximum mean average precision (mAP@[0.5:0.95]) of 67.05% achieved by v7-Original. Single class recall of up to 81.42% was achieved by v5-X, and precision of up to 89.72% was achieved by v8-X. Class activation maps (CAM) were used to understand model attention on the complex dataset. Fewer classes, grouped by visual or size features improved performance over the ground-truth eight-class dataset. Successful growth stage detection highlights the substantial opportunity for improving plant phenotyping and weed recognition technologies with open-source object detection architectures. △ Less

Submitted 28 July, 2023; originally announced July 2023.

Comments: 27 pages, 10 figures, 5 tables

arXiv:2307.11502 [pdf, other]

doi 10.5281/zenodo.10420938

Software engineering to sustain a high-performance computing scientific application: QMCPACK

Authors: William F. Godoy, Steven E. Hahn, Michael M. Walsh, Philip W. Fackler, Jaron T. Krogel, Peter W. Doak, Paul R. C. Kent, Alfredo A. Correa, Ye Luo, Mark Dewing

Abstract: We provide an overview of the software engineering efforts and their impact in QMCPACK, a production-level ab-initio Quantum Monte Carlo open-source code targeting high-performance computing (HPC) systems. Aspects included are: (i) strategic expansion of continuous integration (CI) targeting CPUs, using GitHub Actions runners, and NVIDIA and AMD GPUs in pre-exascale systems, using self-hosted hard… ▽ More We provide an overview of the software engineering efforts and their impact in QMCPACK, a production-level ab-initio Quantum Monte Carlo open-source code targeting high-performance computing (HPC) systems. Aspects included are: (i) strategic expansion of continuous integration (CI) targeting CPUs, using GitHub Actions runners, and NVIDIA and AMD GPUs in pre-exascale systems, using self-hosted hardware; (ii) incremental reduction of memory leaks using sanitizers, (iii) incorporation of Docker containers for CI and reproducibility, and (iv) refactoring efforts to improve maintainability, testing coverage, and memory lifetime management. We quantify the value of these improvements by providing metrics to illustrate the shift towards a predictive, rather than reactive, sustainable maintenance approach. Our goal, in documenting the impact of these efforts on QMCPACK, is to contribute to the body of knowledge on the importance of research software engineering (RSE) for the sustainability of community HPC codes and scientific discovery at scale. △ Less

Submitted 21 July, 2023; originally announced July 2023.

Comments: Accepted at the first US-RSE Conference, USRSE2023, https://us-rse.org/usrse23/, 8 pages, 3 figures, 4 tables

arXiv:2305.10311 [pdf]

Investigating image-based fallow weed detection performance on Raphanus sativus and Avena sativa at speeds up to 30 km h$^{-1}$

Authors: Guy R. Y. Coleman, Angus Macintyre, Michael J. Walsh, William T. Salter

Abstract: Site-specific weed control (SSWC) can provide considerable reductions in weed control costs and herbicide usage. Despite the promise of machine vision for SSWC systems and the importance of ground speed in weed control efficacy, there has been little investigation of the role of ground speed and camera characteristics on weed detection performance. Here, we compare the performance of four camera-s… ▽ More Site-specific weed control (SSWC) can provide considerable reductions in weed control costs and herbicide usage. Despite the promise of machine vision for SSWC systems and the importance of ground speed in weed control efficacy, there has been little investigation of the role of ground speed and camera characteristics on weed detection performance. Here, we compare the performance of four camera-software combinations using the open-source OpenWeedLocator platform - (1) default settings on a Raspberry Pi HQ camera, (2) optimised software settings on a HQ camera, (3) optimised software settings on the Raspberry Pi v2 camera, and (4) a global shutter Arducam AR0234 camera - at speeds ranging from 5 km h$^{-1}$ to 30 km h$^{-1}$. A combined excess green (ExG) and hue, saturation, value (HSV) thresholding algorithm was used for testing under fallow conditions using tillage radish (Raphanus sativus) and forage oats (Avena sativa) as representative broadleaf and grass weeds, respectively. ARD demonstrated the highest recall among camera systems, with up to 95.7% of weeds detected at 5 km h$^{-1}$ and 85.7% at 30 km h$^{-1}$. HQ1 and V2 cameras had the lowest recall of 31.1% and 26.0% at 30 km h$^{-1}$, respectively. All cameras experienced a decrease in recall as speed increased. The highest rate of decrease was observed for HQ1 with 1.12% and 0.90% reductions in recall for every km h$^{-1}$ increase in speed for tillage radish and forage oats, respectively. Detection of the grassy forage oats was worse (P<0.05) than the broadleaved tillage radish for all cameras. Despite the variations in recall, HQ1, HQ2, and V2 maintained near-perfect precision at all tested speeds. The variable effect of ground speed and camera system on detection performance of grass and broadleaf weeds, indicates that careful hardware and software considerations must be made when developing SSWC systems. △ Less

Submitted 17 May, 2023; originally announced May 2023.

Comments: 15 pages, 9 figures, 3 tables

ACM Class: C.3; I.4.8; J.3

arXiv:2204.09042 [pdf, other]

Accelerating Inhibitor Discovery With A Deep Generative Foundation Model: Validation for SARS-CoV-2 Drug Targets

Authors: Vijil Chenthamarakshan, Samuel C. Hoffman, C. David Owen, Petra Lukacik, Claire Strain-Damerell, Daren Fearon, Tika R. Malla, Anthony Tumber, Christopher J. Schofield, Helen M. E. Duyvesteyn, Wanwisa Dejnirattisai, Loic Carrique, Thomas S. Walter, Gavin R. Screaton, Tetiana Matviiuk, Aleksandra Mojsilovic, Jason Crain, Martin A. Walsh, David I. Stuart, Payel Das

Abstract: The discovery of novel inhibitor molecules for emerging drug-target proteins is widely acknowledged as a challenging inverse design problem: Exhaustive exploration of the vast chemical search space is impractical, especially when the target structure or active molecules are unknown. Here we validate experimentally the broad utility of a deep generative framework trained at-scale on protein sequenc… ▽ More The discovery of novel inhibitor molecules for emerging drug-target proteins is widely acknowledged as a challenging inverse design problem: Exhaustive exploration of the vast chemical search space is impractical, especially when the target structure or active molecules are unknown. Here we validate experimentally the broad utility of a deep generative framework trained at-scale on protein sequences, small molecules, and their mutual interactions -- that is unbiased toward any specific target. As demonstrators, we consider two dissimilar and relevant SARS-CoV-2 targets: the main protease and the spike protein (receptor binding domain, RBD). To perform target-aware design of novel inhibitor molecules, a protein sequence-conditioned sampling on the generative foundation model is performed. Despite using only the target sequence information, and without performing any target-specific adaptation of the generative model, micromolar-level inhibition was observed in in vitro experiments for two candidates out of only four synthesized for each target. The most potent spike RBD inhibitor also exhibited activity against several variants in live virus neutralization assays. These results therefore establish that a single, broadly deployable generative foundation model for accelerated hit discovery is effective and efficient, even in the most general case where neither target structure nor binder information is available. △ Less

Submitted 14 October, 2022; v1 submitted 19 April, 2022; originally announced April 2022.

Comments: Revised title, abstract, and text; additional figures

arXiv:2106.15353 [pdf, other]

Patient-independent Schizophrenia Relapse Prediction Using Mobile Sensor based Daily Behavioral Rhythm Changes

Authors: Bishal Lamichhane, Dror Ben-Zeev, Andrew Campbell, Tanzeem Choudhury, Marta Hauser, John Kane, Mikio Obuchi, Emily Scherer, Megan Walsh, Rui Wang, Weichen Wang, Akane Sano

Abstract: A schizophrenia relapse has severe consequences for a patient's health, work, and sometimes even life safety. If an oncoming relapse can be predicted on time, for example by detecting early behavioral changes in patients, then interventions could be provided to prevent the relapse. In this work, we investigated a machine learning based schizophrenia relapse prediction model using mobile sensing da… ▽ More A schizophrenia relapse has severe consequences for a patient's health, work, and sometimes even life safety. If an oncoming relapse can be predicted on time, for example by detecting early behavioral changes in patients, then interventions could be provided to prevent the relapse. In this work, we investigated a machine learning based schizophrenia relapse prediction model using mobile sensing data to characterize behavioral features. A patient-independent model providing sequential predictions, closely representing the clinical deployment scenario for relapse prediction, was evaluated. The model uses the mobile sensing data from the recent four weeks to predict an oncoming relapse in the next week. We used the behavioral rhythm features extracted from daily templates of mobile sensing data, self-reported symptoms collected via EMA (Ecological Momentary Assessment), and demographics to compare different classifiers for the relapse prediction. Naive Bayes based model gave the best results with an F2 score of 0.083 when evaluated in a dataset consisting of 63 schizophrenia patients, each monitored for up to a year. The obtained F2 score, though low, is better than the baseline performance of random classification (F2 score of 0.02 $\pm$ 0.024). Thus, mobile sensing has predictive value for detecting an oncoming relapse and needs further investigation to improve the current performance. Towards that end, further feature engineering and model personalization based on the behavioral idiosyncrasies of a patient could be helpful. △ Less

Submitted 25 June, 2021; originally announced June 2021.

Comments: EAI MobiHealth 2020

arXiv:2103.14872 [pdf, other]

doi 10.1007/s11119-023-10073-1

Deep Learning Techniques for In-Crop Weed Identification: A Review

Authors: Kun Hu, Zhiyong Wang, Guy Coleman, Asher Bender, Tingting Yao, Shan Zeng, Dezhen Song, Arnold Schumann, Michael Walsh

Abstract: Weeds are a significant threat to the agricultural productivity and the environment. The increasing demand for sustainable agriculture has driven innovations in accurate weed control technologies aimed at reducing the reliance on herbicides. With the great success of deep learning in various vision tasks, many promising image-based weed detection algorithms have been developed. This paper reviews… ▽ More Weeds are a significant threat to the agricultural productivity and the environment. The increasing demand for sustainable agriculture has driven innovations in accurate weed control technologies aimed at reducing the reliance on herbicides. With the great success of deep learning in various vision tasks, many promising image-based weed detection algorithms have been developed. This paper reviews recent developments of deep learning techniques in the field of image-based weed detection. The review begins with an introduction to the fundamentals of deep learning related to weed detection. Next, recent progresses on deep weed detection are reviewed with the discussion of the research materials including public weed datasets. Finally, the challenges of developing practically deployable weed detection methods are summarized, together with the discussions of the opportunities for future research.We hope that this review will provide a timely survey of the field and attract more researchers to address this inter-disciplinary research problem. △ Less

Submitted 4 February, 2024; v1 submitted 27 March, 2021; originally announced March 2021.

arXiv:2011.06455 [pdf]

doi 10.1098/rsos.210429

Optimal governance and implementation of vaccination programmes to contain the COVID-19 pandemic

Authors: Mahendra Piraveenan, Shailendra Sawleshwarkar, Michael Walsh, Iryna Zablotska, Samit Bhattacharyya, Habib Hassan Farooqui, Tarun Bhatnagar, Anup Karan, Manoj Murhekar, Sanjay Zodpey, K. S. Mallikarjuna Rao, Philippa Pattison, Albert Zomaya, Matjaz Perc

Abstract: Since the recent introduction of several viable vaccines for SARS-CoV-2, vaccination uptake has become the key factor that will determine our success in containing the COVID-19 pandemic. We argue that game theory and social network models should be used to guide decisions pertaining to vaccination programmes for the best possible results. In the months following the introduction of vaccines, their… ▽ More Since the recent introduction of several viable vaccines for SARS-CoV-2, vaccination uptake has become the key factor that will determine our success in containing the COVID-19 pandemic. We argue that game theory and social network models should be used to guide decisions pertaining to vaccination programmes for the best possible results. In the months following the introduction of vaccines, their availability and the human resources needed to run the vaccination programmes have been scarce in many countries. Vaccine hesitancy is also being encountered from some sections of the general public. We emphasize that decision-making under uncertainty and imperfect information, and with only conditionally optimal outcomes, is a unique forte of established game-theoretic modelling. Therefore, we can use this approach to obtain the best framework for modelling and simulating vaccination prioritization and uptake that will be readily available to inform important policy decisions for the optimal control of the COVID-19 pandemic. △ Less

Submitted 9 June, 2021; v1 submitted 12 November, 2020; originally announced November 2020.

Comments: 15 pages, 1 figure; published in Royal Society Open Science

Journal ref: R. Soc. Open Sci. 8, 210429 (2021)

arXiv:1802.06515 [pdf, other]

Image Forensics: Detecting duplication of scientific images with manipulation-invariant image similarity

Authors: M. Cicconet, H. Elliott, D. L. Richmond, D. Wainstock, M. Walsh

Abstract: Manipulation and re-use of images in scientific publications is a concerning problem that currently lacks a scalable solution. Current tools for detecting image duplication are mostly manual or semi-automated, despite the availability of an overwhelming target dataset for a learning-based approach. This paper addresses the problem of determining if, given two images, one is a manipulated version o… ▽ More Manipulation and re-use of images in scientific publications is a concerning problem that currently lacks a scalable solution. Current tools for detecting image duplication are mostly manual or semi-automated, despite the availability of an overwhelming target dataset for a learning-based approach. This paper addresses the problem of determining if, given two images, one is a manipulated version of the other by means of copy, rotation, translation, scale, perspective transform, histogram adjustment, or partial erasing. We propose a data-driven solution based on a 3-branch Siamese Convolutional Neural Network. The ConvNet model is trained to map images into a 128-dimensional space, where the Euclidean distance between duplicate images is smaller than or equal to 1, and the distance between unique images is greater than 1. Our results suggest that such an approach has the potential to improve surveillance of the published and in-peer-review literature for image manipulation. △ Less

Submitted 17 March, 2020; v1 submitted 18 February, 2018; originally announced February 2018.

Comments: 12 pages; 6 figures; keywords: siamese network, similarity metric, image forensics, image manipulation

arXiv:1704.08931 [pdf, other]

A Framework for Rate Efficient Control of Distributed Discrete Systems

Authors: Jie Ren, Solmaz Torabi, John MacLaren Walsh

Abstract: A key issue in the control of distributed discrete systems modeled as Markov decisions processes, is that often the state of the system is not directly observable at any single location in the system. The participants in the control scheme must share information with one another regarding the state of the system in order to collectively make informed control decisions, but this information sharing… ▽ More A key issue in the control of distributed discrete systems modeled as Markov decisions processes, is that often the state of the system is not directly observable at any single location in the system. The participants in the control scheme must share information with one another regarding the state of the system in order to collectively make informed control decisions, but this information sharing can be costly. Harnessing recent results from information theory regarding distributed function computation, in this paper we derive, for several information sharing model structures, the minimum amount of control information that must be exchanged to enable local participants to derive the same control decisions as an imaginary omniscient controller having full knowledge of the global state. Incorporating consideration for this amount of information that must be exchanged into the reward enables one to trade the competing objectives of minimizing this control information exchange and maximizing the performance of the controller. An alternating optimization framework is then provided to help find the efficient controllers and messaging schemes. A series of running examples from wireless resource allocation illustrate the ideas and design tradeoffs. △ Less

Submitted 28 April, 2017; originally announced April 2017.

arXiv:1704.01891 [pdf, other]

On Multi-source Networks: Enumeration, Rate Region Computation, and Hierarchy

Authors: Congduan Li, Steven Weber, John MacLaren Walsh

Abstract: Recent algorithmic developments have enabled computers to automatically determine and prove the capacity regions of small hypergraph networks under network coding. A structural theory relating network coding problems of different sizes is developed to make best use of this newfound computational capability. A formal notion of network minimality is developed which removes components of a network co… ▽ More Recent algorithmic developments have enabled computers to automatically determine and prove the capacity regions of small hypergraph networks under network coding. A structural theory relating network coding problems of different sizes is developed to make best use of this newfound computational capability. A formal notion of network minimality is developed which removes components of a network coding problem that are inessential to its core complexity. Equivalence between different network coding problems under relabeling is formalized via group actions, an algorithm which can directly list single representatives from each equivalence class of minimal networks up to a prescribed network size is presented. This algorithm, together with rate region software, is leveraged to create a database containing the rate regions for all minimal network coding problems with five or fewer sources and edges, a collection of 744119 equivalence classes representing more than 9 million networks. In order to best learn from this database, and to leverage it to infer rate regions and their characteristics of networks at scale, a hierarchy between different network coding problems is created with a new theory of combinations and embedding operators. △ Less

Submitted 6 April, 2017; originally announced April 2017.

Comments: 20 pages with double column, revision of previous submission arXiv:1507.05728

arXiv:1607.06833 [pdf, other]

Explicit Polyhedral Bounds on Network Coding Rate Regions via Entropy Function Region: Algorithms, Symmetry, and Computation

Authors: Jayant Apte, John MacLaren Walsh

Abstract: Automating the solutions of multiple network information theory problems, stretching from fundamental concerns such as determining all information inequalities and the limitations of linear codes, to applied ones such as designing coded networks, distributed storage systems, and caching systems, can be posed as polyhedral projections. These problems are demonstrated to exhibit multiple types of po… ▽ More Automating the solutions of multiple network information theory problems, stretching from fundamental concerns such as determining all information inequalities and the limitations of linear codes, to applied ones such as designing coded networks, distributed storage systems, and caching systems, can be posed as polyhedral projections. These problems are demonstrated to exhibit multiple types of polyhedral symmetries. It is shown how these symmetries can be exploited to reduce the complexity of solving these problems through polyhedral projection. △ Less

Submitted 6 July, 2017; v1 submitted 22 July, 2016; originally announced July 2016.

Comments: 23 pages, 15 figures

arXiv:1605.04598 [pdf, other]

Constrained Linear Representability of Polymatroids and Algorithms for Computing Achievability Proofs in Network Coding

Authors: Jayant Apte, John MacLaren Walsh

Abstract: The constrained linear representability problem (CLRP) for polymatroids determines whether there exists a polymatroid that is linear over a specified field while satisfying a collection of constraints on the rank function. Using a computer to test whether a certain rate vector is achievable with vector linear network codes for a multi-source network coding instance and whether there exists a multi… ▽ More The constrained linear representability problem (CLRP) for polymatroids determines whether there exists a polymatroid that is linear over a specified field while satisfying a collection of constraints on the rank function. Using a computer to test whether a certain rate vector is achievable with vector linear network codes for a multi-source network coding instance and whether there exists a multi-linear secret sharing scheme achieving a specified information ratio for a given secret sharing instance are shown to be special cases of CLRP. Methods for solving CLRP built from group theoretic techniques for combinatorial generation are developed and described. These techniques form the core of an information theoretic achievability prover, an implementation accompanies the article, and several computational experiments with interesting instances of network coding and secret sharing demonstrating the utility of the method are provided. △ Less

Submitted 1 February, 2017; v1 submitted 15 May, 2016; originally announced May 2016.

Comments: submitted to IEEE Transactions on Information Theory, (this version: corrected figure 9)

arXiv:1605.01744 [pdf, other]

Improving Automated Patent Claim Parsing: Dataset, System, and Experiments

Authors: Mengke Hu, David Cinciruk, John MacLaren Walsh

Abstract: Off-the-shelf natural language processing software performs poorly when parsing patent claims owing to their use of irregular language relative to the corpora built from news articles and the web typically utilized to train this software. Stopping short of the extensive and expensive process of accumulating a large enough dataset to completely retrain parsers for patent claims, a method of adaptin… ▽ More Off-the-shelf natural language processing software performs poorly when parsing patent claims owing to their use of irregular language relative to the corpora built from news articles and the web typically utilized to train this software. Stopping short of the extensive and expensive process of accumulating a large enough dataset to completely retrain parsers for patent claims, a method of adapting existing natural language processing software towards patent claims via forced part of speech tag correction is proposed. An Amazon Mechanical Turk collection campaign organized to generate a public corpus to train such an improved claim parsing system is discussed, identifying lessons learned during the campaign that can be of use in future NLP dataset collection campaigns with AMT. Experiments utilizing this corpus and other patent claim sets measure the parsing performance improvement garnered via the claim parsing system. Finally, the utility of the improved claim parsing system within other patent processing applications is demonstrated via experiments showing improved automated patent subject classification when the new claim parsing system is utilized to generate the features. △ Less

Submitted 5 May, 2016; originally announced May 2016.

arXiv:1512.03324 [pdf, other]

Mapping the Region of Entropic Vectors with Support Enumeration & Information Geometry

Authors: Yunshu Liu, John MacLaren Walsh

Abstract: The region of entropic vectors is a convex cone that has been shown to be at the core of many fundamental limits for problems in multiterminal data compression, network coding, and multimedia transmission. This cone has been shown to be non-polyhedral for four or more random variables, however its boundary remains unknown for four or more discrete random variables. Methods for specifying probabili… ▽ More The region of entropic vectors is a convex cone that has been shown to be at the core of many fundamental limits for problems in multiterminal data compression, network coding, and multimedia transmission. This cone has been shown to be non-polyhedral for four or more random variables, however its boundary remains unknown for four or more discrete random variables. Methods for specifying probability distributions that are in faces and on the boundary of the convex cone are derived, then utilized to map optimized inner bounds to the unknown part of the entropy region. The first method utilizes tools and algorithms from abstract algebra to efficiently determine those supports for the joint probability mass functions for four or more random variables that can, for some appropriate set of non-zero probabilities, yield entropic vectors in the gap between the best known inner and outer bounds. These supports are utilized, together with numerical optimization over non-zero probabilities, to provide inner bounds to the unknown part of the entropy region. Next, information geometry is utilized to parameterize and study the structure of probability distributions on these supports yielding entropic vectors in the faces of entropy and in the unknown part of the entropy region. △ Less

Submitted 10 December, 2015; originally announced December 2015.

arXiv:1507.05728 [pdf, other]

On Multi-source Networks: Enumeration, Rate Region Computation, and Hierarchy

Authors: Congduan Li, Steven Weber, John MacLaren Walsh

Abstract: This paper investigates the enumeration, rate region computation, and hierarchy of general multi-source multi-sink hyperedge networks under network coding, which includes multiple network models, such as independent distributed storage systems and index coding problems, as special cases. A notion of minimal networks and a notion of network equivalence under group action are defined. An efficient a… ▽ More This paper investigates the enumeration, rate region computation, and hierarchy of general multi-source multi-sink hyperedge networks under network coding, which includes multiple network models, such as independent distributed storage systems and index coding problems, as special cases. A notion of minimal networks and a notion of network equivalence under group action are defined. An efficient algorithm capable of directly listing single minimal canonical representatives from each network equivalence class is presented and utilized to list all minimal canonical networks with up to 5 sources and hyperedges. Computational tools are then applied to obtain the rate regions of all of these canonical networks, providing exact expressions for 744,119 newly solved network coding rate regions corresponding to more than 2 trillion isomorphic network coding problems. In order to better understand and analyze the huge repository of rate regions through hierarchy, several embedding and combination operations are defined so that the rate region of the network after operation can be derived from the rate regions of networks involved in the operation. The embedding operations enable the definition and determination of a list of forbidden network minors for the sufficiency of classes of linear codes. The combination operations enable the rate regions of some larger networks to be obtained as the combination of the rate regions of smaller networks. The integration of both the combinations and embedding operators is then shown to enable the calculation of rate regions for many networks not reachable via combination operations alone. △ Less

Submitted 21 July, 2015; originally announced July 2015.

Comments: 63 pages, submitted to TransIT

arXiv:1505.04202 [pdf, other]

doi 10.1109/TSP.2015.2483479

Interactive Scalar Quantization for Distributed Resource Allocation

Authors: Bradford D. Boyle, Jie Ren, John MacLaren Walsh, Steven Weber

Abstract: In many resource allocation problems, a centralized controller needs to award some resource to a user selected from a collection of distributed users with the goal of maximizing the utility the user would receive from the resource. This can be modeled as the controller computing an extremum of the distributed users' utilities. The overhead rate necessary to enable the controller to reproduce the u… ▽ More In many resource allocation problems, a centralized controller needs to award some resource to a user selected from a collection of distributed users with the goal of maximizing the utility the user would receive from the resource. This can be modeled as the controller computing an extremum of the distributed users' utilities. The overhead rate necessary to enable the controller to reproduce the users' local state can be prohibitively high. An approach to reduce this overhead is interactive communication wherein rate savings are achieved by tolerating an increase in delay. In this paper, we consider the design of a simple achievable scheme based on successive refinements of scalar quantization at each user. The optimal quantization policy is computed via a dynamic program and we demonstrate that tolerating a small increase in delay can yield significant rate savings. We then consider two simpler quantization policies to investigate the scaling properties of the rate-delay trade-offs. Using a combination of these simpler policies, the performance of the optimal policy can be closely approximated with lower computational costs. △ Less

Submitted 6 September, 2015; v1 submitted 15 May, 2015; originally announced May 2015.

Comments: 31 pages, 9 figures. Submitted on 2015-05-15 to IEEE Transactions on Signal Processing. Revised 2015-09-06

arXiv:1408.3661 [pdf, other]

Overhead Performance Tradeoffs - A Resource Allocation Perspective

Authors: Jie Ren, Bradford D. Boyle, Gwanmo Ku, Steven Weber, John MacLaren Walsh

Abstract: A key aspect of many resource allocation problems is the need for the resource controller to compute a function, such as the max or arg max, of the competing users metrics. Information must be exchanged between the competing users and the resource controller in order for this function to be computed. In many practical resource controllers the competing users' metrics are communicated to the resour… ▽ More A key aspect of many resource allocation problems is the need for the resource controller to compute a function, such as the max or arg max, of the competing users metrics. Information must be exchanged between the competing users and the resource controller in order for this function to be computed. In many practical resource controllers the competing users' metrics are communicated to the resource controller, which then computes the desired extremization function. However, in this paper it is shown that information rate savings can be obtained by recognizing that controller only needs to determine the result of this extremization function. If the extremization function is to be computed losslessly, the rate savings are shown in most cases to be at most 2 bits independent of the number of competing users. Motivated by the small savings in the lossless case, simple achievable schemes for both the lossy and interactive variants of this problem are considered. It is shown that both of these approaches have the potential to realize large rate savings, especially in the case where the number of competing users is large. For the lossy variant, it is shown that the proposed simple achievable schemes are in fact close to the fundamental limit given by the rate distortion function. △ Less

Submitted 15 August, 2014; originally announced August 2014.

Comments: 70 pages, 18 figures, Submitted to IEEE Transactions on Information Theory on 2014-08-14

arXiv:1408.3469 [pdf, other]

doi 10.1109/TIT.2016.2640302

Properties of an Aloha-like stability region

Authors: Nan Xie, John MacLaren Walsh, Steven Weber

Abstract: A well-known inner bound on the stability region of the finite-user slotted Aloha protocol is the set of all arrival rates for which there exists some choice of the contention probabilities such that the associated worst-case service rate for each user exceeds the user's arrival rate, denoted $Λ$. Although testing membership in $Λ$ of a given arrival rate can be posed as a convex program, it is no… ▽ More A well-known inner bound on the stability region of the finite-user slotted Aloha protocol is the set of all arrival rates for which there exists some choice of the contention probabilities such that the associated worst-case service rate for each user exceeds the user's arrival rate, denoted $Λ$. Although testing membership in $Λ$ of a given arrival rate can be posed as a convex program, it is nonetheless of interest to understand the properties of this set. In this paper we develop new results of this nature, including $i)$ an equivalence between membership in $Λ$ and the existence of a positive root of a given polynomial, $ii)$ a method to construct a vector of contention probabilities to stabilize any stabilizable arrival rate vector, $iii)$ the volume of $Λ$, $iv)$ explicit polyhedral, spherical, and ellipsoid inner and outer bounds on $Λ$, and $v)$ characterization of the generalized convexity properties of a natural ``excess rate'' function associated with $Λ$, including the convexity of the set of contention probabilities that stabilize a given arrival rate vector. △ Less

Submitted 4 January, 2017; v1 submitted 15 August, 2014; originally announced August 2014.

Comments: 28 pages, 9 figures. Submitted August 15, 2014, revised September 21, 2015 and August 31, 2016, and accepted November 06, 2016 for publication in IEEE Transactions on Information Theory. Preliminary results presented at ISIT 2010, ITA 2010, and ITA 2011. DOI: 10.1109/TIT.2016.2640302. Copyright transferred to IEEE. This is last version uploaded by the authors prior to IEEE proofing process

arXiv:1407.5659 [pdf, other]

Multilevel Diversity Coding Systems: Rate Regions, Codes, Computation, & Forbidden Minors

Authors: Congduan Li, Steven Weber, John MacLaren Walsh

Abstract: The rate regions of multilevel diversity coding systems (MDCS), a sub-class of the broader family of multi-source multi-sink networks with special structure, are investigated. After showing how to enumerate all non-isomorphic MDCS instances of a given size, the Shannon outer bound and several achievable inner bounds based on linear codes are given for the rate region of each non-isomorphic instanc… ▽ More The rate regions of multilevel diversity coding systems (MDCS), a sub-class of the broader family of multi-source multi-sink networks with special structure, are investigated. After showing how to enumerate all non-isomorphic MDCS instances of a given size, the Shannon outer bound and several achievable inner bounds based on linear codes are given for the rate region of each non-isomorphic instance. For thousands of MDCS instances, the bounds match, and hence exact rate regions are proven. Results gained from these computations are summarized in key statistics involving aspects such as the sufficiency of scalar binary codes, the necessary size of vector binary codes, etc. Also, it is shown how to generate computer aided human readable converse proofs, as well as how to construct the codes for an achievability proof. Based on this large repository of rate regions, a series of results about general MDCS cases that they inspired are introduced and proved. In particular, a series of embedding operations that preserve the property of sufficiency of scalar or vector codes are presented. The utility of these operations is demonstrated by boiling the thousands of MDCS instances for which binary scalar codes are insufficient down to 12 forbidden smallest embedded MDCS instances. △ Less

Submitted 26 August, 2014; v1 submitted 21 July, 2014; originally announced July 2014.

Comments: Submitted to IEEE Transactions on Information Theory, 52 pages

Showing 1–25 of 25 results for author: Walsh, M