Search | arXiv e-print repository

Entropy Coding of Unordered Data Structures

Authors: Julius Kunze, Daniel Severo, Giulio Zani, Jan-Willem van de Meent, James Townsend

Abstract: We present shuffle coding, a general method for optimal compression of sequences of unordered objects using bits-back coding. Data structures that can be compressed using shuffle coding include multisets, graphs, hypergraphs, and others. We release an implementation that can easily be adapted to different data types and statistical models, and demonstrate that our implementation achieves state-of-… ▽ More We present shuffle coding, a general method for optimal compression of sequences of unordered objects using bits-back coding. Data structures that can be compressed using shuffle coding include multisets, graphs, hypergraphs, and others. We release an implementation that can easily be adapted to different data types and statistical models, and demonstrate that our implementation achieves state-of-the-art compression rates on a range of graph datasets including molecular data. △ Less

Submitted 16 August, 2024; originally announced August 2024.

Comments: Published at ICLR 2024

arXiv:2310.05986 [pdf, other]

The Unreasonable Effectiveness of Linear Prediction as a Perceptual Metric

Authors: Daniel Severo, Lucas Theis, Johannes Ballé

Abstract: We show how perceptual embeddings of the visual system can be constructed at inference-time with no training data or deep neural network features. Our perceptual embeddings are solutions to a weighted least squares (WLS) problem, defined at the pixel-level, and solved at inference-time, that can capture global and local image characteristics. The distance in embedding space is used to define a per… ▽ More We show how perceptual embeddings of the visual system can be constructed at inference-time with no training data or deep neural network features. Our perceptual embeddings are solutions to a weighted least squares (WLS) problem, defined at the pixel-level, and solved at inference-time, that can capture global and local image characteristics. The distance in embedding space is used to define a perceptual similarity metric which we call LASI: Linear Autoregressive Similarity Index. Experiments on full-reference image quality assessment datasets show LASI performs competitively with learned deep feature based methods like LPIPS (Zhang et al., 2018) and PIM (Bhardwaj et al., 2020), at a similar computational cost to hand-crafted methods such as MS-SSIM (Wang et al., 2003). We found that increasing the dimensionality of the embedding space consistently reduces the WLS loss while increasing performance on perceptual tasks, at the cost of increasing the computational complexity. LASI is fully differentiable, scales cubically with the number of embedding dimensions, and can be parallelized at the pixel-level. A Maximum Differentiation (MAD) competition (Wang & Simoncelli, 2008) between LASI and LPIPS shows that both methods are capable of finding failure points for the other, suggesting these metrics can be combined. △ Less

Submitted 6 October, 2023; originally announced October 2023.

arXiv:2305.09705 [pdf, other]

Random Edge Coding: One-Shot Bits-Back Coding of Large Labeled Graphs

Authors: Daniel Severo, James Townsend, Ashish Khisti, Alireza Makhzani

Abstract: We present a one-shot method for compressing large labeled graphs called Random Edge Coding. When paired with a parameter-free model based on Pólya's Urn, the worst-case computational and memory complexities scale quasi-linearly and linearly with the number of observed edges, making it efficient on sparse graphs, and requires only integer arithmetic. Key to our method is bits-back coding, which is… ▽ More We present a one-shot method for compressing large labeled graphs called Random Edge Coding. When paired with a parameter-free model based on Pólya's Urn, the worst-case computational and memory complexities scale quasi-linearly and linearly with the number of observed edges, making it efficient on sparse graphs, and requires only integer arithmetic. Key to our method is bits-back coding, which is used to sample edges and vertices without replacement from the edge-list in a way that preserves the structure of the graph. Optimality is proven under a class of random graph models that are invariant to permutations of the edges and of vertices within an edge. Experiments indicate Random Edge Coding can achieve competitive compression performance on real-world network datasets and scales to graphs with millions of nodes and edges. △ Less

Submitted 16 May, 2023; originally announced May 2023.

Comments: Published at ICML 2023

arXiv:2210.06662 [pdf, other]

Action Matching: Learning Stochastic Dynamics from Samples

Authors: Kirill Neklyudov, Rob Brekelmans, Daniel Severo, Alireza Makhzani

Abstract: Learning the continuous dynamics of a system from snapshots of its temporal marginals is a problem which appears throughout natural sciences and machine learning, including in quantum systems, single-cell biological data, and generative modeling. In these settings, we assume access to cross-sectional samples that are uncorrelated over time, rather than full trajectories of samples. In order to bet… ▽ More Learning the continuous dynamics of a system from snapshots of its temporal marginals is a problem which appears throughout natural sciences and machine learning, including in quantum systems, single-cell biological data, and generative modeling. In these settings, we assume access to cross-sectional samples that are uncorrelated over time, rather than full trajectories of samples. In order to better understand the systems under observation, we would like to learn a model of the underlying process that allows us to propagate samples in time and thereby simulate entire individual trajectories. In this work, we propose Action Matching, a method for learning a rich family of dynamics using only independent samples from its time evolution. We derive a tractable training objective, which does not rely on explicit assumptions about the underlying dynamics and does not require back-propagation through differential equations or optimal transport solvers. Inspired by connections with optimal transport, we derive extensions of Action Matching to learn stochastic differential equations and dynamics involving creation and destruction of probability mass. Finally, we showcase applications of Action Matching by achieving competitive performance in a diverse set of experiments from biology, physics, and generative modeling. △ Less

Submitted 8 June, 2023; v1 submitted 12 October, 2022; originally announced October 2022.

Comments: Published in ICML 2023

arXiv:2112.13687 [pdf]

Predição de Incidência de Lesão por Pressão em Pacientes de UTI usando Aprendizado de Máquina

Authors: Henrique P. Silva, Arthur D. Reys, Daniel S. Severo, Dominique H. Ruther, Flávio A. O. B. Silva, Maria C. S. S. Guimarães, Roberto Z. A. Pinto, Saulo D. S. Pedro, Túlio P. Navarro, Danilo Silva

Abstract: Pressure ulcers have high prevalence in ICU patients but are preventable if identified in initial stages. In practice, the Braden scale is used to classify high-risk patients. This paper investigates the use of machine learning in electronic health records data for this task, by using data available in MIMIC-III v1.4. Two main contributions are made: a new approach for evaluating models that consi… ▽ More Pressure ulcers have high prevalence in ICU patients but are preventable if identified in initial stages. In practice, the Braden scale is used to classify high-risk patients. This paper investigates the use of machine learning in electronic health records data for this task, by using data available in MIMIC-III v1.4. Two main contributions are made: a new approach for evaluating models that considers all predictions made during a stay, and a new training method for the machine learning models. The results show a superior performance in comparison to the state of the art; moreover, all models surpass the Braden scale in every operating point in the precision-recall curve. -- -- Lesões por pressão possuem alta prevalência em pacientes de UTI e são preveníveis ao serem identificadas em estágios iniciais. Na prática utiliza-se a escala de Braden para classificação de pacientes em risco. Este artigo investiga o uso de aprendizado de máquina em dados de registros eletrônicos para este fim, a partir da base de dados MIMIC-III v1.4. São feitas duas contribuições principais: uma nova abordagem para a avaliação dos modelos e da escala de Braden levando em conta todas as predições feitas ao longo das internações, e um novo método de treinamento para os modelos de aprendizado de máquina. Os resultados obtidos superam o estado da arte e verifica-se que os modelos superam significativamente a escala de Braden em todos os pontos de operação da curva de precisão por sensibilidade. △ Less

Submitted 23 December, 2021; originally announced December 2021.

Comments: 3 pages, 1 figure, in Portuguese, accepted at XVIII Congresso Brasileiro de Informática em Saúde (CBIS 2021)

arXiv:2107.09716 [pdf, other]

Regularized Classification-Aware Quantization

Authors: Daniel Severo, Elad Domanovitz, Ashish Khisti

Abstract: Traditionally, quantization is designed to minimize the reconstruction error of a data source. When considering downstream classification tasks, other measures of distortion can be of interest; such as the 0-1 classification loss. Furthermore, it is desirable that the performance of these quantizers not deteriorate once they are deployed into production, as relearning the scheme online is not alwa… ▽ More Traditionally, quantization is designed to minimize the reconstruction error of a data source. When considering downstream classification tasks, other measures of distortion can be of interest; such as the 0-1 classification loss. Furthermore, it is desirable that the performance of these quantizers not deteriorate once they are deployed into production, as relearning the scheme online is not always possible. In this work, we present a class of algorithms that learn distributed quantization schemes for binary classification tasks. Our method performs well on unseen data, and is faster than previous methods proportional to a quadratic term of the dataset size. It works by regularizing the 0-1 loss with the reconstruction error. We present experiments on synthetic mixture and bivariate Gaussian data and compare training, testing, and generalization errors with a family of benchmark quantization schemes from the literature. Our method is called Regularized Classification-Aware Quantization. △ Less

Submitted 12 July, 2021; originally announced July 2021.

Comments: Accepted to the 30th Biennial Symposium on Communications (BSC) 2021

arXiv:2107.09202 [pdf, other]

doi 10.1109/JSAIT.2023.3245417

Compressing Multisets with Large Alphabets using Bits-Back Coding

Authors: Daniel Severo, James Townsend, Ashish Khisti, Alireza Makhzani, Karen Ullrich

Abstract: Current methods which compress multisets at an optimal rate have computational complexity that scales linearly with alphabet size, making them too slow to be practical in many real-world settings. We show how to convert a compression algorithm for sequences into one for multisets, in exchange for an additional complexity term that is quasi-linear in sequence length. This allows us to compress mult… ▽ More Current methods which compress multisets at an optimal rate have computational complexity that scales linearly with alphabet size, making them too slow to be practical in many real-world settings. We show how to convert a compression algorithm for sequences into one for multisets, in exchange for an additional complexity term that is quasi-linear in sequence length. This allows us to compress multisets of exchangeable symbols at an optimal rate, with computational complexity decoupled from the alphabet size. The key insight is to avoid encoding the multiset directly, and instead compress a proxy sequence, using a technique called `bits-back coding'. We demonstrate the method experimentally on tasks which are intractable with previous optimal-rate methods: compression of multisets of images and JavaScript Object Notation (JSON) files. Code for our experiments is available at https://github.com/facebookresearch/multiset-compression. △ Less

Submitted 27 February, 2023; v1 submitted 15 July, 2021; originally announced July 2021.

Journal ref: IEEE Journal on Selected Areas in Information Theory, 2023

arXiv:2102.11086 [pdf, other]

Improving Lossless Compression Rates via Monte Carlo Bits-Back Coding

Authors: Yangjun Ruan, Karen Ullrich, Daniel Severo, James Townsend, Ashish Khisti, Arnaud Doucet, Alireza Makhzani, Chris J. Maddison

Abstract: Latent variable models have been successfully applied in lossless compression with the bits-back coding algorithm. However, bits-back suffers from an increase in the bitrate equal to the KL divergence between the approximate posterior and the true posterior. In this paper, we show how to remove this gap asymptotically by deriving bits-back coding algorithms from tighter variational bounds. The key… ▽ More Latent variable models have been successfully applied in lossless compression with the bits-back coding algorithm. However, bits-back suffers from an increase in the bitrate equal to the KL divergence between the approximate posterior and the true posterior. In this paper, we show how to remove this gap asymptotically by deriving bits-back coding algorithms from tighter variational bounds. The key idea is to exploit extended space representations of Monte Carlo estimators of the marginal likelihood. Naively applied, our schemes would require more initial bits than the standard bits-back coder, but we show how to drastically reduce this additional cost with couplings in the latent space. When parallel architectures can be exploited, our coders can achieve better rates than bits-back with little additional cost. We demonstrate improved lossless compression rates in a variety of settings, especially in out-of-distribution or sequential data compression. △ Less

Submitted 14 June, 2021; v1 submitted 22 February, 2021; originally announced February 2021.

arXiv:2008.01515 [pdf, other]

doi 10.1007/978-3-030-61377-8_39

Predicting Multiple ICD-10 Codes from Brazilian-Portuguese Clinical Notes

Authors: Arthur D. Reys, Danilo Silva, Daniel Severo, Saulo Pedro, Marcia M. de Souza e Sá, Guilherme A. C. Salgado

Abstract: ICD coding from electronic clinical records is a manual, time-consuming and expensive process. Code assignment is, however, an important task for billing purposes and database organization. While many works have studied the problem of automated ICD coding from free text using machine learning techniques, most use records in the English language, especially from the MIMIC-III public dataset. This w… ▽ More ICD coding from electronic clinical records is a manual, time-consuming and expensive process. Code assignment is, however, an important task for billing purposes and database organization. While many works have studied the problem of automated ICD coding from free text using machine learning techniques, most use records in the English language, especially from the MIMIC-III public dataset. This work presents results for a dataset with Brazilian Portuguese clinical notes. We develop and optimize a Logistic Regression model, a Convolutional Neural Network (CNN), a Gated Recurrent Unit Neural Network and a CNN with Attention (CNN-Att) for prediction of diagnosis ICD codes. We also report our results for the MIMIC-III dataset, which outperform previous work among models of the same families, as well as the state of the art. Compared to MIMIC-III, the Brazilian Portuguese dataset contains far fewer words per document, when only discharge summaries are used. We experiment concatenating additional documents available in this dataset, achieving a great boost in performance. The CNN-Att model achieves the best results on both datasets, with micro-averaged F1 score of 0.537 on MIMIC-III and 0.485 on our dataset with additional documents. △ Less

Submitted 29 July, 2020; originally announced August 2020.

Comments: Accepted at BRACIS 2020

arXiv:1910.00752 [pdf, ps, other]

Ward2ICU: A Vital Signs Dataset of Inpatients from the General Ward

Authors: Daniel Severo, Flávio Amaro, Estevam R. Hruschka Jr, André Soares de Moura Costa

Abstract: We present a proxy dataset of vital signs with class labels indicating patient transitions from the ward to intensive care units called Ward2ICU. Patient privacy is protected using a Wasserstein Generative Adversarial Network to implicitly learn an approximation of the data distribution, allowing us to sample synthetic data. The quality of data generation is assessed directly on the binary classif… ▽ More We present a proxy dataset of vital signs with class labels indicating patient transitions from the ward to intensive care units called Ward2ICU. Patient privacy is protected using a Wasserstein Generative Adversarial Network to implicitly learn an approximation of the data distribution, allowing us to sample synthetic data. The quality of data generation is assessed directly on the binary classification task by comparing specificity and sensitivity of an LSTM classifier on proxy and original datasets. We initialize a discussion of unintentionally disclosing commercial sensitive information and propose a solution for a special case through class label balancing △ Less

Submitted 1 October, 2019; originally announced October 2019.

Showing 1–10 of 10 results for author: Severo, D