-
Gemma 2: Improving Open Language Models at a Practical Size
Authors:
Gemma Team,
Morgane Riviere,
Shreya Pathak,
Pier Giuseppe Sessa,
Cassidy Hardin,
Surya Bhupatiraju,
Léonard Hussenot,
Thomas Mesnard,
Bobak Shahriari,
Alexandre Ramé,
Johan Ferret,
Peter Liu,
Pouya Tafti,
Abe Friesen,
Michelle Casbon,
Sabela Ramos,
Ravin Kumar,
Charline Le Lan,
Sammy Jerome,
Anton Tsitsulin,
Nino Vieillard,
Piotr Stanczyk,
Sertan Girgin,
Nikola Momchev,
Matt Hoffman
, et al. (172 additional authors not shown)
Abstract:
In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We al…
▽ More
In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We also train the 2B and 9B models with knowledge distillation (Hinton et al., 2015) instead of next token prediction. The resulting models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3 times bigger. We release all our models to the community.
△ Less
Submitted 2 August, 2024; v1 submitted 31 July, 2024;
originally announced August 2024.
-
EdgeLLM: A Highly Efficient CPU-FPGA Heterogeneous Edge Accelerator for Large Language Models
Authors:
Mingqiang Huang,
Ao Shen,
Kai Li,
Haoxiang Peng,
Boyu Li,
Hao Yu
Abstract:
The rapid advancements in artificial intelligence (AI), particularly the Large Language Models (LLMs), have profoundly affected our daily work and communication forms. However, the colossal scale of LLM presents significant operational challenges, particularly when attempting to deploy them on resource-constrained edge devices such as smartphones, robots, and embedded systems. In this work, we pro…
▽ More
The rapid advancements in artificial intelligence (AI), particularly the Large Language Models (LLMs), have profoundly affected our daily work and communication forms. However, the colossal scale of LLM presents significant operational challenges, particularly when attempting to deploy them on resource-constrained edge devices such as smartphones, robots, and embedded systems. In this work, we proposed EdgeLLM, an efficient CPU-FPGA heterogeneous acceleration framework, to markedly enhance the computational efficiency of LLMs on edge. We first analyzed the whole operators within AI models and developed a universal data parallelism scheme, which is generic and can be adapted to any type of AI algorithm. Then, we developed fully-customized hardware operators according to the designated data formats. A multitude of optimization techniques have been integrated in the design, such as approximate FP16*INT4 and FP16*FP16 computation engines, group vector systolic arrays, log-scale structured sparsity, asynchronous between data transfer and processing. Finally, we proposed an end-to-end compilation scheme that can dynamically compile all of the operators and map the whole model on CPU-FPGA heterogeneous system. The design has been deployed on AMD Xilinx VCU128 FPGA, our accelerator achieves 1.67x higher throughput and 7.4x higher energy efficiency than the commercial GPU (NVIDIA A100-SXM4-80G) on ChatGLM2-6B, and shows 10%~20% better performance than state-of-the-art FPGA accelerator of FlightLLM in terms of HBM bandwidth utilization and LLM throughput.
△ Less
Submitted 31 July, 2024;
originally announced July 2024.
-
Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance
Authors:
Ao Shen,
Qiang Wang,
Zhiquan Lai,
Xionglve Li,
Dongsheng Li
Abstract:
Large Language Models (LLMs) have demonstrated impressive performance across various domains. However, the enormous number of model parameters makes fine-tuning challenging, significantly limiting their application and deployment. Existing solutions combine parameter quantization with Low-Rank Adaptation (LoRA), greatly reducing memory usage but resulting in noticeable performance degradation. In…
▽ More
Large Language Models (LLMs) have demonstrated impressive performance across various domains. However, the enormous number of model parameters makes fine-tuning challenging, significantly limiting their application and deployment. Existing solutions combine parameter quantization with Low-Rank Adaptation (LoRA), greatly reducing memory usage but resulting in noticeable performance degradation. In this paper, we identify an imbalance in fine-tuning quantized pre-trained models: overly complex adapter inputs and outputs versus low effective trainability of the adaptation. We propose Quantized LLMs with Balanced-rank Adaptation (Q-BaRA), which simplifies the adapter inputs and outputs while increasing the adapter's rank to achieve a more suitable balance for fine-tuning quantized LLMs. Additionally, for scenarios where fine-tuned LLMs need to be deployed as low-precision inference models, we introduce Quantization-Aware Fine-tuning with Higher Rank Adaptation (QA-HiRA), which simplifies the adapter inputs and outputs to align with the pre-trained model's block-wise quantization while employing a single matrix to achieve a higher rank. Both Q-BaRA and QA-HiRA are easily implemented and offer the following optimizations: (i) Q-BaRA consistently achieves the highest accuracy compared to baselines and other variants, requiring the same number of trainable parameters and computational effort; (ii) QA-HiRA naturally merges adapter parameters into the block-wise quantized model after fine-tuning, achieving the highest accuracy compared to other methods. We apply our Q-BaRA and QA-HiRA to the LLaMA and LLaMA2 model families and validate their effectiveness across different fine-tuning datasets and downstream scenarios.
Code will be made available at \href{https://github.com/xiaocaigou/qbaraqahira}{https://github.com/xiaocaigou/qbaraqahira}
△ Less
Submitted 24 July, 2024;
originally announced July 2024.
-
Kolmogorov complexity as a combinatorial tool
Authors:
Alexander Shen
Abstract:
Kolmogorov complexity is often used as a convenient language for counting and/or probabilistic existence proofs. However, there are some applications where Kolmogorov complexity is used in a more subtle way. We provide one (somehow) surprising example where an existence of a winning strategy in a natural combinatorial game is proven (and no direct proof is known).
Kolmogorov complexity is often used as a convenient language for counting and/or probabilistic existence proofs. However, there are some applications where Kolmogorov complexity is used in a more subtle way. We provide one (somehow) surprising example where an existence of a winning strategy in a natural combinatorial game is proven (and no direct proof is known).
△ Less
Submitted 15 May, 2024;
originally announced May 2024.
-
Physical formula enhanced multi-task learning for pharmacokinetics prediction
Authors:
Ruifeng Li,
Dongzhan Zhou,
Ancheng Shen,
Ao Zhang,
Mao Su,
Mingqian Li,
Hongyang Chen,
Gang Chen,
Yin Zhang,
Shufei Zhang,
Yuqiang Li,
Wanli Ouyang
Abstract:
Artificial intelligence (AI) technology has demonstrated remarkable potential in drug dis-covery, where pharmacokinetics plays a crucial role in determining the dosage, safety, and efficacy of new drugs. A major challenge for AI-driven drug discovery (AIDD) is the scarcity of high-quality data, which often requires extensive wet-lab work. A typical example of this is pharmacokinetic experiments. I…
▽ More
Artificial intelligence (AI) technology has demonstrated remarkable potential in drug dis-covery, where pharmacokinetics plays a crucial role in determining the dosage, safety, and efficacy of new drugs. A major challenge for AI-driven drug discovery (AIDD) is the scarcity of high-quality data, which often requires extensive wet-lab work. A typical example of this is pharmacokinetic experiments. In this work, we develop a physical formula enhanced mul-ti-task learning (PEMAL) method that predicts four key parameters of pharmacokinetics simultaneously. By incorporating physical formulas into the multi-task framework, PEMAL facilitates effective knowledge sharing and target alignment among the pharmacokinetic parameters, thereby enhancing the accuracy of prediction. Our experiments reveal that PEMAL significantly lowers the data demand, compared to typical Graph Neural Networks. Moreover, we demonstrate that PEMAL enhances the robustness to noise, an advantage that conventional Neural Networks do not possess. Another advantage of PEMAL is its high flexibility, which can be potentially applied to other multi-task machine learning scenarios. Overall, our work illustrates the benefits and potential of using PEMAL in AIDD and other scenarios with data scarcity and noise.
△ Less
Submitted 16 April, 2024;
originally announced April 2024.
-
Conditional normality and finite-state dimensions revisited
Authors:
Alexander Shen
Abstract:
The notion of a normal bit sequence was introduced by Borel in 1909; it was the first definition of an individual random object. Normality is a weak notion of randomness requiring only that all $2^n$ factors (substrings) of arbitrary length~$n$ appear with the same limit frequency $2^{-n}$. Later many stronger definitions of randomness were introduced, and in this context normality found its place…
▽ More
The notion of a normal bit sequence was introduced by Borel in 1909; it was the first definition of an individual random object. Normality is a weak notion of randomness requiring only that all $2^n$ factors (substrings) of arbitrary length~$n$ appear with the same limit frequency $2^{-n}$. Later many stronger definitions of randomness were introduced, and in this context normality found its place as ``randomness against a finite-memory adversary''. A quantitative measure of finite-state compressibility was also introduced (the finite-state dimension) and normality means that the finite state dimension is maximal (equals~$1$).
Recently Nandakumar, Pulari and S (2023) introduced the notion of relative finite-state dimension for a binary sequence with respect to some other binary sequence (treated as an oracle), and the corresponding notion of conditional (relative) normality. (Different notions of conditional randomness were considered before, but not for the finite memory case.) They establish equivalence between the block frequency and the gambling approaches to conditional normality and finite-state dimensions.
In this note we revisit their definitions and explain how this equivalence can be obtained easily by generalizing known characterizations of (unconditional) normality and dimension in terms of compressibility (finite-state complexity), superadditive complexity measures and gambling (finite-state gales), thus also answering some questions left open in the above-mentioned paper.
△ Less
Submitted 3 March, 2024;
originally announced March 2024.
-
Classification under Nuisance Parameters and Generalized Label Shift in Likelihood-Free Inference
Authors:
Luca Masserano,
Alex Shen,
Michele Doro,
Tommaso Dorigo,
Rafael Izbicki,
Ann B. Lee
Abstract:
An open scientific challenge is how to classify events with reliable measures of uncertainty, when we have a mechanistic model of the data-generating process but the distribution over both labels and latent nuisance parameters is different between train and target data. We refer to this type of distributional shift as generalized label shift (GLS). Direct classification using observed data…
▽ More
An open scientific challenge is how to classify events with reliable measures of uncertainty, when we have a mechanistic model of the data-generating process but the distribution over both labels and latent nuisance parameters is different between train and target data. We refer to this type of distributional shift as generalized label shift (GLS). Direct classification using observed data $\mathbf{X}$ as covariates leads to biased predictions and invalid uncertainty estimates of labels $Y$. We overcome these biases by proposing a new method for robust uncertainty quantification that casts classification as a hypothesis testing problem under nuisance parameters. The key idea is to estimate the classifier's receiver operating characteristic (ROC) across the entire nuisance parameter space, which allows us to devise cutoffs that are invariant under GLS. Our method effectively endows a pre-trained classifier with domain adaptation capabilities and returns valid prediction sets while maintaining high power. We demonstrate its performance on two challenging scientific problems in biology and astroparticle physics with data from realistic mechanistic models.
△ Less
Submitted 1 July, 2024; v1 submitted 7 February, 2024;
originally announced February 2024.
-
Gemini: A Family of Highly Capable Multimodal Models
Authors:
Gemini Team,
Rohan Anil,
Sebastian Borgeaud,
Jean-Baptiste Alayrac,
Jiahui Yu,
Radu Soricut,
Johan Schalkwyk,
Andrew M. Dai,
Anja Hauth,
Katie Millican,
David Silver,
Melvin Johnson,
Ioannis Antonoglou,
Julian Schrittwieser,
Amelia Glaese,
Jilin Chen,
Emily Pitler,
Timothy Lillicrap,
Angeliki Lazaridou,
Orhan Firat,
James Molloy,
Michael Isard,
Paul R. Barham,
Tom Hennigan,
Benjamin Lee
, et al. (1325 additional authors not shown)
Abstract:
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr…
▽ More
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.
△ Less
Submitted 17 June, 2024; v1 submitted 18 December, 2023;
originally announced December 2023.
-
Towards Quantum-Native Communication Systems: New Developments, Trends, and Challenges
Authors:
Xiaolin Zhou,
Anqi Shen,
Shuyan Hu,
Wei Ni,
Xin Wang,
Ekram Hossain,
Lajos Hanzo
Abstract:
The potential synergy between quantum communications and future wireless communication systems is explored. By proposing a quantum-native or quantum-by-design philosophy, the survey examines technologies such as quantum-domain (QD) multi-input multi-output (MIMO), QD non-orthogonal multiple access (NOMA), quantum secure direct communication (QSDC), QD resource allocation, QD routing, and QD artifi…
▽ More
The potential synergy between quantum communications and future wireless communication systems is explored. By proposing a quantum-native or quantum-by-design philosophy, the survey examines technologies such as quantum-domain (QD) multi-input multi-output (MIMO), QD non-orthogonal multiple access (NOMA), quantum secure direct communication (QSDC), QD resource allocation, QD routing, and QD artificial intelligence (AI). The recent research advances in these areas are summarized. Given the behavior of photonic and particle-like Terahertz (THz) systems, a comprehensive system-oriented perspective is adopted to assess the feasibility of using quantum communications in future systems. This survey also reviews quantum optimization algorithms and quantum neural networks to explore the potential integration of quantum communication and quantum computing in future systems. Additionally, the current status of quantum sensing, quantum radar, and quantum timing is briefly reviewed in support of future applications. The associated research gaps and future directions are identified, including extending the entanglement coherence time, developing THz quantum communications devices, addressing challenges in channel estimation and tracking, and establishing the theoretical bounds and performance trade-offs of quantum communication, computing, and sensing. This survey offers a unique perspective on the potential for quantum communications to revolutionize future systems and pave the way for even more advanced technologies.
△ Less
Submitted 9 November, 2023;
originally announced November 2023.
-
Online Matching with Stochastic Rewards: Advanced Analyses Using Configuration Linear Programs
Authors:
Zhiyi Huang,
Hanrui Jiang,
Aocheng Shen,
Junkai Song,
Zhiang Wu,
Qiankun Zhang
Abstract:
Mehta and Panigrahi (2012) proposed Online Matching with Stochastic Rewards, which generalizes the Online Bipartite Matching problem of Karp, Vazirani, and Vazirani (1990) by associating the edges with success probabilities. This new feature captures the pay-per-click model in online advertising. Recently, Huang and Zhang (2020) studied this problem under the online primal dual framework using the…
▽ More
Mehta and Panigrahi (2012) proposed Online Matching with Stochastic Rewards, which generalizes the Online Bipartite Matching problem of Karp, Vazirani, and Vazirani (1990) by associating the edges with success probabilities. This new feature captures the pay-per-click model in online advertising. Recently, Huang and Zhang (2020) studied this problem under the online primal dual framework using the Configuration Linear Program (LP), and got the best known competitive ratios of the Stochastic Balance algorithm. Their work suggests that the more expressive Configuration LP is more suitable for this problem than the Matching LP.
This paper advances the theory of Configuration LP in two directions. Our technical contribution includes a characterization of the joint matching outcome of an offline vertex and \emph{all its neighbors}. This characterization may be of independent interest, and is aligned with the spirit of Configuration LP. By contrast, previous analyses of Ranking generally focus on only one neighbor. Second, we designed a Stochastic Configuration LP that captures a stochastic benchmark proposed by Goyal and Udwani (2020), who used a Path-based LP. The Stochastic Configuration LP is smaller and simpler than the Path-based LP. Moreover, using the new LP we improved the competitive ratio of Stochastic Balance from $0.596$ to $0.611$ when the success probabilities are infinitesimal, and to $0.613$ when the success probabilities are further equal.
△ Less
Submitted 18 September, 2023;
originally announced September 2023.
-
Marine Debris Detection in Satellite Surveillance using Attention Mechanisms
Authors:
Ao Shen,
Yijie Zhu,
Richard Jiang
Abstract:
Marine debris is an important issue for environmental protection, but current methods for locating marine debris are yet limited. In order to achieve higher efficiency and wider applicability in the localization of Marine debris, this study tries to combine the instance segmentation of YOLOv7 with different attention mechanisms and explores the best model. By utilizing a labelled dataset consistin…
▽ More
Marine debris is an important issue for environmental protection, but current methods for locating marine debris are yet limited. In order to achieve higher efficiency and wider applicability in the localization of Marine debris, this study tries to combine the instance segmentation of YOLOv7 with different attention mechanisms and explores the best model. By utilizing a labelled dataset consisting of satellite images containing ocean debris, we examined three attentional models including lightweight coordinate attention, CBAM (combining spatial and channel focus), and bottleneck transformer (based on self-attention). Box detection assessment revealed that CBAM achieved the best outcome (F1 score of 77%) compared to coordinate attention (F1 score of 71%) and YOLOv7/bottleneck transformer (both F1 scores around 66%). Mask evaluation showed CBAM again leading with an F1 score of 73%, whereas coordinate attention and YOLOv7 had comparable performances (around F1 score of 68%/69%) and bottleneck transformer lagged behind at F1 score of 56%. These findings suggest that CBAM offers optimal suitability for detecting marine debris. However, it should be noted that the bottleneck transformer detected some areas missed by manual annotation and displayed better mask precision for larger debris pieces, signifying potentially superior practical performance.
△ Less
Submitted 9 July, 2023;
originally announced July 2023.
-
Characterizing Out-of-Distribution Error via Optimal Transport
Authors:
Yuzhe Lu,
Yilong Qin,
Runtian Zhai,
Andrew Shen,
Ketong Chen,
Zhenlin Wang,
Soheil Kolouri,
Simon Stepputtis,
Joseph Campbell,
Katia Sycara
Abstract:
Out-of-distribution (OOD) data poses serious challenges in deployed machine learning models, so methods of predicting a model's performance on OOD data without labels are important for machine learning safety. While a number of methods have been proposed by prior work, they often underestimate the actual error, sometimes by a large margin, which greatly impacts their applicability to real tasks. I…
▽ More
Out-of-distribution (OOD) data poses serious challenges in deployed machine learning models, so methods of predicting a model's performance on OOD data without labels are important for machine learning safety. While a number of methods have been proposed by prior work, they often underestimate the actual error, sometimes by a large margin, which greatly impacts their applicability to real tasks. In this work, we identify pseudo-label shift, or the difference between the predicted and true OOD label distributions, as a key indicator to this underestimation. Based on this observation, we introduce a novel method for estimating model performance by leveraging optimal transport theory, Confidence Optimal Transport (COT), and show that it provably provides more robust error estimates in the presence of pseudo-label shift. Additionally, we introduce an empirically-motivated variant of COT, Confidence Optimal Transport with Thresholding (COTT), which applies thresholding to the individual transport costs and further improves the accuracy of COT's error estimates. We evaluate COT and COTT on a variety of standard benchmarks that induce various types of distribution shift -- synthetic, novel subpopulation, and natural -- and show that our approaches significantly outperform existing state-of-the-art methods with an up to 3x lower prediction error.
△ Less
Submitted 27 October, 2023; v1 submitted 24 May, 2023;
originally announced May 2023.
-
The Kraft--Barmpalias--Lewis-Pye lemma revisited
Authors:
Alexander Shen
Abstract:
This note provides a simplified exposition of the proof of hierarchical Kraft lemma proven by Barmpalias and Lewis-Pye and its consequences for the oracle use in the Kučera--Gács theorem (saying that every sequence is Turing reducible to a random one).
This note provides a simplified exposition of the proof of hierarchical Kraft lemma proven by Barmpalias and Lewis-Pye and its consequences for the oracle use in the Kučera--Gács theorem (saying that every sequence is Turing reducible to a random one).
△ Less
Submitted 31 May, 2023; v1 submitted 10 April, 2023;
originally announced April 2023.
-
Experimental quantum secret sharing based on phase encoding of coherent states
Authors:
Ao Shen,
Xiao-Yu Cao,
Yang Wang,
Yao Fu,
Jie Gu,
Wen-Bo Liu,
Chen-Xun Weng,
Hua-Lei Yin,
Zeng-Bing Chen
Abstract:
Quantum secret sharing (QSS) is one of the basic communication primitives in future quantum networks which addresses part of the basic cryptographic tasks of multiparty communication and computation. Nevertheless, it is a challenge to provide a practical QSS protocol with security against general attacks. A QSS protocol that balances security and practicality is still lacking. Here, we propose a Q…
▽ More
Quantum secret sharing (QSS) is one of the basic communication primitives in future quantum networks which addresses part of the basic cryptographic tasks of multiparty communication and computation. Nevertheless, it is a challenge to provide a practical QSS protocol with security against general attacks. A QSS protocol that balances security and practicality is still lacking. Here, we propose a QSS protocol with simple phase encoding of coherent states among three parties. Removing the requirement of impractical entangled resources and the need for phase randomization, our protocol can be implemented with accessible technology. We provide the finite-key analysis against coherent attacks and implement a proof-of-principle experiment to demonstrate our scheme's feasibility. Our scheme achieves a key rate of 85.3 bps under a 35 dB channel loss. Combined with security against general attacks and accessible technology, our protocol is a promising candidate for practical multiparty quantum communication networks.
△ Less
Submitted 27 March, 2023; v1 submitted 26 March, 2023;
originally announced March 2023.
-
Asymmetric distribution of data products from WALLABY, an SKA precursor neutral hydrogen survey
Authors:
Manuel Parra-Royon,
Austin Shen,
Tristan Reynolds,
Parthasarathy Venkataraman,
María Angeles Mendoza,
Susana Sánchez-Exposito,
Julian Garrido,
Slava Kitaeff,
Lourdes Verdes-Montenegro
Abstract:
The Widefield ASKAP L-band Legacy All-sky Blind surveY (WALLABY) is a neutral hydrogen survey (HI) that is running on the Australian SKA Pathfinder (ASKAP), a precursor telescope for the Square Kilometre Array (SKA). The goal of WALLABY is to use ASKAP's powerful wide-field phased array feed technology to observe three quarters of the entire sky at the 21 cm neutral hydrogen line with an angular r…
▽ More
The Widefield ASKAP L-band Legacy All-sky Blind surveY (WALLABY) is a neutral hydrogen survey (HI) that is running on the Australian SKA Pathfinder (ASKAP), a precursor telescope for the Square Kilometre Array (SKA). The goal of WALLABY is to use ASKAP's powerful wide-field phased array feed technology to observe three quarters of the entire sky at the 21 cm neutral hydrogen line with an angular resolution of 30 arcseconds. Post-processing activities at the Australian SKA Regional Centre (AusSRC), Canadian Initiative for Radio Astronomy Data Analysis (CIRADA) and Spanish SKA Regional Centre prototype (SPSRC) will then produce publicly available advanced data products in the form of source catalogues, kinematic models and image cutouts, respectively. These advanced data products will be generated locally at each site and distributed across the network. Over the course of the full survey we expect to replicate data up to 10 MB per source detection, which could imply an ingestion of tens of GB to be consolidated in the other locations near real time. Here, we explore the use of an asymmetric database replication model and strategy, using PostgreSQL as the engine and Bucardo as the asynchronous replication service to enable robust multi-source pools operations with data products from WALLABY. This work would serve to evaluate this type of data distribution solution across globally distributed sites. Furthermore, a set of benchmarks have been developed to confirm that the deployed model is sufficient for future scalability and remote collaboration needs.
△ Less
Submitted 21 March, 2023;
originally announced March 2023.
-
Systematic Evaluation of Predictive Fairness
Authors:
Xudong Han,
Aili Shen,
Trevor Cohn,
Timothy Baldwin,
Lea Frermann
Abstract:
Mitigating bias in training on biased datasets is an important open problem. Several techniques have been proposed, however the typical evaluation regime is very limited, considering very narrow data conditions. For instance, the effect of target class imbalance and stereotyping is under-studied. To address this gap, we examine the performance of various debiasing methods across multiple tasks, sp…
▽ More
Mitigating bias in training on biased datasets is an important open problem. Several techniques have been proposed, however the typical evaluation regime is very limited, considering very narrow data conditions. For instance, the effect of target class imbalance and stereotyping is under-studied. To address this gap, we examine the performance of various debiasing methods across multiple tasks, spanning binary classification (Twitter sentiment), multi-class classification (profession prediction), and regression (valence prediction). Through extensive experimentation, we find that data conditions have a strong influence on relative model performance, and that general conclusions cannot be drawn about method efficacy when evaluating only on standard datasets, as is current practice in fairness research.
△ Less
Submitted 17 October, 2022;
originally announced October 2022.
-
Inequalities for entropies and dimensions
Authors:
Alexander Shen
Abstract:
We show that linear inequalities for entropies have a natural geometric interpretation in terms of Hausdorff and packing dimensions, using the point-to-set principle and known results about inequalities for complexities, entropies and the sizes of subgroups.
We show that linear inequalities for entropies have a natural geometric interpretation in terms of Hausdorff and packing dimensions, using the point-to-set principle and known results about inequalities for complexities, entropies and the sizes of subgroups.
△ Less
Submitted 28 April, 2023; v1 submitted 15 September, 2022;
originally announced September 2022.
-
Optimising Equal Opportunity Fairness in Model Training
Authors:
Aili Shen,
Xudong Han,
Trevor Cohn,
Timothy Baldwin,
Lea Frermann
Abstract:
Real-world datasets often encode stereotypes and societal biases. Such biases can be implicitly captured by trained models, leading to biased predictions and exacerbating existing societal preconceptions. Existing debiasing methods, such as adversarial training and removing protected information from representations, have been shown to reduce bias. However, a disconnect between fairness criteria a…
▽ More
Real-world datasets often encode stereotypes and societal biases. Such biases can be implicitly captured by trained models, leading to biased predictions and exacerbating existing societal preconceptions. Existing debiasing methods, such as adversarial training and removing protected information from representations, have been shown to reduce bias. However, a disconnect between fairness criteria and training objectives makes it difficult to reason theoretically about the effectiveness of different techniques. In this work, we propose two novel training objectives which directly optimise for the widely-used criterion of {\it equal opportunity}, and show that they are effective in reducing bias while maintaining high performance over two classification tasks.
△ Less
Submitted 4 May, 2022;
originally announced May 2022.
-
fairlib: A Unified Framework for Assessing and Improving Classification Fairness
Authors:
Xudong Han,
Aili Shen,
Yitong Li,
Lea Frermann,
Timothy Baldwin,
Trevor Cohn
Abstract:
This paper presents fairlib, an open-source framework for assessing and improving classification fairness. It provides a systematic framework for quickly reproducing existing baseline models, developing new methods, evaluating models with different metrics, and visualizing their results. Its modularity and extensibility enable the framework to be used for diverse types of inputs, including natural…
▽ More
This paper presents fairlib, an open-source framework for assessing and improving classification fairness. It provides a systematic framework for quickly reproducing existing baseline models, developing new methods, evaluating models with different metrics, and visualizing their results. Its modularity and extensibility enable the framework to be used for diverse types of inputs, including natural language, images, and audio. In detail, we implement 14 debiasing methods, including pre-processing, at-training-time, and post-processing approaches. The built-in metrics cover the most commonly used fairness criterion and can be further generalized and customized for fairness evaluation.
△ Less
Submitted 3 May, 2022;
originally announced May 2022.
-
27 Open Problems in Kolmogorov Complexity
Authors:
Andrei Romashchenko,
Alexander Shen,
Marius Zimand
Abstract:
The paper proposes open problems in classical Kolmogorov complexity. Each problem is presented with background information and thus the article also surveys some recent studies in the area.
The paper proposes open problems in classical Kolmogorov complexity. Each problem is presented with background information and thus the article also surveys some recent studies in the area.
△ Less
Submitted 28 March, 2022;
originally announced March 2022.
-
Versatile Offline Imitation from Observations and Examples via Regularized State-Occupancy Matching
Authors:
Yecheng Jason Ma,
Andrew Shen,
Dinesh Jayaraman,
Osbert Bastani
Abstract:
We propose State Matching Offline DIstribution Correction Estimation (SMODICE), a novel and versatile regression-based offline imitation learning (IL) algorithm derived via state-occupancy matching. We show that the SMODICE objective admits a simple optimization procedure through an application of Fenchel duality and an analytic solution in tabular MDPs. Without requiring access to expert actions,…
▽ More
We propose State Matching Offline DIstribution Correction Estimation (SMODICE), a novel and versatile regression-based offline imitation learning (IL) algorithm derived via state-occupancy matching. We show that the SMODICE objective admits a simple optimization procedure through an application of Fenchel duality and an analytic solution in tabular MDPs. Without requiring access to expert actions, SMODICE can be effectively applied to three offline IL settings: (i) imitation from observations (IfO), (ii) IfO with dynamics or morphologically mismatched expert, and (iii) example-based reinforcement learning, which we show can be formulated as a state-occupancy matching problem. We extensively evaluate SMODICE on both gridworld environments as well as on high-dimensional offline benchmarks. Our results demonstrate that SMODICE is effective for all three problem settings and significantly outperforms prior state-of-art.
△ Less
Submitted 18 June, 2022; v1 submitted 4 February, 2022;
originally announced February 2022.
-
Conservative and Adaptive Penalty for Model-Based Safe Reinforcement Learning
Authors:
Yecheng Jason Ma,
Andrew Shen,
Osbert Bastani,
Dinesh Jayaraman
Abstract:
Reinforcement Learning (RL) agents in the real world must satisfy safety constraints in addition to maximizing a reward objective. Model-based RL algorithms hold promise for reducing unsafe real-world actions: they may synthesize policies that obey all constraints using simulated samples from a learned model. However, imperfect models can result in real-world constraint violations even for actions…
▽ More
Reinforcement Learning (RL) agents in the real world must satisfy safety constraints in addition to maximizing a reward objective. Model-based RL algorithms hold promise for reducing unsafe real-world actions: they may synthesize policies that obey all constraints using simulated samples from a learned model. However, imperfect models can result in real-world constraint violations even for actions that are predicted to satisfy all constraints. We propose Conservative and Adaptive Penalty (CAP), a model-based safe RL framework that accounts for potential modeling errors by capturing model uncertainty and adaptively exploiting it to balance the reward and the cost objectives. First, CAP inflates predicted costs using an uncertainty-based penalty. Theoretically, we show that policies that satisfy this conservative cost constraint are guaranteed to also be feasible in the true environment. We further show that this guarantees the safety of all intermediate solutions during RL training. Further, CAP adaptively tunes this penalty during training using true cost feedback from the environment. We evaluate this conservative and adaptive penalty-based approach for model-based safe RL extensively on state and image-based environments. Our results demonstrate substantial gains in sample-efficiency while incurring fewer violations than prior safe RL algorithms. Code is available at: https://github.com/Redrew/CAP
△ Less
Submitted 14 December, 2021;
originally announced December 2021.
-
Individual codewords
Authors:
Alexander Shen
Abstract:
Algorithmic information theory translates statements about classes of objects into statements about individual objects; it defines individual random sequences, effective Hausdorff dimension of individual points, amount of information in individual strings, etc. We observe that a similar translation is possible for list-decodable codes.
Algorithmic information theory translates statements about classes of objects into statements about individual objects; it defines individual random sequences, effective Hausdorff dimension of individual points, amount of information in individual strings, etc. We observe that a similar translation is possible for list-decodable codes.
△ Less
Submitted 1 November, 2021;
originally announced November 2021.
-
Gács-Kučera's Theorem Revisited by Levin
Authors:
George Barmpalias,
Alexander Shen
Abstract:
Leonid Levin (arxiv.org/abs/cs/0503039v14, p.7) published a new (and very nice) proof of Gács-Kučera's theorem that occupies only a few lines when presented in his style. We try to explain more details and discuss the connection of this proof with image randomness theorems, making explicit some result (see Proposition 4) that is implicit in Levin's exposition. Then we review the previous work abou…
▽ More
Leonid Levin (arxiv.org/abs/cs/0503039v14, p.7) published a new (and very nice) proof of Gács-Kučera's theorem that occupies only a few lines when presented in his style. We try to explain more details and discuss the connection of this proof with image randomness theorems, making explicit some result (see Proposition 4) that is implicit in Levin's exposition. Then we review the previous work about the oracle use when reducing a given sequence to another one, and its connection with algorithmic dimension theory.
△ Less
Submitted 25 January, 2023; v1 submitted 31 October, 2021;
originally announced November 2021.
-
Contrastive Learning for Fair Representations
Authors:
Aili Shen,
Xudong Han,
Trevor Cohn,
Timothy Baldwin,
Lea Frermann
Abstract:
Trained classification models can unintentionally lead to biased representations and predictions, which can reinforce societal preconceptions and stereotypes. Existing debiasing methods for classification models, such as adversarial training, are often expensive to train and difficult to optimise. In this paper, we propose a method for mitigating bias in classifier training by incorporating contra…
▽ More
Trained classification models can unintentionally lead to biased representations and predictions, which can reinforce societal preconceptions and stereotypes. Existing debiasing methods for classification models, such as adversarial training, are often expensive to train and difficult to optimise. In this paper, we propose a method for mitigating bias in classifier training by incorporating contrastive learning, in which instances sharing the same class label are encouraged to have similar representations, while instances sharing a protected attribute are forced further apart. In such a way our method learns representations which capture the task label in focused regions, while ensuring the protected attribute has diverse spread, and thus has limited impact on prediction and thereby results in fairer models. Extensive experimental results across four tasks in NLP and computer vision show (a) that our proposed method can achieve fairer representations and realises bias reductions compared with competitive baselines; and (b) that it can do so without sacrificing main task performance; (c) that it sets a new state-of-the-art performance in one task despite reducing the bias. Finally, our method is conceptually simple and agnostic to network architectures, and incurs minimal additional compute cost.
△ Less
Submitted 22 September, 2021;
originally announced September 2021.
-
Evaluating Document Coherence Modelling
Authors:
Aili Shen,
Meladel Mistica,
Bahar Salehi,
Hang Li,
Timothy Baldwin,
Jianzhong Qi
Abstract:
While pretrained language models ("LM") have driven impressive gains over morpho-syntactic and semantic tasks, their ability to model discourse and pragmatic phenomena is less clear. As a step towards a better understanding of their discourse modelling capabilities, we propose a sentence intrusion detection task. We examine the performance of a broad range of pretrained LMs on this detection task…
▽ More
While pretrained language models ("LM") have driven impressive gains over morpho-syntactic and semantic tasks, their ability to model discourse and pragmatic phenomena is less clear. As a step towards a better understanding of their discourse modelling capabilities, we propose a sentence intrusion detection task. We examine the performance of a broad range of pretrained LMs on this detection task for English. Lacking a dataset for the task, we introduce INSteD, a novel intruder sentence detection dataset, containing 170,000+ documents constructed from English Wikipedia and CNN news articles. Our experiments show that pretrained LMs perform impressively in in-domain evaluation, but experience a substantial drop in the cross-domain setting, indicating limited generalisation capacity. Further results over a novel linguistic probe dataset show that there is substantial room for improvement, especially in the cross-domain setting.
△ Less
Submitted 18 March, 2021;
originally announced March 2021.
-
Inequalities for space-bounded Kolmogorov complexity
Authors:
Bruno Bauwens,
Peter Gács,
Andrei Romashchenko,
Alexander Shen
Abstract:
There is a parallelism between Shannon information theory and algorithmic information theory. In particular, the same linear inequalities are true for Shannon entropies of tuples of random variables and Kolmogorov complexities of tuples of strings (Hammer et al., 1997), as well as for sizes of subgroups and projections of sets (Chan, Yeung, Romashchenko, Shen, Vereshchagin, 1998--2002). This paral…
▽ More
There is a parallelism between Shannon information theory and algorithmic information theory. In particular, the same linear inequalities are true for Shannon entropies of tuples of random variables and Kolmogorov complexities of tuples of strings (Hammer et al., 1997), as well as for sizes of subgroups and projections of sets (Chan, Yeung, Romashchenko, Shen, Vereshchagin, 1998--2002). This parallelism started with the Kolmogorov-Levin formula (1968) for the complexity of pairs of strings with logarithmic precision. Longpré (1986) proved a version of this formula for space-bounded complexities.
In this paper we prove an improved version of Longpré's result with a tighter space bound, using Sipser's trick (1980). Then, using this space bound, we show that every linear inequality that is true for complexities or entropies, is also true for space-bounded Kolmogorov complexities with a polynomial space overhead.
△ Less
Submitted 9 September, 2022; v1 submitted 20 October, 2020;
originally announced October 2020.
-
Complexity of majorants
Authors:
Alexander Shen
Abstract:
The minimal Kolmogorov complexity of a total computable function that exceeds everywhere all total computable functions of complexity at most $n$, is $2^{n+O(1)}$. If we replace "everywhere" by "for all sufficiently large inputs", the answer is $n+O(1)$.
The minimal Kolmogorov complexity of a total computable function that exceeds everywhere all total computable functions of complexity at most $n$, is $2^{n+O(1)}$. If we replace "everywhere" by "for all sufficiently large inputs", the answer is $n+O(1)$.
△ Less
Submitted 25 December, 2020; v1 submitted 6 April, 2020;
originally announced April 2020.
-
Correlations between Word Vector Sets
Authors:
Vitalii Zhelezniak,
April Shen,
Daniel Busbridge,
Aleksandar Savkov,
Nils Hammerla
Abstract:
Similarity measures based purely on word embeddings are comfortably competing with much more sophisticated deep learning and expert-engineered systems on unsupervised semantic textual similarity (STS) tasks. In contrast to commonly used geometric approaches, we treat a single word embedding as e.g. 300 observations from a scalar random variable. Using this paradigm, we first illustrate that simila…
▽ More
Similarity measures based purely on word embeddings are comfortably competing with much more sophisticated deep learning and expert-engineered systems on unsupervised semantic textual similarity (STS) tasks. In contrast to commonly used geometric approaches, we treat a single word embedding as e.g. 300 observations from a scalar random variable. Using this paradigm, we first illustrate that similarities derived from elementary pooling operations and classic correlation coefficients yield excellent results on standard STS benchmarks, outperforming many recently proposed methods while being much faster and trivial to implement. Next, we demonstrate how to avoid pooling operations altogether and compare sets of word embeddings directly via correlation operators between reproducing kernel Hilbert spaces. Just like cosine similarity is used to compare individual word vectors, we introduce a novel application of the centered kernel alignment (CKA) as a natural generalisation of squared cosine similarity for sets of word vectors. Likewise, CKA is very easy to implement and enjoys very strong empirical results.
△ Less
Submitted 7 October, 2019;
originally announced October 2019.
-
Correlation Coefficients and Semantic Textual Similarity
Authors:
Vitalii Zhelezniak,
Aleksandar Savkov,
April Shen,
Nils Y. Hammerla
Abstract:
A large body of research into semantic textual similarity has focused on constructing state-of-the-art embeddings using sophisticated modelling, careful choice of learning signals and many clever tricks. By contrast, little attention has been devoted to similarity measures between these embeddings, with cosine similarity being used unquestionably in the majority of cases. In this work, we illustra…
▽ More
A large body of research into semantic textual similarity has focused on constructing state-of-the-art embeddings using sophisticated modelling, careful choice of learning signals and many clever tricks. By contrast, little attention has been devoted to similarity measures between these embeddings, with cosine similarity being used unquestionably in the majority of cases. In this work, we illustrate that for all common word vectors, cosine similarity is essentially equivalent to the Pearson correlation coefficient, which provides some justification for its use. We thoroughly characterise cases where Pearson correlation (and thus cosine similarity) is unfit as similarity measure. Importantly, we show that Pearson correlation is appropriate for some word vectors but not others. When it is not appropriate, we illustrate how common non-parametric rank correlation coefficients can be used instead to significantly improve performance. We support our analysis with a series of evaluations on word-level and sentence-level semantic textual similarity benchmarks. On the latter, we show that even the simplest averaged word vectors compared by rank correlation easily rival the strongest deep representations compared by cosine similarity.
△ Less
Submitted 19 May, 2019;
originally announced May 2019.
-
Don't Settle for Average, Go for the Max: Fuzzy Sets and Max-Pooled Word Vectors
Authors:
Vitalii Zhelezniak,
Aleksandar Savkov,
April Shen,
Francesco Moramarco,
Jack Flann,
Nils Y. Hammerla
Abstract:
Recent literature suggests that averaged word vectors followed by simple post-processing outperform many deep learning methods on semantic textual similarity tasks. Furthermore, when averaged word vectors are trained supervised on large corpora of paraphrases, they achieve state-of-the-art results on standard STS benchmarks. Inspired by these insights, we push the limits of word embeddings even fu…
▽ More
Recent literature suggests that averaged word vectors followed by simple post-processing outperform many deep learning methods on semantic textual similarity tasks. Furthermore, when averaged word vectors are trained supervised on large corpora of paraphrases, they achieve state-of-the-art results on standard STS benchmarks. Inspired by these insights, we push the limits of word embeddings even further. We propose a novel fuzzy bag-of-words (FBoW) representation for text that contains all the words in the vocabulary simultaneously but with different degrees of membership, which are derived from similarities between word vectors. We show that max-pooled word vectors are only a special case of fuzzy BoW and should be compared via fuzzy Jaccard index rather than cosine similarity. Finally, we propose DynaMax, a completely unsupervised and non-parametric similarity measure that dynamically extracts and max-pools good features depending on the sentence pair. This method is both efficient and easy to implement, yet outperforms current baselines on STS tasks by a large margin and is even competitive with supervised word vectors trained to directly optimise cosine similarity.
△ Less
Submitted 30 April, 2019;
originally announced April 2019.
-
A Practical Analysis of Rust's Concurrency Story
Authors:
Aditya Saligrama,
Andrew Shen,
Jon Gjengset
Abstract:
Correct concurrent programs are difficult to write; when multiple threads mutate shared data, they may lose writes, corrupt data, or produce erratic program behavior. While many of the data-race issues with concurrency can be avoided by the placing of locks throughout the code, these often serialize program execution, and can significantly slow down performance-critical applications. Programmers a…
▽ More
Correct concurrent programs are difficult to write; when multiple threads mutate shared data, they may lose writes, corrupt data, or produce erratic program behavior. While many of the data-race issues with concurrency can be avoided by the placing of locks throughout the code, these often serialize program execution, and can significantly slow down performance-critical applications. Programmers also make mistakes, and often forget locks in less-executed code paths, which leads to programs that misbehave only in rare situations.
Rust is a recent programming language from Mozilla that attempts to solve these intertwined issues by detecting data-races at compile time. Rust's type system encodes a data-structure's ability to be shared between threads in the type system, which in turn allows the compiler to reject programs where threads directly mutate shared state without locks or other protection mechanisms. In this work, we examine how this aspect of Rust's type system impacts the development and refinement of a concurrent data structure, as well as its ability to adapt to situations where correctness is guaranteed by lower-level invariants (e.g., in lock-free algorithms) that are not directly expressible in the type system itself. We detail the implementation of a concurrent lock-free hashmap in order to describe these traits of the Rust language. Our code is publicly available at https://github.com/saligrama/concache and is one of the fastest concurrent hashmaps for the Rust language, which leads to mitigating bottlenecks in concurrent programs.
△ Less
Submitted 27 April, 2019;
originally announced April 2019.
-
A Joint Model for Multimodal Document Quality Assessment
Authors:
Aili Shen,
Bahar Salehi,
Timothy Baldwin,
Jianzhong Qi
Abstract:
The quality of a document is affected by various factors, including grammaticality, readability, stylistics, and expertise depth, making the task of document quality assessment a complex one. In this paper, we explore this task in the context of assessing the quality of Wikipedia articles and academic papers. Observing that the visual rendering of a document can capture implicit quality indicators…
▽ More
The quality of a document is affected by various factors, including grammaticality, readability, stylistics, and expertise depth, making the task of document quality assessment a complex one. In this paper, we explore this task in the context of assessing the quality of Wikipedia articles and academic papers. Observing that the visual rendering of a document can capture implicit quality indicators that are not present in the document text --- such as images, font choices, and visual layout --- we propose a joint model that combines the text content with a visual rendering of the document for document quality assessment. Experimental results over two datasets reveal that textual and visual features are complementary, achieving state-of-the-art results.
△ Less
Submitted 13 January, 2019; v1 submitted 4 January, 2019;
originally announced January 2019.
-
DSCnet: Replicating Lidar Point Clouds with Deep Sensor Cloning
Authors:
Paden Tomasello,
Sammy Sidhu,
Anting Shen,
Matthew W. Moskewicz,
Nobie Redmon,
Gayatri Joshi,
Romi Phadte,
Paras Jain,
Forrest Iandola
Abstract:
Convolutional neural networks (CNNs) have become increasingly popular for solving a variety of computer vision tasks, ranging from image classification to image segmentation. Recently, autonomous vehicles have created a demand for depth information, which is often obtained using hardware sensors such as Light detection and ranging (LIDAR). Although it can provide precise distance measurements, mos…
▽ More
Convolutional neural networks (CNNs) have become increasingly popular for solving a variety of computer vision tasks, ranging from image classification to image segmentation. Recently, autonomous vehicles have created a demand for depth information, which is often obtained using hardware sensors such as Light detection and ranging (LIDAR). Although it can provide precise distance measurements, most LIDARs are still far too expensive to sell in mass-produced consumer vehicles, which has motivated methods to generate depth information from commodity automotive sensors like cameras.
In this paper, we propose an approach called Deep Sensor Cloning (DSC). The idea is to use Convolutional Neural Networks in conjunction with inexpensive sensors to replicate the 3D point-clouds that are created by expensive LIDARs. To accomplish this, we develop a new dataset (DSDepth) and a new family of CNN architectures (DSCnets). While previous tasks such as KITTI depth prediction use an interpolated RGB-D images as ground-truth for training, we instead use DSCnets to directly predict LIDAR point-clouds. When we compare the output of our models to a $75,000 LIDAR, we find that our most accurate DSCnet achieves a relative error of 5.77% using a single camera and 4.69% using stereo cameras.
△ Less
Submitted 26 November, 2018; v1 submitted 16 November, 2018;
originally announced November 2018.
-
Axiomatic approach to the theory of algorithms and relativized computability
Authors:
Alexander Shen
Abstract:
It is well known that many theorems in recursion theory can be "relativized". This means that they remain true if partial recursive functions are replaced by functions that are partial recursive relative to some fixed oracle set. Uspensky formulates three "axioms" called "axiom of computation records", "axiom of programs'" and "arithmeticity axiom". Then, using these axioms (more precisely, two fi…
▽ More
It is well known that many theorems in recursion theory can be "relativized". This means that they remain true if partial recursive functions are replaced by functions that are partial recursive relative to some fixed oracle set. Uspensky formulates three "axioms" called "axiom of computation records", "axiom of programs'" and "arithmeticity axiom". Then, using these axioms (more precisely, two first ones) he proves basic results of the recursion theory. These two axioms are true also for the class of functions that are partial recursive relative to some fixed oracle set. Also this class is closed under substitution, primitive recursion and minimization ($μ$-operator); these (intuitively obvious) closure properties are also used in the proofs. This observation made by Uspensky explains why many theorems of recursion theory can be relativized. It turns out that the reverse statement is also true: all relativizable results follow from the first two axioms and closure properties. Indeed, \emph{every class of partial functions that is closed under substitution, primitive recursion and minimization that satisfies the first two axioms is the class of functions that are partial recursive relative to some oracle set $A$}. This is the main result of the present article.
△ Less
Submitted 15 November, 2018;
originally announced November 2018.
-
Random noise increases Kolmogorov complexity and Hausdorff dimension
Authors:
Gleb Posobin,
Alexander Shen
Abstract:
Consider a binary string $x$ of length $n$ whose Kolmogorov complexity is $αn$ for some $α<1$. We want to increase the complexity of $x$ by changing a small fraction of bits in $x$. This is always possible: Buhrman, Fortnow, Newman and Vereshchagin (2005) showed that the increase can be at least $δn$ for large $n$ (where $δ$ is some positive number that depends on $α$ and the allowed fraction of c…
▽ More
Consider a binary string $x$ of length $n$ whose Kolmogorov complexity is $αn$ for some $α<1$. We want to increase the complexity of $x$ by changing a small fraction of bits in $x$. This is always possible: Buhrman, Fortnow, Newman and Vereshchagin (2005) showed that the increase can be at least $δn$ for large $n$ (where $δ$ is some positive number that depends on $α$ and the allowed fraction of changed bits).
We consider a related question: what happens with the complexity of $x$ when we randomly change a small fraction of the bits (changing each bit independently with some probability $τ$)? It turns out that a linear increase in complexity happens with high probability, but this increase is smaller than in the case of arbitrary change. We note that the amount of the increase depends on $x$ (strings of the same complexity could behave differently), and give an exact lower and upper bounds for this increase (with $o(n)$ precision).
The proof uses the combinatorial and probabilistic technique that goes back to Ahlswede, Gács and Körner (1976). For the reader's convenience (and also because we need a slightly stronger statement) we provide a simplified exposition of this technique, so the paper is self-contained.
△ Less
Submitted 16 January, 2019; v1 submitted 14 August, 2018;
originally announced August 2018.
-
Information Distance Revisited
Authors:
Bruno Bauwens,
Alexander Shen
Abstract:
We consider the notion of information distance between two objects x and y introduced by Bennett, Gács, Li, Vitanyi, and Zurek [1] as the minimal length of a program that computes x from y as well as computing y from x, and study different versions of this notion. It was claimed by Mahmud [11] that the prefix version of information distance equals max(K(x|y), K(y|) + O(1) (this equality with logar…
▽ More
We consider the notion of information distance between two objects x and y introduced by Bennett, Gács, Li, Vitanyi, and Zurek [1] as the minimal length of a program that computes x from y as well as computing y from x, and study different versions of this notion. It was claimed by Mahmud [11] that the prefix version of information distance equals max(K(x|y), K(y|) + O(1) (this equality with logarithmic precision was one of the main results of the paper by Bennett, Gács, Li, Vitanyi, and Zurek). We show that this claim is false, but does hold if the information distance is at least super logarithmic.
△ Less
Submitted 1 October, 2019; v1 submitted 29 July, 2018;
originally announced July 2018.
-
Decoding Decoders: Finding Optimal Representation Spaces for Unsupervised Similarity Tasks
Authors:
Vitalii Zhelezniak,
Dan Busbridge,
April Shen,
Samuel L. Smith,
Nils Y. Hammerla
Abstract:
Experimental evidence indicates that simple models outperform complex deep networks on many unsupervised similarity tasks. We provide a simple yet rigorous explanation for this behaviour by introducing the concept of an optimal representation space, in which semantically close symbols are mapped to representations that are close under a similarity measure induced by the model's objective function.…
▽ More
Experimental evidence indicates that simple models outperform complex deep networks on many unsupervised similarity tasks. We provide a simple yet rigorous explanation for this behaviour by introducing the concept of an optimal representation space, in which semantically close symbols are mapped to representations that are close under a similarity measure induced by the model's objective function. In addition, we present a straightforward procedure that, without any retraining or architectural modifications, allows deep recurrent models to perform equally well (and sometimes better) when compared to shallow models. To validate our analysis, we conduct a set of consistent empirical evaluations and introduce several new sentence embedding models in the process. Even though this work is presented within the context of natural language processing, the insights are readily applicable to other domains that rely on distributed representations for transfer tasks.
△ Less
Submitted 9 May, 2018;
originally announced May 2018.
-
Dimension 1 sequences are close to randoms
Authors:
Noam Greenberg,
Joe Miller,
Alexander Shen,
Linda Brown Westrick
Abstract:
We show that a sequence has effective Hausdorff dimension 1 if and only if it is coarsely similar to a Martin-Löf random sequence. More generally, a sequence has effective dimension $s$ if and only if it is coarsely similar to a weakly $s$-random sequence. Further, for any $s<t$, every sequence of effective dimension $s$ can be changed on density at most $H^{-1}(t)-H^{-1}(s)$ of its bits to produc…
▽ More
We show that a sequence has effective Hausdorff dimension 1 if and only if it is coarsely similar to a Martin-Löf random sequence. More generally, a sequence has effective dimension $s$ if and only if it is coarsely similar to a weakly $s$-random sequence. Further, for any $s<t$, every sequence of effective dimension $s$ can be changed on density at most $H^{-1}(t)-H^{-1}(s)$ of its bits to produce a sequence of effective dimension $t$, and this bound is optimal.
△ Less
Submitted 15 September, 2017;
originally announced September 2017.
-
Plain stopping time and conditional complexities revisited
Authors:
Mikhail Andreev,
Gleb Posobin,
Alexander Shen
Abstract:
In this paper we analyze the notion of "stopping time complexity", informally defined as the amount of information needed to specify when to stop while reading an infinite sequence. This notion was introduced by Vovk and Pavlovic (2016). It turns out that plain stopping time complexity of a binary string $x$ could be equivalently defined as (a) the minimal plain complexity of a Turing machine that…
▽ More
In this paper we analyze the notion of "stopping time complexity", informally defined as the amount of information needed to specify when to stop while reading an infinite sequence. This notion was introduced by Vovk and Pavlovic (2016). It turns out that plain stopping time complexity of a binary string $x$ could be equivalently defined as (a) the minimal plain complexity of a Turing machine that stops after reading $x$ on a one-directional input tape; (b) the minimal plain complexity of an algorithm that enumerates a prefix-free set containing $x$; (c)~the conditional complexity $C(x|x*)$ where $x$ in the condition is understood as a prefix of an infinite binary sequence while the first $x$ is understood as a terminated binary string; (d) as a minimal upper semicomputable function $K$ such that each binary sequence has at most $2^n$ prefixes $z$ such that $K(z)<n$; (e) as $\max C^X(x)$ where $C^X(z)$ is plain Kolmogorov complexity of $z$ relative to oracle $X$ and the maximum is taken over all extensions $X$ of $x$.
We also show that some of these equivalent definitions become non-equivalent in the more general setting where the condition $y$ and the object $x$ may differ. We also answer an open question from Chernov, Hutter and~Schmidhuber.
△ Less
Submitted 3 October, 2017; v1 submitted 27 August, 2017;
originally announced August 2017.
-
Compressibility and probabilistic proofs
Authors:
Alexander Shen
Abstract:
We consider several examples of probabilistic existence proofs using compressibility arguments, including some results that involve Lovász local lemma.
We consider several examples of probabilistic existence proofs using compressibility arguments, including some results that involve Lovász local lemma.
△ Less
Submitted 9 March, 2017;
originally announced March 2017.
-
Automatic Kolmogorov complexity, normality and finite state dimension revisited
Authors:
Alexander Kozachinskiy,
Alexander Shen
Abstract:
It is well known that normality can be described as incompressibility via finite automata. Still the statement and the proof of this result as given by Becher and Heiber (2013) in terms of "lossless finite-state compressors" do not follow the standard scheme of Kolmogorov complexity definition (an automaton is used for compression, not decompression). We modify this approach to make it more simila…
▽ More
It is well known that normality can be described as incompressibility via finite automata. Still the statement and the proof of this result as given by Becher and Heiber (2013) in terms of "lossless finite-state compressors" do not follow the standard scheme of Kolmogorov complexity definition (an automaton is used for compression, not decompression). We modify this approach to make it more similar to the traditional Kolmogorov complexity theory (and simpler) by explicitly defining the notion of automatic Kolmogorov complexity and using its simple properties.
Using this characterization and a sufficient condition for normality in terms of Kolmogorov complexity derived from it, we provide easy proofs for classical results about normal sequences (Champernown, Wall, Piatetski-Shapiro, Besicovitch, Copeland, Erdos et al.)
Then we extend this approach to finite state dimension. We show that the block entropy definition of the finite state dimension remains the same if non-aligned blocks are used. Then we provide equivalent definitions in terms of automatic complexity, superadditive bounds for Kolmogorov complexity, calibrated superadditive functions and finite state a priori probability and use them to give simple proofs for known results about finite state dimension, and for Agafonov's result saying that normality is preserved by automatic selection rules as well as the results of Schnorr and Stimm that relate normality to finite state martingales.
Some results of this paper were presented at the Fundamentals in Computing Theory conferences in 2017 and 2019. Preliminary version of this paper (that does not mention the finite state dimension) was published in arxiv in~2017 (see the previous version of this submission).
△ Less
Submitted 24 August, 2020; v1 submitted 31 January, 2017;
originally announced January 2017.
-
Algorithmic statistics: forty years later
Authors:
Nikolai Vereshchagin,
Alexander Shen
Abstract:
Algorithmic statistics has two different (and almost orthogonal) motivations. From the philosophical point of view, it tries to formalize how the statistics works and why some statistical models are better than others. After this notion of a "good model" is introduced, a natural question arises: it is possible that for some piece of data there is no good model? If yes, how often these bad ("non-st…
▽ More
Algorithmic statistics has two different (and almost orthogonal) motivations. From the philosophical point of view, it tries to formalize how the statistics works and why some statistical models are better than others. After this notion of a "good model" is introduced, a natural question arises: it is possible that for some piece of data there is no good model? If yes, how often these bad ("non-stochastic") data appear "in real life"?
Another, more technical motivation comes from algorithmic information theory. In this theory a notion of complexity of a finite object (=amount of information in this object) is introduced; it assigns to every object some number, called its algorithmic complexity (or Kolmogorov complexity). Algorithmic statistic provides a more fine-grained classification: for each finite object some curve is defined that characterizes its behavior. It turns out that several different definitions give (approximately) the same curve.
In this survey we try to provide an exposition of the main results in the field (including full proofs for the most important ones), as well as some historical comments. We assume that the reader is familiar with the main notions of algorithmic information (Kolmogorov complexity) theory.
△ Less
Submitted 7 March, 2017; v1 submitted 27 July, 2016;
originally announced July 2016.
-
Layerwise computability and image randomness
Authors:
Laurent Bienvenu,
Mathieu Hoyrup,
Alexander Shen
Abstract:
Algorithmic randomness theory starts with a notion of an individual random object. To be reasonable, this notion should have some natural properties; in particular, an object should be random with respect to image distribution if and only if it has a random preimage. This result (for computable distributions and mappings, and Martin-Löf randomness) was known for a long time (folklore); in this pap…
▽ More
Algorithmic randomness theory starts with a notion of an individual random object. To be reasonable, this notion should have some natural properties; in particular, an object should be random with respect to image distribution if and only if it has a random preimage. This result (for computable distributions and mappings, and Martin-Löf randomness) was known for a long time (folklore); in this paper we prove its natural generalization for layerwise computable mappings, and discuss the related quantitative results.
△ Less
Submitted 14 July, 2016;
originally announced July 2016.
-
Evolving hypernetwork model based on WeChat user relations
Authors:
Fu-Hong Wang,
Jin-Li Guo,
Ai-Zhong Shen,
Qi Suo
Abstract:
Based on the theory of hypernetwork and WeChat online social relations, the paper proposes an evolving hypernetwork model with the competitiveness and the age of nodes. In the model, nodes arrive at the system in accordance with Poisson process and are gradual aging. We analyze the model by using a Poisson process theory and a continuous technique, and give a characteristic equation of hyperdegree…
▽ More
Based on the theory of hypernetwork and WeChat online social relations, the paper proposes an evolving hypernetwork model with the competitiveness and the age of nodes. In the model, nodes arrive at the system in accordance with Poisson process and are gradual aging. We analyze the model by using a Poisson process theory and a continuous technique, and give a characteristic equation of hyperdegrees. We obtain the stationary average hyperdegree distribution of the hypernetwork by the characteristic equation. The numerical simulations of the models agree with the analytical results well. It is expected that our work may give help to the study of WeChat information transmission dynamics and mobile e-commerce.
△ Less
Submitted 5 November, 2015;
originally announced November 2015.
-
DeepLogo: Hitting Logo Recognition with the Deep Neural Network Hammer
Authors:
Forrest N. Iandola,
Anting Shen,
Peter Gao,
Kurt Keutzer
Abstract:
Recently, there has been a flurry of industrial activity around logo recognition, such as Ditto's service for marketers to track their brands in user-generated images, and LogoGrab's mobile app platform for logo recognition. However, relatively little academic or open-source logo recognition progress has been made in the last four years. Meanwhile, deep convolutional neural networks (DCNNs) have r…
▽ More
Recently, there has been a flurry of industrial activity around logo recognition, such as Ditto's service for marketers to track their brands in user-generated images, and LogoGrab's mobile app platform for logo recognition. However, relatively little academic or open-source logo recognition progress has been made in the last four years. Meanwhile, deep convolutional neural networks (DCNNs) have revolutionized a broad range of object recognition applications. In this work, we apply DCNNs to logo recognition. We propose several DCNN architectures, with which we surpass published state-of-art accuracy on a popular logo recognition dataset.
△ Less
Submitted 7 October, 2015;
originally announced October 2015.
-
Generic algorithms for halting problem and optimal machines revisited
Authors:
Laurent Bienvenu,
Damien Desfontaines,
Alexander Shen
Abstract:
The halting problem is undecidable --- but can it be solved for "most" inputs? This natural question was considered in a number of papers, in different settings. We revisit their results and show that most of them can be easily proven in a natural framework of optimal machines (considered in algorithmic information theory) using the notion of Kolmogorov complexity. We also consider some related q…
▽ More
The halting problem is undecidable --- but can it be solved for "most" inputs? This natural question was considered in a number of papers, in different settings. We revisit their results and show that most of them can be easily proven in a natural framework of optimal machines (considered in algorithmic information theory) using the notion of Kolmogorov complexity. We also consider some related questions about this framework and about asymptotic properties of the halting problem. In particular, we show that the fraction of terminating programs cannot have a limit, and all limit points are Martin-Löf random reals. We then consider mass problems of finding an approximate solution of halting problem and probabilistic algorithms for them, proving both positive and negative results. We consider the fraction of terminating programs that require a long time for termination, and describe this fraction using the busy beaver function. We also consider approximate versions of separation problems, and revisit Schnorr's results about optimal numberings showing how they can be generalized.
△ Less
Submitted 4 April, 2016; v1 submitted 4 May, 2015;
originally announced May 2015.
-
Around Kolmogorov complexity: basic notions and results
Authors:
Alexander Shen
Abstract:
Algorithmic information theory studies description complexity and randomness and is now a well known field of theoretical computer science and mathematical logic. There are several textbooks and monographs devoted to this theory where one can find the detailed exposition of many difficult results as well as historical references. However, it seems that a short survey of its basic notions and main…
▽ More
Algorithmic information theory studies description complexity and randomness and is now a well known field of theoretical computer science and mathematical logic. There are several textbooks and monographs devoted to this theory where one can find the detailed exposition of many difficult results as well as historical references. However, it seems that a short survey of its basic notions and main results relating these notions to each other, is missing.
This report attempts to fill this gap and covers the basic notions of algorithmic information theory: Kolmogorov complexity (plain, conditional, prefix), Solomonoff universal a priori probability, notions of randomness (Martin-Löf randomness, Mises--Church randomness), effective Hausdorff dimension. We prove their basic properties (symmetry of information, connection between a priori probability and prefix complexity, criterion of randomness in terms of complexity, complexity characterization for effective dimension) and show some applications (incompressibility method in computational complexity theory, incompleteness theorems). It is based on the lecture notes of a course at Uppsala University given by the author.
△ Less
Submitted 20 April, 2015;
originally announced April 2015.
-
Algorithmic statistics revisited
Authors:
Nikolay Vereshchagin,
Alexander Shen
Abstract:
The mission of statistics is to provide adequate statistical hypotheses (models) for observed data. But what is an "adequate" model? To answer this question, one needs to use the notions of algorithmic information theory. It turns out that for every data string $x$ one can naturally define "stochasticity profile", a curve that represents a trade-off between complexity of a model and its adequacy.…
▽ More
The mission of statistics is to provide adequate statistical hypotheses (models) for observed data. But what is an "adequate" model? To answer this question, one needs to use the notions of algorithmic information theory. It turns out that for every data string $x$ one can naturally define "stochasticity profile", a curve that represents a trade-off between complexity of a model and its adequacy. This curve has four different equivalent definitions in terms of (1)~randomness deficiency, (2)~minimal description length, (3)~position in the lists of simple strings and (4)~Kolmogorov complexity with decompression time bounded by busy beaver function. We present a survey of the corresponding definitions and results relating them to each other.
△ Less
Submitted 27 April, 2015; v1 submitted 20 April, 2015;
originally announced April 2015.
-
K-trivial, K-low and MLR-low sequences: a tutorial
Authors:
Laurent Bienvenu,
Alexander Shen
Abstract:
A remarkable achievement in algorithmic randomness and algorithmic information theory was the discovery of the notions of K-trivial, K-low and Martin-Lof-random-low sets: three different definitions turns out to be equivalent for very non-trivial reasons. This paper, based on the course taught by one of the authors (L.B.) in Poncelet laboratory (CNRS, Moscow) in 2014, provides an exposition of the…
▽ More
A remarkable achievement in algorithmic randomness and algorithmic information theory was the discovery of the notions of K-trivial, K-low and Martin-Lof-random-low sets: three different definitions turns out to be equivalent for very non-trivial reasons. This paper, based on the course taught by one of the authors (L.B.) in Poncelet laboratory (CNRS, Moscow) in 2014, provides an exposition of the proof of this equivalence and some related results. We assume that the reader is familiar with basic notions of algorithmic information theory.
△ Less
Submitted 1 October, 2015; v1 submitted 16 July, 2014;
originally announced July 2014.