Search | arXiv e-print repository

arXiv:2405.19107 [pdf, ps, other]

Offline Regularised Reinforcement Learning for Large Language Models Alignment

Authors: Pierre Harvey Richemond, Yunhao Tang, Daniel Guo, Daniele Calandriello, Mohammad Gheshlaghi Azar, Rafael Rafailov, Bernardo Avila Pires, Eugene Tarassov, Lucas Spangher, Will Ellsworth, Aliaksei Severyn, Jonathan Mallinson, Lior Shani, Gil Shamir, Rishabh Joshi, Tianqi Liu, Remi Munos, Bilal Piot

Abstract: The dominant framework for alignment of large language models (LLM), whether through reinforcement learning from human feedback or direct preference optimisation, is to learn from preference data. This involves building datasets where each element is a quadruplet composed of a prompt, two independent responses (completions of the prompt) and a human preference between the two independent responses… ▽ More The dominant framework for alignment of large language models (LLM), whether through reinforcement learning from human feedback or direct preference optimisation, is to learn from preference data. This involves building datasets where each element is a quadruplet composed of a prompt, two independent responses (completions of the prompt) and a human preference between the two independent responses, yielding a preferred and a dis-preferred response. Such data is typically scarce and expensive to collect. On the other hand, \emph{single-trajectory} datasets where each element is a triplet composed of a prompt, a response and a human feedback is naturally more abundant. The canonical element of such datasets is for instance an LLM's response to a user's prompt followed by a user's feedback such as a thumbs-up/down. Consequently, in this work, we propose DRO, or \emph{Direct Reward Optimisation}, as a framework and associated algorithms that do not require pairwise preferences. DRO uses a simple mean-squared objective that can be implemented in various ways. We validate our findings empirically, using T5 encoder-decoder language models, and show DRO's performance over selected baselines such as Kahneman-Tversky Optimization (KTO). Thus, we confirm that DRO is a simple and empirically compelling method for single-trajectory policy optimisation. △ Less

Submitted 29 May, 2024; originally announced May 2024.

arXiv:2306.08650 [pdf, other]

Learning to Rank when Grades Matter

Authors: Le Yan, Zhen Qin, Gil Shamir, Dong Lin, Xuanhui Wang, Mike Bendersky

Abstract: Graded labels are ubiquitous in real-world learning-to-rank applications, especially in human rated relevance data. Traditional learning-to-rank techniques aim to optimize the ranked order of documents. They typically, however, ignore predicting actual grades. This prevents them from being adopted in applications where grades matter, such as filtering out ``poor'' documents. Achieving both good ra… ▽ More Graded labels are ubiquitous in real-world learning-to-rank applications, especially in human rated relevance data. Traditional learning-to-rank techniques aim to optimize the ranked order of documents. They typically, however, ignore predicting actual grades. This prevents them from being adopted in applications where grades matter, such as filtering out ``poor'' documents. Achieving both good ranking performance and good grade prediction performance is still an under-explored problem. Existing research either focuses only on ranking performance by not calibrating model outputs, or treats grades as numerical values, assuming labels are on a linear scale and failing to leverage the ordinal grade information. In this paper, we conduct a rigorous study of learning to rank with grades, where both ranking performance and grade prediction performance are important. We provide a formal discussion on how to perform ranking with non-scalar predictions for grades, and propose a multiobjective formulation to jointly optimize both ranking and grade predictions. In experiments, we verify on several public datasets that our methods are able to push the Pareto frontier of the tradeoff between ranking and grade prediction performance, showing the benefit of leveraging ordinal grade information. △ Less

Submitted 20 June, 2023; v1 submitted 14 June, 2023; originally announced June 2023.

arXiv:2209.05310 [pdf, other]

On the Factory Floor: ML Engineering for Industrial-Scale Ads Recommendation Models

Authors: Rohan Anil, Sandra Gadanho, Da Huang, Nijith Jacob, Zhuoshu Li, Dong Lin, Todd Phillips, Cristina Pop, Kevin Regan, Gil I. Shamir, Rakesh Shivanna, Qiqi Yan

Abstract: For industrial-scale advertising systems, prediction of ad click-through rate (CTR) is a central problem. Ad clicks constitute a significant class of user engagements and are often used as the primary signal for the usefulness of ads to users. Additionally, in cost-per-click advertising systems where advertisers are charged per click, click rate expectations feed directly into value estimation. Ac… ▽ More For industrial-scale advertising systems, prediction of ad click-through rate (CTR) is a central problem. Ad clicks constitute a significant class of user engagements and are often used as the primary signal for the usefulness of ads to users. Additionally, in cost-per-click advertising systems where advertisers are charged per click, click rate expectations feed directly into value estimation. Accordingly, CTR model development is a significant investment for most Internet advertising companies. Engineering for such problems requires many machine learning (ML) techniques suited to online learning that go well beyond traditional accuracy improvements, especially concerning efficiency, reproducibility, calibration, credit attribution. We present a case study of practical techniques deployed in Google's search ads CTR model. This paper provides an industry case study highlighting important areas of current ML research and illustrating how impactful new ML methods are evaluated and made useful in a large-scale industrial setting. △ Less

Submitted 12 September, 2022; originally announced September 2022.

Comments: ORSUM - ACM RecSys, September 23, 2022, Seattle, WA

arXiv:2202.06499 [pdf, other]

Real World Large Scale Recommendation Systems Reproducibility and Smooth Activations

Authors: Gil I. Shamir, Dong Lin

Abstract: Real world recommendation systems influence a constantly growing set of domains. With deep networks, that now drive such systems, recommendations have been more relevant to the user's interests and tasks. However, they may not always be reproducible even if produced by the same system for the same user, recommendation sequence, request, or query. This problem received almost no attention in academ… ▽ More Real world recommendation systems influence a constantly growing set of domains. With deep networks, that now drive such systems, recommendations have been more relevant to the user's interests and tasks. However, they may not always be reproducible even if produced by the same system for the same user, recommendation sequence, request, or query. This problem received almost no attention in academic publications, but is, in fact, very realistic and critical in real production systems. We consider reproducibility of real large scale deep models, whose predictions determine such recommendations. We demonstrate that the celebrated Rectified Linear Unit (ReLU) activation, used in deep models, can be a major contributor to irreproducibility. We propose the use of smooth activations to improve recommendation reproducibility. We describe a novel family of smooth activations; Smooth ReLU (SmeLU), designed to improve reproducibility with mathematical simplicity, with potentially cheaper implementation. SmeLU is a member of a wider family of smooth activations. While other techniques that improve reproducibility in real systems usually come at accuracy costs, smooth activations not only improve reproducibility, but can even give accuracy gains. We report metrics from real systems in which we were able to productionalize SmeLU with substantial reproducibility gains and better accuracy-reproducibility trade-offs. These include click-through-rate (CTR) prediction systems, content, and application recommendation systems. △ Less

Submitted 14 February, 2022; originally announced February 2022.

arXiv:2202.04598 [pdf, ps, other]

Reproducibility in Optimization: Theoretical Framework and Limits

Authors: Kwangjun Ahn, Prateek Jain, Ziwei Ji, Satyen Kale, Praneeth Netrapalli, Gil I. Shamir

Abstract: We initiate a formal study of reproducibility in optimization. We define a quantitative measure of reproducibility of optimization procedures in the face of noisy or error-prone operations such as inexact or stochastic gradient computations or inexact initialization. We then analyze several convex optimization settings of interest such as smooth, non-smooth, and strongly-convex objective functions… ▽ More We initiate a formal study of reproducibility in optimization. We define a quantitative measure of reproducibility of optimization procedures in the face of noisy or error-prone operations such as inexact or stochastic gradient computations or inexact initialization. We then analyze several convex optimization settings of interest such as smooth, non-smooth, and strongly-convex objective functions and establish tight bounds on the limits of reproducibility in each setting. Our analysis reveals a fundamental trade-off between computation and reproducibility: more computation is necessary (and sufficient) for better reproducibility. △ Less

Submitted 4 December, 2022; v1 submitted 9 February, 2022; originally announced February 2022.

Comments: 45 Pages; Accepted to NeurIPS 2022

arXiv:2110.06435 [pdf, other]

Dropout Prediction Uncertainty Estimation Using Neuron Activation Strength

Authors: Haichao Yu, Zhe Chen, Dong Lin, Gil Shamir, Jie Han

Abstract: Dropout has been commonly used to quantify prediction uncertainty, i.e, the variations of model predictions on a given input example. However, using dropout in practice can be expensive as it requires running dropout inferences many times. In this paper, we study how to estimate dropout prediction uncertainty in a resource-efficient manner. We demonstrate that we can use neuron activation streng… ▽ More Dropout has been commonly used to quantify prediction uncertainty, i.e, the variations of model predictions on a given input example. However, using dropout in practice can be expensive as it requires running dropout inferences many times. In this paper, we study how to estimate dropout prediction uncertainty in a resource-efficient manner. We demonstrate that we can use neuron activation strengths to estimate dropout prediction uncertainty under different dropout settings and on a variety of tasks using three large datasets, MovieLens, Criteo, and EMNIST. Our approach provides an inference-once method to estimate dropout prediction uncertainty as a cheap auxiliary task. We also demonstrate that using activation features from a subset of the neural network layers can be sufficient to achieve uncertainty estimation performance almost comparable to that of using activation features from all layers, thus reducing resources even further for uncertainty estimation. △ Less

Submitted 16 June, 2022; v1 submitted 12 October, 2021; originally announced October 2021.

Comments: 8 pages

arXiv:2102.10696 [pdf, other]

Synthesizing Irreproducibility in Deep Networks

Authors: Robert R. Snapp, Gil I. Shamir

Abstract: The success and superior performance of deep networks is spreading their popularity and use to an increasing number of applications. Very recent works, however, demonstrate that modern day deep networks suffer from irreproducibility (also referred to as nondeterminism or underspecification). Two or more models that are identical in architecture, structure, training hyper-parameters, and parameters… ▽ More The success and superior performance of deep networks is spreading their popularity and use to an increasing number of applications. Very recent works, however, demonstrate that modern day deep networks suffer from irreproducibility (also referred to as nondeterminism or underspecification). Two or more models that are identical in architecture, structure, training hyper-parameters, and parameters, and that are trained on exactly the same training data, yield different predictions on individual previously unseen examples. Thus, a model that performs well on controlled test data, may perform in unexpected ways when deployed in the real world, whose data is expected to be similar to the test data. We study simple synthetic models and data to understand the origins of these problems. We show that even with a single nonlinearity and for very simple data and models, irreproducibility occurs. Our study demonstrates the effects of randomness in initialization, training data shuffling window size, and activation functions on prediction irreproducibility, even under very controlled synthetic data. While, as one would expect, randomness in initialization and in shuffling the training examples exacerbates the phenomenon, we show that model complexity and the choice of nonlinearity also play significant roles in making deep models irreproducible. △ Less

Submitted 21 February, 2021; originally announced February 2021.

arXiv:2101.12113 [pdf, other]

Low Complexity Approximate Bayesian Logistic Regression for Sparse Online Learning

Authors: Gil I. Shamir, Wojciech Szpankowski

Abstract: Theoretical results show that Bayesian methods can achieve lower bounds on regret for online logistic regression. In practice, however, such techniques may not be feasible especially for very large feature sets. Various approximations that, for huge sparse feature sets, diminish the theoretical advantages, must be used. Often, they apply stochastic gradient methods with hyper-parameters that must… ▽ More Theoretical results show that Bayesian methods can achieve lower bounds on regret for online logistic regression. In practice, however, such techniques may not be feasible especially for very large feature sets. Various approximations that, for huge sparse feature sets, diminish the theoretical advantages, must be used. Often, they apply stochastic gradient methods with hyper-parameters that must be tuned on some surrogate loss, defeating theoretical advantages of Bayesian methods. The surrogate loss, defined to approximate the mixture, requires techniques as Monte Carlo sampling, increasing computations per example. We propose low complexity analytical approximations for sparse online logistic and probit regressions. Unlike variational inference and other methods, our methods use analytical closed forms, substantially lowering computations. Unlike dense solutions, as Gaussian Mixtures, our methods allow for sparse problems with huge feature sets without increasing complexity. With the analytical closed forms, there is also no need for applying stochastic gradient methods on surrogate losses, and for tuning and balancing learning and regularization hyper-parameters. Empirical results top the performance of the more computationally involved methods. Like such methods, our methods still reveal per feature and per example uncertainty measures. △ Less

Submitted 28 January, 2021; originally announced January 2021.

arXiv:2010.09931 [pdf, other]

Smooth activations and reproducibility in deep networks

Authors: Gil I. Shamir, Dong Lin, Lorenzo Coviello

Abstract: Deep networks are gradually penetrating almost every domain in our lives due to their amazing success. However, with substantive performance accuracy improvements comes the price of \emph{irreproducibility}. Two identical models, trained on the exact same training dataset may exhibit large differences in predictions on individual examples even when average accuracy is similar, especially when trai… ▽ More Deep networks are gradually penetrating almost every domain in our lives due to their amazing success. However, with substantive performance accuracy improvements comes the price of \emph{irreproducibility}. Two identical models, trained on the exact same training dataset may exhibit large differences in predictions on individual examples even when average accuracy is similar, especially when trained on highly distributed parallel systems. The popular Rectified Linear Unit (ReLU) activation has been key to recent success of deep networks. We demonstrate, however, that ReLU is also a catalyzer to irreproducibility in deep networks. We show that not only can activations smoother than ReLU provide better accuracy, but they can also provide better accuracy-reproducibility tradeoffs. We propose a new family of activations; Smooth ReLU (\emph{SmeLU}), designed to give such better tradeoffs, while also keeping the mathematical expression simple, and thus implementation cheap. SmeLU is monotonic, mimics ReLU, while providing continuous gradients, yielding better reproducibility. We generalize SmeLU to give even more flexibility and then demonstrate that SmeLU and its generalized form are special cases of a more general methodology of REctified Smooth Continuous Unit (RESCU) activations. Empirical results demonstrate the superior accuracy-reproducibility tradeoffs with smooth activations, SmeLU in particular. △ Less

Submitted 30 November, 2020; v1 submitted 19 October, 2020; originally announced October 2020.

arXiv:2010.09923 [pdf, other]

Anti-Distillation: Improving reproducibility of deep networks

Authors: Gil I. Shamir, Lorenzo Coviello

Abstract: Deep networks have been revolutionary in improving performance of machine learning and artificial intelligence systems. Their high prediction accuracy, however, comes at a price of \emph{model irreproducibility\/} in very high levels that do not occur with classical linear models. Two models, even if they are supposedly identical, with identical architecture and identical trained parameter sets, a… ▽ More Deep networks have been revolutionary in improving performance of machine learning and artificial intelligence systems. Their high prediction accuracy, however, comes at a price of \emph{model irreproducibility\/} in very high levels that do not occur with classical linear models. Two models, even if they are supposedly identical, with identical architecture and identical trained parameter sets, and that are trained on the same set of training examples, while possibly providing identical average prediction accuracies, may predict very differently on individual, previously unseen, examples. \emph{Prediction differences\/} may be as large as the order of magnitude of the predictions themselves. Ensembles have been shown to somewhat mitigate this behavior, but without an extra push, may not be utilizing their full potential. In this work, a novel approach, \emph{Anti-Distillation\/}, is proposed to address irreproducibility in deep networks, where ensemble models are used to generate predictions. Anti-Distillation forces ensemble components away from one another by techniques like de-correlating their outputs over mini-batches of examples, forcing them to become even more different and more diverse. Doing so enhances the benefit of ensembles, making the final predictions more reproducible. Empirical results demonstrate substantial prediction difference reductions achieved by Anti-Distillation on benchmark and real datasets. △ Less

Submitted 19 October, 2020; originally announced October 2020.

arXiv:2005.10320 [pdf, ps, other]

Sequential Universal Modeling for Non-Binary Sequences with Constrained Distributions

Authors: Michael Drmota, Gil Shamir, Wojciech Szpankowski

Abstract: Sequential probability assignment and universal compression go hand in hand. We propose sequential probability assignment for non-binary (and large alphabet) sequences with empirical distributions whose parameters are known to be bounded within a limited interval. Sequential probability assignment algorithms are essential in many applications that require fast and accurate estimation of the maximi… ▽ More Sequential probability assignment and universal compression go hand in hand. We propose sequential probability assignment for non-binary (and large alphabet) sequences with empirical distributions whose parameters are known to be bounded within a limited interval. Sequential probability assignment algorithms are essential in many applications that require fast and accurate estimation of the maximizing sequence probability. These applications include learning, regression, channel estimation and decoding, prediction, and universal compression. On the other hand, constrained distributions introduce interesting theoretical twists that must be overcome in order to present efficient sequential algorithms. Here, we focus on universal compression for memoryless sources, and present precise analysis for the maximal minimax and the average minimax for constrained distributions. We show that our sequential algorithm based on modified Krichevsky-Trofimov (KT) estimator is asymptotically optimal up to $O(1)$ for both maximal and average redundancies. This paper follows and addresses the challenge presented in \cite{stw08} that suggested "results for the binary case lay the foundation to studying larger alphabets". △ Less

Submitted 6 February, 2021; v1 submitted 20 May, 2020; originally announced May 2020.

arXiv:2002.02950 [pdf, ps, other]

Logistic Regression Regret: What's the Catch?

Authors: Gil I. Shamir

Abstract: We address the problem of the achievable regret rates with online logistic regression. We derive lower bounds with logarithmic regret under $L_1$, $L_2$, and $L_\infty$ constraints on the parameter values. The bounds are dominated by $d/2 \log T$, where $T$ is the horizon and $d$ is the dimensionality of the parameter space. We show their achievability for $d=o(T^{1/3})$ in all these cases with Ba… ▽ More We address the problem of the achievable regret rates with online logistic regression. We derive lower bounds with logarithmic regret under $L_1$, $L_2$, and $L_\infty$ constraints on the parameter values. The bounds are dominated by $d/2 \log T$, where $T$ is the horizon and $d$ is the dimensionality of the parameter space. We show their achievability for $d=o(T^{1/3})$ in all these cases with Bayesian methods, that achieve them up to a $d/2 \log d$ term. Interesting different behaviors are shown for larger dimensionality. Specifically, on the negative side, if $d = Ω(\sqrt{T})$, any algorithm is guaranteed regret of $Ω(d \log T)$ (greater than $Ω(\sqrt{T})$) under $L_\infty$ constraints on the parameters (and the example features). On the positive side, under $L_1$ constraints on the parameters, there exist algorithms that can achieve regret that is sub-linear in $d$ for the asymptotically larger values of $d$. For $L_2$ constraints, it is shown that for large enough $d$, the regret remains linear in $d$ but no longer logarithmic in $T$. Adapting the redundancy-capacity theorem from information theory, we demonstrate a principled methodology based on grids of parameters to derive lower bounds. Grids are also utilized to derive some upper bounds. Our results strengthen results by Kakade and Ng (2005) and Foster et al. (2018) for upper bounds for this problem, introduce novel lower bounds, and adapt a methodology that can be used to obtain such bounds for other related problems. They also give a novel characterization of the asymptotic behavior when the dimension of the parameter space is allowed to grow with $T$. They additionally establish connections to the information theory literature, demonstrating that the actual regret for logistic regression depends on the richness of the parameter class, where even within this problem, richer classes lead to greater regret. △ Less

Submitted 19 February, 2020; v1 submitted 7 February, 2020; originally announced February 2020.

arXiv:0711.2102 [pdf, ps, other]

Patterns of i.i.d. Sequences and Their Entropy - Part II: Bounds for Some Distributions

Authors: Gil I. Shamir

Abstract: A pattern of a sequence is a sequence of integer indices with each index describing the order of first occurrence of the respective symbol in the original sequence. In a recent paper, tight general bounds on the block entropy of patterns of sequences generated by independent and identically distributed (i.i.d.) sources were derived. In this paper, precise approximations are provided for the patt… ▽ More A pattern of a sequence is a sequence of integer indices with each index describing the order of first occurrence of the respective symbol in the original sequence. In a recent paper, tight general bounds on the block entropy of patterns of sequences generated by independent and identically distributed (i.i.d.) sources were derived. In this paper, precise approximations are provided for the pattern block entropies for patterns of sequences generated by i.i.d. uniform and monotonic distributions, including distributions over the integers, and the geometric distribution. Numerical bounds on the pattern block entropies of these distributions are provided even for very short blocks. Tight bounds are obtained even for distributions that have infinite i.i.d. entropy rates. The approximations are obtained using general bounds and their derivation techniques. Conditional index entropy is also studied for distributions over smaller alphabets. △ Less

Submitted 13 November, 2007; originally announced November 2007.

arXiv:0704.0838 [pdf, ps, other]

Universal Source Coding for Monotonic and Fast Decaying Monotonic Distributions

Authors: Gil I. Shamir

Abstract: We study universal compression of sequences generated by monotonic distributions. We show that for a monotonic distribution over an alphabet of size $k$, each probability parameter costs essentially $0.5 \log (n/k^3)$ bits, where $n$ is the coded sequence length, as long as $k = o(n^{1/3})$. Otherwise, for $k = O(n)$, the total average sequence redundancy is $O(n^{1/3+ε})$ bits overall. We then… ▽ More We study universal compression of sequences generated by monotonic distributions. We show that for a monotonic distribution over an alphabet of size $k$, each probability parameter costs essentially $0.5 \log (n/k^3)$ bits, where $n$ is the coded sequence length, as long as $k = o(n^{1/3})$. Otherwise, for $k = O(n)$, the total average sequence redundancy is $O(n^{1/3+ε})$ bits overall. We then show that there exists a sub-class of monotonic distributions over infinite alphabets for which redundancy of $O(n^{1/3+ε})$ bits overall is still achievable. This class contains fast decaying distributions, including many distributions over the integers and geometric distributions. For some slower decays, including other distributions over the integers, redundancy of $o(n)$ bits overall is achievable, where a method to compute specific redundancy rates for such distributions is derived. The results are specifically true for finite entropy monotonic distributions. Finally, we study individual sequence redundancy behavior assuming a sequence is governed by a monotonic distribution. We show that for sequences whose empirical distributions are monotonic, individual redundancy bounds similar to those in the average case can be obtained. However, even if the monotonicity in the empirical distribution is violated, diminishing per symbol individual sequence redundancies with respect to the monotonic maximum likelihood description length may still be achievable. △ Less

Submitted 5 April, 2007; originally announced April 2007.

Comments: Submitted to IEEE Transactions on Information Theory

arXiv:cs/0605046 [pdf, ps, other]

Patterns of i.i.d. Sequences and Their Entropy - Part I: General Bounds

Authors: Gil I. Shamir

Abstract: Tight bounds on the block entropy of patterns of sequences generated by independent and identically distributed (i.i.d.) sources are derived. A pattern of a sequence is a sequence of integer indices with each index representing the order of first occurrence of the respective symbol in the original sequence. Since a pattern is the result of data processing on the original sequence, its entropy ca… ▽ More Tight bounds on the block entropy of patterns of sequences generated by independent and identically distributed (i.i.d.) sources are derived. A pattern of a sequence is a sequence of integer indices with each index representing the order of first occurrence of the respective symbol in the original sequence. Since a pattern is the result of data processing on the original sequence, its entropy cannot be larger. Bounds derived here describe the pattern entropy as function of the original i.i.d. source entropy, the alphabet size, the symbol probabilities, and their arrangement in the probability space. Matching upper and lower bounds derived provide a useful tool for very accurate approximations of pattern block entropies for various distributions, and for assessing the decrease of the pattern entropy from that of the original i.i.d. sequence. △ Less

Submitted 13 November, 2007; v1 submitted 10 May, 2006; originally announced May 2006.

Comments: Submitted to IEEE Transactions on Information Theory

arXiv:cs/0603068 [pdf, ps, other]

doi 10.1109/ISIT.2004.1365062

Universal Lossless Compression with Unknown Alphabets - The Average Case

Authors: Gil I. Shamir

Abstract: Universal compression of patterns of sequences generated by independently identically distributed (i.i.d.) sources with unknown, possibly large, alphabets is investigated. A pattern is a sequence of indices that contains all consecutive indices in increasing order of first occurrence. If the alphabet of a source that generated a sequence is unknown, the inevitable cost of coding the unknown alph… ▽ More Universal compression of patterns of sequences generated by independently identically distributed (i.i.d.) sources with unknown, possibly large, alphabets is investigated. A pattern is a sequence of indices that contains all consecutive indices in increasing order of first occurrence. If the alphabet of a source that generated a sequence is unknown, the inevitable cost of coding the unknown alphabet symbols can be exploited to create the pattern of the sequence. This pattern can in turn be compressed by itself. It is shown that if the alphabet size $k$ is essentially small, then the average minimax and maximin redundancies as well as the redundancy of every code for almost every source, when compressing a pattern, consist of at least 0.5 log(n/k^3) bits per each unknown probability parameter, and if all alphabet letters are likely to occur, there exist codes whose redundancy is at most 0.5 log(n/k^2) bits per each unknown probability parameter, where n is the length of the data sequences. Otherwise, if the alphabet is large, these redundancies are essentially at least O(n^{-2/3}) bits per symbol, and there exist codes that achieve redundancy of essentially O(n^{-1/2}) bits per symbol. Two sub-optimal low-complexity sequential algorithms for compression of patterns are presented and their description lengths analyzed, also pointing out that the pattern average universal description length can decrease below the underlying i.i.d.\ entropy for large enough alphabets. △ Less

Submitted 16 March, 2006; originally announced March 2006.

Comments: Revised for IEEE Transactions on Information Theory

ACM Class: G.3

arXiv:cs/0504049 [pdf, ps, other]

Bounds on the Entropy of Patterns of I.I.D. Sequences

Authors: Gil I. Shamir

Abstract: Bounds on the entropy of patterns of sequences generated by independently identically distributed (i.i.d.) sources are derived. A pattern is a sequence of indices that contains all consecutive integer indices in increasing order of first occurrence. If the alphabet of a source that generated a sequence is unknown, the inevitable cost of coding the unknown alphabet symbols can be exploited to cre… ▽ More Bounds on the entropy of patterns of sequences generated by independently identically distributed (i.i.d.) sources are derived. A pattern is a sequence of indices that contains all consecutive integer indices in increasing order of first occurrence. If the alphabet of a source that generated a sequence is unknown, the inevitable cost of coding the unknown alphabet symbols can be exploited to create the pattern of the sequence. This pattern can in turn be compressed by itself. The bounds derived here are functions of the i.i.d. source entropy, alphabet size, and letter probabilities. It is shown that for large alphabets, the pattern entropy must decrease from the i.i.d. one. The decrease is in many cases more significant than the universal coding redundancy bounds derived in prior works. The pattern entropy is confined between two bounds that depend on the arrangement of the letter probabilities in the probability space. For very large alphabets whose size may be greater than the coded pattern length, all low probability letters are packed into one symbol. The pattern entropy is upper and lower bounded in terms of the i.i.d. entropy of the new packed alphabet. Correction terms, which are usually negligible, are provided for both upper and lower bounds. △ Less

Submitted 12 April, 2005; originally announced April 2005.

Comments: submitted to ITW2005

Showing 1–17 of 17 results for author: Shamir, G