-
Intrinsic Motivation in Dynamical Control Systems
Authors:
Stas Tiomkin,
Ilya Nemenman,
Daniel Polani,
Naftali Tishby
Abstract:
Biological systems often choose actions without an explicit reward signal, a phenomenon known as intrinsic motivation. The computational principles underlying this behavior remain poorly understood. In this study, we investigate an information-theoretic approach to intrinsic motivation, based on maximizing an agent's empowerment (the mutual information between its past actions and future states).…
▽ More
Biological systems often choose actions without an explicit reward signal, a phenomenon known as intrinsic motivation. The computational principles underlying this behavior remain poorly understood. In this study, we investigate an information-theoretic approach to intrinsic motivation, based on maximizing an agent's empowerment (the mutual information between its past actions and future states). We show that this approach generalizes previous attempts to formalize intrinsic motivation, and we provide a computationally efficient algorithm for computing the necessary quantities. We test our approach on several benchmark control problems, and we explain its success in guiding intrinsically motivated behaviors by relating our information-theoretic control function to fundamental properties of the dynamical system representing the combined agent-environment system. This opens the door for designing practical artificial, intrinsically motivated controllers and for linking animal behaviors to their dynamical properties.
△ Less
Submitted 29 December, 2022;
originally announced January 2023.
-
Detecting chaos in lineage-trees: A deep learning approach
Authors:
Hagai Rappeport,
Irit Levin Reisman,
Naftali Tishby,
Nathalie Q. Balaban
Abstract:
Many complex phenomena, from weather systems to heartbeat rhythm patterns, are effectively modeled as low-dimensional dynamical systems. Such systems may behave chaotically under certain conditions, and so the ability to detect chaos based on empirical measurement is an important step in characterizing and predicting these processes. Classifying a system as chaotic usually requires estimating its…
▽ More
Many complex phenomena, from weather systems to heartbeat rhythm patterns, are effectively modeled as low-dimensional dynamical systems. Such systems may behave chaotically under certain conditions, and so the ability to detect chaos based on empirical measurement is an important step in characterizing and predicting these processes. Classifying a system as chaotic usually requires estimating its largest Lyapunov exponent, which quantifies the average rate of convergence or divergence of initially close trajectories in state space, and for which a positive value is generally accepted as an operational definition of chaos. Estimating the largest Lyapunov exponent from observations of a process is especially challenging in systems affected by dynamical noise, which is the case for many models of real-world processes, in particular models of biological systems. We describe a novel method for estimating the largest Lyapunov exponent from data, based on training Deep Learning models on synthetically generated trajectories, and demonstrate that this method yields accurate and noise-robust predictions given relatively short inputs and across a range of different dynamical systems. Our method is unique in that it can analyze tree-shaped data, a ubiquitous topology in biological settings, and specifically in dynamics over lineages of cells or organisms. We also characterize the types of input information extracted by our models for their predictions, allowing for a deeper understanding into the different ways by which chaos can be analyzed in different topologies.
△ Less
Submitted 8 June, 2021;
originally announced June 2021.
-
Critical Slowing Down Near Topological Transitions in Rate-Distortion Problems
Authors:
Shlomi Agmon,
Etam Benger,
Or Ordentlich,
Naftali Tishby
Abstract:
In rate-distortion (RD) problems one seeks reduced representations of a source that meet a target distortion constraint. Such optimal representations undergo topological transitions at some critical rate values, when their cardinality or dimensionality change. We study the convergence time of the Arimoto-Blahut alternating projection algorithms, used to solve such problems, near those critical poi…
▽ More
In rate-distortion (RD) problems one seeks reduced representations of a source that meet a target distortion constraint. Such optimal representations undergo topological transitions at some critical rate values, when their cardinality or dimensionality change. We study the convergence time of the Arimoto-Blahut alternating projection algorithms, used to solve such problems, near those critical points, both for the rate-distortion and information bottleneck settings. We argue that they suffer from critical slowing down -- a diverging number of iterations for convergence -- near the critical points. This phenomenon can have theoretical and practical implications for both machine learning and data compression problems.
△ Less
Submitted 9 May, 2021; v1 submitted 3 March, 2021;
originally announced March 2021.
-
The Dual Information Bottleneck
Authors:
Zoe Piran,
Ravid Shwartz-Ziv,
Naftali Tishby
Abstract:
The Information Bottleneck (IB) framework is a general characterization of optimal representations obtained using a principled approach for balancing accuracy and complexity. Here we present a new framework, the Dual Information Bottleneck (dualIB), which resolves some of the known drawbacks of the IB. We provide a theoretical analysis of the dualIB framework; (i) solving for the structure of its…
▽ More
The Information Bottleneck (IB) framework is a general characterization of optimal representations obtained using a principled approach for balancing accuracy and complexity. Here we present a new framework, the Dual Information Bottleneck (dualIB), which resolves some of the known drawbacks of the IB. We provide a theoretical analysis of the dualIB framework; (i) solving for the structure of its solutions (ii) unraveling its superiority in optimizing the mean prediction error exponent and (iii) demonstrating its ability to preserve exponential forms of the original distribution. To approach large scale problems, we present a novel variational formulation of the dualIB for Deep Neural Networks. In experiments on several data-sets, we compare it to a variational form of the IB. This exposes superior Information Plane properties of the dualIB and its potential in improvement of the error.
△ Less
Submitted 8 June, 2020;
originally announced June 2020.
-
Semantic categories of artifacts and animals reflect efficient coding
Authors:
Noga Zaslavsky,
Terry Regier,
Naftali Tishby,
Charles Kemp
Abstract:
It has been argued that semantic categories across languages reflect pressure for efficient communication. Recently, this idea has been cast in terms of a general information-theoretic principle of efficiency, the Information Bottleneck (IB) principle, and it has been shown that this principle accounts for the emergence and evolution of named color categories across languages, including soft struc…
▽ More
It has been argued that semantic categories across languages reflect pressure for efficient communication. Recently, this idea has been cast in terms of a general information-theoretic principle of efficiency, the Information Bottleneck (IB) principle, and it has been shown that this principle accounts for the emergence and evolution of named color categories across languages, including soft structure and patterns of inconsistent naming. However, it is not yet clear to what extent this account generalizes to semantic domains other than color. Here we show that it generalizes to two qualitatively different semantic domains: names for containers, and for animals. First, we show that container naming in Dutch and French is near-optimal in the IB sense, and that IB broadly accounts for soft categories and inconsistent naming patterns in both languages. Second, we show that a hierarchy of animal categories derived from IB captures cross-linguistic tendencies in the growth of animal taxonomies. Taken together, these findings suggest that fundamental information-theoretic principles of efficient coding may shape semantic categories across languages and across domains.
△ Less
Submitted 11 May, 2019;
originally announced May 2019.
-
Non-linear Canonical Correlation Analysis: A Compressed Representation Approach
Authors:
Amichai Painsky,
Meir Feder,
Naftali Tishby
Abstract:
Canonical Correlation Analysis (CCA) is a linear representation learning method that seeks maximally correlated variables in multi-view data. Non-linear CCA extends this notion to a broader family of transformations, which are more powerful in many real-world applications. Given the joint probability, the Alternating Conditional Expectation (ACE) algorithm provides an optimal solution to the non-l…
▽ More
Canonical Correlation Analysis (CCA) is a linear representation learning method that seeks maximally correlated variables in multi-view data. Non-linear CCA extends this notion to a broader family of transformations, which are more powerful in many real-world applications. Given the joint probability, the Alternating Conditional Expectation (ACE) algorithm provides an optimal solution to the non-linear CCA problem. However, it suffers from limited performance and an increasing computational burden when only a finite number of samples is available. In this work we introduce an information-theoretic compressed representation framework for the non-linear CCA problem (CRCCA), which extends the classical ACE approach. Our suggested framework seeks compact representations of the data that allow a maximal level of correlation. This way we control the trade-off between the flexibility and the complexity of the model. CRCCA provides theoretical bounds and optimality conditions, as we establish fundamental connections to rate-distortion theory, the information bottleneck and remote source coding. In addition, it allows a soft dimensionality reduction, as the compression level is determined by the mutual information between the original noisy data and the extracted signals. Finally, we introduce a simple implementation of the CRCCA framework, based on lattice quantization.
△ Less
Submitted 10 February, 2020; v1 submitted 31 October, 2018;
originally announced October 2018.
-
Efficient human-like semantic representations via the Information Bottleneck principle
Authors:
Noga Zaslavsky,
Charles Kemp,
Terry Regier,
Naftali Tishby
Abstract:
Maintaining efficient semantic representations of the environment is a major challenge both for humans and for machines. While human languages represent useful solutions to this problem, it is not yet clear what computational principle could give rise to similar solutions in machines. In this work we propose an answer to this open question. We suggest that languages compress percepts into words by…
▽ More
Maintaining efficient semantic representations of the environment is a major challenge both for humans and for machines. While human languages represent useful solutions to this problem, it is not yet clear what computational principle could give rise to similar solutions in machines. In this work we propose an answer to this open question. We suggest that languages compress percepts into words by optimizing the Information Bottleneck (IB) tradeoff between the complexity and accuracy of their lexicons. We present empirical evidence that this principle may give rise to human-like semantic representations, by exploring how human languages categorize colors. We show that color naming systems across languages are near-optimal in the IB sense, and that these natural systems are similar to artificial IB color naming systems with a single tradeoff parameter controlling the cross-language variability. In addition, the IB systems evolve through a sequence of structural phase transitions, demonstrating a possible adaptation process. This work thus identifies a computational principle that characterizes human semantic systems, and that could usefully inform semantic representations in machines.
△ Less
Submitted 9 August, 2018;
originally announced August 2018.
-
Color naming reflects both perceptual structure and communicative need
Authors:
Noga Zaslavsky,
Charles Kemp,
Naftali Tishby,
Terry Regier
Abstract:
Gibson et al. (2017) argued that color naming is shaped by patterns of communicative need. In support of this claim, they showed that color naming systems across languages support more precise communication about warm colors than cool colors, and that the objects we talk about tend to be warm-colored rather than cool-colored. Here, we present new analyses that alter this picture. We show that grea…
▽ More
Gibson et al. (2017) argued that color naming is shaped by patterns of communicative need. In support of this claim, they showed that color naming systems across languages support more precise communication about warm colors than cool colors, and that the objects we talk about tend to be warm-colored rather than cool-colored. Here, we present new analyses that alter this picture. We show that greater communicative precision for warm than for cool colors, and greater communicative need, may both be explained by perceptual structure. However, using an information-theoretic analysis, we also show that color naming across languages bears signs of communicative need beyond what would be predicted by perceptual structure alone. We conclude that color naming is shaped both by perceptual structure, as has traditionally been argued, and by patterns of communicative need, as argued by Gibson et al. - although for reasons other than those they advanced.
△ Less
Submitted 2 August, 2018; v1 submitted 16 May, 2018;
originally announced May 2018.
-
A General Memory-Bounded Learning Algorithm
Authors:
Michal Moshkovitz,
Naftali Tishby
Abstract:
Designing bounded-memory algorithms is becoming increasingly important nowadays. Previous works studying bounded-memory algorithms focused on proving impossibility results, while the design of bounded-memory algorithms was left relatively unexplored. To remedy this situation, in this work we design a general bounded-memory learning algorithm, when the underlying distribution is known. The core ide…
▽ More
Designing bounded-memory algorithms is becoming increasingly important nowadays. Previous works studying bounded-memory algorithms focused on proving impossibility results, while the design of bounded-memory algorithms was left relatively unexplored. To remedy this situation, in this work we design a general bounded-memory learning algorithm, when the underlying distribution is known. The core idea of the algorithm is not to save the exact example received, but only a few important bits that give sufficient information. This algorithm applies to any hypothesis class that has an "anti-mixing" property. This paper complements previous works on unlearnability with bounded memory and provides a step towards a full characterization of bounded-memory learning.
△ Less
Submitted 11 October, 2019; v1 submitted 10 December, 2017;
originally announced December 2017.
-
Gaussian Lower Bound for the Information Bottleneck Limit
Authors:
Amichai Painsky,
Naftali Tishby
Abstract:
The Information Bottleneck (IB) is a conceptual method for extracting the most compact, yet informative, representation of a set of variables, with respect to the target. It generalizes the notion of minimal sufficient statistics from classical parametric statistics to a broader information-theoretic sense. The IB curve defines the optimal trade-off between representation complexity and its predic…
▽ More
The Information Bottleneck (IB) is a conceptual method for extracting the most compact, yet informative, representation of a set of variables, with respect to the target. It generalizes the notion of minimal sufficient statistics from classical parametric statistics to a broader information-theoretic sense. The IB curve defines the optimal trade-off between representation complexity and its predictive power. Specifically, it is achieved by minimizing the level of mutual information (MI) between the representation and the original variables, subject to a minimal level of MI between the representation and the target. This problem is shown to be in general NP hard. One important exception is the multivariate Gaussian case, for which the Gaussian IB (GIB) is known to obtain an analytical closed form solution, similar to Canonical Correlation Analysis (CCA). In this work we introduce a Gaussian lower bound to the IB curve; we find an embedding of the data which maximizes its "Gaussian part", on which we apply the GIB. This embedding provides an efficient (and practical) representation of any arbitrary data-set (in the IB sense), which in addition holds the favorable properties of a Gaussian distribution. Importantly, we show that the optimal Gaussian embedding is bounded from above by non-linear CCA. This allows a fundamental limit for our ability to Gaussianize arbitrary data-sets and solve complex problems by linear methods.
△ Less
Submitted 7 November, 2017;
originally announced November 2017.
-
Opening the Black Box of Deep Neural Networks via Information
Authors:
Ravid Shwartz-Ziv,
Naftali Tishby
Abstract:
Despite their great success, there is still no comprehensive theoretical understanding of learning with Deep Neural Networks (DNNs) or their inner organization. Previous work proposed to analyze DNNs in the \textit{Information Plane}; i.e., the plane of the Mutual Information values that each layer preserves on the input and output variables. They suggested that the goal of the network is to optim…
▽ More
Despite their great success, there is still no comprehensive theoretical understanding of learning with Deep Neural Networks (DNNs) or their inner organization. Previous work proposed to analyze DNNs in the \textit{Information Plane}; i.e., the plane of the Mutual Information values that each layer preserves on the input and output variables. They suggested that the goal of the network is to optimize the Information Bottleneck (IB) tradeoff between compression and prediction, successively, for each layer.
In this work we follow up on this idea and demonstrate the effectiveness of the Information-Plane visualization of DNNs. Our main results are: (i) most of the training epochs in standard DL are spent on {\emph compression} of the input to efficient representation and not on fitting the training labels. (ii) The representation compression phase begins when the training errors becomes small and the Stochastic Gradient Decent (SGD) epochs change from a fast drift to smaller training error into a stochastic relaxation, or random diffusion, constrained by the training error value. (iii) The converged layers lie on or very close to the Information Bottleneck (IB) theoretical bound, and the maps from the input to any hidden layer and from this hidden layer to the output satisfy the IB self-consistent equations. This generalization through noise mechanism is unique to Deep Neural Networks and absent in one layer networks. (iv) The training time is dramatically reduced when adding more hidden layers. Thus the main advantage of the hidden layers is computational. This can be explained by the reduced relaxation time, as this it scales super-linearly (exponentially for simple diffusion) with the information compression from the previous layer.
△ Less
Submitted 29 April, 2017; v1 submitted 2 March, 2017;
originally announced March 2017.
-
Mixing Complexity and its Applications to Neural Networks
Authors:
Michal Moshkovitz,
Naftali Tishby
Abstract:
We suggest analyzing neural networks through the prism of space constraints. We observe that most training algorithms applied in practice use bounded memory, which enables us to use a new notion introduced in the study of space-time tradeoffs that we call mixing complexity. This notion was devised in order to measure the (in)ability to learn using a bounded-memory algorithm. In this paper we descr…
▽ More
We suggest analyzing neural networks through the prism of space constraints. We observe that most training algorithms applied in practice use bounded memory, which enables us to use a new notion introduced in the study of space-time tradeoffs that we call mixing complexity. This notion was devised in order to measure the (in)ability to learn using a bounded-memory algorithm. In this paper we describe how we use mixing complexity to obtain new results on what can and cannot be learned using neural networks.
△ Less
Submitted 2 March, 2017;
originally announced March 2017.
-
Control Capacity of Partially Observable Dynamic Systems in Continuous Time
Authors:
Stas Tiomkin,
Daniel Polani,
Naftali Tishby
Abstract:
Stochastic dynamic control systems relate in a prob- abilistic fashion the space of control signals to the space of corresponding future states. Consequently, stochastic dynamic systems can be interpreted as an information channel between the control space and the state space. In this work we study this control-to-state informartion capacity of stochastic dynamic systems in continuous-time, when t…
▽ More
Stochastic dynamic control systems relate in a prob- abilistic fashion the space of control signals to the space of corresponding future states. Consequently, stochastic dynamic systems can be interpreted as an information channel between the control space and the state space. In this work we study this control-to-state informartion capacity of stochastic dynamic systems in continuous-time, when the states are observed only partially. The control-to-state capacity, known as empowerment, was shown in the past to be useful in solving various Artificial Intelligence & Control benchmarks, and was used to replace problem-specific utilities. The higher the value of empowerment is, the more optional future states an agent may reach by using its controls inside a given time horizon. The contribution of this work is that we derive an efficient solution for computing the control-to-state information capacity for a linear, partially-observed Gaussian dynamic control system in continuous time, and discover new relationships between control-theoretic and information-theoretic properties of dynamic systems. Particularly, using the derived method, we demonstrate that the capacity between the control signal and the system output does not grow without limits with the length of the control signal. This means that only the near-past window of the control signal contributes effectively to the control-to-state capacity, while most of the information beyond this window is irrelevant for the future state of the dynamic system. We show that empowerment depends on a time constant of a dynamic system.
△ Less
Submitted 18 January, 2017;
originally announced January 2017.
-
Principled Option Learning in Markov Decision Processes
Authors:
Roy Fox,
Michal Moshkovitz,
Naftali Tishby
Abstract:
It is well known that options can make planning more efficient, among their many benefits. Thus far, algorithms for autonomously discovering a set of useful options were heuristic. Naturally, a principled way of finding a set of useful options may be more promising and insightful. In this paper we suggest a mathematical characterization of good sets of options using tools from information theory.…
▽ More
It is well known that options can make planning more efficient, among their many benefits. Thus far, algorithms for autonomously discovering a set of useful options were heuristic. Naturally, a principled way of finding a set of useful options may be more promising and insightful. In this paper we suggest a mathematical characterization of good sets of options using tools from information theory. This characterization enables us to find conditions for a set of options to be optimal and an algorithm that outputs a useful set of options and illustrate the proposed algorithm in simulation.
△ Less
Submitted 30 March, 2017; v1 submitted 18 September, 2016;
originally announced September 2016.
-
Minimum-Information LQG Control - Part II: Retentive Controllers
Authors:
Roy Fox,
Naftali Tishby
Abstract:
Retentive (memory-utilizing) sensing-acting agents may operate under limitations on the communication between their sensing, memory and acting components, requiring them to trade off the external cost that they incur with the capacity of their communication channels. In this paper we formulate this problem as a sequential rate-distortion problem of minimizing the rate of information required for t…
▽ More
Retentive (memory-utilizing) sensing-acting agents may operate under limitations on the communication between their sensing, memory and acting components, requiring them to trade off the external cost that they incur with the capacity of their communication channels. In this paper we formulate this problem as a sequential rate-distortion problem of minimizing the rate of information required for the controller's operation under a constraint on its external cost. We reduce this bounded retentive control problem to the memoryless one, studied in Part I of this work, by viewing the memory reader as one more sensor and the memory writer as one more actuator. We further investigate the structure of the resulting optimal solution and demonstrate its interesting phenomenology.
△ Less
Submitted 30 March, 2017; v1 submitted 6 June, 2016;
originally announced June 2016.
-
Minimum-Information LQG Control - Part I: Memoryless Controllers
Authors:
Roy Fox,
Naftali Tishby
Abstract:
With the increased demand for power efficiency in feedback-control systems, communication is becoming a limiting factor, raising the need to trade off the external cost that they incur with the capacity of the controller's communication channels. With a proper design of the channels, this translates into a sequential rate-distortion problem, where we minimize the rate of information required for t…
▽ More
With the increased demand for power efficiency in feedback-control systems, communication is becoming a limiting factor, raising the need to trade off the external cost that they incur with the capacity of the controller's communication channels. With a proper design of the channels, this translates into a sequential rate-distortion problem, where we minimize the rate of information required for the controller's operation under a constraint on its external cost. Memoryless controllers are of particular interest both for the simplicity and frugality of their implementation and as a basis for studying more complex controllers. In this paper we present the optimality principle for memoryless linear controllers that utilize minimal information rates to achieve a guaranteed external-cost level. We also study the interesting and useful phenomenology of the optimal controller, such as the principled reduction of its order.
△ Less
Submitted 30 March, 2017; v1 submitted 6 June, 2016;
originally announced June 2016.
-
Memory shapes time perception and intertemporal choices
Authors:
Pedro A. Ortega,
Naftali Tishby
Abstract:
There is a consensus that human and non-human subjects experience temporal distortions in many stages of their perceptual and decision-making systems. Similarly, intertemporal choice research has shown that decision-makers undervalue future outcomes relative to immediate ones. Here we combine techniques from information theory and artificial intelligence to show how both temporal distortions and i…
▽ More
There is a consensus that human and non-human subjects experience temporal distortions in many stages of their perceptual and decision-making systems. Similarly, intertemporal choice research has shown that decision-makers undervalue future outcomes relative to immediate ones. Here we combine techniques from information theory and artificial intelligence to show how both temporal distortions and intertemporal choice preferences can be explained as a consequence of the coding efficiency of sensorimotor representation. In particular, the model implies that interactions that constrain future behavior are perceived as being both longer in duration and more valuable. Furthermore, using simulations of artificial agents, we investigate how memory constraints enforce a renormalization of the perceived timescales. Our results show that qualitatively different discount functions, such as exponential and hyperbolic discounting, arise as a consequence of an agent's probabilistic model of the world.
△ Less
Submitted 29 May, 2016; v1 submitted 18 April, 2016;
originally announced April 2016.
-
Optimal Selective Attention in Reactive Agents
Authors:
Roy Fox,
Naftali Tishby
Abstract:
In POMDPs, information about the hidden state, delivered through observations, is both valuable to the agent, allowing it to base its actions on better informed internal states, and a "curse", exploding the size and diversity of the internal state space. One attempt to deal with this is to focus on reactive policies, that only base their actions on the most recent observation. However, even reacti…
▽ More
In POMDPs, information about the hidden state, delivered through observations, is both valuable to the agent, allowing it to base its actions on better informed internal states, and a "curse", exploding the size and diversity of the internal state space. One attempt to deal with this is to focus on reactive policies, that only base their actions on the most recent observation. However, even reactive policies can be demanding on resources, and agents need to pay selective attention to only some of the information available to them in observations. In this report we present the minimum-information principle for selective attention in reactive agents. We further motivate this approach by reducing the general problem of optimal control in POMDPs, to reactive control with complex observations. Lastly, we explore a newly discovered phenomenon of this optimization process - period doubling bifurcations. This necessitates periodic policies, and raises many more questions regarding stability, periodicity and chaos in optimal control.
△ Less
Submitted 28 December, 2015;
originally announced December 2015.
-
Taming the Noise in Reinforcement Learning via Soft Updates
Authors:
Roy Fox,
Ari Pakman,
Naftali Tishby
Abstract:
Model-free reinforcement learning algorithms, such as Q-learning, perform poorly in the early stages of learning in noisy environments, because much effort is spent unlearning biased estimates of the state-action value function. The bias results from selecting, among several noisy estimates, the apparent optimum, which may actually be suboptimal. We propose G-learning, a new off-policy learning al…
▽ More
Model-free reinforcement learning algorithms, such as Q-learning, perform poorly in the early stages of learning in noisy environments, because much effort is spent unlearning biased estimates of the state-action value function. The bias results from selecting, among several noisy estimates, the apparent optimum, which may actually be suboptimal. We propose G-learning, a new off-policy learning algorithm that regularizes the value estimates by penalizing deterministic policies in the beginning of the learning process. We show that this method reduces the bias of the value-function estimation, leading to faster convergence to the optimal value and the optimal policy. Moreover, G-learning enables the natural incorporation of prior domain knowledge, when available. The stochastic nature of G-learning also makes it avoid some exploration costs, a property usually attributed only to on-policy algorithms. We illustrate these ideas in several examples, where G-learning results in significant improvements of the convergence rate and the cost of the learning process.
△ Less
Submitted 30 March, 2017; v1 submitted 28 December, 2015;
originally announced December 2015.
-
Information-Theoretic Bounded Rationality
Authors:
Pedro A. Ortega,
Daniel A. Braun,
Justin Dyer,
Kee-Eung Kim,
Naftali Tishby
Abstract:
Bounded rationality, that is, decision-making and planning under resource limitations, is widely regarded as an important open problem in artificial intelligence, reinforcement learning, computational neuroscience and economics. This paper offers a consolidated presentation of a theory of bounded rationality based on information-theoretic ideas. We provide a conceptual justification for using the…
▽ More
Bounded rationality, that is, decision-making and planning under resource limitations, is widely regarded as an important open problem in artificial intelligence, reinforcement learning, computational neuroscience and economics. This paper offers a consolidated presentation of a theory of bounded rationality based on information-theoretic ideas. We provide a conceptual justification for using the free energy functional as the objective function for characterizing bounded-rational decisions. This functional possesses three crucial properties: it controls the size of the solution space; it has Monte Carlo planners that are exact, yet bypass the need for exhaustive search; and it captures model uncertainty arising from lack of evidence or from interacting with other agents having unknown intentions. We discuss the single-step decision-making case, and show how to extend it to sequential decisions using equivalence transformations. This extension yields a very general class of decision problems that encompass classical decision rules (e.g. EXPECTIMAX and MINIMAX) as limit cases, as well as trust- and risk-sensitive planning.
△ Less
Submitted 21 December, 2015;
originally announced December 2015.
-
Deep Learning and the Information Bottleneck Principle
Authors:
Naftali Tishby,
Noga Zaslavsky
Abstract:
Deep Neural Networks (DNNs) are analyzed via the theoretical framework of the information bottleneck (IB) principle. We first show that any DNN can be quantified by the mutual information between the layers and the input and output variables. Using this representation we can calculate the optimal information theoretic limits of the DNN and obtain finite sample generalization bounds. The advantage…
▽ More
Deep Neural Networks (DNNs) are analyzed via the theoretical framework of the information bottleneck (IB) principle. We first show that any DNN can be quantified by the mutual information between the layers and the input and output variables. Using this representation we can calculate the optimal information theoretic limits of the DNN and obtain finite sample generalization bounds. The advantage of getting closer to the theoretical limit is quantifiable both by the generalization bound and by the network's simplicity. We argue that both the optimal architecture, number of layers and features/connections at each layer, are related to the bifurcation points of the information bottleneck tradeoff, namely, relevant compression of the input layer with respect to the output layer. The hierarchical representations at the layered network naturally correspond to the structural phase transitions along the information curve. We believe that this new insight can lead to new optimality bounds and deep learning algorithms.
△ Less
Submitted 9 March, 2015;
originally announced March 2015.
-
Multivariate Information Bottleneck
Authors:
Nir Friedman,
Ori Mosenzon,
Noam Slonim,
Naftali Tishby
Abstract:
The Information bottleneck method is an unsupervised non-parametric data organization technique. Given a joint distribution P(A,B), this method constructs a new variable T that extracts partitions, or clusters, over the values of A that are informative about B. The information bottleneck has already been applied to document classification, gene expression, neural code, and spectral analysis. In th…
▽ More
The Information bottleneck method is an unsupervised non-parametric data organization technique. Given a joint distribution P(A,B), this method constructs a new variable T that extracts partitions, or clusters, over the values of A that are informative about B. The information bottleneck has already been applied to document classification, gene expression, neural code, and spectral analysis. In this paper, we introduce a general principled framework for multivariate extensions of the information bottleneck method. This allows us to consider multiple systems of data partitions that are inter-related. Our approach utilizes Bayesian networks for specifying the systems of clusters and what information each captures. We show that this construction provides insight about bottleneck variations and enables us to characterize solutions of these variations. We also present a general framework for iterative algorithms for constructing solutions, and apply it to several examples.
△ Less
Submitted 10 January, 2013;
originally announced January 2013.
-
Sufficient Dimensionality Reduction with Irrelevant Statistics
Authors:
Amir Globerson,
Gal Chechik,
Naftali Tishby
Abstract:
The problem of finding a reduced dimensionality representation of categorical variables while preserving their most relevant characteristics is fundamental for the analysis of complex data. Specifically, given a co-occurrence matrix of two variables, one often seeks a compact representation of one variable which preserves information about the other variable. We have recently intro…
▽ More
The problem of finding a reduced dimensionality representation of categorical variables while preserving their most relevant characteristics is fundamental for the analysis of complex data. Specifically, given a co-occurrence matrix of two variables, one often seeks a compact representation of one variable which preserves information about the other variable. We have recently introduced ``Sufficient Dimensionality Reduction' [GT-2003], a method that extracts continuous reduced dimensional features whose measurements (i.e., expectation values) capture maximal mutual information among the variables. However, such measurements often capture information that is irrelevant for a given task. Widely known examples are illumination conditions, which are irrelevant as features for face recognition, writing style which is irrelevant as a feature for content classification, and intonation which is irrelevant as a feature for speech recognition. Such irrelevance cannot be deduced apriori, since it depends on the details of the task, and is thus inherently ill defined in the purely unsupervised case. Separating relevant from irrelevant features can be achieved using additional side data that contains such irrelevant structures. This approach was taken in [CT-2002], extending the information bottleneck method, which uses clustering to compress the data. Here we use this side-information framework to identify features whose measurements are maximally informative for the original data set, but carry as little information as possible on a side data set. In statistical terms this can be understood as extracting statistics which are maximally sufficient for the original dataset, while simultaneously maximally ancillary for the side dataset. We formulate this tradeoff as a constrained optimization problem and characterize its solutions. We then derive a gradient descent algorithm for this problem, which is based on the Generalized Iterative Scaling method for finding maximum entropy distributions. The method is demonstrated on synthetic data, as well as on real face recognition datasets, and is shown to outperform standard methods such as oriented PCA.
△ Less
Submitted 19 October, 2012;
originally announced December 2012.
-
The Minimum Information Principle for Discriminative Learning
Authors:
Amir Globerson,
Naftali Tishby
Abstract:
Exponential models of distributions are widely used in machine learning for classiffication and modelling. It is well known that they can be interpreted as maximum entropy models under empirical expectation constraints. In this work, we argue that for classiffication tasks, mutual information is a more suitable information theoretic measure to be optimized. We show how the principle of minimum mut…
▽ More
Exponential models of distributions are widely used in machine learning for classiffication and modelling. It is well known that they can be interpreted as maximum entropy models under empirical expectation constraints. In this work, we argue that for classiffication tasks, mutual information is a more suitable information theoretic measure to be optimized. We show how the principle of minimum mutual information generalizes that of maximum entropy, and provides a comprehensive framework for building discriminative classiffiers. A game theoretic interpretation of our approach is then given, and several generalization bounds provided. We present iterative algorithms for solving the minimum information problem and its convex dual, and demonstrate their performance on various classiffication tasks. The results show that minimum information classiffiers outperform the corresponding maximum entropy models.
△ Less
Submitted 11 July, 2012;
originally announced July 2012.
-
Bounded Planning in Passive POMDPs
Authors:
Roy Fox,
Naftali Tishby
Abstract:
In Passive POMDPs actions do not affect the world state, but still incur costs. When the agent is bounded by information-processing constraints, it can only keep an approximation of the belief. We present a variational principle for the problem of maintaining the information which is most useful for minimizing the cost, and introduce an efficient and simple algorithm for finding an optimum.
In Passive POMDPs actions do not affect the world state, but still incur costs. When the agent is bounded by information-processing constraints, it can only keep an approximation of the belief. We present a variational principle for the problem of maintaining the information which is most useful for minimizing the cost, and introduce an efficient and simple algorithm for finding an optimum.
△ Less
Submitted 27 June, 2012;
originally announced June 2012.
-
Distribution-Dependent Sample Complexity of Large Margin Learning
Authors:
Sivan Sabato,
Nathan Srebro,
Naftali Tishby
Abstract:
We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the margin-adapted dimension, which is a simple function of the second order statistics of the data distribution, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the margin-adapted dimension of the dat…
▽ More
We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the margin-adapted dimension, which is a simple function of the second order statistics of the data distribution, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the margin-adapted dimension of the data distribution. The upper bounds are universal, and the lower bounds hold for the rich family of sub-Gaussian distributions with independent features. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. To prove the lower bound, we develop several new tools of independent interest. These include new connections between shattering and hardness of learning, new properties of shattering with linear classifiers, and a new lower bound on the smallest eigenvalue of a random Gram matrix generated by sub-Gaussian variables. Our results can be used to quantitatively compare large margin learning to other learning rules, and to improve the effectiveness of methods that use sample complexity bounds, such as active learning.
△ Less
Submitted 18 September, 2013; v1 submitted 5 April, 2012;
originally announced April 2012.
-
Multi-Instance Learning with Any Hypothesis Class
Authors:
Sivan Sabato,
Naftali Tishby
Abstract:
In the supervised learning setting termed Multiple-Instance Learning (MIL), the examples are bags of instances, and the bag label is a function of the labels of its instances. Typically, this function is the Boolean OR. The learner observes a sample of bags and the bag labels, but not the instance labels that determine the bag labels. The learner is then required to emit a classification rule for…
▽ More
In the supervised learning setting termed Multiple-Instance Learning (MIL), the examples are bags of instances, and the bag label is a function of the labels of its instances. Typically, this function is the Boolean OR. The learner observes a sample of bags and the bag labels, but not the instance labels that determine the bag labels. The learner is then required to emit a classification rule for bags based on the sample. MIL has numerous applications, and many heuristic algorithms have been used successfully on this problem, each adapted to specific settings or applications. In this work we provide a unified theoretical analysis for MIL, which holds for any underlying hypothesis class, regardless of a specific application or problem domain. We show that the sample complexity of MIL is only poly-logarithmically dependent on the size of the bag, for any underlying hypothesis class. In addition, we introduce a new PAC-learning algorithm for MIL, which uses a regular supervised learning algorithm as an oracle. We prove that efficient PAC-learning for MIL can be generated from any efficient non-MIL supervised learning algorithm that handles one-sided error. The computational complexity of the resulting algorithm is only polynomially dependent on the bag size.
△ Less
Submitted 13 August, 2012; v1 submitted 11 July, 2011;
originally announced July 2011.
-
Tight Sample Complexity of Large-Margin Learning
Authors:
Sivan Sabato,
Nathan Srebro,
Naftali Tishby
Abstract:
We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L_2 regularization: We introduce the γ-adapted-dimension, which is a simple function of the spectrum of a distribution's covariance matrix, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the γ-adapted-dimension of the source distrib…
▽ More
We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L_2 regularization: We introduce the γ-adapted-dimension, which is a simple function of the spectrum of a distribution's covariance matrix, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the γ-adapted-dimension of the source distribution. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. The bounds hold for a rich family of sub-Gaussian distributions.
△ Less
Submitted 5 April, 2012; v1 submitted 23 November, 2010;
originally announced November 2010.
-
Predictability, complexity and learning
Authors:
William Bialek,
Ilya Nemenman,
Naftali Tishby
Abstract:
We define {\em predictive information} $I_{\rm pred} (T)$ as the mutual information between the past and the future of a time series. Three qualitatively different behaviors are found in the limit of large observation times $T$: $I_{\rm pred} (T)$ can remain finite, grow logarithmically, or grow as a fractional power law. If the time series allows us to learn a model with a finite number of para…
▽ More
We define {\em predictive information} $I_{\rm pred} (T)$ as the mutual information between the past and the future of a time series. Three qualitatively different behaviors are found in the limit of large observation times $T$: $I_{\rm pred} (T)$ can remain finite, grow logarithmically, or grow as a fractional power law. If the time series allows us to learn a model with a finite number of parameters, then $I_{\rm pred} (T)$ grows logarithmically with a coefficient that counts the dimensionality of the model space. In contrast, power--law growth is associated, for example, with the learning of infinite parameter (or nonparametric) models such as continuous functions with smoothness constraints. There are connections between the predictive information and measures of complexity that have been defined both in learning theory and in the analysis of physical systems through statistical mechanics and dynamical systems theory. Further, in the same way that entropy provides the unique measure of available information consistent with some simple and plausible conditions, we argue that the divergent part of $I_{\rm pred} (T)$ provides the unique measure for the complexity of dynamics underlying a time series. Finally, we discuss how these ideas may be useful in different problems in physics, statistics, and biology.
△ Less
Submitted 23 January, 2001; v1 submitted 19 July, 2000;
originally announced July 2000.
-
The information bottleneck method
Authors:
Naftali Tishby,
Fernando C. Pereira,
William Bialek
Abstract:
We define the relevant information in a signal $x\in X$ as being the information that this signal provides about another signal $y\in \Y$. Examples include the information that face images provide about the names of the people portrayed, or the information that speech sounds provide about the words spoken. Understanding the signal $x$ requires more than just predicting $y$, it also requires spec…
▽ More
We define the relevant information in a signal $x\in X$ as being the information that this signal provides about another signal $y\in \Y$. Examples include the information that face images provide about the names of the people portrayed, or the information that speech sounds provide about the words spoken. Understanding the signal $x$ requires more than just predicting $y$, it also requires specifying which features of $\X$ play a role in the prediction. We formalize this problem as that of finding a short code for $\X$ that preserves the maximum information about $\Y$. That is, we squeeze the information that $\X$ provides about $\Y$ through a `bottleneck' formed by a limited set of codewords $\tX$. This constrained optimization problem can be seen as a generalization of rate distortion theory in which the distortion measure $d(x,\x)$ emerges from the joint statistics of $\X$ and $\Y$. This approach yields an exact set of self consistent equations for the coding rules $X \to \tX$ and $\tX \to \Y$. Solutions to these equations can be found by a convergent re-estimation method that generalizes the Blahut-Arimoto algorithm. Our variational principle provides a surprisingly rich framework for discussing a variety of problems in signal processing and learning, as will be described in detail elsewhere.
△ Less
Submitted 24 April, 2000;
originally announced April 2000.
-
Beyond Word N-Grams
Authors:
Fernando C. N. Pereira,
Yoram Singer,
Naftali Tishby
Abstract:
We describe, analyze, and evaluate experimentally a new probabilistic model for word-sequence prediction in natural language based on prediction suffix trees (PSTs). By using efficient data structures, we extend the notion of PST to unbounded vocabularies. We also show how to use a Bayesian approach based on recursive priors over all possible PSTs to efficiently maintain tree mixtures. These mix…
▽ More
We describe, analyze, and evaluate experimentally a new probabilistic model for word-sequence prediction in natural language based on prediction suffix trees (PSTs). By using efficient data structures, we extend the notion of PST to unbounded vocabularies. We also show how to use a Bayesian approach based on recursive priors over all possible PSTs to efficiently maintain tree mixtures. These mixtures have provably and practically better performance than almost any single model. We evaluate the model on several corpora. The low perplexity achieved by relatively small PST mixture models suggests that they may be an advantageous alternative, both theoretically and practically, to the widely used n-gram models.
△ Less
Submitted 13 July, 1996;
originally announced July 1996.
-
Distributional Clustering of English Words
Authors:
Fernando Pereira,
Naftali Tishby,
Lillian Lee
Abstract:
We describe and experimentally evaluate a method for automatically clustering words according to their distribution in particular syntactic contexts. Deterministic annealing is used to find lowest distortion sets of clusters. As the annealing parameter increases, existing clusters become unstable and subdivide, yielding a hierarchical ``soft'' clustering of the data. Clusters are used as the bas…
▽ More
We describe and experimentally evaluate a method for automatically clustering words according to their distribution in particular syntactic contexts. Deterministic annealing is used to find lowest distortion sets of clusters. As the annealing parameter increases, existing clusters become unstable and subdivide, yielding a hierarchical ``soft'' clustering of the data. Clusters are used as the basis for class models of word coocurrence, and the models evaluated with respect to held-out test data.
△ Less
Submitted 22 August, 1994;
originally announced August 1994.