-
Hierarchical Associative Memory, Parallelized MLP-Mixer, and Symmetry Breaking
Authors:
Ryo Karakida,
Toshihiro Ota,
Masato Taki
Abstract:
Transformers have established themselves as the leading neural network model in natural language processing and are increasingly foundational in various domains. In vision, the MLP-Mixer model has demonstrated competitive performance, suggesting that attention mechanisms might not be indispensable. Inspired by this, recent research has explored replacing attention modules with other mechanisms, in…
▽ More
Transformers have established themselves as the leading neural network model in natural language processing and are increasingly foundational in various domains. In vision, the MLP-Mixer model has demonstrated competitive performance, suggesting that attention mechanisms might not be indispensable. Inspired by this, recent research has explored replacing attention modules with other mechanisms, including those described by MetaFormers. However, the theoretical framework for these models remains underdeveloped. This paper proposes a novel perspective by integrating Krotov's hierarchical associative memory with MetaFormers, enabling a comprehensive representation of the entire Transformer block, encompassing token-/channel-mixing modules, layer normalization, and skip connections, as a single Hopfield network. This approach yields a parallelized MLP-Mixer derived from a three-layer Hopfield network, which naturally incorporates symmetric token-/channel-mixing modules and layer normalization. Empirical studies reveal that symmetric interaction matrices in the model hinder performance in image recognition tasks. Introducing symmetry-breaking effects transitions the performance of the symmetric parallelized MLP-Mixer to that of the vanilla MLP-Mixer. This indicates that during standard training, weight matrices of the vanilla MLP-Mixer spontaneously acquire a symmetry-breaking configuration, enhancing their effectiveness. These findings offer insights into the intrinsic properties of Transformers and MLP-Mixers and their theoretical underpinnings, providing a robust framework for future model design and optimization.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Decision Mamba: Reinforcement Learning via Sequence Modeling with Selective State Spaces
Authors:
Toshihiro Ota
Abstract:
Decision Transformer, a promising approach that applies Transformer architectures to reinforcement learning, relies on causal self-attention to model sequences of states, actions, and rewards. While this method has shown competitive results, this paper investigates the integration of the Mamba framework, known for its advanced capabilities in efficient and effective sequence modeling, into the Dec…
▽ More
Decision Transformer, a promising approach that applies Transformer architectures to reinforcement learning, relies on causal self-attention to model sequences of states, actions, and rewards. While this method has shown competitive results, this paper investigates the integration of the Mamba framework, known for its advanced capabilities in efficient and effective sequence modeling, into the Decision Transformer architecture, focusing on the potential performance enhancements in sequential decision-making tasks. Our study systematically evaluates this integration by conducting a series of experiments across various decision-making environments, comparing the modified Decision Transformer, Decision Mamba, with its traditional counterpart. This work contributes to the advancement of sequential decision-making models, suggesting that the architecture and training methodology of neural networks can significantly impact their performance in complex tasks, and highlighting the potential of Mamba as a valuable tool for improving the efficacy of Transformer-based models in reinforcement learning scenarios.
△ Less
Submitted 28 March, 2024;
originally announced March 2024.
-
iMixer: hierarchical Hopfield network implies an invertible, implicit and iterative MLP-Mixer
Authors:
Toshihiro Ota,
Masato Taki
Abstract:
In the last few years, the success of Transformers in computer vision has stimulated the discovery of many alternative models that compete with Transformers, such as the MLP-Mixer. Despite their weak inductive bias, these models have achieved performance comparable to well-studied convolutional neural networks. Recent studies on modern Hopfield networks suggest the correspondence between certain e…
▽ More
In the last few years, the success of Transformers in computer vision has stimulated the discovery of many alternative models that compete with Transformers, such as the MLP-Mixer. Despite their weak inductive bias, these models have achieved performance comparable to well-studied convolutional neural networks. Recent studies on modern Hopfield networks suggest the correspondence between certain energy-based associative memory models and Transformers or MLP-Mixer, and shed some light on the theoretical background of the Transformer-type architectures design. In this paper, we generalize the correspondence to the recently introduced hierarchical Hopfield network, and find iMixer, a novel generalization of MLP-Mixer model. Unlike ordinary feedforward neural networks, iMixer involves MLP layers that propagate forward from the output side to the input side. We characterize the module as an example of invertible, implicit, and iterative mixing module. We evaluate the model performance with various datasets on image classification tasks, and find that iMixer, despite its unique architecture, exhibits stable learning capabilities and achieves performance comparable to or better than the baseline vanilla MLP-Mixer. The results imply that the correspondence between the Hopfield networks and the Mixer models serves as a principle for understanding a broader class of Transformer-like architecture designs.
△ Less
Submitted 1 April, 2024; v1 submitted 25 April, 2023;
originally announced April 2023.
-
Attention in a family of Boltzmann machines emerging from modern Hopfield networks
Authors:
Toshihiro Ota,
Ryo Karakida
Abstract:
Hopfield networks and Boltzmann machines (BMs) are fundamental energy-based neural network models. Recent studies on modern Hopfield networks have broaden the class of energy functions and led to a unified perspective on general Hopfield networks including an attention module. In this letter, we consider the BM counterparts of modern Hopfield networks using the associated energy functions, and stu…
▽ More
Hopfield networks and Boltzmann machines (BMs) are fundamental energy-based neural network models. Recent studies on modern Hopfield networks have broaden the class of energy functions and led to a unified perspective on general Hopfield networks including an attention module. In this letter, we consider the BM counterparts of modern Hopfield networks using the associated energy functions, and study their salient properties from a trainability perspective. In particular, the energy function corresponding to the attention module naturally introduces a novel BM, which we refer to as the attentional BM (AttnBM). We verify that AttnBM has a tractable likelihood function and gradient for certain special cases and is easy to train. Moreover, we reveal the hidden connections between AttnBM and some single-layer models, namely the Gaussian--Bernoulli restricted BM and the denoising autoencoder with softmax units coming from denoising score matching. We also investigate BMs introduced by other energy functions and show that the energy function of dense associative memory models gives BMs belonging to Exponential Family Harmoniums.
△ Less
Submitted 28 March, 2023; v1 submitted 9 December, 2022;
originally announced December 2022.
-
An examination of applicability of face recognition sensors in public facilities
Authors:
Takuji Takemoto,
Takashi Ota,
Hiroko Oe
Abstract:
This study aimed to explore the usability and applicability of face recognition sensors in public spaces to collect customer footfall data, which could then be analysed and evaluated for facility design and planning. Nine OMRON sensors were provided for the project and installed at five locations in a public facility for three months. The project was carried out by a local consortium with the coop…
▽ More
This study aimed to explore the usability and applicability of face recognition sensors in public spaces to collect customer footfall data, which could then be analysed and evaluated for facility design and planning. Nine OMRON sensors were provided for the project and installed at five locations in a public facility for three months. The project was carried out by a local consortium with the cooperation of local technology-based Small Medium-sized Enterprises (SMEs), business organisations, and a local university. Collected data were analysed to develop a report with diagrams, and reveal issues and potential for practical application in the future.
△ Less
Submitted 19 May, 2020;
originally announced May 2020.
-
Two-Dimensional Source Coding by Means of Subblock Enumeration
Authors:
Takahiro Ota,
Hiroyoshi Morita
Abstract:
A technique of lossless compression via substring enumeration (CSE) attains compression ratios as well as popular lossless compressors for one-dimensional (1D) sources. The CSE utilizes a probabilistic model built from the circular string of an input source for encoding the source.The CSE is applicable to two-dimensional (2D) sources such as images by dealing with a line of pixels of 2D source as…
▽ More
A technique of lossless compression via substring enumeration (CSE) attains compression ratios as well as popular lossless compressors for one-dimensional (1D) sources. The CSE utilizes a probabilistic model built from the circular string of an input source for encoding the source.The CSE is applicable to two-dimensional (2D) sources such as images by dealing with a line of pixels of 2D source as a symbol of an extended alphabet. At the initial step of the CSE encoding process, we need to output the number of occurrences of all symbols of the extended alphabet, so that the time complexity increase exponentially when the size of source becomes large. To reduce the time complexity, we propose a new CSE which can encode a 2D source in block-by-block instead of line-by-line. The proposed CSE utilizes the flat torus of an input 2D source as a probabilistic model for encoding the source instead of the circular string of the source. Moreover, we analyze the limit of the average codeword length of the proposed CSE for general sources.
△ Less
Submitted 24 January, 2017;
originally announced January 2017.
-
A Graph Representation for Two-Dimensional Finite Type Constrained Systems
Authors:
Takahiro Ota,
Akiko Manada,
Hiroyoshi Morita
Abstract:
The demand of two-dimensional source coding and constrained coding has been getting higher these days, but compared to the one-dimensional case, many problems have remained open as the analysis is cumbersome. A main reason for that would be because there are no graph representations discovered so far. In this paper, we focus on a two-dimensional finite type constrained system, a set of two-dimensi…
▽ More
The demand of two-dimensional source coding and constrained coding has been getting higher these days, but compared to the one-dimensional case, many problems have remained open as the analysis is cumbersome. A main reason for that would be because there are no graph representations discovered so far. In this paper, we focus on a two-dimensional finite type constrained system, a set of two-dimensional blocks characterized by a finite number of two-dimensional constraints, and propose its graph representation. We then show how to generate an element of the two-dimensional finite type constrained system from the graph representation.
△ Less
Submitted 2 February, 2016; v1 submitted 1 February, 2016;
originally announced February 2016.
-
Asymptotic Optimality of Antidictionary Codes
Authors:
Takahiro Ota,
Hiroyoshi Morita
Abstract:
An antidictionary code is a lossless compression algorithm using an antidictionary which is a set of minimal words that do not occur as substrings in an input string. The code was proposed by Crochemore et al. in 2000, and its asymptotic optimality has been proved with respect to only a specific information source, called balanced binary source that is a binary Markov source in which a state trans…
▽ More
An antidictionary code is a lossless compression algorithm using an antidictionary which is a set of minimal words that do not occur as substrings in an input string. The code was proposed by Crochemore et al. in 2000, and its asymptotic optimality has been proved with respect to only a specific information source, called balanced binary source that is a binary Markov source in which a state transition occurs with probability 1/2 or 1. In this paper, we prove the optimality of both static and dynamic antidictionary codes with respect to a stationary ergodic Markov source on finite alphabet such that a state transition occurs with probability $p (0 < p \leq 1)$.
△ Less
Submitted 1 June, 2010;
originally announced June 2010.