Search | arXiv e-print repository

World Models Increase Autonomy in Reinforcement Learning

Authors: Zhao Yang, Thomas M. Moerland, Mike Preuss, Aske Plaat, Edward S. Hu

Abstract: Reinforcement learning (RL) is an appealing paradigm for training intelligent agents, enabling policy acquisition from the agent's own autonomously acquired experience. However, the training process of RL is far from automatic, requiring extensive human effort to reset the agent and environments. To tackle the challenging reset-free setting, we first demonstrate the superiority of model-based (MB)… ▽ More Reinforcement learning (RL) is an appealing paradigm for training intelligent agents, enabling policy acquisition from the agent's own autonomously acquired experience. However, the training process of RL is far from automatic, requiring extensive human effort to reset the agent and environments. To tackle the challenging reset-free setting, we first demonstrate the superiority of model-based (MB) RL methods in such setting, showing that a straightforward adaptation of MBRL can outperform all the prior state-of-the-art methods while requiring less supervision. We then identify limitations inherent to this direct extension and propose a solution called model-based reset-free (MoReFree) agent, which further enhances the performance. MoReFree adapts two key mechanisms, exploration and policy learning, to handle reset-free tasks by prioritizing task-relevant states. It exhibits superior data-efficiency across various reset-free tasks without access to environmental reward or demonstrations while significantly outperforming privileged baselines that require supervision. Our findings suggest model-based methods hold significant promise for reducing human effort in RL. Website: https://sites.google.com/view/morefree △ Less

Submitted 20 August, 2024; v1 submitted 19 August, 2024; originally announced August 2024.

arXiv:2407.18597 [pdf, other]

Reinforcement Learning for Sustainable Energy: A Survey

Authors: Koen Ponse, Felix Kleuker, Márton Fejér, Álvaro Serra-Gómez, Aske Plaat, Thomas Moerland

Abstract: The transition to sustainable energy is a key challenge of our time, requiring modifications in the entire pipeline of energy production, storage, transmission, and consumption. At every stage, new sequential decision-making challenges emerge, ranging from the operation of wind farms to the management of electrical grids or the scheduling of electric vehicle charging stations. All such problems ar… ▽ More The transition to sustainable energy is a key challenge of our time, requiring modifications in the entire pipeline of energy production, storage, transmission, and consumption. At every stage, new sequential decision-making challenges emerge, ranging from the operation of wind farms to the management of electrical grids or the scheduling of electric vehicle charging stations. All such problems are well suited for reinforcement learning, the branch of machine learning that learns behavior from data. Therefore, numerous studies have explored the use of reinforcement learning for sustainable energy. This paper surveys this literature with the intention of bridging both the underlying research communities: energy and machine learning. After a brief introduction of both fields, we systematically list relevant sustainability challenges, how they can be modeled as a reinforcement learning problem, and what solution approaches currently exist in the literature. Afterwards, we zoom out and identify overarching reinforcement learning themes that appear throughout sustainability, such as multi-agent, offline, and safe reinforcement learning. Lastly, we also cover standardization of environments, which will be crucial for connecting both research fields, and highlight potential directions for future work. In summary, this survey provides an extensive overview of reinforcement learning methods for sustainable energy, which may play a vital role in the energy transition. △ Less

Submitted 26 July, 2024; originally announced July 2024.

Comments: 22 pages excluding references, 40 pages including references, 7 images

arXiv:2407.11511 [pdf, other]

Reasoning with Large Language Models, a Survey

Authors: Aske Plaat, Annie Wong, Suzan Verberne, Joost Broekens, Niki van Stein, Thomas Back

Abstract: Scaling up language models to billions of parameters has opened up possibilities for in-context learning, allowing instruction tuning and few-shot learning on tasks that the model was not specifically trained for. This has achieved breakthrough performance on language tasks such as translation, summarization, and question-answering. Furthermore, in addition to these associative "System 1" tasks, r… ▽ More Scaling up language models to billions of parameters has opened up possibilities for in-context learning, allowing instruction tuning and few-shot learning on tasks that the model was not specifically trained for. This has achieved breakthrough performance on language tasks such as translation, summarization, and question-answering. Furthermore, in addition to these associative "System 1" tasks, recent advances in Chain-of-thought prompt learning have demonstrated strong "System 2" reasoning abilities, answering a question in the field of artificial general intelligence whether LLMs can reason. The field started with the question whether LLMs can solve grade school math word problems. This paper reviews the rapidly expanding field of prompt-based reasoning with LLMs. Our taxonomy identifies different ways to generate, evaluate, and control multi-step reasoning. We provide an in-depth coverage of core approaches and open problems, and we propose a research agenda for the near future. Finally, we highlight the relation between reasoning and prompt-based learning, and we discuss the relation between reasoning, sequential decision processes, and reinforcement learning. We find that self-improvement, self-reflection, and some metacognitive abilities of the reasoning processes are possible through the judicious use of prompts. True self-improvement and self-reasoning, to go from reasoning with LLMs to reasoning by LLMs, remains future work. △ Less

Submitted 16 July, 2024; originally announced July 2024.

arXiv:2405.13092 [pdf, other]

CausalPlayground: Addressing Data-Generation Requirements in Cutting-Edge Causality Research

Authors: Andreas W M Sauter, Erman Acar, Aske Plaat

Abstract: Research on causal effects often relies on synthetic data due to the scarcity of real-world datasets with ground-truth effects. Since current data-generating tools do not always meet all requirements for state-of-the-art research, ad-hoc methods are often employed. This leads to heterogeneity among datasets and delays research progress. We address the shortcomings of current data-generating librar… ▽ More Research on causal effects often relies on synthetic data due to the scarcity of real-world datasets with ground-truth effects. Since current data-generating tools do not always meet all requirements for state-of-the-art research, ad-hoc methods are often employed. This leads to heterogeneity among datasets and delays research progress. We address the shortcomings of current data-generating libraries by introducing CausalPlayground, a Python library that provides a standardized platform for generating, sampling, and sharing structural causal models (SCMs). CausalPlayground offers fine-grained control over SCMs, interventions, and the generation of datasets of SCMs for learning and quantitative research. Furthermore, by integrating with Gymnasium, the standard framework for reinforcement learning (RL) environments, we enable online interaction with the SCMs. Overall, by introducing CausalPlayground we aim to foster more efficient and comparable research in the field. All code and API documentation is available at https://github.com/sa-and/CausalPlayground. △ Less

Submitted 21 May, 2024; originally announced May 2024.

arXiv:2403.13705 [pdf, ps, other]

Research Re: search & Re-search

Authors: Aske Plaat

Abstract: Search algorithms are often categorized by their node expansion strategy. One option is the depth-first strategy, a simple backtracking strategy that traverses the search space in the order in which successor nodes are generated. An alternative is the best-first strategy, which was designed to make it possible to use domain-specific heuristic information. By exploring promising parts of the search… ▽ More Search algorithms are often categorized by their node expansion strategy. One option is the depth-first strategy, a simple backtracking strategy that traverses the search space in the order in which successor nodes are generated. An alternative is the best-first strategy, which was designed to make it possible to use domain-specific heuristic information. By exploring promising parts of the search space first, best-first algorithms are usually more efficient than depth-first algorithms. In programs that play minimax games such as chess and checkers, the efficiency of the search is of crucial importance. Given the success of best-first algorithms in other domains, one would expect them to be used for minimax games too. However, all high-performance game-playing programs are based on a depth-first algorithm. This study takes a closer look at a depth-first algorithm, AB, and a best-first algorithm, SSS. The prevailing opinion on these algorithms is that SSS offers the potential for a more efficient search, but that its complicated formulation and exponential memory requirements render it impractical. The theoretical part of this work shows that there is a surprisingly straightforward link between the two algorithms -- for all practical purposes, SSS is a special case of AB. Subsequent empirical evidence proves the prevailing opinion on SSS to be wrong: it is not a complicated algorithm, it does not need too much memory, and it is also not more efficient than depth-first search. △ Less

Submitted 20 March, 2024; originally announced March 2024.

Comments: PhD thesis Aske Plaat 20 June 1996. AlphaBeta, SSS*, MTD(f)

arXiv:2403.09713 [pdf, other]

doi 10.1613/jair.1.15135

A Hybrid Intelligence Method for Argument Mining

Authors: Michiel van der Meer, Enrico Liscio, Catholijn M. Jonker, Aske Plaat, Piek Vossen, Pradeep K. Murukannaiah

Abstract: Large-scale survey tools enable the collection of citizen feedback in opinion corpora. Extracting the key arguments from a large and noisy set of opinions helps in understanding the opinions quickly and accurately. Fully automated methods can extract arguments but (1) require large labeled datasets that induce large annotation costs and (2) work well for known viewpoints, but not for novel points… ▽ More Large-scale survey tools enable the collection of citizen feedback in opinion corpora. Extracting the key arguments from a large and noisy set of opinions helps in understanding the opinions quickly and accurately. Fully automated methods can extract arguments but (1) require large labeled datasets that induce large annotation costs and (2) work well for known viewpoints, but not for novel points of view. We propose HyEnA, a hybrid (human + AI) method for extracting arguments from opinionated texts, combining the speed of automated processing with the understanding and reasoning capabilities of humans. We evaluate HyEnA on three citizen feedback corpora. We find that, on the one hand, HyEnA achieves higher coverage and precision than a state-of-the-art automated method when compared to a common set of diverse opinions, justifying the need for human insight. On the other hand, HyEnA requires less human effort and does not compromise quality compared to (fully manual) expert analysis, demonstrating the benefit of combining human and artificial intelligence. △ Less

Submitted 1 August, 2024; v1 submitted 11 March, 2024; originally announced March 2024.

Comments: Published in JAIR

Journal ref: Journal of Artificial Intelligence Research (JAIR), 80:1187-1222, 2024

arXiv:2402.06912 [pdf, other]

Solving Deep Reinforcement Learning Tasks with Evolution Strategies and Linear Policy Networks

Authors: Annie Wong, Jacob de Nobel, Thomas Bäck, Aske Plaat, Anna V. Kononova

Abstract: Although deep reinforcement learning methods can learn effective policies for challenging problems such as Atari games and robotics tasks, algorithms are complex, and training times are often long. This study investigates how Evolution Strategies perform compared to gradient-based deep reinforcement learning methods. We use Evolution Strategies to optimize the weights of a neural network via neuro… ▽ More Although deep reinforcement learning methods can learn effective policies for challenging problems such as Atari games and robotics tasks, algorithms are complex, and training times are often long. This study investigates how Evolution Strategies perform compared to gradient-based deep reinforcement learning methods. We use Evolution Strategies to optimize the weights of a neural network via neuroevolution, performing direct policy search. We benchmark both deep policy networks and networks consisting of a single linear layer from observations to actions for three gradient-based methods, such as Proximal Policy Optimization. These methods are evaluated against three classical Evolution Strategies and Augmented Random Search, which all use linear policy networks. Our results reveal that Evolution Strategies can find effective linear policies for many reinforcement learning benchmark tasks, unlike deep reinforcement learning methods that can only find successful policies using much larger networks, suggesting that current benchmarks are easier to solve than previously assumed. Interestingly, Evolution Strategies also achieve results comparable to gradient-based deep reinforcement learning algorithms for higher-complexity tasks. Furthermore, we find that by directly accessing the memory state of the game, Evolution Strategies can find successful policies in Atari that outperform the policies found by Deep Q-Learning. Evolution Strategies also outperform Augmented Random Search in most benchmarks, demonstrating superior sample efficiency and robustness in training linear policy networks. △ Less

Submitted 24 July, 2024; v1 submitted 10 February, 2024; originally announced February 2024.

arXiv:2402.03326 [pdf, other]

Slot Structured World Models

Authors: Jonathan Collu, Riccardo Majellaro, Aske Plaat, Thomas M. Moerland

Abstract: The ability to perceive and reason about individual objects and their interactions is a goal to be achieved for building intelligent artificial systems. State-of-the-art approaches use a feedforward encoder to extract object embeddings and a latent graph neural network to model the interaction between these object embeddings. However, the feedforward encoder can not extract {\it object-centric} re… ▽ More The ability to perceive and reason about individual objects and their interactions is a goal to be achieved for building intelligent artificial systems. State-of-the-art approaches use a feedforward encoder to extract object embeddings and a latent graph neural network to model the interaction between these object embeddings. However, the feedforward encoder can not extract {\it object-centric} representations, nor can it disentangle multiple objects with similar appearance. To solve these issues, we introduce {\it Slot Structured World Models} (SSWM), a class of world models that combines an {\it object-centric} encoder (based on Slot Attention) with a latent graph-based dynamics model. We evaluate our method in the Spriteworld benchmark with simple rules of physical interaction, where Slot Structured World Models consistently outperform baselines on a range of (multi-step) prediction tasks with action-conditional object interactions. All code to reproduce paper experiments is available from \url{https://github.com/JonathanCollu/Slot-Structured-World-Models}. △ Less

Submitted 8 January, 2024; originally announced February 2024.

arXiv:2401.16974 [pdf, other]

CORE: Towards Scalable and Efficient Causal Discovery with Reinforcement Learning

Authors: Andreas W. M. Sauter, Nicolò Botteghi, Erman Acar, Aske Plaat

Abstract: Causal discovery is the challenging task of inferring causal structure from data. Motivated by Pearl's Causal Hierarchy (PCH), which tells us that passive observations alone are not enough to distinguish correlation from causation, there has been a recent push to incorporate interventions into machine learning research. Reinforcement learning provides a convenient framework for such an active appr… ▽ More Causal discovery is the challenging task of inferring causal structure from data. Motivated by Pearl's Causal Hierarchy (PCH), which tells us that passive observations alone are not enough to distinguish correlation from causation, there has been a recent push to incorporate interventions into machine learning research. Reinforcement learning provides a convenient framework for such an active approach to learning. This paper presents CORE, a deep reinforcement learning-based approach for causal discovery and intervention planning. CORE learns to sequentially reconstruct causal graphs from data while learning to perform informative interventions. Our results demonstrate that CORE generalizes to unseen graphs and efficiently uncovers causal structures. Furthermore, CORE scales to larger graphs with up to 10 variables and outperforms existing approaches in structure estimation accuracy and sample efficiency. All relevant code and supplementary material can be found at https://github.com/sa-and/CORE △ Less

Submitted 30 January, 2024; originally announced January 2024.

Comments: To be published In Proc. of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024), Auckland, New Zealand, May 6 - 10, 2024, IFAAMAS

ACM Class: I.2.6; I.2.8

arXiv:2401.10148 [pdf, other]

Explicitly Disentangled Representations in Object-Centric Learning

Authors: Riccardo Majellaro, Jonathan Collu, Aske Plaat, Thomas M. Moerland

Abstract: Extracting structured representations from raw visual data is an important and long-standing challenge in machine learning. Recently, techniques for unsupervised learning of object-centric representations have raised growing interest. In this context, enhancing the robustness of the latent features can improve the efficiency and effectiveness of the training of downstream tasks. A promising step i… ▽ More Extracting structured representations from raw visual data is an important and long-standing challenge in machine learning. Recently, techniques for unsupervised learning of object-centric representations have raised growing interest. In this context, enhancing the robustness of the latent features can improve the efficiency and effectiveness of the training of downstream tasks. A promising step in this direction is to disentangle the factors that cause variation in the data. Previously, Invariant Slot Attention disentangled position, scale, and orientation from the remaining features. Extending this approach, we focus on separating the shape and texture components. In particular, we propose a novel architecture that biases object-centric models toward disentangling shape and texture components into two non-overlapping subsets of the latent space dimensions. These subsets are known a priori, hence before the training process. Experiments on a range of object-centric benchmarks reveal that our approach achieves the desired disentanglement while also numerically improving baseline performance in most cases. In addition, we show that our method can generate novel textures for a specific object or transfer textures between objects with distinct shapes. △ Less

Submitted 18 January, 2024; originally announced January 2024.

arXiv:2311.10590 [pdf, other]

EduGym: An Environment and Notebook Suite for Reinforcement Learning Education

Authors: Thomas M. Moerland, Matthias Müller-Brockhausen, Zhao Yang, Andrius Bernatavicius, Koen Ponse, Tom Kouwenhoven, Andreas Sauter, Michiel van der Meer, Bram Renting, Aske Plaat

Abstract: Due to the empirical success of reinforcement learning, an increasing number of students study the subject. However, from our practical teaching experience, we see students entering the field (bachelor, master and early PhD) often struggle. On the one hand, textbooks and (online) lectures provide the fundamentals, but students find it hard to translate between equations and code. On the other hand… ▽ More Due to the empirical success of reinforcement learning, an increasing number of students study the subject. However, from our practical teaching experience, we see students entering the field (bachelor, master and early PhD) often struggle. On the one hand, textbooks and (online) lectures provide the fundamentals, but students find it hard to translate between equations and code. On the other hand, public codebases do provide practical examples, but the implemented algorithms tend to be complex, and the underlying test environments contain multiple reinforcement learning challenges at once. Although this is realistic from a research perspective, it often hinders educational conceptual understanding. To solve this issue we introduce EduGym, a set of educational reinforcement learning environments and associated interactive notebooks tailored for education. Each EduGym environment is specifically designed to illustrate a certain aspect/challenge of reinforcement learning (e.g., exploration, partial observability, stochasticity, etc.), while the associated interactive notebook explains the challenge and its possible solution approaches, connecting equations and code in a single document. An evaluation among RL students and researchers shows 86% of them think EduGym is a useful tool for reinforcement learning education. All notebooks are available from https://www.edugym.org/, while the full software package can be installed from https://github.com/RLG-Leiden/edugym. △ Less

Submitted 22 February, 2024; v1 submitted 17 November, 2023; originally announced November 2023.

arXiv:2310.14139 [pdf, other]

Are LSTMs Good Few-Shot Learners?

Authors: Mike Huisman, Thomas M. Moerland, Aske Plaat, Jan N. van Rijn

Abstract: Deep learning requires large amounts of data to learn new tasks well, limiting its applicability to domains where such data is available. Meta-learning overcomes this limitation by learning how to learn. In 2001, Hochreiter et al. showed that an LSTM trained with backpropagation across different tasks is capable of meta-learning. Despite promising results of this approach on small problems, and mo… ▽ More Deep learning requires large amounts of data to learn new tasks well, limiting its applicability to domains where such data is available. Meta-learning overcomes this limitation by learning how to learn. In 2001, Hochreiter et al. showed that an LSTM trained with backpropagation across different tasks is capable of meta-learning. Despite promising results of this approach on small problems, and more recently, also on reinforcement learning problems, the approach has received little attention in the supervised few-shot learning setting. We revisit this approach and test it on modern few-shot learning benchmarks. We find that LSTM, surprisingly, outperform the popular meta-learning technique MAML on a simple few-shot sine wave regression benchmark, but that LSTM, expectedly, fall short on more complex few-shot image classification benchmarks. We identify two potential causes and propose a new method called Outer Product LSTM (OP-LSTM) that resolves these issues and displays substantial performance gains over the plain LSTM. Compared to popular meta-learning baselines, OP-LSTM yields competitive performance on within-domain few-shot image classification, and performs better in cross-domain settings by 0.5% to 1.9% in accuracy score. While these results alone do not set a new state-of-the-art, the advances of OP-LSTM are orthogonal to other advances in the field of meta-learning, yield new insights in how LSTM work in image classification, allowing for a whole range of new research directions. For reproducibility purposes, we publish all our research code publicly. △ Less

Submitted 21 October, 2023; originally announced October 2023.

Comments: Accepted at Machine Learning Journal, Special Issue of the ECML PKDD 2023 Journal Track

arXiv:2310.09028 [pdf, other]

Subspace Adaptation Prior for Few-Shot Learning

Authors: Mike Huisman, Aske Plaat, Jan N. van Rijn

Abstract: Gradient-based meta-learning techniques aim to distill useful prior knowledge from a set of training tasks such that new tasks can be learned more efficiently with gradient descent. While these methods have achieved successes in various scenarios, they commonly adapt all parameters of trainable layers when learning new tasks. This neglects potentially more efficient learning strategies for a given… ▽ More Gradient-based meta-learning techniques aim to distill useful prior knowledge from a set of training tasks such that new tasks can be learned more efficiently with gradient descent. While these methods have achieved successes in various scenarios, they commonly adapt all parameters of trainable layers when learning new tasks. This neglects potentially more efficient learning strategies for a given task distribution and may be susceptible to overfitting, especially in few-shot learning where tasks must be learned from a limited number of examples. To address these issues, we propose Subspace Adaptation Prior (SAP), a novel gradient-based meta-learning algorithm that jointly learns good initialization parameters (prior knowledge) and layer-wise parameter subspaces in the form of operation subsets that should be adaptable. In this way, SAP can learn which operation subsets to adjust with gradient descent based on the underlying task distribution, simultaneously decreasing the risk of overfitting when learning new tasks. We demonstrate that this ability is helpful as SAP yields superior or competitive performance in few-shot image classification settings (gains between 0.1% and 3.9% in accuracy). Analysis of the learned subspaces demonstrates that low-dimensional operations often yield high activation strengths, indicating that they may be important for achieving good few-shot learning performance. For reproducibility purposes, we publish all our research code publicly. △ Less

Submitted 13 October, 2023; originally announced October 2023.

Comments: Accepted at Machine Learning Journal, Special Issue of the ECML PKDD 2023 Journal Track

arXiv:2310.06148 [pdf, other]

Understanding Transfer Learning and Gradient-Based Meta-Learning Techniques

Authors: Mike Huisman, Aske Plaat, Jan N. van Rijn

Abstract: Deep neural networks can yield good performance on various tasks but often require large amounts of data to train them. Meta-learning received considerable attention as one approach to improve the generalization of these networks from a limited amount of data. Whilst meta-learning techniques have been observed to be successful at this in various scenarios, recent results suggest that when evaluate… ▽ More Deep neural networks can yield good performance on various tasks but often require large amounts of data to train them. Meta-learning received considerable attention as one approach to improve the generalization of these networks from a limited amount of data. Whilst meta-learning techniques have been observed to be successful at this in various scenarios, recent results suggest that when evaluated on tasks from a different data distribution than the one used for training, a baseline that simply finetunes a pre-trained network may be more effective than more complicated meta-learning techniques such as MAML, which is one of the most popular meta-learning techniques. This is surprising as the learning behaviour of MAML mimics that of finetuning: both rely on re-using learned features. We investigate the observed performance differences between finetuning, MAML, and another meta-learning technique called Reptile, and show that MAML and Reptile specialize for fast adaptation in low-data regimes of similar data distribution as the one used for training. Our findings show that both the output layer and the noisy training conditions induced by data scarcity play important roles in facilitating this specialization for MAML. Lastly, we show that the pre-trained features as obtained by the finetuning baseline are more diverse and discriminative than those learned by MAML and Reptile. Due to this lack of diversity and distribution specialization, MAML and Reptile may fail to generalize to out-of-distribution tasks whereas finetuning can fall back on the diversity of the learned features. △ Less

Submitted 9 October, 2023; originally announced October 2023.

Comments: Accepted at Machine Learning Journal, Special Issue on Discovery Science 2021

arXiv:2309.07033 [pdf, other]

Human-Robot Co-Creativity: A Scoping Review -- Informing a Research Agenda for Human-Robot Co-Creativity with Older Adults

Authors: Marianne Bossema, Somaya Ben Allouch, Aske Plaat, Rob Saunders

Abstract: This review is the first step in a long-term research project exploring how social robotics and AI-generated content can contribute to the creative experiences of older adults, with a focus on collaborative drawing and painting. We systematically searched and selected literature on human-robot co-creativity, and analyzed articles to identify methods and strategies for researching co-creative robot… ▽ More This review is the first step in a long-term research project exploring how social robotics and AI-generated content can contribute to the creative experiences of older adults, with a focus on collaborative drawing and painting. We systematically searched and selected literature on human-robot co-creativity, and analyzed articles to identify methods and strategies for researching co-creative robotics. We found that none of the studies involved older adults, which shows the gap in the literature for this often involved participant group in robotics research. The analyzed literature provides valuable insights into the design of human-robot co-creativity and informs a research agenda to further investigate the topic with older adults. We argue that future research should focus on ecological and developmental perspectives on creativity, on how system behavior can be aligned with the values of older adults, and on the system structures that support this best. △ Less

Submitted 15 September, 2023; v1 submitted 13 September, 2023; originally announced September 2023.

arXiv:2309.01664 [pdf, ps, other]

Fine-grained Affective Processing Capabilities Emerging from Large Language Models

Authors: Joost Broekens, Bernhard Hilpert, Suzan Verberne, Kim Baraka, Patrick Gebhard, Aske Plaat

Abstract: Large language models, in particular generative pre-trained transformers (GPTs), show impressive results on a wide variety of language-related tasks. In this paper, we explore ChatGPT's zero-shot ability to perform affective computing tasks using prompting alone. We show that ChatGPT a) performs meaningful sentiment analysis in the Valence, Arousal and Dominance dimensions, b) has meaningful emoti… ▽ More Large language models, in particular generative pre-trained transformers (GPTs), show impressive results on a wide variety of language-related tasks. In this paper, we explore ChatGPT's zero-shot ability to perform affective computing tasks using prompting alone. We show that ChatGPT a) performs meaningful sentiment analysis in the Valence, Arousal and Dominance dimensions, b) has meaningful emotion representations in terms of emotion categories and these affective dimensions, and c) can perform basic appraisal-based emotion elicitation of situations based on a prompt-based computational implementation of the OCC appraisal model. These findings are highly relevant: First, they show that the ability to solve complex affect processing tasks emerges from language-based token prediction trained on extensive data sets. Second, they show the potential of large language models for simulating, processing and analyzing human emotions, which has important implications for various applications such as sentiment analysis, socially interactive agents, and social robotics. △ Less

Submitted 4 September, 2023; originally announced September 2023.

arXiv:2304.10098 [pdf, other]

Two-Memory Reinforcement Learning

Authors: Zhao Yang, Thomas. M. Moerland, Mike Preuss, Aske Plaat

Abstract: While deep reinforcement learning has shown important empirical success, it tends to learn relatively slow due to slow propagation of rewards information and slow update of parametric neural networks. Non-parametric episodic memory, on the other hand, provides a faster learning alternative that does not require representation learning and uses maximum episodic return as state-action values for act… ▽ More While deep reinforcement learning has shown important empirical success, it tends to learn relatively slow due to slow propagation of rewards information and slow update of parametric neural networks. Non-parametric episodic memory, on the other hand, provides a faster learning alternative that does not require representation learning and uses maximum episodic return as state-action values for action selection. Episodic memory and reinforcement learning both have their own strengths and weaknesses. Notably, humans can leverage multiple memory systems concurrently during learning and benefit from all of them. In this work, we propose a method called Two-Memory reinforcement learning agent (2M) that combines episodic memory and reinforcement learning to distill both of their strengths. The 2M agent exploits the speed of the episodic memory part and the optimality and the generalization capacity of the reinforcement learning part to complement each other. Our experiments demonstrate that the 2M agent is more data efficient and outperforms both pure episodic memory and pure reinforcement learning, as well as a state-of-the-art memory-augmented RL agent. Moreover, the proposed approach provides a general framework that can be used to combine any episodic memory agent with other off-policy reinforcement learning algorithms. △ Less

Submitted 23 April, 2023; v1 submitted 20 April, 2023; originally announced April 2023.

arXiv:2212.03251 [pdf, other]

First Go, then Post-Explore: the Benefits of Post-Exploration in Intrinsic Motivation

Authors: Zhao Yang, Thomas M. Moerland, Mike Preuss, Aske Plaat

Abstract: Go-Explore achieved breakthrough performance on challenging reinforcement learning (RL) tasks with sparse rewards. The key insight of Go-Explore was that successful exploration requires an agent to first return to an interesting state ('Go'), and only then explore into unknown terrain ('Explore'). We refer to such exploration after a goal is reached as 'post-exploration'. In this paper, we present… ▽ More Go-Explore achieved breakthrough performance on challenging reinforcement learning (RL) tasks with sparse rewards. The key insight of Go-Explore was that successful exploration requires an agent to first return to an interesting state ('Go'), and only then explore into unknown terrain ('Explore'). We refer to such exploration after a goal is reached as 'post-exploration'. In this paper, we present a clear ablation study of post-exploration in a general intrinsically motivated goal exploration process (IMGEP) framework, that the Go-Explore paper did not show. We study the isolated potential of post-exploration, by turning it on and off within the same algorithm under both tabular and deep RL settings on both discrete navigation and continuous control tasks. Experiments on a range of MiniGrid and Mujoco environments show that post-exploration indeed helps IMGEP agents reach more diverse states and boosts their performance. In short, our work suggests that RL researchers should consider to use post-exploration in IMGEP when possible since it is effective, method-agnostic and easy to implement. △ Less

Submitted 6 January, 2023; v1 submitted 6 December, 2022; originally announced December 2022.

Comments: arXiv admin note: substantial text overlap with arXiv:2203.16311

arXiv:2211.15183 [pdf, other]

Continuous Episodic Control

Authors: Zhao Yang, Thomas M. Moerland, Mike Preuss, Aske Plaat

Abstract: Non-parametric episodic memory can be used to quickly latch onto high-rewarded experience in reinforcement learning tasks. In contrast to parametric deep reinforcement learning approaches in which reward signals need to be back-propagated slowly, these methods only need to discover the solution once, and may then repeatedly solve the task. However, episodic control solutions are stored in discrete… ▽ More Non-parametric episodic memory can be used to quickly latch onto high-rewarded experience in reinforcement learning tasks. In contrast to parametric deep reinforcement learning approaches in which reward signals need to be back-propagated slowly, these methods only need to discover the solution once, and may then repeatedly solve the task. However, episodic control solutions are stored in discrete tables, and this approach has so far only been applied to discrete action space problems. Therefore, this paper introduces Continuous Episodic Control (CEC), a novel non-parametric episodic memory algorithm for sequential decision making in problems with a continuous action space. Results on several sparse-reward continuous control environments show that our proposed method learns faster than state-of-the-art model-free RL and memory-augmented RL algorithms, while maintaining good long-run performance as well. In short, CEC can be a fast approach for learning in continuous control tasks. △ Less

Submitted 23 April, 2023; v1 submitted 28 November, 2022; originally announced November 2022.

arXiv:2203.16311 [pdf, other]

When to Go, and When to Explore: The Benefit of Post-Exploration in Intrinsic Motivation

Authors: Zhao Yang, Thomas M. Moerland, Mike Preuss, Aske Plaat

Abstract: Go-Explore achieved breakthrough performance on challenging reinforcement learning (RL) tasks with sparse rewards. The key insight of Go-Explore was that successful exploration requires an agent to first return to an interesting state ('Go'), and only then explore into unknown terrain ('Explore'). We refer to such exploration after a goal is reached as 'post-exploration'. In this paper we present… ▽ More Go-Explore achieved breakthrough performance on challenging reinforcement learning (RL) tasks with sparse rewards. The key insight of Go-Explore was that successful exploration requires an agent to first return to an interesting state ('Go'), and only then explore into unknown terrain ('Explore'). We refer to such exploration after a goal is reached as 'post-exploration'. In this paper we present a systematic study of post-exploration, answering open questions that the Go-Explore paper did not answer yet. First, we study the isolated potential of post-exploration, by turning it on and off within the same algorithm. Subsequently, we introduce new methodology to adaptively decide when to post-explore and for how long to post-explore. Experiments on a range of MiniGrid environments show that post-exploration indeed boosts performance (with a bigger impact than tuning regular exploration parameters), and this effect is further enhanced by adaptively deciding when and for how long to post-explore. In short, our work identifies adaptive post-exploration as a promising direction for RL exploration research. △ Less

Submitted 13 April, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

arXiv:2203.03292 [pdf, other]

On Credit Assignment in Hierarchical Reinforcement Learning

Authors: Joery A. de Vries, Thomas M. Moerland, Aske Plaat

Abstract: Hierarchical Reinforcement Learning (HRL) has held longstanding promise to advance reinforcement learning. Yet, it has remained a considerable challenge to develop practical algorithms that exhibit some of these promises. To improve our fundamental understanding of HRL, we investigate hierarchical credit assignment from the perspective of conventional multistep reinforcement learning. We show how… ▽ More Hierarchical Reinforcement Learning (HRL) has held longstanding promise to advance reinforcement learning. Yet, it has remained a considerable challenge to develop practical algorithms that exhibit some of these promises. To improve our fundamental understanding of HRL, we investigate hierarchical credit assignment from the perspective of conventional multistep reinforcement learning. We show how e.g., a 1-step `hierarchical backup' can be seen as a conventional multistep backup with $n$ skip connections over time connecting each subsequent state to the first independent of actions inbetween. Furthermore, we find that generalizing hierarchy to multistep return estimation methods requires us to consider how to partition the environment trace, in order to construct backup paths. We leverage these insight to develop a new hierarchical algorithm Hier$Q_k(λ)$, for which we demonstrate that hierarchical credit assignment alone can already boost agent performance (i.e., when eliminating generalization or exploration). Altogether, our work yields fundamental insight into the nature of hierarchical backups and distinguishes this as an additional basis for reinforcement learning research. △ Less

Submitted 7 March, 2022; originally announced March 2022.

Comments: Code and Data available at: https://github.com/joeryjoery/HierQ

arXiv:2203.01075 [pdf, other]

Reliable validation of Reinforcement Learning Benchmarks

Authors: Matthias Müller-Brockhausen, Aske Plaat, Mike Preuss

Abstract: Reinforcement Learning (RL) is one of the most dynamic research areas in Game AI and AI as a whole, and a wide variety of games are used as its prominent test problems. However, it is subject to the replicability crisis that currently affects most algorithmic AI research. Benchmarking in Reinforcement Learning could be improved through verifiable results. There are numerous benchmark environments… ▽ More Reinforcement Learning (RL) is one of the most dynamic research areas in Game AI and AI as a whole, and a wide variety of games are used as its prominent test problems. However, it is subject to the replicability crisis that currently affects most algorithmic AI research. Benchmarking in Reinforcement Learning could be improved through verifiable results. There are numerous benchmark environments whose scores are used to compare different algorithms, such as Atari. Nevertheless, reviewers must trust that figures represent truthful values, as it is difficult to reproduce an exact training curve. We propose improving this situation by providing access to the original experimental data to validate study results. To that end, we rely on the concept of minimal traces. These allow re-simulation of action sequences in deterministic RL environments and, in turn, enable reviewers to verify, re-use, and manually inspect experimental results without needing large compute clusters. It also permits validation of presented reward graphs, an inspection of individual episodes, and re-use of result data (baselines) for proper comparison in follow-up papers. We offer plug-and-play code that works with Gym so that our measures fit well in the existing RL and reproducibility eco-system. Our approach is freely available, easy to use, and adds minimal overhead, as minimal traces allow a data compression ratio of up to $\approx 10^4:1$ (94GB to 8MB for Atari Pong) compared to a regular MDP trace used in offline RL datasets. The paper presents proof-of-concept results for a variety of games. △ Less

Submitted 2 March, 2022; originally announced March 2022.

arXiv:2201.02135 [pdf, other]

doi 10.1007/978-981-19-0638-1

Deep Reinforcement Learning, a textbook

Authors: Aske Plaat

Abstract: Deep reinforcement learning has gathered much attention recently. Impressive results were achieved in activities as diverse as autonomous driving, game playing, molecular recombination, and robotics. In all these fields, computer programs have taught themselves to solve difficult problems. They have learned to fly model helicopters and perform aerobatic manoeuvers such as loops and rolls. In some… ▽ More Deep reinforcement learning has gathered much attention recently. Impressive results were achieved in activities as diverse as autonomous driving, game playing, molecular recombination, and robotics. In all these fields, computer programs have taught themselves to solve difficult problems. They have learned to fly model helicopters and perform aerobatic manoeuvers such as loops and rolls. In some applications they have even become better than the best humans, such as in Atari, Go, poker and StarCraft. The way in which deep reinforcement learning explores complex environments reminds us of how children learn, by playfully trying out things, getting feedback, and trying again. The computer seems to truly possess aspects of human learning; this goes to the heart of the dream of artificial intelligence. The successes in research have not gone unnoticed by educators, and universities have started to offer courses on the subject. The aim of this book is to provide a comprehensive overview of the field of deep reinforcement learning. The book is written for graduate students of artificial intelligence, and for researchers and practitioners who wish to better understand deep reinforcement learning methods and their challenges. We assume an undergraduate-level of understanding of computer science and artificial intelligence; the programming language of this book is Python. We describe the foundations, the algorithms and the applications of deep reinforcement learning. We cover the established model-free and model-based methods that form the basis of the field. Developments go quickly, and we also cover advanced topics: deep multi-agent reinforcement learning, deep hierarchical reinforcement learning, and deep meta learning. △ Less

Submitted 23 April, 2023; v1 submitted 4 January, 2022; originally announced January 2022.

Comments: Revised version 2023, added description of Monte Carlo sampling and N-step algorithm, improved explanation of on-policy and off-policy learning. Preprint available by permission of Publisher

arXiv:2109.05022 [pdf, other]

Potential-based Reward Shaping in Sokoban

Authors: Zhao Yang, Mike Preuss, Aske Plaat

Abstract: Learning to solve sparse-reward reinforcement learning problems is difficult, due to the lack of guidance towards the goal. But in some problems, prior knowledge can be used to augment the learning process. Reward shaping is a way to incorporate prior knowledge into the original reward function in order to speed up the learning. While previous work has investigated the use of expert knowledge to g… ▽ More Learning to solve sparse-reward reinforcement learning problems is difficult, due to the lack of guidance towards the goal. But in some problems, prior knowledge can be used to augment the learning process. Reward shaping is a way to incorporate prior knowledge into the original reward function in order to speed up the learning. While previous work has investigated the use of expert knowledge to generate potential functions, in this work, we study whether we can use a search algorithm(A*) to automatically generate a potential function for reward shaping in Sokoban, a well-known planning task. The results showed that learning with shaped reward function is faster than learning from scratch. Our results indicate that distance functions could be a suitable function for Sokoban. This work demonstrates the possibility of solving multiple instances with the help of reward shaping. The result can be compressed into a single policy, which can be seen as the first phrase towards training a general policy that is able to solve unseen instances. △ Less

Submitted 10 September, 2021; originally announced September 2021.

arXiv:2107.08241 [pdf, other]

High-Accuracy Model-Based Reinforcement Learning, a Survey

Authors: Aske Plaat, Walter Kosters, Mike Preuss

Abstract: Deep reinforcement learning has shown remarkable success in the past few years. Highly complex sequential decision making problems from game playing and robotics have been solved with deep model-free methods. Unfortunately, the sample complexity of model-free methods is often high. To reduce the number of environment samples, model-based reinforcement learning creates an explicit model of the envi… ▽ More Deep reinforcement learning has shown remarkable success in the past few years. Highly complex sequential decision making problems from game playing and robotics have been solved with deep model-free methods. Unfortunately, the sample complexity of model-free methods is often high. To reduce the number of environment samples, model-based reinforcement learning creates an explicit model of the environment dynamics. Achieving high model accuracy is a challenge in high-dimensional problems. In recent years, a diverse landscape of model-based methods has been introduced to improve model accuracy, using methods such as uncertainty modeling, model-predictive control, latent models, and end-to-end learning and planning. Some of these methods succeed in achieving high accuracy at low sample complexity, most do so either in a robotics or in a games context. In this paper, we survey these methods; we explain in detail how they work and what their strengths and weaknesses are. We conclude with a research agenda for future work to make the methods more robust and more widely applicable to other applications. △ Less

Submitted 17 July, 2021; originally announced July 2021.

Comments: arXiv admin note: text overlap with arXiv:2008.05598

arXiv:2106.15691 [pdf, other]

Deep Multiagent Reinforcement Learning: Challenges and Directions

Authors: Annie Wong, Thomas Bäck, Anna V. Kononova, Aske Plaat

Abstract: This paper surveys the field of deep multiagent reinforcement learning. The combination of deep neural networks with reinforcement learning has gained increased traction in recent years and is slowly shifting the focus from single-agent to multiagent environments. Dealing with multiple agents is inherently more complex as (a) the future rewards depend on multiple players' joint actions and (b) the… ▽ More This paper surveys the field of deep multiagent reinforcement learning. The combination of deep neural networks with reinforcement learning has gained increased traction in recent years and is slowly shifting the focus from single-agent to multiagent environments. Dealing with multiple agents is inherently more complex as (a) the future rewards depend on multiple players' joint actions and (b) the computational complexity increases. We present the most common multiagent problem representations and their main challenges, and identify five research areas that address one or more of these challenges: centralised training and decentralised execution, opponent modelling, communication, efficient coordination, and reward shaping. We find that many computational studies rely on unrealistic assumptions or are not generalisable to other settings; they struggle to overcome the curse of dimensionality or nonstationarity. Approaches from psychology and sociology capture promising relevant behaviours, such as communication and coordination, to help agents achieve better performance in multiagent settings. We suggest that, for multiagent reinforcement learning to be successful, future research should address these challenges with an interdisciplinary approach to open up new possibilities in multiagent reinforcement learning. △ Less

Submitted 12 October, 2022; v1 submitted 29 June, 2021; originally announced June 2021.

Comments: 41 pages, 6 figures

ACM Class: A.1; I.2.6; I.2.8; J.4

arXiv:2105.14780 [pdf, other]

doi 10.1109/CoG52621.2021.9619000

Procedural Content Generation: Better Benchmarks for Transfer Reinforcement Learning

Authors: Matthias Müller-Brockhausen, Mike Preuss, Aske Plaat

Abstract: The idea of transfer in reinforcement learning (TRL) is intriguing: being able to transfer knowledge from one problem to another problem without learning everything from scratch. This promises quicker learning and learning more complex methods. To gain an insight into the field and to detect emerging trends, we performed a database search. We note a surprisingly late adoption of deep learning that… ▽ More The idea of transfer in reinforcement learning (TRL) is intriguing: being able to transfer knowledge from one problem to another problem without learning everything from scratch. This promises quicker learning and learning more complex methods. To gain an insight into the field and to detect emerging trends, we performed a database search. We note a surprisingly late adoption of deep learning that starts in 2018. The introduction of deep learning has not yet solved the greatest challenge of TRL: generalization. Transfer between different domains works well when domains have strong similarities (e.g. MountainCar to Cartpole), and most TRL publications focus on different tasks within the same domain that have few differences. Most TRL applications we encountered compare their improvements against self-defined baselines, and the field is still missing unified benchmarks. We consider this to be a disappointing situation. For the future, we note that: (1) A clear measure of task similarity is needed. (2) Generalization needs to improve. Promising approaches merge deep learning with planning via MCTS or introduce memory through LSTMs. (3) The lack of benchmarking tools will be remedied to enable meaningful comparison and measure progress. Already Alchemy and Meta-World are emerging as interesting benchmark suites. We note that another development, the increase in procedural content generation (PCG), can improve both benchmarking and generalization in TRL. △ Less

Submitted 31 May, 2021; originally announced May 2021.

arXiv:2105.11702 [pdf, other]

Transfer Learning and Curriculum Learning in Sokoban

Authors: Zhao Yang, Mike Preuss, Aske Plaat

Abstract: Transfer learning can speed up training in machine learning and is regularly used in classification tasks. It reuses prior knowledge from other tasks to pre-train networks for new tasks. In reinforcement learning, learning actions for a behavior policy that can be applied to new environments is still a challenge, especially for tasks that involve much planning. Sokoban is a challenging puzzle game… ▽ More Transfer learning can speed up training in machine learning and is regularly used in classification tasks. It reuses prior knowledge from other tasks to pre-train networks for new tasks. In reinforcement learning, learning actions for a behavior policy that can be applied to new environments is still a challenge, especially for tasks that involve much planning. Sokoban is a challenging puzzle game. It has been used widely as a benchmark in planning-based reinforcement learning. In this paper, we show how prior knowledge improves learning in Sokoban tasks. We find that reusing feature representations learned previously can accelerate learning new, more complex, instances. In effect, we show how curriculum learning, from simple to complex tasks, works in Sokoban. Furthermore, feature representations learned in simpler instances are more general, and thus lead to positive transfers towards more complex tasks, but not vice versa. We have also studied which part of the knowledge is most important for transfer to succeed, and identify which layers should be used for pre-training. △ Less

Submitted 10 September, 2021; v1 submitted 25 May, 2021; originally announced May 2021.

arXiv:2105.06136 [pdf, other]

Adaptive Warm-Start MCTS in AlphaZero-like Deep Reinforcement Learning

Authors: Hui Wang, Mike Preuss, Aske Plaat

Abstract: AlphaZero has achieved impressive performance in deep reinforcement learning by utilizing an architecture that combines search and training of a neural network in self-play. Many researchers are looking for ways to reproduce and improve results for other games/tasks. However, the architecture is designed to learn from scratch, tabula rasa, accepting a cold-start problem in self-play. Recently, a w… ▽ More AlphaZero has achieved impressive performance in deep reinforcement learning by utilizing an architecture that combines search and training of a neural network in self-play. Many researchers are looking for ways to reproduce and improve results for other games/tasks. However, the architecture is designed to learn from scratch, tabula rasa, accepting a cold-start problem in self-play. Recently, a warm-start enhancement method for Monte Carlo Tree Search was proposed to improve the self-play starting phase. It employs a fixed parameter $I^\prime$ to control the warm-start length. Improved performance was reported in small board games. In this paper we present results with an adaptive switch method. Experiments show that our approach works better than the fixed $I^\prime$, especially for "deep," tactical, games (Othello and Connect Four). We conjecture that the adaptive value for $I^\prime$ is also influenced by the size of the game, and that on average $I^\prime$ will increase with game size. We conclude that AlphaZero-like deep reinforcement learning benefits from adaptive rollout based warm-start, as Rapid Action Value Estimate did for rollout-based reinforcement learning 15 years ago. △ Less

Submitted 13 May, 2021; originally announced May 2021.

arXiv:2104.10527 [pdf, other]

doi 10.1007/s10994-022-06210-y

Stateless Neural Meta-Learning using Second-Order Gradients

Authors: Mike Huisman, Aske Plaat, Jan N. van Rijn

Abstract: Deep learning typically requires large data sets and much compute power for each new problem that is learned. Meta-learning can be used to learn a good prior that facilitates quick learning, thereby relaxing these requirements so that new tasks can be learned quicker; two popular approaches are MAML and the meta-learner LSTM. In this work, we compare the two and formally show that the meta-learner… ▽ More Deep learning typically requires large data sets and much compute power for each new problem that is learned. Meta-learning can be used to learn a good prior that facilitates quick learning, thereby relaxing these requirements so that new tasks can be learned quicker; two popular approaches are MAML and the meta-learner LSTM. In this work, we compare the two and formally show that the meta-learner LSTM subsumes MAML. Combining this insight with recent empirical findings, we construct a new algorithm (dubbed TURTLE) which is simpler than the meta-learner LSTM yet more expressive than MAML. TURTLE outperforms both techniques at few-shot sine wave regression and image classification on miniImageNet and CUB without any additional hyperparameter tuning, at a computational cost that is comparable with second-order MAML. The key to TURTLE's success lies in the use of second-order gradients, which also significantly increases the performance of the meta-learner LSTM by 1-6% accuracy. △ Less

Submitted 21 April, 2021; originally announced April 2021.

Journal ref: Machine Learning, 2022

arXiv:2102.12924 [pdf, other]

Visualizing MuZero Models

Authors: Joery A. de Vries, Ken S. Voskuil, Thomas M. Moerland, Aske Plaat

Abstract: MuZero, a model-based reinforcement learning algorithm that uses a value equivalent dynamics model, achieved state-of-the-art performance in Chess, Shogi and the game of Go. In contrast to standard forward dynamics models that predict a full next state, value equivalent models are trained to predict a future value, thereby emphasizing value relevant information in the representations. While value… ▽ More MuZero, a model-based reinforcement learning algorithm that uses a value equivalent dynamics model, achieved state-of-the-art performance in Chess, Shogi and the game of Go. In contrast to standard forward dynamics models that predict a full next state, value equivalent models are trained to predict a future value, thereby emphasizing value relevant information in the representations. While value equivalent models have shown strong empirical success, there is no research yet that visualizes and investigates what types of representations these models actually learn. Therefore, in this paper we visualize the latent representation of MuZero agents. We find that action trajectories may diverge between observation embeddings and internal state transition dynamics, which could lead to instability during planning. Based on this insight, we propose two regularization techniques to stabilize MuZero's performance. Additionally, we provide an open-source implementation of MuZero along with an interactive visualizer of learned representations, which may aid further investigation of value equivalent algorithms. △ Less

Submitted 3 March, 2021; v1 submitted 25 February, 2021; originally announced February 2021.

arXiv:2010.06488 [pdf, other]

Multiple Node Immunisation for Preventing Epidemics on Networks by Exact Multiobjective Optimisation of Cost and Shield-Value

Authors: Michael Emmerich, Joost Nibbeling, Marios Kefalas, Aske Plaat

Abstract: The general problem in this paper is vertex (node) subset selection with the goal to contain an infection that spreads in a network. Instead of selecting the single most important node, this paper deals with the problem of selecting multiple nodes for removal. As compared to previous work on multiple-node selection, the trade-off between cost and benefit is considered. The benefit is measured in t… ▽ More The general problem in this paper is vertex (node) subset selection with the goal to contain an infection that spreads in a network. Instead of selecting the single most important node, this paper deals with the problem of selecting multiple nodes for removal. As compared to previous work on multiple-node selection, the trade-off between cost and benefit is considered. The benefit is measured in terms of increasing the epidemic threshold which is a measure of how difficult it is for an infection to spread in a network. The cost is measured in terms of the number and size of nodes to be removed or controlled. Already in its single-objective instance with a fixed number of $k$ nodes to be removed, the multiple vertex immunisation problems have been proven to be NP-hard. Several heuristics have been developed to approximate the problem. In this work, we compare meta-heuristic techniques with exact methods on the Shield-value, which is a sub-modular proxy for the maximal eigenvalue and used in the current state-of-the-art greedy node-removal strategies. We generalise it to the multi-objective case and replace the greedy algorithm by a quadratic program (QP), which then can be solved with exact QP solvers. The main contribution of this paper is the insight that, if time permits, exact and problem-specific methods approximation should be used, which are often far better than Pareto front approximations obtained by general meta-heuristics. Based on these, it will be more effective to develop strategies for controlling real-world networks when the goal is to prevent or contain epidemic outbreaks. This paper is supported by ready to use Python implementation of the optimization methods and datasets. △ Less

Submitted 13 October, 2020; originally announced October 2020.

Comments: Based on the Master Thesis of Joost Nibbeling, LIACS, Leiden University, The Netherlands

MSC Class: 90C27; 05C69 ACM Class: J.3; G.2.2; G.4; I.2.8

arXiv:2010.03522 [pdf, other]

doi 10.1007/s10462-021-10004-4

A Survey of Deep Meta-Learning

Authors: Mike Huisman, Jan N. van Rijn, Aske Plaat

Abstract: Deep neural networks can achieve great successes when presented with large data sets and sufficient computational resources. However, their ability to learn new concepts quickly is limited. Meta-learning is one approach to address this issue, by enabling the network to learn how to learn. The field of Deep Meta-Learning advances at great speed, but lacks a unified, in-depth overview of current tec… ▽ More Deep neural networks can achieve great successes when presented with large data sets and sufficient computational resources. However, their ability to learn new concepts quickly is limited. Meta-learning is one approach to address this issue, by enabling the network to learn how to learn. The field of Deep Meta-Learning advances at great speed, but lacks a unified, in-depth overview of current techniques. With this work, we aim to bridge this gap. After providing the reader with a theoretical foundation, we investigate and summarize key methods, which are categorized into i)~metric-, ii)~model-, and iii)~optimization-based techniques. In addition, we identify the main open challenges, such as performance evaluations on heterogeneous benchmarks, and reduction of the computational costs of meta-learning. △ Less

Submitted 21 April, 2021; v1 submitted 7 October, 2020; originally announced October 2020.

Comments: Published in the AI Review (AIRE) Journal (2021)

arXiv:2008.05598 [pdf, other]

Deep Model-Based Reinforcement Learning for High-Dimensional Problems, a Survey

Authors: Aske Plaat, Walter Kosters, Mike Preuss

Abstract: Deep reinforcement learning has shown remarkable success in the past few years. Highly complex sequential decision making problems have been solved in tasks such as game playing and robotics. Unfortunately, the sample complexity of most deep reinforcement learning methods is high, precluding their use in some important applications. Model-based reinforcement learning creates an explicit model of t… ▽ More Deep reinforcement learning has shown remarkable success in the past few years. Highly complex sequential decision making problems have been solved in tasks such as game playing and robotics. Unfortunately, the sample complexity of most deep reinforcement learning methods is high, precluding their use in some important applications. Model-based reinforcement learning creates an explicit model of the environment dynamics to reduce the need for environment samples. Current deep learning methods use high-capacity networks to solve high-dimensional problems. Unfortunately, high-capacity models typically require many samples, negating the potential benefit of lower sample complexity in model-based methods. A challenge for deep model-based methods is therefore to achieve high predictive power while maintaining low sample complexity. In recent years, many model-based methods have been introduced to address this challenge. In this paper, we survey the contemporary model-based landscape. First we discuss definitions and relations to other fields. We propose a taxonomy based on three approaches: using explicit planning on given transitions, using explicit planning on learned transitions, and end-to-end learning of both planning and transitions. We use these approaches to organize a comprehensive overview of important recent developments such as latent models. We describe methods and benchmarks, and we suggest directions for future work for each of the approaches. Among promising research directions are curriculum learning, uncertainty modeling, and use of latent models for transfer learning. △ Less

Submitted 1 December, 2020; v1 submitted 11 August, 2020; originally announced August 2020.

arXiv:2006.16712 [pdf, other]

Model-based Reinforcement Learning: A Survey

Authors: Thomas M. Moerland, Joost Broekens, Aske Plaat, Catholijn M. Jonker

Abstract: Sequential decision making, commonly formalized as Markov Decision Process (MDP) optimization, is a important challenge in artificial intelligence. Two key approaches to this problem are reinforcement learning (RL) and planning. This paper presents a survey of the integration of both fields, better known as model-based reinforcement learning. Model-based RL has two main steps. First, we systematic… ▽ More Sequential decision making, commonly formalized as Markov Decision Process (MDP) optimization, is a important challenge in artificial intelligence. Two key approaches to this problem are reinforcement learning (RL) and planning. This paper presents a survey of the integration of both fields, better known as model-based reinforcement learning. Model-based RL has two main steps. First, we systematically cover approaches to dynamics model learning, including challenges like dealing with stochasticity, uncertainty, partial observability, and temporal abstraction. Second, we present a systematic categorization of planning-learning integration, including aspects like: where to start planning, what budgets to allocate to planning and real data collection, how to plan, and how to integrate planning in the learning and acting loop. After these two sections, we also discuss implicit model-based RL as an end-to-end alternative for model learning and planning, and we cover the potential benefits of model-based RL. Along the way, the survey also draws connections to several related RL fields, like hierarchical RL and transfer learning. Altogether, the survey presents a broad conceptual overview of the combination of planning and learning for MDP optimization. △ Less

Submitted 31 March, 2022; v1 submitted 30 June, 2020; originally announced June 2020.

arXiv:2006.15009 [pdf, other]

A Unifying Framework for Reinforcement Learning and Planning

Authors: Thomas M. Moerland, Joost Broekens, Aske Plaat, Catholijn M. Jonker

Abstract: Sequential decision making, commonly formalized as optimization of a Markov Decision Process, is a key challenge in artificial intelligence. Two successful approaches to MDP optimization are reinforcement learning and planning, which both largely have their own research communities. However, if both research fields solve the same problem, then we might be able to disentangle the common factors in… ▽ More Sequential decision making, commonly formalized as optimization of a Markov Decision Process, is a key challenge in artificial intelligence. Two successful approaches to MDP optimization are reinforcement learning and planning, which both largely have their own research communities. However, if both research fields solve the same problem, then we might be able to disentangle the common factors in their solution approaches. Therefore, this paper presents a unifying algorithmic framework for reinforcement learning and planning (FRAP), which identifies underlying dimensions on which MDP planning and learning algorithms have to decide. At the end of the paper, we compare a variety of well-known planning, model-free and model-based RL algorithms along these dimensions. Altogether, the framework may help provide deeper insight in the algorithmic design space of planning and reinforcement learning. △ Less

Submitted 31 March, 2022; v1 submitted 26 June, 2020; originally announced June 2020.

arXiv:2006.07970 [pdf, other]

Tackling Morpion Solitaire with AlphaZero-likeRanked Reward Reinforcement Learning

Authors: Hui Wang, Mike Preuss, Michael Emmerich, Aske Plaat

Abstract: Morpion Solitaire is a popular single player game, performed with paper and pencil. Due to its large state space (on the order of the game of Go) traditional search algorithms, such as MCTS, have not been able to find good solutions. A later algorithm, Nested Rollout Policy Adaptation, was able to find a new record of 82 steps, albeit with large computational resources. After achieving this record… ▽ More Morpion Solitaire is a popular single player game, performed with paper and pencil. Due to its large state space (on the order of the game of Go) traditional search algorithms, such as MCTS, have not been able to find good solutions. A later algorithm, Nested Rollout Policy Adaptation, was able to find a new record of 82 steps, albeit with large computational resources. After achieving this record, to the best of our knowledge, there has been no further progress reported, for about a decade. In this paper we take the recent impressive performance of deep self-learning reinforcement learning approaches from AlphaGo/AlphaZero as inspiration to design a searcher for Morpion Solitaire. A challenge of Morpion Solitaire is that the state space is sparse, there are few win/loss signals. Instead, we use an approach known as ranked reward to create a reinforcement learning self-play framework for Morpion Solitaire. This enables us to find medium-quality solutions with reasonable computational effort. Our record is a 67 steps solution, which is very close to the human best (68) without any other adaptation to the problem than using ranked reward. We list many further avenues for potential improvement. △ Less

Submitted 14 June, 2020; originally announced June 2020.

Comments: 4 pages, 2 figures. the first/ongoing attempt to tackle Morpion Solitaire using ranked reward reinforcement learning. submitted to SYNASC2020

arXiv:2005.09645 [pdf, other]

The Second Type of Uncertainty in Monte Carlo Tree Search

Authors: Thomas M Moerland, Joost Broekens, Aske Plaat, Catholijn M Jonker

Abstract: Monte Carlo Tree Search (MCTS) efficiently balances exploration and exploitation in tree search based on count-derived uncertainty. However, these local visit counts ignore a second type of uncertainty induced by the size of the subtree below an action. We first show how, due to the lack of this second uncertainty type, MCTS may completely fail in well-known sparse exploration problems, known from… ▽ More Monte Carlo Tree Search (MCTS) efficiently balances exploration and exploitation in tree search based on count-derived uncertainty. However, these local visit counts ignore a second type of uncertainty induced by the size of the subtree below an action. We first show how, due to the lack of this second uncertainty type, MCTS may completely fail in well-known sparse exploration problems, known from the reinforcement learning community. We then introduce a new algorithm, which estimates the size of the subtree below an action, and leverages this information in the UCB formula to better direct exploration. Subsequently, we generalize these ideas by showing that loops, i.e., the repeated occurrence of (approximately) the same state in the same trace, are actually a special case of subtree depth variation. Testing on a variety of tasks shows that our algorithms increase sample efficiency, especially when the planning budget per timestep is small. △ Less

Submitted 19 May, 2020; originally announced May 2020.

Comments: arXiv admin note: text overlap with arXiv:1805.09218

arXiv:2004.12357 [pdf, other]

doi 10.1007/978-3-030-58115-2_37

Warm-Start AlphaZero Self-Play Search Enhancements

Authors: Hui Wang, Mike Preuss, Aske Plaat

Abstract: Recently, AlphaZero has achieved landmark results in deep reinforcement learning, by providing a single self-play architecture that learned three different games at super human level. AlphaZero is a large and complicated system with many parameters, and success requires much compute power and fine-tuning. Reproducing results in other games is a challenge, and many researchers are looking for ways… ▽ More Recently, AlphaZero has achieved landmark results in deep reinforcement learning, by providing a single self-play architecture that learned three different games at super human level. AlphaZero is a large and complicated system with many parameters, and success requires much compute power and fine-tuning. Reproducing results in other games is a challenge, and many researchers are looking for ways to improve results while reducing computational demands. AlphaZero's design is purely based on self-play and makes no use of labeled expert data ordomain specific enhancements; it is designed to learn from scratch. We propose a novel approach to deal with this cold-start problem by employing simple search enhancements at the beginning phase of self-play training, namely Rollout, Rapid Action Value Estimate (RAVE) and dynamically weighted combinations of these with the neural network, and Rolling Horizon Evolutionary Algorithms (RHEA). Our experiments indicate that most of these enhancements improve the performance of their baseline player in three different (small) board games, with especially RAVE based variants playing strongly. △ Less

Submitted 26 April, 2020; originally announced April 2020.

Journal ref: PPSN 2020

arXiv:2004.00377 [pdf, other]

A New Challenge: Approaching Tetris Link with AI

Authors: Matthias Muller-Brockhausen, Mike Preuss, Aske Plaat

Abstract: Decades of research have been invested in making computer programs for playing games such as Chess and Go. This paper focuses on a new game, Tetris Link, a board game that is still lacking any scientific analysis. Tetris Link has a large branching factor, hampering a traditional heuristic planning approach. We explore heuristic planning and two other approaches: Reinforcement Learning, Monte Carlo… ▽ More Decades of research have been invested in making computer programs for playing games such as Chess and Go. This paper focuses on a new game, Tetris Link, a board game that is still lacking any scientific analysis. Tetris Link has a large branching factor, hampering a traditional heuristic planning approach. We explore heuristic planning and two other approaches: Reinforcement Learning, Monte Carlo tree search. We document our approach and report on their relative performance in a tournament. Curiously, the heuristic approach is stronger than the planning/learning approaches. However, experienced human players easily win the majority of the matches against the heuristic planning AIs. We, therefore, surmise that Tetris Link is more difficult than expected. We offer our findings to the community as a challenge to improve upon. △ Less

Submitted 1 April, 2020; originally announced April 2020.

arXiv:2003.05988 [pdf, other]

Analysis of Hyper-Parameters for Small Games: Iterations or Epochs in Self-Play?

Authors: Hui Wang, Michael Emmerich, Mike Preuss, Aske Plaat

Abstract: The landmark achievements of AlphaGo Zero have created great research interest into self-play in reinforcement learning. In self-play, Monte Carlo Tree Search is used to train a deep neural network, that is then used in tree searches. Training itself is governed by many hyperparameters.There has been surprisingly little research on design choices for hyper-parameter values and loss-functions, pres… ▽ More The landmark achievements of AlphaGo Zero have created great research interest into self-play in reinforcement learning. In self-play, Monte Carlo Tree Search is used to train a deep neural network, that is then used in tree searches. Training itself is governed by many hyperparameters.There has been surprisingly little research on design choices for hyper-parameter values and loss-functions, presumably because of the prohibitive computational cost to explore the parameter space. In this paper, we investigate 12 hyper-parameters in an AlphaZero-like self-play algorithm and evaluate how these parameters contribute to training. We use small games, to achieve meaningful exploration with moderate computational effort. The experimental results show that training is highly sensitive to hyper-parameter choices. Through multi-objective analysis we identify 4 important hyper-parameters to further assess. To start, we find surprising results where too much training can sometimes lead to lower performance. Our main result is that the number of self-play iterations subsumes MCTS-search simulations, game-episodes, and training epochs. The intuition is that these three increase together as self-play iterations increase, and that increasing them individually is sub-optimal. A consequence of our experiments is a direct recommendation for setting hyper-parameter values in self-play: the overarching outer-loop of self-play iterations should be maximized, in favor of the three inner-loop hyper-parameters, which should be set at lower values. A secondary result of our experiments concerns the choice of optimization goals, for which we also provide recommendations. △ Less

Submitted 12 March, 2020; originally announced March 2020.

arXiv:1903.08781 [pdf, other]

Fault-Tolerant Nanosatellite Computing on a Budget

Authors: Christian M. Fuchs, Nadia Murillo, Aske Plaat, Erik Van der Kouwe, Daniel Harsono, Todor Stefanov

Abstract: Micro- and nanosatellites have become popular platforms for a variety of commercial and scientific applications, but today are considered suitable mainly for short and low-priority space missions due to their low reliability. In part, this can be attributed to their reliance upon cheap, low-feature size, COTS components originally designed for embedded and mobile-market applications, for which tra… ▽ More Micro- and nanosatellites have become popular platforms for a variety of commercial and scientific applications, but today are considered suitable mainly for short and low-priority space missions due to their low reliability. In part, this can be attributed to their reliance upon cheap, low-feature size, COTS components originally designed for embedded and mobile-market applications, for which traditional hardware-voting concepts are ineffective. Software-fault-tolerance concepts have been shown effective for such systems, but have largely been ignored by the space industry due to low maturity, as most have only been researched in theory. In practice, designers of payload instruments and miniaturized satellites are usually forced to sacrifice reliability in favor deliver the level of performance necessary for cutting-edge science and innovative commercial applications. Thus, we developed a software-fault-tolerance-approach based upon thread-level coarse-grain lockstep, which was validated using fault-injection. To offer strong long-term fault coverage, our architecture is implemented as tiled MPSoC on an FPGA, utilizing partial reconfiguration, as well as mixed criticality. This architecture can satisfy the high performance requirements of current and future scientific and commercial space missions at very low cost, while offering the strong fault-coverage guarantees necessary for platform control even for missions with a long duration. This architecture was developed for a 4-year ESA project. Together with two industrial partners, we are developing a prototype to then undergo radiation testing. △ Less

Submitted 20 March, 2019; originally announced March 2019.

Journal ref: Conference on Radiation Effects on Components and Systems 2018 (RADECS)

arXiv:1903.08129 [pdf, ps, other]

Hyper-Parameter Sweep on AlphaZero General

Authors: Hui Wang, Michael Emmerich, Mike Preuss, Aske Plaat

Abstract: Since AlphaGo and AlphaGo Zero have achieved breakground successes in the game of Go, the programs have been generalized to solve other tasks. Subsequently, AlphaZero was developed to play Go, Chess and Shogi. In the literature, the algorithms are explained well. However, AlphaZero contains many parameters, and for neither AlphaGo, AlphaGo Zero nor AlphaZero, there is sufficient discussion about h… ▽ More Since AlphaGo and AlphaGo Zero have achieved breakground successes in the game of Go, the programs have been generalized to solve other tasks. Subsequently, AlphaZero was developed to play Go, Chess and Shogi. In the literature, the algorithms are explained well. However, AlphaZero contains many parameters, and for neither AlphaGo, AlphaGo Zero nor AlphaZero, there is sufficient discussion about how to set parameter values in these algorithms. Therefore, in this paper, we choose 12 parameters in AlphaZero and evaluate how these parameters contribute to training. We focus on three objectives~(training loss, time cost and playing strength). For each parameter, we train 3 models using 3 different values~(minimum value, default value, maximum value). We use the game of play 6$\times$6 Othello, on the AlphaZeroGeneral open source re-implementation of AlphaZero. Overall, experimental results show that different values can lead to different training results, proving the importance of such a parameter sweep. We categorize these 12 parameters into time-sensitive parameters and time-friendly parameters. Moreover, through multi-objective analysis, this paper provides an insightful basis for further hyper-parameter optimization. △ Less

Submitted 19 March, 2019; originally announced March 2019.

Comments: 19 pages 13 figures

arXiv:1902.09493 [pdf, other]

doi 10.1109/AHS.2018.8541457

Dynamic Fault Tolerance Through Resource Pooling

Authors: Christian M. Fuchs, Nadia M. Murillo, Aske Plaat, Erik van der Kouwe, Todor Stefanov

Abstract: Miniaturized satellites are currently not considered suitable for critical, high-priority, and complex multi-phased missions, due to their low reliability. As hardware-side fault tolerance (FT) solutions designed for larger spacecraft can not be adopted aboard very small satellites due to budget, energy, and size constraints, we developed a hybrid FT-approach based upon only COTS components, commo… ▽ More Miniaturized satellites are currently not considered suitable for critical, high-priority, and complex multi-phased missions, due to their low reliability. As hardware-side fault tolerance (FT) solutions designed for larger spacecraft can not be adopted aboard very small satellites due to budget, energy, and size constraints, we developed a hybrid FT-approach based upon only COTS components, commodity processor cores, library IP, and standard software. This approach facilitates fault detection, isolation, and recovery in software, and utilizes fault-coverage techniques across the embedded stack within an multiprocessor system-on-chip (MPSoC). This allows our FPGA-based proof-of-concept implementation to deliver strong fault-coverage even for missions with a long duration, but also to adapt to varying performance requirements during the mission. The operator of a spacecraft utilizing this approach can define performance profiles, which allow an on-board computer (OBC) to trade between processing capacity, fault coverage, and energy consumption using simple heuristics. The software-side FT approach developed also offers advantages if deployed aboard larger spacecraft through spare resource pooling, enabling an OBC to more efficiently handle permanent faults. This FT approach in part mimics a critical biological systems's way of tolerating and adjusting to failures, enabling graceful ageing of an MPSoC. △ Less

Submitted 21 February, 2019; originally announced February 2019.

Journal ref: 2018 NASA/ESA Conference on Adaptive Hardware and Systems (AHS)

arXiv:1810.06078 [pdf, other]

Assessing the Potential of Classical Q-learning in General Game Playing

Authors: Hui Wang, Michael Emmerich, Aske Plaat

Abstract: After the recent groundbreaking results of AlphaGo and AlphaZero, we have seen strong interests in deep reinforcement learning and artificial general intelligence (AGI) in game playing. However, deep learning is resource-intensive and the theory is not yet well developed. For small games, simple classical table-based Q-learning might still be the algorithm of choice. General Game Playing (GGP) pro… ▽ More After the recent groundbreaking results of AlphaGo and AlphaZero, we have seen strong interests in deep reinforcement learning and artificial general intelligence (AGI) in game playing. However, deep learning is resource-intensive and the theory is not yet well developed. For small games, simple classical table-based Q-learning might still be the algorithm of choice. General Game Playing (GGP) provides a good testbed for reinforcement learning to research AGI. Q-learning is one of the canonical reinforcement learning methods, and has been used by (Banerjee $\&$ Stone, IJCAI 2007) in GGP. In this paper we implement Q-learning in GGP for three small-board games (Tic-Tac-Toe, Connect Four, Hex)\footnote{source code: https://github.com/wh1992v/ggp-rl}, to allow comparison to Banerjee et al.. We find that Q-learning converges to a high win rate in GGP. For the $ε$-greedy strategy, we propose a first enhancement, the dynamic $ε$ algorithm. In addition, inspired by (Gelly $\&$ Silver, ICML 2007) we combine online search (Monte Carlo Search) to enhance offline learning, and propose QM-learning for GGP. Both enhancements improve the performance of classical Q-learning. In this work, GGP allows us to show, if augmented by appropriate enhancements, that classical table-based Q-learning can perform well in small games. △ Less

Submitted 14 October, 2018; originally announced October 2018.

Comments: arXiv admin note: substantial text overlap with arXiv:1802.05944

arXiv:1805.09613 [pdf, other]

A0C: Alpha Zero in Continuous Action Space

Authors: Thomas M. Moerland, Joost Broekens, Aske Plaat, Catholijn M. Jonker

Abstract: A core novelty of Alpha Zero is the interleaving of tree search and deep learning, which has proven very successful in board games like Chess, Shogi and Go. These games have a discrete action space. However, many real-world reinforcement learning domains have continuous action spaces, for example in robotic control, navigation and self-driving cars. This paper presents the necessary theoretical ex… ▽ More A core novelty of Alpha Zero is the interleaving of tree search and deep learning, which has proven very successful in board games like Chess, Shogi and Go. These games have a discrete action space. However, many real-world reinforcement learning domains have continuous action spaces, for example in robotic control, navigation and self-driving cars. This paper presents the necessary theoretical extensions of Alpha Zero to deal with continuous action space. We also provide some preliminary experiments on the Pendulum swing-up task, empirically showing the feasibility of our approach. Thereby, this work provides a first step towards the application of iterated search and learning in domains with a continuous action space. △ Less

Submitted 24 May, 2018; originally announced May 2018.

arXiv:1805.09218 [pdf, other]

Monte Carlo Tree Search for Asymmetric Trees

Authors: Thomas M. Moerland, Joost Broekens, Aske Plaat, Catholijn M. Jonker

Abstract: We present an extension of Monte Carlo Tree Search (MCTS) that strongly increases its efficiency for trees with asymmetry and/or loops. Asymmetric termination of search trees introduces a type of uncertainty for which the standard upper confidence bound (UCB) formula does not account. Our first algorithm (MCTS-T), which assumes a non-stochastic environment, backs-up tree structure uncertainty and… ▽ More We present an extension of Monte Carlo Tree Search (MCTS) that strongly increases its efficiency for trees with asymmetry and/or loops. Asymmetric termination of search trees introduces a type of uncertainty for which the standard upper confidence bound (UCB) formula does not account. Our first algorithm (MCTS-T), which assumes a non-stochastic environment, backs-up tree structure uncertainty and leverages it for exploration in a modified UCB formula. Results show vastly improved efficiency in a well-known asymmetric domain in which MCTS performs arbitrarily bad. Next, we connect the ideas about asymmetric termination to the presence of loops in the tree, where the same state appears multiple times in a single trace. An extension to our algorithm (MCTS-T+), which in addition to non-stochasticity assumes full state observability, further increases search efficiency for domains with loops as well. Benchmark testing on a set of OpenAI Gym and Atari 2600 games indicates that our algorithms always perform better than or at least equivalent to standard MCTS, and could be first-choice tree search algorithms for non-stochastic, fully-observable environments. △ Less

Submitted 23 May, 2018; originally announced May 2018.

arXiv:1802.05944 [pdf, other]

Monte Carlo Q-learning for General Game Playing

Authors: Hui Wang, Michael Emmerich, Aske Plaat

Abstract: After the recent groundbreaking results of AlphaGo, we have seen a strong interest in reinforcement learning in game playing. General Game Playing (GGP) provides a good testbed for reinforcement learning. In GGP, a specification of games rules is given. GGP problems can be solved by reinforcement learning. Q-learning is one of the canonical reinforcement learning methods, and has been used by (Ban… ▽ More After the recent groundbreaking results of AlphaGo, we have seen a strong interest in reinforcement learning in game playing. General Game Playing (GGP) provides a good testbed for reinforcement learning. In GGP, a specification of games rules is given. GGP problems can be solved by reinforcement learning. Q-learning is one of the canonical reinforcement learning methods, and has been used by (Banerjee & Stone, IJCAI 2007) in GGP. In this paper we implement Q-learning in GGP for three small-board games (Tic-Tac-Toe, Connect Four, Hex), to allow comparison to Banerjee et al. As expected, Q-learning converges, although much slower than MCTS. Borrowing an idea from MCTS, we enhance Q-learning with Monte Carlo Search, to give QM-learning. This enhancement improves the performance of pure Q-learning. We believe that QM-learning can also be used to improve performance of reinforcement learning further for larger games, something which we will test in future work. △ Less

Submitted 21 May, 2018; v1 submitted 16 February, 2018; originally announced February 2018.

arXiv:1708.06931 [pdf, other]

Bringing Fault-Tolerant GigaHertz-Computing to Space: A Multi-Stage Software-Side Fault-Tolerance Approach for Miniaturized Spacecraft

Authors: Christian M. Fuchs, Todor Stefanov, Nadia Murillo, Aske Plaat

Abstract: Modern embedded technology is a driving factor in satellite miniaturization, contributing to a massive boom in satellite launches and a rapidly evolving new space industry. Miniaturized satellites, however, suffer from low reliability, as traditional hardware-based fault-tolerance (FT) concepts are ineffective for on-board computers (OBCs) utilizing modern systems-on-a-chip (SoC). Therefore, large… ▽ More Modern embedded technology is a driving factor in satellite miniaturization, contributing to a massive boom in satellite launches and a rapidly evolving new space industry. Miniaturized satellites, however, suffer from low reliability, as traditional hardware-based fault-tolerance (FT) concepts are ineffective for on-board computers (OBCs) utilizing modern systems-on-a-chip (SoC). Therefore, larger satellites continue to rely on proven processors with large feature sizes. Software-based concepts have largely been ignored by the space industry as they were researched only in theory, and have not yet reached the level of maturity necessary for implementation. We present the first integral, real-world solution to enable fault-tolerant general-purpose computing with modern multiprocessor-SoCs (MPSoCs) for spaceflight, thereby enabling their use in future high-priority space missions. The presented multi-stage approach consists of three FT stages, combining coarse-grained thread-level distributed self-validation, FPGA reconfiguration, and mixed criticality to assure long-term FT and excellent scalability for both resource constrained and critical high-priority space missions. Early benchmark results indicate a drastic performance increase over state-of-the-art radiation-hard OBC designs and considerably lower software- and hardware development costs. This approach was developed for a 4-year European Space Agency (ESA) project, and we are implementing a tiled MPSoC prototype jointly with two industrial partners. △ Less

Submitted 23 August, 2017; originally announced August 2017.

Comments: 26th IEEE Asian Test Symposium 2017, 27-30 Nov 2017, Taipei City, Taiwan

arXiv:1706.02086 [pdf, other]

Preliminary Performance Estimations and Benchmark Results for a Software-based Fault-Tolerance Approach aboard Miniaturized Satellite Computers

Authors: Christian M. Fuchs, Todor Stefanov, Nadia Murillo, Aske Plaat

Abstract: Modern embedded technology is a driving factor in satellite miniaturization, contributing to a massive boom in satellite launches and a rapidly evolving new space industry. Miniaturized satellites however suffer from low reliability, as traditional hardware-based fault-tolerance (FT) concepts are ineffective for on-board computers (OBCs) utilizing modern systems-on-a-chip (SoC). Larger satellites… ▽ More Modern embedded technology is a driving factor in satellite miniaturization, contributing to a massive boom in satellite launches and a rapidly evolving new space industry. Miniaturized satellites however suffer from low reliability, as traditional hardware-based fault-tolerance (FT) concepts are ineffective for on-board computers (OBCs) utilizing modern systems-on-a-chip (SoC). Larger satellites therefore continue to rely on proven processors with large feature sizes. Software-based concepts have largely been ignored by the space industry as they were researched only in theory, and have not yet reached the level of maturity necessary for implementation. In related work, we presented the first integral, real-world solution to enable fault-tolerant general-purpose computing with modern multiprocessor-SoCs (MPSoCs) for spaceflight, thereby enabling their use in future high-priority space missions. The presented multi-stage approach consists of three FT stages, combining coarse-grained thread-level distributed self-validation, FPGA reconfiguration, and mixed criticality to assure long-term FT and excellent scalability for both resource constrained and critical high-priority space missions. As part of the ongoing implementation effort towards a hardware prototype, several software implementations were achieved and tested. This document contains an outline of the conducted tests, performance evaluation results, and supplementary information not included in the actual paper. It is being continuously expanded and updated. △ Less

Submitted 22 July, 2017; v1 submitted 7 June, 2017; originally announced June 2017.

Showing 1–50 of 70 results for author: Plaat, A