-
SteloCoder: a Decoder-Only LLM for Multi-Language to Python Code Translation
Authors:
Jialing Pan,
Adrien Sadé,
Jin Kim,
Eric Soriano,
Guillem Sole,
Sylvain Flamant
Abstract:
With the recent focus on Large Language Models (LLMs), both StarCoder (Li et al., 2023) and Code Llama (Rozière et al., 2023) have demonstrated remarkable performance in code generation. However, there is still a need for improvement in code translation functionality with efficient training techniques. In response to this, we introduce SteloCoder, a decoder-only StarCoder-based LLM designed specif…
▽ More
With the recent focus on Large Language Models (LLMs), both StarCoder (Li et al., 2023) and Code Llama (Rozière et al., 2023) have demonstrated remarkable performance in code generation. However, there is still a need for improvement in code translation functionality with efficient training techniques. In response to this, we introduce SteloCoder, a decoder-only StarCoder-based LLM designed specifically for multi-programming language-to-Python code translation. In particular, SteloCoder achieves C++, C#, JavaScript, Java, or PHP-to-Python code translation without specifying the input programming language. We modified StarCoder model architecture by incorporating a Mixture-of-Experts (MoE) technique featuring five experts and a gating network for multi-task handling. Experts are obtained by StarCoder fine-tuning. Specifically, we use a Low-Rank Adaptive Method (LoRA) technique, limiting each expert size as only 0.06% of number of StarCoder's parameters. At the same time, to enhance training efficiency in terms of time, we adopt curriculum learning strategy and use self-instruct data for efficient fine-tuning. As a result, each expert takes only 6 hours to train on one single 80Gb A100 HBM. With experiments on XLCoST datasets, SteloCoder achieves an average of 73.76 CodeBLEU score in multi-programming language-to-Python translation, surpassing the top performance from the leaderboard by at least 3.5. This accomplishment is attributed to only 45M extra parameters with StarCoder as the backbone and 32 hours of valid training on one 80GB A100 HBM. The source code is release here: https://github.com/sade-adrien/SteloCoder.
△ Less
Submitted 15 December, 2023; v1 submitted 24 October, 2023;
originally announced October 2023.
-
Unlocking the Power of Representations in Long-term Novelty-based Exploration
Authors:
Alaa Saade,
Steven Kapturowski,
Daniele Calandriello,
Charles Blundell,
Pablo Sprechmann,
Leopoldo Sarra,
Oliver Groth,
Michal Valko,
Bilal Piot
Abstract:
We introduce Robust Exploration via Clustering-based Online Density Estimation (RECODE), a non-parametric method for novelty-based exploration that estimates visitation counts for clusters of states based on their similarity in a chosen embedding space. By adapting classical clustering to the nonstationary setting of Deep RL, RECODE can efficiently track state visitation counts over thousands of e…
▽ More
We introduce Robust Exploration via Clustering-based Online Density Estimation (RECODE), a non-parametric method for novelty-based exploration that estimates visitation counts for clusters of states based on their similarity in a chosen embedding space. By adapting classical clustering to the nonstationary setting of Deep RL, RECODE can efficiently track state visitation counts over thousands of episodes. We further propose a novel generalization of the inverse dynamics loss, which leverages masked transformer architectures for multi-step prediction; which in conjunction with RECODE achieves a new state-of-the-art in a suite of challenging 3D-exploration tasks in DM-Hard-8. RECODE also sets new state-of-the-art in hard exploration Atari games, and is the first agent to reach the end screen in "Pitfall!".
△ Less
Submitted 2 May, 2023;
originally announced May 2023.
-
BYOL-Explore: Exploration by Bootstrapped Prediction
Authors:
Zhaohan Daniel Guo,
Shantanu Thakoor,
Miruna Pîslar,
Bernardo Avila Pires,
Florent Altché,
Corentin Tallec,
Alaa Saade,
Daniele Calandriello,
Jean-Bastien Grill,
Yunhao Tang,
Michal Valko,
Rémi Munos,
Mohammad Gheshlaghi Azar,
Bilal Piot
Abstract:
We present BYOL-Explore, a conceptually simple yet general approach for curiosity-driven exploration in visually-complex environments. BYOL-Explore learns a world representation, the world dynamics, and an exploration policy all-together by optimizing a single prediction loss in the latent space with no additional auxiliary objective. We show that BYOL-Explore is effective in DM-HARD-8, a challeng…
▽ More
We present BYOL-Explore, a conceptually simple yet general approach for curiosity-driven exploration in visually-complex environments. BYOL-Explore learns a world representation, the world dynamics, and an exploration policy all-together by optimizing a single prediction loss in the latent space with no additional auxiliary objective. We show that BYOL-Explore is effective in DM-HARD-8, a challenging partially-observable continuous-action hard-exploration benchmark with visually-rich 3-D environments. On this benchmark, we solve the majority of the tasks purely through augmenting the extrinsic reward with BYOL-Explore s intrinsic reward, whereas prior work could only get off the ground with human demonstrations. As further evidence of the generality of BYOL-Explore, we show that it achieves superhuman performance on the ten hardest exploration games in Atari while having a much simpler design than other competitive agents.
△ Less
Submitted 16 June, 2022;
originally announced June 2022.
-
Geometric Entropic Exploration
Authors:
Zhaohan Daniel Guo,
Mohammad Gheshlaghi Azar,
Alaa Saade,
Shantanu Thakoor,
Bilal Piot,
Bernardo Avila Pires,
Michal Valko,
Thomas Mesnard,
Tor Lattimore,
Rémi Munos
Abstract:
Exploration is essential for solving complex Reinforcement Learning (RL) tasks. Maximum State-Visitation Entropy (MSVE) formulates the exploration problem as a well-defined policy optimization problem whose solution aims at visiting all states as uniformly as possible. This is in contrast to standard uncertainty-based approaches where exploration is transient and eventually vanishes. However, exis…
▽ More
Exploration is essential for solving complex Reinforcement Learning (RL) tasks. Maximum State-Visitation Entropy (MSVE) formulates the exploration problem as a well-defined policy optimization problem whose solution aims at visiting all states as uniformly as possible. This is in contrast to standard uncertainty-based approaches where exploration is transient and eventually vanishes. However, existing approaches to MSVE are theoretically justified only for discrete state-spaces as they are oblivious to the geometry of continuous domains. We address this challenge by introducing Geometric Entropy Maximisation (GEM), a new algorithm that maximises the geometry-aware Shannon entropy of state-visits in both discrete and continuous domains. Our key theoretical contribution is casting geometry-aware MSVE exploration as a tractable problem of optimising a simple and novel noise-contrastive objective function. In our experiments, we show the efficiency of GEM in solving several RL problems with sparse rewards, compared against other deep RL exploration approaches.
△ Less
Submitted 7 January, 2021; v1 submitted 6 January, 2021;
originally announced January 2021.
-
Counterfactual Credit Assignment in Model-Free Reinforcement Learning
Authors:
Thomas Mesnard,
Théophane Weber,
Fabio Viola,
Shantanu Thakoor,
Alaa Saade,
Anna Harutyunyan,
Will Dabney,
Tom Stepleton,
Nicolas Heess,
Arthur Guez,
Éric Moulines,
Marcus Hutter,
Lars Buesing,
Rémi Munos
Abstract:
Credit assignment in reinforcement learning is the problem of measuring an action's influence on future rewards. In particular, this requires separating skill from luck, i.e. disentangling the effect of an action on rewards from that of external factors and subsequent actions. To achieve this, we adapt the notion of counterfactuals from causality theory to a model-free RL setup. The key idea is to…
▽ More
Credit assignment in reinforcement learning is the problem of measuring an action's influence on future rewards. In particular, this requires separating skill from luck, i.e. disentangling the effect of an action on rewards from that of external factors and subsequent actions. To achieve this, we adapt the notion of counterfactuals from causality theory to a model-free RL setup. The key idea is to condition value functions on future events, by learning to extract relevant information from a trajectory. We formulate a family of policy gradient algorithms that use these future-conditional value functions as baselines or critics, and show that they are provably low variance. To avoid the potential bias from conditioning on future information, we constrain the hindsight information to not contain information about the agent's actions. We demonstrate the efficacy and validity of our algorithm on a number of illustrative and challenging problems.
△ Less
Submitted 14 December, 2021; v1 submitted 18 November, 2020;
originally announced November 2020.
-
Spoken Language Understanding on the Edge
Authors:
Alaa Saade,
Alice Coucke,
Alexandre Caulier,
Joseph Dureau,
Adrien Ball,
Théodore Bluche,
David Leroy,
Clément Doumouro,
Thibault Gisselbrecht,
Francesco Caltagirone,
Thibaut Lavril,
Maël Primet
Abstract:
We consider the problem of performing Spoken Language Understanding (SLU) on small devices typical of IoT applications. Our contributions are twofold. First, we outline the design of an embedded, private-by-design SLU system and show that it has performance on par with cloud-based commercial solutions. Second, we release the datasets used in our experiments in the interest of reproducibility and i…
▽ More
We consider the problem of performing Spoken Language Understanding (SLU) on small devices typical of IoT applications. Our contributions are twofold. First, we outline the design of an embedded, private-by-design SLU system and show that it has performance on par with cloud-based commercial solutions. Second, we release the datasets used in our experiments in the interest of reproducibility and in the hope that they can prove useful to the SLU community.
△ Less
Submitted 2 October, 2019; v1 submitted 30 October, 2018;
originally announced October 2018.
-
Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces
Authors:
Alice Coucke,
Alaa Saade,
Adrien Ball,
Théodore Bluche,
Alexandre Caulier,
David Leroy,
Clément Doumouro,
Thibault Gisselbrecht,
Francesco Caltagirone,
Thibaut Lavril,
Maël Primet,
Joseph Dureau
Abstract:
This paper presents the machine learning architecture of the Snips Voice Platform, a software solution to perform Spoken Language Understanding on microprocessors typical of IoT devices. The embedded inference is fast and accurate while enforcing privacy by design, as no personal user data is ever collected. Focusing on Automatic Speech Recognition and Natural Language Understanding, we detail our…
▽ More
This paper presents the machine learning architecture of the Snips Voice Platform, a software solution to perform Spoken Language Understanding on microprocessors typical of IoT devices. The embedded inference is fast and accurate while enforcing privacy by design, as no personal user data is ever collected. Focusing on Automatic Speech Recognition and Natural Language Understanding, we detail our approach to training high-performance Machine Learning models that are small enough to run in real-time on small devices. Additionally, we describe a data generation procedure that provides sufficient, high-quality training data without compromising user privacy.
△ Less
Submitted 6 December, 2018; v1 submitted 25 May, 2018;
originally announced May 2018.
-
Deep Representation for Patient Visits from Electronic Health Records
Authors:
Jean-Baptiste Escudié,
Alaa Saade,
Alice Coucke,
Marc Lelarge
Abstract:
We show how to learn low-dimensional representations (embeddings) of patient visits from the corresponding electronic health record (EHR) where International Classification of Diseases (ICD) diagnosis codes are removed. We expect that these embeddings will be useful for the construction of predictive statistical models anticipated to drive personalized medicine and improve healthcare quality. Thes…
▽ More
We show how to learn low-dimensional representations (embeddings) of patient visits from the corresponding electronic health record (EHR) where International Classification of Diseases (ICD) diagnosis codes are removed. We expect that these embeddings will be useful for the construction of predictive statistical models anticipated to drive personalized medicine and improve healthcare quality. These embeddings are learned using a deep neural network trained to predict ICD diagnosis categories. We show that our embeddings capture relevant clinical informations and can be used directly as input to standard machine learning algorithms like multi-output classifiers for ICD code prediction. We also show that important medical informations correspond to particular directions in our embedding space.
△ Less
Submitted 26 March, 2018;
originally announced March 2018.
-
Spectral Inference Methods on Sparse Graphs: Theory and Applications
Authors:
Alaa Saade
Abstract:
In an era of unprecedented deluge of (mostly unstructured) data, graphs are proving more and more useful, across the sciences, as a flexible abstraction to capture complex relationships between complex objects. One of the main challenges arising in the study of such networks is the inference of macroscopic, large-scale properties affecting a large number of objects, based solely on the microscopic…
▽ More
In an era of unprecedented deluge of (mostly unstructured) data, graphs are proving more and more useful, across the sciences, as a flexible abstraction to capture complex relationships between complex objects. One of the main challenges arising in the study of such networks is the inference of macroscopic, large-scale properties affecting a large number of objects, based solely on the microscopic interactions between their elementary constituents. Statistical physics, precisely created to recover the macroscopic laws of thermodynamics from an idealized model of interacting particles, provides significant insight to tackle such complex networks.
In this dissertation, we use methods derived from the statistical physics of disordered systems to design and study new algorithms for inference on graphs. Our focus is on spectral methods, based on certain eigenvectors of carefully chosen matrices, and sparse graphs, containing only a small amount of information. We develop an original theory of spectral inference based on a relaxation of various mean-field free energy optimizations. Our approach is therefore fully probabilistic, and contrasts with more traditional motivations based on the optimization of a cost function. We illustrate the efficiency of our approach on various problems, including community detection, randomized similarity-based clustering, and matrix completion.
△ Less
Submitted 14 October, 2016;
originally announced October 2016.
-
Fast Randomized Semi-Supervised Clustering
Authors:
Alaa Saade,
Florent Krzakala,
Marc Lelarge,
Lenka Zdeborová
Abstract:
We consider the problem of clustering partially labeled data from a minimal number of randomly chosen pairwise comparisons between the items. We introduce an efficient local algorithm based on a power iteration of the non-backtracking operator and study its performance on a simple model. For the case of two clusters, we give bounds on the classification error and show that a small error can be ach…
▽ More
We consider the problem of clustering partially labeled data from a minimal number of randomly chosen pairwise comparisons between the items. We introduce an efficient local algorithm based on a power iteration of the non-backtracking operator and study its performance on a simple model. For the case of two clusters, we give bounds on the classification error and show that a small error can be achieved from $O(n)$ randomly chosen measurements, where $n$ is the number of items in the dataset. Our algorithm is therefore efficient both in terms of time and space complexities. We also investigate numerically the performance of the algorithm on synthetic and real world data.
△ Less
Submitted 9 October, 2016; v1 submitted 20 May, 2016;
originally announced May 2016.
-
Clustering from Sparse Pairwise Measurements
Authors:
Alaa Saade,
Marc Lelarge,
Florent Krzakala,
Lenka Zdeborová
Abstract:
We consider the problem of grouping items into clusters based on few random pairwise comparisons between the items. We introduce three closely related algorithms for this task: a belief propagation algorithm approximating the Bayes optimal solution, and two spectral algorithms based on the non-backtracking and Bethe Hessian operators. For the case of two symmetric clusters, we conjecture that thes…
▽ More
We consider the problem of grouping items into clusters based on few random pairwise comparisons between the items. We introduce three closely related algorithms for this task: a belief propagation algorithm approximating the Bayes optimal solution, and two spectral algorithms based on the non-backtracking and Bethe Hessian operators. For the case of two symmetric clusters, we conjecture that these algorithms are asymptotically optimal in that they detect the clusters as soon as it is information theoretically possible to do so. We substantiate this claim for one of the spectral approaches we introduce.
△ Less
Submitted 19 May, 2016; v1 submitted 25 January, 2016;
originally announced January 2016.
-
Random Projections through multiple optical scattering: Approximating kernels at the speed of light
Authors:
Alaa Saade,
Francesco Caltagirone,
Igor Carron,
Laurent Daudet,
Angélique Drémeau,
Sylvain Gigan,
Florent Krzakala
Abstract:
Random projections have proven extremely useful in many signal processing and machine learning applications. However, they often require either to store a very large random matrix, or to use a different, structured matrix to reduce the computational and memory costs. Here, we overcome this difficulty by proposing an analog, optical device, that performs the random projections literally at the spee…
▽ More
Random projections have proven extremely useful in many signal processing and machine learning applications. However, they often require either to store a very large random matrix, or to use a different, structured matrix to reduce the computational and memory costs. Here, we overcome this difficulty by proposing an analog, optical device, that performs the random projections literally at the speed of light without having to store any matrix in memory. This is achieved using the physical properties of multiple coherent scattering of coherent light in random media. We use this device on a simple task of classification with a kernel machine, and we show that, on the MNIST database, the experimental results closely match the theoretical performance of the corresponding kernel. This framework can help make kernel methods practical for applications that have large training sets and/or require real-time prediction. We discuss possible extensions of the method in terms of a class of kernels, speed, memory consumption and different problems.
△ Less
Submitted 25 October, 2015; v1 submitted 22 October, 2015;
originally announced October 2015.
-
Matrix Completion from Fewer Entries: Spectral Detectability and Rank Estimation
Authors:
Alaa Saade,
Florent Krzakala,
Lenka Zdeborová
Abstract:
The completion of low rank matrices from few entries is a task with many practical applications. We consider here two aspects of this problem: detectability, i.e. the ability to estimate the rank $r$ reliably from the fewest possible random entries, and performance in achieving small reconstruction error. We propose a spectral algorithm for these two tasks called MaCBetH (for Matrix Completion wit…
▽ More
The completion of low rank matrices from few entries is a task with many practical applications. We consider here two aspects of this problem: detectability, i.e. the ability to estimate the rank $r$ reliably from the fewest possible random entries, and performance in achieving small reconstruction error. We propose a spectral algorithm for these two tasks called MaCBetH (for Matrix Completion with the Bethe Hessian). The rank is estimated as the number of negative eigenvalues of the Bethe Hessian matrix, and the corresponding eigenvectors are used as initial condition for the minimization of the discrepancy between the estimated matrix and the revealed entries. We analyze the performance in a random matrix setting using results from the statistical mechanics of the Hopfield neural network, and show in particular that MaCBetH efficiently detects the rank $r$ of a large $n\times m$ matrix from $C(r)r\sqrt{nm}$ entries, where $C(r)$ is a constant close to $1$. We also evaluate the corresponding root-mean-square error empirically and show that MaCBetH compares favorably to other existing approaches.
△ Less
Submitted 28 January, 2016; v1 submitted 10 June, 2015;
originally announced June 2015.
-
Spectral Detection in the Censored Block Model
Authors:
Alaa Saade,
Florent Krzakala,
Marc Lelarge,
Lenka Zdeborová
Abstract:
We consider the problem of partially recovering hidden binary variables from the observation of (few) censored edge weights, a problem with applications in community detection, correlation clustering and synchronization. We describe two spectral algorithms for this task based on the non-backtracking and the Bethe Hessian operators. These algorithms are shown to be asymptotically optimal for the pa…
▽ More
We consider the problem of partially recovering hidden binary variables from the observation of (few) censored edge weights, a problem with applications in community detection, correlation clustering and synchronization. We describe two spectral algorithms for this task based on the non-backtracking and the Bethe Hessian operators. These algorithms are shown to be asymptotically optimal for the partial recovery problem, in that they detect the hidden assignment as soon as it is information theoretically possible to do so.
△ Less
Submitted 10 June, 2015; v1 submitted 31 January, 2015;
originally announced February 2015.
-
Computational Complexity, Phase Transitions, and Message-Passing for Community Detection
Authors:
Aurélien Decelle,
Janina Hüttel,
Alaa Saade,
Cristopher Moore
Abstract:
We take a whirlwind tour of problems and techniques at the boundary of computer science and statistical physics. We start with a brief description of P, NP, and NP-completeness. We then discuss random graphs, including the emergence of the giant component and the k-core, using techniques from branching processes and differential equations. Using these tools as well as the second moment method, we…
▽ More
We take a whirlwind tour of problems and techniques at the boundary of computer science and statistical physics. We start with a brief description of P, NP, and NP-completeness. We then discuss random graphs, including the emergence of the giant component and the k-core, using techniques from branching processes and differential equations. Using these tools as well as the second moment method, we give upper and lower bounds on the critical clause density for random k-SAT. We end with community detection in networks, variational methods, the Bethe free energy, belief propagation, the detectability transition, and the non-backtracking matrix.
△ Less
Submitted 8 September, 2014;
originally announced September 2014.
-
Spectral Clustering of Graphs with the Bethe Hessian
Authors:
Alaa Saade,
Florent Krzakala,
Lenka Zdeborová
Abstract:
Spectral clustering is a standard approach to label nodes on a graph by studying the (largest or lowest) eigenvalues of a symmetric real matrix such as e.g. the adjacency or the Laplacian. Recently, it has been argued that using instead a more complicated, non-symmetric and higher dimensional operator, related to the non-backtracking walk on the graph, leads to improved performance in detecting cl…
▽ More
Spectral clustering is a standard approach to label nodes on a graph by studying the (largest or lowest) eigenvalues of a symmetric real matrix such as e.g. the adjacency or the Laplacian. Recently, it has been argued that using instead a more complicated, non-symmetric and higher dimensional operator, related to the non-backtracking walk on the graph, leads to improved performance in detecting clusters, and even to optimal performance for the stochastic block model. Here, we propose to use instead a simpler object, a symmetric real matrix known as the Bethe Hessian operator, or deformed Laplacian. We show that this approach combines the performances of the non-backtracking operator, thus detecting clusters all the way down to the theoretical limit in the stochastic block model, with the computational, theoretical and memory advantages of real symmetric matrices.
△ Less
Submitted 8 September, 2014; v1 submitted 7 June, 2014;
originally announced June 2014.
-
Spectral density of the non-backtracking operator
Authors:
Alaa Saade,
Florent Krzakala,
Lenka Zdeborová
Abstract:
The non-backtracking operator was recently shown to provide a significant improvement when used for spectral clustering of sparse networks. In this paper we analyze its spectral density on large random sparse graphs using a mapping to the correlation functions of a certain interacting quantum disordered system on the graph. On sparse, tree-like graphs, this can be solved efficiently by the cavity…
▽ More
The non-backtracking operator was recently shown to provide a significant improvement when used for spectral clustering of sparse networks. In this paper we analyze its spectral density on large random sparse graphs using a mapping to the correlation functions of a certain interacting quantum disordered system on the graph. On sparse, tree-like graphs, this can be solved efficiently by the cavity method and a belief propagation algorithm. We show that there exists a paramagnetic phase, leading to zero spectral density, that is stable outside a circle of radius $\sqrtρ$, where $ρ$ is the leading eigenvalue of the non-backtracking operator. We observe a second-order phase transition at the edge of this circle, between a zero and a non-zero spectral density. That fact that this phase transition is absent in the spectral density of other matrices commonly used for spectral clustering provides a physical justification of the performances of the non-backtracking operator in spectral clustering.
△ Less
Submitted 30 April, 2014;
originally announced April 2014.