Search | arXiv e-print repository

Mixtral of Experts

Authors: Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix , et al. (1 additional authors not shown)

Abstract: We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected e… ▽ More We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license. △ Less

Submitted 8 January, 2024; originally announced January 2024.

Comments: See more details at https://mistral.ai/news/mixtral-of-experts/

arXiv:2009.12570 [pdf]

Quantifying the effect of image compression on supervised learning applications in optical microscopy

Authors: Enrico Pomarico, Cédric Schmidt, Florian Chays, David Nguyen, Arielle Planchette, Audrey Tissot, Adrien Roux, Stéphane Pagès, Laura Batti, Christoph Clausen, Theo Lasser, Aleksandra Radenovic, Bruno Sanguinetti, Jérôme Extermann

Abstract: The impressive growth of data throughput in optical microscopy has triggered a widespread use of supervised learning (SL) models running on compressed image datasets for efficient automated analysis. However, since lossy image compression risks to produce unpredictable artifacts, quantifying the effect of data compression on SL applications is of pivotal importance to assess their reliability, esp… ▽ More The impressive growth of data throughput in optical microscopy has triggered a widespread use of supervised learning (SL) models running on compressed image datasets for efficient automated analysis. However, since lossy image compression risks to produce unpredictable artifacts, quantifying the effect of data compression on SL applications is of pivotal importance to assess their reliability, especially for clinical use. We propose an experimental method to evaluate the tolerability of image compression distortions in 2D and 3D cell segmentation SL tasks: predictions on compressed data are compared to the raw predictive uncertainty, which is numerically estimated from the raw noise statistics measured through sensor calibration. We show that predictions on object- and image-specific segmentation parameters can be altered by up to 15% and more than 10 standard deviations after 16-to-8 bits downsampling or JPEG compression. In contrast, a recently developed lossless compression algorithm provides a prediction spread which is statistically equivalent to that stemming from raw noise, while providing a compression ratio of up to 10:1. By setting a lower bound to the SL predictive uncertainty, our technique can be generalized to validate a variety of data analysis pipelines in SL-assisted fields. △ Less

Submitted 26 September, 2020; originally announced September 2020.

Comments: 26 pages, 8 figures

arXiv:1904.02076 [pdf, other]

Lightweight FEC: Rectangular Codes with Minimum Feedback Information

Authors: Binh-Minh Bui-Xuan, Pierre Meyer, Antoine Roux

Abstract: We propose a hybrid protocol combining a rectangular error-correcting code - paired with an error-detecting code - and a backward error correction in order to send packages of information over a noisy channel. We depict a linear-time algorithm the receiver can use to determine the minimum amount of information to be requested from the sender in order to repair all transmission errors. Repairs may… ▽ More We propose a hybrid protocol combining a rectangular error-correcting code - paired with an error-detecting code - and a backward error correction in order to send packages of information over a noisy channel. We depict a linear-time algorithm the receiver can use to determine the minimum amount of information to be requested from the sender in order to repair all transmission errors. Repairs may possibly occur over several cycles of emissions and requests. We show that the expected bandwidth use on the backward channel by our protocol is asymptotically small. In most configurations we give the explicit asymptotic expansion for said expectation. This is obtained by linking our problem to a well known algorithmic problem on a gadget graph, feedback edge set. The little use of the backward channel makes our protocol suitable where one could otherwise simply use backward error correction, e.g. TCP, but where overly using the backward channel is undesirable. We confront our protocol to numerical analysis versus TCP protocol. In most cases our protocol allows to reduce the number of iterations down to 60%, while requiring only negligibly more packages. △ Less

Submitted 3 April, 2019; originally announced April 2019.

arXiv:1812.08615 [pdf, other]

Temporal Matching

Authors: Julien Baste, Binh-Minh Bui-Xuan, Antoine Roux

Abstract: A link stream is a sequence of pairs of the form $(t,\{u,v\})$, where $t\in\mathbb N$ represents a time instant and $u\neq v$. Given an integer $γ$, the $γ$-edge between vertices $u$ and $v$, starting at time $t$, is the set of temporally consecutive edges defined by $\{(t',\{u,v\}) | t' \in [t,t+γ-1]\}$. We introduce the notion of temporal matching of a link stream to be an independent $γ$-edge s… ▽ More A link stream is a sequence of pairs of the form $(t,\{u,v\})$, where $t\in\mathbb N$ represents a time instant and $u\neq v$. Given an integer $γ$, the $γ$-edge between vertices $u$ and $v$, starting at time $t$, is the set of temporally consecutive edges defined by $\{(t',\{u,v\}) | t' \in [t,t+γ-1]\}$. We introduce the notion of temporal matching of a link stream to be an independent $γ$-edge set belonging to the link stream. We show that the problem of computing a temporal matching of maximum size is NP-hard as soon as $γ>1$. We depict a kernelization algorithm parameterized by the solution size for the problem. As a byproduct we also give a $2$-approximation algorithm. Both our $2$-approximation and kernelization algorithms are implemented and confronted to link streams collected from real world graph data. We observe that finding temporal matchings is a sensitive question when mining our data from such a perspective as: managing peer-working when any pair of peers $X$ and $Y$ are to collaborate over a period of one month, at an average rate of at least two email exchanges every week. We furthermore design a link stream generating process by mimicking the behaviour of a random moving group of particles under natural simulation, and confront our algorithms to these generated instances of link streams. All the implementations are open source. △ Less

Submitted 7 February, 2019; v1 submitted 20 December, 2018; originally announced December 2018.

Comments: Submitted

arXiv:1110.2477 [pdf, ps, other]

Parallel Binomial American Option Pricing with (and without) Transaction Costs

Authors: Nan Zhang, Alet Roux, Tomasz Zastawniak

Abstract: We present a parallel algorithm that computes the ask and bid prices of an American option when proportional transaction costs apply to the trading of the underlying asset. The algorithm computes the prices on recombining binomial trees, and is designed for modern multi-core processors. Although parallel option pricing has been well studied, none of the existing approaches takes transaction costs… ▽ More We present a parallel algorithm that computes the ask and bid prices of an American option when proportional transaction costs apply to the trading of the underlying asset. The algorithm computes the prices on recombining binomial trees, and is designed for modern multi-core processors. Although parallel option pricing has been well studied, none of the existing approaches takes transaction costs into consideration. The algorithm that we propose partitions a binomial tree into blocks. In any round of computation a block is further partitioned into regions which are assigned to distinct processors. To minimise load imbalance the assignment of nodes to processors is dynamically adjusted before each new round starts. Synchronisation is required both within a round and between two successive rounds. The parallel speedup of the algorithm is proportional to the number of processors used. The parallel algorithm was implemented in C/C++ via POSIX Threads, and was tested on a machine with 8 processors. In the pricing of an American put option, the parallel speedup against an efficient sequential implementation was 5.26 using 8 processors and 1500 time steps, achieving a parallel efficiency of 65.75%. △ Less

Submitted 11 October, 2011; originally announced October 2011.

MSC Class: 62L15; 90C15; 91B28; 60G42 ACM Class: G.4

Showing 1–5 of 5 results for author: Roux, A