-
Order Matters in the Presence of Dataset Imbalance for Multilingual Learning
Authors:
Dami Choi,
Derrick Xin,
Hamid Dadkhahi,
Justin Gilmer,
Ankush Garg,
Orhan Firat,
Chih-Kuan Yeh,
Andrew M. Dai,
Behrooz Ghorbani
Abstract:
In this paper, we empirically study the optimization dynamics of multi-task learning, particularly focusing on those that govern a collection of tasks with significant data imbalance. We present a simple yet effective method of pre-training on high-resource tasks, followed by fine-tuning on a mixture of high/low-resource tasks. We provide a thorough empirical study and analysis of this method's be…
▽ More
In this paper, we empirically study the optimization dynamics of multi-task learning, particularly focusing on those that govern a collection of tasks with significant data imbalance. We present a simple yet effective method of pre-training on high-resource tasks, followed by fine-tuning on a mixture of high/low-resource tasks. We provide a thorough empirical study and analysis of this method's benefits showing that it achieves consistent improvements relative to the performance trade-off profile of standard static weighting. We analyze under what data regimes this method is applicable and show its improvements empirically in neural machine translation (NMT) and multi-lingual language modeling.
△ Less
Submitted 11 December, 2023;
originally announced December 2023.
-
Epsilon Sampling Rocks: Investigating Sampling Strategies for Minimum Bayes Risk Decoding for Machine Translation
Authors:
Markus Freitag,
Behrooz Ghorbani,
Patrick Fernandes
Abstract:
Recent advances in machine translation (MT) have shown that Minimum Bayes Risk (MBR) decoding can be a powerful alternative to beam search decoding, especially when combined with neural-based utility functions. However, the performance of MBR decoding depends heavily on how and how many candidates are sampled from the model. In this paper, we explore how different sampling approaches for generatin…
▽ More
Recent advances in machine translation (MT) have shown that Minimum Bayes Risk (MBR) decoding can be a powerful alternative to beam search decoding, especially when combined with neural-based utility functions. However, the performance of MBR decoding depends heavily on how and how many candidates are sampled from the model. In this paper, we explore how different sampling approaches for generating candidate lists for MBR decoding affect performance. We evaluate popular sampling approaches, such as ancestral, nucleus, and top-k sampling. Based on our insights into their limitations, we experiment with the recently proposed epsilon-sampling approach, which prunes away all tokens with a probability smaller than epsilon, ensuring that each token in a sample receives a fair probability mass. Through extensive human evaluations, we demonstrate that MBR decoding based on epsilon-sampling significantly outperforms not only beam search decoding, but also MBR decoding with all other tested sampling methods across four language pairs.
△ Less
Submitted 17 May, 2023; v1 submitted 16 May, 2023;
originally announced May 2023.
-
Scaling Laws for Multilingual Neural Machine Translation
Authors:
Patrick Fernandes,
Behrooz Ghorbani,
Xavier Garcia,
Markus Freitag,
Orhan Firat
Abstract:
In this work, we provide a large-scale empirical study of the scaling properties of multilingual neural machine translation models. We examine how increases in the model size affect the model performance and investigate the role of the training mixture composition on the scaling behavior. We find that changing the weightings of the individual language pairs in the training mixture only affect the…
▽ More
In this work, we provide a large-scale empirical study of the scaling properties of multilingual neural machine translation models. We examine how increases in the model size affect the model performance and investigate the role of the training mixture composition on the scaling behavior. We find that changing the weightings of the individual language pairs in the training mixture only affect the multiplicative factor of the scaling law. In particular, we observe that multilingual models trained using different mixing rates all exhibit the same scaling exponent. Through a novel joint scaling law formulation, we compute the effective number of parameters allocated to each language pair and examine the role of language similarity in the scaling behavior of our models. We find little evidence that language similarity has any impact. In contrast, the direction of the multilinguality plays a significant role, with models translating from multiple languages into English having a larger number of effective parameters per task than their reversed counterparts. Finally, we leverage our observations to predict the performance of multilingual models trained with any language weighting at any scale, significantly reducing efforts required for language balancing in large multilingual models. Our findings apply to both in-domain and out-of-domain test sets and to multiple evaluation metrics, such as ChrF and BLEURT.
△ Less
Submitted 19 February, 2023;
originally announced February 2023.
-
Binarized Neural Machine Translation
Authors:
Yichi Zhang,
Ankush Garg,
Yuan Cao,
Łukasz Lew,
Behrooz Ghorbani,
Zhiru Zhang,
Orhan Firat
Abstract:
The rapid scaling of language models is motivating research using low-bitwidth quantization. In this work, we propose a novel binarization technique for Transformers applied to machine translation (BMT), the first of its kind. We identify and address the problem of inflated dot-product variance when using one-bit weights and activations. Specifically, BMT leverages additional LayerNorms and residu…
▽ More
The rapid scaling of language models is motivating research using low-bitwidth quantization. In this work, we propose a novel binarization technique for Transformers applied to machine translation (BMT), the first of its kind. We identify and address the problem of inflated dot-product variance when using one-bit weights and activations. Specifically, BMT leverages additional LayerNorms and residual connections to improve binarization quality. Experiments on the WMT dataset show that a one-bit weight-only Transformer can achieve the same quality as a float one, while being 16x smaller in size. One-bit activations incur varying degrees of quality drop, but mitigated by the proposed architectural changes. We further conduct a scaling law study using production-scale translation datasets, which shows that one-bit weight Transformers scale and generalize well in both in-domain and out-of-domain settings. Implementation in JAX/Flax will be open sourced.
△ Less
Submitted 9 February, 2023;
originally announced February 2023.
-
Do Current Multi-Task Optimization Methods in Deep Learning Even Help?
Authors:
Derrick Xin,
Behrooz Ghorbani,
Ankush Garg,
Orhan Firat,
Justin Gilmer
Abstract:
Recent research has proposed a series of specialized optimization algorithms for deep multi-task models. It is often claimed that these multi-task optimization (MTO) methods yield solutions that are superior to the ones found by simply optimizing a weighted average of the task losses. In this paper, we perform large-scale experiments on a variety of language and vision tasks to examine the empiric…
▽ More
Recent research has proposed a series of specialized optimization algorithms for deep multi-task models. It is often claimed that these multi-task optimization (MTO) methods yield solutions that are superior to the ones found by simply optimizing a weighted average of the task losses. In this paper, we perform large-scale experiments on a variety of language and vision tasks to examine the empirical validity of these claims. We show that, despite the added design and computational complexity of these algorithms, MTO methods do not yield any performance improvements beyond what is achievable via traditional optimization approaches. We highlight alternative strategies that consistently yield improvements to the performance profile and point out common training pitfalls that might cause suboptimal results. Finally, we outline challenges in reliably evaluating the performance of MTO algorithms and discuss potential solutions.
△ Less
Submitted 22 September, 2022;
originally announced September 2022.
-
Adaptive Gradient Methods at the Edge of Stability
Authors:
Jeremy M. Cohen,
Behrooz Ghorbani,
Shankar Krishnan,
Naman Agarwal,
Sourabh Medapati,
Michal Badura,
Daniel Suo,
David Cardoze,
Zachary Nado,
George E. Dahl,
Justin Gilmer
Abstract:
Very little is known about the training dynamics of adaptive gradient methods like Adam in deep learning. In this paper, we shed light on the behavior of these algorithms in the full-batch and sufficiently large batch settings. Specifically, we empirically demonstrate that during full-batch training, the maximum eigenvalue of the preconditioned Hessian typically equilibrates at a certain numerical…
▽ More
Very little is known about the training dynamics of adaptive gradient methods like Adam in deep learning. In this paper, we shed light on the behavior of these algorithms in the full-batch and sufficiently large batch settings. Specifically, we empirically demonstrate that during full-batch training, the maximum eigenvalue of the preconditioned Hessian typically equilibrates at a certain numerical value -- the stability threshold of a gradient descent algorithm. For Adam with step size $η$ and $β_1 = 0.9$, this stability threshold is $38/η$. Similar effects occur during minibatch training, especially as the batch size grows. Yet, even though adaptive methods train at the ``Adaptive Edge of Stability'' (AEoS), their behavior in this regime differs in a significant way from that of non-adaptive methods at the EoS. Whereas non-adaptive algorithms at the EoS are blocked from entering high-curvature regions of the loss landscape, adaptive gradient methods at the AEoS can keep advancing into high-curvature regions, while adapting the preconditioner to compensate. Our findings can serve as a foundation for the community's future understanding of adaptive gradient methods in deep learning.
△ Less
Submitted 15 April, 2024; v1 submitted 29 July, 2022;
originally announced July 2022.
-
Data Scaling Laws in NMT: The Effect of Noise and Architecture
Authors:
Yamini Bansal,
Behrooz Ghorbani,
Ankush Garg,
Biao Zhang,
Maxim Krikun,
Colin Cherry,
Behnam Neyshabur,
Orhan Firat
Abstract:
In this work, we study the effect of varying the architecture and training data quality on the data scaling properties of Neural Machine Translation (NMT). First, we establish that the test loss of encoder-decoder transformer models scales as a power law in the number of training samples, with a dependence on the model size. Then, we systematically vary aspects of the training setup to understand…
▽ More
In this work, we study the effect of varying the architecture and training data quality on the data scaling properties of Neural Machine Translation (NMT). First, we establish that the test loss of encoder-decoder transformer models scales as a power law in the number of training samples, with a dependence on the model size. Then, we systematically vary aspects of the training setup to understand how they impact the data scaling laws. In particular, we change the following (1) Architecture and task setup: We compare to a transformer-LSTM hybrid, and a decoder-only transformer with a language modeling loss (2) Noise level in the training distribution: We experiment with filtering, and adding iid synthetic noise. In all the above cases, we find that the data scaling exponents are minimally impacted, suggesting that marginally worse architectures or training data can be compensated for by adding more data. Lastly, we find that using back-translated data instead of parallel data, can significantly degrade the scaling exponent.
△ Less
Submitted 4 February, 2022;
originally announced February 2022.
-
Examining Scaling and Transfer of Language Model Architectures for Machine Translation
Authors:
Biao Zhang,
Behrooz Ghorbani,
Ankur Bapna,
Yong Cheng,
Xavier Garcia,
Jonathan Shen,
Orhan Firat
Abstract:
Natural language understanding and generation models follow one of the two dominant architectural paradigms: language models (LMs) that process concatenated sequences in a single stack of layers, and encoder-decoder models (EncDec) that utilize separate layer stacks for input and output processing. In machine translation, EncDec has long been the favoured approach, but with few studies investigati…
▽ More
Natural language understanding and generation models follow one of the two dominant architectural paradigms: language models (LMs) that process concatenated sequences in a single stack of layers, and encoder-decoder models (EncDec) that utilize separate layer stacks for input and output processing. In machine translation, EncDec has long been the favoured approach, but with few studies investigating the performance of LMs. In this work, we thoroughly examine the role of several architectural design choices on the performance of LMs on bilingual, (massively) multilingual and zero-shot translation tasks, under systematic variations of data conditions and model sizes. Our results show that: (i) Different LMs have different scaling properties, where architectural differences often have a significant impact on model performance at small scales, but the performance gap narrows as the number of parameters increases, (ii) Several design choices, including causal masking and language-modeling objectives for the source sequence, have detrimental effects on translation quality, and (iii) When paired with full-visible masking for source sequences, LMs could perform on par with EncDec on supervised bilingual and multilingual translation tasks, and improve greatly on zero-shot directions by facilitating the reduction of off-target translations.
△ Less
Submitted 16 February, 2022; v1 submitted 1 February, 2022;
originally announced February 2022.
-
A Loss Curvature Perspective on Training Instability in Deep Learning
Authors:
Justin Gilmer,
Behrooz Ghorbani,
Ankush Garg,
Sneha Kudugunta,
Behnam Neyshabur,
David Cardoze,
George Dahl,
Zachary Nado,
Orhan Firat
Abstract:
In this work, we study the evolution of the loss Hessian across many classification tasks in order to understand the effect the curvature of the loss has on the training dynamics. Whereas prior work has focused on how different learning rates affect the loss Hessian observed during training, we also analyze the effects of model initialization, architectural choices, and common training heuristics…
▽ More
In this work, we study the evolution of the loss Hessian across many classification tasks in order to understand the effect the curvature of the loss has on the training dynamics. Whereas prior work has focused on how different learning rates affect the loss Hessian observed during training, we also analyze the effects of model initialization, architectural choices, and common training heuristics such as gradient clipping and learning rate warmup. Our results demonstrate that successful model and hyperparameter choices allow the early optimization trajectory to either avoid -- or navigate out of -- regions of high curvature and into flatter regions that tolerate a higher learning rate. Our results suggest a unifying perspective on how disparate mitigation strategies for training instability ultimately address the same underlying failure mode of neural network optimization, namely poor conditioning. Inspired by the conditioning perspective, we show that learning rate warmup can improve training stability just as much as batch normalization, layer normalization, MetaInit, GradInit, and Fixup initialization.
△ Less
Submitted 8 October, 2021;
originally announced October 2021.
-
Scaling Laws for Neural Machine Translation
Authors:
Behrooz Ghorbani,
Orhan Firat,
Markus Freitag,
Ankur Bapna,
Maxim Krikun,
Xavier Garcia,
Ciprian Chelba,
Colin Cherry
Abstract:
We present an empirical study of scaling properties of encoder-decoder Transformer models used in neural machine translation (NMT). We show that cross-entropy loss as a function of model size follows a certain scaling law. Specifically (i) We propose a formula which describes the scaling behavior of cross-entropy loss as a bivariate function of encoder and decoder size, and show that it gives accu…
▽ More
We present an empirical study of scaling properties of encoder-decoder Transformer models used in neural machine translation (NMT). We show that cross-entropy loss as a function of model size follows a certain scaling law. Specifically (i) We propose a formula which describes the scaling behavior of cross-entropy loss as a bivariate function of encoder and decoder size, and show that it gives accurate predictions under a variety of scaling approaches and languages; we show that the total number of parameters alone is not sufficient for such purposes. (ii) We observe different power law exponents when scaling the decoder vs scaling the encoder, and provide recommendations for optimal allocation of encoder/decoder capacity based on this observation. (iii) We also report that the scaling behavior of the model is acutely influenced by composition bias of the train/test sets, which we define as any deviation from naturally generated text (either via machine generated or human translated text). We observe that natural text on the target side enjoys scaling, which manifests as successful reduction of the cross-entropy loss. (iv) Finally, we investigate the relationship between the cross-entropy loss and the quality of the generated translations. We find two different behaviors, depending on the nature of the test data. For test sets which were originally translated from target language to source language, both loss and BLEU score improve as model size increases. In contrast, for test sets originally translated from source language to target language, the loss improves, but the BLEU score stops improving after a certain threshold. We release generated text from all models used in this study.
△ Less
Submitted 16 September, 2021;
originally announced September 2021.
-
When Do Neural Networks Outperform Kernel Methods?
Authors:
Behrooz Ghorbani,
Song Mei,
Theodor Misiakiewicz,
Andrea Montanari
Abstract:
For a certain scaling of the initialization of stochastic gradient descent (SGD), wide neural networks (NN) have been shown to be well approximated by reproducing kernel Hilbert space (RKHS) methods. Recent empirical work showed that, for some classification tasks, RKHS methods can replace NNs without a large loss in performance. On the other hand, two-layers NNs are known to encode richer smoothn…
▽ More
For a certain scaling of the initialization of stochastic gradient descent (SGD), wide neural networks (NN) have been shown to be well approximated by reproducing kernel Hilbert space (RKHS) methods. Recent empirical work showed that, for some classification tasks, RKHS methods can replace NNs without a large loss in performance. On the other hand, two-layers NNs are known to encode richer smoothness classes than RKHS and we know of special examples for which SGD-trained NN provably outperform RKHS. This is true even in the wide network limit, for a different scaling of the initialization.
How can we reconcile the above claims? For which tasks do NNs outperform RKHS? If covariates are nearly isotropic, RKHS methods suffer from the curse of dimensionality, while NNs can overcome it by learning the best low-dimensional representation. Here we show that this curse of dimensionality becomes milder if the covariates display the same low-dimensional structure as the target function, and we precisely characterize this tradeoff. Building on these results, we present the spiked covariates model that can capture in a unified framework both behaviors observed in earlier work.
We hypothesize that such a latent low-dimensional structure is present in image classification. We test numerically this hypothesis by showing that specific perturbations of the training distribution degrade the performances of RKHS methods much more significantly than NNs.
△ Less
Submitted 9 November, 2021; v1 submitted 23 June, 2020;
originally announced June 2020.
-
Limitations of Lazy Training of Two-layers Neural Networks
Authors:
Behrooz Ghorbani,
Song Mei,
Theodor Misiakiewicz,
Andrea Montanari
Abstract:
We study the supervised learning problem under either of the following two models: (1) Feature vectors ${\boldsymbol x}_i$ are $d$-dimensional Gaussians and responses are $y_i = f_*({\boldsymbol x}_i)$ for $f_*$ an unknown quadratic function; (2) Feature vectors ${\boldsymbol x}_i$ are distributed as a mixture of two $d$-dimensional centered Gaussians, and $y_i$'s are the corresponding class label…
▽ More
We study the supervised learning problem under either of the following two models: (1) Feature vectors ${\boldsymbol x}_i$ are $d$-dimensional Gaussians and responses are $y_i = f_*({\boldsymbol x}_i)$ for $f_*$ an unknown quadratic function; (2) Feature vectors ${\boldsymbol x}_i$ are distributed as a mixture of two $d$-dimensional centered Gaussians, and $y_i$'s are the corresponding class labels. We use two-layers neural networks with quadratic activations, and compare three different learning regimes: the random features (RF) regime in which we only train the second-layer weights; the neural tangent (NT) regime in which we train a linearization of the neural network around its initialization; the fully trained neural network (NN) regime in which we train all the weights in the network. We prove that, even for the simple quadratic model of point (1), there is a potentially unbounded gap between the prediction risk achieved in these three training regimes, when the number of neurons is smaller than the ambient dimension. When the number of neurons is larger than the number of dimensions, the problem is significantly easier and both NT and NN learning achieve zero risk.
△ Less
Submitted 20 June, 2019;
originally announced June 2019.
-
Linearized two-layers neural networks in high dimension
Authors:
Behrooz Ghorbani,
Song Mei,
Theodor Misiakiewicz,
Andrea Montanari
Abstract:
We consider the problem of learning an unknown function $f_{\star}$ on the $d$-dimensional sphere with respect to the square loss, given i.i.d. samples $\{(y_i,{\boldsymbol x}_i)\}_{i\le n}$ where ${\boldsymbol x}_i$ is a feature vector uniformly distributed on the sphere and $y_i=f_{\star}({\boldsymbol x}_i)+\varepsilon_i$. We study two popular classes of models that can be regarded as linearizat…
▽ More
We consider the problem of learning an unknown function $f_{\star}$ on the $d$-dimensional sphere with respect to the square loss, given i.i.d. samples $\{(y_i,{\boldsymbol x}_i)\}_{i\le n}$ where ${\boldsymbol x}_i$ is a feature vector uniformly distributed on the sphere and $y_i=f_{\star}({\boldsymbol x}_i)+\varepsilon_i$. We study two popular classes of models that can be regarded as linearizations of two-layers neural networks around a random initialization: the random features model of Rahimi-Recht (RF); the neural tangent kernel model of Jacot-Gabriel-Hongler (NT). Both these approaches can also be regarded as randomized approximations of kernel ridge regression (with respect to different kernels), and enjoy universal approximation properties when the number of neurons $N$ diverges, for a fixed dimension $d$.
We consider two specific regimes: the approximation-limited regime, in which $n=\infty$ while $d$ and $N$ are large but finite; and the sample size-limited regime in which $N=\infty$ while $d$ and $n$ are large but finite. In the first regime we prove that if $d^{\ell + δ} \le N\le d^{\ell+1-δ}$ for small $δ> 0$, then \RF\, effectively fits a degree-$\ell$ polynomial in the raw features, and \NT\, fits a degree-$(\ell+1)$ polynomial. In the second regime, both RF and NT reduce to kernel methods with rotationally invariant kernels. We prove that, if the number of samples is $d^{\ell + δ} \le n \le d^{\ell +1-δ}$, then kernel methods can fit at most a a degree-$\ell$ polynomial in the raw features. This lower bound is achieved by kernel ridge regression. Optimal prediction error is achieved for vanishing ridge regularization.
△ Less
Submitted 16 February, 2020; v1 submitted 27 April, 2019;
originally announced April 2019.
-
An Investigation into Neural Net Optimization via Hessian Eigenvalue Density
Authors:
Behrooz Ghorbani,
Shankar Krishnan,
Ying Xiao
Abstract:
To understand the dynamics of optimization in deep neural networks, we develop a tool to study the evolution of the entire Hessian spectrum throughout the optimization process. Using this, we study a number of hypotheses concerning smoothness, curvature, and sharpness in the deep learning literature. We then thoroughly analyze a crucial structural feature of the spectra: in non-batch normalized ne…
▽ More
To understand the dynamics of optimization in deep neural networks, we develop a tool to study the evolution of the entire Hessian spectrum throughout the optimization process. Using this, we study a number of hypotheses concerning smoothness, curvature, and sharpness in the deep learning literature. We then thoroughly analyze a crucial structural feature of the spectra: in non-batch normalized networks, we observe the rapid appearance of large isolated eigenvalues in the spectrum, along with a surprising concentration of the gradient in the corresponding eigenspaces. In batch normalized networks, these two effects are almost absent. We characterize these effects, and explain how they affect optimization speed through both theory and experiments. As part of this work, we adapt advanced tools from numerical linear algebra that allow scalable and accurate estimation of the entire Hessian spectrum of ImageNet-scale neural networks; this technique may be of independent interest in other applications.
△ Less
Submitted 29 January, 2019;
originally announced January 2019.
-
Sparse regression with highly correlated predictors
Authors:
Behrooz Ghorbani,
Ozgur Yilmaz
Abstract:
We consider a linear regression $y=Xβ+u$ where $X\in\mathbb{\mathbb{R}}^{n\times p}$, $p\gg n,$ and $β$ is $s$-sparse. Motivated by examples in financial and economic data, we consider the situation where $X$ has highly correlated and clustered columns. To perform sparse recovery in this setting, we introduce the \emph{clustering removal algorithm} (CRA), that seeks to decrease the correlation in…
▽ More
We consider a linear regression $y=Xβ+u$ where $X\in\mathbb{\mathbb{R}}^{n\times p}$, $p\gg n,$ and $β$ is $s$-sparse. Motivated by examples in financial and economic data, we consider the situation where $X$ has highly correlated and clustered columns. To perform sparse recovery in this setting, we introduce the \emph{clustering removal algorithm} (CRA), that seeks to decrease the correlation in $X$ by removing the cluster structure without changing the parameter vector $β$. We show that as long as certain assumptions hold about $X$, the decorrelated matrix will satisfy the restricted isometry property (RIP) with high probability. We also provide examples of the empirical performance of CRA and compare it with other sparse recovery techniques.
△ Less
Submitted 4 April, 2015;
originally announced April 2015.