-
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Authors:
Gemini Team,
Petko Georgiev,
Ving Ian Lei,
Ryan Burnell,
Libin Bai,
Anmol Gulati,
Garrett Tanzer,
Damien Vincent,
Zhufeng Pan,
Shibo Wang,
Soroosh Mariooryad,
Yifan Ding,
Xinyang Geng,
Fred Alcober,
Roy Frostig,
Mark Omernick,
Lexi Walker,
Cosmin Paduraru,
Christina Sorokin,
Andrea Tacchetti,
Colin Gaffney,
Samira Daruki,
Olcan Sercinoglu,
Zach Gleicher,
Juliette Love
, et al. (1110 additional authors not shown)
Abstract:
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February…
▽ More
In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professionals on completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.
△ Less
Submitted 8 August, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
Gemini: A Family of Highly Capable Multimodal Models
Authors:
Gemini Team,
Rohan Anil,
Sebastian Borgeaud,
Jean-Baptiste Alayrac,
Jiahui Yu,
Radu Soricut,
Johan Schalkwyk,
Andrew M. Dai,
Anja Hauth,
Katie Millican,
David Silver,
Melvin Johnson,
Ioannis Antonoglou,
Julian Schrittwieser,
Amelia Glaese,
Jilin Chen,
Emily Pitler,
Timothy Lillicrap,
Angeliki Lazaridou,
Orhan Firat,
James Molloy,
Michael Isard,
Paul R. Barham,
Tom Hennigan,
Benjamin Lee
, et al. (1325 additional authors not shown)
Abstract:
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr…
▽ More
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.
△ Less
Submitted 17 June, 2024; v1 submitted 18 December, 2023;
originally announced December 2023.
-
Instruction-Following Evaluation for Large Language Models
Authors:
Jeffrey Zhou,
Tianjian Lu,
Swaroop Mishra,
Siddhartha Brahma,
Sujoy Basu,
Yi Luan,
Denny Zhou,
Le Hou
Abstract:
One core capability of Large Language Models (LLMs) is to follow natural language instructions. However, the evaluation of such abilities is not standardized: Human evaluations are expensive, slow, and not objectively reproducible, while LLM-based auto-evaluation is potentially biased or limited by the ability of the evaluator LLM. To overcome these issues, we introduce Instruction-Following Eval…
▽ More
One core capability of Large Language Models (LLMs) is to follow natural language instructions. However, the evaluation of such abilities is not standardized: Human evaluations are expensive, slow, and not objectively reproducible, while LLM-based auto-evaluation is potentially biased or limited by the ability of the evaluator LLM. To overcome these issues, we introduce Instruction-Following Eval (IFEval) for large language models. IFEval is a straightforward and easy-to-reproduce evaluation benchmark. It focuses on a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times". We identified 25 types of those verifiable instructions and constructed around 500 prompts, with each prompt containing one or more verifiable instructions. We show evaluation results of two widely available LLMs on the market. Our code and data can be found at https://github.com/google-research/google-research/tree/master/instruction_following_eval
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
PaLM 2 Technical Report
Authors:
Rohan Anil,
Andrew M. Dai,
Orhan Firat,
Melvin Johnson,
Dmitry Lepikhin,
Alexandre Passos,
Siamak Shakeri,
Emanuel Taropa,
Paige Bailey,
Zhifeng Chen,
Eric Chu,
Jonathan H. Clark,
Laurent El Shafey,
Yanping Huang,
Kathy Meier-Hellstern,
Gaurav Mishra,
Erica Moreira,
Mark Omernick,
Kevin Robinson,
Sebastian Ruder,
Yi Tay,
Kefan Xiao,
Yuanzhong Xu,
Yujing Zhang,
Gustavo Hernandez Abrego
, et al. (103 additional authors not shown)
Abstract:
We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on…
▽ More
We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on downstream tasks across different model sizes, while simultaneously exhibiting faster and more efficient inference compared to PaLM. This improved efficiency enables broader deployment while also allowing the model to respond faster, for a more natural pace of interaction. PaLM 2 demonstrates robust reasoning capabilities exemplified by large improvements over PaLM on BIG-Bench and other reasoning tasks. PaLM 2 exhibits stable performance on a suite of responsible AI evaluations, and enables inference-time control over toxicity without additional overhead or impact on other capabilities. Overall, PaLM 2 achieves state-of-the-art performance across a diverse set of tasks and capabilities.
When discussing the PaLM 2 family, it is important to distinguish between pre-trained models (of various sizes), fine-tuned variants of these models, and the user-facing products that use these models. In particular, user-facing products typically include additional pre- and post-processing steps. Additionally, the underlying models may evolve over time. Therefore, one should not expect the performance of user-facing products to exactly match the results reported in this report.
△ Less
Submitted 13 September, 2023; v1 submitted 17 May, 2023;
originally announced May 2023.
-
Conditional Adapters: Parameter-efficient Transfer Learning with Fast Inference
Authors:
Tao Lei,
Junwen Bai,
Siddhartha Brahma,
Joshua Ainslie,
Kenton Lee,
Yanqi Zhou,
Nan Du,
Vincent Y. Zhao,
Yuexin Wu,
Bo Li,
Yu Zhang,
Ming-Wei Chang
Abstract:
We propose Conditional Adapter (CoDA), a parameter-efficient transfer learning method that also improves inference efficiency. CoDA generalizes beyond standard adapter approaches to enable a new way of balancing speed and accuracy using conditional computation. Starting with an existing dense pretrained model, CoDA adds sparse activation together with a small number of new parameters and a light-w…
▽ More
We propose Conditional Adapter (CoDA), a parameter-efficient transfer learning method that also improves inference efficiency. CoDA generalizes beyond standard adapter approaches to enable a new way of balancing speed and accuracy using conditional computation. Starting with an existing dense pretrained model, CoDA adds sparse activation together with a small number of new parameters and a light-weight training phase. Our experiments demonstrate that the CoDA approach provides an unexpectedly efficient way to transfer knowledge. Across a variety of language, vision, and speech tasks, CoDA achieves a 2x to 8x inference speed-up compared to the state-of-the-art Adapter approaches with moderate to no accuracy loss and the same parameter efficiency.
△ Less
Submitted 26 November, 2023; v1 submitted 10 April, 2023;
originally announced April 2023.
-
CoLT5: Faster Long-Range Transformers with Conditional Computation
Authors:
Joshua Ainslie,
Tao Lei,
Michiel de Jong,
Santiago Ontañón,
Siddhartha Brahma,
Yury Zemlyanskiy,
David Uthus,
Mandy Guo,
James Lee-Thorp,
Yi Tay,
Yun-Hsuan Sung,
Sumit Sanghai
Abstract:
Many natural language processing tasks benefit from long inputs, but processing long documents with Transformers is expensive -- not only due to quadratic attention complexity but also from applying feedforward and projection layers to every token. However, not all tokens are equally important, especially for longer documents. We propose CoLT5, a long-input Transformer model that builds on this in…
▽ More
Many natural language processing tasks benefit from long inputs, but processing long documents with Transformers is expensive -- not only due to quadratic attention complexity but also from applying feedforward and projection layers to every token. However, not all tokens are equally important, especially for longer documents. We propose CoLT5, a long-input Transformer model that builds on this intuition by employing conditional computation, devoting more resources to important tokens in both feedforward and attention layers. We show that CoLT5 achieves stronger performance than LongT5 with much faster training and inference, achieving SOTA on the long-input SCROLLS benchmark. Moreover, CoLT5 can effectively and tractably make use of extremely long inputs, showing strong gains up to 64k input length.
△ Less
Submitted 23 October, 2023; v1 submitted 16 March, 2023;
originally announced March 2023.
-
Multi-Vector Retrieval as Sparse Alignment
Authors:
Yujie Qian,
Jinhyuk Lee,
Sai Meher Karthik Duddu,
Zhuyun Dai,
Siddhartha Brahma,
Iftekhar Naim,
Tao Lei,
Vincent Y. Zhao
Abstract:
Multi-vector retrieval models improve over single-vector dual encoders on many information retrieval tasks. In this paper, we cast the multi-vector retrieval problem as sparse alignment between query and document tokens. We propose AligneR, a novel multi-vector retrieval model that learns sparsified pairwise alignments between query and document tokens (e.g. `dog' vs. `puppy') and per-token unary…
▽ More
Multi-vector retrieval models improve over single-vector dual encoders on many information retrieval tasks. In this paper, we cast the multi-vector retrieval problem as sparse alignment between query and document tokens. We propose AligneR, a novel multi-vector retrieval model that learns sparsified pairwise alignments between query and document tokens (e.g. `dog' vs. `puppy') and per-token unary saliences reflecting their relative importance for retrieval. We show that controlling the sparsity of pairwise token alignments often brings significant performance gains. While most factoid questions focusing on a specific part of a document require a smaller number of alignments, others requiring a broader understanding of a document favor a larger number of alignments. Unary saliences, on the other hand, decide whether a token ever needs to be aligned with others for retrieval (e.g. `kind' from `kind of currency is used in new zealand}'). With sparsified unary saliences, we are able to prune a large number of query and document token vectors and improve the efficiency of multi-vector retrieval. We learn the sparse unary saliences with entropy-regularized linear programming, which outperforms other methods to achieve sparsity. In a zero-shot setting, AligneR scores 51.1 points nDCG@10, achieving a new retriever-only state-of-the-art on 13 tasks in the BEIR benchmark. In addition, adapting pairwise alignments with a few examples (<= 8) further improves the performance up to 15.7 points nDCG@10 for argument retrieval tasks. The unary saliences of AligneR helps us to keep only 20% of the document token representations with minimal performance loss. We further show that our model often produces interpretable alignments and significantly improves its performance when initialized from larger language models.
△ Less
Submitted 2 November, 2022;
originally announced November 2022.
-
Scaling Instruction-Finetuned Language Models
Authors:
Hyung Won Chung,
Le Hou,
Shayne Longpre,
Barret Zoph,
Yi Tay,
William Fedus,
Yunxuan Li,
Xuezhi Wang,
Mostafa Dehghani,
Siddhartha Brahma,
Albert Webson,
Shixiang Shane Gu,
Zhuyun Dai,
Mirac Suzgun,
Xinyun Chen,
Aakanksha Chowdhery,
Alex Castro-Ros,
Marie Pellat,
Kevin Robinson,
Dasha Valter,
Sharan Narang,
Gaurav Mishra,
Adams Yu,
Vincent Zhao,
Yanping Huang
, et al. (10 additional authors not shown)
Abstract:
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects d…
▽ More
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.
△ Less
Submitted 6 December, 2022; v1 submitted 20 October, 2022;
originally announced October 2022.
-
Breaking BERT: Evaluating and Optimizing Sparsified Attention
Authors:
Siddhartha Brahma,
Polina Zablotskaia,
David Mimno
Abstract:
Transformers allow attention between all pairs of tokens, but there is reason to believe that most of these connections - and their quadratic time and memory - may not be necessary. But which ones? We evaluate the impact of sparsification patterns with a series of ablation experiments. First, we compare masks based on syntax, lexical similarity, and token position to random connections, and measur…
▽ More
Transformers allow attention between all pairs of tokens, but there is reason to believe that most of these connections - and their quadratic time and memory - may not be necessary. But which ones? We evaluate the impact of sparsification patterns with a series of ablation experiments. First, we compare masks based on syntax, lexical similarity, and token position to random connections, and measure which patterns reduce performance the least. We find that on three common finetuning tasks even using attention that is at least 78% sparse can have little effect on performance if applied at later transformer layers, but that applying sparsity throughout the network reduces performance significantly. Second, we vary the degree of sparsity for three patterns supported by previous work, and find that connections to neighbouring tokens are the most significant. Finally, we treat sparsity as an optimizable parameter, and present an algorithm to learn degrees of neighboring connections that gives a fine-grained control over the accuracy-sparsity trade-off while approaching the performance of existing methods.
△ Less
Submitted 7 October, 2022;
originally announced October 2022.
-
Improved Semantic Role Labeling using Parameterized Neighborhood Memory Adaptation
Authors:
Ishan Jindal,
Ranit Aharonov,
Siddhartha Brahma,
Huaiyu Zhu,
Yunyao Li
Abstract:
Deep neural models achieve some of the best results for semantic role labeling. Inspired by instance-based learning that utilizes nearest neighbors to handle low-frequency context-specific training samples, we investigate the use of memory adaptation techniques in deep neural models. We propose a parameterized neighborhood memory adaptive (PNMA) method that uses a parameterized representation of t…
▽ More
Deep neural models achieve some of the best results for semantic role labeling. Inspired by instance-based learning that utilizes nearest neighbors to handle low-frequency context-specific training samples, we investigate the use of memory adaptation techniques in deep neural models. We propose a parameterized neighborhood memory adaptive (PNMA) method that uses a parameterized representation of the nearest neighbors of tokens in a memory of activations and makes predictions based on the most similar samples in the training data. We empirically show that PNMA consistently improves the SRL performance of the base model irrespective of types of word embeddings. Coupled with contextualized word embeddings derived from BERT, PNMA improves over existing models for both span and dependency semantic parsing datasets, especially on out-of-domain text, reaching F1 scores of 80.2, and 84.97 on CoNLL2005, and CoNLL2009 datasets, respectively.
△ Less
Submitted 29 November, 2020;
originally announced November 2020.
-
CLAR: A Cross-Lingual Argument Regularizer for Semantic Role Labeling
Authors:
Ishan Jindal,
Yunyao Li,
Siddhartha Brahma,
Huaiyu Zhu
Abstract:
Semantic role labeling (SRL) identifies predicate-argument structure(s) in a given sentence. Although different languages have different argument annotations, polyglot training, the idea of training one model on multiple languages, has previously been shown to outperform monolingual baselines, especially for low resource languages. In fact, even a simple combination of data has been shown to be ef…
▽ More
Semantic role labeling (SRL) identifies predicate-argument structure(s) in a given sentence. Although different languages have different argument annotations, polyglot training, the idea of training one model on multiple languages, has previously been shown to outperform monolingual baselines, especially for low resource languages. In fact, even a simple combination of data has been shown to be effective with polyglot training by representing the distant vocabularies in a shared representation space. Meanwhile, despite the dissimilarity in argument annotations between languages, certain argument labels do share common semantic meaning across languages (e.g. adjuncts have more or less similar semantic meaning across languages). To leverage such similarity in annotation space across languages, we propose a method called Cross-Lingual Argument Regularizer (CLAR). CLAR identifies such linguistic annotation similarity across languages and exploits this information to map the target language arguments using a transformation of the space on which source language arguments lie. By doing so, our experimental results show that CLAR consistently improves SRL performance on multiple languages over monolingual and polyglot baselines for low resource languages.
△ Less
Submitted 9 November, 2020;
originally announced November 2020.
-
Small but Mighty: New Benchmarks for Split and Rephrase
Authors:
Li Zhang,
Huaiyu Zhu,
Siddhartha Brahma,
Yunyao Li
Abstract:
Split and Rephrase is a text simplification task of rewriting a complex sentence into simpler ones. As a relatively new task, it is paramount to ensure the soundness of its evaluation benchmark and metric. We find that the widely used benchmark dataset universally contains easily exploitable syntactic cues caused by its automatic generation process. Taking advantage of such cues, we show that even…
▽ More
Split and Rephrase is a text simplification task of rewriting a complex sentence into simpler ones. As a relatively new task, it is paramount to ensure the soundness of its evaluation benchmark and metric. We find that the widely used benchmark dataset universally contains easily exploitable syntactic cues caused by its automatic generation process. Taking advantage of such cues, we show that even a simple rule-based model can perform on par with the state-of-the-art model. To remedy such limitations, we collect and release two crowdsourced benchmark datasets. We not only make sure that they contain significantly more diverse syntax, but also carefully control for their quality according to a well-defined set of criteria. While no satisfactory automatic metric exists, we apply fine-grained manual evaluation based on these criteria using crowdsourcing, showing that our datasets better represent the task and are significantly more challenging for the models.
△ Less
Submitted 12 December, 2020; v1 submitted 17 September, 2020;
originally announced September 2020.
-
Efficiently Processing Workflow Provenance Queries on SPARK
Authors:
Rajmohan C,
Pranay Lohia,
Himanshu Gupta,
Siddhartha Brahma,
Mauricio Hernandez,
Sameep Mehta
Abstract:
In this paper, we investigate how we can leverage Spark platform for efficiently processing provenance queries on large volumes of workflow provenance data. We focus on processing provenance queries at attribute-value level which is the finest granularity available. We propose a novel weakly connected component based framework which is carefully engineered to quickly determine a minimal volume of…
▽ More
In this paper, we investigate how we can leverage Spark platform for efficiently processing provenance queries on large volumes of workflow provenance data. We focus on processing provenance queries at attribute-value level which is the finest granularity available. We propose a novel weakly connected component based framework which is carefully engineered to quickly determine a minimal volume of data containing the entire lineage of the queried attribute-value. This minimal volume of data is then processed to figure out the provenance of the queried attribute-value. The proposed framework computes weakly connected components on the workflow provenance graph and further partitions the large components as a collection of weakly connected sets. The framework exploits the workflow dependency graph to effectively partition the large components into a collection of weakly connected sets. We study the effectiveness of the proposed framework through experiments on a provenance trace obtained from a real-life unstructured text curation workflow. On provenance graphs containing upto 500M nodes and edges, we show that the proposed framework answers provenance queries in real-time and easily outperforms the naive approaches.
△ Less
Submitted 25 October, 2018; v1 submitted 25 August, 2018;
originally announced August 2018.
-
Improved Language Modeling by Decoding the Past
Authors:
Siddhartha Brahma
Abstract:
Highly regularized LSTMs achieve impressive results on several benchmark datasets in language modeling. We propose a new regularization method based on decoding the last token in the context using the predicted distribution of the next token. This biases the model towards retaining more contextual information, in turn improving its ability to predict the next token. With negligible overhead in the…
▽ More
Highly regularized LSTMs achieve impressive results on several benchmark datasets in language modeling. We propose a new regularization method based on decoding the last token in the context using the predicted distribution of the next token. This biases the model towards retaining more contextual information, in turn improving its ability to predict the next token. With negligible overhead in the number of parameters and training time, our Past Decode Regularization (PDR) method achieves a word level perplexity of 55.6 on the Penn Treebank and 63.5 on the WikiText-2 datasets using a single softmax. We also show gains by using PDR in combination with a mixture-of-softmaxes, achieving a word level perplexity of 53.8 and 60.5 on these datasets. In addition, our method achieves 1.169 bits-per-character on the Penn Treebank Character dataset for character level language modeling. These results constitute a new state-of-the-art in their respective settings.
△ Less
Submitted 23 January, 2019; v1 submitted 14 August, 2018;
originally announced August 2018.
-
REGMAPR - Text Matching Made Easy
Authors:
Siddhartha Brahma
Abstract:
Text matching is a fundamental problem in natural language processing. Neural models using bidirectional LSTMs for sentence encoding and inter-sentence attention mechanisms perform remarkably well on several benchmark datasets. We propose REGMAPR - a simple and general architecture for text matching that does not use inter-sentence attention. Starting from a Siamese architecture, we augment the em…
▽ More
Text matching is a fundamental problem in natural language processing. Neural models using bidirectional LSTMs for sentence encoding and inter-sentence attention mechanisms perform remarkably well on several benchmark datasets. We propose REGMAPR - a simple and general architecture for text matching that does not use inter-sentence attention. Starting from a Siamese architecture, we augment the embeddings of the words with two features based on exact and para- phrase match between words in the two sentences. We train the model using three types of regularization on datasets for textual entailment, paraphrase detection and semantic related- ness. REGMAPR performs comparably or better than more complex neural models or models using a large number of handcrafted features. REGMAPR achieves state-of-the-art results for paraphrase detection on the SICK dataset and for textual entailment on the SNLI dataset among models that do not use inter-sentence attention.
△ Less
Submitted 10 September, 2018; v1 submitted 13 August, 2018;
originally announced August 2018.
-
Unsupervised Learning of Sentence Representations Using Sequence Consistency
Authors:
Siddhartha Brahma
Abstract:
Computing universal distributed representations of sentences is a fundamental task in natural language processing. We propose ConsSent, a simple yet surprisingly powerful unsupervised method to learn such representations by enforcing consistency constraints on sequences of tokens. We consider two classes of such constraints -- sequences that form a sentence and between two sequences that form a se…
▽ More
Computing universal distributed representations of sentences is a fundamental task in natural language processing. We propose ConsSent, a simple yet surprisingly powerful unsupervised method to learn such representations by enforcing consistency constraints on sequences of tokens. We consider two classes of such constraints -- sequences that form a sentence and between two sequences that form a sentence when merged. We learn sentence encoders by training them to distinguish between consistent and inconsistent examples, the latter being generated by randomly perturbing consistent examples in six different ways. Extensive evaluation on several transfer learning and linguistic probing tasks shows improved performance over strong unsupervised and supervised baselines, substantially surpassing them in several cases. Our best results are achieved by training sentence encoders in a multitask setting and by an ensemble of encoders trained on the individual tasks.
△ Less
Submitted 23 January, 2019; v1 submitted 10 August, 2018;
originally announced August 2018.
-
Improved Sentence Modeling using Suffix Bidirectional LSTM
Authors:
Siddhartha Brahma
Abstract:
Recurrent neural networks have become ubiquitous in computing representations of sequential data, especially textual data in natural language processing. In particular, Bidirectional LSTMs are at the heart of several neural models achieving state-of-the-art performance in a wide variety of tasks in NLP. However, BiLSTMs are known to suffer from sequential bias - the contextual representation of a…
▽ More
Recurrent neural networks have become ubiquitous in computing representations of sequential data, especially textual data in natural language processing. In particular, Bidirectional LSTMs are at the heart of several neural models achieving state-of-the-art performance in a wide variety of tasks in NLP. However, BiLSTMs are known to suffer from sequential bias - the contextual representation of a token is heavily influenced by tokens close to it in a sentence. We propose a general and effective improvement to the BiLSTM model which encodes each suffix and prefix of a sequence of tokens in both forward and reverse directions. We call our model Suffix Bidirectional LSTM or SuBiLSTM. This introduces an alternate bias that favors long range dependencies. We apply SuBiLSTMs to several tasks that require sentence modeling. We demonstrate that using SuBiLSTM instead of a BiLSTM in existing models leads to improvements in performance in learning general sentence representations, text classification, textual entailment and paraphrase detection. Using SuBiLSTM we achieve new state-of-the-art results for fine-grained sentiment classification and question classification.
△ Less
Submitted 10 September, 2018; v1 submitted 18 May, 2018;
originally announced May 2018.
-
On the scaling of polynomial features for representation matching
Authors:
Siddhartha Brahma
Abstract:
In many neural models, new features as polynomial functions of existing ones are used to augment representations. Using the natural language inference task as an example, we investigate the use of scaled polynomials of degree 2 and above as matching features. We find that scaling degree 2 features has the highest impact on performance, reducing classification error by 5% in the best models.
In many neural models, new features as polynomial functions of existing ones are used to augment representations. Using the natural language inference task as an example, we investigate the use of scaled polynomials of degree 2 and above as matching features. We find that scaling degree 2 features has the highest impact on performance, reducing classification error by 5% in the best models.
△ Less
Submitted 20 February, 2018;
originally announced February 2018.
-
SufiSent - Universal Sentence Representations Using Suffix Encodings
Authors:
Siddhartha Brahma
Abstract:
Computing universal distributed representations of sentences is a fundamental task in natural language processing. We propose a method to learn such representations by encoding the suffixes of word sequences in a sentence and training on the Stanford Natural Language Inference (SNLI) dataset. We demonstrate the effectiveness of our approach by evaluating it on the SentEval benchmark, improving on…
▽ More
Computing universal distributed representations of sentences is a fundamental task in natural language processing. We propose a method to learn such representations by encoding the suffixes of word sequences in a sentence and training on the Stanford Natural Language Inference (SNLI) dataset. We demonstrate the effectiveness of our approach by evaluating it on the SentEval benchmark, improving on existing approaches on several transfer tasks.
△ Less
Submitted 20 February, 2018;
originally announced February 2018.
-
Consistency in the face of change: an adaptive approach to physical layer cooperation
Authors:
Ayan Sengupta,
Yahya H. Ezzeldin,
Siddhartha Brahma,
Christina Fragouli,
Suhas Diggavi
Abstract:
Most existing works on physical-layer (PHY) cooperation (beyond routing) focus on how to best use a given, static relay network--while wireless networks are anything but static. In this paper, we pose a different set of questions: given that we have multiple devices within range, which relay(s) do we use for PHY cooperation, to maintain a consistent target performance? How can we efficiently adapt…
▽ More
Most existing works on physical-layer (PHY) cooperation (beyond routing) focus on how to best use a given, static relay network--while wireless networks are anything but static. In this paper, we pose a different set of questions: given that we have multiple devices within range, which relay(s) do we use for PHY cooperation, to maintain a consistent target performance? How can we efficiently adapt, as network conditions change? And how important is it, in terms of performance, to adapt? Although adapting to the best path when routing is a well understood problem, how to do so over PHY cooperation networks is an open question. Our contributions are: (1) We demonstrate via theoretical evaluation, a diminishing returns trend as the number of deployed relays increases. (2) Using a simple algorithm based on network metrics, we efficiently select the sub-network to use at any given time to maintain a target reliability. (3) When streaming video from Netflix, we experimentally show (using measurements from a WARP radio testbed employing DIQIF relaying) that our adaptive PHY cooperation scheme provides a throughput gain of 2x over nonadaptive PHY schemes, and a gain of 6x over genie-aided IP-level adaptive routing.
△ Less
Submitted 6 December, 2016;
originally announced December 2016.
-
Towards the Design of Prospect-Theory based Human Decision Rules for Hypothesis Testing
Authors:
V. Sriram Siddhardh Nadendla,
Swastik Brahma,
Pramod K. Varshney
Abstract:
Detection rules have traditionally been designed for rational agents that minimize the Bayes risk (average decision cost). With the advent of crowd-sensing systems, there is a need to redesign binary hypothesis testing rules for behavioral agents, whose cognitive behavior is not captured by traditional utility functions such as Bayes risk. In this paper, we adopt prospect theory based models for d…
▽ More
Detection rules have traditionally been designed for rational agents that minimize the Bayes risk (average decision cost). With the advent of crowd-sensing systems, there is a need to redesign binary hypothesis testing rules for behavioral agents, whose cognitive behavior is not captured by traditional utility functions such as Bayes risk. In this paper, we adopt prospect theory based models for decision makers. We consider special agent models namely optimists and pessimists in this paper, and derive optimal detection rules under different scenarios. Using an illustrative example, we also show how the decision rule of a human agent deviates from the Bayesian decision rule under various behavioral models, considered in this paper.
△ Less
Submitted 4 October, 2016;
originally announced October 2016.
-
Optimal Auction Design with Quantized Bids
Authors:
Nianxia Cao,
Swastik Brahma,
Pramod K. Varshney
Abstract:
This letter considers the design of an auction mechanism to sell the object of a seller when the buyers quantize their private value estimates regarding the object prior to communicating them to the seller. The designed auction mechanism maximizes the utility of the seller (i.e., the auction is optimal), prevents buyers from communicating falsified quantized bids (i.e., the auction is incentive-co…
▽ More
This letter considers the design of an auction mechanism to sell the object of a seller when the buyers quantize their private value estimates regarding the object prior to communicating them to the seller. The designed auction mechanism maximizes the utility of the seller (i.e., the auction is optimal), prevents buyers from communicating falsified quantized bids (i.e., the auction is incentive-compatible), and ensures that buyers will participate in the auction (i.e., the auction is individually-rational). The letter also investigates the design of the optimal quantization thresholds using which buyers quantize their private value estimates. Numerical results provide insights regarding the influence of the quantization thresholds on the auction mechanism.
△ Less
Submitted 28 September, 2015;
originally announced September 2015.
-
Matching-based Spectrum Allocation in Cognitive Radio Networks
Authors:
Raghed El-Bardan,
Walid Saad,
Swastik Brahma,
Pramod K. Varshney
Abstract:
In this paper, a novel spectrum association approach for cognitive radio networks (CRNs) is proposed. Based on a measure of both inference and confidence as well as on a measure of quality-of-service, the association between secondary users (SUs) in the network and frequency bands licensed to primary users (PUs) is investigated. The problem is formulated as a matching game between SUs and PUs. In…
▽ More
In this paper, a novel spectrum association approach for cognitive radio networks (CRNs) is proposed. Based on a measure of both inference and confidence as well as on a measure of quality-of-service, the association between secondary users (SUs) in the network and frequency bands licensed to primary users (PUs) is investigated. The problem is formulated as a matching game between SUs and PUs. In this game, SUs employ a soft-decision Bayesian framework to detect PUs' signals and, eventually, rank them based on the logarithm of the a posteriori ratio. A performance measure that captures both the ranking metric and rate is further computed by the SUs. Using this performance measure, a PU evaluates its own utility function that it uses to build its own association preferences. A distributed algorithm that allows both SUs and PUs to interact and self-organize into a stable match is proposed. Simulation results show that the proposed algorithm can improve the sum of SUs' rates by up to 20 % and 60 % relative to the deferred acceptance algorithm and random channel allocation approach, respectively. The results also show an improved convergence time.
△ Less
Submitted 12 August, 2015;
originally announced August 2015.
-
Consensus based Detection in the Presence of Data Falsification Attacks
Authors:
Bhavya Kailkhura,
Swastik Brahma,
Pramod K. Varshney
Abstract:
This paper considers the problem of detection in distributed networks in the presence of data falsification (Byzantine) attacks. Detection approaches considered in the paper are based on fully distributed consensus algorithms, where all of the nodes exchange information only with their neighbors in the absence of a fusion center. In such networks, we characterize the negative effect of Byzantines…
▽ More
This paper considers the problem of detection in distributed networks in the presence of data falsification (Byzantine) attacks. Detection approaches considered in the paper are based on fully distributed consensus algorithms, where all of the nodes exchange information only with their neighbors in the absence of a fusion center. In such networks, we characterize the negative effect of Byzantines on the steady-state and transient detection performance of the conventional consensus based detection algorithms. To address this issue, we study the problem from the network designer's perspective. More specifically, we first propose a distributed weighted average consensus algorithm that is robust to Byzantine attacks. We show that, under reasonable assumptions, the global test statistic for detection can be computed locally at each node using our proposed consensus algorithm. We exploit the statistical distribution of the nodes' data to devise techniques for mitigating the influence of data falsifying Byzantines on the distributed detection system. Since some parameters of the statistical distribution of the nodes' data might not be known a priori, we propose learning based techniques to enable an adaptive design of the local fusion or update rules.
△ Less
Submitted 13 April, 2015;
originally announced April 2015.
-
Distributed Detection in Tree Networks: Byzantines and Mitigation Techniques
Authors:
Bhavya Kailkhura,
Swastik Brahma,
Berkan Dulek,
Yunghsiang S Han,
Pramod K. Varshney
Abstract:
In this paper, the problem of distributed detection in tree networks in the presence of Byzantines is considered. Closed form expressions for optimal attacking strategies that minimize the miss detection error exponent at the fusion center (FC) are obtained. We also look at the problem from the network designer's (FC's) perspective. We study the problem of designing optimal distributed detection p…
▽ More
In this paper, the problem of distributed detection in tree networks in the presence of Byzantines is considered. Closed form expressions for optimal attacking strategies that minimize the miss detection error exponent at the fusion center (FC) are obtained. We also look at the problem from the network designer's (FC's) perspective. We study the problem of designing optimal distributed detection parameters in a tree network in the presence of Byzantines. Next, we model the strategic interaction between the FC and the attacker as a Leader-Follower (Stackelberg) game. This formulation provides a methodology for predicting attacker and defender (FC) equilibrium strategies, which can be used to implement the optimal detector. Finally, a reputation based scheme to identify Byzantines is proposed and its performance is analytically evaluated. We also provide some numerical examples to gain insights into the solution.
△ Less
Submitted 21 October, 2014;
originally announced October 2014.
-
Asymptotic Analysis of Distributed Bayesian Detection with Byzantine Data
Authors:
Bhavya Kailkhura,
Yunghsiang S. Han,
Swastik Brahma,
Pramod K. Varshney
Abstract:
In this letter, we consider the problem of distributed Bayesian detection in the presence of data falsifying Byzantines in the network. The problem of distributed detection is formulated as a binary hypothesis test at the fusion center (FC) based on 1-bit data sent by the sensors. Adopting Chernoff information as our performance metric, we study the detection performance of the system under Byzant…
▽ More
In this letter, we consider the problem of distributed Bayesian detection in the presence of data falsifying Byzantines in the network. The problem of distributed detection is formulated as a binary hypothesis test at the fusion center (FC) based on 1-bit data sent by the sensors. Adopting Chernoff information as our performance metric, we study the detection performance of the system under Byzantine attack in the asymptotic regime. The expression for minimum attacking power required by the Byzantines to blind the FC is obtained. More specifically, we show that above a certain fraction of Byzantine attackers in the network, the detection scheme becomes completely incapable of utilizing the sensor data for detection. When the fraction of Byzantines is not sufficient to blind the FC, we also provide closed form expressions for the optimal attacking strategies for the Byzantines that most degrade the detection performance.
△ Less
Submitted 14 August, 2014;
originally announced August 2014.
-
Target Tracking via Crowdsourcing: A Mechanism Design Approach
Authors:
Nianxia Cao,
Swastik Brahma,
Pramod K. Varshney
Abstract:
In this paper, we propose a crowdsourcing based framework for myopic target tracking by designing an incentive-compatible mechanism based optimal auction in a wireless sensor network (WSN) containing sensors that are selfish and profit-motivated. For typical WSNs which have limited bandwidth, the fusion center (FC) has to distribute the total number of bits that can be transmitted from the sensors…
▽ More
In this paper, we propose a crowdsourcing based framework for myopic target tracking by designing an incentive-compatible mechanism based optimal auction in a wireless sensor network (WSN) containing sensors that are selfish and profit-motivated. For typical WSNs which have limited bandwidth, the fusion center (FC) has to distribute the total number of bits that can be transmitted from the sensors to the FC among the sensors. To accomplish the task, the FC conducts an auction by soliciting bids from the selfish sensors, which reflect how much they value their energy cost. Furthermore, the rationality and truthfulness of the sensors are guaranteed in our model. The final problem is formulated as a multiple-choice knapsack problem (MCKP), which is solved by the dynamic programming method in pseudo-polynomial time. Simulation results show the effectiveness of our proposed approach in terms of both the tracking performance and lifetime of the sensor network.
△ Less
Submitted 21 May, 2014;
originally announced May 2014.
-
Optimal Spectrum Auction Design with Two-Dimensional Truthful Revelations under Uncertain Spectrum Availability
Authors:
V. Sriram Siddhardh Nadendla,
Swastik Brahma,
Pramod K. Varshney
Abstract:
In this paper, we propose a novel sealed-bid auction framework to address the problem of dynamic spectrum allocation in cognitive radio (CR) networks. We design an optimal auction mechanism that maximizes the moderator's expected utility, when the spectrum is not available with certainty. We assume that the moderator employs collaborative spectrum sensing in order to make a reliable inference abou…
▽ More
In this paper, we propose a novel sealed-bid auction framework to address the problem of dynamic spectrum allocation in cognitive radio (CR) networks. We design an optimal auction mechanism that maximizes the moderator's expected utility, when the spectrum is not available with certainty. We assume that the moderator employs collaborative spectrum sensing in order to make a reliable inference about spectrum availability. Due to the presence of a collision cost whenever the moderator makes an erroneous inference, and a sensing cost at each CR, we investigate feasibility conditions that guarantee a non-negative utility at the moderator. We present tight theoretical-bounds on instantaneous network throughput and also show that our algorithm provides maximum throughput if the CRs have i.i.d. valuations. Since the moderator fuses CRs' sensing decisions to obtain a global inference regarding spectrum availability, we propose a novel strategy-proof fusion rule that encourages the CRs to simultaneously reveal truthful sensing decisions, along with truthful valuations to the moderator. Numerical examples are also presented to provide insights into the performance of the proposed auction under different scenarios.
△ Less
Submitted 13 November, 2015; v1 submitted 26 March, 2014;
originally announced March 2014.
-
Distributed Detection in Tree Topologies with Byzantines
Authors:
Bhavya Kailkhura,
Swastik Brahma,
Yunghsiang S. Han,
Pramod K. Varshney
Abstract:
In this paper, we consider the problem of distributed detection in tree topologies in the presence of Byzantines. The expression for minimum attacking power required by the Byzantines to blind the fusion center (FC) is obtained. More specifically, we show that when more than a certain fraction of individual node decisions are falsified, the decision fusion scheme becomes completely incapable. We o…
▽ More
In this paper, we consider the problem of distributed detection in tree topologies in the presence of Byzantines. The expression for minimum attacking power required by the Byzantines to blind the fusion center (FC) is obtained. More specifically, we show that when more than a certain fraction of individual node decisions are falsified, the decision fusion scheme becomes completely incapable. We obtain closed form expressions for the optimal attacking strategies that minimize the detection error exponent at the FC. We also look at the possible counter-measures from the FC's perspective to protect the network from these Byzantines. We formulate the robust topology design problem as a bi-level program and provide an efficient algorithm to solve it. We also provide some numerical results to gain insights into the solution.
△ Less
Submitted 17 September, 2013;
originally announced September 2013.
-
Distributed Bayesian Detection with Byzantine Data
Authors:
Bhavya Kailkhura,
Yunghsiang S. Han,
Swastik Brahma,
Pramod K. Varshney
Abstract:
In this paper, we consider the problem of distributed Bayesian detection in the presence of Byzantines in the network. It is assumed that a fraction of the nodes in the network are compromised and reprogrammed by an adversary to transmit false information to the fusion center (FC) to degrade detection performance. The problem of distributed detection is formulated as a binary hypothesis test at th…
▽ More
In this paper, we consider the problem of distributed Bayesian detection in the presence of Byzantines in the network. It is assumed that a fraction of the nodes in the network are compromised and reprogrammed by an adversary to transmit false information to the fusion center (FC) to degrade detection performance. The problem of distributed detection is formulated as a binary hypothesis test at the FC based on 1-bit data sent by the sensors. The expression for minimum attacking power required by the Byzantines to blind the FC is obtained. More specifically, we show that above a certain fraction of Byzantine attackers in the network, the detection scheme becomes completely incapable of utilizing the sensor data for detection. We analyze the problem under different attacking scenarios and derive results for different non-asymptotic cases. It is found that existing asymptotics-based results do not hold under several non-asymptotic scenarios. When the fraction of Byzantines is not sufficient to blind the FC, we also provide closed form expressions for the optimal attacking strategies for the Byzantines that most degrade the detection performance.
△ Less
Submitted 3 September, 2014; v1 submitted 12 July, 2013;
originally announced July 2013.