Search | arXiv e-print repository

Gemma 2: Improving Open Language Models at a Practical Size

Authors: Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, Nikola Momchev, Matt Hoffman , et al. (172 additional authors not shown)

Abstract: In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We al… ▽ More In this work, we introduce Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters. In this new version, we apply several known technical modifications to the Transformer architecture, such as interleaving local-global attentions (Beltagy et al., 2020a) and group-query attention (Ainslie et al., 2023). We also train the 2B and 9B models with knowledge distillation (Hinton et al., 2015) instead of next token prediction. The resulting models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3 times bigger. We release all our models to the community. △ Less

Submitted 2 August, 2024; v1 submitted 31 July, 2024; originally announced August 2024.

arXiv:2401.16701 [pdf, ps, other]

Multivariate Priors and the Linearity of Optimal Bayesian Estimators under Gaussian Noise

Authors: Leighton P. Barnes, Alex Dytso, Jingbo Liu, H. Vincent Poor

Abstract: Consider the task of estimating a random vector $X$ from noisy observations $Y = X + Z$, where $Z$ is a standard normal vector, under the $L^p$ fidelity criterion. This work establishes that, for $1 \leq p \leq 2$, the optimal Bayesian estimator is linear and positive definite if and only if the prior distribution on $X$ is a (non-degenerate) multivariate Gaussian. Furthermore, for $p > 2$, it is… ▽ More Consider the task of estimating a random vector $X$ from noisy observations $Y = X + Z$, where $Z$ is a standard normal vector, under the $L^p$ fidelity criterion. This work establishes that, for $1 \leq p \leq 2$, the optimal Bayesian estimator is linear and positive definite if and only if the prior distribution on $X$ is a (non-degenerate) multivariate Gaussian. Furthermore, for $p > 2$, it is demonstrated that there are infinitely many priors that can induce such an estimator. △ Less

Submitted 29 January, 2024; originally announced January 2024.

arXiv:2309.09129 [pdf, ps, other]

$L^1$ Estimation: On the Optimality of Linear Estimators

Authors: Leighton P. Barnes, Alex Dytso, Jingbo Liu, H. Vincent Poor

Abstract: Consider the problem of estimating a random variable $X$ from noisy observations $Y = X+ Z$, where $Z$ is standard normal, under the $L^1$ fidelity criterion. It is well known that the optimal Bayesian estimator in this setting is the conditional median. This work shows that the only prior distribution on $X$ that induces linearity in the conditional median is Gaussian. Along the way, several ot… ▽ More Consider the problem of estimating a random variable $X$ from noisy observations $Y = X+ Z$, where $Z$ is standard normal, under the $L^1$ fidelity criterion. It is well known that the optimal Bayesian estimator in this setting is the conditional median. This work shows that the only prior distribution on $X$ that induces linearity in the conditional median is Gaussian. Along the way, several other results are presented. In particular, it is demonstrated that if the conditional distribution $P_{X|Y=y}$ is symmetric for all $y$, then $X$ must follow a Gaussian distribution. Additionally, we consider other $L^p$ losses and observe the following phenomenon: for $p \in [1,2]$, Gaussian is the only prior distribution that induces a linear optimal Bayesian estimator, and for $p \in (2,\infty)$, infinitely many prior distributions on $X$ can induce linearity. Finally, extensions are provided to encompass noise models leading to conditional distributions from certain exponential families. △ Less

Submitted 6 August, 2024; v1 submitted 16 September, 2023; originally announced September 2023.

arXiv:2308.01982 [pdf, other]

Predicting Ki67, ER, PR, and HER2 Statuses from H&E-stained Breast Cancer Images

Authors: Amir Akbarnejad, Nilanjan Ray, Penny J. Barnes, Gilbert Bigras

Abstract: Despite the advances in machine learning and digital pathology, it is not yet clear if machine learning methods can accurately predict molecular information merely from histomorphology. In a quest to answer this question, we built a large-scale dataset (185538 images) with reliable measurements for Ki67, ER, PR, and HER2 statuses. The dataset is composed of mirrored images of H\&E and correspondin… ▽ More Despite the advances in machine learning and digital pathology, it is not yet clear if machine learning methods can accurately predict molecular information merely from histomorphology. In a quest to answer this question, we built a large-scale dataset (185538 images) with reliable measurements for Ki67, ER, PR, and HER2 statuses. The dataset is composed of mirrored images of H\&E and corresponding images of immunohistochemistry (IHC) assays (Ki67, ER, PR, and HER2. These images are mirrored through registration. To increase reliability, individual pairs were inspected and discarded if artifacts were present (tissue folding, bubbles, etc). Measurements for Ki67, ER and PR were determined by calculating H-Score from image analysis. HER2 measurement is based on binary classification: 0 and 1+ (IHC scores representing a negative subset) vs 3+ (IHC score positive subset). Cases with IHC equivocal score (2+) were excluded. We show that a standard ViT-based pipeline can achieve prediction performances around 90% in terms of Area Under the Curve (AUC) when trained with a proper labeling protocol. Finally, we shed light on the ability of the trained classifiers to localize relevant regions, which encourages future work to improve the localizations. Our proposed dataset is publicly available: https://ihc4bc.github.io/ △ Less

Submitted 3 August, 2023; originally announced August 2023.

arXiv:2204.02311 [pdf, other]

PaLM: Scaling Language Modeling with Pathways

Authors: Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin , et al. (42 additional authors not shown)

Abstract: Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Tran… ▽ More Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies. △ Less

Submitted 5 October, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

arXiv:2202.02423 [pdf, other]

doi 10.3390/e24091178

Improved Information Theoretic Generalization Bounds for Distributed and Federated Learning

Authors: L. P. Barnes, Alex Dytso, H. V. Poor

Abstract: We consider information-theoretic bounds on expected generalization error for statistical learning problems in a networked setting. In this setting, there are $K$ nodes, each with its own independent dataset, and the models from each node have to be aggregated into a final centralized model. We consider both simple averaging of the models as well as more complicated multi-round algorithms. We give… ▽ More We consider information-theoretic bounds on expected generalization error for statistical learning problems in a networked setting. In this setting, there are $K$ nodes, each with its own independent dataset, and the models from each node have to be aggregated into a final centralized model. We consider both simple averaging of the models as well as more complicated multi-round algorithms. We give upper bounds on the expected generalization error for a variety of problems, such as those with Bregman divergence or Lipschitz continuous losses, that demonstrate an improved dependence of $1/K$ on the number of nodes. These "per node" bounds are in terms of the mutual information between the training dataset and the trained weights at each node, and are therefore useful in describing the generalization properties inherent to having communication or privacy constraints at each node. △ Less

Submitted 15 January, 2024; v1 submitted 4 February, 2022; originally announced February 2022.

Comments: This version of the paper adds an assumption that was missing from Theorem 4 for loss functions of type (i). Thanks to Peyman Gholami for spotting this bug

arXiv:2103.04014 [pdf, ps, other]

Over-the-Air Statistical Estimation

Authors: Chuan-Zheng Lee, Leighton Pate Barnes, Ayfer Ozgur

Abstract: We study schemes and lower bounds for distributed minimax statistical estimation over a Gaussian multiple-access channel (MAC) under squared error loss, in a framework combining statistical estimation and wireless communication. First, we develop "analog" joint estimation-communication schemes that exploit the superposition property of the Gaussian MAC and we characterize their risk in terms of th… ▽ More We study schemes and lower bounds for distributed minimax statistical estimation over a Gaussian multiple-access channel (MAC) under squared error loss, in a framework combining statistical estimation and wireless communication. First, we develop "analog" joint estimation-communication schemes that exploit the superposition property of the Gaussian MAC and we characterize their risk in terms of the number of nodes and dimension of the parameter space. Then, we derive information-theoretic lower bounds on the minimax risk of any estimation scheme restricted to communicate the samples over a given number of uses of the channel and show that the risk achieved by our proposed schemes is within a logarithmic factor of these lower bounds. We compare both achievability and lower bound results to previous "digital" lower bounds, where nodes transmit errorless bits at the Shannon capacity of the MAC, showing that estimation schemes that leverage the physical layer offer a drastic reduction in estimation error over digital schemes relying on a physical-layer abstraction. △ Less

Submitted 5 March, 2021; originally announced March 2021.

Comments: 12 pages, 5 figures

arXiv:2102.05802 [pdf, ps, other]

Fisher Information and Mutual Information Constraints

Authors: Leighton Pate Barnes, Ayfer Ozgur

Abstract: We consider the processing of statistical samples $X\sim P_θ$ by a channel $p(y|x)$, and characterize how the statistical information from the samples for estimating the parameter $θ\in\mathbb{R}^d$ can scale with the mutual information or capacity of the channel. We show that if the statistical model has a sub-Gaussian score function, then the trace of the Fisher information matrix for estimating… ▽ More We consider the processing of statistical samples $X\sim P_θ$ by a channel $p(y|x)$, and characterize how the statistical information from the samples for estimating the parameter $θ\in\mathbb{R}^d$ can scale with the mutual information or capacity of the channel. We show that if the statistical model has a sub-Gaussian score function, then the trace of the Fisher information matrix for estimating $θ$ from $Y$ can scale at most linearly with the mutual information between $X$ and $Y$. We apply this result to obtain minimax lower bounds in distributed statistical estimation problems, and obtain a tight preconstant for Gaussian mean estimation. We then show how our Fisher information bound can also imply mutual information or Jensen-Shannon divergence based distributed strong data processing inequalities. △ Less

Submitted 8 July, 2021; v1 submitted 10 February, 2021; originally announced February 2021.

arXiv:2010.13561 [pdf, other]

Towards Accountability for Machine Learning Datasets: Practices from Software Engineering and Infrastructure

Authors: Ben Hutchinson, Andrew Smart, Alex Hanna, Emily Denton, Christina Greer, Oddur Kjartansson, Parker Barnes, Margaret Mitchell

Abstract: Rising concern for the societal implications of artificial intelligence systems has inspired demands for greater transparency and accountability. However the datasets which empower machine learning are often used, shared and re-used with little visibility into the processes of deliberation which led to their creation. Which stakeholder groups had their perspectives included when the dataset was co… ▽ More Rising concern for the societal implications of artificial intelligence systems has inspired demands for greater transparency and accountability. However the datasets which empower machine learning are often used, shared and re-used with little visibility into the processes of deliberation which led to their creation. Which stakeholder groups had their perspectives included when the dataset was conceived? Which domain experts were consulted regarding how to model subgroups and other phenomena? How were questions of representational biases measured and addressed? Who labeled the data? In this paper, we introduce a rigorous framework for dataset development transparency which supports decision-making and accountability. The framework uses the cyclical, infrastructural and engineering nature of dataset development to draw on best practices from the software development lifecycle. Each stage of the data development lifecycle yields a set of documents that facilitate improved communication and decision-making, as well as drawing attention the value and necessity of careful data work. The proposed framework is intended to contribute to closing the accountability gap in artificial intelligence systems, by making visible the often overlooked work that goes into dataset creation. △ Less

Submitted 29 January, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

arXiv:2005.10783 [pdf, ps, other]

Fisher information under local differential privacy

Authors: Leighton Pate Barnes, Wei-Ning Chen, Ayfer Ozgur

Abstract: We develop data processing inequalities that describe how Fisher information from statistical samples can scale with the privacy parameter $\varepsilon$ under local differential privacy constraints. These bounds are valid under general conditions on the distribution of the score of the statistical model, and they elucidate under which conditions the dependence on $\varepsilon$ is linear, quadratic… ▽ More We develop data processing inequalities that describe how Fisher information from statistical samples can scale with the privacy parameter $\varepsilon$ under local differential privacy constraints. These bounds are valid under general conditions on the distribution of the score of the statistical model, and they elucidate under which conditions the dependence on $\varepsilon$ is linear, quadratic, or exponential. We show how these inequalities imply order optimal lower bounds for private estimation for both the Gaussian location model and discrete distribution estimation for all levels of privacy $\varepsilon>0$. We further apply these inequalities to sparse Bernoulli models and demonstrate privacy mechanisms and estimators with order-matching squared $\ell^2$ error. △ Less

Submitted 21 May, 2020; originally announced May 2020.

arXiv:2005.10761 [pdf, other]

rTop-k: A Statistical Estimation Approach to Distributed SGD

Authors: Leighton Pate Barnes, Huseyin A. Inan, Berivan Isik, Ayfer Ozgur

Abstract: The large communication cost for exchanging gradients between different nodes significantly limits the scalability of distributed training for large-scale learning models. Motivated by this observation, there has been significant recent interest in techniques that reduce the communication cost of distributed Stochastic Gradient Descent (SGD), with gradient sparsification techniques such as top-k a… ▽ More The large communication cost for exchanging gradients between different nodes significantly limits the scalability of distributed training for large-scale learning models. Motivated by this observation, there has been significant recent interest in techniques that reduce the communication cost of distributed Stochastic Gradient Descent (SGD), with gradient sparsification techniques such as top-k and random-k shown to be particularly effective. The same observation has also motivated a separate line of work in distributed statistical estimation theory focusing on the impact of communication constraints on the estimation efficiency of different statistical models. The primary goal of this paper is to connect these two research lines and demonstrate how statistical estimation models and their analysis can lead to new insights in the design of communication-efficient training techniques. We propose a simple statistical estimation model for the stochastic gradients which captures the sparsity and skewness of their distribution. The statistically optimal communication scheme arising from the analysis of this model leads to a new sparsification technique for SGD, which concatenates random-k and top-k, considered separately in the prior literature. We show through extensive experiments on both image and language domains with CIFAR-10, ImageNet, and Penn Treebank datasets that the concatenated application of these two sparsification methods consistently and significantly outperforms either method applied alone. △ Less

Submitted 2 December, 2020; v1 submitted 21 May, 2020; originally announced May 2020.

arXiv:2004.01277 [pdf, other]

The Courtade-Kumar Most Informative Boolean Function Conjecture and a Symmetrized Li-Médard Conjecture are Equivalent

Authors: Leighton Pate Barnes, Ayfer Özgür

Abstract: We consider the Courtade-Kumar most informative Boolean function conjecture for balanced functions, as well as a conjecture by Li and Médard that dictatorship functions also maximize the $L^α$ norm of $T_pf$ for $1\leqα\leq2$ where $T_p$ is the noise operator and $f$ is a balanced Boolean function. By using a result due to Laguerre from the 1880's, we are able to bound how many times an $L^α$-norm… ▽ More We consider the Courtade-Kumar most informative Boolean function conjecture for balanced functions, as well as a conjecture by Li and Médard that dictatorship functions also maximize the $L^α$ norm of $T_pf$ for $1\leqα\leq2$ where $T_p$ is the noise operator and $f$ is a balanced Boolean function. By using a result due to Laguerre from the 1880's, we are able to bound how many times an $L^α$-norm related quantity can cross zero as a function of $α$, and show that these two conjectures are essentially equivalent. △ Less

Submitted 2 April, 2020; originally announced April 2020.

arXiv:2001.00973 [pdf, other]

Closing the AI Accountability Gap: Defining an End-to-End Framework for Internal Algorithmic Auditing

Authors: Inioluwa Deborah Raji, Andrew Smart, Rebecca N. White, Margaret Mitchell, Timnit Gebru, Ben Hutchinson, Jamila Smith-Loud, Daniel Theron, Parker Barnes

Abstract: Rising concern for the societal implications of artificial intelligence systems has inspired a wave of academic and journalistic literature in which deployed systems are audited for harm by investigators from outside the organizations deploying the algorithms. However, it remains challenging for practitioners to identify the harmful repercussions of their own systems prior to deployment, and, once… ▽ More Rising concern for the societal implications of artificial intelligence systems has inspired a wave of academic and journalistic literature in which deployed systems are audited for harm by investigators from outside the organizations deploying the algorithms. However, it remains challenging for practitioners to identify the harmful repercussions of their own systems prior to deployment, and, once deployed, emergent issues can become difficult or impossible to trace back to their source. In this paper, we introduce a framework for algorithmic auditing that supports artificial intelligence system development end-to-end, to be applied throughout the internal organization development lifecycle. Each stage of the audit yields a set of documents that together form an overall audit report, drawing on an organization's values or principles to assess the fit of decisions made throughout the process. The proposed auditing framework is intended to contribute to closing the accountability gap in the development and deployment of large-scale artificial intelligence systems by embedding a robust process to ensure audit integrity. △ Less

Submitted 3 January, 2020; originally announced January 2020.

Comments: Accepted to ACM FAT* (Fariness, Accountability and Transparency) conference 2020. Full workable templates for the documents of the SMACTR framework presented in the paper can be found here https://drive.google.com/drive/folders/1GWlq8qGZXb2lNHxWBuo2wl-rlHsjNPM0?usp=sharing

arXiv:1910.01625 [pdf, ps, other]

Minimax Bounds for Distributed Logistic Regression

Authors: Leighton Pate Barnes, Ayfer Ozgur

Abstract: We consider a distributed logistic regression problem where labeled data pairs $(X_i,Y_i)\in \mathbb{R}^d\times\{-1,1\}$ for $i=1,\ldots,n$ are distributed across multiple machines in a network and must be communicated to a centralized estimator using at most $k$ bits per labeled pair. We assume that the data $X_i$ come independently from some distribution $P_X$, and that the distribution of… ▽ More We consider a distributed logistic regression problem where labeled data pairs $(X_i,Y_i)\in \mathbb{R}^d\times\{-1,1\}$ for $i=1,\ldots,n$ are distributed across multiple machines in a network and must be communicated to a centralized estimator using at most $k$ bits per labeled pair. We assume that the data $X_i$ come independently from some distribution $P_X$, and that the distribution of $Y_i$ conditioned on $X_i$ follows a logistic model with some parameter $θ\in\mathbb{R}^d$. By using a Fisher information argument, we give minimax lower bounds for estimating $θ$ under different assumptions on the tail of the distribution $P_X$. We consider both $\ell^2$ and logistic losses, and show that for the logistic loss our sub-Gaussian lower bound is order-optimal and cannot be improved. △ Less

Submitted 3 October, 2019; originally announced October 2019.

arXiv:1902.02890 [pdf, other]

Lower Bounds for Learning Distributions under Communication Constraints via Fisher Information

Authors: Leighton Pate Barnes, Yanjun Han, Ayfer Ozgur

Abstract: We consider the problem of learning high-dimensional, nonparametric and structured (e.g. Gaussian) distributions in distributed networks, where each node in the network observes an independent sample from the underlying distribution and can use $k$ bits to communicate its sample to a central processor. We consider three different models for communication. Under the independent model, each node com… ▽ More We consider the problem of learning high-dimensional, nonparametric and structured (e.g. Gaussian) distributions in distributed networks, where each node in the network observes an independent sample from the underlying distribution and can use $k$ bits to communicate its sample to a central processor. We consider three different models for communication. Under the independent model, each node communicates its sample to a central processor by independently encoding it into $k$ bits. Under the more general sequential or blackboard communication models, nodes can share information interactively but each node is restricted to write at most $k$ bits on the final transcript. We characterize the impact of the communication constraint $k$ on the minimax risk of estimating the underlying distribution under $\ell^2$ loss. We develop minimax lower bounds that apply in a unified way to many common statistical models and reveal that the impact of the communication constraint can be qualitatively different depending on the tail behavior of the score function associated with each model. A key ingredient in our proofs is a geometric characterization of Fisher information from quantized samples. △ Less

Submitted 31 May, 2019; v1 submitted 7 February, 2019; originally announced February 2019.

arXiv:1811.10533 [pdf, ps, other]

An Isoperimetric Result on High-Dimensional Spheres

Authors: Leighton Pate Barnes, Ayfer Ozgur, Xiugang Wu

Abstract: We consider an extremal problem for subsets of high-dimensional spheres that can be thought of as an extension of the classical isoperimetric problem on the sphere. Let $A$ be a subset of the $(m-1)$-dimensional sphere $\mathbb{S}^{m-1}$, and let $\mathbf{y}\in \mathbb{S}^{m-1}$ be a randomly chosen point on the sphere. What is the measure of the intersection of the $t$-neighborhood of the point… ▽ More We consider an extremal problem for subsets of high-dimensional spheres that can be thought of as an extension of the classical isoperimetric problem on the sphere. Let $A$ be a subset of the $(m-1)$-dimensional sphere $\mathbb{S}^{m-1}$, and let $\mathbf{y}\in \mathbb{S}^{m-1}$ be a randomly chosen point on the sphere. What is the measure of the intersection of the $t$-neighborhood of the point $\mathbf{y}$ with the subset $A$? We show that with high probability this intersection is approximately as large as the intersection that would occur with high probability if $A$ were a spherical cap of the same measure. △ Less

Submitted 20 November, 2018; originally announced November 2018.

Comments: arXiv admin note: text overlap with arXiv:1701.02043

arXiv:1810.03993 [pdf, other]

doi 10.1145/3287560.3287596

Model Cards for Model Reporting

Authors: Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, Timnit Gebru

Abstract: Trained machine learning models are increasingly used to perform high-impact tasks in areas such as law enforcement, medicine, education, and employment. In order to clarify the intended use cases of machine learning models and minimize their usage in contexts for which they are not well suited, we recommend that released models be accompanied by documentation detailing their performance character… ▽ More Trained machine learning models are increasingly used to perform high-impact tasks in areas such as law enforcement, medicine, education, and employment. In order to clarify the intended use cases of machine learning models and minimize their usage in contexts for which they are not well suited, we recommend that released models be accompanied by documentation detailing their performance characteristics. In this paper, we propose a framework that we call model cards, to encourage such transparent model reporting. Model cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups (e.g., race, geographic location, sex, Fitzpatrick skin type) and intersectional groups (e.g., age and race, or sex and Fitzpatrick skin type) that are relevant to the intended application domains. Model cards also disclose the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information. While we focus primarily on human-centered machine learning models in the application fields of computer vision and natural language processing, this framework can be used to document any trained machine learning model. To solidify the concept, we provide cards for two supervised models: One trained to detect smiling faces in images, and one trained to detect toxic comments in text. We propose model cards as a step towards the responsible democratization of machine learning and related AI technology, increasing transparency into how well AI technology works. We hope this work encourages those releasing trained machine learning models to accompany model releases with similar detailed evaluation numbers and other relevant documentation. △ Less

Submitted 14 January, 2019; v1 submitted 5 October, 2018; originally announced October 2018.

Journal ref: FAT* '19: Conference on Fairness, Accountability, and Transparency, January 29--31, 2019, Atlanta, GA, USA

arXiv:1701.02043 [pdf, other]

"The Capacity of the Relay Channel": Solution to Cover's Problem in the Gaussian Case

Authors: Xiugang Wu, Leighton Pate Barnes, Ayfer Ozgur

Abstract: Consider a memoryless relay channel, where the relay is connected to the destination with an isolated bit pipe of capacity $C_0$. Let $C(C_0)$ denote the capacity of this channel as a function of $C_0$. What is the critical value of $C_0$ such that $C(C_0)$ first equals $C(\infty)$? This is a long-standing open problem posed by Cover and named "The Capacity of the Relay Channel," in… ▽ More Consider a memoryless relay channel, where the relay is connected to the destination with an isolated bit pipe of capacity $C_0$. Let $C(C_0)$ denote the capacity of this channel as a function of $C_0$. What is the critical value of $C_0$ such that $C(C_0)$ first equals $C(\infty)$? This is a long-standing open problem posed by Cover and named "The Capacity of the Relay Channel," in $Open \ Problems \ in \ Communication \ and \ Computation$, Springer-Verlag, 1987. In this paper, we answer this question in the Gaussian case and show that $C(C_0)$ can not equal to $C(\infty)$ unless $C_0=\infty$, regardless of the SNR of the Gaussian channels. This result follows as a corollary to a new upper bound we develop on the capacity of this channel. Instead of "single-letterizing" expressions involving information measures in a high-dimensional space as is typically done in converse results in information theory, our proof directly quantifies the tension between the pertinent $n$-letter forms. This is done by translating the information tension problem to a problem in high-dimensional geometry. As an intermediate result, we develop an extension of the classical isoperimetric inequality on a high-dimensional sphere, which can be of interest in its own right. △ Less

Submitted 7 October, 2018; v1 submitted 8 January, 2017; originally announced January 2017.

Comments: Accepted to IEEE Trans. on Information Theory

Showing 1–18 of 18 results for author: Barnes, P