Search | arXiv e-print repository

arXiv:2407.08689 [pdf, ps, other]

Operationalizing the Blueprint for an AI Bill of Rights: Recommendations for Practitioners, Researchers, and Policy Makers

Authors: Alex Oesterling, Usha Bhalla, Suresh Venkatasubramanian, Himabindu Lakkaraju

Abstract: As Artificial Intelligence (AI) tools are increasingly employed in diverse real-world applications, there has been significant interest in regulating these tools. To this end, several regulatory frameworks have been introduced by different countries worldwide. For example, the European Union recently passed the AI Act, the White House issued an Executive Order on safe, secure, and trustworthy AI,… ▽ More As Artificial Intelligence (AI) tools are increasingly employed in diverse real-world applications, there has been significant interest in regulating these tools. To this end, several regulatory frameworks have been introduced by different countries worldwide. For example, the European Union recently passed the AI Act, the White House issued an Executive Order on safe, secure, and trustworthy AI, and the White House Office of Science and Technology Policy issued the Blueprint for an AI Bill of Rights (AI BoR). Many of these frameworks emphasize the need for auditing and improving the trustworthiness of AI tools, underscoring the importance of safety, privacy, explainability, fairness, and human fallback options. Although these regulatory frameworks highlight the necessity of enforcement, practitioners often lack detailed guidance on implementing them. Furthermore, the extensive research on operationalizing each of these aspects is frequently buried in technical papers that are difficult for practitioners to parse. In this write-up, we address this shortcoming by providing an accessible overview of existing literature related to operationalizing regulatory principles. We provide easy-to-understand summaries of state-of-the-art literature and highlight various gaps that exist between regulatory guidelines and existing AI research, including the trade-offs that emerge during operationalization. We hope that this work not only serves as a starting point for practitioners interested in learning more about operationalizing the regulatory guidelines outlined in the Blueprint for an AI BoR but also provides researchers with a list of critical open problems and gaps between regulations and state-of-the-art AI research. Finally, we note that this is a working paper and we invite feedback in line with the purpose of this document as described in the introduction. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: 15 pages

arXiv:2406.09638 [pdf, other]

RASPNet: A Benchmark Dataset for Radar Adaptive Signal Processing Applications

Authors: Shyam Venkatasubramanian, Bosung Kang, Ali Pezeshki, Muralidhar Rangaswamy, Vahid Tarokh

Abstract: This work presents a large-scale dataset for radar adaptive signal processing (RASP) applications, aimed at supporting the development of data-driven models within the radar community. The dataset, called RASPNet, consists of 100 realistic scenarios compiled over a variety of topographies and land types from across the contiguous United States, designed to reflect a diverse array of real-world env… ▽ More This work presents a large-scale dataset for radar adaptive signal processing (RASP) applications, aimed at supporting the development of data-driven models within the radar community. The dataset, called RASPNet, consists of 100 realistic scenarios compiled over a variety of topographies and land types from across the contiguous United States, designed to reflect a diverse array of real-world environments. Within each scenario, RASPNet consists of 10,000 clutter realizations from an airborne radar setting, which can be utilized for radar algorithm development and evaluation. RASPNet intends to fill a prominent gap in the availability of a large-scale, realistic dataset that standardizes the evaluation of adaptive radar processing techniques. We describe its construction, organization, and several potential applications, which includes a transfer learning example to demonstrate how RASPNet can be leveraged for realistic adaptive radar processing scenarios. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.07556 [pdf]

Community Driven Approaches to Research in Technology & Society CCC Workshop Report

Authors: Suresh Venkatasubramanian, Timnit Gebru, Ufuk Topcu, Haley Griffin, Leah Namisa Rosenbloom, Nasim Sonboli

Abstract: Based on our workshop activities, we outlined three ways in which research can support community needs: (1) Mapping the ecosystem of both the players and ecosystem and harm landscapes, (2) Counter-Programming, which entails using the same surveillance tools that communities are subjected to observe the entities doing the surveilling, effectively protecting people from surveillance, and conducting… ▽ More Based on our workshop activities, we outlined three ways in which research can support community needs: (1) Mapping the ecosystem of both the players and ecosystem and harm landscapes, (2) Counter-Programming, which entails using the same surveillance tools that communities are subjected to observe the entities doing the surveilling, effectively protecting people from surveillance, and conducting ethical data collection to measure the impact of these technologies, and (3) Engaging in positive visions and tools for empowerment so that technology can bring good instead of harm. In order to effectively collaborate on the aforementioned directions, we outlined seven important mechanisms for effective collaboration: (1) Never expect free labor of community members, (2) Ensure goals are aligned between all collaborators, (3) Elevate community members to leadership positions, (4) Understand no group is a monolith, (5) Establish a common language, (6) Discuss organization roles and goals of the project transparently from the start, and (7) Enable a recourse for harm. We recommend that anyone engaging in community-based research (1) starts with community-defined solutions, (2) provides alternatives to digital services/information collecting mechanisms, (3) prohibits harmful automated systems, (4) transparently states any systems impact, (5) minimizes and protects data, (6) proactively demonstrates a system is safe and beneficial prior to deployment, and (7) provides resources directly to community partners. Throughout the recommendation section of the report, we also provide specific recommendations for funding agencies, academic institutions, and individual researchers. △ Less

Submitted 21 March, 2024; originally announced June 2024.

arXiv:2402.18803 [pdf, other]

To Pool or Not To Pool: Analyzing the Regularizing Effects of Group-Fair Training on Shared Models

Authors: Cyrus Cousins, I. Elizabeth Kumar, Suresh Venkatasubramanian

Abstract: In fair machine learning, one source of performance disparities between groups is over-fitting to groups with relatively few training samples. We derive group-specific bounds on the generalization error of welfare-centric fair machine learning that benefit from the larger sample size of the majority group. We do this by considering group-specific Rademacher averages over a restricted hypothesis cl… ▽ More In fair machine learning, one source of performance disparities between groups is over-fitting to groups with relatively few training samples. We derive group-specific bounds on the generalization error of welfare-centric fair machine learning that benefit from the larger sample size of the majority group. We do this by considering group-specific Rademacher averages over a restricted hypothesis class, which contains the family of models likely to perform well with respect to a fair learning objective (e.g., a power-mean). Our simulations demonstrate these bounds improve over a naive method, as expected by theory, with particularly significant improvement for smaller group sizes. △ Less

Submitted 28 February, 2024; originally announced February 2024.

arXiv:2402.06609 [pdf, other]

You Still See Me: How Data Protection Supports the Architecture of ML Surveillance

Authors: Rui-Jie Yew, Lucy Qin, Suresh Venkatasubramanian

Abstract: Human data forms the backbone of machine learning. Data protection laws thus have strong bearing on how ML systems are governed. Given that most requirements in data protection laws accompany the processing of personal data, organizations have an incentive to keep their data out of legal scope. This makes the development and application of certain privacy-preserving techniques--data protection tec… ▽ More Human data forms the backbone of machine learning. Data protection laws thus have strong bearing on how ML systems are governed. Given that most requirements in data protection laws accompany the processing of personal data, organizations have an incentive to keep their data out of legal scope. This makes the development and application of certain privacy-preserving techniques--data protection techniques--an important strategy for ML compliance. In this paper, we examine the impact of a rhetoric that deems data wrapped in these techniques as data that is "good-to-go". We show how their application in the development of ML systems--from private set intersection as part of dataset curation to homomorphic encryption and federated learning as part of model computation--can further support individual monitoring and data consolidation. With data accumulation at the core of how the ML pipeline is configured, we argue that data protection techniques are often instrumentalized in ways that support infrastructures of surveillance, rather than in ways that protect individuals associated with data. Finally, we propose technology and policy strategies to evaluate data protection techniques in light of the protections they actually confer. We conclude by highlighting the role that technologists might play in devising policies that combat surveillance ML technologies. △ Less

Submitted 18 February, 2024; v1 submitted 9 February, 2024; originally announced February 2024.

Comments: A version of this work was accepted at the 2023 NeurIPS Workshop on Regulatable ML

arXiv:2401.11176 [pdf, other]

Data-Driven Target Localization: Benchmarking Gradient Descent Using the Cramer-Rao Bound

Authors: Shyam Venkatasubramanian, Sandeep Gogineni, Bosung Kang, Muralidhar Rangaswamy

Abstract: In modern radar systems, precise target localization using azimuth and velocity estimation is paramount. Traditional unbiased estimation methods have utilized gradient descent algorithms to reach the theoretical limits of the Cramer Rao Bound (CRB) for the error of the parameter estimates. As an extension, we demonstrate on a realistic simulated example scenario that our earlier presented data-dri… ▽ More In modern radar systems, precise target localization using azimuth and velocity estimation is paramount. Traditional unbiased estimation methods have utilized gradient descent algorithms to reach the theoretical limits of the Cramer Rao Bound (CRB) for the error of the parameter estimates. As an extension, we demonstrate on a realistic simulated example scenario that our earlier presented data-driven neural network model outperforms these traditional methods, yielding improved accuracies in target azimuth and velocity estimation. We emphasize, however, that this improvement does not imply that the neural network outperforms the CRB itself. Rather, the enhanced performance is attributed to the biased nature of the neural network approach. Our findings underscore the potential of employing deep learning methods in radar systems to achieve more accurate localization in cluttered and dynamic environments. △ Less

Submitted 22 April, 2024; v1 submitted 20 January, 2024; originally announced January 2024.

arXiv:2311.12356 [pdf, other]

Random Linear Projections Loss for Hyperplane-Based Optimization in Neural Networks

Authors: Shyam Venkatasubramanian, Ahmed Aloui, Vahid Tarokh

Abstract: Advancing loss function design is pivotal for optimizing neural network training and performance. This work introduces Random Linear Projections (RLP) loss, a novel approach that enhances training efficiency by leveraging geometric relationships within the data. Distinct from traditional loss functions that target minimizing pointwise errors, RLP loss operates by minimizing the distance between se… ▽ More Advancing loss function design is pivotal for optimizing neural network training and performance. This work introduces Random Linear Projections (RLP) loss, a novel approach that enhances training efficiency by leveraging geometric relationships within the data. Distinct from traditional loss functions that target minimizing pointwise errors, RLP loss operates by minimizing the distance between sets of hyperplanes connecting fixed-size subsets of feature-prediction pairs and feature-label pairs. Our empirical evaluations, conducted across benchmark datasets and synthetic examples, demonstrate that neural networks trained with RLP loss outperform those trained with traditional loss functions, achieving improved performance with fewer data samples, and exhibiting greater robustness to additive noise. We provide theoretical analysis supporting our empirical findings. △ Less

Submitted 30 May, 2024; v1 submitted 21 November, 2023; originally announced November 2023.

arXiv:2305.18159 [pdf, other]

The Misuse of AUC: What High Impact Risk Assessment Gets Wrong

Authors: Kweku Kwegyir-Aggrey, Marissa Gerchick, Malika Mohan, Aaron Horowitz, Suresh Venkatasubramanian

Abstract: When determining which machine learning model best performs some high impact risk assessment task, practitioners commonly use the Area under the Curve (AUC) to defend and validate their model choices. In this paper, we argue that the current use and understanding of AUC as a model performance metric misunderstands the way the metric was intended to be used. To this end, we characterize the misuse… ▽ More When determining which machine learning model best performs some high impact risk assessment task, practitioners commonly use the Area under the Curve (AUC) to defend and validate their model choices. In this paper, we argue that the current use and understanding of AUC as a model performance metric misunderstands the way the metric was intended to be used. To this end, we characterize the misuse of AUC and illustrate how this misuse negatively manifests in the real world across several risk assessment domains. We locate this disconnect in the way the original interpretation of AUC has shifted over time to the point where issues pertaining to decision thresholds, class balance, statistical uncertainty, and protected groups remain unaddressed by AUC-based model comparisons, and where model choices that should be the purview of policymakers are hidden behind the veil of mathematical rigor. We conclude that current model validation practices involving AUC are not robust, and often invalid. △ Less

Submitted 29 May, 2023; originally announced May 2023.

arXiv:2303.08241 [pdf, other]

Subspace Perturbation Analysis for Data-Driven Radar Target Localization

Authors: Shyam Venkatasubramanian, Sandeep Gogineni, Bosung Kang, Ali Pezeshki, Muralidhar Rangaswamy, Vahid Tarokh

Abstract: Recent works exploring data-driven approaches to classical problems in adaptive radar have demonstrated promising results pertaining to the task of radar target localization. Via the use of space-time adaptive processing (STAP) techniques and convolutional neural networks, these data-driven approaches to target localization have helped benchmark the performance of neural networks for matched scena… ▽ More Recent works exploring data-driven approaches to classical problems in adaptive radar have demonstrated promising results pertaining to the task of radar target localization. Via the use of space-time adaptive processing (STAP) techniques and convolutional neural networks, these data-driven approaches to target localization have helped benchmark the performance of neural networks for matched scenarios. However, the thorough bridging of these topics across mismatched scenarios still remains an open problem. As such, in this work, we augment our data-driven approach to radar target localization by performing a subspace perturbation analysis, which allows us to benchmark the localization accuracy of our proposed deep learning framework across mismatched scenarios. To evaluate this framework, we generate comprehensive datasets by randomly placing targets of variable strengths in mismatched constrained areas via RFView, a high-fidelity, site-specific modeling and simulation tool. For the radar returns from these constrained areas, we generate heatmap tensors in range, azimuth, and elevation using the normalized adaptive matched filter (NAMF) test statistic. We estimate target locations from these heatmap tensors using a convolutional neural network, and demonstrate that the predictive performance of our framework in the presence of mismatches can be predetermined. △ Less

Submitted 21 March, 2023; v1 submitted 14 March, 2023; originally announced March 2023.

Comments: 6 pages, 3 figures. Submitted to 2023 IEEE Radar Conference (RadarConf). Extension of arXiv:2209.02890

arXiv:2209.07616 [pdf, other]

Reducing Access Disparities in Networks using Edge Augmentation

Authors: Ashkan Bashardoust, Sorelle A. Friedler, Carlos E. Scheidegger, Blair D. Sullivan, Suresh Venkatasubramanian

Abstract: In social networks, a node's position is a form of \it{social capital}. Better-positioned members not only benefit from (faster) access to diverse information, but innately have more potential influence on information spread. Structural biases often arise from network formation, and can lead to significant disparities in information access based on position. Further, processes such as link recomme… ▽ More In social networks, a node's position is a form of \it{social capital}. Better-positioned members not only benefit from (faster) access to diverse information, but innately have more potential influence on information spread. Structural biases often arise from network formation, and can lead to significant disparities in information access based on position. Further, processes such as link recommendation can exacerbate this inequality by relying on network structure to augment connectivity. We argue that one can understand and quantify this social capital through the lens of information flow in the network. We consider the setting where all nodes may be sources of distinct information, and a node's (dis)advantage deems its ability to access all information available on the network. We introduce three new measures of advantage (broadcast, influence, and control), which are quantified in terms of position in the network using \it{access signatures} -- vectors that represent a node's ability to share information. We then consider the problem of improving equity by making interventions to increase the access of the least-advantaged nodes. We argue that edge augmentation is most appropriate for mitigating bias in the network structure, and frame a budgeted intervention problem for maximizing minimum pairwise access. Finally, we propose heuristic strategies for selecting edge augmentations and empirically evaluate their performance on a corpus of real-world social networks. We demonstrate that a small number of interventions significantly increase the broadcast measure of access for the least-advantaged nodes (over 5 times more than random), and also improve the minimum influence. Additional analysis shows that these interventions can also dramatically shrink the gap in advantage between nodes (over \%82) and reduce disparities between their access signatures. △ Less

Submitted 15 September, 2022; originally announced September 2022.

arXiv:2209.02890 [pdf, other]

Data-Driven Target Localization Using Adaptive Radar Processing and Convolutional Neural Networks

Authors: Shyam Venkatasubramanian, Sandeep Gogineni, Bosung Kang, Ali Pezeshki, Muralidhar Rangaswamy, Vahid Tarokh

Abstract: Leveraging the advanced functionalities of modern radio frequency (RF) modeling and simulation tools, specifically designed for adaptive radar processing applications, this paper presents a data-driven approach to improve accuracy in radar target localization post adaptive radar detection. To this end, we generate a large number of radar returns by randomly placing targets of variable strengths in… ▽ More Leveraging the advanced functionalities of modern radio frequency (RF) modeling and simulation tools, specifically designed for adaptive radar processing applications, this paper presents a data-driven approach to improve accuracy in radar target localization post adaptive radar detection. To this end, we generate a large number of radar returns by randomly placing targets of variable strengths in a predefined area, using RFView, a high-fidelity, site-specific, RF modeling & simulation tool. We produce heatmap tensors from the radar returns, in range, azimuth [and Doppler], of the normalized adaptive matched filter (NAMF) test statistic. We then train a regression convolutional neural network (CNN) to estimate target locations from these heatmap tensors, and we compare the target localization accuracy of this approach with that of peak-finding and local search methods. This empirical study shows that our regression CNN achieves a considerable improvement in target location estimation accuracy. The regression CNN offers significant gains and reasonable accuracy even at signal-to-clutter-plus-noise ratio (SCNR) regimes that are close to the breakdown threshold SCNR of the NAMF. We also study the robustness of our trained CNN to mismatches in the radar data, where the CNN is tested on heatmap tensors collected from areas that it was not trained on. We show that our CNN can be made robust to mismatches in the radar data through few-shot learning, using a relatively small number of new training samples. △ Less

Submitted 9 July, 2024; v1 submitted 6 September, 2022; originally announced September 2022.

arXiv:2205.14867 [pdf, other]

Measuring and mitigating voting access disparities: a study of race and polling locations in Florida and North Carolina

Authors: Mohsen Abbasi, Suresh Venkatasubramanian, Sorelle A. Friedler, Kristian Lum, Calvin Barrett

Abstract: Voter suppression and associated racial disparities in access to voting are long-standing civil rights concerns in the United States. Barriers to voting have taken many forms over the decades. A history of violent explicit discouragement has shifted to more subtle access limitations that can include long lines and wait times, long travel times to reach a polling station, and other logistical barri… ▽ More Voter suppression and associated racial disparities in access to voting are long-standing civil rights concerns in the United States. Barriers to voting have taken many forms over the decades. A history of violent explicit discouragement has shifted to more subtle access limitations that can include long lines and wait times, long travel times to reach a polling station, and other logistical barriers to voting. Our focus in this work is on quantifying disparities in voting access pertaining to the overall time-to-vote, and how they could be remedied via a better choice of polling location or provisioning more sites where voters can cast ballots. However, appropriately calibrating access disparities is difficult because of the need to account for factors such as population density and different community expectations for reasonable travel times. In this paper, we quantify access to polling locations, developing a methodology for the calibrated measurement of racial disparities in polling location "load" and distance to polling locations. We apply this methodology to a study of real-world data from Florida and North Carolina to identify disparities in voting access from the 2020 election. We also introduce algorithms, with modifications to handle scale, that can reduce these disparities by suggesting new polling locations from a given list of identified public locations (including schools and libraries). Applying these algorithms on the 2020 election location data also helps to expose and explore tradeoffs between the cost of allocating more polling locations and the potential impact on access disparities. The developed voting access measurement methodology and algorithmic remediation technique is a first step in better polling location assignment. △ Less

Submitted 30 May, 2022; originally announced May 2022.

arXiv:2203.07490 [pdf, other]

Repairing Regressors for Fair Binary Classification at Any Decision Threshold

Authors: Kweku Kwegyir-Aggrey, A. Feder Cooper, Jessica Dai, John Dickerson, Keegan Hines, Suresh Venkatasubramanian

Abstract: We study the problem of post-processing a supervised machine-learned regressor to maximize fair binary classification at all decision thresholds. By decreasing the statistical distance between each group's score distributions, we show that we can increase fair performance across all thresholds at once, and that we can do so without a large decrease in accuracy. To this end, we introduce a formal m… ▽ More We study the problem of post-processing a supervised machine-learned regressor to maximize fair binary classification at all decision thresholds. By decreasing the statistical distance between each group's score distributions, we show that we can increase fair performance across all thresholds at once, and that we can do so without a large decrease in accuracy. To this end, we introduce a formal measure of Distributional Parity, which captures the degree of similarity in the distributions of classifications for different protected groups. Our main result is to put forward a novel post-processing algorithm based on optimal transport, which provably maximizes Distributional Parity, thereby attaining common notions of group fairness like Equalized Odds or Equal Opportunity at all thresholds. We demonstrate on two fairness benchmarks that our technique works well empirically, while also outperforming and generalizing similar techniques from related work. △ Less

Submitted 10 December, 2023; v1 submitted 14 March, 2022; originally announced March 2022.

arXiv:2201.10712 [pdf, other]

Toward Data-Driven STAP Radar

Authors: Shyam Venkatasubramanian, Chayut Wongkamthong, Mohammadreza Soltani, Bosung Kang, Sandeep Gogineni, Ali Pezeshki, Muralidhar Rangaswamy, Vahid Tarokh

Abstract: Using an amalgamation of techniques from classical radar, computer vision, and deep learning, we characterize our ongoing data-driven approach to space-time adaptive processing (STAP) radar. We generate a rich example dataset of received radar signals by randomly placing targets of variable strengths in a predetermined region using RFView, a site-specific radio frequency modeling and simulation to… ▽ More Using an amalgamation of techniques from classical radar, computer vision, and deep learning, we characterize our ongoing data-driven approach to space-time adaptive processing (STAP) radar. We generate a rich example dataset of received radar signals by randomly placing targets of variable strengths in a predetermined region using RFView, a site-specific radio frequency modeling and simulation tool developed by ISL Inc. For each data sample within this region, we generate heatmap tensors in range, azimuth, and elevation of the output power of a minimum variance distortionless response (MVDR) beamformer, which can be replaced with a desired test statistic. These heatmap tensors can be thought of as stacked images, and in an airborne scenario, the moving radar creates a sequence of these time-indexed image stacks, resembling a video. Our goal is to use these images and videos to detect targets and estimate their locations, a procedure reminiscent of computer vision algorithms for object detection$-$namely, the Faster Region-Based Convolutional Neural Network (Faster R-CNN). The Faster R-CNN consists of a proposal generating network for determining regions of interest (ROI), a regression network for positioning anchor boxes around targets, and an object classification algorithm; it is developed and optimized for natural images. Our ongoing research will develop analogous tools for heatmap images of radar data. In this regard, we will generate a large, representative adaptive radar signal processing database for training and testing, analogous in spirit to the COCO dataset for natural images. As a preliminary example, we present a regression network in this paper for estimating target locations to demonstrate the feasibility of and significant improvements provided by our data-driven approach. △ Less

Submitted 9 March, 2022; v1 submitted 25 January, 2022; originally announced January 2022.

Comments: 5 pages, 4 figures. Submitted to 2022 IEEE Radar Conference (RadarConf)

arXiv:2106.05498 [pdf, ps, other]

It's COMPASlicated: The Messy Relationship between RAI Datasets and Algorithmic Fairness Benchmarks

Authors: Michelle Bao, Angela Zhou, Samantha Zottola, Brian Brubach, Sarah Desmarais, Aaron Horowitz, Kristian Lum, Suresh Venkatasubramanian

Abstract: Risk assessment instrument (RAI) datasets, particularly ProPublica's COMPAS dataset, are commonly used in algorithmic fairness papers due to benchmarking practices of comparing algorithms on datasets used in prior work. In many cases, this data is used as a benchmark to demonstrate good performance without accounting for the complexities of criminal justice (CJ) processes. However, we show that pr… ▽ More Risk assessment instrument (RAI) datasets, particularly ProPublica's COMPAS dataset, are commonly used in algorithmic fairness papers due to benchmarking practices of comparing algorithms on datasets used in prior work. In many cases, this data is used as a benchmark to demonstrate good performance without accounting for the complexities of criminal justice (CJ) processes. However, we show that pretrial RAI datasets can contain numerous measurement biases and errors, and due to disparities in discretion and deployment, algorithmic fairness applied to RAI datasets is limited in making claims about real-world outcomes. These reasons make the datasets a poor fit for benchmarking under assumptions of ground truth and real-world impact. Furthermore, conventional practices of simply replicating previous data experiments may implicitly inherit or edify normative positions without explicitly interrogating value-laden assumptions. Without context of how interdisciplinary fields have engaged in CJ research and context of how RAIs operate upstream and downstream, algorithmic fairness practices are misaligned for meaningful contribution in the context of CJ, and would benefit from transparent engagement with normative considerations and values related to fairness, justice, and equality. These factors prompt questions about whether benchmarks for intrinsically socio-technical systems like the CJ system can exist in a beneficial and ethical way. △ Less

Submitted 28 April, 2022; v1 submitted 10 June, 2021; originally announced June 2021.

Comments: NeurIPS 2021 Datasets and Benchmarks

arXiv:2104.12037 [pdf, other]

Precarity: Modeling the Long Term Effects of Compounded Decisions on Individual Instability

Authors: Pegah Nokhiz, Aravinda Kanchana Ruwanpathirana, Neal Patwari, Suresh Venkatasubramanian

Abstract: When it comes to studying the impacts of decision making, the research has been largely focused on examining the fairness of the decisions, the long-term effects of the decision pipelines, and utility-based perspectives considering both the decision-maker and the individuals. However, there has hardly been any focus on precarity which is the term that encapsulates the instability in people's lives… ▽ More When it comes to studying the impacts of decision making, the research has been largely focused on examining the fairness of the decisions, the long-term effects of the decision pipelines, and utility-based perspectives considering both the decision-maker and the individuals. However, there has hardly been any focus on precarity which is the term that encapsulates the instability in people's lives. That is, a negative outcome can overspread to other decisions and measures of well-being. Studying precarity necessitates a shift in focus - from the point of view of the decision-maker to the perspective of the decision subject. This centering of the subject is an important direction that unlocks the importance of parting with aggregate measures to examine the long-term effects of decision making. To address this issue, in this paper, we propose a modeling framework that simulates the effects of compounded decision-making on precarity over time. Through our simulations, we are able to show the heterogeneity of precarity by the non-uniform ruinous aftereffects of negative decisions on different income classes of the underlying population and how policy interventions can help mitigate such effects. △ Less

Submitted 24 April, 2021; originally announced April 2021.

Comments: To appear at AIES 2021

arXiv:2101.09962 [pdf, other]

GRADE-AO: Towards Near-Optimal Spatially-Coupled Codes With High Memories

Authors: Siyi Yang, Ahmed Hareedy, Shyam Venkatasubramanian, Robert Calderbank, Lara Dolecek

Abstract: Spatially-coupled (SC) codes, known for their threshold saturation phenomenon and low-latency windowed decoding algorithms, are ideal for streaming applications. They also find application in various data storage systems because of their excellent performance. SC codes are constructed by partitioning an underlying block code, followed by rearranging and concatenating the partitioned components in… ▽ More Spatially-coupled (SC) codes, known for their threshold saturation phenomenon and low-latency windowed decoding algorithms, are ideal for streaming applications. They also find application in various data storage systems because of their excellent performance. SC codes are constructed by partitioning an underlying block code, followed by rearranging and concatenating the partitioned components in a "convolutional" manner. The number of partitioned components determines the "memory" of SC codes. While adopting higher memories results in improved SC code performance, obtaining optimal SC codes with high memory is known to be hard. In this paper, we investigate the relation between the performance of SC codes and the density distribution of partitioning matrices. We propose a probabilistic framework that obtains (locally) optimal density distributions via gradient descent. Starting from random partitioning matrices abiding by the obtained distribution, we perform low complexity optimization algorithms over the cycle properties to construct high memory, high performance quasi-cyclic SC codes. Simulation results show that codes obtained through our proposed method notably outperform state-of-the-art SC codes with the same constraint length and codes with uniform partitioning. △ Less

Submitted 25 January, 2021; originally announced January 2021.

Comments: 8 pages, 10 figures, 1 table, the shortened version has been submitted to ISIT 2021

arXiv:2101.01264 [pdf]

A Research Ecosystem for Secure Computing

Authors: Nadya Bliss, Lawrence A. Gordon, Daniel Lopresti, Fred Schneider, Suresh Venkatasubramanian

Abstract: Computing devices are vital to all areas of modern life and permeate every aspect of our society. The ubiquity of computing and our reliance on it has been accelerated and amplified by the COVID-19 pandemic. From education to work environments to healthcare to defense to entertainment - it is hard to imagine a segment of modern life that is not touched by computing. The security of computers, syst… ▽ More Computing devices are vital to all areas of modern life and permeate every aspect of our society. The ubiquity of computing and our reliance on it has been accelerated and amplified by the COVID-19 pandemic. From education to work environments to healthcare to defense to entertainment - it is hard to imagine a segment of modern life that is not touched by computing. The security of computers, systems, and applications has been an active area of research in computer science for decades. However, with the confluence of both the scale of interconnected systems and increased adoption of artificial intelligence, there are many research challenges the community must face so that our society can continue to benefit and risks are minimized, not multiplied. Those challenges range from security and trust of the information ecosystem to adversarial artificial intelligence and machine learning. Along with basic research challenges, more often than not, securing a system happens after the design or even deployment, meaning the security community is routinely playing catch-up and attempting to patch vulnerabilities that could be exploited any minute. While security measures such as encryption and authentication have been widely adopted, questions of security tend to be secondary to application capability. There needs to be a sea-change in the way we approach this critically important aspect of the problem: new incentives and education are at the core of this change. Now is the time to refocus research community efforts on developing interconnected technologies with security "baked in by design" and creating an ecosystem that ensures adoption of promising research developments. To realize this vision, two additional elements of the ecosystem are necessary - proper incentive structures for adoption and an educated citizenry that is well versed in vulnerabilities and risks. △ Less

Submitted 4 January, 2021; originally announced January 2021.

Comments: A Computing Community Consortium (CCC) white paper, 5 pages

Report number: ccc2020whitepaper_13

arXiv:2012.06057 [pdf]

Interdisciplinary Approaches to Understanding Artificial Intelligence's Impact on Society

Authors: Suresh Venkatasubramanian, Nadya Bliss, Helen Nissenbaum, Melanie Moses

Abstract: Innovations in AI have focused primarily on the questions of "what" and "how"-algorithms for finding patterns in web searches, for instance-without adequate attention to the possible harms (such as privacy, bias, or manipulation) and without adequate consideration of the societal context in which these systems operate. In part, this is driven by incentives and forces in the tech industry, where a… ▽ More Innovations in AI have focused primarily on the questions of "what" and "how"-algorithms for finding patterns in web searches, for instance-without adequate attention to the possible harms (such as privacy, bias, or manipulation) and without adequate consideration of the societal context in which these systems operate. In part, this is driven by incentives and forces in the tech industry, where a more product-driven focus tends to drown out broader reflective concerns about potential harms and misframings. But this focus on what and how is largely a reflection of the engineering and mathematics-focused training in computer science, which emphasizes the building of tools and development of computational concepts. As a result of this tight technical focus, and the rapid, worldwide explosion in its use, AI has come with a storm of unanticipated socio-technical problems, ranging from algorithms that act in racially or gender-biased ways, get caught in feedback loops that perpetuate inequalities, or enable unprecedented behavioral monitoring surveillance that challenges the fundamental values of free, democratic societies. Given that AI is no longer solely the domain of technologists but rather of society as a whole, we need tighter coupling of computer science and those disciplines that study society and societal values. △ Less

Submitted 10 December, 2020; originally announced December 2020.

Comments: A Computing Community Consortium (CCC) white paper, 5 pages

Report number: ccc2020whitepaper_5

arXiv:2010.12611 [pdf, other]

Information access representations and social capital in networks

Authors: Ashkan Bashardoust, Hannah C. Beilinson, Sorelle A. Friedler, Jiajie Ma, Jade Rousseau, Carlos E. Scheidegger, Blair D. Sullivan, Nasanbayar Ulzii-Orshikh, Suresh Venkatasubramanian

Abstract: Social network position confers power and social capital. In the setting of online social networks that have massive reach, creating mathematical representations of social capital is an important step towards understanding how network position can differentially confer advantage to different groups and how network position can itself be a source of advantage. In this paper, we use well established… ▽ More Social network position confers power and social capital. In the setting of online social networks that have massive reach, creating mathematical representations of social capital is an important step towards understanding how network position can differentially confer advantage to different groups and how network position can itself be a source of advantage. In this paper, we use well established models for information flow on networks as a base to propose a formal descriptor of the network position of a node as represented by its information access. Combining these descriptors allows a full representation of social capital across the network. Using real-world networks, we demonstrate that this representation allows the identification of differences between groups based on network specific measures of inequality of access. △ Less

Submitted 16 October, 2023; v1 submitted 23 October, 2020; originally announced October 2020.

arXiv:2007.01242 [pdf]

Evolving Methods for Evaluating and Disseminating Computing Research

Authors: Benjamin Zorn, Tom Conte, Keith Marzullo, Suresh Venkatasubramanian

Abstract: Social and technical trends have significantly changed methods for evaluating and disseminating computing research. Traditional venues for reviewing and publishing, such as conferences and journals, worked effectively in the past. Recently, trends have created new opportunities but also put new pressures on the process of review and dissemination. For example, many conferences have seen large incr… ▽ More Social and technical trends have significantly changed methods for evaluating and disseminating computing research. Traditional venues for reviewing and publishing, such as conferences and journals, worked effectively in the past. Recently, trends have created new opportunities but also put new pressures on the process of review and dissemination. For example, many conferences have seen large increases in the number of submissions. Likewise, dissemination of research ideas has become dramatically through publication venues such as arXiv.org and social media networks. While these trends predate COVID-19, the pandemic could accelerate longer term changes. Based on interviews with leading academics in computing research, our findings include: (1) Trends impacting computing research are largely positive and have increased the participation, scope, accessibility, and speed of the research process. (2) Challenges remain in securing the integrity of the process, including addressing ways to scale the review process, avoiding attempts to misinform or confuse the dissemination of results, and ensuring fairness and broad participation in the process itself. Based on these findings, we recommend: (1) Regularly polling members of the computing research community, including program and general conference chairs, journal editors, authors, reviewers, etc., to identify specific challenges they face to better understand these issues. (2) An influential body, such as the Computing Research Association regularly issues a "State of the Computing Research Enterprise" report to update the community on trends, both positive and negative, impacting the computing research enterprise. (3) A deeper investigation, specifically to better understand the influence that social media and preprint archives have on computing research, is conducted. △ Less

Submitted 2 July, 2020; originally announced July 2020.

Comments: A Computing Community Consortium (CCC) white paper, 12 pages

Report number: ccc2020whitepaper_2

arXiv:2006.11009 [pdf, other]

doi 10.1145/3442188.3445913

Fair clustering via equitable group representations

Authors: Mohsen Abbasi, Aditya Bhaskara, Suresh Venkatasubramanian

Abstract: What does it mean for a clustering to be fair? One popular approach seeks to ensure that each cluster contains groups in (roughly) the same proportion in which they exist in the population. The normative principle at play is balance: any cluster might act as a representative of the data, and thus should reflect its diversity. But clustering also captures a different form of representativeness. A… ▽ More What does it mean for a clustering to be fair? One popular approach seeks to ensure that each cluster contains groups in (roughly) the same proportion in which they exist in the population. The normative principle at play is balance: any cluster might act as a representative of the data, and thus should reflect its diversity. But clustering also captures a different form of representativeness. A core principle in most clustering problems is that a cluster center should be representative of the cluster it represents, by being "close" to the points associated with it. This is so that we can effectively replace the points by their cluster centers without significant loss in fidelity, and indeed is a common "use case" for clustering. For such a clustering to be fair, the centers should "represent" different groups equally well. We call such a clustering a group-representative clustering. In this paper, we study the structure and computation of group-representative clusterings. We show that this notion naturally parallels the development of fairness notions in classification, with direct analogs of ideas like demographic parity and equal opportunity. We demonstrate how these notions are distinct from and cannot be captured by balance-based notions of fairness. We present approximation algorithms for group representative $k$-median clustering and couple this with an empirical evaluation on various real-world data sets. △ Less

Submitted 27 January, 2021; v1 submitted 19 June, 2020; originally announced June 2020.

Comments: 11 pages, 5 figures, ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT)

arXiv:2002.11097 [pdf, other]

Problems with Shapley-value-based explanations as feature importance measures

Authors: I. Elizabeth Kumar, Suresh Venkatasubramanian, Carlos Scheidegger, Sorelle Friedler

Abstract: Game-theoretic formulations of feature importance have become popular as a way to "explain" machine learning models. These methods define a cooperative game between the features of a model and distribute influence among these input elements using some form of the game's unique Shapley values. Justification for these methods rests on two pillars: their desirable mathematical properties, and their a… ▽ More Game-theoretic formulations of feature importance have become popular as a way to "explain" machine learning models. These methods define a cooperative game between the features of a model and distribute influence among these input elements using some form of the game's unique Shapley values. Justification for these methods rests on two pillars: their desirable mathematical properties, and their applicability to specific motivations for explanations. We show that mathematical problems arise when Shapley values are used for feature importance and that the solutions to mitigate these necessarily induce further complexity, such as the need for causal reasoning. We also draw on additional literature to argue that Shapley values do not provide explanations which suit human-centric goals of explainability. △ Less

Submitted 30 June, 2020; v1 submitted 25 February, 2020; originally announced February 2020.

Comments: Accepted to ICML 2020

arXiv:1909.03166 [pdf, other]

Equalizing Recourse across Groups

Authors: Vivek Gupta, Pegah Nokhiz, Chitradeep Dutta Roy, Suresh Venkatasubramanian

Abstract: The rise in machine learning-assisted decision-making has led to concerns about the fairness of the decisions and techniques to mitigate problems of discrimination. If a negative decision is made about an individual (denying a loan, rejecting an application for housing, and so on) justice dictates that we be able to ask how we might change circumstances to get a favorable decision the next time. M… ▽ More The rise in machine learning-assisted decision-making has led to concerns about the fairness of the decisions and techniques to mitigate problems of discrimination. If a negative decision is made about an individual (denying a loan, rejecting an application for housing, and so on) justice dictates that we be able to ask how we might change circumstances to get a favorable decision the next time. Moreover, the ability to change circumstances (a better education, improved credentials) should not be limited to only those with access to expensive resources. In other words, \emph{recourse} for negative decisions should be considered a desirable value that can be equalized across (demographically defined) groups. This paper describes how to build models that make accurate predictions while still ensuring that the penalties for a negative outcome do not disadvantage different groups disproportionately. We measure recourse as the distance of an individual from the decision boundary of a classifier. We then introduce a regularized objective to minimize the difference in recourse across groups. We explore linear settings and further extend recourse to non-linear settings as well as model-agnostic settings where the exact distance from boundary cannot be calculated. Our results show that we can successfully decrease the unfairness in recourse while maintaining classifier performance. △ Less

Submitted 6 September, 2019; originally announced September 2019.

Comments: 13 pages, 4 figures, 2 tables

arXiv:1906.08652 [pdf, other]

Disentangling Influence: Using Disentangled Representations to Audit Model Predictions

Authors: Charles T. Marx, Richard Lanas Phillips, Sorelle A. Friedler, Carlos Scheidegger, Suresh Venkatasubramanian

Abstract: Motivated by the need to audit complex and black box models, there has been extensive research on quantifying how data features influence model predictions. Feature influence can be direct (a direct influence on model outcomes) and indirect (model outcomes are influenced via proxy features). Feature influence can also be expressed in aggregate over the training or test data or locally with respect… ▽ More Motivated by the need to audit complex and black box models, there has been extensive research on quantifying how data features influence model predictions. Feature influence can be direct (a direct influence on model outcomes) and indirect (model outcomes are influenced via proxy features). Feature influence can also be expressed in aggregate over the training or test data or locally with respect to a single point. Current research has typically focused on one of each of these dimensions. In this paper, we develop disentangled influence audits, a procedure to audit the indirect influence of features. Specifically, we show that disentangled representations provide a mechanism to identify proxy features in the dataset, while allowing an explicit computation of feature influence on either individual outcomes or aggregate-level outcomes. We show through both theory and experiments that disentangled influence audits can both detect proxy features and show, for each individual or in aggregate, which of these proxy features affects the classifier being audited the most. In this respect, our method is more powerful than existing methods for ascertaining feature influence. △ Less

Submitted 20 June, 2019; originally announced June 2019.

arXiv:1903.02047 [pdf, other]

doi 10.1145/3308558.3313680

Gaps in Information Access in Social Networks

Authors: Benjamin Fish, Ashkan Bashardoust, danah boyd, Sorelle A. Friedler, Carlos Scheidegger, Suresh Venkatasubramanian

Abstract: The study of influence maximization in social networks has largely ignored disparate effects these algorithms might have on the individuals contained in the social network. Individuals may place a high value on receiving information, e.g. job openings or advertisements for loans. While well-connected individuals at the center of the network are likely to receive the information that is being distr… ▽ More The study of influence maximization in social networks has largely ignored disparate effects these algorithms might have on the individuals contained in the social network. Individuals may place a high value on receiving information, e.g. job openings or advertisements for loans. While well-connected individuals at the center of the network are likely to receive the information that is being distributed through the network, poorly connected individuals are systematically less likely to receive the information, producing a gap in access to the information between individuals. In this work, we study how best to spread information in a social network while minimizing this access gap. We propose to use the maximin social welfare function as an objective function, where we maximize the minimum probability of receiving the information under an intervention. We prove that in this setting this welfare function constrains the access gap whereas maximizing the expected number of nodes reached does not. We also investigate the difficulties of using the maximin, and present hardness results and analysis for standard greedy strategies. Finally, we investigate practical ways of optimizing for the maximin, and give empirical evidence that a simple greedy-based strategy works well in practice. △ Less

Submitted 5 March, 2019; originally announced March 2019.

Comments: Accepted at The Web Conference 2019

arXiv:1901.09565 [pdf, other]

Fairness in representation: quantifying stereotyping as a representational harm

Authors: Mohsen Abbasi, Sorelle A. Friedler, Carlos Scheidegger, Suresh Venkatasubramanian

Abstract: While harms of allocation have been increasingly studied as part of the subfield of algorithmic fairness, harms of representation have received considerably less attention. In this paper, we formalize two notions of stereotyping and show how they manifest in later allocative harms within the machine learning pipeline. We also propose mitigation strategies and demonstrate their effectiveness on syn… ▽ More While harms of allocation have been increasingly studied as part of the subfield of algorithmic fairness, harms of representation have received considerably less attention. In this paper, we formalize two notions of stereotyping and show how they manifest in later allocative harms within the machine learning pipeline. We also propose mitigation strategies and demonstrate their effectiveness on synthetic datasets. △ Less

Submitted 28 January, 2019; originally announced January 2019.

Comments: 9 pages, 6 figures, Siam International Conference on Data Mining

arXiv:1802.06992 [pdf, ps, other]

Sublinear Algorithms for MAXCUT and Correlation Clustering

Authors: Aditya Bhaskara, Samira Daruki, Suresh Venkatasubramanian

Abstract: We study sublinear algorithms for two fundamental graph problems, MAXCUT and correlation clustering. Our focus is on constructing core-sets as well as developing streaming algorithms for these problems. Constant space algorithms are known for dense graphs for these problems, while $Ω(n)$ lower bounds exist (in the streaming setting) for sparse graphs. Our goal in this paper is to bridge the gap… ▽ More We study sublinear algorithms for two fundamental graph problems, MAXCUT and correlation clustering. Our focus is on constructing core-sets as well as developing streaming algorithms for these problems. Constant space algorithms are known for dense graphs for these problems, while $Ω(n)$ lower bounds exist (in the streaming setting) for sparse graphs. Our goal in this paper is to bridge the gap between these extremes. Our first result is to construct core-sets of size $\tilde{O}(n^{1-δ})$ for both the problems, on graphs with average degree $n^δ$ (for any $δ>0$). This turns out to be optimal, under the exponential time hypothesis (ETH). Our core-set analysis is based on studying random-induced sub-problems of optimization problems. To the best of our knowledge, all the known results in our parameter range rely crucially on near-regularity assumptions. We avoid these by using a biased sampling approach, which we analyze using recent results on concentration of quadratic functions. We then show that our construction yields a 2-pass streaming $(1+ε)$-approximation for both problems; the algorithm uses $\tilde{O}(n^{1-δ})$ space, for graphs of average degree $n^δ$. △ Less

Submitted 20 February, 2018; originally announced February 2018.

Comments: 29 pages, conference

arXiv:1802.04422 [pdf, other]

A comparative study of fairness-enhancing interventions in machine learning

Authors: Sorelle A. Friedler, Carlos Scheidegger, Suresh Venkatasubramanian, Sonam Choudhary, Evan P. Hamilton, Derek Roth

Abstract: Computers are increasingly used to make decisions that have significant impact in people's lives. Often, these predictions can affect different population subgroups disproportionately. As a result, the issue of fairness has received much recent interest, and a number of fairness-enhanced classifiers and predictors have appeared in the literature. This paper seeks to study the following questions:… ▽ More Computers are increasingly used to make decisions that have significant impact in people's lives. Often, these predictions can affect different population subgroups disproportionately. As a result, the issue of fairness has received much recent interest, and a number of fairness-enhanced classifiers and predictors have appeared in the literature. This paper seeks to study the following questions: how do these different techniques fundamentally compare to one another, and what accounts for the differences? Specifically, we seek to bring attention to many under-appreciated aspects of such fairness-enhancing interventions. Concretely, we present the results of an open benchmark we have developed that lets us compare a number of different algorithms under a variety of fairness measures, and a large number of existing datasets. We find that although different algorithms tend to prefer specific formulations of fairness preservations, many of these measures strongly correlate with one another. In addition, we find that fairness-preserving algorithms tend to be sensitive to fluctuations in dataset composition (simulated in our benchmark by varying training-test splits), indicating that fairness interventions might be more brittle than previously thought. △ Less

Submitted 12 February, 2018; originally announced February 2018.

arXiv:1707.00391 [pdf, other]

Fair Pipelines

Authors: Amanda Bower, Sarah N. Kitchen, Laura Niss, Martin J. Strauss, Alexander Vargas, Suresh Venkatasubramanian

Abstract: This work facilitates ensuring fairness of machine learning in the real world by decoupling fairness considerations in compound decisions. In particular, this work studies how fairness propagates through a compound decision-making processes, which we call a pipeline. Prior work in algorithmic fairness only focuses on fairness with respect to one decision. However, many decision-making processes re… ▽ More This work facilitates ensuring fairness of machine learning in the real world by decoupling fairness considerations in compound decisions. In particular, this work studies how fairness propagates through a compound decision-making processes, which we call a pipeline. Prior work in algorithmic fairness only focuses on fairness with respect to one decision. However, many decision-making processes require more than one decision. For instance, hiring is at least a two stage model: deciding who to interview from the applicant pool and then deciding who to hire from the interview pool. Perhaps surprisingly, we show that the composition of fair components may not guarantee a fair pipeline under a $(1+\varepsilon)$-equal opportunity definition of fair. However, we identify circumstances that do provide that guarantee. We also propose numerous directions for future work on more general compound machine learning decisions. △ Less

Submitted 2 July, 2017; originally announced July 2017.

Comments: Presented as a poster at the 2017 Workshop on Fairness, Accountability, and Transparency in Machine Learning (FAT/ML 2017)

arXiv:1706.09847 [pdf, other]

Runaway Feedback Loops in Predictive Policing

Authors: Danielle Ensign, Sorelle A. Friedler, Scott Neville, Carlos Scheidegger, Suresh Venkatasubramanian

Abstract: Predictive policing systems are increasingly used to determine how to allocate police across a city in order to best prevent crime. Discovered crime data (e.g., arrest counts) are used to help update the model, and the process is repeated. Such systems have been empirically shown to be susceptible to runaway feedback loops, where police are repeatedly sent back to the same neighborhoods regardless… ▽ More Predictive policing systems are increasingly used to determine how to allocate police across a city in order to best prevent crime. Discovered crime data (e.g., arrest counts) are used to help update the model, and the process is repeated. Such systems have been empirically shown to be susceptible to runaway feedback loops, where police are repeatedly sent back to the same neighborhoods regardless of the true crime rate. In response, we develop a mathematical model of predictive policing that proves why this feedback loop occurs, show empirically that this model exhibits such problems, and demonstrate how to change the inputs to a predictive policing system (in a black-box manner) so the runaway feedback loop does not occur, allowing the true crime rate to be learned. Our results are quantitative: we can establish a link (in our model) between the degree to which runaway feedback causes problems and the disparity in crime rates between areas. Moreover, we can also demonstrate the way in which \emph{reported} incidents of crime (those reported by residents) and \emph{discovered} incidents of crime (i.e. those directly observed by police officers dispatched as a result of the predictive policing algorithm) interact: in brief, while reported incidents can attenuate the degree of runaway feedback, they cannot entirely remove it without the interventions we suggest. △ Less

Submitted 21 December, 2017; v1 submitted 29 June, 2017; originally announced June 2017.

Comments: Extended version accepted to the 1st Conference on Fairness, Accountability and Transparency, 2018. Adds further treatment of reported as well as discovered incidents

arXiv:1609.07236 [pdf, other]

On the (im)possibility of fairness

Authors: Sorelle A. Friedler, Carlos Scheidegger, Suresh Venkatasubramanian

Abstract: What does it mean for an algorithm to be fair? Different papers use different notions of algorithmic fairness, and although these appear internally consistent, they also seem mutually incompatible. We present a mathematical setting in which the distinctions in previous papers can be made formal. In addition to characterizing the spaces of inputs (the "observed" space) and outputs (the "decision" s… ▽ More What does it mean for an algorithm to be fair? Different papers use different notions of algorithmic fairness, and although these appear internally consistent, they also seem mutually incompatible. We present a mathematical setting in which the distinctions in previous papers can be made formal. In addition to characterizing the spaces of inputs (the "observed" space) and outputs (the "decision" space), we introduce the notion of a construct space: a space that captures unobservable, but meaningful variables for the prediction. We show that in order to prove desirable properties of the entire decision-making process, different mechanisms for fairness require different assumptions about the nature of the mapping from construct space to decision space. The results in this paper imply that future treatments of algorithmic fairness should more explicitly state assumptions about the relationship between constructs and observations. △ Less

Submitted 23 September, 2016; originally announced September 2016.

arXiv:1603.01374 [pdf, other]

A Unified View of Localized Kernel Learning

Authors: John Moeller, Sarathkrishna Swaminathan, Suresh Venkatasubramanian

Abstract: Multiple Kernel Learning, or MKL, extends (kernelized) SVM by attempting to learn not only a classifier/regressor but also the best kernel for the training task, usually from a combination of existing kernel functions. Most MKL methods seek the combined kernel that performs best over every training example, sacrificing performance in some areas to seek a global optimum. Localized kernel learning (… ▽ More Multiple Kernel Learning, or MKL, extends (kernelized) SVM by attempting to learn not only a classifier/regressor but also the best kernel for the training task, usually from a combination of existing kernel functions. Most MKL methods seek the combined kernel that performs best over every training example, sacrificing performance in some areas to seek a global optimum. Localized kernel learning (LKL) overcomes this limitation by allowing the training algorithm to match a component kernel to the examples that can exploit it best. Several approaches to the localized kernel learning problem have been explored in the last several years. We unify many of these approaches under one simple system and design a new algorithm with improved performance. We also develop enhanced versions of existing algorithms, with an eye on scalability and performance. △ Less

Submitted 4 March, 2016; originally announced March 2016.

Comments: 14 pages, 2 figures, 4 tables. Reformatted version of the accepted SDM 2016 paper

arXiv:1602.08162 [pdf, other]

Streaming Verification of Graph Properties

Authors: Amirali Abdullah, Samira Daruki, Chitradeep Dutta Roy, Suresh Venkatasubramanian

Abstract: Streaming interactive proofs (SIPs) are a framework for outsourced computation. A computationally limited streaming client (the verifier) hands over a large data set to an untrusted server (the prover) in the cloud and the two parties run a protocol to confirm the correctness of result with high probability. SIPs are particularly interesting for problems that are hard to solve (or even approximate… ▽ More Streaming interactive proofs (SIPs) are a framework for outsourced computation. A computationally limited streaming client (the verifier) hands over a large data set to an untrusted server (the prover) in the cloud and the two parties run a protocol to confirm the correctness of result with high probability. SIPs are particularly interesting for problems that are hard to solve (or even approximate) well in a streaming setting. The most notable of these problems is finding maximum matchings, which has received intense interest in recent years but has strong lower bounds even for constant factor approximations. In this paper, we present efficient streaming interactive proofs that can verify maximum matchings exactly. Our results cover all flavors of matchings (bipartite/non-bipartite and weighted). In addition, we also present streaming verifiers for approximate metric TSP. In particular, these are the first efficient results for weighted matchings and for metric TSP in any streaming verification model. △ Less

Submitted 3 October, 2016; v1 submitted 25 February, 2016; originally announced February 2016.

Comments: 26 pages, 2 figure, 1 table

arXiv:1602.07043 [pdf, other]

Auditing Black-box Models for Indirect Influence

Authors: Philip Adler, Casey Falk, Sorelle A. Friedler, Gabriel Rybeck, Carlos Scheidegger, Brandon Smith, Suresh Venkatasubramanian

Abstract: Data-trained predictive models see widespread use, but for the most part they are used as black boxes which output a prediction or score. It is therefore hard to acquire a deeper understanding of model behavior, and in particular how different features influence the model prediction. This is important when interpreting the behavior of complex models, or asserting that certain problematic attribute… ▽ More Data-trained predictive models see widespread use, but for the most part they are used as black boxes which output a prediction or score. It is therefore hard to acquire a deeper understanding of model behavior, and in particular how different features influence the model prediction. This is important when interpreting the behavior of complex models, or asserting that certain problematic attributes (like race or gender) are not unduly influencing decisions. In this paper, we present a technique for auditing black-box models, which lets us study the extent to which existing models take advantage of particular features in the dataset, without knowing how the models work. Our work focuses on the problem of indirect influence: how some features might indirectly influence outcomes via other, related features. As a result, we can find attribute influences even in cases where, upon further direct examination of the model, the attribute is not referred to by the model at all. Our approach does not require the black-box model to be retrained. This is important if (for example) the model is only accessible via an API, and contrasts our work with other methods that investigate feature influence like feature selection. We present experimental evidence for the effectiveness of our procedure using a variety of publicly available datasets and models. We also validate our procedure using techniques from interpretable learning and feature selection, as well as against other black-box auditing procedures. △ Less

Submitted 30 November, 2016; v1 submitted 22 February, 2016; originally announced February 2016.

Comments: Final version of paper that appears in the IEEE International Conference on Data Mining (ICDM), 2016

arXiv:1509.05514 [pdf, other]

Streaming Verification in Data Analysis

Authors: Samira Daruki, Justin Thaler, Suresh Venkatasubramanian

Abstract: Streaming interactive proofs (SIPs) are a framework to reason about outsourced computation, where a data owner (the verifier) outsources a computation to the cloud (the prover), but wishes to verify the correctness of the solution provided by the cloud service. In this paper we present streaming interactive proofs for problems in data analysis. We present protocols for clustering and shape fitting… ▽ More Streaming interactive proofs (SIPs) are a framework to reason about outsourced computation, where a data owner (the verifier) outsources a computation to the cloud (the prover), but wishes to verify the correctness of the solution provided by the cloud service. In this paper we present streaming interactive proofs for problems in data analysis. We present protocols for clustering and shape fitting problems, as well as an improved protocol for rectangular matrix multiplication. The latter can in turn be used to verify $k$ eigenvectors of a (streamed) $n \times n$ matrix. In general our solutions use polylogarithmic rounds of communication and polylogarithmic total communication and verifier space. For special cases (when optimality certificates can be verified easily), we present constant round protocols with similar costs. For rectangular matrix multiplication and eigenvector verification, our protocols work in the more restricted annotated data streaming model, and use sublinear (but not polylogarithmic) communication. △ Less

Submitted 18 September, 2015; originally announced September 2015.

arXiv:1504.02462 [pdf, other]

A Group Theoretic Perspective on Unsupervised Deep Learning

Authors: Arnab Paul, Suresh Venkatasubramanian

Abstract: Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the in… ▽ More Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. △ Less

Submitted 21 April, 2015; v1 submitted 8 April, 2015; originally announced April 2015.

Comments: 2-page version of arXiv:1412.6621 prepared for presentation at ICLR 2015 workshop as required by ICLR PC). This version has some minor formatting changes as required by the conference

arXiv:1503.05225 [pdf, ps, other]

Sketching, Embedding, and Dimensionality Reduction for Information Spaces

Authors: Amirali Abdullah, Ravi Kumar, Andrew McGregor, Sergei Vassilvitskii, Suresh Venkatasubramanian

Abstract: Information distances like the Hellinger distance and the Jensen-Shannon divergence have deep roots in information theory and machine learning. They are used extensively in data analysis especially when the objects being compared are high dimensional empirical probability distributions built from data. However, we lack common tools needed to actually use information distances in applications effic… ▽ More Information distances like the Hellinger distance and the Jensen-Shannon divergence have deep roots in information theory and machine learning. They are used extensively in data analysis especially when the objects being compared are high dimensional empirical probability distributions built from data. However, we lack common tools needed to actually use information distances in applications efficiently and at scale with any kind of provable guarantees. We can't sketch these distances easily, or embed them in better behaved spaces, or even reduce the dimensionality of the space while maintaining the probability structure of the data. In this paper, we build these tools for information distances---both for the Hellinger distance and Jensen--Shannon divergence, as well as related measures, like the $χ^2$ divergence. We first show that they can be sketched efficiently (i.e. up to multiplicative error in sublinear space) in the aggregate streaming model. This result is exponentially stronger than known upper bounds for sketching these distances in the strict turnstile streaming model. Second, we show a finite dimensionality embedding result for the Jensen-Shannon and $χ^2$ divergences that preserves pair wise distances. Finally we prove a dimensionality reduction result for the Hellinger, Jensen--Shannon, and $χ^2$ divergences that preserves the information geometry of the distributions (specifically, by retaining the simplex structure of the space). While our second result above already implies that these divergences can be explicitly embedded in Euclidean space, retaining the simplex structure is important because it allows us to continue doing inference in the reduced space. In essence, we preserve not just the distance structure but the underlying geometry of the space. △ Less

Submitted 17 March, 2015; originally announced March 2015.

arXiv:1412.6621 [pdf, other]

Why does Deep Learning work? - A perspective from Group Theory

Authors: Arnab Paul, Suresh Venkatasubramanian

Abstract: Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input s… ▽ More Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper. △ Less

Submitted 28 February, 2015; v1 submitted 20 December, 2014; originally announced December 2014.

Comments: 13 pages, 5 figures

arXiv:1412.3756 [pdf, other]

Certifying and removing disparate impact

Authors: Michael Feldman, Sorelle Friedler, John Moeller, Carlos Scheidegger, Suresh Venkatasubramanian

Abstract: What does it mean for an algorithm to be biased? In U.S. law, unintentional bias is encoded via disparate impact, which occurs when a selection process has widely different outcomes for different groups, even as it appears to be neutral. This legal determination hinges on a definition of a protected class (ethnicity, gender, religious practice) and an explicit description of the process. When th… ▽ More What does it mean for an algorithm to be biased? In U.S. law, unintentional bias is encoded via disparate impact, which occurs when a selection process has widely different outcomes for different groups, even as it appears to be neutral. This legal determination hinges on a definition of a protected class (ethnicity, gender, religious practice) and an explicit description of the process. When the process is implemented using computers, determining disparate impact (and hence bias) is harder. It might not be possible to disclose the process. In addition, even if the process is open, it might be hard to elucidate in a legal setting how the algorithm makes its decisions. Instead of requiring access to the algorithm, we propose making inferences based on the data the algorithm uses. We make four contributions to this problem. First, we link the legal notion of disparate impact to a measure of classification accuracy that while known, has received relatively little attention. Second, we propose a test for disparate impact based on analyzing the information leakage of the protected class from the other data attributes. Third, we describe methods by which data might be made unbiased. Finally, we present empirical evidence supporting the effectiveness of our test for disparate impact and our approach for both masking bias and preserving relevant information in the data. Interestingly, our approach resembles some actual selection practices that have recently received legal scrutiny. △ Less

Submitted 15 July, 2015; v1 submitted 11 December, 2014; originally announced December 2014.

Comments: Extended version of paper accepted at 2015 ACM SIGKDD Conference on Knowledge Discovery and Data Mining

arXiv:1404.1191 [pdf, other]

A directed isoperimetric inequality with application to Bregman near neighbor lower bounds

Authors: Amirali Abdullah, Suresh Venkatasubramanian

Abstract: Bregman divergences $D_φ$ are a class of divergences parametrized by a convex function $φ$ and include well known distance functions like $\ell_2^2$ and the Kullback-Leibler divergence. There has been extensive research on algorithms for problems like clustering and near neighbor search with respect to Bregman divergences, in all cases, the algorithms depend not just on the data size $n$ and dimen… ▽ More Bregman divergences $D_φ$ are a class of divergences parametrized by a convex function $φ$ and include well known distance functions like $\ell_2^2$ and the Kullback-Leibler divergence. There has been extensive research on algorithms for problems like clustering and near neighbor search with respect to Bregman divergences, in all cases, the algorithms depend not just on the data size $n$ and dimensionality $d$, but also on a structure constant $μ\ge 1$ that depends solely on $φ$ and can grow without bound independently. In this paper, we provide the first evidence that this dependence on $μ$ might be intrinsic. We focus on the problem of approximate near neighbor search for Bregman divergences. We show that under the cell probe model, any non-adaptive data structure (like locality-sensitive hashing) for $c$-approximate near-neighbor search that admits $r$ probes must use space $Ω(n^{1 + \fracμ{c r}})$. In contrast, for LSH under $\ell_1$ the best bound is $Ω(n^{1+\frac{1}{cr}})$. Our new tool is a directed variant of the standard boolean noise operator. We show that a generalization of the Bonami-Beckner hypercontractivity inequality exists "in expectation" or upon restriction to certain subsets of the Hamming cube, and that this is sufficient to prove the desired isoperimetric inequality that we use in our data structure lower bound. We also present a structural result reducing the Hamming cube to a Bregman cube. This structure allows us to obtain lower bounds for problems under Bregman divergences from their $\ell_1$ analog. In particular, we get a (weaker) lower bound for approximate near neighbor search of the form $Ω(n^{1 + \frac{1}{cr}})$ for an $r$-query non-adaptive data structure, and new cell probe lower bounds for a number of other near neighbor questions in Bregman space. △ Less

Submitted 16 May, 2015; v1 submitted 4 April, 2014; originally announced April 2014.

Comments: 27 pages

arXiv:1401.3331 [pdf, other]

Advanced Self-interference Cancellation and Multiantenna Techniques for Full-Duplex Radios

Authors: Dani Korpi, Sathya Venkatasubramanian, Taneli Riihonen, Lauri Anttila, Strasdosky Otewa, Clemens Icheln, Katsuyuki Haneda, Sergei Tretyakov, Mikko Valkama, Risto Wichman

Abstract: In an in-band full-duplex system, radios transmit and receive simultaneously in the same frequency band at the same time, providing a radical improvement in spectral efficiency over a half-duplex system. However, in order to design such a system, it is necessary to mitigate the self-interference due to simultaneous transmission and reception, which seriously limits the maximum transmit power of th… ▽ More In an in-band full-duplex system, radios transmit and receive simultaneously in the same frequency band at the same time, providing a radical improvement in spectral efficiency over a half-duplex system. However, in order to design such a system, it is necessary to mitigate the self-interference due to simultaneous transmission and reception, which seriously limits the maximum transmit power of the full-duplex device. Especially, large differences in power levels in the receiver front-end sets stringent requirements for the linearity of the transceiver electronics. We present an advanced architecture for a compact full-duplex multiantenna transceiver combining antenna design with analog and digital cancellation, including both linear and nonlinear signal processing. △ Less

Submitted 14 January, 2014; originally announced January 2014.

Comments: Presented in 47th Annual Asilomar Conference on Signals, Systems, and Computers, 2013

arXiv:1306.3295 [pdf, other]

Rethinking Abstractions for Big Data: Why, Where, How, and What

Authors: Mary Hall, Robert M. Kirby, Feifei Li, Miriah Meyer, Valerio Pascucci, Jeff M. Phillips, Rob Ricci, Jacobus Van der Merwe, Suresh Venkatasubramanian

Abstract: Big data refers to large and complex data sets that, under existing approaches, exceed the capacity and capability of current compute platforms, systems software, analytical tools and human understanding. Numerous lessons on the scalability of big data can already be found in asymptotic analysis of algorithms and from the high-performance computing (HPC) and applications communities. However, scal… ▽ More Big data refers to large and complex data sets that, under existing approaches, exceed the capacity and capability of current compute platforms, systems software, analytical tools and human understanding. Numerous lessons on the scalability of big data can already be found in asymptotic analysis of algorithms and from the high-performance computing (HPC) and applications communities. However, scale is only one aspect of current big data trends; fundamentally, current and emerging problems in big data are a result of unprecedented complexity--in the structure of the data and how to analyze it, in dealing with unreliability and redundancy, in addressing the human factors of comprehending complex data sets, in formulating meaningful analyses, and in managing the dense, power-hungry data centers that house big data. The computer science solution to complexity is finding the right abstractions, those that hide as much triviality as possible while revealing the essence of the problem that is being addressed. The "big data challenge" has disrupted computer science by stressing to the very limits the familiar abstractions which define the relevant subfields in data analysis, data management and the underlying parallel systems. As a result, not enough of these challenges are revealed by isolating abstractions in a traditional software stack or standard algorithmic and analytical techniques, and attempts to address complexity either oversimplify or require low-level management of details. The authors believe that the abstractions for big data need to be rethought, and this reorganization needs to evolve and be sustained through continued cross-disciplinary collaboration. △ Less

Submitted 14 June, 2013; originally announced June 2013.

Comments: 8 pages, 1 figure

Report number: UUCS-13-002

arXiv:1305.4757 [pdf, other]

Power to the Points: Validating Data Memberships in Clusterings

Authors: Parasaran Raman, Suresh Venkatasubramanian

Abstract: A clustering is an implicit assignment of labels of points, based on proximity to other points. It is these labels that are then used for downstream analysis (either focusing on individual clusters, or identifying representatives of clusters and so on). Thus, in order to trust a clustering as a first step in exploratory data analysis, we must trust the labels assigned to individual data. Without s… ▽ More A clustering is an implicit assignment of labels of points, based on proximity to other points. It is these labels that are then used for downstream analysis (either focusing on individual clusters, or identifying representatives of clusters and so on). Thus, in order to trust a clustering as a first step in exploratory data analysis, we must trust the labels assigned to individual data. Without supervision, how can we validate this assignment? In this paper, we present a method to attach affinity scores to the implicit labels of individual points in a clustering. The affinity scores capture the confidence level of the cluster that claims to "own" the point. This method is very general: it can be used with clusterings derived from Euclidean data, kernelized data, or even data derived from information spaces. It smoothly incorporates importance functions on clusters, allowing us to eight different clusters differently. It is also efficient: assigning an affinity score to a point depends only polynomially on the number of clusters and is independent of the number of points in the data. The dimensionality of the underlying space only appears in preprocessing. We demonstrate the value of our approach with an experimental study that illustrates the use of these scores in different data analysis tasks, as well as the efficiency and flexibility of the method. We also demonstrate useful visualizations of these scores; these might prove useful within an interactive analytics framework. △ Less

Submitted 21 May, 2013; originally announced May 2013.

Comments: 18 pages, 9 figures, 5 tables

arXiv:1302.4720 [pdf, other]

Multiple Target Tracking with RF Sensor Networks

Authors: Maurizio Bocca, Ossi Kaltiokallio, Neal Patwari, Suresh Venkatasubramanian

Abstract: RF sensor networks are wireless networks that can localize and track people (or targets) without needing them to carry or wear any electronic device. They use the change in the received signal strength (RSS) of the links due to the movements of people to infer their locations. In this paper, we consider real-time multiple target tracking with RF sensor networks. We perform radio tomographic imagin… ▽ More RF sensor networks are wireless networks that can localize and track people (or targets) without needing them to carry or wear any electronic device. They use the change in the received signal strength (RSS) of the links due to the movements of people to infer their locations. In this paper, we consider real-time multiple target tracking with RF sensor networks. We perform radio tomographic imaging (RTI), which generates images of the change in the propagation field, as if they were frames of a video. Our RTI method uses RSS measurements on multiple frequency channels on each link, combining them with a fade level-based weighted average. We describe methods to adapt machine vision methods to the peculiarities of RTI to enable real time multiple target tracking. Several tests are performed in an open environment, a one-bedroom apartment, and a cluttered office environment. The results demonstrate that the system is capable of accurately tracking in real-time up to 4 targets in cluttered indoor environments, even when their trajectories intersect multiple times, without mis-estimating the number of targets found in the monitored area. The highest average tracking error measured in the tests is 0.45 m with two targets, 0.46 m with three targets, and 0.55 m with four targets. △ Less

Submitted 11 February, 2013; originally announced February 2013.

arXiv:1206.5580 [pdf, other]

A Geometric Algorithm for Scalable Multiple Kernel Learning

Authors: John Moeller, Parasaran Raman, Avishek Saha, Suresh Venkatasubramanian

Abstract: We present a geometric formulation of the Multiple Kernel Learning (MKL) problem. To do so, we reinterpret the problem of learning kernel weights as searching for a kernel that maximizes the minimum (kernel) distance between two convex polytopes. This interpretation combined with novel structural insights from our geometric formulation allows us to reduce the MKL problem to a simple optimization r… ▽ More We present a geometric formulation of the Multiple Kernel Learning (MKL) problem. To do so, we reinterpret the problem of learning kernel weights as searching for a kernel that maximizes the minimum (kernel) distance between two convex polytopes. This interpretation combined with novel structural insights from our geometric formulation allows us to reduce the MKL problem to a simple optimization routine that yields provable convergence as well as quality guarantees. As a result our method scales efficiently to much larger data sets than most prior methods can handle. Empirical evaluation on eleven datasets shows that we are significantly faster and even compare favorably with a uniform unweighted combination of kernels. △ Less

Submitted 15 March, 2014; v1 submitted 25 June, 2012; originally announced June 2012.

Comments: 20 pages

arXiv:1204.3523 [pdf, ps, other]

Efficient Protocols for Distributed Classification and Optimization

Authors: Hal Daume III, Jeff M. Phillips, Avishek Saha, Suresh Venkatasubramanian

Abstract: In distributed learning, the goal is to perform a learning task over data distributed across multiple nodes with minimal (expensive) communication. Prior work (Daume III et al., 2012) proposes a general model that bounds the communication required for learning classifiers while allowing for $\eps$ training error on linearly separable data adversarially distributed across nodes. In this work, we… ▽ More In distributed learning, the goal is to perform a learning task over data distributed across multiple nodes with minimal (expensive) communication. Prior work (Daume III et al., 2012) proposes a general model that bounds the communication required for learning classifiers while allowing for $\eps$ training error on linearly separable data adversarially distributed across nodes. In this work, we develop key improvements and extensions to this basic model. Our first result is a two-party multiplicative-weight-update based protocol that uses $O(d^2 \log{1/\eps})$ words of communication to classify distributed data in arbitrary dimension $d$, $\eps$-optimally. This readily extends to classification over $k$ nodes with $O(kd^2 \log{1/\eps})$ words of communication. Our proposed protocol is simple to implement and is considerably more efficient than baselines compared, as demonstrated by our empirical results. In addition, we illustrate general algorithm design paradigms for doing efficient learning over distributed data. We show how to solve fixed-dimensional and high dimensional linear programming efficiently in a distributed setting where constraints may be distributed across nodes. Since many learning problems can be viewed as convex optimization problems where constraints are generated by individual points, this models many typical distributed learning scenarios. Our techniques make use of a novel connection from multipass streaming, as well as adapting the multiplicative-weight-update framework more generally to a distributed setting. As a consequence, our methods extend to the wide range of problems solvable using these techniques. △ Less

Submitted 16 April, 2012; originally announced April 2012.

arXiv:1202.6078 [pdf, other]

Protocols for Learning Classifiers on Distributed Data

Authors: Hal Daume III, Jeff M. Phillips, Avishek Saha, Suresh Venkatasubramanian

Abstract: We consider the problem of learning classifiers for labeled data that has been distributed across several nodes. Our goal is to find a single classifier, with small approximation error, across all datasets while minimizing the communication between nodes. This setting models real-world communication bottlenecks in the processing of massive distributed datasets. We present several very general samp… ▽ More We consider the problem of learning classifiers for labeled data that has been distributed across several nodes. Our goal is to find a single classifier, with small approximation error, across all datasets while minimizing the communication between nodes. This setting models real-world communication bottlenecks in the processing of massive distributed datasets. We present several very general sampling-based solutions as well as some two-way protocols which have a provable exponential speed-up over any one-way protocol. We focus on core problems for noiseless data distributed across two or more nodes. The techniques we introduce are reminiscent of active learning, but rather than actively probing labels, nodes actively communicate with each other, each node simultaneously learning the important data from another node. △ Less

Submitted 27 February, 2012; originally announced February 2012.

Comments: 19 pages, 12 figures, accepted at AISTATS 2012

arXiv:1108.0835 [pdf, other]

Approximate Bregman near neighbors in sublinear time: Beyond the triangle inequality

Authors: Amirali Abdullah, John Moeller, Suresh Venkatasubramanian

Abstract: In this paper we present the first provable approximate nearest-neighbor (ANN) algorithms for Bregman divergences. Our first algorithm processes queries in O(log^d n) time using O(n log^d n) space and only uses general properties of the underlying distance function (which includes Bregman divergences as a special case). The second algorithm processes queries in O(log n) time using O(n) space and e… ▽ More In this paper we present the first provable approximate nearest-neighbor (ANN) algorithms for Bregman divergences. Our first algorithm processes queries in O(log^d n) time using O(n log^d n) space and only uses general properties of the underlying distance function (which includes Bregman divergences as a special case). The second algorithm processes queries in O(log n) time using O(n) space and exploits structural constants associated specifically with Bregman divergences. An interesting feature of our algorithms is that they extend the ring-tree + quad-tree paradigm for ANN searching beyond Euclidean distances and metrics of bounded doubling dimension to distances that might not even be symmetric or satisfy a triangle inequality. △ Less

Submitted 15 September, 2013; v1 submitted 3 August, 2011; originally announced August 2011.

Comments: 42 pages, including appendices and bibliography. Accepted at SOCG 2012; this version updated to remove typos and minor errata

arXiv:1108.0017 [pdf, other]

Generating a Diverse Set of High-Quality Clusterings

Authors: Jeff M. Phillips, Parasaran Raman, Suresh Venkatasubramanian

Abstract: We provide a new framework for generating multiple good quality partitions (clusterings) of a single data set. Our approach decomposes this problem into two components, generating many high-quality partitions, and then grouping these partitions to obtain k representatives. The decomposition makes the approach extremely modular and allows us to optimize various criteria that control the choice of r… ▽ More We provide a new framework for generating multiple good quality partitions (clusterings) of a single data set. Our approach decomposes this problem into two components, generating many high-quality partitions, and then grouping these partitions to obtain k representatives. The decomposition makes the approach extremely modular and allows us to optimize various criteria that control the choice of representative partitions. △ Less

Submitted 29 July, 2011; originally announced August 2011.

Comments: 12 Pages, 5 Figures, 2nd MultiClust Workshop at ECML PKDD 2011

Showing 1–50 of 63 results for author: Venkatasubramanian, S