-
CLIP-Guided StyleGAN Inversion for Text-Driven Real Image Editing
Authors:
Ahmet Canberk Baykal,
Abdul Basit Anees,
Duygu Ceylan,
Erkut Erdem,
Aykut Erdem,
Deniz Yuret
Abstract:
Researchers have recently begun exploring the use of StyleGAN-based models for real image editing. One particularly interesting application is using natural language descriptions to guide the editing process. Existing approaches for editing images using language either resort to instance-level latent code optimization or map predefined text prompts to some editing directions in the latent space. H…
▽ More
Researchers have recently begun exploring the use of StyleGAN-based models for real image editing. One particularly interesting application is using natural language descriptions to guide the editing process. Existing approaches for editing images using language either resort to instance-level latent code optimization or map predefined text prompts to some editing directions in the latent space. However, these approaches have inherent limitations. The former is not very efficient, while the latter often struggles to effectively handle multi-attribute changes. To address these weaknesses, we present CLIPInverter, a new text-driven image editing approach that is able to efficiently and reliably perform multi-attribute changes. The core of our method is the use of novel, lightweight text-conditioned adapter layers integrated into pretrained GAN-inversion networks. We demonstrate that by conditioning the initial inversion step on the CLIP embedding of the target description, we are able to obtain more successful edit directions. Additionally, we use a CLIP-guided refinement step to make corrections in the resulting residual latent codes, which further improves the alignment with the text prompt. Our method outperforms competing approaches in terms of manipulation accuracy and photo-realism on various domains including human faces, cats, and birds, as shown by our qualitative and quantitative results.
△ Less
Submitted 18 July, 2023; v1 submitted 17 July, 2023;
originally announced July 2023.
-
SLaM: Student-Label Mixing for Distillation with Unlabeled Examples
Authors:
Vasilis Kontonis,
Fotis Iliopoulos,
Khoa Trinh,
Cenk Baykal,
Gaurav Menghani,
Erik Vee
Abstract:
Knowledge distillation with unlabeled examples is a powerful training paradigm for generating compact and lightweight student models in applications where the amount of labeled data is limited but one has access to a large pool of unlabeled data. In this setting, a large teacher model generates ``soft'' pseudo-labels for the unlabeled dataset which are then used for training the student model. Des…
▽ More
Knowledge distillation with unlabeled examples is a powerful training paradigm for generating compact and lightweight student models in applications where the amount of labeled data is limited but one has access to a large pool of unlabeled data. In this setting, a large teacher model generates ``soft'' pseudo-labels for the unlabeled dataset which are then used for training the student model. Despite its success in a wide variety of applications, a shortcoming of this approach is that the teacher's pseudo-labels are often noisy, leading to impaired student performance. In this paper, we present a principled method for knowledge distillation with unlabeled examples that we call Student-Label Mixing (SLaM) and we show that it consistently improves over prior approaches by evaluating it on several standard benchmarks. Finally, we show that SLaM comes with theoretical guarantees; along the way we give an algorithm improving the best-known sample complexity for learning halfspaces with margin under random classification noise, and provide the first convergence analysis for so-called ``forward loss-adjustment" methods.
△ Less
Submitted 8 June, 2023; v1 submitted 7 February, 2023;
originally announced February 2023.
-
The Power of External Memory in Increasing Predictive Model Capacity
Authors:
Cenk Baykal,
Dylan J Cutler,
Nishanth Dikkala,
Nikhil Ghosh,
Rina Panigrahy,
Xin Wang
Abstract:
One way of introducing sparsity into deep networks is by attaching an external table of parameters that is sparsely looked up at different layers of the network. By storing the bulk of the parameters in the external table, one can increase the capacity of the model without necessarily increasing the inference time. Two crucial questions in this setting are then: what is the lookup function for acc…
▽ More
One way of introducing sparsity into deep networks is by attaching an external table of parameters that is sparsely looked up at different layers of the network. By storing the bulk of the parameters in the external table, one can increase the capacity of the model without necessarily increasing the inference time. Two crucial questions in this setting are then: what is the lookup function for accessing the table and how are the contents of the table consumed? Prominent methods for accessing the table include 1) using words/wordpieces token-ids as table indices, 2) LSH hashing the token vector in each layer into a table of buckets, and 3) learnable softmax style routing to a table entry. The ways to consume the contents include adding/concatenating to input representation, and using the contents as expert networks that specialize to different inputs. In this work, we conduct rigorous experimental evaluations of existing ideas and their combinations. We also introduce a new method, alternating updates, that enables access to an increased token dimension without increasing the computation time, and demonstrate its effectiveness in language modeling.
△ Less
Submitted 30 January, 2023;
originally announced February 2023.
-
Alternating Updates for Efficient Transformers
Authors:
Cenk Baykal,
Dylan Cutler,
Nishanth Dikkala,
Nikhil Ghosh,
Rina Panigrahy,
Xin Wang
Abstract:
It has been well established that increasing scale in deep transformer networks leads to improved quality and performance. However, this increase in scale often comes with prohibitive increases in compute cost and inference latency. We introduce Alternating Updates (AltUp), a simple-to-implement method to increase a model's capacity without the computational burden. AltUp enables the widening of t…
▽ More
It has been well established that increasing scale in deep transformer networks leads to improved quality and performance. However, this increase in scale often comes with prohibitive increases in compute cost and inference latency. We introduce Alternating Updates (AltUp), a simple-to-implement method to increase a model's capacity without the computational burden. AltUp enables the widening of the learned representation, i.e., the token embedding, while only incurring a negligible increase in latency. AltUp achieves this by working on a subblock of the widened representation at each layer and using a predict-and-correct mechanism to update the inactivated blocks. We present extensions of AltUp, such as its applicability to the sequence dimension, and demonstrate how AltUp can be synergistically combined with existing approaches, such as Sparse Mixture-of-Experts models, to obtain efficient models with even higher capacity. Our experiments on benchmark transformer models and language tasks demonstrate the consistent effectiveness of AltUp on a diverse set of scenarios. Notably, on SuperGLUE and SQuAD benchmarks, AltUp enables up to $87\%$ speedup relative to the dense baselines at the same accuracy.
△ Less
Submitted 3 October, 2023; v1 submitted 30 January, 2023;
originally announced January 2023.
-
Weighted Distillation with Unlabeled Examples
Authors:
Fotis Iliopoulos,
Vasilis Kontonis,
Cenk Baykal,
Gaurav Menghani,
Khoa Trinh,
Erik Vee
Abstract:
Distillation with unlabeled examples is a popular and powerful method for training deep neural networks in settings where the amount of labeled data is limited: A large ''teacher'' neural network is trained on the labeled data available, and then it is used to generate labels on an unlabeled dataset (typically much larger in size). These labels are then utilized to train the smaller ''student'' mo…
▽ More
Distillation with unlabeled examples is a popular and powerful method for training deep neural networks in settings where the amount of labeled data is limited: A large ''teacher'' neural network is trained on the labeled data available, and then it is used to generate labels on an unlabeled dataset (typically much larger in size). These labels are then utilized to train the smaller ''student'' model which will actually be deployed. Naturally, the success of the approach depends on the quality of the teacher's labels, since the student could be confused if trained on inaccurate data. This paper proposes a principled approach for addressing this issue based on a ''debiasing'' reweighting of the student's loss function tailored to the distillation training paradigm. Our method is hyper-parameter free, data-agnostic, and simple to implement. We demonstrate significant improvements on popular academic datasets and we accompany our results with a theoretical analysis which rigorously justifies the performance of our method in certain settings.
△ Less
Submitted 13 October, 2022;
originally announced October 2022.
-
Robust Active Distillation
Authors:
Cenk Baykal,
Khoa Trinh,
Fotis Iliopoulos,
Gaurav Menghani,
Erik Vee
Abstract:
Distilling knowledge from a large teacher model to a lightweight one is a widely successful approach for generating compact, powerful models in the semi-supervised learning setting where a limited amount of labeled data is available. In large-scale applications, however, the teacher tends to provide a large number of incorrect soft-labels that impairs student performance. The sheer size of the tea…
▽ More
Distilling knowledge from a large teacher model to a lightweight one is a widely successful approach for generating compact, powerful models in the semi-supervised learning setting where a limited amount of labeled data is available. In large-scale applications, however, the teacher tends to provide a large number of incorrect soft-labels that impairs student performance. The sheer size of the teacher additionally constrains the number of soft-labels that can be queried due to prohibitive computational and/or financial costs. The difficulty in achieving simultaneous \emph{efficiency} (i.e., minimizing soft-label queries) and \emph{robustness} (i.e., avoiding student inaccuracies due to incorrect labels) hurts the widespread application of knowledge distillation to many modern tasks. In this paper, we present a parameter-free approach with provable guarantees to query the soft-labels of points that are simultaneously informative and correctly labeled by the teacher. At the core of our work lies a game-theoretic formulation that explicitly considers the inherent trade-off between the informativeness and correctness of input instances. We establish bounds on the expected performance of our approach that hold even in worst-case distillation instances. We present empirical evaluations on popular benchmarks that demonstrate the improved distillation performance enabled by our work relative to that of state-of-the-art active learning and active distillation methods.
△ Less
Submitted 4 February, 2023; v1 submitted 3 October, 2022;
originally announced October 2022.
-
A Theoretical View on Sparsely Activated Networks
Authors:
Cenk Baykal,
Nishanth Dikkala,
Rina Panigrahy,
Cyrus Rashtchian,
Xin Wang
Abstract:
Deep and wide neural networks successfully fit very complex functions today, but dense models are starting to be prohibitively expensive for inference. To mitigate this, one promising direction is networks that activate a sparse subgraph of the network. The subgraph is chosen by a data-dependent routing function, enforcing a fixed mapping of inputs to subnetworks (e.g., the Mixture of Experts (MoE…
▽ More
Deep and wide neural networks successfully fit very complex functions today, but dense models are starting to be prohibitively expensive for inference. To mitigate this, one promising direction is networks that activate a sparse subgraph of the network. The subgraph is chosen by a data-dependent routing function, enforcing a fixed mapping of inputs to subnetworks (e.g., the Mixture of Experts (MoE) paradigm in Switch Transformers). However, prior work is largely empirical, and while existing routing functions work well in practice, they do not lead to theoretical guarantees on approximation ability. We aim to provide a theoretical explanation for the power of sparse networks. As our first contribution, we present a formal model of data-dependent sparse networks that captures salient aspects of popular architectures. We then introduce a routing function based on locality sensitive hashing (LSH) that enables us to reason about how well sparse networks approximate target functions. After representing LSH-based sparse networks with our model, we prove that sparse networks can match the approximation power of dense networks on Lipschitz functions. Applying LSH on the input vectors means that the experts interpolate the target function in different subregions of the input space. To support our theory, we define various datasets based on Lipschitz target functions, and we show that sparse networks give a favorable trade-off between number of active units and approximation quality.
△ Less
Submitted 8 August, 2022;
originally announced August 2022.
-
Bandit Sampling for Multiplex Networks
Authors:
Cenk Baykal,
Vamsi K. Potluru,
Sameena Shah,
Manuela M. Veloso
Abstract:
Graph neural networks have gained prominence due to their excellent performance in many classification and prediction tasks. In particular, they are used for node classification and link prediction which have a wide range of applications in social networks, biomedical data sets, and financial transaction graphs. Most of the existing work focuses primarily on the monoplex setting where we have acce…
▽ More
Graph neural networks have gained prominence due to their excellent performance in many classification and prediction tasks. In particular, they are used for node classification and link prediction which have a wide range of applications in social networks, biomedical data sets, and financial transaction graphs. Most of the existing work focuses primarily on the monoplex setting where we have access to a network with only a single type of connection between entities. However, in the multiplex setting, where there are multiple types of connections, or \emph{layers}, between entities, performance on tasks such as link prediction has been shown to be stronger when information from other connection types is taken into account. We propose an algorithm for scalable learning on multiplex networks with a large number of layers. The efficiency of our method is enabled by an online learning algorithm that learns how to sample relevant neighboring layers so that only the layers with relevant information are aggregated during training. This sampling differs from prior work, such as MNE, which aggregates information across \emph{all} layers and consequently leads to computational intractability on large networks. Our approach also improves on the recent layer sampling method of \textsc{DeePlex} in that the unsampled layers do not need to be trained, enabling further increases in efficiency.We present experimental results on both synthetic and real-world scenarios that demonstrate the practical effectiveness of our proposed approach.
△ Less
Submitted 7 February, 2022;
originally announced February 2022.
-
Graph Belief Propagation Networks
Authors:
Junteng Jia,
Cenk Baykal,
Vamsi K. Potluru,
Austin R. Benson
Abstract:
With the wide-spread availability of complex relational data, semi-supervised node classification in graphs has become a central machine learning problem. Graph neural networks are a recent class of easy-to-train and accurate methods for this problem that map the features in the neighborhood of a node to its label, but they ignore label correlation during inference and their predictions are diffic…
▽ More
With the wide-spread availability of complex relational data, semi-supervised node classification in graphs has become a central machine learning problem. Graph neural networks are a recent class of easy-to-train and accurate methods for this problem that map the features in the neighborhood of a node to its label, but they ignore label correlation during inference and their predictions are difficult to interpret. On the other hand, collective classification is a traditional approach based on interpretable graphical models that explicitly model label correlations. Here, we introduce a model that combines the advantages of these two approaches, where we compute the marginal probabilities in a conditional random field, similar to collective classification, and the potentials in the random field are learned through end-to-end training, akin to graph neural networks. In our model, potentials on each node only depend on that node's features, and edge potentials are learned via a coupling matrix. This structure enables simple training with interpretable parameters, scales to large networks, naturally incorporates training labels at inference, and is often more accurate than related approaches. Our approach can be viewed as either an interpretable message-passing graph neural network or a collective classification method with higher capacity and modernized training.
△ Less
Submitted 6 June, 2021;
originally announced June 2021.
-
Low-Regret Active learning
Authors:
Cenk Baykal,
Lucas Liebenwein,
Dan Feldman,
Daniela Rus
Abstract:
We develop an online learning algorithm for identifying unlabeled data points that are most informative for training (i.e., active learning). By formulating the active learning problem as the prediction with sleeping experts problem, we provide a regret minimization framework for identifying relevant data with respect to any given definition of informativeness. Motivated by the successes of ensemb…
▽ More
We develop an online learning algorithm for identifying unlabeled data points that are most informative for training (i.e., active learning). By formulating the active learning problem as the prediction with sleeping experts problem, we provide a regret minimization framework for identifying relevant data with respect to any given definition of informativeness. Motivated by the successes of ensembles in active learning, we define regret with respect to an omnipotent algorithm that has access to an infinity large ensemble. At the core of our work is an efficient algorithm for sleeping experts that is tailored to achieve low regret on easy instances while remaining resilient to adversarial ones. Low regret implies that we can be provably competitive with an ensemble method \emph{without the computational burden of having to train an ensemble}. This stands in contrast to state-of-the-art active learning methods that are overwhelmingly based on greedy selection, and hence cannot ensure good performance across problem instances with high amounts of noise. We present empirical results demonstrating that our method (i) instantiated with an informativeness measure consistently outperforms its greedy counterpart and (ii) reliably outperforms uniform sampling on real-world scenarios.
△ Less
Submitted 22 February, 2022; v1 submitted 6 April, 2021;
originally announced April 2021.
-
Lost in Pruning: The Effects of Pruning Neural Networks beyond Test Accuracy
Authors:
Lucas Liebenwein,
Cenk Baykal,
Brandon Carter,
David Gifford,
Daniela Rus
Abstract:
Neural network pruning is a popular technique used to reduce the inference costs of modern, potentially overparameterized, networks. Starting from a pre-trained network, the process is as follows: remove redundant parameters, retrain, and repeat while maintaining the same test accuracy. The result is a model that is a fraction of the size of the original with comparable predictive performance (tes…
▽ More
Neural network pruning is a popular technique used to reduce the inference costs of modern, potentially overparameterized, networks. Starting from a pre-trained network, the process is as follows: remove redundant parameters, retrain, and repeat while maintaining the same test accuracy. The result is a model that is a fraction of the size of the original with comparable predictive performance (test accuracy). Here, we reassess and evaluate whether the use of test accuracy alone in the terminating condition is sufficient to ensure that the resulting model performs well across a wide spectrum of "harder" metrics such as generalization to out-of-distribution data and resilience to noise. Across evaluations on varying architectures and data sets, we find that pruned networks effectively approximate the unpruned model, however, the prune ratio at which pruned networks achieve commensurate performance varies significantly across tasks. These results call into question the extent of \emph{genuine} overparameterization in deep learning and raise concerns about the practicability of deploying pruned networks, specifically in the context of safety-critical systems, unless they are widely evaluated beyond test accuracy to reliably predict their performance. Our code is available at https://github.com/lucaslie/torchprune.
△ Less
Submitted 4 March, 2021;
originally announced March 2021.
-
On Coresets for Support Vector Machines
Authors:
Murad Tukan,
Cenk Baykal,
Dan Feldman,
Daniela Rus
Abstract:
We present an efficient coreset construction algorithm for large-scale Support Vector Machine (SVM) training in Big Data and streaming applications. A coreset is a small, representative subset of the original data points such that a models trained on the coreset are provably competitive with those trained on the original data set. Since the size of the coreset is generally much smaller than the or…
▽ More
We present an efficient coreset construction algorithm for large-scale Support Vector Machine (SVM) training in Big Data and streaming applications. A coreset is a small, representative subset of the original data points such that a models trained on the coreset are provably competitive with those trained on the original data set. Since the size of the coreset is generally much smaller than the original set, our preprocess-then-train scheme has potential to lead to significant speedups when training SVM models. We prove lower and upper bounds on the size of the coreset required to obtain small data summaries for the SVM problem. As a corollary, we show that our algorithm can be used to extend the applicability of any off-the-shelf SVM solver to streaming, distributed, and dynamic data settings. We evaluate the performance of our algorithm on real-world and synthetic data sets. Our experimental results reaffirm the favorable theoretical properties of our algorithm and demonstrate its practical effectiveness in accelerating SVM training.
△ Less
Submitted 15 February, 2020;
originally announced February 2020.
-
Provable Filter Pruning for Efficient Neural Networks
Authors:
Lucas Liebenwein,
Cenk Baykal,
Harry Lang,
Dan Feldman,
Daniela Rus
Abstract:
We present a provable, sampling-based approach for generating compact Convolutional Neural Networks (CNNs) by identifying and removing redundant filters from an over-parameterized network. Our algorithm uses a small batch of input data points to assign a saliency score to each filter and constructs an importance sampling distribution where filters that highly affect the output are sampled with cor…
▽ More
We present a provable, sampling-based approach for generating compact Convolutional Neural Networks (CNNs) by identifying and removing redundant filters from an over-parameterized network. Our algorithm uses a small batch of input data points to assign a saliency score to each filter and constructs an importance sampling distribution where filters that highly affect the output are sampled with correspondingly high probability. In contrast to existing filter pruning approaches, our method is simultaneously data-informed, exhibits provable guarantees on the size and performance of the pruned network, and is widely applicable to varying network architectures and data sets. Our analytical bounds bridge the notions of compressibility and importance of network structures, which gives rise to a fully-automated procedure for identifying and preserving filters in layers that are essential to the network's performance. Our experimental evaluations on popular architectures and data sets show that our algorithm consistently generates sparser and more efficient models than those constructed by existing filter pruning approaches.
△ Less
Submitted 23 March, 2020; v1 submitted 17 November, 2019;
originally announced November 2019.
-
SiPPing Neural Networks: Sensitivity-informed Provable Pruning of Neural Networks
Authors:
Cenk Baykal,
Lucas Liebenwein,
Igor Gilitschenski,
Dan Feldman,
Daniela Rus
Abstract:
We introduce a pruning algorithm that provably sparsifies the parameters of a trained model in a way that approximately preserves the model's predictive accuracy. Our algorithm uses a small batch of input points to construct a data-informed importance sampling distribution over the network's parameters, and adaptively mixes a sampling-based and deterministic pruning procedure to discard redundant…
▽ More
We introduce a pruning algorithm that provably sparsifies the parameters of a trained model in a way that approximately preserves the model's predictive accuracy. Our algorithm uses a small batch of input points to construct a data-informed importance sampling distribution over the network's parameters, and adaptively mixes a sampling-based and deterministic pruning procedure to discard redundant weights. Our pruning method is simultaneously computationally efficient, provably accurate, and broadly applicable to various network architectures and data distributions. Our empirical comparisons show that our algorithm reliably generates highly compressed networks that incur minimal loss in performance relative to that of the original network. We present experimental results that demonstrate our algorithm's potential to unearth essential network connections that can be trained successfully in isolation, which may be of independent interest.
△ Less
Submitted 14 March, 2021; v1 submitted 11 October, 2019;
originally announced October 2019.
-
Data-Dependent Coresets for Compressing Neural Networks with Applications to Generalization Bounds
Authors:
Cenk Baykal,
Lucas Liebenwein,
Igor Gilitschenski,
Dan Feldman,
Daniela Rus
Abstract:
We present an efficient coresets-based neural network compression algorithm that sparsifies the parameters of a trained fully-connected neural network in a manner that provably approximates the network's output. Our approach is based on an importance sampling scheme that judiciously defines a sampling distribution over the neural network parameters, and as a result, retains parameters of high impo…
▽ More
We present an efficient coresets-based neural network compression algorithm that sparsifies the parameters of a trained fully-connected neural network in a manner that provably approximates the network's output. Our approach is based on an importance sampling scheme that judiciously defines a sampling distribution over the neural network parameters, and as a result, retains parameters of high importance while discarding redundant ones. We leverage a novel, empirical notion of sensitivity and extend traditional coreset constructions to the application of compressing parameters. Our theoretical analysis establishes guarantees on the size and accuracy of the resulting compressed network and gives rise to generalization bounds that may provide new insights into the generalization properties of neural networks. We demonstrate the practical effectiveness of our algorithm on a variety of neural network configurations and real-world data sets.
△ Less
Submitted 17 May, 2019; v1 submitted 15 April, 2018;
originally announced April 2018.
-
Training Support Vector Machines using Coresets
Authors:
Cenk Baykal,
Lucas Liebenwein,
Wilko Schwarting
Abstract:
We present a novel coreset construction algorithm for solving classification tasks using Support Vector Machines (SVMs) in a computationally efficient manner. A coreset is a weighted subset of the original data points that provably approximates the original set. We show that coresets of size polylogarithmic in $n$ and polynomial in $d$ exist for a set of $n$ input points with $d$ features and pres…
▽ More
We present a novel coreset construction algorithm for solving classification tasks using Support Vector Machines (SVMs) in a computationally efficient manner. A coreset is a weighted subset of the original data points that provably approximates the original set. We show that coresets of size polylogarithmic in $n$ and polynomial in $d$ exist for a set of $n$ input points with $d$ features and present an $(ε,δ)$-FPRAS for constructing coresets for scalable SVM training. Our method leverages the insight that data points are often redundant and uses an importance sampling scheme based on the sensitivity of each data point to construct coresets efficiently. We evaluate the performance of our algorithm in accelerating SVM training against real-world data sets and compare our algorithm to state-of-the-art coreset approaches. Our empirical results show that our approach outperforms a state-of-the-art coreset approach and uniform sampling in enabling computational speedups while achieving low approximation error.
△ Less
Submitted 9 November, 2017; v1 submitted 12 August, 2017;
originally announced August 2017.
-
Detection of AQM on Paths using Machine Learning Methods
Authors:
Cenk Baykal,
Wilko Schwarting,
Alex Wallar
Abstract:
In this paper, we address the problem of determining whether a bottleneck router on a given network path is using an AQM or a drop-tail scheme. We assume that we are given a source-to-sink path of interest -along which a bottleneck router exists- and data regarding the Round-Trip Times (RTT) and Congestion Window (CWND) sizes with respect to this flow. We develop a reliable classification algorith…
▽ More
In this paper, we address the problem of determining whether a bottleneck router on a given network path is using an AQM or a drop-tail scheme. We assume that we are given a source-to-sink path of interest -along which a bottleneck router exists- and data regarding the Round-Trip Times (RTT) and Congestion Window (CWND) sizes with respect to this flow. We develop a reliable classification algorithm that solely uses RTT and CWND information pertaining to a single flow to classify the queuing scheme, Tail Drop or AQM, used by the bottleneck router. We evaluate our method and present results that demonstrate our algorithm's highly accurate classification ability across a wide array of complex network topologies and configurations.
△ Less
Submitted 7 July, 2017;
originally announced July 2017.