Search | arXiv e-print repository

CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models

Authors: Santiago Castro, Amir Ziai, Avneesh Saluja, Zhuoning Yuan, Rada Mihalcea

Abstract: Recent years have witnessed a significant increase in the performance of Vision and Language tasks. Foundational Vision-Language Models (VLMs), such as CLIP, have been leveraged in multiple settings and demonstrated remarkable performance across several tasks. Such models excel at object-centric recognition yet learn text representations that seem invariant to word order, failing to compose known… ▽ More Recent years have witnessed a significant increase in the performance of Vision and Language tasks. Foundational Vision-Language Models (VLMs), such as CLIP, have been leveraged in multiple settings and demonstrated remarkable performance across several tasks. Such models excel at object-centric recognition yet learn text representations that seem invariant to word order, failing to compose known concepts in novel ways. However, no evidence exists that any VLM, including large-scale single-stream models such as GPT-4V, identifies compositions successfully. In this paper, we introduce a framework to significantly improve the ability of existing models to encode compositional language, with over 10% absolute improvement on compositionality benchmarks, while maintaining or improving the performance on standard object-recognition and retrieval benchmarks. Our code and pre-trained models are publicly available at https://github.com/netflix/clove. △ Less

Submitted 29 February, 2024; v1 submitted 22 February, 2024; originally announced February 2024.

arXiv:2402.06560 [pdf, other]

Video Annotator: A framework for efficiently building video classifiers using vision-language models and active learning

Authors: Amir Ziai, Aneesh Vartakavi

Abstract: High-quality and consistent annotations are fundamental to the successful development of robust machine learning models. Traditional data annotation methods are resource-intensive and inefficient, often leading to a reliance on third-party annotators who are not the domain experts. Hard samples, which are usually the most informative for model training, tend to be difficult to label accurately and… ▽ More High-quality and consistent annotations are fundamental to the successful development of robust machine learning models. Traditional data annotation methods are resource-intensive and inefficient, often leading to a reliance on third-party annotators who are not the domain experts. Hard samples, which are usually the most informative for model training, tend to be difficult to label accurately and consistently without business context. These can arise unpredictably during the annotation process, requiring a variable number of iterations and rounds of feedback, leading to unforeseen expenses and time commitments to guarantee quality. We posit that more direct involvement of domain experts, using a human-in-the-loop system, can resolve many of these practical challenges. We propose a novel framework we call Video Annotator (VA) for annotating, managing, and iterating on video classification datasets. Our approach offers a new paradigm for an end-user-centered model development process, enhancing the efficiency, usability, and effectiveness of video classifiers. Uniquely, VA allows for a continuous annotation process, seamlessly integrating data collection and model training. We leverage the zero-shot capabilities of vision-language foundation models combined with active learning techniques, and demonstrate that VA enables the efficient creation of high-quality models. VA achieves a median 6.8 point improvement in Average Precision relative to the most competitive baseline across a wide-ranging assortment of tasks. We release a dataset with 153k labels across 56 video understanding tasks annotated by three professional video editors using VA, and also release code to replicate our experiments at: http://github.com/netflix/videoannotator. △ Less

Submitted 9 February, 2024; originally announced February 2024.

Comments: Submitted for review to KDD '24 (ADS Track)

arXiv:2312.06868 [pdf, other]

RAFIC: Retrieval-Augmented Few-shot Image Classification

Authors: Hangfei Lin, Li Miao, Amir Ziai

Abstract: Few-shot image classification is the task of classifying unseen images to one of N mutually exclusive classes, using only a small number of training examples for each class. The limited availability of these examples (denoted as K) presents a significant challenge to classification accuracy in some cases. To address this, we have developed a method for augmenting the set of K with an addition set… ▽ More Few-shot image classification is the task of classifying unseen images to one of N mutually exclusive classes, using only a small number of training examples for each class. The limited availability of these examples (denoted as K) presents a significant challenge to classification accuracy in some cases. To address this, we have developed a method for augmenting the set of K with an addition set of A retrieved images. We call this system Retrieval-Augmented Few-shot Image Classification (RAFIC). Through a series of experiments, we demonstrate that RAFIC markedly improves performance of few-shot image classification across two challenging datasets. RAFIC consists of two main components: (a) a retrieval component which uses CLIP, LAION-5B, and faiss, in order to efficiently retrieve images similar to the supplied images, and (b) retrieval meta-learning, which learns to judiciously utilize the retrieved images. Code and data is available at github.com/amirziai/rafic. △ Less

Submitted 11 December, 2023; originally announced December 2023.

arXiv:2308.00938 [pdf, other]

DPA Load Balancer: Load balancing for Data Parallel Actor-based systems

Authors: Ziheng Wang, Atem Aguer, Amir Ziai

Abstract: In this project we explore ways to dynamically load balance actors in a streaming framework. This is used to address input data skew that might lead to stragglers. We continuously monitor actors' input queue lengths for load, and redistribute inputs among reducers using consistent hashing if we detect stragglers. To ensure consistent processing post-redistribution, we adopt an approach that uses i… ▽ More In this project we explore ways to dynamically load balance actors in a streaming framework. This is used to address input data skew that might lead to stragglers. We continuously monitor actors' input queue lengths for load, and redistribute inputs among reducers using consistent hashing if we detect stragglers. To ensure consistent processing post-redistribution, we adopt an approach that uses input forwarding combined with a state merge step at the end of the processing. We show that this approach can greatly alleviate stragglers for skewed data. △ Less

Submitted 2 August, 2023; originally announced August 2023.

Comments: 7 pages

arXiv:2212.02192 [pdf, other]

Niimpy: a toolbox for behavioral data analysis

Authors: A. Ikäheimonen, A. M. Triana, N. Luong, A. Ziaei, J. Rantaharju, R. Darst, T. Aledavood

Abstract: Behavioral studies using personal digital devices typically produce rich longitudinal datasets of mixed data types. These data provide information about the behavior of users of these devices in real-time and in the users' natural environments. Analyzing the data requires multidisciplinary expertise and dedicated software. Currently, no generalizable, device-agnostic, freely available software exi… ▽ More Behavioral studies using personal digital devices typically produce rich longitudinal datasets of mixed data types. These data provide information about the behavior of users of these devices in real-time and in the users' natural environments. Analyzing the data requires multidisciplinary expertise and dedicated software. Currently, no generalizable, device-agnostic, freely available software exists within Python scientific computing ecosystem to preprocess and analyze such data. This paper introduces a Python package, Niimpy, for analyzing digital behavioral data. The Niimpy toolbox is a user-friendly open-source package that can quickly be expanded and adapted to specific research requirements. The toolbox facilitates the analysis phase by offering tools for preprocessing, extracting features, and exploring the data. It also aims to educate the user on behavioral data analysis and promotes open science practices. Over time, Niimpy will expand with extra data analysis features developed by the core group, new users, and developers. Niimpy can help the fast-growing number of researchers with diverse backgrounds who collect data from personal and consumer digital devices to systematically and efficiently analyze the data and extract useful information. This novel information is vital for answering research questions in various fields, from medicine to psychology, sociology, and others. △ Less

Submitted 5 December, 2022; originally announced December 2022.

arXiv:2210.05766 [pdf, other]

Match Cutting: Finding Cuts with Smooth Visual Transitions

Authors: Boris Chen, Amir Ziai, Rebecca Tucker, Yuchen Xie

Abstract: A match cut is a transition between a pair of shots that uses similar framing, composition, or action to fluidly bring the viewer from one scene to the next. Match cuts are frequently used in film, television, and advertising. However, finding shots that work together is a highly manual and time-consuming process that can take days. We propose a modular and flexible system to efficiently find high… ▽ More A match cut is a transition between a pair of shots that uses similar framing, composition, or action to fluidly bring the viewer from one scene to the next. Match cuts are frequently used in film, television, and advertising. However, finding shots that work together is a highly manual and time-consuming process that can take days. We propose a modular and flexible system to efficiently find high-quality match cut candidates starting from millions of shot pairs. We annotate and release a dataset of approximately 20k labeled pairs that we use to evaluate our system, using both classification and metric learning approaches that leverage a variety of image, video, audio, and audio-visual feature extractors. In addition, we release code and embeddings for reproducing our experiments at github.com/netflix/matchcut. △ Less

Submitted 11 October, 2022; originally announced October 2022.

arXiv:2203.02124 [pdf, other]

Abuse and Fraud Detection in Streaming Services Using Heuristic-Aware Machine Learning

Authors: Soheil Esmaeilzadeh, Negin Salajegheh, Amir Ziai, Jeff Boote

Abstract: This work presents a fraud and abuse detection framework for streaming services by modeling user streaming behavior. The goal is to discover anomalous and suspicious incidents and scale the investigation efforts by creating models that characterize the user behavior. We study the use of semi-supervised as well as supervised approaches for anomaly detection. In the semi-supervised approach, by leve… ▽ More This work presents a fraud and abuse detection framework for streaming services by modeling user streaming behavior. The goal is to discover anomalous and suspicious incidents and scale the investigation efforts by creating models that characterize the user behavior. We study the use of semi-supervised as well as supervised approaches for anomaly detection. In the semi-supervised approach, by leveraging only a set of authenticated anomaly-free data samples, we show the use of one-class classification algorithms as well as autoencoder deep neural networks for anomaly detection. In the supervised anomaly detection task, we present a so-called heuristic-aware data labeling strategy for creating labeled data samples. We carry out binary classification as well as multi-class multi-label classification tasks for not only detecting the anomalous samples but also identifying the underlying anomaly behavior(s) associated with each one. Finally, using a systematic feature importance study we provide insights into the underlying set of features that characterize different streaming fraud categories. To the best of our knowledge, this is the first paper to use machine learning methods for fraud and abuse detection in real-world scale streaming services. △ Less

Submitted 3 March, 2022; originally announced March 2022.

arXiv:2105.08447 [pdf]

Deep Active Contours Using Locally Controlled Distance Vector Flow

Authors: Parastoo Akbari, Atefeh Ziaei, Hamed Azarnoush

Abstract: Active contours Model (ACM) has been extensively used in computer vision and image processing. In recent studies, Convolutional Neural Networks (CNNs) have been combined with active contours replacing the user in the process of contour evolution and image segmentation to eliminate limitations associated with ACM's dependence on parameters of the energy functional and initialization. However, prior… ▽ More Active contours Model (ACM) has been extensively used in computer vision and image processing. In recent studies, Convolutional Neural Networks (CNNs) have been combined with active contours replacing the user in the process of contour evolution and image segmentation to eliminate limitations associated with ACM's dependence on parameters of the energy functional and initialization. However, prior works did not aim for automatic initialization which is addressed here. In addition to manual initialization, current methods are highly sensitive to initial location and fail to delineate borders accurately. We propose a fully automatic image segmentation method to address problems of manual initialization, insufficient capture range, and poor convergence to boundaries, in addition to the problem of assignment of energy functional parameters. We train two CNNs, which predict active contour weighting parameters and generate a ground truth mask to extract Distance Transform (DT) and an initialization circle. Distance transform is used to form a vector field pointing from each pixel of the image towards the closest point on the boundary, the size of which is equal to the Euclidean distance map. We evaluate our method on four publicly available datasets including two building instance segmentation datasets, Vaihingen and Bing huts, and two mammography image datasets, INBreast and DDSM-BCRP. Our approach outperforms latest research by 0.59 ans 2.39 percent in mean Intersection-over-Union (mIoU), 7.38 and 8.62 percent in Boundary F-score (BoundF) for Vaihingen and Bing huts datasets, respectively. Dice similarity coefficient for the INBreast and DDSM-BCRP datasets is 94.23% and 90.89%, respectively indicating our method is comparable to state-of-the-art frameworks. △ Less

Submitted 18 May, 2021; originally announced May 2021.

Comments: 22 pages with 12 figures

arXiv:2011.06304 [pdf, other]

Machine Learning Interpretability Meets TLS Fingerprinting

Authors: Mahdi Jafari Siavoshani, Amir Hossein Khajepour, Amirmohammad Ziaei, Amir Ali Gatmiri, Ali Taheri

Abstract: Protecting users' privacy over the Internet is of great importance; however, it becomes harder and harder to maintain due to the increasing complexity of network protocols and components. Therefore, investigating and understanding how data is leaked from the information transmission platforms and protocols can lead us to a more secure environment. In this paper, we propose a framework to systema… ▽ More Protecting users' privacy over the Internet is of great importance; however, it becomes harder and harder to maintain due to the increasing complexity of network protocols and components. Therefore, investigating and understanding how data is leaked from the information transmission platforms and protocols can lead us to a more secure environment. In this paper, we propose a framework to systematically find the most vulnerable information fields in a network protocol. To this end, focusing on the transport layer security (TLS) protocol, we perform different machine-learning-based fingerprinting attacks on the collected data from more than 70 domains (websites) to understand how and where this information leakage occurs in the TLS protocol. Then, by employing the interpretation techniques developed in the machine learning community and applying our framework, we find the most vulnerable information fields in the TLS protocol. Our findings demonstrate that the TLS handshake (which is mainly unencrypted), the TLS record length appearing in the TLS application data header, and the initialization vector (IV) field are among the most critical leaker parts in this protocol, respectively. △ Less

Submitted 12 September, 2021; v1 submitted 12 November, 2020; originally announced November 2020.

arXiv:2010.01400 [pdf, other]

doi 10.1145/3599237

Joint Inference of Diffusion and Structure in Partially Observed Social Networks Using Coupled Matrix Factorization

Authors: Maryam Ramezani, Aryan Ahadinia, Amirmohammad Ziaei, Hamid R. Rabiee

Abstract: Access to complete data in large-scale networks is often infeasible. Therefore, the problem of missing data is a crucial and unavoidable issue in the analysis and modeling of real-world social networks. However, most of the research on different aspects of social networks does not consider this limitation. One effective way to solve this problem is to recover the missing data as a pre-processing s… ▽ More Access to complete data in large-scale networks is often infeasible. Therefore, the problem of missing data is a crucial and unavoidable issue in the analysis and modeling of real-world social networks. However, most of the research on different aspects of social networks does not consider this limitation. One effective way to solve this problem is to recover the missing data as a pre-processing step. In this paper, a model is learned from partially observed data to infer unobserved diffusion and structure networks. To jointly discover omitted diffusion activities and hidden network structures, we develop a probabilistic generative model called "DiffStru." The interrelations among links of nodes and cascade processes are utilized in the proposed method via learning coupled with low-dimensional latent factors. Besides inferring unseen data, latent factors such as community detection may also aid in network classification problems. We tested different missing data scenarios on simulated independent cascades over LFR networks and real datasets, including Twitter and Memtracker. Experiments on these synthetic and real-world datasets show that the proposed method successfully detects invisible social behaviors, predicts links, and identifies latent features. △ Less

Submitted 22 March, 2023; v1 submitted 3 October, 2020; originally announced October 2020.

arXiv:1906.02458 [pdf]

Kinematic & Dynamic Analysis of the Human Upper Limb Using the Theory of Screws

Authors: Amir Ziai

Abstract: Screw theory provides geometrical insight into the mechanics of rigid bodies. Screw axis is defined as the line coinciding with the joint axis. Line transformations in the form of a screw operator are used to determine the joint axes of a seven degree of freedom manipulator, representing the human upper limb. Multiplication of a unit screw axis with the joint angular velocity provides the joint tw… ▽ More Screw theory provides geometrical insight into the mechanics of rigid bodies. Screw axis is defined as the line coinciding with the joint axis. Line transformations in the form of a screw operator are used to determine the joint axes of a seven degree of freedom manipulator, representing the human upper limb. Multiplication of a unit screw axis with the joint angular velocity provides the joint twist. Instantaneous motion of a joint is the summation of the twists of the preceding joints and the joint twist itself. Inverse kinematics, velocities and accelerations are calculated using the screw Jacobian for a non-redundant six degree of freedom manipulator. Netwon and Euler dynamic equations are then utilized to solve for the forward and inverse dynamic problems. Dynamics of the upper limb and the upper limb combined with an exoskeleton are only different due to the additional mass and inertia of the exoskeleton. Dynamic equations are crucial for controlling the exoskeleton in position and force. △ Less

Submitted 6 June, 2019; originally announced June 2019.

arXiv:1906.01843 [pdf, other]

Detecting Kissing Scenes in a Database of Hollywood Films

Authors: Amir Ziai

Abstract: Detecting scene types in a movie can be very useful for application such as video editing, ratings assignment, and personalization. We propose a system for detecting kissing scenes in a movie. This system consists of two components. The first component is a binary classifier that predicts a binary label (i.e. kissing or not) given a features exctracted from both the still frames and audio waves of… ▽ More Detecting scene types in a movie can be very useful for application such as video editing, ratings assignment, and personalization. We propose a system for detecting kissing scenes in a movie. This system consists of two components. The first component is a binary classifier that predicts a binary label (i.e. kissing or not) given a features exctracted from both the still frames and audio waves of a one-second segment. The second component aggregates the binary labels for contiguous non-overlapping segments into a set of kissing scenes. We experimented with a variety of 2D and 3D convolutional architectures such as ResNet, DesnseNet, and VGGish and developed a highly accurate kissing detector that achieves a validation F1 score of 0.95 on a diverse database of Hollywood films ranging many genres and spanning multiple decades. The code for this project is available at http://github.com/amirziai/kissing-detector. △ Less

Submitted 5 June, 2019; originally announced June 2019.

arXiv:1906.00529 [pdf]

Mining Data from the Congressional Record

Authors: Zhengyu Ma, Tianjiao Qi, James Route, Amir Ziai

Abstract: We propose a data storage and analysis method for using the US Congressional record as a policy analysis tool. We use Amazon Web Services and the Solr search engine to store and process Congressional record data from 1789 to the present, and then query Solr to find how frequently language related to tax increases and decreases appears. This frequency data is compared to six economic indicators. Ou… ▽ More We propose a data storage and analysis method for using the US Congressional record as a policy analysis tool. We use Amazon Web Services and the Solr search engine to store and process Congressional record data from 1789 to the present, and then query Solr to find how frequently language related to tax increases and decreases appears. This frequency data is compared to six economic indicators. Our preliminary results indicate potential relationships between incidence of tax discussion and multiple indicators. We present our data storage and analysis procedures, as well as results from comparisons to all six indicators. △ Less

Submitted 2 June, 2019; originally announced June 2019.

arXiv:1905.11531 [pdf, other]

Compositional pre-training for neural semantic parsing

Authors: Amir Ziai

Abstract: Semantic parsing is the process of translating natural language utterances into logical forms, which has many important applications such as question answering and instruction following. Sequence-to-sequence models have been very successful across many NLP tasks. However, a lack of task-specific prior knowledge can be detrimental to the performance of these models. Prior work has used frameworks f… ▽ More Semantic parsing is the process of translating natural language utterances into logical forms, which has many important applications such as question answering and instruction following. Sequence-to-sequence models have been very successful across many NLP tasks. However, a lack of task-specific prior knowledge can be detrimental to the performance of these models. Prior work has used frameworks for inducing grammars over the training examples, which capture conditional independence properties that the model can leverage. Inspired by the recent success stories such as BERT we set out to extend this augmentation framework into two stages. The first stage is to pre-train using a corpus of augmented examples in an unsupervised manner. The second stage is to fine-tune to a domain-specific task. In addition, since the pre-training stage is separate from the training on the main task we also expand the universe of possible augmentations without causing catastrophic inference. We also propose a novel data augmentation strategy that interchanges tokens that co-occur in similar contexts to produce new training pairs. We demonstrate that the proposed two-stage framework is beneficial for improving the parsing accuracy in a standard dataset called GeoQuery for the task of generating logical forms from a set of questions about the US geography. △ Less

Submitted 27 May, 2019; originally announced May 2019.

arXiv:1904.01555 [pdf]

Active Learning for Network Intrusion Detection

Authors: Amir Ziai

Abstract: Network operators are generally aware of common attack vectors that they defend against. For most networks the vast majority of traffic is legitimate. However new attack vectors are continually designed and attempted by bad actors which bypass detection and go unnoticed due to low volume. One strategy for finding such activity is to look for anomalous behavior. Investigating anomalous behavior req… ▽ More Network operators are generally aware of common attack vectors that they defend against. For most networks the vast majority of traffic is legitimate. However new attack vectors are continually designed and attempted by bad actors which bypass detection and go unnoticed due to low volume. One strategy for finding such activity is to look for anomalous behavior. Investigating anomalous behavior requires significant time and resources. Collecting a large number of labeled examples for training supervised models is both prohibitively expensive and subject to obsoletion as new attacks surface. A purely unsupervised methodology is ideal; however, research has shown that even a very small number of labeled examples can significantly improve the quality of anomaly detection. A methodology that minimizes the number of required labels while maximizing the quality of detection is desirable. False positives in this context result in wasted effort or blockage of legitimate traffic and false negatives translate to undetected attacks. We propose a general active learning framework and experiment with different choices of learners and sampling strategies. △ Less

Submitted 2 April, 2019; originally announced April 2019.

Showing 1–15 of 15 results for author: Ziai, A