Search | arXiv e-print repository

How Effective are Self-Supervised Models for Contact Identification in Videos

Authors: Malitha Gunawardhana, Limalka Sadith, Liel David, Daniel Harari, Muhammad Haris Khan

Abstract: The exploration of video content via Self-Supervised Learning (SSL) models has unveiled a dynamic field of study, emphasizing both the complex challenges and unique opportunities inherent in this area. Despite the growing body of research, the ability of SSL models to detect physical contacts in videos remains largely unexplored, particularly the effectiveness of methods such as downstream supervi… ▽ More The exploration of video content via Self-Supervised Learning (SSL) models has unveiled a dynamic field of study, emphasizing both the complex challenges and unique opportunities inherent in this area. Despite the growing body of research, the ability of SSL models to detect physical contacts in videos remains largely unexplored, particularly the effectiveness of methods such as downstream supervision with linear probing or full fine-tuning. This work aims to bridge this gap by employing eight different convolutional neural networks (CNNs) based video SSL models to identify instances of physical contact within video sequences specifically. The Something-Something v2 (SSv2) and Epic-Kitchen (EK-100) datasets were chosen for evaluating these approaches due to the promising results on UCF101 and HMDB51, coupled with their limited prior assessment on SSv2 and EK-100. Additionally, these datasets feature diverse environments and scenarios, essential for testing the robustness and accuracy of video-based models. This approach not only examines the effectiveness of each model in recognizing physical contacts but also explores the performance in the action recognition downstream task. By doing so, valuable insights into the adaptability of SSL models in interpreting complex, dynamic visual information are contributed. △ Less

Submitted 1 August, 2024; originally announced August 2024.

Comments: 15 pages, 6 figures

arXiv:2403.02782 [pdf, other]

Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos

Authors: Kumaranage Ravindu Yasas Nagasinghe, Honglu Zhou, Malitha Gunawardhana, Martin Renqiang Min, Daniel Harari, Muhammad Haris Khan

Abstract: In this paper, we explore the capability of an agent to construct a logical sequence of action steps, thereby assembling a strategic procedural plan. This plan is crucial for navigating from an initial visual observation to a target visual outcome, as depicted in real-life instructional videos. Existing works have attained partial success by extensively leveraging various sources of information av… ▽ More In this paper, we explore the capability of an agent to construct a logical sequence of action steps, thereby assembling a strategic procedural plan. This plan is crucial for navigating from an initial visual observation to a target visual outcome, as depicted in real-life instructional videos. Existing works have attained partial success by extensively leveraging various sources of information available in the datasets, such as heavy intermediate visual observations, procedural names, or natural language step-by-step instructions, for features or supervision signals. However, the task remains formidable due to the implicit causal constraints in the sequencing of steps and the variability inherent in multiple feasible plans. To tackle these intricacies that previous efforts have overlooked, we propose to enhance the capabilities of the agent by infusing it with procedural knowledge. This knowledge, sourced from training procedure plans and structured as a directed weighted graph, equips the agent to better navigate the complexities of step sequencing and its potential variations. We coin our approach KEPP, a novel Knowledge-Enhanced Procedure Planning system, which harnesses a probabilistic procedural knowledge graph extracted from training data, effectively acting as a comprehensive textbook for the training domain. Experimental evaluations across three widely-used datasets under settings of varying complexity reveal that KEPP attains superior, state-of-the-art results while requiring only minimal supervision. △ Less

Submitted 15 June, 2024; v1 submitted 5 March, 2024; originally announced March 2024.

Comments: 8 pages, 6 figures, (supplementary material: 9 pages, 5 figures), accepted to CVPR 2024

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024 , Pages 18816-18826

arXiv:2110.08744 [pdf]

A model for full local image interpretation

Authors: Guy Ben-Yosef, Liav Assif, Daniel Harari, Shimon Ullman

Abstract: We describe a computational model of humans' ability to provide a detailed interpretation of components in a scene. Humans can identify in an image meaningful components almost everywhere, and identifying these components is an essential part of the visual process, and of understanding the surrounding scene and its potential meaning to the viewer. Detailed interpretation is beyond the scope of cur… ▽ More We describe a computational model of humans' ability to provide a detailed interpretation of components in a scene. Humans can identify in an image meaningful components almost everywhere, and identifying these components is an essential part of the visual process, and of understanding the surrounding scene and its potential meaning to the viewer. Detailed interpretation is beyond the scope of current models of visual recognition. Our model suggests that this is a fundamental limitation, related to the fact that existing models rely on feed-forward but limited top-down processing. In our model, a first recognition stage leads to the initial activation of class candidates, which is incomplete and with limited accuracy. This stage then triggers the application of class-specific interpretation and validation processes, which recover richer and more accurate interpretation of the visible scene. We discuss implications of the model for visual interpretation by humans and by computer vision models. △ Less

Submitted 17 October, 2021; originally announced October 2021.

Comments: Published in the Proceedings of the 37th Annual Meeting of the Cognitive Science Society (CogSci), 2015

Journal ref: https://cogsci.mindmodeling.org/2015/papers/0048/

arXiv:2109.13445 [pdf, other]

Emergent Neural Network Mechanisms for Generalization to Objects in Novel Orientations

Authors: Avi Cooper, Xavier Boix, Daniel Harari, Spandan Madan, Hanspeter Pfister, Tomotake Sasaki, Pawan Sinha

Abstract: The capability of Deep Neural Networks (DNNs) to recognize objects in orientations outside the distribution of the training data is not well understood. We present evidence that DNNs are capable of generalizing to objects in novel orientations by disseminating orientation-invariance obtained from familiar objects seen from many viewpoints. This capability strengthens when training the DNN with an… ▽ More The capability of Deep Neural Networks (DNNs) to recognize objects in orientations outside the distribution of the training data is not well understood. We present evidence that DNNs are capable of generalizing to objects in novel orientations by disseminating orientation-invariance obtained from familiar objects seen from many viewpoints. This capability strengthens when training the DNN with an increasing number of familiar objects, but only in orientations that involve 2D rotations of familiar orientations. We show that this dissemination is achieved via neurons tuned to common features between familiar and unfamiliar objects. These results implicate brain-like neural mechanisms for generalization. △ Less

Submitted 13 July, 2023; v1 submitted 27 September, 2021; originally announced September 2021.

arXiv:2006.05249 [pdf]

What takes the brain so long: Object recognition at the level of minimal images develops for up to seconds of presentation time

Authors: Hanna Benoni, Daniel Harari, Shimon Ullman

Abstract: Rich empirical evidence has shown that visual object recognition in the brain is fast and effortless, with relevant brain signals reported to start as early as 80 ms. Here we study the time trajectory of the recognition process at the level of minimal recognizable images (termed MIRC). These are images that can be recognized reliably, but in which a minute change of the image (reduction by either… ▽ More Rich empirical evidence has shown that visual object recognition in the brain is fast and effortless, with relevant brain signals reported to start as early as 80 ms. Here we study the time trajectory of the recognition process at the level of minimal recognizable images (termed MIRC). These are images that can be recognized reliably, but in which a minute change of the image (reduction by either size or resolution) has a drastic effect on recognition. Subjects were assigned to one of nine exposure conditions: 200, 500, 1000, 2000 ms with or without masking, as well as unlimited time. The subjects were not limited in time to respond after presentation. The results show that in the masked conditions, recognition rates develop gradually over an extended period, e.g. average of 18% for 200 ms exposure and 45% for 500 ms, increasing significantly with longer exposure even above 2 secs. When presented for unlimited time (until response), MIRC recognition rates were equivalent to the rates of full-object images presented for 50 ms followed by masking. What takes the brain so long to recognize such images? We discuss why processes involving eye-movements, perceptual decision-making and pattern completion are unlikely explanations. Alternatively, we hypothesize that MIRC recognition requires an extended top-down process complementing the feed-forward phase. △ Less

Submitted 9 June, 2020; originally announced June 2020.

Comments: 7 pages, 2 figures, 1 table

arXiv:1812.05455 [pdf]

Using Motion and Internal Supervision in Object Recognition

Authors: Daniel Harari

Abstract: In this thesis we address two related aspects of visual object recognition: the use of motion information, and the use of internal supervision, to help unsupervised learning. These two aspects are inter-related in the current study, since image motion is used for internal supervision, via the detection of spatiotemporal events of active-motion and the use of tracking. Most current work in object r… ▽ More In this thesis we address two related aspects of visual object recognition: the use of motion information, and the use of internal supervision, to help unsupervised learning. These two aspects are inter-related in the current study, since image motion is used for internal supervision, via the detection of spatiotemporal events of active-motion and the use of tracking. Most current work in object recognition deals with static images during both learning and recognition. In contrast, we are interested in a dynamic scene where visual processes, such as detecting motion events and tracking, contribute spatiotemporal information, which is useful for object attention, motion segmentation, 3-D understanding and object interactions. We explore the use of these sources of information in both learning and recognition processes. In the first part of the work, we demonstrate how motion can be used for adaptive detection of object-parts in dynamic environments, while automatically learning new object appearances and poses. In the second and main part of the study we develop methods for using specific types of visual motion to solve two difficult problems in unsupervised visual learning: learning to recognize hands by their appearance and by context, and learning to extract direction of gaze. We use our conclusions in this part to propose a model for several aspects of learning by human infants from their visual environment. △ Less

Submitted 13 December, 2018; originally announced December 2018.

Comments: PhD dissertation, 87 pages, 51 figures, 7 tables

arXiv:1804.04604 [pdf, other]

Discovery and usage of joint attention in images

Authors: Daniel Harari, Joshua B. Tenenbaum, Shimon Ullman

Abstract: Joint visual attention is characterized by two or more individuals looking at a common target at the same time. The ability to identify joint attention in scenes, the people involved, and their common target, is fundamental to the understanding of social interactions, including others' intentions and goals. In this work we deal with the extraction of joint attention events, and the use of such eve… ▽ More Joint visual attention is characterized by two or more individuals looking at a common target at the same time. The ability to identify joint attention in scenes, the people involved, and their common target, is fundamental to the understanding of social interactions, including others' intentions and goals. In this work we deal with the extraction of joint attention events, and the use of such events for image descriptions. The work makes two novel contributions. First, our extraction algorithm is the first which identifies joint visual attention in single static images. It computes 3D gaze direction, identifies the gaze target by combining gaze direction with a 3D depth map computed for the image, and identifies the common gaze target. Second, we use a human study to demonstrate the sensitivity of humans to joint attention, suggesting that the detection of such a configuration in an image can be useful for understanding the image, including the goals of the agents and their joint activity, and therefore can contribute to image captioning and related tasks. △ Less

Submitted 10 April, 2018; originally announced April 2018.

Comments: 6 pages, 3 figures

arXiv:1804.03576 [pdf, other]

Large Field and High Resolution: Detecting Needle in Haystack

Authors: Hadar Gorodissky, Daniel Harari, Shimon Ullman

Abstract: The growing use of convolutional neural networks (CNN) for a broad range of visual tasks, including tasks involving fine details, raises the problem of applying such networks to a large field of view, since the amount of computations increases significantly with the number of pixels. To deal effectively with this difficulty, we develop and compare methods of using CNNs for the task of small target… ▽ More The growing use of convolutional neural networks (CNN) for a broad range of visual tasks, including tasks involving fine details, raises the problem of applying such networks to a large field of view, since the amount of computations increases significantly with the number of pixels. To deal effectively with this difficulty, we develop and compare methods of using CNNs for the task of small target localization in natural images, given a limited "budget" of samples to form an image. Inspired in part by human vision, we develop and compare variable sampling schemes, with peak resolution at the center and decreasing resolution with eccentricity, applied iteratively by re-centering the image at the previous predicted target location. The results indicate that variable resolution models significantly outperform constant resolution models. Surprisingly, variable resolution models and in particular multi-channel models, outperform the optimal, "budget-free" full-resolution model, using only 5\% of the samples. △ Less

Submitted 10 April, 2018; originally announced April 2018.

Comments: 15 pages, 7 figures

arXiv:1611.09819 [pdf]

Measuring and modeling the perception of natural and unconstrained gaze in humans and machines

Authors: Daniel Harari, Tao Gao, Nancy Kanwisher, Joshua Tenenbaum, Shimon Ullman

Abstract: Humans are remarkably adept at interpreting the gaze direction of other individuals in their surroundings. This skill is at the core of the ability to engage in joint visual attention, which is essential for establishing social interactions. How accurate are humans in determining the gaze direction of others in lifelike scenes, when they can move their heads and eyes freely, and what are the sourc… ▽ More Humans are remarkably adept at interpreting the gaze direction of other individuals in their surroundings. This skill is at the core of the ability to engage in joint visual attention, which is essential for establishing social interactions. How accurate are humans in determining the gaze direction of others in lifelike scenes, when they can move their heads and eyes freely, and what are the sources of information for the underlying perceptual processes? These questions pose a challenge from both empirical and computational perspectives, due to the complexity of the visual input in real-life situations. Here we measure empirically human accuracy in perceiving the gaze direction of others in lifelike scenes, and study computationally the sources of information and representations underlying this cognitive capacity. We show that humans perform better in face-to-face conditions compared with recorded conditions, and that this advantage is not due to the availability of input dynamics. We further show that humans are still performing well when only the eyes-region is visible, rather than the whole face. We develop a computational model, which replicates the pattern of human performance, including the finding that the eyes-region contains on its own, the required information for estimating both head orientation and direction of gaze. Consistent with neurophysiological findings on task-specific face regions in the brain, the learned computational representations reproduce perceptual effects such as the Wollaston illusion, when trained to estimate direction of gaze, but not when trained to recognize objects or faces. △ Less

Submitted 29 November, 2016; originally announced November 2016.

Comments: Daniel Harari and Tao Gao contributed equally to this work

Report number: Center for Brains, Minds and Machines Memo No. 059

arXiv:1610.09625 [pdf]

doi 10.1016/j.cognition.2018.11.001

Discovering containment: from infants to machines

Authors: Shimon Ullman, Nimrod Dorfman, Daniel Harari

Abstract: Current artificial learning systems can recognize thousands of visual categories, or play Go at a champion"s level, but cannot explain infants learning, in particular the ability to learn complex concepts without guidance, in a specific order. A notable example is the category of 'containers' and the notion of containment, one of the earliest spatial relations to be learned, starting already at 2.… ▽ More Current artificial learning systems can recognize thousands of visual categories, or play Go at a champion"s level, but cannot explain infants learning, in particular the ability to learn complex concepts without guidance, in a specific order. A notable example is the category of 'containers' and the notion of containment, one of the earliest spatial relations to be learned, starting already at 2.5 months, and preceding other common relations (e.g., support). Such spontaneous unsupervised learning stands in contrast with current highly successful computational models, which learn in a supervised manner, that is, by using large data sets of labeled examples. How can meaningful concepts be learned without guidance, and what determines the trajectory of infant learning, making some notions appear consistently earlier than others? △ Less

Submitted 30 October, 2016; originally announced October 2016.

Journal ref: Cognition 183 (2019) 67-81

arXiv:1603.08079 [pdf, other]

Do You See What I Mean? Visual Resolution of Linguistic Ambiguities

Authors: Yevgeni Berzak, Andrei Barbu, Daniel Harari, Boris Katz, Shimon Ullman

Abstract: Understanding language goes hand in hand with the ability to integrate complex contextual information obtained via perception. In this work, we present a novel task for grounded language understanding: disambiguating a sentence given a visual scene which depicts one of the possible interpretations of that sentence. To this end, we introduce a new multimodal corpus containing ambiguous sentences, r… ▽ More Understanding language goes hand in hand with the ability to integrate complex contextual information obtained via perception. In this work, we present a novel task for grounded language understanding: disambiguating a sentence given a visual scene which depicts one of the possible interpretations of that sentence. To this end, we introduce a new multimodal corpus containing ambiguous sentences, representing a wide range of syntactic, semantic and discourse ambiguities, coupled with videos that visualize the different interpretations for each sentence. We address this task by extending a vision model which determines if a sentence is depicted by a video. We demonstrate how such a model can be adjusted to recognize different interpretations of the same underlying sentence, allowing to disambiguate sentences in a unified fashion across the different ambiguity types. △ Less

Submitted 26 March, 2016; originally announced March 2016.

Comments: EMNLP 2015

Journal ref: Conference on Empirical Methods in Natural Language Processing (EMNLP), 2015, pages 1477--1487

arXiv:1412.2672 [pdf]

When Computer Vision Gazes at Cognition

Authors: Tao Gao, Daniel Harari, Joshua Tenenbaum, Shimon Ullman

Abstract: Joint attention is a core, early-developing form of social interaction. It is based on our ability to discriminate the third party objects that other people are looking at. While it has been shown that people can accurately determine whether another person is looking directly at them versus away, little is known about human ability to discriminate a third person gaze directed towards objects that… ▽ More Joint attention is a core, early-developing form of social interaction. It is based on our ability to discriminate the third party objects that other people are looking at. While it has been shown that people can accurately determine whether another person is looking directly at them versus away, little is known about human ability to discriminate a third person gaze directed towards objects that are further away, especially in unconstraint cases where the looker can move her head and eyes freely. In this paper we address this question by jointly exploring human psychophysics and a cognitively motivated computer vision model, which can detect the 3D direction of gaze from 2D face images. The synthesis of behavioral study and computer vision yields several interesting discoveries. (1) Human accuracy of discriminating targets 8°-10° of visual angle apart is around 40% in a free looking gaze task; (2) The ability to interpret gaze of different lookers vary dramatically; (3) This variance can be captured by the computational model; (4) Human outperforms the current model significantly. These results collectively show that the acuity of human joint attention is indeed highly impressive, given the computational challenge of the natural looking task. Moreover, the gap between human and model performance, as well as the variability of gaze interpretation across different lookers, require further understanding of the underlying mechanisms utilized by humans for this challenging task. △ Less

Submitted 8 December, 2014; originally announced December 2014.

Comments: Tao Gao and Daniel Harari contributed equally to this work

Report number: CBMM Memo No. 025, MIT

Showing 1–12 of 12 results for author: Harari, D