-
Accelerating Giant Impact Simulations with Machine Learning
Authors:
Caleb Lammers,
Miles Cranmer,
Sam Hadden,
Shirley Ho,
Norman Murray,
Daniel Tamayo
Abstract:
Constraining planet formation models based on the observed exoplanet population requires generating large samples of synthetic planetary systems, which can be computationally prohibitive. A significant bottleneck is simulating the giant impact phase, during which planetary embryos evolve gravitationally and combine to form planets, which may themselves experience later collisions. To accelerate gi…
▽ More
Constraining planet formation models based on the observed exoplanet population requires generating large samples of synthetic planetary systems, which can be computationally prohibitive. A significant bottleneck is simulating the giant impact phase, during which planetary embryos evolve gravitationally and combine to form planets, which may themselves experience later collisions. To accelerate giant impact simulations, we present a machine learning (ML) approach to predicting collisional outcomes in multiplanet systems. Trained on more than 500,000 $N$-body simulations of three-planet systems, we develop an ML model that can accurately predict which two planets will experience a collision, along with the state of the post-collision planets, from a short integration of the system's initial conditions. Our model greatly improves on non-ML baselines that rely on metrics from dynamics theory, which struggle to accurately predict which pair of planets will experience a collision. By combining with a model for predicting long-term stability, we create an efficient ML-based giant impact emulator, which can predict the outcomes of giant impact simulations with a speedup of up to four orders of magnitude. We expect our model to enable analyses that would not otherwise be computationally feasible. As such, we release our full training code, along with an easy-to-use API for our collision outcome model and giant impact emulator.
△ Less
Submitted 16 August, 2024;
originally announced August 2024.
-
From Diagnostic CT to DTI Tractography labels: Using Deep Learning for Corticospinal Tract Injury Assessment and Outcome Prediction in Intracerebral Haemorrhage
Authors:
Olivia N Murray,
Hamied Haroon,
Paul Ryu,
Hiren Patel,
George Harston,
Marieke Wermer,
Wilmar Jolink,
Daniel Hanley,
Catharina Klijn,
Ulrike Hammerbeck,
Adrian Parry-Jones,
Timothy Cootes
Abstract:
The preservation of the corticospinal tract (CST) is key to good motor recovery after stroke. The gold standard method of assessing the CST with imaging is diffusion tensor tractography. However, this is not available for most intracerebral haemorrhage (ICH) patients. Non-contrast CT scans are routinely available in most ICH diagnostic pipelines, but delineating white matter from a CT scan is chal…
▽ More
The preservation of the corticospinal tract (CST) is key to good motor recovery after stroke. The gold standard method of assessing the CST with imaging is diffusion tensor tractography. However, this is not available for most intracerebral haemorrhage (ICH) patients. Non-contrast CT scans are routinely available in most ICH diagnostic pipelines, but delineating white matter from a CT scan is challenging. We utilise nnU-Net, trained on paired diagnostic CT scans and high-directional diffusion tractography maps, to segment the CST from diagnostic CT scans alone, and we show our model reproduces diffusion based tractography maps of the CST with a Dice similarity coefficient of 57%.
Surgical haematoma evacuation is sometimes performed after ICH, but published clinical trials to date show that whilst surgery reduces mortality, there is no evidence of improved functional recovery. Restricting surgery to patients with an intact CST may reveal a subset of patients for whom haematoma evacuation improves functional outcome. We investigated the clinical utility of our model in the MISTIE III clinical trial dataset. We found that our model's CST integrity measure significantly predicted outcome after ICH in the acute and chronic time frames, therefore providing a prognostic marker for patients to whom advanced diffusion tensor imaging is unavailable. This will allow for future probing of subgroups who may benefit from surgery.
△ Less
Submitted 12 August, 2024;
originally announced August 2024.
-
Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation
Authors:
Dahyun Kang,
Piotr Koniusz,
Minsu Cho,
Naila Murray
Abstract:
We address the task of weakly-supervised few-shot image classification and segmentation, by leveraging a Vision Transformer (ViT) pretrained with self-supervision. Our proposed method takes token representations from the self-supervised ViT and leverages their correlations, via self-attention, to produce classification and segmentation predictions through separate task heads. Our model is able to…
▽ More
We address the task of weakly-supervised few-shot image classification and segmentation, by leveraging a Vision Transformer (ViT) pretrained with self-supervision. Our proposed method takes token representations from the self-supervised ViT and leverages their correlations, via self-attention, to produce classification and segmentation predictions through separate task heads. Our model is able to effectively learn to perform classification and segmentation in the absence of pixel-level labels during training, using only image-level labels. To do this it uses attention maps, created from tokens generated by the self-supervised ViT backbone, as pixel-level pseudo-labels. We also explore a practical setup with ``mixed" supervision, where a small number of training images contains ground-truth pixel-level labels and the remaining images have only image-level labels. For this mixed setup, we propose to improve the pseudo-labels using a pseudo-label enhancer that was trained using the available ground-truth pixel-level labels. Experiments on Pascal-5i and COCO-20i demonstrate significant performance gains in a variety of supervision settings, and in particular when little-to-no pixel-level labels are available.
△ Less
Submitted 7 July, 2023;
originally announced July 2023.
-
Dungeons and Data: A Large-Scale NetHack Dataset
Authors:
Eric Hambro,
Roberta Raileanu,
Danielle Rothermel,
Vegard Mella,
Tim Rocktäschel,
Heinrich Küttler,
Naila Murray
Abstract:
Recent breakthroughs in the development of agents to solve challenging sequential decision making problems such as Go, StarCraft, or DOTA, have relied on both simulated environments and large-scale datasets. However, progress on this research has been hindered by the scarcity of open-sourced datasets and the prohibitive computational cost to work with them. Here we present the NetHack Learning Dat…
▽ More
Recent breakthroughs in the development of agents to solve challenging sequential decision making problems such as Go, StarCraft, or DOTA, have relied on both simulated environments and large-scale datasets. However, progress on this research has been hindered by the scarcity of open-sourced datasets and the prohibitive computational cost to work with them. Here we present the NetHack Learning Dataset (NLD), a large and highly-scalable dataset of trajectories from the popular game of NetHack, which is both extremely challenging for current methods and very fast to run. NLD consists of three parts: 10 billion state transitions from 1.5 million human trajectories collected on the NAO public NetHack server from 2009 to 2020; 3 billion state-action-score transitions from 100,000 trajectories collected from the symbolic bot winner of the NetHack Challenge 2021; and, accompanying code for users to record, load and stream any collection of such trajectories in a highly compressed form. We evaluate a wide range of existing algorithms including online and offline RL, as well as learning from demonstrations, showing that significant research advances are needed to fully leverage large-scale datasets for challenging sequential decision making tasks.
△ Less
Submitted 24 November, 2023; v1 submitted 1 November, 2022;
originally announced November 2022.
-
Time-rEversed diffusioN tEnsor Transformer: A new TENET of Few-Shot Object Detection
Authors:
Shan Zhang,
Naila Murray,
Lei Wang,
Piotr Koniusz
Abstract:
In this paper, we tackle the challenging problem of Few-shot Object Detection. Existing FSOD pipelines (i) use average-pooled representations that result in information loss; and/or (ii) discard position information that can help detect object instances. Consequently, such pipelines are sensitive to large intra-class appearance and geometric variations between support and query images. To address…
▽ More
In this paper, we tackle the challenging problem of Few-shot Object Detection. Existing FSOD pipelines (i) use average-pooled representations that result in information loss; and/or (ii) discard position information that can help detect object instances. Consequently, such pipelines are sensitive to large intra-class appearance and geometric variations between support and query images. To address these drawbacks, we propose a Time-rEversed diffusioN tEnsor Transformer (TENET), which i) forms high-order tensor representations that capture multi-way feature occurrences that are highly discriminative, and ii) uses a transformer that dynamically extracts correlations between the query image and the entire support set, instead of a single average-pooled support embedding. We also propose a Transformer Relation Head (TRH), equipped with higher-order representations, which encodes correlations between query regions and the entire support set, while being sensitive to the positional variability of object instances. Our model achieves state-of-the-art results on PASCAL VOC, FSOD, and COCO.
△ Less
Submitted 30 October, 2022;
originally announced October 2022.
-
Analysis of Smooth Pursuit Assessment in Virtual Reality and Concussion Detection using BiLSTM
Authors:
Prithul Sarker,
Khondker Fariha Hossain,
Isayas Berhe Adhanom,
Philip K Pavilionis,
Nicholas G. Murray,
Alireza Tavakkoli
Abstract:
The sport-related concussion (SRC) battery relies heavily upon subjective symptom reporting in order to determine the diagnosis of a concussion. Unfortunately, athletes with SRC may return-to-play (RTP) too soon if they are untruthful of their symptoms. It is critical to provide accurate assessments that can overcome underreporting to prevent further injury. To lower the risk of injury, a more rob…
▽ More
The sport-related concussion (SRC) battery relies heavily upon subjective symptom reporting in order to determine the diagnosis of a concussion. Unfortunately, athletes with SRC may return-to-play (RTP) too soon if they are untruthful of their symptoms. It is critical to provide accurate assessments that can overcome underreporting to prevent further injury. To lower the risk of injury, a more robust and precise method for detecting concussion is needed to produce reliable and objective results. In this paper, we propose a novel approach to detect SRC using long short-term memory (LSTM) recurrent neural network (RNN) architectures from oculomotor data. In particular, we propose a new error metric that incorporates mean squared error in different proportions. The experimental results on the smooth pursuit test of the VR-VOMS dataset suggest that the proposed approach can predict concussion symptoms with higher accuracy compared to symptom provocation on the vestibular ocular motor screening (VOMS).
△ Less
Submitted 12 October, 2022;
originally announced October 2022.
-
Virtual-Reality based Vestibular Ocular Motor Screening for Concussion Detection using Machine-Learning
Authors:
Khondker Fariha Hossain,
Sharif Amit Kamran,
Prithul Sarker,
Philip Pavilionis,
Isayas Adhanom,
Nicholas Murray,
Alireza Tavakkoli
Abstract:
Sport-related concussion (SRC) depends on sensory information from visual, vestibular, and somatosensory systems. At the same time, the current clinical administration of Vestibular/Ocular Motor Screening (VOMS) is subjective and deviates among administrators. Therefore, for the assessment and management of concussion detection, standardization is required to lower the risk of injury and increase…
▽ More
Sport-related concussion (SRC) depends on sensory information from visual, vestibular, and somatosensory systems. At the same time, the current clinical administration of Vestibular/Ocular Motor Screening (VOMS) is subjective and deviates among administrators. Therefore, for the assessment and management of concussion detection, standardization is required to lower the risk of injury and increase the validation among clinicians. With the advancement of technology, virtual reality (VR) can be utilized to advance the standardization of the VOMS, increasing the accuracy of testing administration and decreasing overall false positive rates. In this paper, we experimented with multiple machine learning methods to detect SRC on VR-generated data using VOMS. In our observation, the data generated from VR for smooth pursuit (SP) and the Visual Motion Sensitivity (VMS) tests are highly reliable for concussion detection. Furthermore, we train and evaluate these models, both qualitatively and quantitatively. Our findings show these models can reach high true-positive-rates of around 99.9 percent of symptom provocation on the VR stimuli-based VOMS vs. current clinical manual VOMS.
△ Less
Submitted 12 October, 2022;
originally announced October 2022.
-
Temporally-Weighted Hierarchical Clustering for Unsupervised Action Segmentation
Authors:
M. Saquib Sarfraz,
Naila Murray,
Vivek Sharma,
Ali Diba,
Luc Van Gool,
Rainer Stiefelhagen
Abstract:
Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos and is an important requirement for many video understanding tasks. For this and other video understanding tasks, supervised approaches have achieved encouraging performance but require a high volume of detailed frame-level annotations. We present a fully automatic and unsupervised approach for…
▽ More
Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos and is an important requirement for many video understanding tasks. For this and other video understanding tasks, supervised approaches have achieved encouraging performance but require a high volume of detailed frame-level annotations. We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training. Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video. Our main finding is that representing a video with a 1-nearest neighbor graph by taking into account the time progression is sufficient to form semantically and temporally consistent clusters of frames where each cluster may represent some action in the video. Additionally, we establish strong unsupervised baselines for action segmentation and show significant performance improvements over published unsupervised methods on five challenging action segmentation datasets. Our code is available at https://github.com/ssarfraz/FINCH-Clustering/tree/master/TW-FINCH
△ Less
Submitted 27 March, 2021; v1 submitted 20 March, 2021;
originally announced March 2021.
-
QUALINET White Paper on Definitions of Immersive Media Experience (IMEx)
Authors:
Andrew Perkis,
Christian Timmerer,
Sabina Baraković,
Jasmina Baraković Husić,
Søren Bech,
Sebastian Bosse,
Jean Botev,
Kjell Brunnström,
Luis Cruz,
Katrien De Moor,
Andrea de Polo Saibanti,
Wouter Durnez,
Sebastian Egger-Lampl,
Ulrich Engelke,
Tiago H. Falk,
Jesús Gutiérrez,
Asim Hameed,
Andrew Hines,
Tanja Kojic,
Dragan Kukolj,
Eirini Liotou,
Dragorad Milovanovic,
Sebastian Möller,
Niall Murray,
Babak Naderi
, et al. (19 additional authors not shown)
Abstract:
With the coming of age of virtual/augmented reality and interactive media, numerous definitions, frameworks, and models of immersion have emerged across different fields ranging from computer graphics to literary works. Immersion is oftentimes used interchangeably with presence as both concepts are closely related. However, there are noticeable interdisciplinary differences regarding definitions,…
▽ More
With the coming of age of virtual/augmented reality and interactive media, numerous definitions, frameworks, and models of immersion have emerged across different fields ranging from computer graphics to literary works. Immersion is oftentimes used interchangeably with presence as both concepts are closely related. However, there are noticeable interdisciplinary differences regarding definitions, scope, and constituents that are required to be addressed so that a coherent understanding of the concepts can be achieved. Such consensus is vital for paving the directionality of the future of immersive media experiences (IMEx) and all related matters. The aim of this white paper is to provide a survey of definitions of immersion and presence which leads to a definition of immersive media experience (IMEx). The Quality of Experience (QoE) for immersive media is described by establishing a relationship between the concepts of QoE and IMEx followed by application areas of immersive media experience. Influencing factors on immersive media experience are elaborated as well as the assessment of immersive media experience. Finally, standardization activities related to IMEx are highlighted and the white paper is concluded with an outlook related to future developments.
△ Less
Submitted 24 November, 2020; v1 submitted 10 June, 2020;
originally announced July 2020.
-
Automated Quantification of CT Patterns Associated with COVID-19 from Chest CT
Authors:
Shikha Chaganti,
Abishek Balachandran,
Guillaume Chabin,
Stuart Cohen,
Thomas Flohr,
Bogdan Georgescu,
Philippe Grenier,
Sasa Grbic,
Siqi Liu,
François Mellot,
Nicolas Murray,
Savvas Nicolaou,
William Parker,
Thomas Re,
Pina Sanelli,
Alexander W. Sauter,
Zhoubing Xu,
Youngjin Yoo,
Valentin Ziebandt,
Dorin Comaniciu
Abstract:
Purpose: To present a method that automatically segments and quantifies abnormal CT patterns commonly present in coronavirus disease 2019 (COVID-19), namely ground glass opacities and consolidations. Materials and Methods: In this retrospective study, the proposed method takes as input a non-contrasted chest CT and segments the lesions, lungs, and lobes in three dimensions, based on a dataset of 9…
▽ More
Purpose: To present a method that automatically segments and quantifies abnormal CT patterns commonly present in coronavirus disease 2019 (COVID-19), namely ground glass opacities and consolidations. Materials and Methods: In this retrospective study, the proposed method takes as input a non-contrasted chest CT and segments the lesions, lungs, and lobes in three dimensions, based on a dataset of 9749 chest CT volumes. The method outputs two combined measures of the severity of lung and lobe involvement, quantifying both the extent of COVID-19 abnormalities and presence of high opacities, based on deep learning and deep reinforcement learning. The first measure of (PO, PHO) is global, while the second of (LSS, LHOS) is lobewise. Evaluation of the algorithm is reported on CTs of 200 participants (100 COVID-19 confirmed patients and 100 healthy controls) from institutions from Canada, Europe and the United States collected between 2002-Present (April, 2020). Ground truth is established by manual annotations of lesions, lungs, and lobes. Correlation and regression analyses were performed to compare the prediction to the ground truth. Results: Pearson correlation coefficient between method prediction and ground truth for COVID-19 cases was calculated as 0.92 for PO (P < .001), 0.97 for PHO(P < .001), 0.91 for LSS (P < .001), 0.90 for LHOS (P < .001). 98 of 100 healthy controls had a predicted PO of less than 1%, 2 had between 1-2%. Automated processing time to compute the severity scores was 10 seconds per case compared to 30 minutes required for manual annotations. Conclusion: A new method segments regions of CT abnormalities associated with COVID-19 and computes (PO, PHO), as well as (LSS, LHOS) severity scores.
△ Less
Submitted 18 November, 2020; v1 submitted 2 April, 2020;
originally announced April 2020.
-
Virtual KITTI 2
Authors:
Yohann Cabon,
Naila Murray,
Martin Humenberger
Abstract:
This paper introduces an updated version of the well-known Virtual KITTI dataset which consists of 5 sequence clones from the KITTI tracking benchmark. In addition, the dataset provides different variants of these sequences such as modified weather conditions (e.g. fog, rain) or modified camera configurations (e.g. rotated by 15 degrees). For each sequence, we provide multiple sets of images conta…
▽ More
This paper introduces an updated version of the well-known Virtual KITTI dataset which consists of 5 sequence clones from the KITTI tracking benchmark. In addition, the dataset provides different variants of these sequences such as modified weather conditions (e.g. fog, rain) or modified camera configurations (e.g. rotated by 15 degrees). For each sequence, we provide multiple sets of images containing RGB, depth, class segmentation, instance segmentation, flow, and scene flow data. Camera parameters and poses as well as vehicle locations are available as well. In order to showcase some of the dataset's capabilities, we ran multiple relevant experiments using state-of-the-art algorithms from the field of autonomous driving. The dataset is available for download at https://europe.naverlabs.com/Research/Computer-Vision/Proxy-Virtual-Worlds.
△ Less
Submitted 29 January, 2020;
originally announced January 2020.
-
Generating Human Action Videos by Coupling 3D Game Engines and Probabilistic Graphical Models
Authors:
César Roberto de Souza,
Adrien Gaidon,
Yohann Cabon,
Naila Murray,
Antonio Manuel López
Abstract:
Deep video action recognition models have been highly successful in recent years but require large quantities of manually annotated data, which are expensive and laborious to obtain. In this work, we investigate the generation of synthetic training data for video action recognition, as synthetic data have been successfully used to supervise models for a variety of other computer vision tasks. We p…
▽ More
Deep video action recognition models have been highly successful in recent years but require large quantities of manually annotated data, which are expensive and laborious to obtain. In this work, we investigate the generation of synthetic training data for video action recognition, as synthetic data have been successfully used to supervise models for a variety of other computer vision tasks. We propose an interpretable parametric generative model of human action videos that relies on procedural generation, physics models and other components of modern game engines. With this model we generate a diverse, realistic, and physically plausible dataset of human action videos, called PHAV for "Procedural Human Action Videos". PHAV contains a total of 39,982 videos, with more than 1,000 examples for each of 35 action categories. Our video generation approach is not limited to existing motion capture sequences: 14 of these 35 categories are procedurally defined synthetic actions. In addition, each video is represented with 6 different data modalities, including RGB, optical flow and pixel-level semantic labels. These modalities are generated almost simultaneously using the Multiple Render Targets feature of modern GPUs. In order to leverage PHAV, we introduce a deep multi-task (i.e. that considers action classes from multiple datasets) representation learning architecture that is able to simultaneously learn from synthetic and real video datasets, even when their action categories differ. Our experiments on the UCF-101 and HMDB-51 benchmarks suggest that combining our large set of synthetic videos with small real-world datasets can boost recognition performance. Our approach also significantly outperforms video representations produced by fine-tuning state-of-the-art unsupervised generative models of videos.
△ Less
Submitted 12 October, 2019;
originally announced October 2019.
-
Eye-based Continuous Affect Prediction
Authors:
Jonny O'Dwyer,
Niall Murray,
Ronan Flynn
Abstract:
Eye-based information channels include the pupils, gaze, saccades, fixational movements, and numerous forms of eye opening and closure. Pupil size variation indicates cognitive load and emotion, while a person's gaze direction is said to be congruent with the motivation to approach or avoid stimuli. The eyelids are involved in facial expressions that can encode basic emotions. Additionally, eye-ba…
▽ More
Eye-based information channels include the pupils, gaze, saccades, fixational movements, and numerous forms of eye opening and closure. Pupil size variation indicates cognitive load and emotion, while a person's gaze direction is said to be congruent with the motivation to approach or avoid stimuli. The eyelids are involved in facial expressions that can encode basic emotions. Additionally, eye-based cues can have implications for human annotators of emotions or feelings. Despite these facts, the use of eye-based cues in affective computing is in its infancy, however, and this work is intended to start to address this. Eye-based feature sets, incorporating data from all of the aforementioned information channels, that can be estimated from video are proposed. Feature set refinement is provided by way of continuous arousal and valence learning and prediction experiments on the RECOLA validation set. The eye-based features are then combined with a speech feature set to provide confirmation of their usefulness and assess affect prediction performance compared with group-of-humans-level performance on the RECOLA test set. The core contribution of this paper, a refined eye-based feature set, is shown to provide benefits for affect prediction. It is hoped that this work stimulates further research into eye-based affective computing.
△ Less
Submitted 23 January, 2020; v1 submitted 23 July, 2019;
originally announced July 2019.
-
Affective computing using speech and eye gaze: a review and bimodal system proposal for continuous affect prediction
Authors:
Jonny O'Dwyer,
Niall Murray,
Ronan Flynn
Abstract:
Speech has been a widely used modality in the field of affective computing. Recently however, there has been a growing interest in the use of multi-modal affective computing systems. These multi-modal systems incorporate both verbal and non-verbal features for affective computing tasks. Such multi-modal affective computing systems are advantageous for emotion assessment of individuals in audio-vid…
▽ More
Speech has been a widely used modality in the field of affective computing. Recently however, there has been a growing interest in the use of multi-modal affective computing systems. These multi-modal systems incorporate both verbal and non-verbal features for affective computing tasks. Such multi-modal affective computing systems are advantageous for emotion assessment of individuals in audio-video communication environments such as teleconferencing, healthcare, and education. From a review of the literature, the use of eye gaze features extracted from video is a modality that has remained largely unexploited for continuous affect prediction. This work presents a review of the literature within the emotion classification and continuous affect prediction sub-fields of affective computing for both speech and eye gaze modalities. Additionally, continuous affect prediction experiments using speech and eye gaze modalities are presented. A baseline system is proposed using open source software, the performance of which is assessed on a publicly available audio-visual corpus. Further system performance is assessed in a cross-corpus and cross-lingual experiment. The experimental results suggest that eye gaze is an effective supportive modality for speech when used in a bimodal continuous affect prediction system. The addition of eye gaze to speech in a simple feature fusion framework yields a prediction improvement of 6.13% for valence and 1.62% for arousal.
△ Less
Submitted 17 May, 2018;
originally announced May 2018.
-
End-to-End Saliency Mapping via Probability Distribution Prediction
Authors:
Saumya Jetley,
Naila Murray,
Eleonora Vig
Abstract:
Most saliency estimation methods aim to explicitly model low-level conspicuity cues such as edges or blobs and may additionally incorporate top-down cues using face or text detection. Data-driven methods for training saliency models using eye-fixation data are increasingly popular, particularly with the introduction of large-scale datasets and deep architectures. However, current methods in this l…
▽ More
Most saliency estimation methods aim to explicitly model low-level conspicuity cues such as edges or blobs and may additionally incorporate top-down cues using face or text detection. Data-driven methods for training saliency models using eye-fixation data are increasingly popular, particularly with the introduction of large-scale datasets and deep architectures. However, current methods in this latter paradigm use loss functions designed for classification or regression tasks whereas saliency estimation is evaluated on topographical maps. In this work, we introduce a new saliency map model which formulates a map as a generalized Bernoulli distribution. We then train a deep architecture to predict such maps using novel loss functions which pair the softmax activation function with measures designed to compute distances between probability distributions. We show in extensive experiments the effectiveness of such loss functions over standard ones on four public benchmark datasets, and demonstrate improved performance over state-of-the-art saliency methods.
△ Less
Submitted 5 April, 2018;
originally announced April 2018.
-
Continuous Affect Prediction Using Eye Gaze and Speech
Authors:
Jonny O'Dwyer,
Ronan Flynn,
Niall Murray
Abstract:
Affective computing research traditionally focused on labeling a person's emotion as one of a discrete number of classes e.g. happy or sad. In recent times, more attention has been given to continuous affect prediction across dimensions in the emotional space, e.g. arousal and valence. Continuous affect prediction is the task of predicting a numerical value for different emotion dimensions. The ap…
▽ More
Affective computing research traditionally focused on labeling a person's emotion as one of a discrete number of classes e.g. happy or sad. In recent times, more attention has been given to continuous affect prediction across dimensions in the emotional space, e.g. arousal and valence. Continuous affect prediction is the task of predicting a numerical value for different emotion dimensions. The application of continuous affect prediction is powerful in domains involving real-time audio-visual communications which could include remote or assistive technologies for psychological assessment of subjects. Modalities used for continuous affect prediction may include speech, facial expressions and physiological responses. As opposed to single modality analysis, the research community have combined multiple modalities to improve the accuracy of continuous affect prediction. In this context, this paper investigates a continuous affect prediction system using the novel combination of speech and eye gaze. A new eye gaze feature set is proposed. This novel approach uses open source software for real-time affect prediction in audio-visual communication environments. A unique advantage of the human-computer interface used here is that it does not require the subject to wear specialized and expensive eye-tracking headsets or intrusive devices. The results indicate that the combination of speech and eye gaze improves arousal prediction by 3.5% and valence prediction by 19.5% compared to using speech alone.
△ Less
Submitted 5 March, 2018;
originally announced March 2018.
-
Continuous Affect Prediction using Eye Gaze
Authors:
Jonny O'Dwyer,
Ronan Flynn,
Niall Murray
Abstract:
In recent times, there has been significant interest in the machine recognition of human emotions, due to the suite of applications to which this knowledge can be applied. A number of different modalities, such as speech or facial expression, individually and with eye gaze, have been investigated by the affective computing research community to either classify the emotion (e.g. sad, happy, angry)…
▽ More
In recent times, there has been significant interest in the machine recognition of human emotions, due to the suite of applications to which this knowledge can be applied. A number of different modalities, such as speech or facial expression, individually and with eye gaze, have been investigated by the affective computing research community to either classify the emotion (e.g. sad, happy, angry) or predict the continuous values of affective dimensions (e.g. valence, arousal, dominance) at each moment in time. Surprisingly after an extensive literature review, eye gaze as a unimodal input to a continuous affect prediction system has not been considered. In this context, this paper evaluates the use of eye gaze as a unimodal input to a continuous affect prediction system. The performance of continuous prediction of arousal and valence using eye gaze is compared with the performance of a speech system using the AVEC 2014 speech feature set. The experimental evaluation when using eye gaze as the single modality in a continuous affect prediction system produced a correlation result for valence prediction that is better than the correlation result obtained with the AVEC 2014 speech feature set. Furthermore, the eye gaze feature set proposed in this paper contains 98% fewer features compared to the number of features in the AVEC 2014 feature set.
△ Less
Submitted 5 March, 2018;
originally announced March 2018.
-
PTP: Path-specified Transport Protocol for Concurrent Multipath Transmission in Named Data Networks
Authors:
Yuhang Ye,
Brian Lee,
Ronan Flynn,
Niall Murray,
Guiming Fang,
Jianwen Cao,
Yuansong Qiao
Abstract:
Named Data Networking (NDN) is a promising Future Internet architecture to support content distribution. Its inherent addressless routing paradigm brings valuable characteristics to improve the transmission robustness and efficiency, e.g. users are enabled to download content from multiple providers concurrently. However, multipath transmission NDN is different from that in Multipath TCP, i.e. the…
▽ More
Named Data Networking (NDN) is a promising Future Internet architecture to support content distribution. Its inherent addressless routing paradigm brings valuable characteristics to improve the transmission robustness and efficiency, e.g. users are enabled to download content from multiple providers concurrently. However, multipath transmission NDN is different from that in Multipath TCP, i.e. the "paths" in NDN are transparent to and uncontrollable by users. To this end, the user controls the traffic on all transmission paths as an entirety, which leads to a noticeable problem of low bandwidth utilization. In particular, the congestion of a certain path will trigger the traffic reduction on the other transmission paths that are underutilized. Some solutions have been proposed by letting routers balance the loads of different paths to avoid congesting a certain path prematurely. However, the complexity of obtaining an optimal load balancing solution (of solving a Multi-Commodity Flow problem) becomes higher with the increasing network size, which limits the universal NDN deployments. This paper introduces a compromising solution - Path-specified Transport Protocol (PTP). PTP supports both the label switching and the addressless routing schemes. Specifically, the label switching scheme facilitates users to precisely control the traffic on each transmission path, and the addressless routing scheme maintains the valuable feature of retrieving content from any provider to guarantee robustness. As the traffic on a transmission path can be explicitly controlled by consumers, load balancing is no longer needed in routers, which reduce the computational burden of routers and consequently increase the system scalability. The experimental results show that PTP significantly increases the users' downloading rates and improved the network throughput.
△ Less
Submitted 24 August, 2018; v1 submitted 8 February, 2018;
originally announced February 2018.
-
Re-ID done right: towards good practices for person re-identification
Authors:
Jon Almazan,
Bojana Gajic,
Naila Murray,
Diane Larlus
Abstract:
Training a deep architecture using a ranking loss has become standard for the person re-identification task. Increasingly, these deep architectures include additional components that leverage part detections, attribute predictions, pose estimators and other auxiliary information, in order to more effectively localize and align discriminative image regions. In this paper we adopt a different approa…
▽ More
Training a deep architecture using a ranking loss has become standard for the person re-identification task. Increasingly, these deep architectures include additional components that leverage part detections, attribute predictions, pose estimators and other auxiliary information, in order to more effectively localize and align discriminative image regions. In this paper we adopt a different approach and carefully design each component of a simple deep architecture and, critically, the strategy for training it effectively for person re-identification. We extensively evaluate each design choice, leading to a list of good practices for person re-identification. By following these practices, our approach outperforms the state of the art, including more complex methods with auxiliary components, by large margins on four benchmark datasets. We also provide a qualitative analysis of our trained representation which indicates that, while compact, it is able to capture information from localized and discriminative regions, in a manner akin to an implicit attention mechanism.
△ Less
Submitted 16 January, 2018;
originally announced January 2018.
-
A deep architecture for unified aesthetic prediction
Authors:
Naila Murray,
Albert Gordo
Abstract:
Image aesthetics has become an important criterion for visual content curation on social media sites and media content repositories. Previous work on aesthetic prediction models in the computer vision community has focused on aesthetic score prediction or binary image labeling. However, raw aesthetic annotations are in the form of score histograms and provide richer and more precise information th…
▽ More
Image aesthetics has become an important criterion for visual content curation on social media sites and media content repositories. Previous work on aesthetic prediction models in the computer vision community has focused on aesthetic score prediction or binary image labeling. However, raw aesthetic annotations are in the form of score histograms and provide richer and more precise information than binary labels or mean scores. Consequently, in this work we focus on the rarely-studied problem of predicting aesthetic score distributions and propose a novel architecture and training procedure for our model. Our model achieves state-of-the-art results on the standard AVA large-scale benchmark dataset for three tasks: (i) aesthetic quality classification; (ii) aesthetic score regression; and (iii) aesthetic score distribution prediction, all while using one model trained only for the distribution prediction task. We also introduce a method to modify an image such that its predicted aesthetics changes, and use this modification to gain insight into our model.
△ Less
Submitted 16 August, 2017;
originally announced August 2017.
-
MVP2P: Layer-Dependency-Aware Live MVC Video Streaming over Peer-to-Peer Networks
Authors:
Zhao Liu,
Niall Murray,
Brian Lee,
Enda Fallon,
Yuansong Qiao
Abstract:
Multiview video supports observing a scene from different viewpoints. The Joint Video Team (JVT) developed H.264/MVC to enhance the compression efficiency for multiview video, however, MVC encoded multiview video (MVC video) still requires high bitrates for transmission. This paper investigates live MVC video streaming over Peer-to-Peer (P2P) networks. The goal is to minimize the server bandwidth…
▽ More
Multiview video supports observing a scene from different viewpoints. The Joint Video Team (JVT) developed H.264/MVC to enhance the compression efficiency for multiview video, however, MVC encoded multiview video (MVC video) still requires high bitrates for transmission. This paper investigates live MVC video streaming over Peer-to-Peer (P2P) networks. The goal is to minimize the server bandwidth costs whist ensuring high streaming quality to peers. MVC employs intra-view and inter-view prediction structures, which leads to a complicated layer dependency relationship. As the peers' outbound bandwidth is shared while supplying all the MVC video layers, the bandwidth allocation to one MVC layer affects the available outbound bandwidth of the other layers. To optimise the utilisation of the peers' outbound bandwidth for providing video layers, a maximum flow based model is proposed which considers the MVC video layer dependency and the layer supplying relationship between peers. Based on the model, a layer dependency aware live MVC video streaming method over a BitTorrent-like P2P network is proposed, named MVP2P. The key components of MVP2P include a chunk scheduling strategy and a peer selection strategy for receiving peers, and a bandwidth scheduling algorithm for supplying peers. To evaluate the efficiency of the proposed solution, MVP2P is compared with existing methods considering the constraints of peer bandwidth, peer numbers, view switching rates, and peer churns. The test results show that MVP2P significantly outperforms the existing methods.
△ Less
Submitted 16 April, 2018; v1 submitted 25 July, 2017;
originally announced July 2017.
-
Interferences in match kernels
Authors:
Naila Murray,
Hervé Jégou,
Florent Perronnin,
Andrew Zisserman
Abstract:
We consider the design of an image representation that embeds and aggregates a set of local descriptors into a single vector. Popular representations of this kind include the bag-of-visual-words, the Fisher vector and the VLAD. When two such image representations are compared with the dot-product, the image-to-image similarity can be interpreted as a match kernel. In match kernels, one has to deal…
▽ More
We consider the design of an image representation that embeds and aggregates a set of local descriptors into a single vector. Popular representations of this kind include the bag-of-visual-words, the Fisher vector and the VLAD. When two such image representations are compared with the dot-product, the image-to-image similarity can be interpreted as a match kernel. In match kernels, one has to deal with interference, i.e. with the fact that even if two descriptors are unrelated, their matching score may contribute to the overall similarity.
We formalise this problem and propose two related solutions, both aimed at equalising the individual contributions of the local descriptors in the final representation. These methods modify the aggregation stage by including a set of per-descriptor weights. They differ by the objective function that is optimised to compute those weights. The first is a "democratisation" strategy that aims at equalising the relative importance of each descriptor in the set comparison metric. The second one involves equalising the match of a single descriptor to the aggregated vector.
These concurrent methods give a substantial performance boost over the state of the art in image search with short or mid-size vectors, as demonstrated by our experiments on standard public image retrieval benchmarks.
△ Less
Submitted 24 November, 2016;
originally announced November 2016.
-
LEWIS: Latent Embeddings for Word Images and their Semantics
Authors:
Albert Gordo,
Jon Almazan,
Naila Murray,
Florent Perronnin
Abstract:
The goal of this work is to bring semantics into the tasks of text recognition and retrieval in natural images. Although text recognition and retrieval have received a lot of attention in recent years, previous works have focused on recognizing or retrieving exactly the same word used as a query, without taking the semantics into consideration.
In this paper, we ask the following question: \emph…
▽ More
The goal of this work is to bring semantics into the tasks of text recognition and retrieval in natural images. Although text recognition and retrieval have received a lot of attention in recent years, previous works have focused on recognizing or retrieving exactly the same word used as a query, without taking the semantics into consideration.
In this paper, we ask the following question: \emph{can we predict semantic concepts directly from a word image, without explicitly trying to transcribe the word image or its characters at any point?} For this goal we propose a convolutional neural network (CNN) with a weighted ranking loss objective that ensures that the concepts relevant to the query image are ranked ahead of those that are not relevant. This can also be interpreted as learning a Euclidean space where word images and concepts are jointly embedded. This model is learned in an end-to-end manner, from image pixels to semantic concepts, using a dataset of synthetically generated word images and concepts mined from a lexical database (WordNet). Our results show that, despite the complexity of the task, word images and concepts can indeed be associated with a high degree of accuracy
△ Less
Submitted 21 September, 2015;
originally announced September 2015.
-
Discovering beautiful attributes for aesthetic image analysis
Authors:
Luca Marchesotti,
Naila Murray,
Florent Perronnin
Abstract:
Aesthetic image analysis is the study and assessment of the aesthetic properties of images. Current computational approaches to aesthetic image analysis either provide accurate or interpretable results. To obtain both accuracy and interpretability by humans, we advocate the use of learned and nameable visual attributes as mid-level features. For this purpose, we propose to discover and learn the v…
▽ More
Aesthetic image analysis is the study and assessment of the aesthetic properties of images. Current computational approaches to aesthetic image analysis either provide accurate or interpretable results. To obtain both accuracy and interpretability by humans, we advocate the use of learned and nameable visual attributes as mid-level features. For this purpose, we propose to discover and learn the visual appearance of attributes automatically, using a recently introduced database, called AVA, which contains more than 250,000 images together with their aesthetic scores and textual comments given by photography enthusiasts. We provide a detailed analysis of these annotations as well as the context in which they were given. We then describe how these three key components of AVA - images, scores, and comments - can be effectively leveraged to learn visual attributes. Lastly, we show that these learned attributes can be successfully used in three applications: aesthetic quality prediction, image tagging and retrieval.
△ Less
Submitted 16 December, 2014;
originally announced December 2014.
-
Generalized Max Pooling
Authors:
Naila Murray,
Florent Perronnin
Abstract:
State-of-the-art patch-based image representations involve a pooling operation that aggregates statistics computed from local descriptors. Standard pooling operations include sum- and max-pooling. Sum-pooling lacks discriminability because the resulting representation is strongly influenced by frequent yet often uninformative descriptors, but only weakly influenced by rare yet potentially highly-i…
▽ More
State-of-the-art patch-based image representations involve a pooling operation that aggregates statistics computed from local descriptors. Standard pooling operations include sum- and max-pooling. Sum-pooling lacks discriminability because the resulting representation is strongly influenced by frequent yet often uninformative descriptors, but only weakly influenced by rare yet potentially highly-informative ones. Max-pooling equalizes the influence of frequent and rare descriptors but is only applicable to representations that rely on count statistics, such as the bag-of-visual-words (BOV) and its soft- and sparse-coding extensions. We propose a novel pooling mechanism that achieves the same effect as max-pooling but is applicable beyond the BOV and especially to the state-of-the-art Fisher Vector -- hence the name Generalized Max Pooling (GMP). It involves equalizing the similarity between each patch and the pooled representation, which is shown to be equivalent to re-weighting the per-patch statistics. We show on five public image classification benchmarks that the proposed GMP can lead to significant performance gains with respect to heuristic alternatives.
△ Less
Submitted 2 June, 2014;
originally announced June 2014.