-
Gemini: A Family of Highly Capable Multimodal Models
Authors:
Gemini Team,
Rohan Anil,
Sebastian Borgeaud,
Jean-Baptiste Alayrac,
Jiahui Yu,
Radu Soricut,
Johan Schalkwyk,
Andrew M. Dai,
Anja Hauth,
Katie Millican,
David Silver,
Melvin Johnson,
Ioannis Antonoglou,
Julian Schrittwieser,
Amelia Glaese,
Jilin Chen,
Emily Pitler,
Timothy Lillicrap,
Angeliki Lazaridou,
Orhan Firat,
James Molloy,
Michael Isard,
Paul R. Barham,
Tom Hennigan,
Benjamin Lee
, et al. (1325 additional authors not shown)
Abstract:
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultr…
▽ More
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.
△ Less
Submitted 17 June, 2024; v1 submitted 18 December, 2023;
originally announced December 2023.
-
On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes
Authors:
Rishabh Agarwal,
Nino Vieillard,
Yongchao Zhou,
Piotr Stanczyk,
Sabela Ramos,
Matthieu Geist,
Olivier Bachem
Abstract:
Knowledge distillation (KD) is widely used for compressing a teacher model to reduce its inference cost and memory footprint, by training a smaller student model. However, current KD methods for auto-regressive sequence models suffer from distribution mismatch between output sequences seen during training and those generated by the student during inference. To address this issue, we introduce Gene…
▽ More
Knowledge distillation (KD) is widely used for compressing a teacher model to reduce its inference cost and memory footprint, by training a smaller student model. However, current KD methods for auto-regressive sequence models suffer from distribution mismatch between output sequences seen during training and those generated by the student during inference. To address this issue, we introduce Generalized Knowledge Distillation (GKD). Instead of solely relying on a fixed set of output sequences, GKD trains the student on its self-generated output sequences by leveraging feedback from the teacher on such sequences. Unlike supervised KD approaches, GKD also offers the flexibility to employ alternative loss functions between the student and teacher, which can be useful when the student lacks the expressivity to mimic the teacher's distribution. Furthermore, GKD facilitates the seamless integration of distillation with RL fine-tuning (RLHF). We demonstrate the efficacy of GKD for distilling auto-regressive language models on summarization, translation, and arithmetic reasoning tasks, and task-agnostic distillation for instruction-tuning.
△ Less
Submitted 16 January, 2024; v1 submitted 23 June, 2023;
originally announced June 2023.
-
Factually Consistent Summarization via Reinforcement Learning with Textual Entailment Feedback
Authors:
Paul Roit,
Johan Ferret,
Lior Shani,
Roee Aharoni,
Geoffrey Cideron,
Robert Dadashi,
Matthieu Geist,
Sertan Girgin,
Léonard Hussenot,
Orgad Keller,
Nikola Momchev,
Sabela Ramos,
Piotr Stanczyk,
Nino Vieillard,
Olivier Bachem,
Gal Elidan,
Avinatan Hassidim,
Olivier Pietquin,
Idan Szpektor
Abstract:
Despite the seeming success of contemporary grounded text generation systems, they often tend to generate factually inconsistent text with respect to their input. This phenomenon is emphasized in tasks like summarization, in which the generated summaries should be corroborated by their source article. In this work, we leverage recent progress on textual entailment models to directly address this p…
▽ More
Despite the seeming success of contemporary grounded text generation systems, they often tend to generate factually inconsistent text with respect to their input. This phenomenon is emphasized in tasks like summarization, in which the generated summaries should be corroborated by their source article. In this work, we leverage recent progress on textual entailment models to directly address this problem for abstractive summarization systems. We use reinforcement learning with reference-free, textual entailment rewards to optimize for factual consistency and explore the ensuing trade-offs, as improved consistency may come at the cost of less informative or more extractive summaries. Our results, according to both automatic metrics and human evaluation, show that our method considerably improves the faithfulness, salience, and conciseness of the generated summaries.
△ Less
Submitted 31 May, 2023;
originally announced June 2023.
-
The MEV Saga: Can Regulation Illuminate the Dark Forest?
Authors:
Simona Ramos,
Joshua Ellul
Abstract:
In this article, we develop an interdisciplinary analysis of MEV which desires to merge the gap that exists between technical and legal research supporting policymakers in their regulatory decisions concerning blockchains, DeFi and associated risks. Consequently, this article is intended for both technical and legal audiences, and while we abstain from a detailed legal analysis, we aim to open a p…
▽ More
In this article, we develop an interdisciplinary analysis of MEV which desires to merge the gap that exists between technical and legal research supporting policymakers in their regulatory decisions concerning blockchains, DeFi and associated risks. Consequently, this article is intended for both technical and legal audiences, and while we abstain from a detailed legal analysis, we aim to open a policy discussion regarding decentralized governance design at the block building layer as the place where MEV occurs. Maximal Extractable Value or MEV has been one of the major concerns in blockchain designs as it creates a centralizing force which ultimately affects user transactions. In this article, we dive into the technicality behind MEV, where we explain the concept behind the novel Proposal Builder Separation design as an effort by Flashbots to increase decentralization through modularity. We underline potential vulnerability factors under the PBS design, which open space for MEV extracting adversarial strategies by inside participants. We discuss the shift of trust from validators to builders in PoS blockchains such as Ethereum, acknowledging the impact that the later ones may have on users' transactions (in terms of front running) and censorship resistance (in terms of transaction inclusion). We recognize that under PBS, centralized (dominant) entities such as builders could potentially harm users by extracting MEV via front running strategies. Finally, we suggest adequate design and policy measures which could potentially mitigate these negative effects while protecting blockchain users.
△ Less
Submitted 2 May, 2023;
originally announced May 2023.
-
Watch the Gap: Making code more intelligible to users without sacrificing decentralization?
Authors:
Simona Ramos,
Morshed Mannan
Abstract:
The potential for blockchain technology to eliminate the middleman and replace the top down hierarchical model of governance with a system of distributed cooperation has opened up many new opportunities, as well as dilemmas. Surpassing the level of acceptance by early tech adopters, the market of smart contracts is now moving towards wider acceptance from regular (non tech) users. For this to happ…
▽ More
The potential for blockchain technology to eliminate the middleman and replace the top down hierarchical model of governance with a system of distributed cooperation has opened up many new opportunities, as well as dilemmas. Surpassing the level of acceptance by early tech adopters, the market of smart contracts is now moving towards wider acceptance from regular (non tech) users. For this to happen however, smart contract development will have to overcome certain technical and legal obstacles to bring the code and the user closer. Guided by notions from contract law and consumer protection we highlight the information gap that exists between users, legal bodies and the source code. We present a spectrum of low-code to no-code initiatives that aim at bridging this gap, promising the potential of higher regulatory acceptance. Nevertheless, this highlights the so called "Pitfall of the Trustless Dream", because arguably solutions to the information gap tend to make the system more centralized. In this article, we aim to make a practical contribution of relevance to the wide-spread adoption of smart contracts and their legal acceptance by analyzing the evolving practices that bring the user and the code closer.
△ Less
Submitted 10 March, 2023;
originally announced April 2023.
-
Statistical Properties of the Entropy from Ordinal Patterns
Authors:
Eduarda T. C. Chagas,
Alejandro. C. Frery,
Juliana Gambini,
Magdalena M. Lucini,
Heitor S. Ramos,
Andrea A. Rey
Abstract:
The ultimate purpose of the statistical analysis of ordinal patterns is to characterize the distribution of the features they induce. In particular, knowing the joint distribution of the pair Entropy-Statistical Complexity for a large class of time series models would allow statistical tests that are unavailable to date. Working in this direction, we characterize the asymptotic distribution of the…
▽ More
The ultimate purpose of the statistical analysis of ordinal patterns is to characterize the distribution of the features they induce. In particular, knowing the joint distribution of the pair Entropy-Statistical Complexity for a large class of time series models would allow statistical tests that are unavailable to date. Working in this direction, we characterize the asymptotic distribution of the empirical Shannon's Entropy for any model under which the true normalized Entropy is neither zero nor one. We obtain the asymptotic distribution from the Central Limit Theorem (assuming large time series), the Multivariate Delta Method, and a third-order correction of its mean value. We discuss the applicability of other results (exact, first-, and second-order corrections) regarding their accuracy and numerical stability. Within a general framework for building test statistics about Shannon's Entropy, we present a bilateral test that verifies if there is enough evidence to reject the hypothesis that two signals produce ordinal patterns with the same Shannon's Entropy. We applied this bilateral test to the daily maximum temperature time series from three cities (Dublin, Edinburgh, and Miami) and obtained sensible results.
△ Less
Submitted 15 September, 2022;
originally announced September 2022.
-
Leveraging Synthetic Data to Learn Video Stabilization Under Adverse Conditions
Authors:
Abdulrahman Kerim,
Washington L. S. Ramos,
Leandro Soriano Marcolino,
Erickson R. Nascimento,
Richard Jiang
Abstract:
Video stabilization plays a central role to improve videos quality. However, despite the substantial progress made by these methods, they were, mainly, tested under standard weather and lighting conditions, and may perform poorly under adverse conditions. In this paper, we propose a synthetic-aware adverse weather robust algorithm for video stabilization that does not require real data and can be…
▽ More
Video stabilization plays a central role to improve videos quality. However, despite the substantial progress made by these methods, they were, mainly, tested under standard weather and lighting conditions, and may perform poorly under adverse conditions. In this paper, we propose a synthetic-aware adverse weather robust algorithm for video stabilization that does not require real data and can be trained only on synthetic data. We also present Silver, a novel rendering engine to generate the required training data with an automatic ground-truth extraction procedure. Our approach uses our specially generated synthetic data for training an affine transformation matrix estimator avoiding the feature extraction issues faced by current methods. Additionally, since no video stabilization datasets under adverse conditions are available, we propose the novel VSAC105Real dataset for evaluation. We compare our method to five state-of-the-art video stabilization algorithms using two benchmarks. Our results show that current approaches perform poorly in at least one weather condition, and that, even training in a small dataset with synthetic data only, we achieve the best performance in terms of stability average score, distortion score, success rate, and average cropping ratio when considering all weather conditions. Hence, our video stabilization model generalizes well on real-world videos and does not require large-scale synthetic training data to converge.
△ Less
Submitted 26 August, 2022;
originally announced August 2022.
-
RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning
Authors:
Sabela Ramos,
Sertan Girgin,
Léonard Hussenot,
Damien Vincent,
Hanna Yakubovich,
Daniel Toyama,
Anita Gergely,
Piotr Stanczyk,
Raphael Marinier,
Jeremiah Harmsen,
Olivier Pietquin,
Nikola Momchev
Abstract:
We introduce RLDS (Reinforcement Learning Datasets), an ecosystem for recording, replaying, manipulating, annotating and sharing data in the context of Sequential Decision Making (SDM) including Reinforcement Learning (RL), Learning from Demonstrations, Offline RL or Imitation Learning. RLDS enables not only reproducibility of existing research and easy generation of new datasets, but also acceler…
▽ More
We introduce RLDS (Reinforcement Learning Datasets), an ecosystem for recording, replaying, manipulating, annotating and sharing data in the context of Sequential Decision Making (SDM) including Reinforcement Learning (RL), Learning from Demonstrations, Offline RL or Imitation Learning. RLDS enables not only reproducibility of existing research and easy generation of new datasets, but also accelerates novel research. By providing a standard and lossless format of datasets it enables to quickly test new algorithms on a wider range of tasks. The RLDS ecosystem makes it easy to share datasets without any loss of information and to be agnostic to the underlying original format when applying various data processing pipelines to large collections of datasets. Besides, RLDS provides tools for collecting data generated by either synthetic agents or humans, as well as for inspecting and manipulating the collected data. Ultimately, integration with TFDS facilitates the sharing of RL datasets with the research community.
△ Less
Submitted 4 November, 2021;
originally announced November 2021.
-
Hyperparameter Selection for Imitation Learning
Authors:
Leonard Hussenot,
Marcin Andrychowicz,
Damien Vincent,
Robert Dadashi,
Anton Raichuk,
Lukasz Stafiniak,
Sertan Girgin,
Raphael Marinier,
Nikola Momchev,
Sabela Ramos,
Manu Orsini,
Olivier Bachem,
Matthieu Geist,
Olivier Pietquin
Abstract:
We address the issue of tuning hyperparameters (HPs) for imitation learning algorithms in the context of continuous-control, when the underlying reward function of the demonstrating expert cannot be observed at any time. The vast literature in imitation learning mostly considers this reward function to be available for HP selection, but this is not a realistic setting. Indeed, would this reward fu…
▽ More
We address the issue of tuning hyperparameters (HPs) for imitation learning algorithms in the context of continuous-control, when the underlying reward function of the demonstrating expert cannot be observed at any time. The vast literature in imitation learning mostly considers this reward function to be available for HP selection, but this is not a realistic setting. Indeed, would this reward function be available, it could then directly be used for policy training and imitation would not be necessary. To tackle this mostly ignored problem, we propose a number of possible proxies to the external reward. We evaluate them in an extensive empirical study (more than 10'000 agents across 9 environments) and make practical recommendations for selecting HPs. Our results show that while imitation learning algorithms are sensitive to HP choices, it is often possible to select good enough HPs through a proxy to the reward function.
△ Less
Submitted 25 May, 2021;
originally announced May 2021.
-
Reverb: A Framework For Experience Replay
Authors:
Albin Cassirer,
Gabriel Barth-Maron,
Eugene Brevdo,
Sabela Ramos,
Toby Boyd,
Thibault Sottiaux,
Manuel Kroiss
Abstract:
A central component of training in Reinforcement Learning (RL) is Experience: the data used for training. The mechanisms used to generate and consume this data have an important effect on the performance of RL algorithms.
In this paper, we introduce Reverb: an efficient, extensible, and easy to use system designed specifically for experience replay in RL. Reverb is designed to work efficiently i…
▽ More
A central component of training in Reinforcement Learning (RL) is Experience: the data used for training. The mechanisms used to generate and consume this data have an important effect on the performance of RL algorithms.
In this paper, we introduce Reverb: an efficient, extensible, and easy to use system designed specifically for experience replay in RL. Reverb is designed to work efficiently in distributed configurations with up to thousands of concurrent clients.
The flexible API provides users with the tools to easily and accurately configure the replay buffer. It includes strategies for selecting and removing elements from the buffer, as well as options for controlling the ratio between sampled and inserted elements. This paper presents the core design of Reverb, gives examples of how it can be applied, and provides empirical results of Reverb's performance characteristics.
△ Less
Submitted 9 February, 2021;
originally announced February 2021.
-
Bayesian Paired-Comparison with the bpcs Package
Authors:
David Issa Mattos,
Érika Martins Silva Ramos
Abstract:
This article introduces the bpcs R package (Bayesian Paired Comparison in Stan) and the statistical models implemented in the package. This package aims to facilitate the use of Bayesian models for paired comparison data in behavioral research. Bayesian analysis of paired comparison data allows parameter estimation even in conditions where the maximum likelihood does not exist, allows easy extensi…
▽ More
This article introduces the bpcs R package (Bayesian Paired Comparison in Stan) and the statistical models implemented in the package. This package aims to facilitate the use of Bayesian models for paired comparison data in behavioral research. Bayesian analysis of paired comparison data allows parameter estimation even in conditions where the maximum likelihood does not exist, allows easy extension of paired comparison models, provide straightforward interpretation of the results with credible intervals, have better control of type I error, have more robust evidence towards the null hypothesis, allows propagation of uncertainties, includes prior information, and perform well when handling models with many parameters and latent variables. The bpcs package provides a consistent interface for R users and several functions to evaluate the posterior distribution of all parameters, to estimate the posterior distribution of any contest between items, and to obtain the posterior distribution of the ranks. Three reanalyses of recent studies that used the frequentist Bradley-Terry model are presented. These reanalyses are conducted with the Bayesian models of the bpcs package, and all the code used to fit the models, generate the figures, and the tables are available in the online appendix.
△ Less
Submitted 20 September, 2021; v1 submitted 27 January, 2021;
originally announced January 2021.
-
A New Similarity Space Tailored for Supervised Deep Metric Learning
Authors:
Pedro H. Barros,
Fabiane Queiroz,
Flavio Figueredo,
Jefersson A. dos Santos,
Heitor S. Ramos
Abstract:
We propose a novel deep metric learning method. Differently from many works on this area, we defined a novel latent space obtained through an autoencoder. The new space, namely S-space, is divided into different regions that describe the positions where pairs of objects are similar/dissimilar. We locate makers to identify these regions. We estimate the similarities between objects through a kernel…
▽ More
We propose a novel deep metric learning method. Differently from many works on this area, we defined a novel latent space obtained through an autoencoder. The new space, namely S-space, is divided into different regions that describe the positions where pairs of objects are similar/dissimilar. We locate makers to identify these regions. We estimate the similarities between objects through a kernel-based t-student distribution to measure the markers' distance and the new data representation. In our approach, we simultaneously estimate the markers' position in the S-space and represent the objects in the same space. Moreover, we propose a new regularization function to avoid similar markers to collapse altogether. We present evidences that our proposal can represent complex spaces, for instance, when groups of similar objects are located in disjoint regions. We compare our proposal to 9 different distance metric learning approaches (four of them are based on deep-learning) on 28 real-world heterogeneous datasets. According to the four quantitative metrics used, our method overcomes all the nine strategies from the literature.
△ Less
Submitted 18 November, 2020; v1 submitted 16 November, 2020;
originally announced November 2020.
-
A Sparse Sampling-based framework for Semantic Fast-Forward of First-Person Videos
Authors:
Michel Melo Silva,
Washington Luis Souza Ramos,
Mario Fernando Montenegro Campos,
Erickson Rangel Nascimento
Abstract:
Technological advances in sensors have paved the way for digital cameras to become increasingly ubiquitous, which, in turn, led to the popularity of the self-recording culture. As a result, the amount of visual data on the Internet is moving in the opposite direction of the available time and patience of the users. Thus, most of the uploaded videos are doomed to be forgotten and unwatched stashed…
▽ More
Technological advances in sensors have paved the way for digital cameras to become increasingly ubiquitous, which, in turn, led to the popularity of the self-recording culture. As a result, the amount of visual data on the Internet is moving in the opposite direction of the available time and patience of the users. Thus, most of the uploaded videos are doomed to be forgotten and unwatched stashed away in some computer folder or website. In this paper, we address the problem of creating smooth fast-forward videos without losing the relevant content. We present a new adaptive frame selection formulated as a weighted minimum reconstruction problem. Using a smoothing frame transition and filling visual gaps between segments, our approach accelerates first-person videos emphasizing the relevant segments and avoids visual discontinuities. Experiments conducted on controlled videos and also on an unconstrained dataset of First-Person Videos (FPVs) show that, when creating fast-forward videos, our method is able to retain as much relevant information and smoothness as the state-of-the-art techniques, but in less processing time.
△ Less
Submitted 21 September, 2020;
originally announced September 2020.
-
Leveraging the Self-Transition Probability of Ordinal Pattern Transition Graph for Transportation Mode Classification
Authors:
I. Cardoso-Pereira,
J. B. Borges,
P. H. Barros,
A. F. Loureiro,
O. A. Rosso,
H. S. Ramos
Abstract:
The analysis of GPS trajectories is a well-studied problem in Urban Computing and has been used to track people. Analyzing people mobility and identifying the transportation mode used by them is essential for cities that want to reduce traffic jams and travel time between their points, thus helping to improve the quality of life of citizens. The trajectory data of a moving object is represented by…
▽ More
The analysis of GPS trajectories is a well-studied problem in Urban Computing and has been used to track people. Analyzing people mobility and identifying the transportation mode used by them is essential for cities that want to reduce traffic jams and travel time between their points, thus helping to improve the quality of life of citizens. The trajectory data of a moving object is represented by a discrete collection of points through time, i.e., a time series. Regarding its interdisciplinary and broad scope of real-world applications, it is evident the need of extracting knowledge from time series data. Mining this type of data, however, faces several complexities due to its unique properties. Different representations of data may overcome this. In this work, we propose the use of a feature retained from the Ordinal Pattern Transition Graph, called the probability of self-transition for transportation mode classification. The proposed feature presents better accuracy results than Permutation Entropy and Statistical Complexity, even when these two are combined. This is the first work, to the best of our knowledge, that uses Information Theory quantifiers to transportation mode classification, showing that it is a feasible approach to this kind of problem.
△ Less
Submitted 16 July, 2020;
originally announced July 2020.
-
Acme: A Research Framework for Distributed Reinforcement Learning
Authors:
Matthew W. Hoffman,
Bobak Shahriari,
John Aslanides,
Gabriel Barth-Maron,
Nikola Momchev,
Danila Sinopalnikov,
Piotr Stańczyk,
Sabela Ramos,
Anton Raichuk,
Damien Vincent,
Léonard Hussenot,
Robert Dadashi,
Gabriel Dulac-Arnold,
Manu Orsini,
Alexis Jacq,
Johan Ferret,
Nino Vieillard,
Seyed Kamyar Seyed Ghasemipour,
Sertan Girgin,
Olivier Pietquin,
Feryal Behbahani,
Tamara Norman,
Abbas Abdolmaleki,
Albin Cassirer,
Fan Yang
, et al. (14 additional authors not shown)
Abstract:
Deep reinforcement learning (RL) has led to many recent and groundbreaking advances. However, these advances have often come at the cost of both increased scale in the underlying architectures being trained as well as increased complexity of the RL algorithms used to train them. These increases have in turn made it more difficult for researchers to rapidly prototype new ideas or reproduce publishe…
▽ More
Deep reinforcement learning (RL) has led to many recent and groundbreaking advances. However, these advances have often come at the cost of both increased scale in the underlying architectures being trained as well as increased complexity of the RL algorithms used to train them. These increases have in turn made it more difficult for researchers to rapidly prototype new ideas or reproduce published RL algorithms. To address these concerns this work describes Acme, a framework for constructing novel RL algorithms that is specifically designed to enable agents that are built using simple, modular components that can be used at various scales of execution. While the primary goal of Acme is to provide a framework for algorithm development, a secondary goal is to provide simple reference implementations of important or state-of-the-art algorithms. These implementations serve both as a validation of our design decisions as well as an important contribution to reproducibility in RL research. In this work we describe the major design decisions made within Acme and give further details as to how its components can be used to implement various algorithms. Our experiments provide baselines for a number of common and state-of-the-art algorithms as well as showing how these algorithms can be scaled up for much larger and more complex environments. This highlights one of the primary advantages of Acme, namely that it can be used to implement large, distributed RL algorithms that can run at massive scales while still maintaining the inherent readability of that implementation.
This work presents a second version of the paper which coincides with an increase in modularity, additional emphasis on offline, imitation and learning from demonstrations algorithms, as well as various new agents implemented as part of Acme.
△ Less
Submitted 20 September, 2022; v1 submitted 1 June, 2020;
originally announced June 2020.
-
Personalizing Fast-Forward Videos Based on Visual and Textual Features from Social Network
Authors:
Washington L. S. Ramos,
Michel M. Silva,
Edson R. Araujo,
Alan C. Neves,
Erickson R. Nascimento
Abstract:
The growth of Social Networks has fueled the habit of people logging their day-to-day activities, and long First-Person Videos (FPVs) are one of the main tools in this new habit. Semantic-aware fast-forward methods are able to decrease the watch time and select meaningful moments, which is key to increase the chances of these videos being watched. However, these methods can not handle semantics in…
▽ More
The growth of Social Networks has fueled the habit of people logging their day-to-day activities, and long First-Person Videos (FPVs) are one of the main tools in this new habit. Semantic-aware fast-forward methods are able to decrease the watch time and select meaningful moments, which is key to increase the chances of these videos being watched. However, these methods can not handle semantics in terms of personalization. In this work, we present a new approach to automatically creating personalized fast-forward videos for FPVs. Our approach explores the availability of text-centric data from the user's social networks such as status updates to infer her/his topics of interest and assigns scores to the input frames according to her/his preferences. Extensive experiments are conducted on three different datasets with simulated and real-world users as input, achieving an average F1 score of up to 12.8 percentage points higher than the best competitors. We also present a user study to demonstrate the effectiveness of our method.
△ Less
Submitted 29 December, 2019;
originally announced December 2019.
-
3DBGrowth: volumetric vertebrae segmentation and reconstruction in magnetic resonance imaging
Authors:
Jonathan S. Ramos,
Mirela T. Cazzolato,
Bruno S. Faiçal,
Marcello H. Nogueira-Barbosa,
Caetano Traina Jr.,
Agma J. M. Traina
Abstract:
Segmentation of medical images is critical for making several processes of analysis and classification more reliable. With the growing number of people presenting back pain and related problems, the semi-automatic segmentation and 3D reconstruction of vertebral bodies became even more important to support decision making. A 3D reconstruction allows a fast and objective analysis of each vertebrae c…
▽ More
Segmentation of medical images is critical for making several processes of analysis and classification more reliable. With the growing number of people presenting back pain and related problems, the semi-automatic segmentation and 3D reconstruction of vertebral bodies became even more important to support decision making. A 3D reconstruction allows a fast and objective analysis of each vertebrae condition, which may play a major role in surgical planning and evaluation of suitable treatments. In this paper, we propose 3DBGrowth, which develops a 3D reconstruction over the efficient Balanced Growth method for 2D images. We also take advantage of the slope coefficient from the annotation time to reduce the total number of annotated slices, reducing the time spent on manual annotation. We show experimental results on a representative dataset with 17 MRI exams demonstrating that our approach significantly outperforms the competitors and, on average, only 37% of the total slices with vertebral body content must be annotated without losing performance/accuracy. Compared to the state-of-the-art methods, we have achieved a Dice Score gain of over 5% with comparable processing time. Moreover, 3DBGrowth works well with imprecise seed points, which reduces the time spent on manual annotation by the specialist.
△ Less
Submitted 8 July, 2019; v1 submitted 24 June, 2019;
originally announced June 2019.
-
BGrowth: an efficient approach for the segmentation of vertebral compression fractures in magnetic resonance imaging
Authors:
Jonathan S. Ramos,
Carolina Y. V. Watanabe,
Marcello H. Nogueira-Barbosa,
Agma J. M. Traina
Abstract:
Segmentation of medical images is a critical issue: several process of analysis and classification rely on this segmentation. With the growing number of people presenting back pain and problems related to it, the automatic or semi-automatic segmentation of fractured vertebral bodies became a challenging task. In general, those fractures present several regions with non-homogeneous intensities and…
▽ More
Segmentation of medical images is a critical issue: several process of analysis and classification rely on this segmentation. With the growing number of people presenting back pain and problems related to it, the automatic or semi-automatic segmentation of fractured vertebral bodies became a challenging task. In general, those fractures present several regions with non-homogeneous intensities and the dark regions are quite similar to the structures nearby. Aimed at overriding this challenge, in this paper we present a semi-automatic segmentation method, called Balanced Growth (BGrowth). The experimental results on a dataset with 102 crushed and 89 normal vertebrae show that our approach significantly outperforms well-known methods from the literature. We have achieved an accuracy up to 95% while keeping acceptable processing time performance, that is equivalent to the state-of-the-artmethods. Moreover, BGrowth presents the best results even with a rough (sloppy) manual annotation (seed points).
△ Less
Submitted 24 June, 2019; v1 submitted 20 June, 2019;
originally announced June 2019.
-
Robots Racialized in the Likeness of Marginalized Social Identities are Subject to Greater Dehumanization than those racialized as White
Authors:
Megan Strait,
Ana Sánchez Ramos,
Virginia Contreras,
Noemi Garcia
Abstract:
The emergence and spread of humanlike robots into increasingly public domains has revealed a concerning phenomenon: people's unabashed dehumanization of robots, particularly those gendered as female. Here we examined this phenomenon further towards understanding whether other socially marginalized cues (racialization in the likeness of Asian and Black identities), like female-gendering, are associ…
▽ More
The emergence and spread of humanlike robots into increasingly public domains has revealed a concerning phenomenon: people's unabashed dehumanization of robots, particularly those gendered as female. Here we examined this phenomenon further towards understanding whether other socially marginalized cues (racialization in the likeness of Asian and Black identities), like female-gendering, are associated with the manifestation of dehumanization (e.g., objectification, stereotyping) in human-robot interactions. To that end, we analyzed free-form comments (N=535) on three videos, each depicting a gynoid - Bina48, Nadine, or Yangyang - racialized as Black, White, and Asian respectively. As a preliminary control, we additionally analyzed commentary (N=674) on three videos depicting women embodying similar identity cues. The analyses indicate that people more frequently dehumanize robots racialized as Asian and Black, than they do of robots racialized as White. Additional, preliminary evaluation of how people's responding towards the gynoids compares to that towards other people suggests that the gynoids' ontology (as robots) further facilitates the dehumanization.
△ Less
Submitted 1 August, 2018;
originally announced August 2018.
-
A Weighted Sparse Sampling and Smoothing Frame Transition Approach for Semantic Fast-Forward First-Person Videos
Authors:
Michel Melo Silva,
Washington Luis Souza Ramos,
Joao Klock Ferreira,
Felipe Cadar Chamone,
Mario Fernando Montenegro Campos,
Erickson Rangel Nascimento
Abstract:
Thanks to the advances in the technology of low-cost digital cameras and the popularity of the self-recording culture, the amount of visual data on the Internet is going to the opposite side of the available time and patience of the users. Thus, most of the uploaded videos are doomed to be forgotten and unwatched in a computer folder or website. In this work, we address the problem of creating smo…
▽ More
Thanks to the advances in the technology of low-cost digital cameras and the popularity of the self-recording culture, the amount of visual data on the Internet is going to the opposite side of the available time and patience of the users. Thus, most of the uploaded videos are doomed to be forgotten and unwatched in a computer folder or website. In this work, we address the problem of creating smooth fast-forward videos without losing the relevant content. We present a new adaptive frame selection formulated as a weighted minimum reconstruction problem, which combined with a smoothing frame transition method accelerates first-person videos emphasizing the relevant segments and avoids visual discontinuities. The experiments show that our method is able to fast-forward videos to retain as much relevant information and smoothness as the state-of-the-art techniques in less time. We also present a new 80-hour multimodal (RGB-D, IMU, and GPS) dataset of first-person videos with annotations for recorder profile, frame scene, activities, interaction, and attention.
△ Less
Submitted 4 April, 2019; v1 submitted 23 February, 2018;
originally announced February 2018.
-
Making a long story short: A Multi-Importance fast-forwarding egocentric videos with the emphasis on relevant objects
Authors:
Michel Melo Silva,
Washington Luis Souza Ramos,
Felipe Cadar Chamone,
João Pedro Klock Ferreira,
Mario Fernando Montenegro Campos,
Erickson Rangel Nascimento
Abstract:
The emergence of low-cost high-quality personal wearable cameras combined with the increasing storage capacity of video-sharing websites have evoked a growing interest in first-person videos, since most videos are composed of long-running unedited streams which are usually tedious and unpleasant to watch. State-of-the-art semantic fast-forward methods currently face the challenge of providing an a…
▽ More
The emergence of low-cost high-quality personal wearable cameras combined with the increasing storage capacity of video-sharing websites have evoked a growing interest in first-person videos, since most videos are composed of long-running unedited streams which are usually tedious and unpleasant to watch. State-of-the-art semantic fast-forward methods currently face the challenge of providing an adequate balance between smoothness in visual flow and the emphasis on the relevant parts. In this work, we present the Multi-Importance Fast-Forward (MIFF), a fully automatic methodology to fast-forward egocentric videos facing these challenges. The dilemma of defining what is the semantic information of a video is addressed by a learning process based on the preferences of the user. Results show that the proposed method keeps over $3$ times more semantic content than the state-of-the-art fast-forward. Finally, we discuss the need of a particular video stabilization technique for fast-forward egocentric videos.
△ Less
Submitted 7 March, 2018; v1 submitted 9 November, 2017;
originally announced November 2017.
-
Fast-Forward Video Based on Semantic Extraction
Authors:
Washington Luis Souza Ramos,
Michel Melo Silva,
Mario Fernando Montenegro Campos,
Erickson Rangel Nascimento
Abstract:
Thanks to the low operational cost and large storage capacity of smartphones and wearable devices, people are recording many hours of daily activities, sport actions and home videos. These videos, also known as egocentric videos, are generally long-running streams with unedited content, which make them boring and visually unpalatable, bringing up the challenge to make egocentric videos more appeal…
▽ More
Thanks to the low operational cost and large storage capacity of smartphones and wearable devices, people are recording many hours of daily activities, sport actions and home videos. These videos, also known as egocentric videos, are generally long-running streams with unedited content, which make them boring and visually unpalatable, bringing up the challenge to make egocentric videos more appealing. In this work we propose a novel methodology to compose the new fast-forward video by selecting frames based on semantic information extracted from images. The experiments show that our approach outperforms the state-of-the-art as far as semantic information is concerned and that it is also able to produce videos that are more pleasant to be watched.
△ Less
Submitted 16 August, 2017; v1 submitted 14 August, 2017;
originally announced August 2017.
-
Towards Semantic Fast-Forward and Stabilized Egocentric Videos
Authors:
Michel Melo Silva,
Washington Luis Souza Ramos,
Joao Pedro Klock Ferreira,
Mario Fernando Montenegro Campos,
Erickson Rangel Nascimento
Abstract:
The emergence of low-cost personal mobiles devices and wearable cameras and the increasing storage capacity of video-sharing websites have pushed forward a growing interest towards first-person videos. Since most of the recorded videos compose long-running streams with unedited content, they are tedious and unpleasant to watch. The fast-forward state-of-the-art methods are facing challenges of bal…
▽ More
The emergence of low-cost personal mobiles devices and wearable cameras and the increasing storage capacity of video-sharing websites have pushed forward a growing interest towards first-person videos. Since most of the recorded videos compose long-running streams with unedited content, they are tedious and unpleasant to watch. The fast-forward state-of-the-art methods are facing challenges of balancing the smoothness of the video and the emphasis in the relevant frames given a speed-up rate. In this work, we present a methodology capable of summarizing and stabilizing egocentric videos by extracting the semantic information from the frames. This paper also describes a dataset collection with several semantically labeled videos and introduces a new smoothness evaluation metric for egocentric videos that is used to test our method.
△ Less
Submitted 16 August, 2017; v1 submitted 14 August, 2017;
originally announced August 2017.
-
Detecting Unexpected Obstacles for Self-Driving Cars: Fusing Deep Learning and Geometric Modeling
Authors:
Sebastian Ramos,
Stefan Gehrig,
Peter Pinggera,
Uwe Franke,
Carsten Rother
Abstract:
The detection of small road hazards, such as lost cargo, is a vital capability for self-driving cars. We tackle this challenging and rarely addressed problem with a vision system that leverages appearance, contextual as well as geometric cues. To utilize the appearance and contextual cues, we propose a new deep learning-based obstacle detection framework. Here a variant of a fully convolutional ne…
▽ More
The detection of small road hazards, such as lost cargo, is a vital capability for self-driving cars. We tackle this challenging and rarely addressed problem with a vision system that leverages appearance, contextual as well as geometric cues. To utilize the appearance and contextual cues, we propose a new deep learning-based obstacle detection framework. Here a variant of a fully convolutional network is used to predict a pixel-wise semantic labeling of (i) free-space, (ii) on-road unexpected obstacles, and (iii) background. The geometric cues are exploited using a state-of-the-art detection approach that predicts obstacles from stereo input images via model-based statistical hypothesis tests. We present a principled Bayesian framework to fuse the semantic and stereo-based detection results. The mid-level Stixel representation is used to describe obstacles in a flexible, compact and robust manner. We evaluate our new obstacle detection system on the Lost and Found dataset, which includes very challenging scenes with obstacles of only 5 cm height. Overall, we report a major improvement over the state-of-the-art, with relative performance gains of up to 50%. In particular, we achieve a detection rate of over 90% for distances of up to 50 m. Our system operates at 22 Hz on our self-driving platform.
△ Less
Submitted 20 December, 2016;
originally announced December 2016.
-
Lost and Found: Detecting Small Road Hazards for Self-Driving Vehicles
Authors:
Peter Pinggera,
Sebastian Ramos,
Stefan Gehrig,
Uwe Franke,
Carsten Rother,
Rudolf Mester
Abstract:
Detecting small obstacles on the road ahead is a critical part of the driving task which has to be mastered by fully autonomous cars. In this paper, we present a method based on stereo vision to reliably detect such obstacles from a moving vehicle. The proposed algorithm performs statistical hypothesis tests in disparity space directly on stereo image data, assessing freespace and obstacle hypothe…
▽ More
Detecting small obstacles on the road ahead is a critical part of the driving task which has to be mastered by fully autonomous cars. In this paper, we present a method based on stereo vision to reliably detect such obstacles from a moving vehicle. The proposed algorithm performs statistical hypothesis tests in disparity space directly on stereo image data, assessing freespace and obstacle hypotheses on independent local patches. This detection approach does not depend on a global road model and handles both static and moving obstacles. For evaluation, we employ a novel lost-cargo image sequence dataset comprising more than two thousand frames with pixelwise annotations of obstacle and free-space and provide a thorough comparison to several stereo-based baseline methods. The dataset will be made available to the community to foster further research on this important topic. The proposed approach outperforms all considered baselines in our evaluations on both pixel and object level and runs at frame rates of up to 20 Hz on 2 mega-pixel stereo imagery. Small obstacles down to the height of 5 cm can successfully be detected at 20 m distance at low false positive rates.
△ Less
Submitted 15 September, 2016;
originally announced September 2016.
-
The Cityscapes Dataset for Semantic Urban Scene Understanding
Authors:
Marius Cordts,
Mohamed Omran,
Sebastian Ramos,
Timo Rehfeld,
Markus Enzweiler,
Rodrigo Benenson,
Uwe Franke,
Stefan Roth,
Bernt Schiele
Abstract:
Visual understanding of complex urban street scenes is an enabling factor for a wide range of applications. Object detection has benefited enormously from large-scale datasets, especially in the context of deep learning. For semantic urban scene understanding, however, no current dataset adequately captures the complexity of real-world urban scenes.
To address this, we introduce Cityscapes, a be…
▽ More
Visual understanding of complex urban street scenes is an enabling factor for a wide range of applications. Object detection has benefited enormously from large-scale datasets, especially in the context of deep learning. For semantic urban scene understanding, however, no current dataset adequately captures the complexity of real-world urban scenes.
To address this, we introduce Cityscapes, a benchmark suite and large-scale dataset to train and test approaches for pixel-level and instance-level semantic labeling. Cityscapes is comprised of a large, diverse set of stereo video sequences recorded in streets from 50 different cities. 5000 of these images have high quality pixel-level annotations; 20000 additional images have coarse annotations to enable methods that leverage large volumes of weakly-labeled data. Crucially, our effort exceeds previous attempts in terms of dataset size, annotation richness, scene variability, and complexity. Our accompanying empirical study provides an in-depth analysis of the dataset characteristics, as well as a performance evaluation of several state-of-the-art approaches based on our benchmark.
△ Less
Submitted 7 April, 2016; v1 submitted 6 April, 2016;
originally announced April 2016.
-
Hierarchical Adaptive Structural SVM for Domain Adaptation
Authors:
Jiaolong Xu,
Sebastian Ramos,
David Vazquez,
Antonio M. Lopez
Abstract:
A key topic in classification is the accuracy loss produced when the data distribution in the training (source) domain differs from that in the testing (target) domain. This is being recognized as a very relevant problem for many computer vision tasks such as image classification, object detection, and object category recognition. In this paper, we present a novel domain adaptation method that lev…
▽ More
A key topic in classification is the accuracy loss produced when the data distribution in the training (source) domain differs from that in the testing (target) domain. This is being recognized as a very relevant problem for many computer vision tasks such as image classification, object detection, and object category recognition. In this paper, we present a novel domain adaptation method that leverages multiple target domains (or sub-domains) in a hierarchical adaptation tree. The core idea is to exploit the commonalities and differences of the jointly considered target domains.
Given the relevance of structural SVM (SSVM) classifiers, we apply our idea to the adaptive SSVM (A-SSVM), which only requires the target domain samples together with the existing source-domain classifier for performing the desired adaptation. Altogether, we term our proposal as hierarchical A-SSVM (HA-SSVM).
As proof of concept we use HA-SSVM for pedestrian detection and object category recognition. In the former we apply HA-SSVM to the deformable part-based model (DPM) while in the latter HA-SSVM is applied to multi-category classifiers. In both cases, we show how HA-SSVM is effective in increasing the detection/recognition accuracy with respect to adaptation strategies that ignore the structure of the target data. Since, the sub-domains of the target data are not always known a priori, we shown how HA-SSVM can incorporate sub-domain structure discovery for object category recognition.
△ Less
Submitted 22 August, 2014;
originally announced August 2014.
-
Spatiotemporal Stacked Sequential Learning for Pedestrian Detection
Authors:
Alejandro González,
Sebastian Ramos,
David Vázquez,
Antonio M. López,
Jaume Amores
Abstract:
Pedestrian classifiers decide which image windows contain a pedestrian. In practice, such classifiers provide a relatively high response at neighbor windows overlapping a pedestrian, while the responses around potential false positives are expected to be lower. An analogous reasoning applies for image sequences. If there is a pedestrian located within a frame, the same pedestrian is expected to ap…
▽ More
Pedestrian classifiers decide which image windows contain a pedestrian. In practice, such classifiers provide a relatively high response at neighbor windows overlapping a pedestrian, while the responses around potential false positives are expected to be lower. An analogous reasoning applies for image sequences. If there is a pedestrian located within a frame, the same pedestrian is expected to appear close to the same location in neighbor frames. Therefore, such a location has chances of receiving high classification scores during several frames, while false positives are expected to be more spurious. In this paper we propose to exploit such correlations for improving the accuracy of base pedestrian classifiers. In particular, we propose to use two-stage classifiers which not only rely on the image descriptors required by the base classifiers but also on the response of such base classifiers in a given spatiotemporal neighborhood. More specifically, we train pedestrian classifiers using a stacked sequential learning (SSL) paradigm. We use a new pedestrian dataset we have acquired from a car to evaluate our proposal at different frame rates. We also test on a well known dataset: Caltech. The obtained results show that our SSL proposal boosts detection accuracy significantly with a minimal impact on the computational cost. Interestingly, SSL improves more the accuracy at the most dangerous situations, i.e. when a pedestrian is close to the camera.
△ Less
Submitted 14 July, 2014;
originally announced July 2014.
-
Speckle Reduction with Adaptive Stack Filters
Authors:
María Elena Buemi,
Alejandro C. Frery,
Heitor S. Ramos
Abstract:
Stack filters are a special case of non-linear filters. They have a good performance for filtering images with different types of noise while preserving edges and details. A stack filter decomposes an input image into stacks of binary images according to a set of thresholds. Each binary image is then filtered by a Boolean function, which characterizes the filter. Adaptive stack filters can be comp…
▽ More
Stack filters are a special case of non-linear filters. They have a good performance for filtering images with different types of noise while preserving edges and details. A stack filter decomposes an input image into stacks of binary images according to a set of thresholds. Each binary image is then filtered by a Boolean function, which characterizes the filter. Adaptive stack filters can be computed by training using a prototype (ideal) image and its corrupted version, leading to optimized filters with respect to a loss function. In this work we propose the use of training with selected samples for the estimation of the optimal Boolean function. We study the performance of adaptive stack filters when they are applied to speckled imagery, in particular to Synthetic Aperture Radar (SAR) images. This is done by evaluating the quality of the filtered images through the use of suitable image quality indexes and by measuring the classification accuracy of the resulting images. We used SAR images as input, since they are affected by speckle noise that makes classification a difficult task.
△ Less
Submitted 8 June, 2013;
originally announced June 2013.
-
Assessment of SAR Image Filtering using Adaptive Stack Filters
Authors:
Maria E. Buemi,
Marta Mejail,
Julio Jacobo,
Alejandro C. Frery,
Heitor S. Ramos
Abstract:
Stack filters are a special case of non-linear filters. They have a good performance for filtering images with different types of noise while preserving edges and details. A stack filter decomposes an input image into several binary images according to a set of thresholds. Each binary image is then filtered by a Boolean function, which characterizes the filter. Adaptive stack filters can be design…
▽ More
Stack filters are a special case of non-linear filters. They have a good performance for filtering images with different types of noise while preserving edges and details. A stack filter decomposes an input image into several binary images according to a set of thresholds. Each binary image is then filtered by a Boolean function, which characterizes the filter. Adaptive stack filters can be designed to be optimal; they are computed from a pair of images consisting of an ideal noiseless image and its noisy version. In this work we study the performance of adaptive stack filters when they are applied to Synthetic Aperture Radar (SAR) images. This is done by evaluating the quality of the filtered images through the use of suitable image quality indexes and by measuring the classification accuracy of the resulting images.
△ Less
Submitted 18 July, 2012;
originally announced July 2012.