Search | arXiv e-print repository

JambaTalk: Speech-Driven 3D Talking Head Generation Based on Hybrid Transformer-Mamba Model

Authors: Farzaneh Jafari, Stefano Berretti, Anup Basu

Abstract: In recent years, talking head generation has become a focal point for researchers. Considerable effort is being made to refine lip-sync motion, capture expressive facial expressions, generate natural head poses, and achieve high video quality. However, no single model has yet achieved equivalence across all these metrics. This paper aims to animate a 3D face using Jamba, a hybrid Transformers-Mamb… ▽ More In recent years, talking head generation has become a focal point for researchers. Considerable effort is being made to refine lip-sync motion, capture expressive facial expressions, generate natural head poses, and achieve high video quality. However, no single model has yet achieved equivalence across all these metrics. This paper aims to animate a 3D face using Jamba, a hybrid Transformers-Mamba model. Mamba, a pioneering Structured State Space Model (SSM) architecture, was designed to address the constraints of the conventional Transformer architecture. Nevertheless, it has several drawbacks. Jamba merges the advantages of both Transformer and Mamba approaches, providing a holistic solution. Based on the foundational Jamba block, we present JambaTalk to enhance motion variety and speed through multimodal integration. Extensive experiments reveal that our method achieves performance comparable or superior to state-of-the-art models. △ Less

Submitted 2 August, 2024; originally announced August 2024.

Comments: 12 pages with 3 figures

arXiv:2405.07680 [pdf, other]

Establishing a Unified Evaluation Framework for Human Motion Generation: A Comparative Analysis of Metrics

Authors: Ali Ismail-Fawaz, Maxime Devanne, Stefano Berretti, Jonathan Weber, Germain Forestier

Abstract: The development of generative artificial intelligence for human motion generation has expanded rapidly, necessitating a unified evaluation framework. This paper presents a detailed review of eight evaluation metrics for human motion generation, highlighting their unique features and shortcomings. We propose standardized practices through a unified evaluation setup to facilitate consistent model co… ▽ More The development of generative artificial intelligence for human motion generation has expanded rapidly, necessitating a unified evaluation framework. This paper presents a detailed review of eight evaluation metrics for human motion generation, highlighting their unique features and shortcomings. We propose standardized practices through a unified evaluation setup to facilitate consistent model comparisons. Additionally, we introduce a novel metric that assesses diversity in temporal distortion by analyzing warping diversity, thereby enhancing the evaluation of temporal data. We also conduct experimental analyses of three generative models using a publicly available dataset, offering insights into the interpretation of each metric in specific case scenarios. Our goal is to offer a clear, user-friendly evaluation framework for newcomers, complemented by publicly accessible code. △ Less

Submitted 13 May, 2024; originally announced May 2024.

arXiv:2403.12886 [pdf, other]

EmoVOCA: Speech-Driven Emotional 3D Talking Heads

Authors: Federico Nocentini, Claudio Ferrari, Stefano Berretti

Abstract: The domain of 3D talking head generation has witnessed significant progress in recent years. A notable challenge in this field consists in blending speech-related motions with expression dynamics, which is primarily caused by the lack of comprehensive 3D datasets that combine diversity in spoken sentences with a variety of facial expressions. Whereas literature works attempted to exploit 2D video… ▽ More The domain of 3D talking head generation has witnessed significant progress in recent years. A notable challenge in this field consists in blending speech-related motions with expression dynamics, which is primarily caused by the lack of comprehensive 3D datasets that combine diversity in spoken sentences with a variety of facial expressions. Whereas literature works attempted to exploit 2D video data and parametric 3D models as a workaround, these still show limitations when jointly modeling the two motions. In this work, we address this problem from a different perspective, and propose an innovative data-driven technique that we used for creating a synthetic dataset, called EmoVOCA, obtained by combining a collection of inexpressive 3D talking heads and a set of 3D expressive sequences. To demonstrate the advantages of this approach, and the quality of the dataset, we then designed and trained an emotional 3D talking head generator that accepts a 3D face, an audio file, an emotion label, and an intensity value as inputs, and learns to animate the audio-synchronized lip movements with expressive traits of the face. Comprehensive experiments, both quantitative and qualitative, using our data and generator evidence superior ability in synthesizing convincing animations, when compared with the best performing methods in the literature. Our code and pre-trained model will be made available. △ Less

Submitted 19 March, 2024; originally announced March 2024.

arXiv:2403.10942 [pdf, other]

ScanTalk: 3D Talking Heads from Unregistered Scans

Authors: Federico Nocentini, Thomas Besnier, Claudio Ferrari, Sylvain Arguillere, Stefano Berretti, Mohamed Daoudi

Abstract: Speech-driven 3D talking heads generation has emerged as a significant area of interest among researchers, presenting numerous challenges. Existing methods are constrained by animating faces with fixed topologies, wherein point-wise correspondence is established, and the number and order of points remains consistent across all identities the model can animate. In this work, we present ScanTalk, a… ▽ More Speech-driven 3D talking heads generation has emerged as a significant area of interest among researchers, presenting numerous challenges. Existing methods are constrained by animating faces with fixed topologies, wherein point-wise correspondence is established, and the number and order of points remains consistent across all identities the model can animate. In this work, we present ScanTalk, a novel framework capable of animating 3D faces in arbitrary topologies including scanned data. Our approach relies on the DiffusionNet architecture to overcome the fixed topology constraint, offering promising avenues for more flexible and realistic 3D animations. By leveraging the power of DiffusionNet, ScanTalk not only adapts to diverse facial structures but also maintains fidelity when dealing with scanned data, thereby enhancing the authenticity and versatility of generated 3D talking heads. Through comprehensive comparisons with state-of-the-art methods, we validate the efficacy of our approach, demonstrating its capacity to generate realistic talking heads comparable to existing techniques. While our primary objective is to develop a generic method free from topological constraints, all state-of-the-art methodologies are bound by such limitations. Code for reproducing our results, and the pre-trained model will be made available. △ Less

Submitted 19 March, 2024; v1 submitted 16 March, 2024; originally announced March 2024.

arXiv:2311.14534 [pdf, other]

Finding Foundation Models for Time Series Classification with a PreText Task

Authors: Ali Ismail-Fawaz, Maxime Devanne, Stefano Berretti, Jonathan Weber, Germain Forestier

Abstract: Over the past decade, Time Series Classification (TSC) has gained an increasing attention. While various methods were explored, deep learning - particularly through Convolutional Neural Networks (CNNs)-stands out as an effective approach. However, due to the limited availability of training data, defining a foundation model for TSC that overcomes the overfitting problem is still a challenging task… ▽ More Over the past decade, Time Series Classification (TSC) has gained an increasing attention. While various methods were explored, deep learning - particularly through Convolutional Neural Networks (CNNs)-stands out as an effective approach. However, due to the limited availability of training data, defining a foundation model for TSC that overcomes the overfitting problem is still a challenging task. The UCR archive, encompassing a wide spectrum of datasets ranging from motion recognition to ECG-based heart disease detection, serves as a prime example for exploring this issue in diverse TSC scenarios. In this paper, we address the overfitting challenge by introducing pre-trained domain foundation models. A key aspect of our methodology is a novel pretext task that spans multiple datasets. This task is designed to identify the originating dataset of each time series sample, with the goal of creating flexible convolution filters that can be applied across different datasets. The research process consists of two phases: a pre-training phase where the model acquires general features through the pretext task, and a subsequent fine-tuning phase for specific dataset classifications. Our extensive experiments on the UCR archive demonstrate that this pre-training strategy significantly outperforms the conventional training approach without pre-training. This strategy effectively reduces overfitting in small datasets and provides an efficient route for adapting these models to new datasets, thus advancing the capabilities of deep learning in TSC. △ Less

Submitted 28 February, 2024; v1 submitted 24 November, 2023; originally announced November 2023.

arXiv:2309.16353 [pdf, other]

ShapeDBA: Generating Effective Time Series Prototypes using ShapeDTW Barycenter Averaging

Authors: Ali Ismail-Fawaz, Hassan Ismail Fawaz, François Petitjean, Maxime Devanne, Jonathan Weber, Stefano Berretti, Geoffrey I. Webb, Germain Forestier

Abstract: Time series data can be found in almost every domain, ranging from the medical field to manufacturing and wireless communication. Generating realistic and useful exemplars and prototypes is a fundamental data analysis task. In this paper, we investigate a novel approach to generating realistic and useful exemplars and prototypes for time series data. Our approach uses a new form of time series ave… ▽ More Time series data can be found in almost every domain, ranging from the medical field to manufacturing and wireless communication. Generating realistic and useful exemplars and prototypes is a fundamental data analysis task. In this paper, we investigate a novel approach to generating realistic and useful exemplars and prototypes for time series data. Our approach uses a new form of time series average, the ShapeDTW Barycentric Average. We therefore turn our attention to accurately generating time series prototypes with a novel approach. The existing time series prototyping approaches rely on the Dynamic Time Warping (DTW) similarity measure such as DTW Barycentering Average (DBA) and SoftDBA. These last approaches suffer from a common problem of generating out-of-distribution artifacts in their prototypes. This is mostly caused by the DTW variant used and its incapability of detecting neighborhood similarities, instead it detects absolute similarities. Our proposed method, ShapeDBA, uses the ShapeDTW variant of DTW, that overcomes this issue. We chose time series clustering, a popular form of time series analysis to evaluate the outcome of ShapeDBA compared to the other prototyping approaches. Coupled with the k-means clustering algorithm, and evaluated on a total of 123 datasets from the UCR archive, our proposed averaging approach is able to achieve new state-of-the-art results in terms of Adjusted Rand Index. △ Less

Submitted 28 September, 2023; originally announced September 2023.

Comments: Published in AALTD workshop at ECML/PKDD 2023

arXiv:2306.01415 [pdf, other]

Learning Landmarks Motion from Speech for Speaker-Agnostic 3D Talking Heads Generation

Authors: Federico Nocentini, Claudio Ferrari, Stefano Berretti

Abstract: This paper presents a novel approach for generating 3D talking heads from raw audio inputs. Our method grounds on the idea that speech related movements can be comprehensively and efficiently described by the motion of a few control points located on the movable parts of the face, i.e., landmarks. The underlying musculoskeletal structure then allows us to learn how their motion influences the geom… ▽ More This paper presents a novel approach for generating 3D talking heads from raw audio inputs. Our method grounds on the idea that speech related movements can be comprehensively and efficiently described by the motion of a few control points located on the movable parts of the face, i.e., landmarks. The underlying musculoskeletal structure then allows us to learn how their motion influences the geometrical deformations of the whole face. The proposed method employs two distinct models to this aim: the first one learns to generate the motion of a sparse set of landmarks from the given audio. The second model expands such landmarks motion to a dense motion field, which is utilized to animate a given 3D mesh in neutral state. Additionally, we introduce a novel loss function, named Cosine Loss, which minimizes the angle between the generated motion vectors and the ground truth ones. Using landmarks in 3D talking head generation offers various advantages such as consistency, reliability, and obviating the need for manual-annotation. Our approach is designed to be identity-agnostic, enabling high-quality facial animations for any users without additional data or training. △ Less

Submitted 26 July, 2023; v1 submitted 2 June, 2023; originally announced June 2023.

Journal ref: International Conference on Image Analysis and Processing (ICIAP) 2023

arXiv:2306.01081 [pdf, other]

4DSR-GCN: 4D Video Point Cloud Upsampling using Graph Convolutional Networks

Authors: Lorenzo Berlincioni, Stefano Berretti, Marco Bertini, Alberto Del Bimbo

Abstract: Time varying sequences of 3D point clouds, or 4D point clouds, are now being acquired at an increasing pace in several applications (e.g., LiDAR in autonomous or assisted driving). In many cases, such volume of data is transmitted, thus requiring that proper compression tools are applied to either reduce the resolution or the bandwidth. In this paper, we propose a new solution for upscaling and re… ▽ More Time varying sequences of 3D point clouds, or 4D point clouds, are now being acquired at an increasing pace in several applications (e.g., LiDAR in autonomous or assisted driving). In many cases, such volume of data is transmitted, thus requiring that proper compression tools are applied to either reduce the resolution or the bandwidth. In this paper, we propose a new solution for upscaling and restoration of time-varying 3D video point clouds after they have been heavily compressed. In consideration of recent growing relevance of 3D applications, %We focused on a model allowing user-side upscaling and artifact removal for 3D video point clouds, a real-time stream of which would require . Our model consists of a specifically designed Graph Convolutional Network (GCN) that combines Dynamic Edge Convolution and Graph Attention Networks for feature aggregation in a Generative Adversarial setting. By taking inspiration PointNet++, We present a different way to sample dense point clouds with the intent to make these modules work in synergy to provide each node enough features about its neighbourhood in order to later on generate new vertices. Compared to other solutions in the literature that address the same task, our proposed model is capable of obtaining comparable results in terms of quality of the reconstruction, while using a substantially lower number of parameters (about 300KB), making our solution deployable in edge computing devices such as LiDAR. △ Less

Submitted 1 June, 2023; originally announced June 2023.

arXiv:2305.11921 [pdf, other]

An Approach to Multiple Comparison Benchmark Evaluations that is Stable Under Manipulation of the Comparate Set

Authors: Ali Ismail-Fawaz, Angus Dempster, Chang Wei Tan, Matthieu Herrmann, Lynn Miller, Daniel F. Schmidt, Stefano Berretti, Jonathan Weber, Maxime Devanne, Germain Forestier, Geoffrey I. Webb

Abstract: The measurement of progress using benchmarks evaluations is ubiquitous in computer science and machine learning. However, common approaches to analyzing and presenting the results of benchmark comparisons of multiple algorithms over multiple datasets, such as the critical difference diagram introduced by Demšar (2006), have important shortcomings and, we show, are open to both inadvertent and inte… ▽ More The measurement of progress using benchmarks evaluations is ubiquitous in computer science and machine learning. However, common approaches to analyzing and presenting the results of benchmark comparisons of multiple algorithms over multiple datasets, such as the critical difference diagram introduced by Demšar (2006), have important shortcomings and, we show, are open to both inadvertent and intentional manipulation. To address these issues, we propose a new approach to presenting the results of benchmark comparisons, the Multiple Comparison Matrix (MCM), that prioritizes pairwise comparisons and precludes the means of manipulating experimental results in existing approaches. MCM can be used to show the results of an all-pairs comparison, or to show the results of a comparison between one or more selected algorithms and the state of the art. MCM is implemented in Python and is publicly available. △ Less

Submitted 19 May, 2023; originally announced May 2023.

arXiv:2211.02366 [pdf, other]

SPEAKER VGG CCT: Cross-corpus Speech Emotion Recognition with Speaker Embedding and Vision Transformers

Authors: A. Arezzo, S. Berretti

Abstract: In recent years, Speech Emotion Recognition (SER) has been investigated mainly transforming the speech signal into spectrograms that are then classified using Convolutional Neural Networks pretrained on generic images and fine tuned with spectrograms. In this paper, we start from the general idea above and develop a new learning solution for SER, which is based on Compact Convolutional Transformer… ▽ More In recent years, Speech Emotion Recognition (SER) has been investigated mainly transforming the speech signal into spectrograms that are then classified using Convolutional Neural Networks pretrained on generic images and fine tuned with spectrograms. In this paper, we start from the general idea above and develop a new learning solution for SER, which is based on Compact Convolutional Transformers (CCTs) combined with a speaker embedding. With CCTs, the learning power of Vision Transformers (ViT) is combined with a diminished need for large volume of data as made possible by the convolution. This is important in SER, where large corpora of data are usually not available. The speaker embedding allows the network to extract an identity representation of the speaker, which is then integrated by means of a self-attention mechanism with the features that the CCT extracts from the spectrogram. Overall, the solution is capable of operating in real-time showing promising results in a cross-corpus scenario, where training and test datasets are kept separate. Experiments have been performed on several benchmarks in a cross-corpus setting as rarely used in the literature, with results that are comparable or superior to those obtained with state-of-the-art network architectures. Our code is available at https://github.com/JabuMlDev/Speaker-VGG-CCT. △ Less

Submitted 4 November, 2022; originally announced November 2022.

arXiv:2210.16815 [pdf, other]

CAD 3D Model classification by Graph Neural Networks: A new approach based on STEP format

Authors: L. Mandelli, S. Berretti

Abstract: In this paper, we introduce a new approach for retrieval and classification of 3D models that directly performs in the Computer-Aided Design (CAD) format without any conversion to other representations like point clouds or meshes, thus avoiding any loss of information. Among the various CAD formats, we consider the widely used STEP extension, which represents a standard for product manufacturing i… ▽ More In this paper, we introduce a new approach for retrieval and classification of 3D models that directly performs in the Computer-Aided Design (CAD) format without any conversion to other representations like point clouds or meshes, thus avoiding any loss of information. Among the various CAD formats, we consider the widely used STEP extension, which represents a standard for product manufacturing information. This particular format represents a 3D model as a set of primitive elements such as surfaces and vertices linked together. In our approach, we exploit the linked structure of STEP files to create a graph in which the nodes are the primitive elements and the arcs are the connections between them. We then use Graph Neural Networks (GNNs) to solve the problem of model classification. Finally, we created two datasets of 3D models in native CAD format, respectively, by collecting data from the Traceparts model library and from the Configurators software modeling company. We used these datasets to test and compare our approach with respect to state-of-the-art methods that consider other 3D formats. Our code is available at https://github.com/divanoLetto/3D_STEP_Classification △ Less

Submitted 30 October, 2022; originally announced October 2022.

arXiv:2210.16807 [pdf, other]

The Florence 4D Facial Expression Dataset

Authors: F. Principi, S. Berretti, C. Ferrari, N. Otberdout, M. Daoudi, A. Del Bimbo

Abstract: Human facial expressions change dynamically, so their recognition / analysis should be conducted by accounting for the temporal evolution of face deformations either in 2D or 3D. While abundant 2D video data do exist, this is not the case in 3D, where few 3D dynamic (4D) datasets were released for public use. The negative consequence of this scarcity of data is amplified by current deep learning b… ▽ More Human facial expressions change dynamically, so their recognition / analysis should be conducted by accounting for the temporal evolution of face deformations either in 2D or 3D. While abundant 2D video data do exist, this is not the case in 3D, where few 3D dynamic (4D) datasets were released for public use. The negative consequence of this scarcity of data is amplified by current deep learning based-methods for facial expression analysis that require large quantities of variegate samples to be effectively trained. With the aim of smoothing such limitations, in this paper we propose a large dataset, named Florence 4D, composed of dynamic sequences of 3D face models, where a combination of synthetic and real identities exhibit an unprecedented variety of 4D facial expressions, with variations that include the classical neutral-apex transition, but generalize to expression-to-expression. All these characteristics are not exposed by any of the existing 4D datasets and they cannot even be obtained by combining more than one dataset. We strongly believe that making such a data corpora publicly available to the community will allow designing and experimenting new applications that were not possible to investigate till now. To show at some extent the difficulty of our data in terms of different identities and varying expressions, we also report a baseline experimentation on the proposed dataset that can be used as baseline. △ Less

Submitted 30 October, 2022; originally announced October 2022.

arXiv:2209.01813 [pdf, other]

Automatic Estimation of Self-Reported Pain by Trajectory Analysis in the Manifold of Fixed Rank Positive Semi-Definite Matrices

Authors: Benjamin Szczapa, Mohamed Daoudi, Stefano Berretti, Pietro Pala, Alberto Del Bimbo, Zakia Hammal

Abstract: We propose an automatic method to estimate self-reported pain based on facial landmarks extracted from videos. For each video sequence, we decompose the face into four different regions and the pain intensity is measured by modeling the dynamics of facial movement using the landmarks of these regions. A formulation based on Gram matrices is used for representing the trajectory of landmarks on the… ▽ More We propose an automatic method to estimate self-reported pain based on facial landmarks extracted from videos. For each video sequence, we decompose the face into four different regions and the pain intensity is measured by modeling the dynamics of facial movement using the landmarks of these regions. A formulation based on Gram matrices is used for representing the trajectory of landmarks on the Riemannian manifold of symmetric positive semi-definite matrices of fixed rank. A curve fitting algorithm is used to smooth the trajectories and temporal alignment is performed to compute the similarity between the trajectories on the manifold. A Support Vector Regression classifier is then trained to encode extracted trajectories into pain intensity levels consistent with self-reported pain intensity measurement. Finally, a late fusion of the estimation for each region is performed to obtain the final predicted pain level. The proposed approach is evaluated on two publicly available datasets, the UNBCMcMaster Shoulder Pain Archive and the Biovid Heat Pain dataset. We compared our method to the state-of-the-art on both datasets using different testing protocols, showing the competitiveness of the proposed approach. △ Less

Submitted 17 September, 2022; v1 submitted 5 September, 2022; originally announced September 2022.

Comments: To appear in IEEE Transactions On Affective Computing, it is an extension of our paper arXiv:2006.13882

arXiv:2208.00050 [pdf, other]

Generating Multiple 4D Expression Transitions by Learning Face Landmark Trajectories

Authors: Naima Otberdout, Claudio Ferrari, Mohamed Daoudi, Stefano Berretti, Alberto Del Bimbo

Abstract: In this work, we address the problem of 4D facial expressions generation. This is usually addressed by animating a neutral 3D face to reach an expression peak, and then get back to the neutral state. In the real world though, people show more complex expressions, and switch from one expression to another. We thus propose a new model that generates transitions between different expressions, and syn… ▽ More In this work, we address the problem of 4D facial expressions generation. This is usually addressed by animating a neutral 3D face to reach an expression peak, and then get back to the neutral state. In the real world though, people show more complex expressions, and switch from one expression to another. We thus propose a new model that generates transitions between different expressions, and synthesizes long and composed 4D expressions. This involves three sub-problems: (i) modeling the temporal dynamics of expressions, (ii) learning transitions between them, and (iii) deforming a generic mesh. We propose to encode the temporal evolution of expressions using the motion of a set of 3D landmarks, that we learn to generate by training a manifold-valued GAN (Motion3DGAN). To allow the generation of composed expressions, this model accepts two labels encoding the starting and the ending expressions. The final sequence of meshes is generated by a Sparse2Dense mesh Decoder (S2D-Dec) that maps the landmark displacements to a dense, per-vertex displacement of a known mesh topology. By explicitly working with motion trajectories, the model is totally independent from the identity. Extensive experiments on five public datasets show that our proposed approach brings significant improvements with respect to previous solutions, while retaining good generalization to unseen data. △ Less

Submitted 18 May, 2023; v1 submitted 29 July, 2022; originally announced August 2022.

Comments: This preprint is an extension of CVPR 2022 paper arXiv:2105.07463

arXiv:2206.11759 [pdf, other]

What makes you, you? Analyzing Recognition by Swapping Face Parts

Authors: Claudio Ferrari, Matteo Serpentoni, Stefano Berretti, Alberto Del Bimbo

Abstract: Deep learning advanced face recognition to an unprecedented accuracy. However, understanding how local parts of the face affect the overall recognition performance is still mostly unclear. Among others, face swap has been experimented to this end, but just for the entire face. In this paper, we propose to swap facial parts as a way to disentangle the recognition relevance of different face parts,… ▽ More Deep learning advanced face recognition to an unprecedented accuracy. However, understanding how local parts of the face affect the overall recognition performance is still mostly unclear. Among others, face swap has been experimented to this end, but just for the entire face. In this paper, we propose to swap facial parts as a way to disentangle the recognition relevance of different face parts, like eyes, nose and mouth. In our method, swapping parts from a source face to a target one is performed by fitting a 3D prior, which establishes dense pixels correspondence between parts, while also handling pose differences. Seamless cloning is then used to obtain smooth transitions between the mapped source regions and the shape and skin tone of the target face. We devised an experimental protocol that allowed us to draw some preliminary conclusions when the swapped images are classified by deep networks, indicating a prominence of the eyes and eyebrows region. Code available at https://github.com/clferrari/FacePartsSwap △ Less

Submitted 23 June, 2022; originally announced June 2022.

Comments: Accepted for publication at 26TH International Conference on Pattern Recognition (ICPR), 2022

arXiv:2105.07463 [pdf, other]

Sparse to Dense Dynamic 3D Facial Expression Generation

Authors: Naima Otberdout, Claudio Ferrari, Mohamed Daoudi, Stefano Berretti, Alberto Del Bimbo

Abstract: In this paper, we propose a solution to the task of generating dynamic 3D facial expressions from a neutral 3D face and an expression label. This involves solving two sub-problems: (i)modeling the temporal dynamics of expressions, and (ii) deforming the neutral mesh to obtain the expressive counterpart. We represent the temporal evolution of expressions using the motion of a sparse set of 3D landm… ▽ More In this paper, we propose a solution to the task of generating dynamic 3D facial expressions from a neutral 3D face and an expression label. This involves solving two sub-problems: (i)modeling the temporal dynamics of expressions, and (ii) deforming the neutral mesh to obtain the expressive counterpart. We represent the temporal evolution of expressions using the motion of a sparse set of 3D landmarks that we learn to generate by training a manifold-valued GAN (Motion3DGAN). To better encode the expression-induced deformation and disentangle it from the identity information, the generated motion is represented as per-frame displacement from a neutral configuration. To generate the expressive meshes, we train a Sparse2Dense mesh Decoder (S2D-Dec) that maps the landmark displacements to a dense, per-vertex displacement. This allows us to learn how the motion of a sparse set of landmarks influences the deformation of the overall face surface, independently from the identity. Experimental results on the CoMA and D3DFACS datasets show that our solution brings significant improvements with respect to previous solutions in terms of both dynamic expression generation and mesh reconstruction, while retaining good generalization to unseen data. The code and the pretrained model will be made publicly available. △ Less

Submitted 3 March, 2022; v1 submitted 16 May, 2021; originally announced May 2021.

Comments: paper accepted at CVPR 2022

arXiv:2006.13895 [pdf, other]

Modelling the Statistics of Cyclic Activities by Trajectory Analysis on the Manifold of Positive-Semi-Definite Matrices

Authors: Ettore Maria Celozzi, Luca Ciabini, Luca Cultrera, Pietro Pala, Stefano Berretti, Mohamed Daoudi, Alberto Del Bimbo

Abstract: In this paper, a model is presented to extract statistical summaries to characterize the repetition of a cyclic body action, for instance a gym exercise, for the purpose of checking the compliance of the observed action to a template one and highlighting the parts of the action that are not correctly executed (if any). The proposed system relies on a Riemannian metric to compute the distance betwe… ▽ More In this paper, a model is presented to extract statistical summaries to characterize the repetition of a cyclic body action, for instance a gym exercise, for the purpose of checking the compliance of the observed action to a template one and highlighting the parts of the action that are not correctly executed (if any). The proposed system relies on a Riemannian metric to compute the distance between two poses in such a way that the geometry of the manifold where the pose descriptors lie is preserved; a model to detect the begin and end of each cycle; a model to temporally align the poses of different cycles so as to accurately estimate the \emph{cross-sectional} mean and variance of poses across different cycles. The proposed model is demonstrated using gym videos taken from the Internet. △ Less

Submitted 24 June, 2020; originally announced June 2020.

Comments: accepted at 15th IEEE International Conference on Automatic Face and Gesture Recognition 2020

arXiv:2006.13882 [pdf, other]

Automatic Estimation of Self-Reported Pain by Interpretable Representations of Motion Dynamics

Authors: Benjamin Szczapa, Mohamed Daoudi, Stefano Berretti, Pietro Pala, Alberto Del Bimbo, Zakia Hammal

Abstract: We propose an automatic method for pain intensity measurement from video. For each video, pain intensity was measured using the dynamics of facial movement using 66 facial points. Gram matrices formulation was used for facial points trajectory representations on the Riemannian manifold of symmetric positive semi-definite matrices of fixed rank. Curve fitting and temporal alignment were then used t… ▽ More We propose an automatic method for pain intensity measurement from video. For each video, pain intensity was measured using the dynamics of facial movement using 66 facial points. Gram matrices formulation was used for facial points trajectory representations on the Riemannian manifold of symmetric positive semi-definite matrices of fixed rank. Curve fitting and temporal alignment were then used to smooth the extracted trajectories. A Support Vector Regression model was then trained to encode the extracted trajectories into ten pain intensity levels consistent with the Visual Analogue Scale for pain intensity measurement. The proposed approach was evaluated using the UNBC McMaster Shoulder Pain Archive and was compared to the state-of-the-art on the same data. Using both 5-fold cross-validation and leave-one-subject-out cross-validation, our results are competitive with respect to state-of-the-art methods. △ Less

Submitted 24 June, 2020; originally announced June 2020.

Comments: accepted at ICPR 2020 Conference

arXiv:2006.03840 [pdf, other]

doi 10.1109/TPAMI.2021.3090942

A Sparse and Locally Coherent Morphable Face Model for Dense Semantic Correspondence Across Heterogeneous 3D Faces

Authors: Claudio Ferrari, Stefano Berretti, Pietro Pala, Alberto Del Bimbo

Abstract: The 3D Morphable Model (3DMM) is a powerful statistical tool for representing 3D face shapes. To build a 3DMM, a training set of face scans in full point-to-point correspondence is required, and its modeling capabilities directly depend on the variability contained in the training data. Thus, to increase the descriptive power of the 3DMM, establishing a dense correspondence across heterogeneous sc… ▽ More The 3D Morphable Model (3DMM) is a powerful statistical tool for representing 3D face shapes. To build a 3DMM, a training set of face scans in full point-to-point correspondence is required, and its modeling capabilities directly depend on the variability contained in the training data. Thus, to increase the descriptive power of the 3DMM, establishing a dense correspondence across heterogeneous scans with sufficient diversity in terms of identities, ethnicities, or expressions becomes essential. In this manuscript, we present a fully automatic approach that leverages a 3DMM to transfer its dense semantic annotation across raw 3D faces, establishing a dense correspondence between them. We propose a novel formulation to learn a set of sparse deformation components with local support on the face that, together with an original non-rigid deformation algorithm, allow the 3DMM to precisely fit unseen faces and transfer its semantic annotation. We extensively experimented our approach, showing it can effectively generalize to highly diverse samples and accurately establish a dense correspondence even in presence of complex facial expressions. The accuracy of the dense registration is demonstrated by building a heterogeneous, large-scale 3DMM from more than 9,000 fully registered scans obtained by joining three large datasets together. △ Less

Submitted 24 June, 2021; v1 submitted 6 June, 2020; originally announced June 2020.

Comments: Accepted for publication in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

arXiv:1908.00646 [pdf, other]

Fitting, Comparison, and Alignment of Trajectories on Positive Semi-Definite Matrices with Application to Action Recognition

Authors: Benjamin Szczapa, Mohamed Daoudi, Stefano Berretti, Alberto Del Bimbo, Pietro Pala, Estelle Massart

Abstract: In this paper, we tackle the problem of action recognition using body skeletons extracted from video sequences. Our approach lies in the continuity of recent works representing video frames by Gramian matrices that describe a trajectory on the Riemannian manifold of positive-semidefinite matrices of fixed rank. In comparison with previous works, the manifold of fixed-rank positive-semidefinite mat… ▽ More In this paper, we tackle the problem of action recognition using body skeletons extracted from video sequences. Our approach lies in the continuity of recent works representing video frames by Gramian matrices that describe a trajectory on the Riemannian manifold of positive-semidefinite matrices of fixed rank. In comparison with previous works, the manifold of fixed-rank positive-semidefinite matrices is here endowed with a different metric, and we resort to different algorithms for the curve fitting and temporal alignment steps. We evaluated our approach on three publicly available datasets (UTKinect-Action3D, KTH-Action and UAV-Gesture). The results of the proposed approach are competitive with respect to state-of-the-art methods, while only involving body skeletons. △ Less

Submitted 9 September, 2019; v1 submitted 1 August, 2019; originally announced August 2019.

Comments: Updated version of the paper published in the workshop HBU2019. The differences with the published version are a few small corrections, mainly misleading notations for the distance function on p. 4, and missing square root in the expression for "d", in the Thm. on p. 4. Noticeable changes w. r. t. v1 and v2 on arxiv, please use this version instead

arXiv:1907.10087 [pdf, other]

doi 10.1109/TPAMI.2020.3002500

Dynamic Facial Expression Generation on Hilbert Hypersphere with Conditional Wasserstein Generative Adversarial Nets

Authors: Naima Otberdout, Mohamed Daoudi, Anis Kacem, Lahoucine Ballihi, Stefano Berretti

Abstract: In this work, we propose a novel approach for generating videos of the six basic facial expressions given a neutral face image. We propose to exploit the face geometry by modeling the facial landmarks motion as curves encoded as points on a hypersphere. By proposing a conditional version of manifold-valued Wasserstein generative adversarial network (GAN) for motion generation on the hypersphere, w… ▽ More In this work, we propose a novel approach for generating videos of the six basic facial expressions given a neutral face image. We propose to exploit the face geometry by modeling the facial landmarks motion as curves encoded as points on a hypersphere. By proposing a conditional version of manifold-valued Wasserstein generative adversarial network (GAN) for motion generation on the hypersphere, we learn the distribution of facial expression dynamics of different classes, from which we synthesize new facial expression motions. The resulting motions can be transformed to sequences of landmarks and then to images sequences by editing the texture information using another conditional Generative Adversarial Network. To the best of our knowledge, this is the first work that explores manifold-valued representations with GAN to address the problem of dynamic facial expression generation. We evaluate our proposed approach both quantitatively and qualitatively on two public datasets; Oulu-CASIA and MUG Facial Expression. Our experimental results demonstrate the effectiveness of our approach in generating realistic videos with continuous motion, realistic appearance and identity preservation. We also show the efficiency of our framework for dynamic facial expressions generation, dynamic facial expression transfer and data augmentation for training improved emotion recognition models. △ Less

Submitted 28 May, 2020; v1 submitted 23 July, 2019; originally announced July 2019.

arXiv:1904.04297 [pdf, other]

Learned 3D Shape Representations Using Fused Geometrically Augmented Images: Application to Facial Expression and Action Unit Detection

Authors: Bilal Taha, Munawar Hayat, Stefano Berretti, Naoufel Werghi

Abstract: This paper proposes an approach to learn generic multi-modal mesh surface representations using a novel scheme for fusing texture and geometric data. Our approach defines an inverse mapping between different geometric descriptors computed on the mesh surface or its down-sampled version, and the corresponding 2D texture image of the mesh, allowing the construction of fused geometrically augmented i… ▽ More This paper proposes an approach to learn generic multi-modal mesh surface representations using a novel scheme for fusing texture and geometric data. Our approach defines an inverse mapping between different geometric descriptors computed on the mesh surface or its down-sampled version, and the corresponding 2D texture image of the mesh, allowing the construction of fused geometrically augmented images (FGAI). This new fused modality enables us to learn feature representations from 3D data in a highly efficient manner by simply employing standard convolutional neural networks in a transfer-learning mode. In contrast to existing methods, the proposed approach is both computationally and memory efficient, preserves intrinsic geometric information and learns highly discriminative feature representation by effectively fusing shape and texture information at data level. The efficacy of our approach is demonstrated for the tasks of facial action unit detection and expression classification. The extensive experiments conducted on the Bosphorus and BU-4DFE datasets, show that our method produces a significant boost in the performance when compared to state-of-the-art solutions △ Less

Submitted 8 April, 2019; originally announced April 2019.

arXiv:1902.03804 [pdf, other]

Additional Baseline Metrics for the paper "Extended YouTube Faces: a Dataset for Heterogeneous Open-Set Face Identification"

Authors: Claudio Ferrari, Stefano Berretti, Alberto Del Bimbo

Abstract: In this report, we provide additional and corrected results for the paper "Extended YouTube Faces: a Dataset for Heterogeneous Open-Set Face Identification". After further investigations, we discovered and corrected wrongly labeled images and incorrect identities. This forced us to re-generate the evaluation protocol for the new data; in doing so, we also reproduced and extended the experimental r… ▽ More In this report, we provide additional and corrected results for the paper "Extended YouTube Faces: a Dataset for Heterogeneous Open-Set Face Identification". After further investigations, we discovered and corrected wrongly labeled images and incorrect identities. This forced us to re-generate the evaluation protocol for the new data; in doing so, we also reproduced and extended the experimental results with other standard metrics and measures used in the literature. The reader can refer to the original paper for additional details regarding the data collection procedure and recognition pipeline. △ Less

Submitted 11 February, 2019; originally announced February 2019.

Comments: 3 pages, 2 figures

arXiv:1810.11392 [pdf, other]

Automatic Analysis of Facial Expressions Based on Deep Covariance Trajectories

Authors: Naima Otberdout, Anis Kacem, Mohamed Daoudi, Lahoucine Ballihi, Stefano Berretti

Abstract: In this paper, we propose a new approach for facial expression recognition using deep covariance descriptors. The solution is based on the idea of encoding local and global Deep Convolutional Neural Network (DCNN) features extracted from still images, in compact local and global covariance descriptors. The space geometry of the covariance matrices is that of Symmetric Positive Definite (SPD) matri… ▽ More In this paper, we propose a new approach for facial expression recognition using deep covariance descriptors. The solution is based on the idea of encoding local and global Deep Convolutional Neural Network (DCNN) features extracted from still images, in compact local and global covariance descriptors. The space geometry of the covariance matrices is that of Symmetric Positive Definite (SPD) matrices. By conducting the classification of static facial expressions using Support Vector Machine (SVM) with a valid Gaussian kernel on the SPD manifold, we show that deep covariance descriptors are more effective than the standard classification with fully connected layers and softmax. Besides, we propose a completely new and original solution to model the temporal dynamic of facial expressions as deep trajectories on the SPD manifold. As an extension of the classification pipeline of covariance descriptors, we apply SVM with valid positive definite kernels derived from global alignment for deep covariance trajectories classification. By performing extensive experiments on the Oulu-CASIA, CK+, and SFEW datasets, we show that both the proposed static and dynamic approaches achieve state-of-the-art performance for facial expression recognition outperforming many recent approaches. △ Less

Submitted 4 December, 2019; v1 submitted 25 October, 2018; originally announced October 2018.

Comments: A preliminary version of this work appeared in "Otberdout N, Kacem A, Daoudi M, Ballihi L, Berretti S. Deep Covariance Descriptors for Facial Expression Recognition, in British Machine Vision Conference 2018, BMVC 2018, Northumbria University, Newcastle, UK, September 3-6, 2018. ; 2018 :159." arXiv admin note: substantial text overlap with arXiv:1805.03869

arXiv:1807.00676 [pdf, other]

A Novel Geometric Framework on Gram Matrix Trajectories for Human Behavior Understanding

Authors: Anis Kacem, Mohamed Daoudi, Boulbaba Ben Amor, Stefano Berretti, Juan Carlos Alvarez-Paiva

Abstract: In this paper, we propose a novel space-time geometric representation of human landmark configurations and derive tools for comparison and classification. We model the temporal evolution of landmarks as parametrized trajectories on the Riemannian manifold of positive semidefinite matrices of fixed-rank. Our representation has the benefit to bring naturally a second desirable quantity when comparin… ▽ More In this paper, we propose a novel space-time geometric representation of human landmark configurations and derive tools for comparison and classification. We model the temporal evolution of landmarks as parametrized trajectories on the Riemannian manifold of positive semidefinite matrices of fixed-rank. Our representation has the benefit to bring naturally a second desirable quantity when comparing shapes, the spatial covariance, in addition to the conventional affine-shape representation. We derived then geometric and computational tools for rate-invariant analysis and adaptive re-sampling of trajectories, grounding on the Riemannian geometry of the underlying manifold. Specifically, our approach involves three steps: (1) landmarks are first mapped into the Riemannian manifold of positive semidefinite matrices of fixed-rank to build time-parameterized trajectories; (2) a temporal warping is performed on the trajectories, providing a geometry-aware (dis-)similarity measure between them; (3) finally, a pairwise proximity function SVM is used to classify them, incorporating the (dis-)similarity measure into the kernel function. We show that such representation and metric achieve competitive results in applications as action recognition and emotion recognition from 3D skeletal data, and facial expression recognition from videos. Experiments have been conducted on several publicly available up-to-date benchmarks. △ Less

Submitted 29 June, 2018; originally announced July 2018.

Comments: Under minor revisions in IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI). A preliminary version of this work appeared in ICCV 17 (A Kacem, M Daoudi, BB Amor, JC Alvarez-Paiva, A Novel Space-Time Representation on the Positive Semidefinite Cone for Facial Expression Recognition, ICCV 17). arXiv admin note: substantial text overlap with arXiv:1707.06440

arXiv:1805.03869 [pdf, other]

Deep Covariance Descriptors for Facial Expression Recognition

Authors: Naima Otberdout, Anis Kacem, Mohamed Daoudi, Lahoucine Ballihi, Stefano Berretti

Abstract: In this paper, covariance matrices are exploited to encode the deep convolutional neural networks (DCNN) features for facial expression recognition. The space geometry of the covariance matrices is that of Symmetric Positive Definite (SPD) matrices. By performing the classification of the facial expressions using Gaussian kernel on SPD manifold, we show that the covariance descriptors computed on… ▽ More In this paper, covariance matrices are exploited to encode the deep convolutional neural networks (DCNN) features for facial expression recognition. The space geometry of the covariance matrices is that of Symmetric Positive Definite (SPD) matrices. By performing the classification of the facial expressions using Gaussian kernel on SPD manifold, we show that the covariance descriptors computed on DCNN features are more efficient than the standard classification with fully connected layers and softmax. By implementing our approach using the VGG-face and ExpNet architectures with extensive experiments on the Oulu-CASIA and SFEW datasets, we show that the proposed approach achieves performance at the state of the art for facial expression recognition. △ Less

Submitted 10 May, 2018; originally announced May 2018.

arXiv:1707.07180 [pdf, other]

Emotion Recognition by Body Movement Representation on the Manifold of Symmetric Positive Definite Matrices

Authors: Mohamed Daoudi, Stefano Berretti, Pietro Pala, Yvonne Delevoye, Alberto Del Bimbo

Abstract: Emotion recognition is attracting great interest for its potential application in a multitude of real-life situations. Much of the Computer Vision research in this field has focused on relating emotions to facial expressions, with investigations rarely including more than upper body. In this work, we propose a new scenario, for which emotional states are related to 3D dynamics of the whole body mo… ▽ More Emotion recognition is attracting great interest for its potential application in a multitude of real-life situations. Much of the Computer Vision research in this field has focused on relating emotions to facial expressions, with investigations rarely including more than upper body. In this work, we propose a new scenario, for which emotional states are related to 3D dynamics of the whole body motion. To address the complexity of human body movement, we used covariance descriptors of the sequence of the 3D skeleton joints, and represented them in the non-linear Riemannian manifold of Symmetric Positive Definite matrices. In doing so, we exploited geodesic distances and geometric means on the manifold to perform emotion classification. Using sequences of spontaneous walking under the five primary emotional states, we report a method that succeeded in classifying the different emotions, with comparable performance to those observed in a human-based force-choice classification task. △ Less

Submitted 22 July, 2017; originally announced July 2017.

Comments: accepted in I19th International Conference on Image Analysis and processing (ICIAP), 11-15 september Catania, Italy, 2017

Showing 1–27 of 27 results for author: Berretti, S