Pose-guided multi-task video transformer for driver action recognition

Ricardo Pizarro Universidad de Alcalá de Henares, [email protected], [email protected] Roberto Valle Universidad Politécnica de Madrid, {rvalle,lbaumela}@fi.upm.es Luis Miguel Bergasa Universidad de Alcalá de Henares, [email protected], [email protected] José M. Buenaposada Universidad Rey Juan Carlos, [email protected] Luis Baumela Universidad Politécnica de Madrid, {rvalle,lbaumela}@fi.upm.es
Abstract

We investigate the task of identifying situations of distracted driving through analysis of in-car videos. To tackle this challenge we introduce a multi-task video transformer that predicts both distracted actions and driver pose. Leveraging VideoMAEv2, a large pre-trained architecture, our approach incorporates semantic information from human keypoint locations to enhance action recognition and decrease computational overhead by minimizing the number of spatio-temporal tokens. By guiding token selection with pose and class information, we notably reduce the model’s computational requirements while preserving the baseline accuracy. Our model surpasses existing state-of-the-art results in driver action recognition while exhibiting superior efficiency compared to current video transformer-based approaches.

1 Introduction

Driver distraction is a significant concern for road safety. Eurostat statistics report that in 2021 19,917 persons were killed in road accidents in the EU [1]. The precise number of road accidents resulting from distracted drivers is unclear. Despite estimates indicating that 5-25% of European accidents result from driver distraction, encompassing activities such as cellphone or GPS usage, eating, smoking, or fatigue and stress, research conducted in real-life driving scenarios unveils that 68.3% of crashes exhibit some signs of discernible distraction [52]. To address this problem it is essential to integrate Human Machine Interface (HMI) technologies into advanced driver assistance systems (ADAS). In this paper we study the problem of recognising different situations of distracted driving based on in-car video analysis.

The field of computer vision, particularly using deep learning techniques, has seen significant breakthroughs that have been applied to the detection of driver distraction[27]. Various architecture models are in use, such as Convolutional Neural Networks (CNN)[34], combinations of CNN and Long Short-Term Memory networks (CNN+LSTM)[57], and different adaptations of Transformer models[46, 43, 53]. While progress has been made, there is still considerable room for further refinement of these technologies, especially when taking into account the trade-off between accuracy and efficiency.

Driver distraction detection is intimately related to the problem of human action recognition. In this area we have seen a transition from methods using CNNs [49, 56, 6] and 3D-CNNs [18, 51, 33] or a mixture of both [6] to transformers [8, 7]. Benefiting from self-learning techniques and the use of large-scale datasets, recent video transformer models achieve top accuracy on the human action recognition problem [55]. However, their large computational requirements preclude their use in a practical car driving scenario.

In this paper we present a multi-task video transformer for driver action recognition. Our model is based on VideoMAEv2[55], a transformer architecture pre-trained with a self-learning strategy and refined with a large human action database. We propose a new approach to use the semantic information of human keypoint locations that improves action recognition and alleviates the computational load of the model by reducing the number of spatio-temporal tokens in the video transformer. The use of human pose to improve human action recognition is not new [14, 35, 39, 10, 61] and has also been used in the driver action recognition  [21, 31, 41]. However, previous works using pose start from an external human pose estimation model [35, 10, 61, 15, 21, 31, 41]. This also happens with recent transformer based methods [2, 26, 63]. Having an external pose estimation model not only increases the computational cost, but also decreases the system robustness in situations where the external model fails. Few methods train and estimate the pose and recognize actions in the same model [14, 39]. As far as we know, we are the first to bring human pose estimation to a video transformer in a multi-task approach to improve action recognition without external pose detector models. Furthermore, we are pioneers in incorporating pose information into the token selection within the video transformer architecture.

In our approach we employ spatio-temporal visual tokens along with human pose, represented through a series of heatmap tokens, to predict a range of driver action classes and identify the most informative tokens. To this end we prune spatio-temporal visual tokens, referred to as video tokens, that do not sufficiently attend to semantic and classification ones (i.e., those relevant to action recognition). With our new merging method, the pruned tokens are summarized by averaging similar dropped tokens, which ensures that information that was not captured by the pruning process is not lost.

In summary, the contributions of our work are:

  • We propose a Multi-task Video Transformer for driver action recognition that leverages human keypoints location heatmaps to select the most informative video tokens. Human pose information is inferred within the Video Transformer itself, thus increasing efficiency and robustness.

  • We guide the token selection with pose and class information. In this way, we significantly reduce the GFLOPS, while maintaining the accuracy of the large baseline model.

  • Our representation of human pose, based on a set of heatmaps of keypoints, allows for multi-person representation, which makes it possible to use our method on multi-actor datasets.

  • Our method surpasses the state-of-the-art in driver action recognition while being much more efficient than current video transformer based methods.

2 Related work

In this section we review the human action recognition literature and the specifics of driver action recognition. Recognizing actions in videos requires considering variations in the actors’ positions and poses, their movements, as well as their interaction with objects.

Human Action Recognition. One way to introduce motion information is to compute convolutions both in the image and time dimensions with 3D CNNs [18]. To speed-up 3D-CNNs we can shift the images in the temporal dimension to use 2D convolutions [33], factorize convolutions into temporal and spatial components [51, 59] or do 3D convolutions only in the deepest layers [59]. One of the most popular approaches to add motion is the two-stream CNN [49, 56, 6] using both RGB and optical flow maps. However, optical flow only gives a short temporal scale information. More recent works use Recurrent Neural Networks (RNN) [14] on top of a two-stream network [56] to introduce a longer but still limited temporal context. A better way of introducing a temporal context is the use of a transformer on top of 3D-CNN extracted features [24]. The introduction of video transformers in action recognition allows for a holistic temporal context to be established [8, 7], albeit with a quadratic complexity in the number of video tokens.

The human pose (i.e. set of articulation joint locations or keypoints), is a very discriminative feature for action recognition [23]. Some early works used the human pose in action recognition explicitly with CNNs. The simplest way of using pose is to sample CNN feature maps at the location of human pose joints either from two-stream networks [9] or from a 3D CNN [5]. Early action recognition methods using CNNs realized that body parts descriptors [19] were needed to improve results with shallow CNNs. More recent methods [14, 16] use attention to discover relevant features. The information in the probability maps, or heatmaps, corresponding to the location of body keypoints has proven to be very discriminative in action recognition [10, 35, 61, 48, 2, 26, 63]. Another line of work uses 3D skeletal representations of the body [25] or directly a 3D Morphable Model of the full body [47].

Driver action recognition. Recognizing driver distraction requires a comprehensive and varied dataset to effectively encapsulate the range of potential behaviors encountered in real-world scenarios. Popular datasets such as Drive&Act[42], DMD[45], 3MDAD[22], and “100 drivers”[54] encompass a broad spectrum of driver interactions, actions, and cameras. The Drive&Act dataset is particularly prevalent within this research area due to its large extent. The standard way of evaluation is to use the front-top camera with the Near Infrared (NIR) modality. Early methods [42] adopted this methodology as a baseline, including finetuned two-stream 3D CNNs pre-trained in other action recognition datasets, such as Kinetics [6], I3D [6] and also skeleton-based methods [42]. A more recent work, CTA-Net[57], use CNNs, self-attention layers and LSTMs. TransDARC [46], a modern work using Video Swin Transformer [36] gets the best performance in the standard test of Drive&Act. Similar lines of work can be seen on the 3MDAD[22] dataset. Some recent methods use lightweight transformers, such as Feature Pyramid Vision Transformer [53] or LW-transformer [43] within a teacher-student framework. Other recent works, DACNet [50] and MIFI [28], explore the multi-view (DACNet, MIFI) and the multi-modal approaches (DACNet). An alternative modality to RGB or Infrared (IR) images is the use of 2D or 3D body-pose in the form of keypoints location, some times given as heatmaps and others as a graph [31, 21, 30]. The main drawback of these action recognition models is that they require an external model to detect the keypoints in the image.

Video Transformers Computational requirements. The quadratic complexity in the number of tokens of transformers is a fundamental limitation for its use in real-time video analysis. It can be addressed in different ways. One is to factorize the attention along the spatial and temporal dimensions [3]. Another way is to merge similar tokens into new ones [4] or pruning the less promising tokens in different attention layers  [32, 40, 7]. The visual tokens with less attention to the classification token are removed and grouped into one in EViT [32], while in PPT [40] the pruned ones are those with less attention to pose keypoints. EVAD [7] uses attention to the visual tokens on a key-frame to select which ones to keep.

Our proposal. We use a multi-task strategy, estimating both human pose heatmaps and action recognition, which differs from the usual and less efficient approach using externally provided landmarks [2, 26, 63]. Moreover, our pose-guided video token selection method is able to keep the accuracy of the baseline model while being able to reduce the computation by 25% and improve the accuracy by 8% over the current state-of-the-art transformer in the Drive&Act dataset (see Table 2).

3 POse-GUIded multi-task video transformer with token SElection (PO-GUISE)

Our approach, PO-GUISE, incorporates a video transformer pre-trained as a Masked AutoEncoder (VideoMAEv2) [55] as its encoding mechanism. The video transformer is finetuned in different action recognition datasets. To facilitate human body keypoints localization and guide our token selection, we have integrated the pose heatmaps prediction and action classification tasks. Additionally, to mitigate the computational demands associated with video transformer models, we introduce the PO-GUISE module, which effectively reduces the number of video tokens. A comprehensive visual representation of our model is provided in Fig. 1. In the following sections, we will provide a detailed explanation of each component within our model.

Refer to caption

Figure 1: Our architecture consists of 4 stages. An input clip is tokenized and processed by a ViT encoder alongside learnable class and heatmap tokens. Within the encoder, green and blue blocks denote standard ViT layers and token selection modules respectively.

3.1 Video Transformer and human-pose processing

Consider a video segment, or clip, with T×C×H×W𝑇𝐶𝐻𝑊T\times C\times H\times Witalic_T × italic_C × italic_H × italic_W where T𝑇Titalic_T is the number of frames and C,H,W𝐶𝐻𝑊C,H,Witalic_C , italic_H , italic_W are the channels, height, and width of each frame. In our experiments, we define T=16𝑇16T=16italic_T = 16, C=1𝐶1C=1italic_C = 1, H=224𝐻224H=224italic_H = 224 and W=224𝑊224W=224italic_W = 224 respectively. To process this clip with a video transformer [55], we use the joint space-time cube embedding[3]. This technique samples non-overlapping cubes from the input video clip, which are then fed into the embedding layer. This method segments a video sequence by using cubes of dimension IR2×C×16×16IsuperscriptR2𝐶1616{\rm I\!R}^{2\times C\times 16\times 16}roman_I roman_R start_POSTSUPERSCRIPT 2 × italic_C × 16 × 16 end_POSTSUPERSCRIPT, resulting in XIRt×C×h×w𝑋IsuperscriptR𝑡𝐶𝑤X\in{\rm I\!R}^{t\times C\times h\times w}italic_X ∈ roman_I roman_R start_POSTSUPERSCRIPT italic_t × italic_C × italic_h × italic_w end_POSTSUPERSCRIPT, where t=T2,h=H16,w=W16formulae-sequence𝑡𝑇2formulae-sequence𝐻16𝑤𝑊16t=\frac{T}{2},h=\frac{H}{16},w=\frac{W}{16}italic_t = divide start_ARG italic_T end_ARG start_ARG 2 end_ARG , italic_h = divide start_ARG italic_H end_ARG start_ARG 16 end_ARG , italic_w = divide start_ARG italic_W end_ARG start_ARG 16 end_ARG. We then project X𝑋Xitalic_X to a token of dimension D𝐷Ditalic_D using a linear embedding layer, resulting in an input tensor with shape XvisIRNvis×Dsubscript𝑋𝑣𝑖𝑠IsuperscriptRsubscript𝑁𝑣𝑖𝑠𝐷X_{vis}\in{\rm I\!R}^{N_{vis}\times D}italic_X start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT ∈ roman_I roman_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT, where Nvis=t×h×wsubscript𝑁𝑣𝑖𝑠𝑡𝑤N_{vis}=t\times h\times witalic_N start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT = italic_t × italic_h × italic_w. Next, we apply a positional embedding to each token, and a learnable class token, XclsIR1×Dsubscript𝑋𝑐𝑙𝑠IsuperscriptR1𝐷X_{cls}\in{\rm I\!R}^{1\times D}italic_X start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT ∈ roman_I roman_R start_POSTSUPERSCRIPT 1 × italic_D end_POSTSUPERSCRIPT, is concatenated to the sequence. For the computation of human-pose heatmaps, our model incorporates Np=h×wsubscript𝑁𝑝𝑤N_{p}=h\times witalic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_h × italic_w learnable tokens into the input sequence, defined as XpIRNp×Dsubscript𝑋𝑝IsuperscriptRsubscript𝑁𝑝𝐷X_{p}\in{\rm I\!R}^{N_{p}\times D}italic_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ roman_I roman_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT. The complete sequence of tokens, including the class, pose, and video tokens X=(Xcls,Xp,Xvis)𝑋subscript𝑋𝑐𝑙𝑠subscript𝑋𝑝subscript𝑋𝑣𝑖𝑠X=(X_{cls},X_{p},X_{vis})italic_X = ( italic_X start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT ), is then processed through a standard ViT architecture. The transformed class token Xclssubscript𝑋𝑐𝑙𝑠X_{cls}italic_X start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT is used in a multilayer perceptron (MLP) for the classification task, while the Xpsubscript𝑋𝑝X_{p}italic_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT pose tokens are passed through a heatmap head to be compared against the ground truth heatmap for pose estimation.

3.2 Human-pose estimation task

A crucial part of our approach involves utilizing heatmaps, which enhance the training process and facilitate token selection. These heatmaps are derived from learnable tokens, akin to those in PPT [40]. However, our method diverges from PPT’s image-only processing by extending its capabilities to handle video inputs and multi-person heatmap predictions.

Heatmap prediction starts with the introduction of additional tokens to the network, Xpsubscript𝑋𝑝X_{p}italic_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. These correspond to the number of tokens that are obtained after tokenizing a single frame. After passing through the encoder, these tokens are processed by a lightweight decoder to convert the tokens into heatmaps. The decoder’s architecture consists of two deconvolution layers followed by a convolution layer with a 1 × 1 kernel and with output channels equal to the number of landmarks L𝐿Litalic_L[60]. The output of this decoder is then directly compared with the ground truth heatmaps by measuring the mean-squared error.

While these tokens are inherently capable of predicting heatmaps for an individual frame within a video clip, we can adapt them to capture the entire sequence of movements by modifying the ground truth labels. The use of heatmaps instead of coordinate representations provides greater flexibility by allowing the incorporation of additional information directly within the heatmaps, without requiring any structural changes to the network architecture. For instance, we generate time-aware heatmaps by averaging the heatmaps from the ground truth labels across a video clip. Resulting in a ground truth heatmap where each keypoint is located in its average zone within the clip. Likewise, the framework can be extended to predict multi-person heatmaps by combining detection data from multiple individuals inside a single heatmap.

3.3 POse-GUIded token SElection module (PO-GUISE)

The use of joint space-time cube embeddings for processing videos is computationally expensive, which is not ideal for use in cars or other environments with limited computing power. Videos naturally contain repetitive information over time and areas with no information for action recognition. Thus, we propose the use of token pruning to reduce computation without losing important content.

We introduce a novel approach named PO-GUISE. This method leverages the informative content of class and heatmap tokens to improve the process of token selection. Furthermore, to prevent the loss of potentially valuable information, PO-GUISE also merges some of the tokens that were not initially selected during the pruning step. This merging step is crucial as it compensates for any potentially relevant data that might not have been identified by the pruning algorithm.

We integrate our PO-GUISE module into the network architecture at specific intervals: the first module is placed after three layers, and subsequently after every 2 layers. Resulting in a total of three PO-GUISE modules inside a ViT-base model. By doing so, we aim to strike a balance between reducing computational load and maintaining the critical information necessary for processing the video effectively.

3.3.1 Token pruning.

Our token pruning method is inspired by existing methods such as EVIT [32] and EVAD [7]. The difference is that our method uniquely integrates additional spatial information through the use of heatmap tokens to direct attention to the video tokens where the actors are. Let 𝒜MIRM×N×Nsubscript𝒜𝑀IsuperscriptR𝑀𝑁𝑁\mathcal{A}_{M}\in{\rm I\!R}^{M\times N\times N}caligraphic_A start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ∈ roman_I roman_R start_POSTSUPERSCRIPT italic_M × italic_N × italic_N end_POSTSUPERSCRIPT be the attention tensor from M𝑀Mitalic_M heads, obtained from processing the tokens in XIRN×D𝑋IsuperscriptR𝑁𝐷X\in{\rm I\!R}^{N\times D}italic_X ∈ roman_I roman_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT. We first average across attention heads to condense it into an N×N𝑁𝑁N\times Nitalic_N × italic_N matrix, 𝒜𝒜\mathcal{A}caligraphic_A, representing the attention relationships between token pairs. We then index 𝒜𝒜\mathcal{A}caligraphic_A by selecting the attention each visual token Xvissubscript𝑋𝑣𝑖𝑠X_{vis}italic_X start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT gives to Xclssubscript𝑋𝑐𝑙𝑠X_{cls}italic_X start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT, Xpsubscript𝑋𝑝X_{p}italic_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. Resulting in 𝒜visIRNvis×(1+Np)subscript𝒜𝑣𝑖𝑠IsuperscriptRsubscript𝑁𝑣𝑖𝑠1subscript𝑁𝑝\mathcal{A}_{vis}\in{\rm I\!R}^{N_{vis}\times(1+N_{p})}caligraphic_A start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT ∈ roman_I roman_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT × ( 1 + italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT. We then multiply by a small constant factor, κ𝜅\kappaitalic_κ, the attention scores of class and by 1κ1𝜅1-\kappa1 - italic_κ, the heatmap tokens to denote the relative importance between them. Next, by summing over the rows of 𝒜vissubscript𝒜𝑣𝑖𝑠\mathcal{A}_{vis}caligraphic_A start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT, we get a vector of token scores, 𝒯IRNvis𝒯IsuperscriptRsubscript𝑁𝑣𝑖𝑠\mathcal{T}\in{\rm I\!R}^{N_{vis}}caligraphic_T ∈ roman_I roman_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Each element in this tensor reflects the aggregated importance of a visual token as influenced by the attention to the semantic tokens, (Xclssubscript𝑋𝑐𝑙𝑠X_{cls}italic_X start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT, Xpsubscript𝑋𝑝X_{p}italic_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT). The final pruning decision is based on these aggregated scores, thereby allowing us to retain visual tokens that are deemed most significant in the context of both global class information and local spatial heatmap cues. The computed attention score for a visual token i𝑖iitalic_i can also be formulated as:

𝒯(i)=𝒜vis(i,0)κ+(j=1Np𝒜vis(i,j))(1κ),𝒯𝑖subscript𝒜𝑣𝑖𝑠𝑖0𝜅superscriptsubscript𝑗1subscript𝑁𝑝subscript𝒜𝑣𝑖𝑠𝑖𝑗1𝜅\mathcal{T}(i)=\mathcal{A}_{vis}(i,0)\cdot\kappa+\left(\sum_{j=1}^{N_{p}}% \mathcal{A}_{vis}(i,j)\right)\cdot(1-\kappa),caligraphic_T ( italic_i ) = caligraphic_A start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT ( italic_i , 0 ) ⋅ italic_κ + ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT caligraphic_A start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT ( italic_i , italic_j ) ) ⋅ ( 1 - italic_κ ) ,

where 𝒜vis(i,j)subscript𝒜𝑣𝑖𝑠𝑖𝑗\mathcal{A}_{vis}(i,j)caligraphic_A start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT ( italic_i , italic_j ) is the attention score from video token i𝑖iitalic_i to semantic token j𝑗jitalic_j, and κ𝜅\kappaitalic_κ is a constant factor to balance the importance between class and heatmap tokens.

Using 𝒯𝒯\mathcal{T}caligraphic_T, we select the k𝑘kitalic_k most salient tokens, determined by the highest Nsel=Nvisρsubscript𝑁𝑠𝑒𝑙subscript𝑁𝑣𝑖𝑠𝜌N_{sel}=N_{vis}\cdot\rhoitalic_N start_POSTSUBSCRIPT italic_s italic_e italic_l end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT ⋅ italic_ρ scores, with ρ𝜌\rhoitalic_ρ being a predefined threshold in the range (0, 1]. Resulting in a set of selected tokens, XselIRNsel×Dsubscript𝑋𝑠𝑒𝑙IsuperscriptRsubscript𝑁𝑠𝑒𝑙𝐷X_{sel}\in{\rm I\!R}^{N_{sel}\times D}italic_X start_POSTSUBSCRIPT italic_s italic_e italic_l end_POSTSUBSCRIPT ∈ roman_I roman_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_s italic_e italic_l end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT, and a set of discarded ones, XdiscIR(NvisNsel)×Dsubscript𝑋𝑑𝑖𝑠𝑐IsuperscriptRsubscript𝑁𝑣𝑖𝑠subscript𝑁𝑠𝑒𝑙𝐷X_{disc}\in{\rm I\!R}^{(N_{vis}-N_{sel})\times D}italic_X start_POSTSUBSCRIPT italic_d italic_i italic_s italic_c end_POSTSUBSCRIPT ∈ roman_I roman_R start_POSTSUPERSCRIPT ( italic_N start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT - italic_N start_POSTSUBSCRIPT italic_s italic_e italic_l end_POSTSUBSCRIPT ) × italic_D end_POSTSUPERSCRIPT, to be processed in the next network block.

3.3.2 Token merging.

The process of token pruning might exclude information that may be important for later processing stages, or information that isn’t immediately apparent from examining the attention between classes and the associated heatmaps. To address this, we incorporate a token merging phase for Xdiscsubscript𝑋𝑑𝑖𝑠𝑐X_{disc}italic_X start_POSTSUBSCRIPT italic_d italic_i italic_s italic_c end_POSTSUBSCRIPT resulting from the pruning phase. To identify and merge tokens with closely aligned features we use cosine similarity. Our approach modifies the merging strategy presented in ToMe[4] by implementing an alternative matching algorithm that better adapts to our context. By eliminating the constraint that only up to 50% of the tokens could be eliminated in ToMe[4]. Our algorithm increases the flexibility by allowing the merging and removal of an arbitrary number of tokens as selected by Nmerge=Ndiscλsubscript𝑁𝑚𝑒𝑟𝑔𝑒subscript𝑁𝑑𝑖𝑠𝑐𝜆N_{merge}=N_{disc}\cdot\lambdaitalic_N start_POSTSUBSCRIPT italic_m italic_e italic_r italic_g italic_e end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT italic_d italic_i italic_s italic_c end_POSTSUBSCRIPT ⋅ italic_λ with λ𝜆\lambdaitalic_λ being a predefined threshold in the range (0, 1].

This phase begins with using a feature tensor (either Q𝑄Qitalic_Q (queries), K𝐾Kitalic_K (keys), or attn𝑎𝑡𝑡𝑛attnitalic_a italic_t italic_t italic_n (attention)) obtained from Xdiscsubscript𝑋𝑑𝑖𝑠𝑐X_{disc}italic_X start_POSTSUBSCRIPT italic_d italic_i italic_s italic_c end_POSTSUBSCRIPT. We then compute the pairwise cosine similarity for these tokens, generating a similarity matrix SIRNdisc×Ndisc𝑆IsuperscriptRsubscript𝑁𝑑𝑖𝑠𝑐subscript𝑁𝑑𝑖𝑠𝑐S\in{\rm I\!R}^{N_{disc}\times N_{disc}}italic_S ∈ roman_I roman_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_d italic_i italic_s italic_c end_POSTSUBSCRIPT × italic_N start_POSTSUBSCRIPT italic_d italic_i italic_s italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, masking the corresponding diagonals to avoid self-merging. Each row indicates how similar a token is to all the other tokens Xdiscsubscript𝑋𝑑𝑖𝑠𝑐X_{disc}italic_X start_POSTSUBSCRIPT italic_d italic_i italic_s italic_c end_POSTSUBSCRIPT.

Next, for each row in S𝑆Sitalic_S, we identify the index of the token that has the highest similarity, which becomes the merge candidate for the token in question. Then, to obtain only the most similar tokens we find the K tokens with the strongest similarity to their respective candidates. This selective aggregation ensures that information from tokens with substantial similarity is preserved. Next, each selected token is merged with its candidate by averaging, resulting in a new set of tokens, Xmergesubscript𝑋𝑚𝑒𝑟𝑔𝑒X_{merge}italic_X start_POSTSUBSCRIPT italic_m italic_e italic_r italic_g italic_e end_POSTSUBSCRIPT. Finally Xmergesubscript𝑋𝑚𝑒𝑟𝑔𝑒X_{merge}italic_X start_POSTSUBSCRIPT italic_m italic_e italic_r italic_g italic_e end_POSTSUBSCRIPT and Xselsubscript𝑋𝑠𝑒𝑙X_{sel}italic_X start_POSTSUBSCRIPT italic_s italic_e italic_l end_POSTSUBSCRIPT are concatenated to be processed by the next network block. This process ensures that potentially relevant information is not lost and is passed on to subsequent layers.

4 Experiments

In this section we evaluate our multi-task video transformer. In all the experiments HM stands for Heatmaps computed for: P single human pose (body joints location) or MP multiple-people poses. PR stands for using token pruning by: C using attention to the class token; MF using attention to the middle frame visual tokens; or P using attention to the tokens used to compute human pose heatmaps. MG stands for the proposed pruned tokens merging.

4.1 Datasets

We evaluate our method in two driving action recognition datasets. Drive&Act[42] is a multi-modal driver action recognition dataset containing 12 hours of driving over 15 different drivers and different views. We use fine-grained labels which consist of 34 unique activities that a driver might engage in, such as eating, working on a laptop, among others. The dataset is divided into three predefined splits/folds for training and evaluation with no driver overlap. We report the average of results on the three test sets. An issue with this dataset is the high imbalance between the classes. As such, there is a need to make a distinction between the metrics reported. For consistency with previous works we present both the top-1 accuracy (micro accuracy) and the average-per-class accuracy (macro accuracy). However we will only use the macro accuracy in our discussion, since it takes into account the class imbalance and gives a better understanding of the model’s overall performance. We use only the front-top (inner_mirror) view taken from a NIR camera for comparison with previous methods. 3MDAD[22] is another dataset that offers synchronized multimodal and multi-view data. It includes “safe driving” and 15 common distraction activities performed by 50 drivers in the daytime. It does not provide a predefined train/test split; therefore, we employ k-fold cross-validation with k = 5. In each fold, the dataset is divided into a 40/10 ratio for the train/test split, ensuring that drivers appearing in the test subset are not present in the train. All experiments are done on the RGB modality using the side camera.

Additionally, we also use for evaluation a popular human action recognition dataset, HMDB51 [29], consisting of 6766 Internet videos over 51 classes. We report the average classification accuracy on standard three-fold splits.

4.2 Implementation details

For our experiments, we utilize a ViT-base model with pre-trained weights from VideoMAEv2 [55]. We use the AdamW[38] optimizer with a Cosine Annealing learning rate scheduler[37]. For classification, we use CrossEntropy loss and log-scaled MSE for heatmap prediction. We additionally use Nash-MTL[44] to balance both tasks. Data augmentation includes Cutmix[62], Mixup[64] and RandAug[12]. We train with T=16𝑇16T=16italic_T = 16 frames from the input clip and subsequently resize them to 224 × 224 pixels. For every dataset, we have estimated the human pose for each frame using ViTPose[60]. The detailed hyperparameters used in for the experiments in Drive&Act, 3MDAD, and HMDB51 can be seen in the supplementary.

4.3 Ablation study

For the ablation experiments (see Table 1), we use the Drive&Act dataset fold 0 for training and testing. We train all models in this section only for 100 epochs due to limited computational resources. Additionally, in this section, we refer to ’macro accuracy’ as ’accuracy’. Our baseline result is obtained by finetuning a state-of-the-art Video Transformer, VideoMAEv2 [55] pre-trained on Kinetics [6]. The accuracy for the baseline (see VideoMAEv2-base in Table 1) is 62.47%.

Table 1: Test results fold 0 of Drive&Act dataset for different model configurations. Trained with only 100 epochs due to limited computational resources.
Method Micro Acc.(\uparrow) Macro Acc.(\uparrow) GFLOPS (\downarrow)
VideoMAEv2-base 81.59 62.47 360
+ PR(C) 78.08 59.43 226
+ PR(MF) 80.06 58.05 226
+ PR(C) + MG 81.09 60.80 226
+ HM(P) 80.91 63.32 418
+ HM(P) + PR(C) 79.12 63.22 268
+ HM(P) + PR(C+P) 81.68 64.85 268
+ HM(P) + PR(C+P) + MG 81.55 65.43 269

4.3.1 Comparison with baseline.

First, we test the baseline plus semantic information in the form of a human pose estimation task (baseline+HM(P) in Table 1). On average, it increases the accuracy of all actions in 0.85 points. However, the pose information provides a significant improvement in the accuracy of some actions, as illustrated in Fig. 2. In fact, actions that involve object interaction, such as sunglasses, door, seat belt, or a bottle, and where the body pose is important, get the greatest improvement, around 20%. The main drawback is the increased computational cost by 58 GFLOPS due to the extra tokens that need to be processed.

Refer to caption
Figure 2: Per class accuracy on Drive&Act dataset fold 0. Bars illustrate the performance of the baseline model, red ’X’ marks that of the model augmented with heatmap data and ’+’ symbols the results of PO-GUISE. For a better presentation, we have grouped classes that are related to time-based activities, such as opening and closing a bottle, putting on or taking off sunglasses.

We also compare three different configurations of token selection on the baseline model while maintaining similar GFLOPS for each experiment. We test pruning by attention to the class token (baseline+PR(C)), pruning by attention to the middle frame visual tokens (baseline+PR(MF)), and adding our token merging solution to the class token pruning (baseline+PR(C)+MG). We find that for all configurations there is a big loss in accuracy when compared to the baseline. Utilizing PR(MF), similar to the method in EVAD[7], resulted in a bigger loss than with PR(C), 4.42% vs 3.04%. This means that not all visual tokens in the middle frame are informative and it is better to rely on the class token for pruning. The use of PR(C)+ MG resulted in the least performance loss, with a reduction of only 1.67%. This suggests that merging tokens is beneficial in preserving valuable information that pruning alone may not capture. This is crucial for maintaining model accuracy while increasing computational efficiency. Note here that token pruning is reducing GFLOPs by 37% (360 to 226) and that merging is not adding a significant amount of processing.

The last set of experiments assesses the influence of different token selection methods in the multi-task model (baseline + HM(P)). The first interesting result is that pruning is affecting only slightly the performance of the multitask model (0.1% less accuracy than baseline) in comparison to how much it affected the baseline model (3% less accuracy than baseline). We found that token pruning guided by class and pose tokens (baseline + HM(P) + PR(C+P)) outperforms pruning based solely on class information (baseline + HM(P) + PR(C)) by 1.63%. Moreover, employing the full PO-GUISE model yields an additional improvement of 2.21% over PR(C). These results highlight the effectiveness of pose-guided pruning and the merging process to efficiently select task-relevant tokens.

4.3.2 Efficiency analysis.

In this experiment we explore the tradeoff between accuracy and computational cost incurred by different token selection methods applied over the multi-task model (baseline+HM(P)). In Fig 3 (left) we show the curves of GFLOPS vs. accuracy obtained by training with different values of ρ𝜌\rhoitalic_ρ and λ𝜆\lambdaitalic_λ. For the experiments + HM(P) + PR(C+P) and + HM(P) + PR(P) ρ{0.3,0.4,0.5,0.7}𝜌0.30.40.50.7\rho\in\{0.3,0.4,0.5,0.7\}italic_ρ ∈ { 0.3 , 0.4 , 0.5 , 0.7 }. For the + HM(P) + PR(C+P) + MG experiments, ρ{0.3,0.4,0.6}𝜌0.30.40.6\rho\in\{0.3,0.4,0.6\}italic_ρ ∈ { 0.3 , 0.4 , 0.6 } and λ{0.1,0.2,0.3}𝜆0.10.20.3\lambda\in\{0.1,0.2,0.3\}italic_λ ∈ { 0.1 , 0.2 , 0.3 }.

Refer to caption Refer to caption

Figure 3: Left: Comparison between GFLOPS and accuracy for different configurations. Results on fold 0 Drive&Act. Right: Comparison between GFLOPS and accuracy between our proposed models and Transdarc[46]. Averaged over the three folds of Drive&Act.

The curve associated to PR(C+P)+MG is always on top for different proportions of selected tokens (ρ𝜌\rhoitalic_ρ). The difference with the same pruning method but without token merging (PR(C+P) + MG) is significant while not using the pose tokens in pruning reduces even more the performance in all values of ρ𝜌\rhoitalic_ρ.

In Fig  3 right, we can see that our method PRO-GUISE gets the best tradeoff of speed and accuracy in action recognition in Drive&Act compared with other video transformers. Note here that Transdarc was thre previous state-of-the-art in Drive&Act.

4.3.3 Discusion.

Our contribution concerning the use human pose information as a task and as a token selection procedure allows a top performing video transformer to increase its accuracy in 2.96% with a remarkable 25.27% reduction in GFLOPs. The increase in accuracy means that some of the less represented actions obtained a better recognition rate by our method. Those interacting with objects are 20% more accurate than the baseline. This means that our method sets a promising direction of research to make current video transformers usable for driver distraction monitoring.

4.4 Visualizations

In Fig. 4, we present the tokens selected for pruning and merging at each stage of the network, for the first frame of the input video clip. With each progressing stage, the tokens primarily concentrate around the person, indicating the network’s focus on the subject.

Refer to caption
Refer to caption
Refer to caption
Figure 4: Sample tokens from a frame processed by PO-GUISE at each of the three stages of the network, showing selected, discarded, and merged tokens. Red shows tokens that were discarded and blue those that were selected for merging in that stage.

4.5 Comparison with the state-of-the-art (SOTA)

We compare PO-GUISE with state-of-the-art techniques on two driver action recognition datasets, Drive&Act (Table 2) and 3MDAD (Table 3), and with HMDB51 (Table 4), a popular small-scale human action recognition.

In the Drive&Act dataset (see Table 2), our model sets a new state-of-the-art result. We surpass the previously best result by Transdarc[46], which employed a VideoSwin-base as its backbone architecture. Specifically, our PO-GUISE model achieves a 7.94% higher accuracy while also reducing the computational cost by 12 GFLOPS. Although other methods such as 3D-studentNet [34] and CTA-NET [57] also show competitive performance with lower computational costs, their results are not directly comparable with ours since they use different modalities for training and testing, and they do not provide macro accuracy results.

For the 3MDAD dataset, our model also establishes a state-of-the-art performance, as shown in Table 3. We beat the previously best result set by MIFI[28], a multi-view model, while maintaining comparable computational costs. Our model achieved an accuracy improvement of 5.35% and a reduction of 13 GFLOPS in computational requirements.

In the context of the HMDB51 dataset, see Table 4, our model delivers results that are on par with current state-of-the-art methods. PO-GUISE is on par with other video transformer models such as BIKE[58] but with significantly reduced computational costs (269 GFLOPS compared to 932 GFLOPS for BIKE[58]). However, many studies on this dataset incorporate additional modalities such as optical flow, pose data, and others, making direct comparisons challenging. The R(2+1)D+BERT[24] model outperforms ours and other state-of-the-art models, potentially due to its ability to process a larger temporal context (64 frames compared to our 16 frames), which gives the model a better understanding of the complex scenes in this dataset.

Table 2: Test results on Drive&Act dataset in the front-top NIR camera. For a fair comparison, we make a distinction between the use of micro and macro accuracy. ’*’ denotes the methods reproduced using their original code and weight parameters. ’+’ denotes methods reproduced using MMAction2[11].
Method Micro Acc. (\uparrow) Macro Acc. (\uparrow) GFLOPS (\downarrow)
st-MLP[21] - 33.51 -
Pose[42] - 44.36 -
Interior[42] - 40.30 -
2-stream[42] - 45.39 -
3-stream[42] - 46.95 -
I3D+ [6] 71.50 48.87 33
SMOMS [30] - 51.39 -
OA-SAR [31] - 54.07 -
CTA-NET[57] 65.25 - -
3D-studentNet[34] 65.69 - 37
Transdarc[46] 66.92 55.30 281
Videomaev2-base 78.39 62.58 360
+ HM(P) 79.04 63.89 418
+ HM(P) + PR(C+P) + MG 78.76 63.24 269
Table 3: Test results on 3MDAD dataset (Acc. = averaged 5-fold test accuracy).
Method Acc. (\uparrow) GFLOPS (\downarrow)
FPT[53] 39.88 0.98
LW-transformer[43] 70.39 0.329
DADCNet[50] 77.08 16
I3D-MIFI[28] Single-view* 75.8 111
I3D-MIFI[28] Multi-view* 83.9 223
VideoMAEv2-base (baseline) 91.34 360
+ HM(P) 92.90 418
+ HM(P) + PR(C+P) 0.6 + MG 0.3 92.62 269
+ HM(P) + PR(C+P) + MG 89.25 210
Table 4: Test results on HMDB51 dataset (Acc. = averaged 3-fold test accuracy). In T𝑇Titalic_T we have the clip size in #frames although, depending of the temporal stride, the network could use less frames as input. In pre-train, KT stands for Kinetics [6] dataset, and IN stands for ImageNet [13] dataset. R stands for RGB images as input modality, F stands for Optical Flow images, H stands for human pose heatmaps, AF stands for Part Affinity Fields and T for text. {\dagger} Methods that use Optical Flow do not include flow computation in the GFLOPs.
Method Pre-train Modality T𝑇Titalic_T Acc. (\uparrow) GFLOPS (\downarrow)
TSN [56] IN R+F 25 68.5 -
TSM [33] KT R 16 73.5 65
ρ𝜌\rhoitalic_ρBYOL(S3D-G) [17] KT R 64 75.0 -
S3D-G [59] IN+KT R 64 75.9 -
R(2+1)D [51] KT R 16 74.5 -
I3D [6] IN+KT R 64 74.8 -
R(2+1)D [51] KT R+F 16 78.7 -
I3D [6] IN+KT R+F 64 80.7 -
PoTion+I3D [10] KT R+F+H 64 80.9 -
PA3D+I3D [61] KT R+F+H+AF 64 82.1 -
BIKE(ViT-L)[58] KT R+T 16 83.1 932
ResNext101+BERT [24] KT R+F 64 83.55 72
SMART+TSN [20] KT R+F - 84.3 -
[48]+ResNext101+BERT KT R+F+H 64 84.53 -
R(2+1)D+BERT [24] IG65M R 32 83.99 152
R(2+1)D+BERT [24] IG65M R 64 85.10 -
VideoMAEv2-giant [55] KT R 16 88.1 38,160
VideoMAEv2-base KT R 16 82.30 360
+ HM(P) KT R 16 83.11 418
+ HM(MP) KT R 16 83.64 418
+ HM(MP)+PR(C+P)+MG KT R 16 82.83 269

5 Conclusions

Current video transformers achieve the best performance in driver distraction recognition at an unacceptable computational cost for automotive applications [55]. Token selection is a promising way of reducing the computation burden in these models. The problem with current token selection methods is that they do not use semantic information from the task.

We have shown that the addition of the human pose estimation task improves accuracy. This is different to what current methods do by using external pose detectors. In addition, pose tokens carry semantic information that is useful to select the most informative tokens for the action recognition task. Our method is able to reduce the number of video tokens in such a way that the number of GFLOPS is reduced by 25.25% while the accuracy is increased by 2.96%. Most importantly, the accuracy in actions related to interaction with objects, where the location of human body parts in the image is of vital importance, increases by 20%. Still, further research is needed to reduce the computational cost and the accuracy of PO-GUISE. We plan to continue this line of research and explore the inclusion of other semantic tasks that improve token selection. Especially with actions where the human pose is less informative (e.g., getting into a car), for which we have observed that performance drops. We also plan to investigate better token selection methods, both in the pruning and fusion domains and test the model in real protoype.

Appendix A Extended implementation details

A.1 Training parameters

Table 5 shows the parameters used for each experiment. For the Drive&Act and 3MDAD datasets we discard the landmarks under the knee. ρ𝜌\rhoitalic_ρ and λ𝜆\lambdaitalic_λ correspond to the token keep rates used in PO-GUISE for the TopK pruning and merge methods.

Table 5: Training parameters used in the main paper experiments.
Configuration Drive&Act 3MDAD HMDB51
Pre-trained weights vit_b_k710_dl_from_giant
MSE scaling factor 1000
Learning rate backbone 8e-06 7e-06 8e-06
Learning rate heads 0.0005 5e-07 0.0004
Optimizer Adamw
Learning rate scheduler Cosine Annealing
RandAug. M 7
RandAug. N 4
label smoothing 0.1
CutMix & Mixup prob. 1.0
CutMix & Mixup switch prob. 0.5
Gradient clipping 1.5
accumulate_grad_batches 2
Batch size 16
Merge feat. sim. matrix K (Keys)
Inference # temporal views 3
Inference # visual views 1
Epochs 200 350 100
#Landmarks 13 13 17
PO-GUISE ρ𝜌\rhoitalic_ρ 0.6 0.4 0.6
PO-GUISE λ𝜆\lambdaitalic_λ 0.3 0.2 0.3

A.2 Token merging details

In Algorithm 1 we show the details of the token merging procedure used in PO-GUISE for token selection. The main difference between our merge method and ToMe[4] is the candidate selection algorithm. In ToMe[4] the candidate selection is based on splitting the tokens into two sets and calculating the similarity between these, this results in a maximum token retention rate of λ0.5𝜆0.5\lambda\leq 0.5italic_λ ≤ 0.5. In contrast, our method allows for a more flexible token retention range of (0, 1] by calculating the similarity matrix between all pairs of tokens.

In Table 6 we compare our proposed merge method and ToMe[4]. First, we show the results of our model + HM(P) + PR(C+P) + MG, and replacing our merge method with ToMe[4] (+ HM(P) + PR(C+P) + ToMe). This represents a performance loss of 0.9% in macro accuracy.

Finally, in the last section of Table 6 we can see a comparison between our token reduction method in PO-GUISE (+ HM(P) + PR(C+P) + MG), with a lower token retention rate (ρ=0.4𝜌0.4\rho=0.4italic_ρ = 0.4 and λ=0.1𝜆0.1\lambda=0.1italic_λ = 0.1), and the use of ToMe[4] as a standalone token reduction technique. We obtain an increase of 2.11% in macro accuracy while reducing GFLOPS by 26. This indicates that our PO-GUISE, which leverages relevant task information for token selection, allows for a better tradeoff between the loss in accuracy and computational cost than merge-based methods in this task.

Table 6: Comparison between the proposed merge and ToMe[4]. Results on fold 0 of the Drive&Act dataset test set.
Method Micro Acc. (\uparrow) Macro Acc. (\uparrow) GFLOPS (\downarrow)
VideoMAEv2-base 81.59 62.47 360
+ HM(P) + PR(C+P) + MG 81.55 65.43 269
+ HM(P) + PR(C+P) + ToMe[4] 80.06 64.53 269
+ HM(P) + ToMe[4] (λ=0.5𝜆0.5\lambda=0.5italic_λ = 0.5) 78.58 60.39 222
+ HM(P) + PR(C+P) (0.4) + MG (0.1) 80.87 62.50 196
Algorithm 1 Token Merging Strategy
XN×d𝑋superscript𝑁𝑑X\in\mathcal{R}^{N\times d}italic_X ∈ caligraphic_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT: Original feature tensor
FN×d𝐹superscriptsuperscript𝑁𝑑F\in\mathcal{R}^{N^{\prime}\times d}italic_F ∈ caligraphic_R start_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_d end_POSTSUPERSCRIPT: Feature tensor of unselected tokens
k𝑘kitalic_k: Number of tokens to merge based on similarity
FmergedM×dsubscript𝐹𝑚𝑒𝑟𝑔𝑒𝑑superscript𝑀𝑑F_{merged}\in\mathcal{R}^{M\times d}italic_F start_POSTSUBSCRIPT italic_m italic_e italic_r italic_g italic_e italic_d end_POSTSUBSCRIPT ∈ caligraphic_R start_POSTSUPERSCRIPT italic_M × italic_d end_POSTSUPERSCRIPT: Merged feature tensor
SN×N𝑆superscriptsuperscript𝑁superscript𝑁S\in\mathcal{R}^{N^{\prime}\times N^{\prime}}italic_S ∈ caligraphic_R start_POSTSUPERSCRIPT italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT: Similarity matrix
// Compute cosine similarity for unselected tokens
for i=1𝑖1i=1italic_i = 1 to Nsuperscript𝑁N^{\prime}italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT do
     for j=1𝑗1j=1italic_j = 1 to Nsuperscript𝑁N^{\prime}italic_N start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT do
         SijFiFjFiFjsubscript𝑆𝑖𝑗subscript𝐹𝑖subscript𝐹𝑗normsubscript𝐹𝑖normsubscript𝐹𝑗S_{ij}\leftarrow\frac{F_{i}\cdot F_{j}}{\|F_{i}\|\|F_{j}\|}italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ← divide start_ARG italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ∥ italic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ end_ARG \triangleright Cosine similarity between i𝑖iitalic_i and j𝑗jitalic_j tokens
     end for
end for
SSdiag(diag(S))𝑆𝑆𝑑𝑖𝑎𝑔𝑑𝑖𝑎𝑔𝑆S\leftarrow S-diag(diag(S))italic_S ← italic_S - italic_d italic_i italic_a italic_g ( italic_d italic_i italic_a italic_g ( italic_S ) ) \triangleright Set diagona‘l to zero avoiding self-merging
// Identify merge candidates based on similarity
for i=1𝑖1i=1italic_i = 1 to N𝑁Nitalic_N do
     merge_candidate[i]Max(Si,:)𝑚𝑒𝑟𝑔𝑒_𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒delimited-[]𝑖Maxsubscript𝑆𝑖:merge\_candidate[i]\leftarrow\textsc{Max}(S_{i,:})italic_m italic_e italic_r italic_g italic_e _ italic_c italic_a italic_n italic_d italic_i italic_d italic_a italic_t italic_e [ italic_i ] ← Max ( italic_S start_POSTSUBSCRIPT italic_i , : end_POSTSUBSCRIPT )
end for
// Select the top-k similar tokens for merging based on the values in S𝑆Sitalic_S
merge_candidatesort(merge_candidate)[:k]merge\_candidate\leftarrow sort(merge\_candidate)[:k]italic_m italic_e italic_r italic_g italic_e _ italic_c italic_a italic_n italic_d italic_i italic_d italic_a italic_t italic_e ← italic_s italic_o italic_r italic_t ( italic_m italic_e italic_r italic_g italic_e _ italic_c italic_a italic_n italic_d italic_i italic_d italic_a italic_t italic_e ) [ : italic_k ]
// Select the X source tokens and merge them with the selected tokens
Fmergedmean(X[merge_candidate],axis=0)subscript𝐹𝑚𝑒𝑟𝑔𝑒𝑑𝑚𝑒𝑎𝑛𝑋delimited-[]𝑚𝑒𝑟𝑔𝑒_𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑎𝑥𝑖𝑠0F_{merged}\leftarrow mean(X[merge\_candidate],axis=0)italic_F start_POSTSUBSCRIPT italic_m italic_e italic_r italic_g italic_e italic_d end_POSTSUBSCRIPT ← italic_m italic_e italic_a italic_n ( italic_X [ italic_m italic_e italic_r italic_g italic_e _ italic_c italic_a italic_n italic_d italic_i italic_d italic_a italic_t italic_e ] , italic_a italic_x italic_i italic_s = 0 ) \triangleright Average the selected tokens along the first axis
return Fmergedsubscript𝐹𝑚𝑒𝑟𝑔𝑒𝑑F_{merged}italic_F start_POSTSUBSCRIPT italic_m italic_e italic_r italic_g italic_e italic_d end_POSTSUBSCRIPT

Appendix B Extended results

To have the full reference of results on Drive&Act we show in Table 7 the per-fold test results.

Table 7: Per-fold test results Drive&Act dataset of PO-GUISE.
Fold # Micro Acc. (\uparrow) Macro Acc. (\uparrow)
0 83.26 66.65
1 76.15 63.45
2 76.88 59.64
Average 78.76 63.24

In Fig. 2 of the main paper we merged the following classes for easier visualization:

  • interacting_bottle: opening_bottle,closing_bottle

  • interacting_sunglasses: putting_on_sunglasses,taking_off_sunglasses

  • interacting_laptop: opening_laptop,closing_laptop,working_on_laptop

  • interacting_jacket: putting_on_jacket, taking_off_jacket

  • interacting_backpack: opening_backpack,putting_laptop_into_backpack,

    taking_laptop_from_backpack,placing_an_object

  • eating: eating, preparing_food

  • interacting_seat_belt: unfastening_seat_belt, fastening_seat_belt

  • interacting_door: closing_door_outside, opening_door_outside,

    closing_door_outside, opening_door_inside

In Figures 5, 6, and 7, we present the confusion matrices for the baseline model (VideoMaeV2), the model enhanced with heatmap prediction (+HM), and PO-GUISE respectively. As in Fig. 2 of the main paper we have merged some classes for easier visualization. When comparing the baseline model (Fig. 5) and the rest of the models performance gains occur in categories where there is interaction between the driver and an object, e.g. a bottle, sunglasses, seat belt or phone, and, most prominently, in the sitting still class. . Meanwhile, there is a decrease in performance in the +HM (Fig. 6) and PO-GUISE (Fig. 7) models in classes related to the person entering or exiting the car (interacting_door, exiting_car and entering_car). This is due to the fact that the token selection process is not fully effective in scenarios where no actor is present.

Refer to caption
Figure 5: Confusion Matrix of the merged classes on Drive&Act dataset fold 0 for the baseline VideoMaeV2 model.
Refer to caption
Figure 6: Confusion Matrix of the merged classes on Drive&Act dataset fold 0 for the +HM model.
Refer to caption
Figure 7: Confusion Matrix of the merged classes on Drive&Act dataset fold 0 for our PO-GUISE model.

Appendix C Heatmap visualization

In Table 8 we can see different heatmap prediction results for different multi-task training schemes.

Table 8: MAE heatmap prediction results in Drive&Act. We compare PO-GUISE without any task scaling against models with different task scaling techniques.
Method MAE
Dual-task 0.0108
+ log scaling 0.0081
+ Nash-MTL[44] 0.0065

We have used VitPose[60] to calculate the pseudo-labels used for training. Additionally, for both training and the visualizations shown in this section, we use a mask based on the prediction confidence of VitPose[60] to determine if a heatmap is valid or not. For all the figures, we show the thresholded heatmap of each landmark and its corresponding decoded key-point as a dot. In Figures 8, 9, we can see the predicted heatmaps of our PO-GUISE. The model can correctly identify the position of all landmarks in the scene in various scenarios. We can also see in Fig. 10 samples from the multi-person scenarios. In the fencing example, PO-GUISE correctly detects the most prominent persons inside the clip.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Sample heatmaps from the Drive&Act dataset test set using PO-GUISE. The first column corresponds to the middle frame of the video clip, the second column displays the labels obtained from ViT-Pose [60], and the third column shows the predicted heatmaps. The decoded location of key-points are shown with colored dots.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 9: Sample single-person heatmaps from HMDB51 using PO-GUISE. The first column corresponds to the middle frame of the video clip, the second column displays the labels obtained from a ViT-Pose model[60], and the third column shows the heatmaps predicted by our method.
Refer to caption
Refer to caption
Figure 10: Sample multi-person heatmaps from HMDB51 using PO-GUISE. The first column corresponds to the middle frame of the video clip, the second column displays the labels obtained from ViTPose [60], and the third column shows the predicted heatmaps.

References

  • [1] https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Road_safety_statistics_in_the_EU#The_number_of_persons_killed_in_road_traffic_accidents_increased_in_2021.2C_after_decreasing_continuously_since_2011
  • [2] Ahn, D., Kim, S., Hong, H., Ko, B.C.: Star-transformer: A spatio-temporal cross attention transformer for human action recognition. In: IEEE Winter Conf. on Appl. of Comput. Vis. pp. 3330–3339 (January 2023)
  • [3] Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lucic, M., Schmid, C.: Vivit: A video vision transformer. In: ICCV. pp. 6816–6826. IEEE (2021)
  • [4] Bolya, D., Fu, C., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: Your vit but faster. In: ICLR (2023)
  • [5] Cao, C., Zhang, Y., Zhang, C., Lu, H.: Action recognition with joints-pooled 3d deep convolutional descriptors. In: Kambhampati, S. (ed.) IJCAI. pp. 3324–3330. IJCAI/AAAI Press (2016)
  • [6] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR. pp. 4724–4733 (2017)
  • [7] Chen, L., Tong, Z., Song, Y., Wu, G., Wang, L.: Efficient video action detection with token dropout and context refinement. In: ICCV (2023)
  • [8] Chen, Y., Chen, D., Liu, R., Li, H., Peng, W.: Video action recognition with attentive semantic units. In: ICCV. pp. 10136–10146. IEEE (2023)
  • [9] Chéron, G., Laptev, I., Schmid, C.: P-CNN: pose-based CNN features for action recognition. In: ICCV. pp. 3218–3226. IEEE Computer Society (2015)
  • [10] Choutas, V., Weinzaepfel, P., Revaud, J., Schmid, C.: Potion: Pose motion representation for action recognition. In: CVPR. pp. 7024–7033. Computer Vision Foundation / IEEE Computer Society (2018)
  • [11] Contributors, M.: Openmmlab’s next generation video understanding toolbox and benchmark. https://github.com/open-mmlab/mmaction2 (2020)
  • [12] Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: Practical automated data augmentation with a reduced search space. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. pp. 702–703 (2020)
  • [13] Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR. pp. 248–255. IEEE Computer Society (2009)
  • [14] Du, W., Wang, Y., Qiao, Y.: RPAN: an end-to-end recurrent pose-attention network for action recognition in videos. In: ICCV. pp. 3745–3754. IEEE Computer Society (2017)
  • [15] Duan, H., Zhao, Y., Chen, K., Lin, D., Dai, B.: Revisiting skeleton-based action recognition. In: CVPR. pp. 2959–2968 (2022)
  • [16] Fang, H., Cao, J., Tai, Y., Lu, C.: Pairwise body-part attention for recognizing human object interactions. In: ECCV. Lecture Notes in Computer Science, vol. 11214, pp. 52–68. Springer (2018)
  • [17] Feichtenhofer, C., Fan, H., Xiong, B., Girshick, R., He, K.: A large-scale study on unsupervised spatiotemporal representation learning. In: CVPR. pp. 3298–3308 (2021)
  • [18] Feichtenhofer, C., Pinz, A., Wildes, R.P.: Spatiotemporal residual networks for video action recognition. In: NeurIPS. pp. 3468–3476 (2016)
  • [19] Gkioxari, G., Girshick, R.B., Malik, J.: Actions and attributes from wholes and parts. In: ICCV. pp. 2470–2478. IEEE Computer Society (2015)
  • [20] Gowda, S.N., Rohrbach, M., Sevilla-Lara, L.: SMART frame selection for action recognition. In: AAAI. pp. 1451–1459. AAAI Press (2021)
  • [21] Holzbock, A., Tsaregorodtsev, A., Dawoud, Y., Dietmayer, K., Belagiannis, V.: A spatio-temporal multilayer perceptron for gesture recognition. In: 2022 IEEE Intelligent Vehicles Symposium (IV). pp. 1099–1106 (2022). https://doi.org/10.1109/IV51971.2022.9827054
  • [22] Jegham, I., Khalifa, A.B., Alouani, I., Mahjoub, M.A.: A novel public dataset for multimodal multiview and multispectral driver distraction analysis: 3mdad. Signal Processing: Image Communication 88, 115960 (2020)
  • [23] Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: ICCV. pp. 3192–3199. IEEE Computer Society (2013)
  • [24] Kalfaoglu, M.E., Kalkan, S., Alatan, A.A.: Late temporal modeling in 3d cnn architectures with bert for action recognition. pp. 731–747. Springer International Publishing (2020)
  • [25] Kim, B., Chang, H.J., Kim, J., Choi, J.Y.: Global-local motion transformer for unsupervised skeleton-based action learning. In: Avidan, S., Brostow, G.J., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV. Lecture Notes in Computer Science, vol. 13664, pp. 209–225. Springer (2022)
  • [26] Kim, S., Ahn, D., Ko, B.C.: Cross-modal learning with 3d deformable attention for action recognition. In: ICCV. pp. 10231–10241 (2023)
  • [27] Koay, H.V., Chuah, J.H., Chow, C.O., Chang, Y.L.: Detecting and recognizing driver distraction through various data modality using machine learning: A review, recent advances, simplified framework and open challenges (2014–2021). Engineering Applications of Artificial Intelligence 115, 105309 (2022)
  • [28] Kuang, J., Li, W., Li, F., Zhang, J., Wu, Z.: Mifi: Multi-camera feature integration for robust 3d distracted driver activity recognition. IEEE Transactions on Intelligent Transportation Systems 25(1), 338–348 (2024)
  • [29] Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: Hmdb: a large video database for human motion recognition. In: ICCV. pp. 2556–2563. IEEE (2011)
  • [30] Li, T., Li, X., Ren, B., Guo, G.: An effective multi-scale framework for driver behavior recognition with incomplete skeletons. IEEE Trans. Veh. Technol. 73(1), 295–309 (2024)
  • [31] Li, Z., Guo, H., Chau, L.P., Tan, C.H., Ma, X., Lin, D., Yap, K.H.: Object-augmented skeleton-based action recognition. In: 2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS). pp. 1–4 (2023)
  • [32] Liang, Y., Ge, C., Tong, Z., Song, Y., Wang, J., Xie, P.: Evit: Expediting vision transformers via token reorganizations. In: ICLR (2022)
  • [33] Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: ICCV. pp. 7082–7092. IEEE (2019)
  • [34] Liu, D., Yamasaki, T., Wang, Y., Mase, K., Kato, J.: Toward extremely lightweight distracted driver recognition with distillation-based neural architecture search and knowledge transfer. IEEE TPAMI 24(1), 764–777 (2023)
  • [35] Liu, M., Yuan, J.: Recognizing human actions as the evolution of pose estimation maps. In: CVPR. pp. 1159–1168. Computer Vision Foundation / IEEE Computer Society (2018)
  • [36] Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H.: Video swin transformer. In: CVPR. pp. 3192–3201. IEEE (2022)
  • [37] Loshchilov, I., Hutter, F.: SGDR: Stochastic gradient descent with warm restarts. In: International Conference on Learning Representations (2017), https://openreview.net/forum?id=Skq89Scxx
  • [38] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019), https://openreview.net/forum?id=Bkg6RiCqY7
  • [39] Luvizon, D.C., Picard, D., Tabia, H.: 2d/3d pose estimation and action recognition using multitask deep learning. In: CVPR. pp. 5137–5146. Computer Vision Foundation / IEEE Computer Society (2018)
  • [40] Ma, H., Wang, Z., Chen, Y., Kong, D., Chen, L., Liu, X., Yan, X., Tang, H., Xie, X.: Ppt: token-pruned pose transformer for monocular and multi-view human pose estimation. In: ECCV. pp. 424–442. Springer (2022)
  • [41] Martin, M., Lerch, D., Voit, M.: Viewpoint invariant 3d driver body pose-based activity recognition. In: IEEE Intelligent Vehicles Symposium, IV. pp. 1–6. IEEE (2023)
  • [42] Martin, M., Roitberg, A., Haurilet, M., Horne, M., Reiß, S., Voit, M., Stiefelhagen, R.: Drive&act: A multi-modal dataset for fine-grained driver behavior recognition in autonomous vehicles. In: ICCV. pp. 2801–2810 (2019)
  • [43] Mohammed, A.A., Geng, X., Wang, J., Ali, Z.: Driver distraction detection using semi-supervised lightweight vision transformer. Engineering Applications of Artificial Intelligence 129, 107618 (2024)
  • [44] Navon, A., Shamsian, A., Achituve, I., Maron, H., Kawaguchi, K., Chechik, G., Fetaya, E.: Multi-task learning as a bargaining game. arXiv preprint arXiv:2202.01017 (2022)
  • [45] Ortega, J.D., Kose, N., Cañas, P., Chao, M.A., Unnervik, A., Nieto, M., Otaegui, O., Salgado, L.: Dmd: A large-scale multi-modal driver monitoring dataset for attention and alertness analysis. In: Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16. pp. 387–405. Springer (2020)
  • [46] Peng, K., Roitberg, A., Yang, K., Zhang, J., Stiefelhagen, R.: Transdarc: Transformer-based driver activity recognition with latent space feature calibration. In: Int. Conf. Intelligent Robots and Systems. pp. 278–285 (2022)
  • [47] Rajasegaran, J., Pavlakos, G., Kanazawa, A., Feichtenhofer, C., Malik, J.: On the benefits of 3d pose and tracking for human action recognition. In: CVPR. pp. 640–649. IEEE (2023)
  • [48] Shah, A., Mishra, S., Bansal, A., Chen, J., Chellappa, R., Shrivastava, A.: Pose and joint-aware action recognition. In: IEEE Winter Conf. on Appl. of Comput. Vis. pp. 141–151. IEEE (2022)
  • [49] Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K. (eds.) NeurIPS. vol. 27 (2014)
  • [50] Su, L., Sun, C., Cao, D., Khajepour, A.: Efficient driver anomaly detection via conditional temporal proposal and classification network. IEEE Transactions on Computational Social Systems 10(2), 736–745 (2023)
  • [51] Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: CVPR. pp. 6450–6459. Computer Vision Foundation / IEEE Computer Society (2018)
  • [52] for Transport, D.G.: Road safety thematic report – Driver distraction, vol. European Road Safety Observatory. European Commission (2022)
  • [53] Wang, H., Chen, J., Huang, Z., Li, B., Lv, J., Xi, J., Wu, B., Zhang, J., Wu, Z.: Fpt: Fine-grained detection of driver distraction based on the feature pyramid vision transformer. IEEE Transactions on Intelligent Transportation Systems 24(2), 1594–1608 (2023)
  • [54] Wang, J., Li, W., Li, F., Zhang, J., Wu, Z., Zhong, Z., Sebe, N.: 100-driver: A large-scale, diverse dataset for distracted driver classification. IEEE Transactions on Intelligent Transportation Systems 24(7), 7061–7072 (2023)
  • [55] Wang, L., Huang, B., Zhao, Z., Tong, Z., He, Y., Wang, Y., Wang, Y., Qiao, Y.: Videomae v2: Scaling video masked autoencoders with dual masking. In: CVPR. pp. 14549–14560 (June 2023)
  • [56] Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Gool, L.V.: Temporal segment networks: Towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV. Lecture Notes in Computer Science, vol. 9912, pp. 20–36. Springer (2016)
  • [57] Wharton, Z., Behera, A., Liu, Y., Bessis, N.: Coarse temporal attention network (cta-net) for driver’s activity recognition. In: IEEE Winter Conf. on Appl. of Comput. Vis. pp. 1278–1288 (jan 2021)
  • [58] Wu, W., Wang, X., Luo, H., Wang, J., Yang, Y., Ouyang, W.: Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models. In: CVPR. pp. 6620–6630 (2023)
  • [59] Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In: ECCV. Lecture Notes in Computer Science, vol. 11219, pp. 318–335. Springer (2018)
  • [60] Xu, Y., Zhang, J., Zhang, Q., Tao, D.: Vitpose: Simple vision transformer baselines for human pose estimation. Advances in Neural Information Processing Systems 35, 38571–38584 (2022)
  • [61] Yan, A., Wang, Y., Li, Z., Qiao, Y.: Pa3d: Pose-action 3d machine for video recognition. In: CVPR. pp. 7914–7923 (2019)
  • [62] Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: Regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 6023–6032 (2019)
  • [63] Zhang, H., Leong, M.C., Li, L., Lin, W.: Pgvt: Pose-guided video transformer for fine-grained action recognition. In: IEEE Winter Conf. on Appl. of Comput. Vis. pp. 6645–6656 (January 2024)
  • [64] Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. In: International Conference on Learning Representations (2018), https://openreview.net/forum?id=r1Ddp1-Rb