[1]\fnmJan \surHeld

[1,2]\fnmAnthony \surCioppa

[2]\fnmSilvio \surGiancola

1]\orgnameUniversity of Liege (ULiège), Belgium

2]\orgnameKing Abdullah University of Science and Technology (KAUST), Saudi Arabia

3]\orgnameUniversity of Oxford, United Kingdom

Towards AI-Powered Video Assistant Referee System (VARS) for Association Football

[email protected]    [email protected]    [email protected]    \fnmAbdullah \surHamdi    \fnmChristel \surDevue    \fnmBernard \surGhanem    \fnmMarc \surVan Droogenbroeck [ [ [
Abstract

Over the past decade, the technology used by referees in football has improved substantially, enhancing the fairness and accuracy of decisions. This progress has culminated in the implementation of the Video Assistant Referee (VAR), an innovation that enables backstage referees to review incidents on the pitch from multiple points of view. However, the VAR is currently limited to professional leagues due to its expensive infrastructure and the lack of referees worldwide. In this paper, we present the Video Assistant Referee System (VARS) that leverages the latest findings in multi-view video analysis. Our VARS achieves a new state-of-the-art on the SoccerNet-MVFoul dataset by recognizing the type of foul in 50%percent5050\%50 % of instances and the appropriate sanction in 46%percent4646\%46 % of cases. Finally, we conducted a comparative study to investigate human performance in classifying fouls and their corresponding severity and compared these findings to our VARS. The results of our study highlight the potential of our VARS to reach human performance and support football refereeing across all levels of professional and amateur federations.

keywords:
Football, Soccer, Artificial Intelligence, Computer Vision, Video Recognition, Automated Decision, Video Assistant Referee, Referee Success Rate, Fouls evaluation

1 Introduction

In recent years, technology has played an increasing role in football, revolutionizing how the game is played, coached, and officiated. This transformation extends into the domain of sports video analysis, which encompasses a diverse range of challenging tasks, including player detection and tracking Cioppa2020Multimodal ; Maglo2022Efficient ; Vandeghen2022SemiSupervised ; Somers2024SoccerNetGameState , spotting actions in untrimmed videos Cioppa2020AContextaware ; Giancola2021Temporally ; Hong2022Spotting ; Soares2022Temporally ; Soares2022Action-arxiv ; Giancola2023Towards ; Cabado2024Beyond ; Kassab_2024 , pass feasibility and prediction ArbuesSanguesa2020Using ; Honda2022Pass , summarizing Gautam_2022 ; Midoglu_2024 ; Sushant_2022 ; Midoglu2022MMSys , camera calibrationMagera2024AUniversal , player re-identification in occluded scenarios Somers2023Body , or dense video captioning for football broadcasts commentaries Mkhallati2023SoccerNetCaption ; Andrews2024AiCommentator . Solving these tasks has been taken to a higher level thanks to the emergence of deep learning techniques Su2015Multiview ; Bahdanau2014Neural-arxiv ; Vaswani2017Attention-arxiv . Similar to many other fields in which deep learning has been used, the advancements in sports video understanding heavily rely on the availability of large-scale datasets Pappalardo2019Apublic ; Yu2018Comprehensive ; Scott2022SoccerTrack ; Jiang2020SoccerDB ; VanZandycke2022DeepSportradarv1 . SoccerNet Giancola2018SoccerNet ; Deliege2021SoccerNetv2 ; Cioppa2022Scaling ; Cioppa2022SoccerNetTracking ; Held2023VARS ; Cioppa2023SoccerNetChallenge-arxiv ; Leduc2024SoccerNetDepth ; Held2024XVARS ; Gautam2024SoccerNetEchoes-arxiv stands among the largest and most comprehensive sports dataset, with extensive annotations for video understanding in football.

In refereeing, the biggest revolution was introduced by the Video Assistant Referee (VAR) in 2016 Spitz_2020 . The system involves a team of referees located in a video operation room outside the stadium. These referees have access to all available camera views and check all decisions taken by the on-field referee. If the VAR indicates a probable “clear and obvious error” (E.g. when the referee misses a penalty or a red card, gives a yellow card to the wrong player, etc.), it will be communicated to the on-field referee who can then review his decision in the referee review area before taking a final decision. The VAR helps to ensure greater fairness in the game by reducing the impact of incorrect decisions on the outcome of games. Notably, in 8% of the matches, the VAR has a decisive impact on the result of the game DeDiosCrespo2021TheContribution and it slightly reduces the unconscious bias of referees towards home teams Holder2021Monitoring . On average, away teams now score more goals and receive fewer yellow cards Dufner2023TheIntroduction . Controversial referee mistakes like the famous “hand of God” goal by Diego Maradona during the quarter-final match Argentina versus England of the 1986198619861986 FIFA World Cup, Josip Šimunić getting three yellow cards in a single game at the 2006200620062006 FIFA World Cup, or Thierry Henry’s handball preventing the Republic of Ireland from qualifying for the World Cup could have been avoided with the VAR and would have changed football history.

Despite its potential benefits, the use of the VAR technology remains limited to professional leagues. The infrastructure of the VAR is expensive, including multiple cameras to analyze the incident from different angles, video operation rooms in various locations, and VAR officials hired to analyze the footage. Leagues with financial limitations cannot afford the necessary infrastructure to operate the VAR. In addition to the upfront costs of the infrastructure, there is also an ongoing expense associated with using the VAR. The officials who serve as Video Assistant Referees require specialized training Armenteros2021Educating and monetary compensation following each game. Given the implementation and operational costs of VAR, its use is currently restricted to professional leagues. A further obstacle is the shortage of referees worldwide. In Germany, there were only 50,2415024150{,}24150 , 241 active referees during the 2020/2021 season, whereas the number of games played each weekend was around 90,0009000090{,}00090 , 000 DFB2022Anzahl ; Zeppenfeld2023Anzahl . The introduction of the VAR requires an additional team of referees per game, which is not feasible for semi-professional or amateur leagues. Finally, each referee interprets the Laws of the Game IFAB2022Laws slightly differently, resulting in different decisions for similar actions. Given that the video assistant referee (VAR) changes from one game to another, inconsistencies may arise, with the VAR making different decisions for similar actions across different matches.

In this paper, we present the “Video Assistant Referee System” (VARS), which could support or extend the current VAR. Our VARS fulfills the same objectives and tasks as the VAR. By analyzing fouls from a single or a multi-camera video feed, it indicates a probable “clear and obvious error”, and can communicate this information to the referee, who will then decide whether to initiate a “review”. The proposed VARS automatically analyzes potential incidents that can then be shown to the referee in the referee review area. Just like the regular VAR, our VARS serves as a support system for the referee and only alerts him in the case of potential game-changing mistakes, but the final decision remains in the hands of the main referee. The main benefit of our VARS is that it no longer requires additional referees, making it the perfect tool for leagues that do not have enough financial or human resources.

Contributions. We summarize our contributions and novelties as follows: (i) We propose an upgraded version of the VARS presented by Held et al. Held2023VARS . We introduce an attention mechanism on the different views and calculate an importance score to allocate more attention to more informative views before aggregating the views. (ii) We present a thorough study on the influence of using multiple views and different types of camera views on the performance of our VARS. (iii) We present a comprehensive human study where we compare the performance of human referees, football players, and our VARS on the task of type of foul classification and offense severity classification. Our human study also illustrates the subjectivity of refereeing decisions by examining the inter-rater agreement among referees.

2 Methodology

We propose an upgraded version of the Video Assistant Referee System, which adds an advanced pooling technique to combine the information from multiple views, extracting the most relevant information based on our attention mechanisms.

Refer to caption
Figure 1: Architecture of our Video Assistant Referee System. From multi-view video clips input, our system encodes per-view video features (𝐄𝐄\mathbf{E}bold_E), aggregates the view features (𝐀𝐀\mathbf{A}bold_A), and classifies different properties (𝐂𝐅𝐨𝐮𝐥subscript𝐂𝐅𝐨𝐮𝐥\mathbf{C_{Foul}}bold_C start_POSTSUBSCRIPT bold_Foul end_POSTSUBSCRIPT and 𝐂𝐎𝐟𝐟subscript𝐂𝐎𝐟𝐟\mathbf{C_{Off}}bold_C start_POSTSUBSCRIPT bold_Off end_POSTSUBSCRIPT). The figure is inspired by Held2023VARS .

The architecture is shown in Figure 1. Formally, the VARS takes multiple video clips 𝐯={vi}1n𝐯superscriptsubscriptsubscript𝑣𝑖1𝑛\mathbf{v}=\{v_{i}\}_{1}^{n}bold_v = { italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT as input. Each video clip shows the same action from n𝑛nitalic_n different perspectives. Each clip visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is fed into a video encoder 𝐄𝐄\mathbf{E}bold_E to extract a spatio-temporal feature vector fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of dimension d𝑑ditalic_d for each clip visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. All feature vectors fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are then stored in a matrix 𝐟𝐟\mathbf{f}bold_f as follows:

𝐟=[f1,f2,,fn]T.𝐟superscriptmatrixsubscript𝑓1subscript𝑓2subscript𝑓𝑛𝑇\mathbf{f}=\begin{bmatrix}f_{1},f_{2},...,f_{n}\end{bmatrix}^{T}\,.bold_f = [ start_ARG start_ROW start_CELL italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT . (1)

An aggregation block 𝐀𝐀\mathbf{A}bold_A takes 𝐟𝐟\mathbf{f}bold_f as input and outputs a single multi-view representation 𝐑𝐑\mathbf{R}bold_R. A multi-head classifier, 𝐂foulsuperscript𝐂foul\mathbf{C}^{\text{foul}}bold_C start_POSTSUPERSCRIPT foul end_POSTSUPERSCRIPT and 𝐂offsuperscript𝐂off\mathbf{C}^{\text{off}}bold_C start_POSTSUPERSCRIPT off end_POSTSUPERSCRIPT, simultaneously predicts the fine-grained type of foul class and the offense severity class. For each task, the VARS selects the value with the highest confidence from the respective confidence vector as the final prediction, following:

𝐕𝐀𝐑𝐒targmax𝐂θCtt(𝐑),t{foul,off},formulae-sequencesuperscript𝐕𝐀𝐑𝐒𝑡argmaxsubscriptsuperscript𝐂𝑡subscript𝜃superscript𝐶𝑡𝐑for-all𝑡fouloff\mathbf{VARS}^{t}\leftarrow\mathop{\mathrm{argmax}}\mathbf{C}^{t}_{\theta_{C^{% t}}}(\mathbf{R}),\forall t\in\{\text{foul},\text{off}\},bold_VARS start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ← roman_argmax bold_C start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_R ) , ∀ italic_t ∈ { foul , off } , (2)

where θCtsubscript𝜃superscript𝐶𝑡\theta_{C^{t}}italic_θ start_POSTSUBSCRIPT italic_C start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT corresponds to the parameters of the classification head for task t𝑡titalic_t {foul,off}absentfouloff\in\{\text{foul},\text{off}\}∈ { foul , off }. The model is trained by minimizing the unweighted summation of both task losses foulsuperscriptfoul\mathcal{L}^{\text{foul}}caligraphic_L start_POSTSUPERSCRIPT foul end_POSTSUPERSCRIPT and offsuperscriptoff\mathcal{L}^{\text{off}}caligraphic_L start_POSTSUPERSCRIPT off end_POSTSUPERSCRIPT.

Refer to caption
Figure 2: Architecture of the attention block. “MatMul” represents matrix multiplication, “T” denotes transpose, “Norm” signifies normalization, and “SumRow” indicates the process of summing each row.

Video Encoder E. Based on the work presented in Held2023VARS , the best performance is obtained with a video encoder that extracts spatial and temporal features. In the following, we use the state-of-the-art video encoder MViT Fan2021Multiscale ; Li2022MViTv2 pretrained on Kinetics Kay2017TheKinetics-arxiv , which incorporates a transformer-based architecture with a multiscale feature representation, allowing it to capture spatial and temporal information from video clips.

Multi-view aggregation block A. The original paper Held2023VARS used simple mean or max pooling operations to gather the multi-view information into a unique representation. A major drawback of these pooling approaches is that the combination of the feature vectors is fixed and ignores the relationship between the views. Instead, we propose a new aggregation technique based on an attention mechanism to model such relationships.

Our approach is inspired by the “Integrating Block” presented in Yang2019Learning , where each view is associated with an attention score. However, instead of aggregating multi-view images, we extend the operation to multi-view videos. Technically, we assign an attention score to each view and then calculate the final representation by a weighted sum of the feature vectors. There exist several strategies to assign an attention score to a view. A first naive approach consists of passing each feature vector individually into a learned function. However, this would neglect the relationships between the views and would not provide a relative attention score of the views. A better approach consists of determining the attention score of each view based on its relationships with the other views. To do so, we first take the dot product (denoted by \cdot) of f multiplied by a matrix W𝑊absentW\initalic_W ∈ dxdsuperscript𝑑𝑥𝑑\mathbb{R}^{dxd}blackboard_R start_POSTSUPERSCRIPT italic_d italic_x italic_d end_POSTSUPERSCRIPT of trainable weights and its transpose:

𝐒=𝐟W(𝐟W)T.𝐒𝐟𝑊superscript𝐟𝑊𝑇\mathbf{S}=\mathbf{f}W\cdot(\mathbf{f}W)^{T}{.}bold_S = bold_f italic_W ⋅ ( bold_f italic_W ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT . (3)

By multiplying the matrix f with its transpose, we compute the dot product between each pair of feature vectors, which measures the similarity between two vectors. The obtained symmetric similarity matrix S is of dimension n×n𝑛𝑛n\times nitalic_n × italic_n, where the value at row i𝑖iitalic_i and column j𝑗jitalic_j corresponds to the similarity score between view i𝑖iitalic_i and view j𝑗jitalic_j. A higher score indicates a higher similarity between the vectors, while a lower score suggests a lower similarity. Next, we normalize the similarity scores to get a probability-like distribution, by passing the matrix S through a ReLU layer and divide it by the sum of the matrix S, following:

𝐍=𝑅𝑒𝐿𝑈(𝐒)i=1nj=1n𝑅𝑒𝐿𝑈(𝐒i,j).𝐍𝑅𝑒𝐿𝑈𝐒superscriptsubscript𝑖1𝑛superscriptsubscript𝑗1𝑛𝑅𝑒𝐿𝑈subscript𝐒𝑖𝑗\mathbf{N}=\frac{\mathit{ReLU(\mathbf{S})}}{\sum_{i=1}^{n}\sum_{j=1}^{n}% \mathit{ReLU}(\mathbf{S}_{i,j})}\,.bold_N = divide start_ARG italic_ReLU ( bold_S ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ReLU ( bold_S start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ) end_ARG . (4)

To obtain the attention score for each view, we sum the values in each row of the normalized similarity matrix N. The attention score for a view i𝑖iitalic_i represents the sum of its normalized similarity scores with all other views. By summing the values in each row of the normalized similarity matrix N, we aggregate the normalized similarity scores for each view. This aggregation reflects how similar a particular view is to all other views collectively. Consequently, the resulting attention score captures a view’s overall relevance within the set of views. The reasoning behind this approach is that if a view is highly similar to many other views, it is considered important because it shares visual content with multiple views. On the other hand, if a view is dissimilar to other views, it might be considered less important since it does not contribute significantly to the collective visual information. Formally, we take the sum per row to obtain the attention score A per view:

𝐀=i=1n𝐍i,j,𝐀superscriptsubscript𝑖1𝑛subscript𝐍𝑖𝑗\mathbf{A}=\sum_{i=1}^{n}\mathbf{N}_{i,j}\,,bold_A = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT bold_N start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT , (5)

where A is a vector of size n𝑛nitalic_n, where the value j𝑗jitalic_j corresponds to the attention score of the view j𝑗jitalic_j regarding all other views and itself. The final representation is given by the sum of the extracted feature vector weighted by their calculated attention score, following:

𝐑i=j=1nfi,j×𝐀j.subscript𝐑𝑖superscriptsubscript𝑗1𝑛subscriptf𝑖𝑗subscript𝐀𝑗\mathbf{R}_{i}=\sum_{j=1}^{n}\textbf{f}_{i,j}\times\mathbf{A}_{j}\,.bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT f start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT × bold_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT . (6)

Classification heads C. A multi-task classification approach is used to classify simultaneously the type of foul, whether it is an offense or not, and its severity. As both tasks are related, learning them together can lead to improved generalization and a better understanding of each task. The model can leverage the relationships between the two tasks to make better predictions. Each classification head consists of two dense layers and takes as input the aggregated representation. The output is a vector whose dimensions correspond to the number of classes in each of the classification problems.

3 Experiments

3.1 Experimental setup

Tasks. We test our VARS on the two classification tasks introduced by the SoccerNet-MVFouls dataset Held2023VARS : Fine-grained foul classification, which is the task of classifying a foul into one of 8888 fine-grained foul classes (i.e., “Standing tackling”, “Tackling”, “High leg”, “Pushing”, “Holding”, “Elbowing”, “Challenge”, and “Dive/Simulation”), and Offence severity classification, which is the task of classifying whether an action is an offence, as well as the severity of the foul, defined by four classes: “No offence”, “Offence + No card”, “Offence + Yellow card”, and “Offence + Red card”.

Data. The SoccerNet-MVFoul dataset contains 3,90139013{,}9013 , 901 actions, composed of at least two videos, the live action and at least one replay, see Figure 1. The views were manually synchronized by a human and no pre-processing of the video clips is necessary. Our VARS is trained on clips of 16161616 frames, mostly 8 frames before the foul and 8 after the foul, spanning one second temporally with a spatial dimension re-scaled to 224×224224224224\times 224224 × 224 pixels. This approach was chosen because of the high computational cost associated with using a larger number of frames. Future research could explore whether an increase in frame rate or a larger temporal context enhances performance.

Training details. The encoder E is pre-trained as detailed in the methodology, and the classifier C is trained from scratch, both being trained in an end-to-end fashion. We use a cross-entropy loss, optimized with Adam on a batch size of 6666 samples. The learning rate starts at 5e55superscript𝑒55e^{-5}5 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT and is multiplied by 0.30.30.30.3 every 3333 steps. To artificially increase the dataset size, we use data augmentation and a random temporal shift to have a flexible number of frames used before and after the foul frame annotation during training. The model begins to overfit after 7777 epochs and requires approximately 8888 hours of training time on a single NVIDIA V100 GPU.

Evaluations metrics. To evaluate the performance of the VARS, SoccerNet-MVFouls uses the classification accuracy, which is the ratio of correctly classified actions regarding the total number of actions. As SoccerNet-MVFouls Held2023VARS is unbalanced, the authors also suggest a balanced accuracy, which is defined as follows:

Balanced Accuracy (BA)=1Ni=1NTPiPi,Balanced Accuracy (BA)1𝑁superscriptsubscript𝑖1𝑁𝑇subscript𝑃𝑖subscript𝑃𝑖\textrm{\mbox{Balanced Accuracy (BA)}}=\frac{1}{N}\sum_{i=1}^{N}\frac{TP_{i}}{% P_{i}}\,,Balanced Accuracy (BA) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG italic_T italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG , (7)

with N being the number of classes, TPi𝑇subscript𝑃𝑖TP_{i}italic_T italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT the number of True Positives and Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT the number of Positives for class i𝑖iitalic_i. To ensure a fair comparison, we use the same training, validation, and test sets as those used in the original paper Held2023VARS .

3.2 Main results

Table 1 shows the results obtained for the fine-grained foul and the offense severity classification task. Compared to the fixed combination of the feature vectors (mean or max pooling), our novel attention mechanism enhances the model’s ability to identify and classify the type of foul by 5%percent55\%5 % and the balanced accuracy by 1%percent11\%1 %. This demonstrates the effectiveness of combining the feature vectors of the different views based on their importance compared to max or mean pooling. Similarly, the attention mechanism improves the model’s performance to determine if an action is a foul and the corresponding severity by 3%percent33\%3 % and the balanced accuracy remains the same compared to max pooling. One might argue that the performance increase is based on the supplementary parameters introduced by the attention mechanism. However, the attention mechanism only adds an extra 0.1% of parameters to the model compared to when using max and mean pooling. This suggests that the performance increase derives from the use of the attention mechanism rather than the introduction of additional parameters.

Type of Foul Offence Severity
Feat. extr. Pooling Acc. BA. Acc. BA.
ResNet Mean 0.30 0.27 0.34 0.25
ResNet Max 0.32 0.27 0.32 0.24
R(2+1)D Mean 0.32 0.34 0.34 0.30
R(2+1)D Max 0.34 0.33 0.39 0.31
MViT Mean 0.44 0.40 0.38 0.31
MViT Max 0.45 0.39 0.43 0.34
MViT Attention 0.50 0.41 0.46 0.34
Table 1: Multi-task classification. Attention pooling sets a new benchmark on the SoccerNet-MVFoul dataset for all the evaluation metrics and tasks. Type of foul classification accuracy increased by 5555% while the balanced accuracy (BA) increased by 1111%. We have an increment of 3333% for the offense severity classification, while the balanced accuracy stays the same.

3.3 Detailed analysis

Sensitivity analysis. We first investigate the impact of the training dataset size on the performance of our two classification tasks. Figure 3 shows the evolution of the accuracy regarding different training dataset sizes. For each dataset size, we independently trained and tested the model 10101010 times to avoid any epistemic uncertainty bias. The tests were all performed on the same test set.

Refer to caption
Figure 3: Performance evaluation for different dataset sizes. 100% of the dataset corresponds to 2,31923192{,}3192 , 319 actions. For each dataset size, we independently trained and tested the model 10101010 times. The tests were all performed on the same test set. The error bar corresponds to the standard deviation. For 0% of the dataset, we indicate the accuracy by taking a random decision.

As expected, we observe that increasing the dataset size improves the accuracy of our VARS. For the type of foul classification, we notice a significant improvement in accuracy with increasing dataset size, especially at the beginning. However, the accuracy reached a plateau between 40404040% and 80808080% of the data. Interestingly, we observed a sudden increase in accuracy when we increased the dataset size from 80808080% to 100100100100%. This may be attributable to our unbalanced dataset. The dataset contains numerous “Standing tacklings” and “Tacklings”, while many of the other labels are underrepresented. Increasing the dataset size from 40404040% to 80808080% may not have improved accuracy if the model still struggles to generalize to certain actions due to a limited number of training samples. However, increasing the dataset size to 100100100100% could have provided the model with the additional data necessary to better generalize actions. Moreover, Figure 3 reveals that our VARS is significantly more prone to epistemic uncertainty for smaller datasets, as indicated by the high standard deviation.

In contrast, the offense severity curve in Figure 3 initially shows a sharp increase, but later demonstrates a slower growth. Yet, with each increase in the dataset size, the accuracy improves, which confirms that more data would further improve the performance. The reason for this lies in the significant variability in the visual appearance of an offense with “No card”, “Yellow card”, or “Red card”. For instance, a yellow card can be the outcome of a tackle, or it can be the result of a player holding an opponent’s shirt. Although both instances may result in a yellow card, their visual representations differ significantly. To accurately determine whether an action is an offense or not and the corresponding severity, the model needs plenty of examples to learn the underlying distribution.

Qualitative results. Figure 4 shows the prediction of our VARS on two examples with a 3-view setup.

Refer to caption
Figure 4: Qualitative results. VARS prediction on two examples where the attention score of each view is given in percentage. The ground truth is given in bold and the model prediction with the confidence is given in italic.

In both examples, the VARS correctly determines the type of foul and correctly classifies both actions as a foul with the correct severity. Furthermore, the attention scores offer valuable insights into the contribution of different views or camera angles to the decision-making process of the model. In both cases, the “live action clips” have the lowest attention score, confirming our intuition that they were filmed from too far away to make an accurate decision. Both replays have a similar attention score, as they both offer a lot of information to the model. However, we can see that the most informative view has a slightly higher attention score. The attention score provides insight on which views contribute the most to classifications and helps us better understand how the model processes the visual data. This interpretability is especially important when the VARS is used in practice, as it is essential for fans, players, and referees to understand the reasoning behind decisions and feel confident that the technology is improving the fairness and integrity of the sport. Finally, the attention scores assigned to each view can assist broadcasters in automatically selecting the optimal camera angle for broadcasting purposes. Furthermore, it can support the VAR and helps speed up the review process by automatically proposing the most informative camera perspective. This is particularly useful at a professional level, where the VAR can have up to 30 different camera perspectives at their disposal, making finding the optimal camera a challenge on its own. The attention scores would provide valuable information by highlighting the views that are more likely to provide crucial details, to accelerate the decision-making process during the VAR review.

4 Human study

In contrast to classical classification tasks that involve well-defined and easily separable classes, determining whether an action in football constitutes a foul may be subjective. Despite the definitions and regulations provided by the Laws of the Game IFAB2022Laws , the rule book published by the IFAB regarding when an action in football is considered a foul and its corresponding severity, these guidelines are still open to interpretation, leading to differing opinions about the same action. In practice, many actions fall into this gray area where both interpretations, foul or no foul, could be considered correct. In this study, we first analyze whether and how the performance of our VARS aligns with human performance (i.e., referees and football players) by comparing the accuracy of the type of foul and offense severity classifications between VARS and our human participants. Secondly, we conduct an inter-rater agreement analysis of human decisions to quantify the extent of agreement among our human participants.

Experimental setup. The study involves two distinct groups of participants with different expertise in football: “Players” and “Referees”. The first group consisted of 15 male individuals aged 18 or older (with a mean M = 23.0623.0623.0623.06 and a standard deviation SD = 3.493.493.493.49 years), who had been playing football for a minimum of three years (M = 8.718.718.718.71 and SD = 3.323.323.323.32 years). The second group consisted of 15151515 male individuals aged 18181818 or older (M = 25.3325.3325.3325.33 and SD = 4.514.514.514.51 years), who are certified football referees and have officiated in at least 200200200200 official games (from 223223223223 to 1150115011501150 games). Both groups analyzed 77777777 actions, each presented with three different camera perspectives simultaneously. The participants could review the clips several times and watch the actions in slow motion or frame-by-frame, without any time restriction. To reduce bias, the actions were shown in a different random order to each participant. For each action, we measured the time taken by the participants to make their decision. This time was measured from the moment the participants started the video until they clicked on the ‘Next video’ button. For each action, the participants had the same classification task as presented in Section 3.1. Specifically, they had to determine the type of foul, if the action was a foul or not, and the corresponding severity. For each action, we use the annotations from the SoccerNet-MVFoul dataset as ground truth to determine the accuracy for each participant. An important note is that the participants have a clear advantage over our VARS as they view clips lasting 5 seconds, with a frame rate of 25252525 fps, while our model gets a 1-second clip at 16161616 fps as input. Finally, let us note that our study was approved by the local university’s ethics committee (2223-080/5624). All analyses were performed using the JASP software.

4.1 Comparison to human performance

Table 2 shows the average accuracy compared to the ground truth of players, referees, and our VARS, respectively. These results align with similar studies MacMahon_2007 ; Spitz_2016 ; Pizzera2022TheVideo , where the referees had an overall decision accuracy ranging from 45% to 80%.

In terms of the type of foul categorization, players (M = 0.7520.7520.7520.752, SD = 0.0550.0550.0550.055) were numerically more accurate than referees (M = 0.7040.7040.7040.704, SD = 0.1200.1200.1200.120), but this difference was not statistically significant, as shown by an independent samples Student t-test, t(28) = 1.4211.4211.4211.421, p = 0.1660.1660.1660.166, d = 0.5190.5190.5190.519, 95% CI = [0.2140.214-0.214- 0.214 - 1.2431.2431.2431.243]. Mean confidence levels in these categorizations were comparable between players (M = 3.643.643.643.64, SD = 0.280.280.280.28) and referees (M = 3.713.713.713.71, SD = 0.320.320.320.32), t(28) <<< 1.

Type of Foul Offence Severity Zeit
Acc. Conf. Acc. Conf.
Players 75% 3.6 58% 3.3 41.5341.5341.5341.53
Referees 70% 3.7 60% 3.6 38.0138.0138.0138.01
VARS 60% - 51% - 0.12
Table 2: Accuracy comparison between referees, players, and our VARS. The survey was performed on a subset of the test set of size 77. The time is given in seconds and represents the average time needed to make a decision. Acc. stands for accuracy and conf. for confidence. A rating of 5 indicates high confidence, while a rating of 1 indicates low confidence.

As for determining if an action corresponds to a foul and the corresponding severity, referees were slightly more accurate (M = 0.5940.5940.5940.594, SD = 0.0910.0910.0910.091) than players (M = 0.5820.5820.5820.582, SD = 0.0610.0610.0610.061). However, this difference was not statistically significant, t(28) = 0,4010401-0,401- 0 , 401, p = 0.6910.6910.6910.691, d = 0.1470.147-0.147- 0.147, 95% CI = [0.8620.862-0.862- 0.862 - 0.5710.5710.5710.571]. Although the accuracy of players and referees was comparable, referees were more confident in their severity judgments (M = 3.673.673.673.67, SD = 0.360.360.360.36) than players (M = 3.333.333.333.33, SD = 0.390.390.390.39), t(28) = 2.32.3-2.3- 2.3, p = 0.0290.0290.0290.029, d = 0.8390.839-0.839- 0.839, 95% CI = [1.5811.581-1.581- 1.581 - 0.0840.084-0.084- 0.084]. Referees’ higher confidence might be due to their specific experience in assessing fouls and their severity on the field.

Overall, our results suggest that the accuracy of players and referees is comparable. The Bayesian version of the Student t-test provides support for this null hypothesis with Bayes factors BF10 of 0.732 and 0.366 for the type of foul and offense severity task, respectively. There is a possibility that this lack of difference between groups is due to power issues, i.e., the sample size being too small. Replication studies conducted on larger groups would be valuable in revealing potential differences between the two human groups.

As we do not have a standard deviation for the VARS, we conducted two One-Sample t-tests to compare its performance against humans (players and referees were grouped as their accuracy was comparable). For action categorization, humans (M = 0.7280.7280.7280.728, SD = 0.0950.0950.0950.095) were significantly more accurate than our VARS (M = 0.5970.5970.5970.597), t(29) = 7.5567.5567.5567.556, p <<< .001, d = 1.3791.3791.3791.379, 95% CI = [0.8700.8700.8700.870 - 1.8761.8761.8761.876]. Humans were also more accurate (M = 0.5880.5880.5880.588, SD = 0.0810.0810.0810.081) than our VARS (M = 0.5080.5080.5080.508) for offense severity judgments, t(29) = 5.4925.4925.4925.492, p <<< .001, d = 1.0031.0031.0031.003, 95% CI = [0.5560.5560.5560.556 - 1.4371.4371.4371.437]. This difference in performance might be due to differences in training between our VARS and humans. Players and referees have accumulated an extensive amount of experience in football, through officiating, playing, and watching the game for countless hours. In contrast, our VARS has only been trained on an unbalanced training set of 2,91629162{,}9162 , 916 actions, where some types of labels only occur a few times. For example, there are only 27 fouls with a red card in the training set, making it difficult for the model to precisely learn the difference between a foul with a yellow card and one with a red card. Considering the difficulty of the task and the significant experience disadvantage of our VARS compared to humans, the current results are promising. Further, it is notable that our VARS only requires 120ms120𝑚𝑠120ms120 italic_m italic_s to reach a decision, which is more than 300300300300 times faster than humans. Both referees and players require a similar amount of time to make a decision. On average, players take around 41.5341.5341.5341.53 seconds and referees 38.0138.0138.0138.01 seconds, which is similar to the average time of 46464646 seconds taken for the VAR to make a decision as reported by López Lopez2023Average .

Refer to caption
Figure 5: Example of the subjectivity of human choices. Decisions taken by our participants: “No offense”, “Offense + No card”, and “Offense + Yellow card”.
Nb. of different decisions 1 2 3 4
High-level referees 16% 56% 28% 0%
Referee talents 2% 60% 38% 0%
Table 3: Similarity analysis of the results for Offense Severity classification. Among high-level referees, 28%percent2828\%28 % of cases result in three different decisions being made for the same action. For referee talents, this percentage even increases to 38%percent3838\%38 %. These results show the significant challenge involved in determining whether an action should be classified as a foul and assessing its corresponding severity.

4.2 Inter-rater agreement

In this subsection, we investigate the reliability and consistency of humans in determining whether an action constitutes a foul and its severity. To assess the level of consensus among humans, we calculated inter-rater agreement in each group for the severity classification task. Since determining if an action is a foul and assessing its severity is the most important task, we only focus on evaluating inter-rater agreement for this aspect. To quantify the inter-rater agreement, we calculated the unweighted average Cohen’s kappa, which measures the agreement between multiple individuals. The referees achieved an unweighted average Cohen’s kappa of 0.2130.2130.2130.213, indicating weak agreement. Similarly, players’ agreement was weak, with a score of 0.2230.2230.2230.223. This suggests limited consistency among both groups in their assessments. Among our 15 referees, 7 are officiating at a high level (in the highest league of their country). These referees are called “high-level referees” in the following. All other referees are called “referee talents”. Table 3 shows the consensus in each subgroup for the offense severity classification task. As can be seen, high-level and referee talents reached a consensus between themselves for only 16% and 2% of the actions, respectively. In the majority of cases, multiple decisions were made for the same action, indicating the difficulty in determining whether an action should be classified as a foul and assessing its severity. Particularly among referee talents, 38383838% of actions resulted in three different decisions (out of four possible decisions to take) for the same action. Figure 5 shows an example of an action where all three decisions “No offense”, “Offense + No card” and “Offense + Yellow card” were taken among the referees. For certain referees, the fact that the defender plays the ball is considered enough to not award a free-kick in this situation. However, other referees believe that even if the defender plays the ball, he disregards the danger to, or consequences for, an opponent and awards a yellow card. These findings underscore the complexity and subjectivity inherent in refereeing decisions, highlighting the potential for further research to improve consistency and fairness in officiating.

5 Conclusion

Distinguishing between a foul and no foul and determining its severity is a complex and subjective task that relies entirely on the interpretation of the Laws of the Game IFAB2022Laws by each individual. Despite the challenges posed by this complex task and an unbalanced training dataset, our solution demonstrates promising results. While we have not reached human-level performance yet, we believe that VARS holds the potential to assist and support referees across all levels of professionalism in the future.

Acknowledgement This work was partly supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research through the Visual Computing Center (VCC) funding and the SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence (SDAIA-KAUST AI). J. Held and A. Cioppa are funded by the F.R.S.-FNRS. The present research benefited from computational resources made available on Lucia, the Tier-1 supercomputer of the Walloon Region, infrastructure funded by the Walloon Region under the grant agreement n°1910247.

6 Declarations

Availability of data and code. The data and code are available at these addresses https://github.com/SoccerNet/sn-mvfoul

Conflict of interest. The authors declare no conflict of interest.

Open access.

References

  • \bibcommenthead
  • (1) Cioppa, A., Deliège, A., Ul Huda, N., Gade, R., Van Droogenbroeck, M., Moeslund, T.B.: Multimodal and multiview distillation for real-time player detection on a football field. In: IEEE Int. Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), CVsports, Seattle, WA, USA, pp. 3846–3855 (2020). https://doi.org/10.1109/CVPRW50498.2020.00448. https://doi.org/10.1109/CVPRW50498.2020.00448
  • (2) Maglo, A., Orcesi, A., Pham, Q.-C.: Efficient tracking of team sport players with few game-specific annotations. In: IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), pp. 3460–3470. Inst. Electr. Electron. Eng. (IEEE), New Orleans, LA, USA (2022). https://doi.org/10.1109/cvprw56347.2022.00390. https://doi.org/10.1109/CVPRW56347.2022.00390
  • (3) Vandeghen, R., Cioppa, A., Van Droogenbroeck, M.: Semi-supervised training to improve player and ball detection in soccer. In: IEEE Int. Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), CVsports, pp. 3480–3489. Inst. Electr. Electron. Eng. (IEEE), New Orleans, LA, USA (2022). https://doi.org/10.1109/cvprw56347.2022.00392. https://doi.org/10.1109/CVPRW56347.2022.00392
  • (4) Somers, V., Joos, V., Giancola, S., Cioppa, A., Ghasemzadeh, S.A., Magera, F., Standaert, B., Mansourian, A.M., Zhou, X., Kasaei, S., Ghanem, B., Alahi, A., Van Droogenbroeck, M., De Vleeschouwer, C.: SoccerNet game state reconstruction: End-to-end athlete tracking and identification on a minimap. In: IEEE Int. Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), CVsports, Seattle, WA, USA (2024)
  • (5) Cioppa, A., Deliège, A., Giancola, S., Ghanem, B., Van Droogenbroeck, M., Gade, R., Moeslund, T.B.: A context-aware loss function for action spotting in soccer videos. In: IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 13123–13133. Inst. Electr. Electron. Eng. (IEEE), Seattle, WA, USA (2020). https://doi.org/10.1109/cvpr42600.2020.01314. https://doi.org/10.1109/CVPR42600.2020.01314
  • (6) Giancola, S., Ghanem, B.: Temporally-aware feature pooling for action spotting in soccer broadcasts. In: IEEE Int. Conf. Comput. Vis. Pattern Recognit. (CVPR), Nashville, TN, USA, pp. 4490–4499 (2021). https://doi.org/10.1109/CVPRW53098.2021.00506
  • (7) Hong, J., Zhang, H., Gharbi, M., Fisher, M., Fatahalian, K.: Spotting temporally precise, fine-grained events in video. In: Eur. Conf. Comput. Vis. (ECCV). Lect. Notes Comput. Sci., vol. 13695, pp. 33–51. Springer, Tel Aviv, Israël (2022). https://doi.org/10.1007/978-3-031-19833-5_3
  • (8) Soares, J.V.B., Shah, A., Biswas, T.: Temporally precise action spotting in soccer videos using dense detection anchors. In: IEEE Int. Conf. Image Process. (ICIP), pp. 2796–2800. Inst. Electr. Electron. Eng. (IEEE), Bordeaux, France (2022). https://doi.org/10.1109/icip46576.2022.9897256. https://doi.org/10.1109/ICIP46576.2022.9897256
  • (9) Soares, J.V.B., Shah, A.: Action spotting using dense detection anchors revisited: Submission to the SoccerNet challenge 2022. arXiv abs/2206.07846 (2022) 2206.07846. https://doi.org/10.48550/arXiv.2206.07846
  • (10) Giancola, S., Cioppa, A., Georgieva, J., Billingham, J., Serner, A., Peek, K., Ghanem, B., Van Droogenbroeck, M.: Towards active learning for action spotting in association football videos. In: IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), pp. 5098–5108. Inst. Electr. Electron. Eng. (IEEE), Vancouver, Can. (2023). https://doi.org/10.1109/cvprw59228.2023.00538. https://doi.org/10.1109/CVPRW59228.2023.00538
  • (11) Cabado, B., Cioppa, A., Giancola, S., Villa, A., Guijarro-Berdiñas, B., Padrón, E., Ghanem, B., Van Droogenbroeck, M.: Beyond the Premier: Assessing action spotting transfer capability across diverse domains. In: IEEE Int. Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), CVsports, Seattle, WA, USA (2024)
  • (12) Kassab, E.J., Solberg, H.M., Gautam, S., Sabet, S.S., Torjusen, T., Riegler, M., Halvorsen, P., Midoglu, C.: Tacdec. Proceedings of the ACM Multimedia Systems Conference 2024 on ZZZ (2024). https://doi.org/10.1145/3625468.3652166
  • (13) Arbués Sangüesa, A., Martín, A., Fernández, J., Ballester, C., Haro, G.: Using player’s body-orientation to model pass feasibility in soccer. In: IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), pp. 3875–3884. Inst. Electr. Electron. Eng. (IEEE), Seattle, WA, USA (2020). https://doi.org/10.1109/cvprw50498.2020.00451. https://doi.org/10.1109/CVPRW50498.2020.00451
  • (14) Honda, Y., Kawakami, R., Yoshihashi, R., Kato, K., Naemura, T.: Pass receiver prediction in soccer using video and players’ trajectories. In: IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), pp. 3502–3511. Inst. Electr. Electron. Eng. (IEEE), New Orleans, LA, USA (2022). https://doi.org/10.1109/cvprw56347.2022.00394. https://doi.org/10.1109/CVPRW56347.2022.00394
  • (15) Gautam, S., Midoglu, C., Shafiee Sabet, S., Kshatri, D.B., Halvorsen, P.: Soccer game summarization using audio commentary, metadata, and captions. Proceedings of the 1st Workshop on User-centric Narrative Summarization of Long Videos (2022). https://doi.org/10.1145/3552463.3557019
  • (16) Midoglu, C., Sabet, S.S., Sarkhoosh, M.H., Majidi, M., Gautam, S., Solberg, H.M., Kupka, T., Halvorsen, P.: Ai-based sports highlight generation for social media. Proceedings of the 3rd Mile-High Video Conference on zzz (2024). https://doi.org/%****␣paper.bbl␣Line␣350␣****10.1145/3638036.3640799
  • (17) Gautam, S., Midoglu, C., Sabet, S.S., Kshatri, D.B., Halvorsen, P.: Assisting soccer game summarization via audio intensity analysis of game highlights. Unpublished (2022). https://doi.org/10.13140/RG.2.2.34457.70240/1
  • (18) Midoglu, C., Hicks, S., Thambawita, V., Kupka, T., Halvorsen, P.: MMSys’22 grand challenge on AI-based video production for soccer. In: ACM Multimedia Systems Conference (MMSys), Athlone, Ireland, pp. 1–6 (2022). https://doi.org/
  • (19) Magera, F., Hoyoux, T., Barnich, O., Van Droogenbroeck, M.: A universal protocol to benchmark camera calibration for sports. In: IEEE Int. Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), CVsports, Seattle, WA, USA (2024)
  • (20) Somers, V., De Vleeschouwer, C., Alahi, A.: Body part-based representation learning for occluded person Re-Identification. In: IEEE/CVF Winter Conf. Appl. Comput. Vis. (WACV), pp. 1613–1623. Inst. Electr. Electron. Eng. (IEEE), Waikoloa, HI, USA (2023). https://doi.org/10.1109/wacv56688.2023.00166. https://doi.org/10.1109/WACV56688.2023.00166
  • (21) Mkhallati, H., Cioppa, A., Giancola, S., Ghanem, B., Van Droogenbroeck, M.: SoccerNet-caption: Dense video captioning for soccer broadcasts commentaries. In: IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), pp. 5074–5085. Inst. Electr. Electron. Eng. (IEEE), Vancouver, Can. (2023). https://doi.org/10.1109/cvprw59228.2023.00536. https://doi.org/10.1109/CVPRW59228.2023.00536
  • (22) Andrews, P., Nordberg, O.E., Zubicueta Portales, S., Borch, N., Guribye, F., Fujita, K., Fjeld, M.: AiCommentator: A multimodal conversational agent for embedded visualization in football viewing. In: Int. Conf. Intell. User Interfaces, pp. 14–34. ACM, Greenville, SC, USA (2024). https://doi.org/10.1145/3640543.3645197. https://doi.org/10.1145/3640543.3645197
  • (23) Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view convolutional neural networks for 3D shape recognition. In: IEEE Int. Conf. Comput. Vis. (ICCV), pp. 945–953. Inst. Electr. Electron. Eng. (IEEE), Santiago, Chile (2015). https://doi.org/%****␣paper.bbl␣Line␣475␣****10.1109/iccv.2015.114
  • (24) Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv abs/1409.0473 (2014). https://doi.org/10.48550/arXiv.1409.0473
  • (25) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. arXiv abs/1706.03762 (2017). https://doi.org/10.48550/arXiv.1706.03762
  • (26) Pappalardo, L., Cintia, P., Rossi, A., Massucco, E., Ferragina, P., Pedreschi, D., Giannotti, F.: A public data set of spatio-temporal match events in soccer competitions. Sci. Data 6(1), 1–15 (2019). https://doi.org/10.1038/s41597-019-0247-7
  • (27) Yu, J., Lei, A., Song, Z., Wang, T., Cai, H., Feng, N.: Comprehensive dataset of broadcast soccer videos. In: IEEE Conf. Multimedia Inf. Process. Retr. (MIPR), pp. 418–423. Inst. Electr. Electron. Eng. (IEEE), Miami, FL, USA (2018). https://doi.org/10.1109/MIPR.2018.00090
  • (28) Scott, A., Uchida, I., Onishi, M., Kameda, Y., Fukui, K., Fujii, K.: SoccerTrack: A dataset and tracking algorithm for soccer with fish-eye and drone videos. In: IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), pp. 3568–3578. Inst. Electr. Electron. Eng. (IEEE), New Orleans, LA, USA (2022). https://doi.org/10.1109/cvprw56347.2022.00401. https://doi.org/10.1109/CVPRW56347.2022.00401
  • (29) Jiang, Y., Cui, K., Chen, L., Wang, C., Xu, C.: SoccerDB: A large-scale database for comprehensive video understanding. In: Int. ACM Work. Multimedia Content Anal. Sports (MMSports), pp. 1–8. ACM, Seattle, WA, USA (2020). https://doi.org/10.1145/3422844.3423051
  • (30) Van Zandycke, G., Somers, V., Istasse, M., Don, C.D., Zambrano, D.: DeepSportradar-v1: Computer vision dataset for sports understanding with high quality annotations. In: Int. ACM Work. Multimedia Content Anal. Sports (MMSports), pp. 1–8. ACM, Lisbon, Port. (2022). https://doi.org/10.1145/3552437.3555699. https://doi.org/10.1145/3552437.3555699
  • (31) Giancola, S., Amine, M., Dghaily, T., Ghanem, B.: SoccerNet: A scalable dataset for action spotting in soccer videos. In: IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), pp. 1792–179210. Inst. Electr. Electron. Eng. (IEEE), Salt Lake City, UT, USA (2018). https://doi.org/10.1109/cvprw.2018.00223
  • (32) Deliège, A., Cioppa, A., Giancola, S., Seikavandi, M.J., Dueholm, J.V., Nasrollahi, K., Ghanem, B., Moeslund, T.B., Van Droogenbroeck, M.: SoccerNet-v2: A dataset and benchmarks for holistic understanding of broadcast soccer videos. In: IEEE Int. Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), CVsports, Nashville, TN, USA, pp. 4508–4519 (2021). https://doi.org/10.1109/CVPRW53098.2021.00508. http://hdl.handle.net/2268/253781
  • (33) Cioppa, A., Deliège, A., Giancola, S., Ghanem, B., Van Droogenbroeck, M.: Scaling up SoccerNet with multi-view spatial localization and re-identification. Sci. Data 9(1), 1–9 (2022). https://doi.org/10.1038/s41597-022-01469-1
  • (34) Cioppa, A., Giancola, S., Deliege, A., Kang, L., Zhou, X., Cheng, Z., Ghanem, B., Van Droogenbroeck, M.: SoccerNet-tracking: Multiple object tracking dataset and benchmark in soccer videos. In: IEEE Int. Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), CVsports, pp. 3490–3501. Inst. Electr. Electron. Eng. (IEEE), New Orleans, LA, USA (2022). https://doi.org/10.1109/cvprw56347.2022.00393. https://doi.org/10.1109/CVPRW56347.2022.00393
  • (35) Held, J., Cioppa, A., Giancola, S., Hamdi, A., Ghanem, B., Van Droogenbroeck, M.: VARS: Video assistant referee system for automated soccer decision making from multiple views. In: IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), pp. 5086–5097. Inst. Electr. Electron. Eng. (IEEE), Vancouver, Can. (2023). https://doi.org/10.1109/cvprw59228.2023.00537. https://doi.org/10.1109/CVPRW59228.2023.00537
  • (36) Cioppa, A., Giancola, S., Somers, V., Magera, F., Zhou, X., Mkhallati, H., Deliège, A., Held, J., Hinojosa, C., Mansourian, A.M., Miralles, P., Barnich, O., De Vleeschouwer, C., Alahi, A., Ghanem, B., Van Droogenbroeck, M., Kamal, A., Maglo, A., Clapés, A., Abdelaziz, A., Xarles, A., Orcesi, A., Scott, A., Liu, B., Lim, B., Chen, C., Deuser, F., Yan, F., Yu, F., Shitrit, G., Wang, G., Choi, G., Kim, H., Guo, H., Fahrudin, H., Koguchi, H., Ardö, H., Salah, I., Yerushalmy, I., Muhammad, I., Uchida, I., Be’ery, I., Rabarisoa, J., Lee, J., Fu, J., Yin, J., Xu, J., Nang, J., Denize, J., Li, J., Zhang, J., Kim, J., Synowiec, K., Kobayashi, K., Zhang, K., Habel, K., Nakajima, K., Jiao, L., Ma, L., Wang, L., Wang, L., Li, M., Zhou, M., Nasr, M., Abdelwahed, M., Liashuha, M., Falaleev, N., Oswald, N., Jia, Q., Pham, Q.-C., Song, R., Hérault, R., Peng, R., Chen, R., Liu, R., Baikulov, R., Fukushima, R., Escalera, S., Lee, S., Chen, S., Ding, S., Someya, T., Moeslund, T.B., Li, T., Shen, W., Zhang, W., Li, W., Dai, W., Luo, W., Zhao, W., Zhang, W., Yang, X., Ma, Y., Joo, Y., Zeng, Y., Gan, Y., Zhu, Y., Zhong, Y., Ruan, Z., Li, Z., Huangi, Z., Meng, Z.: SoccerNet 2023 challenges results. arXiv abs/2309.06006 (2023) 2309.06006. https://doi.org/%****␣paper.bbl␣Line␣825␣****10.48550/arXiv.2309.06006
  • (37) Leduc, A., Cioppa, A., Giancola, S., Ghanem, B., Van Droogenbroeck, M.: SoccerNet-Depth: a scalable dataset for monocular depth estimation in sports videos. In: IEEE Int. Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), CVsports, Seattle, WA, USA (2024)
  • (38) Held, J., Itani, H., Cioppa, A., Giancola, S., Ghanem, B., Van Droogenbroeck, M.: X-vars: Introducing explainability in football refereeing with multi-modal large language models. In: IEEE Int. Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), CVsports, Seattle, WA, USA (2024)
  • (39) Gautam, S., Sarkhoosh, M.H., Held, J., Midoglu, C., Cioppa, A., Giancola, S., Thambawita, V., Riegler, M.A., Halvorsen, P., Shah, M.: SoccerNet-echoes: A soccer game audio commentary dataset. arXiv abs/2405.07354 (2024) 2405.07354. https://doi.org/10.48550/arXiv.2405.07354
  • (40) Spitz, J., Wagemans, J., Memmert, D., Williams, A.M., Helsen, W.F.: Video assistant referees (var): The impact of technology on decision making in association football referees. Journal of Sports Sciences 39(2), 147–153 (2020). https://doi.org/%****␣paper.bbl␣Line␣900␣****10.1080/02640414.2020.1809163
  • (41) De Dios Crespo, J.: 2. The Contribution of VARs to Fairness in Sport, pp. 23–35. Routledge, New York City, NY, USA (2021). https://doi.org/10.4324/9780429455551-2
  • (42) Holder, U., Ehrmann, T., König, A.: Monitoring experts: insights from the introduction of video assistant referee (VAR) in elite football. Journal of Business Economics 92(2), 285–308 (2021). https://doi.org/10.1007/s11573-021-01058-5
  • (43) Dufner, A.-L., Schütz, L.-M., Hill, Y.: The introduction of the video assistant referee supports the fairness of the game — an analysis of the home advantage in the german bundesliga. Psychology of Sport and Exercise 66, 1–5 (2023). https://doi.org/10.1016/j.psychsport.2023.102386
  • (44) Armenteros, M., Webb, T.: Educating International Football Referees: The importance of Uniformity, pp. 301–327. Routledge, New York City, NY, USA (2021). Chap. 16. https://doi.org/10.4324/9780429455551-16
  • (45) Deutscher Fußball-Bund (DFB): Anzahl aktiver Schiedsrichter/-innen bis 2022. https://www.dfb.de/verbandsstruktur/dfb-zentrale/ (2022)
  • (46) Zeppenfeld, B.: Anzahl aktiver Schiedsrichter / Schiedsrichterinnen des Deutschen Fußball Bundes (DFB) von 2018/2019 bis 2022/2023. https://de.statista.com/statistik/daten/studie/1243626/umfrage/dfb-anzahl-aktiver-schiedsrichter/ (2023)
  • (47) IFAB: Laws of the game. Technical report, The International Football Association Board, Zurich, Switzerland (2022)
  • (48) Fan, H., Xiong, B., Mangalam, K., Li, Y., Yan, Z., Malik, J., Feichtenhofer, C.: Multiscale vision transformers. In: IEEE Int. Conf. Comput. Vis. (ICCV), pp. 6804–6815. Inst. Electr. Electron. Eng. (IEEE), Montréal, Can. (2021). https://doi.org/10.1109/iccv48922.2021.00675
  • (49) Li, Y., Wu, C.-Y., Fan, H., Mangalam, K., Xiong, B., Malik, J., Feichtenhofer, C.: MViTv2: Improved multiscale vision transformers for classification and detection. In: IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 4794–4804. Inst. Electr. Electron. Eng. (IEEE), New Orleans, LA, USA (2022). https://doi.org/10.1109/cvpr52688.2022.00476
  • (50) Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., Zisserman, A.: The kinetics human action video dataset. arXiv abs/1705.06950 (2017). https://doi.org/10.48550/arXiv.1705.06950
  • (51) Yang, Z., Wang, L.: Learning relationships for multi-view 3D object recognition. In: IEEE Int. Conf. Comput. Vis. (ICCV), pp. 7504–7513. Inst. Electr. Electron. Eng. (IEEE), Seoul, South Korea (2019). https://doi.org/10.1109/iccv.2019.00760
  • (52) MacMahon, C., Helsen, W.F., Starkes, J.L., Weston, M.: Decision-making skills and deliberate practice in elite association football referees. Journal of Sports Sciences 25(1), 65–78 (2007). https://doi.org/10.1080/02640410600718640
  • (53) Spitz, J., Put, K., Wagemans, J., Williams, A.M., Helsen, W.F.: Visual search behaviors of association football referees during assessment of foul play situations. Cognitive Research: Principles and Implications 1(1) (2016). https://doi.org/10.1186/s41235-016-0013-8
  • (54) Pizzera, A., Marrable, J., Raab, M.: The video review system in association football: implementation and effectiveness for match officials and referee education. Managing Sport and Leisure, 1–17 (2022). https://doi.org/10.1080/23750472.2022.2147856
  • (55) López, A.M.: Average time needed for a video assistant referee (VAR) intervention in Brazil in 2019 and 2020. https://www.statista.com/statistics/1010093/average-time-video-assistant-referee-checking-brazil/ (2023)