This is the html version of the file http://openaccess.thecvf.com/content/ICCV2023/html/Liu_Diffusion_Action_Segmentation_ICCV_2023_paper.html.
Google automatically generates html versions of documents as we crawl the web.
These search terms have been highlighted: qiyue li
Diffusion Action Segmentation
Page 1
Diffusion Action Segmentation
Daochang Liu1, Qiyue Li2,4, Anh-Dung Dinh1, Tingting Jiang2, Mubarak Shah3, Chang Xu1
1School of Computer Science, Faculty of Engineering, The University of Sydney
2NERCVT, NKLMIP, School of Computer Science, 4School of Mathematical Sciences, Peking University
3Center for Research in Computer Vision, University of Central Florida
{daochang.liu, c.xu}@sydney.edu.au
[email protected]
Abstract
Temporal action segmentation is crucial for understand-
ing long-form videos. Previous works on this task com-
monly adopt an iterative refinement paradigm by using
multi-stage models. We propose a novel framework via
denoising diffusion models, which nonetheless shares the
same inherent spirit of such iterative refinement. In this
framework, action predictions are iteratively generated
from random noise with input video features as conditions.
To enhance the modeling of three striking characteristics
of human actions, including the position prior, the bound-
ary ambiguity, and the relational dependency, we devise
a unified masking strategy for the conditioning inputs in
our framework. Extensive experiments on three bench-
mark datasets, i.e., GTEA, 50Salads, and Breakfast, are
performed and the proposed method achieves superior or
comparable results to state-of-the-art methods, showing the
effectiveness of a generative approach for action segmenta-
tion. Code is at tinyurl.com/DiffAct.
1. Introduction
Temporal action segmentation is a key task for under-
standing and analyzing human activities in complex long
videos, with a wide range of applications from video
surveillance [62], video summarization [3] to skill assess-
ment [43]. The goal of temporal action segmentation is to
take as input an untrimmed video and output an action se-
quence indicating the class label for each frame.
This task has witnessed remarkable progresses in recent
years with the development of multi-stage models [20, 41,
68, 72]. The core idea of multi-stage models is to stack sev-
eral stages where the first stage produces an initial predic-
tion and later stages adjust the prediction from the preceding
stage. Researchers have actively explored various architec-
tures to implement the multi-stage model, such as the MS-
TCN [20, 41] relying on dilated temporal convolutional lay-
Predict
Refine
Refine
Refine
Denoise
Denoise
Denoise
Stage 1
Video
Stage 2
Stage 3
Stage 4
Step S (Fully Random Noise)
Video
Step s1
Step s2
Step 0
Action Segmentation via
Diffusion Model (Ours)
Action Segmentation via
Multi-Stage Model (Previous)
Condition
Figure 1. Multi-stage model vs. diffusion model for action seg-
mentation. They both follow an iterative refinement paradigm.
Left: Many previous methods utilize a multi-stage framework to
refine the initial prediction. Right: We formulate action segmenta-
tion as a frame-wise action sequence generation problem and ob-
tain the refined prediction by an iterative denoising process. Col-
ors in the barcodes represent different actions.
ers and the ASFormer [72] with attention mechanisms. The
success of multi-stage models could be largely attributed to
the underlying iterative refinement paradigm that properly
captures temporal dynamics of actions and significantly re-
duces over-segmentation errors [15].
In this paper, we propose an action segmentation method
following the same philosophy of iterative refinement but
in an essentially new generative approach, which incorpo-
rates the denoising diffusion model. Favored for its sim-
ple training recipe and high generation quality, the diffusion
model [13, 54, 14, 26] has become a rapidly emerging cat-
egory of generative models. A forward process in the dif-
fusion model corrupts the data by gradually adding noise,
while a corresponding reverse process removes the noise
step by step so that new samples can be generated from the
data distribution starting from fully random noise. Such it-
erative denoising in the reverse process coincides with the
iterative refinement paradigm for action segmentation. This
This ICCV paper is the Open Access version, provided by the Computer Vision Foundation.
Except for this watermark, it is identical to the accepted version;
the final published version of the proceedings is available on IEEE Xplore.
10139

Page 2
motivates us to rethink action segmentation from a genera-
tive view employing the diffusion model as in Fig. 1, where
we can formulate action segmentation as an action sequence
generation problem conditioned on the input video. One
distinct advantage of such diffusion-based action segmen-
tation is that it not only learns the discriminative mapping
from video frames to actions but also implicitly captures
the prior distribution of human actions through generative
modeling. This prior modeling can be further explicitly en-
hanced in line with three prominent characteristics of hu-
man actions. The first characteristic is the temporal position
prior, which means certain actions are more likely to occur
at particular time locations in the video. Taking a video
of making salads as an example, actions of cutting vegeta-
bles tend to appear in the middle of the video, while serving
salads onto the plate is mostly located at the end. The sec-
ond characteristic is the boundary prior, which reflects that
transitions between actions are visually gradual and thus
lead to ambiguous features around action boundaries. The
third characteristic is the relation prior that represents hu-
man actions usually adhere to some intrinsic temporal or-
dering, e.g., cutting a cucumber typically follows behind
peeling the cucumber. This relation prior differs from the
position prior since it focuses on the arrangements relative
to other actions. To jointly exploit these priors of human
actions, we devise a condition masking strategy, which nat-
urally fits into the newly proposed framework.
In our method, dubbed DiffAct, we formulate action seg-
mentation as a conditional generation problem of the frame-
wise action label sequence, leveraging the input video as the
condition. During training, the model is provided with the
input video features as well as a degraded temporal action
segmentation sequence obtained from the ground truth with
varying levels of injected noise. The model then learns to
denoise the sequence to restore the original ground truth.
To achieve this, we impose three loss functions between the
denoised sequence and the ground-truth sequence, includ-
ing a standard cross-entropy loss, a temporal smoothness
loss, and a boundary alignment loss. At inference, follow-
ing the reversed diffusion process, the model refines a ran-
dom sequence in an iterative manner to generate the action
prediction sequence. On the other hand, to boost the mod-
eling of the three aforementioned priors of human actions,
the conditional information in our framework is controlled
through a condition masking strategy during training, which
encourages the model to reason over information other than
visual features, e.g., time locations, action durations, and
the temporal context. The masked conditions convert the
generative learning of action sequences from a basic con-
ditional one to a combination of fully conditional, partially
conditional, and unconditional ones to enhance the three ac-
tion priors simultaneously.
The effectiveness of our diffusion-based temporal action
segmentation is demonstrated by the experiments on three
datasets, GTEA [21], 50Salads [59], and Breakfast [34],
on which our model performs better or on par compared to
state-of-the-art methods. In summary, our contributions are
three-fold: 1) temporal action segmentation is formulated
as a conditional generation task; 2) a new iterative refine-
ment framework is proposed based on the denoising diffu-
sion process; 3) a condition masking strategy is designed to
further exploit the priors of human actions.
2. Related Works
Temporal action segmentation [15] is a complex video
understanding task segmenting videos that can span for
minutes. To model the long-range dependencies among
actions, a rich variety of temporal models have been em-
ployed in the literature, evolving from early recurrent
neural networks [16, 51], to temporal convolutional net-
works [37, 36, 38, 20, 46, 65, 25, 41, 64, 58, 23, 52, 42,
47, 75], graph neural networks [29, 74], and recent trans-
formers [72, 4, 66, 18, 19, 61, 44]. Multi-stage mod-
els [72, 4, 41, 64, 20, 47, 68], which employ incremental re-
fining, are especially notable given their superiority in cap-
turing temporal context and mitigating over-segmentation
errors [15]. Apart from architecture designs, another line of
works focuses on acquiring more accurate and robust frame-
level features through representation learning [1, 50, 40] or
domain adaptation [10, 11].
For action segmentation, the position prior, the bound-
ary ambiguity, and the relation prior are three beneficial in-
ductive biases. The boundary ambiguity and relation prior
have drawn attention from researchers while the position
prior has been less explored. To cope with boundary ambi-
guity, in the boundary-aware cascade network [68], a local
barrier pooling is presented to assign aggregation weights
adaptive to boundaries. Ishikawa et al. integrated a comple-
mentary network branch to regress boundary locations [31],
and Chen et al. estimated the uncertainty due to ambiguous
boundaries by Monte-Carlo sampling [9]. Another strategy
is to smooth the annotation into a soft version where ac-
tion probabilities decrease around boundaries [33]. As for
the relation prior, GTRM [29] and UVAST [6] respectively
leveraged graph convolutional networks and sequence-to-
sequence translation to facilitate relational reasoning at the
segment level. For contextual relations between adjacent
actions, Br-Prompt [40] resorted to multi-modal learning
by using text prompts as supervision. Recently, DTL [69]
mined relational constraints from annotations and enforced
the constraints by a temporal logic loss during training. In
this paper, we simply use condition masking to take care of
the position prior, boundary ambiguity, and relation prior si-
multaneously, without additional designs for each of them.
One related work [22] also employs generative learning
for action segmentation. This method synthesized interme-
10140

Page 3
diate action codes per frame to aid recognition using GANs.
In contrast, our method generates action label sequences via
the diffusion model.
Diffusion models [53, 13, 54, 26], which have been the-
oretically unified with the score-based models [55, 56, 57],
are known due to their stable training procedure and do not
require any adversarial mechanism for generative learning.
The diffusion-based generation has achieved impressive re-
sults in image generation [14, 49, 7, 67], natural language
generation [73], text-to-image synthesis [24, 32], audio gen-
eration [39, 35] and so on. A gradient view was proposed to
further improve the diffusion sampling process with guid-
ance [17]. Diffusion models were repurposed for some im-
age understanding tasks in computer vision very recently,
such as object detection [12] and image segmentation [5, 2].
However, only very few works targeted video-related tasks,
including video forecasting and infilling [70, 28, 63]. A
pre-trained diffusion model was fine-tuned to predict video
memorability [60], and a frequency-aware diffusion module
was proposed for video captioning [76]. In this paper, we
recognize that diffusion models, given their iterative refine-
ment properties, are especially suitable for temporal action
segmentation. To the best of our knowledge, this work is the
first one employing diffusion models for action analysis.
3. Preliminaries
We first familiarize the readers with the background of
diffusion models. Diffusion models [26, 54] aim to ap-
proximate the data distribution qpx0q with a model distri-
bution pθpx0q. The forward process or diffusion process
corrupts the real data x0 „ qpx0q into a series of noisy
data x1, x2, ..., xS. The reverse process or denoising pro-
cess gradually removes the noise from xS „ Np0, Iq to
xS´1, xS´2, ... until x0 „ pθpx0q is achieved. S is the total
number of steps.
Forward Process. Formally, the forward process adds
Gaussian noise to the data at each step with a pre-
defined variance schedule β12, ..., βS: qpxs|xs´1q “
Npxs;
?
1 ´ βsxs´1sIq. By denoting αs “ 1 ´ βs
and ¯αs
“ ś
s
i“1
αi, we can directly obtain xs from
x0 in a closed form without recursion: qpxs|x0q “
Npxs;
?
¯αsx0, p1 ´ ¯αsqIq, which can be simplified using
the reparameterization trick:
xs
?
¯αsx0 ` ϵ
?
1 ´ ¯αs.
(1)
The noise ϵ „ Np0, Iq is sampled from a normal distribu-
tion at each step and its intensity is defined by
?
1 ´ ¯αs.
Reverse Process. The reverse process starts from xS and
progressively removes noise to recover x0, with one step in
the process defined as pθpxs´1|xsq:
pθpxs´1|xsq “ Npxs´1; µθpxs,sq,σ2
s Iq,
(2)
where σ2
s is controlled by βs, and µθpxs,sq is a predicted
mean parameterized by a step-dependent neural network.
Several different ways [45] are possible to parameterize
pθ, including the prediction of mean as in Eq. 2, the predic-
tion of the noise ϵ, and the prediction of x0. The x0 predic-
tion is used in our work. In such case, the model predicts x0
by a neural network fθpxs,sq, instead of directly predict-
ing the µθpxs,sq in Eq. 2. To optimize the model, a mean
squared error loss can be utilized to match fθpxs,sq and x0:
L “ ||fθpxs,sq ´ x0||
2
,s PR t1, 2, ..., Su. The step s is
randomly selected at each training iteration.
At the inference stage, starting from a pure noise xS
Np0, Iq, the model can gradually reduce the noise accord-
ing to the update rule [54] below using the trained fθ:
xs´1
?
¯αs´1fθpxs,sq`
a
1 ´ ¯αs´1 ´ σ2
s
xs ´
?
¯αsfθpxs,sq
?
1 ´ ¯αs
` σsϵ.
(3)
By iteratively applying Eq. 3, a new sample x0 can be gen-
erated from pθ via a trajectory xS, xS´1, ..., x0. Some im-
proved sampling strategy skips steps in the trajectory, i.e.,
xS, xS´∆, ..., x0, for better efficiency [54].
Conditional information can be included in the diffusion
model to control the generation process. The conditional
model can be written as fθpxs, s, Cq, with the conditional
information C as an extra input. In the literature, class la-
bels [14], text prompts [24, 32], and image guidance [48]
are the most common forms of conditional information.
4. Method
We formulate temporal action segmentation as a con-
ditional generation problem of temporal action sequences
(Fig. 2). Given the input features F P RLˆD with D
dimensions for a video of L frames, we approximate the
data distribution of the ground truth action sequence Y0 P
t0, 1uLˆC in the one-hot form with C classes of actions.
Since the input features F are usually extracted per short
clip by general-purpose pre-trained models, we employ an
encoder hϕ to enrich the features with long-range temporal
information and make them task-oriented, i.e., E “ hϕpFq.
E P RLˆD1
is the encoded feature with D1 dimensions.
4.1. Diffusion Action Segmentation
The proposed method, DiffAct, constructs a diffusion
process Y0,Y1, ..., YS from the action ground truth Y0 to
a nearly pure noise YS at training, and a denoising process
ˆYS, ˆYS´1, ..., ˆY0 from a pure noiseˆYS to the action predic-
tionˆY0 at inference.
Training. To learn the underlying distribution of human
actions, the model is trained to restore the action ground
truth from its corrupted versions. Specifically, a random
diffusion step s PR t1, 2, ..., Su is chosen at each training
10141

Page 4
iteration. Then we add noise to the ground truth sequence
Y0 according to the accumulative noise schedule as in Eq. 1
to attain a corrupted sequence Ys P r0, 1sLˆC:
Ys
?
¯αsY0 ` ϵ
?
1 ´ ¯αs,
(4)
where noise ϵ is from a Gaussian distribution.
Taking the corrupted sequence as input, a decoder gψ is
designed to denoise the sequence:
Ps “ gψpYs, s, E d Mq,
(5)
where the resultant denoised sequence Ps P r0, 1sLˆC in-
dicates action probabilities for each frame. Apart from Ys,
the decoder also includes another two inputs, which are the
step s and encoded video features E. Using step s as in-
put makes the model step-aware so that we can share the
same model to denoise at different noise intensities. The us-
age of encoded video features E as conditions ensures that
the produced action sequences are not only broadly plausi-
ble but also consistent with the input video. More impor-
tantly, the conditional information is further diversified by
element-wisely multiplying the feature E and a mask M to
explicitly capture the three characteristics of human actions,
which will be discussed later. The choices of the decoder gψ
and encoder hϕ are flexible, which can be of common net-
work architectures for action segmentation.
Loss Functions. With the denoised sequence Ps ob-
tained, we impose three loss functions as below to match
it with the original ground truth Y0.
Cross-Entropy Loss. The first loss is the standard cross-
entropy for classification minimizing the negative log-
likelihood of the ground truth action class for each frame:
Lce
s
1
LC
L
ÿ
i“1
C
ÿ
c“1
´Y0,i,clogPs,i,c,
(6)
where i is the frame index and c is the class index.
Temporal Smoothness Loss. To promote the local simi-
larity along the temporal dimension, the second loss is com-
puted as the mean squared error of the log-likelihoods be-
tween adjacent video frames [20, 41]:
Lsmo
s
1
pL ´ 1qC
L´1
ÿ
i“1
C
ÿ
c“1
plogPs,i,c ´ logPs,i`1,cq
2
.
(7)
Note that Lsmo
s
is clipped to avoid outlier values [20].
Boundary Alignment Loss. Accurate detection of bound-
aries is important for action segmentation. Therefore, the
third loss is to align the action boundaries in the denoised
sequence Ps and the ground truth sequence Y0. To this
end, we need to derive the boundary probabilities from
both Ps and Y0. First, a ground truth boundary sequence
B P t0, 1uL´1 can be derived from the action ground truth
Add Noise
Action Ground Truth
Loss
Video
Condition Masking
En
cod
er
Decoder
Denoised Sequence
Denoise
Conditions
Input
Features
Encoded
Features
Training
Decoder
Pure Noise
Inference
Iterative Denoising
Prediction
𝐹
𝐸
$
𝐸⊙𝑀
𝑌
𝑌"
𝑃"
𝑔#
(𝑌!
(𝑌
(𝑌!, (𝑌! ∆,…, (𝑌
𝑔#
Prior Modeling
Figure 2. Method overview. During training, the proposed model
is optimized to denoise corrupted action sequences. At inference,
the model begins with a random noise sequence and obtains results
via an iterative denoising process. The condition masking strategy
strengthens the action prior modeling by blocking certain features
during training. Input features are pre-extracted I3D features.
Y0, where Bi 1pY0,i ‰ Y0,i`1q. Since action transitions
usually happen gradually, we smooth this sequence with a
Gaussian filter for a soft version¯B “ λpBq. As for the
boundaries in the denoised sequence, their probabilities are
computed with the dot product of the action probabilities
from neighboring frames in Ps, i.e., 1 ´ Ps,i ¨ Ps,i`1. The
boundaries derived from the two sources are then aligned
via a binary cross-entropy loss:
Lbd
s
1
L ´ 1
L´1
ÿ
i“1
r´¯Bilogp1 ´ Ps,i ¨ Ps,i`1q
´ p1 ´¯BiqlogpPs,i ¨ Ps,i`1qs.
(8)
The final training loss is a combination of the three losses
at a randomly selected diffusion step: Lsum “ Lce
s `Lsmo
s
`
Lbd
s ,s PR t1, 2, ..., Su.
Inference. Intuitively, the denoising decoder gψ is
trained to adapt to sequences with arbitrary noise levels so
that it can even handle a sequence made of purely random
noise. Therefore, at inference, the denoising decoder starts
from a pure noise sequenceˆYS P Np0, Iq and gradually re-
duces the noise. Concretely, the update rule for each step is
adapted from Eq. 3 as below:
ˆYs´1
?
¯αs´1Ps `
a
1 ´ ¯αs´1 ´ σ2
s
ˆYs ´
?
¯αsPs
?
1 ´ ¯αs
` σsϵ,
(9)
where theˆYs´1 is sent into the decoder to obtain the next
10142

Page 5
No Masking
Position Modeling
Boundary Modeling
Relation Modeling
Figure 3. Illustration of the action prior modeling. The striped
locations are masked. Different colors represent different actions.
Ps´1. This iterative refinement process yields a series of
action sequencesˆYS, ˆYS´1, ..., ˆY0, which leads to theˆY0 at
the end that can well approximate the underlying ground
truth and is regarded as the final prediction. To speed up
the inference, a sampling trajectory [54] with skipped steps
ˆYS, ˆYS´∆, ..., ˆY0 is utilized in our method. Note that the en-
coded features E are sent into the decoder without masking
at inference. In addition, we fix the random seed for this
denoising process to make it deterministic in practice.
4.2. Action Prior Modeling
Human behaviors follow regular patterns. We identify
three valuable prior knowledge for action segmentation,
which are the position prior, the boundary ambiguity, and
the relation prior. One unique advantage of diffusion-based
action segmentation is that it can capture the prior distribu-
tion of action sequences via its generative learning ability.
This allows us to further exploit the three action priors by
changing the conditional information for the model.
In detail, we devise a condition masking strategy to
control the conditional information, which applies a mask
M P t0, 1uL to the features E in Eq. 5. At each train-
ing step, the mask is randomly sampled from a set, M PR
tMN,MP,MB,MRu, with each element detailed below.
No Masking (N). The first type is a basic all-one mask
MN “ 1 which lets all features pass into the decoder. This
naive mask provides full conditional information for the
model to map visual features to action classes.
Masking for Position Prior (P). The second type is an
all-zero mask MP “ 0 that entirely blocks the features.
Without any visual features, the only cues the decoder can
rely on are the video length and the frame positions in the
video1. Therefore, the modeling of the positional prior of
actions is promoted.
Masking for Boundary Prior (B). Due to the ambiguity
of action transitions, the visual features around boundaries
may not be reliable. For this reason, the third mask MB
removes the features close to boundaries based on the soft
ground truth¯B. This mask is defined as MB
i 1p¯Bi ă
0.5q,i P t1, 2, ..., Lu. With mask MB, the decoder is en-
couraged to further explore the context information about
the action before and after the boundary and their durations,
which can be more robust than the unreliable features solely.
1Specifically, such positional information is available due to the tem-
poral convolutions in the decoder.
Masking for Relation Prior (R). The ordinal relation is
another fundamental characteristic of human actions. We
thus mask the segments belonging to a random action class
˜c PR t1, 2, ..., Cu during training to enforce the model to
infer the missing action based on its surrounding actions.
The mask is denoted as MR where MR
i 1pY0,i,˜c ‰ 1q,i P
t1, 2, ..., Lu. Such masked segment modeling benefits the
utilization of the relational dependencies among actions.
To summarize, by the condition masking strategy in
Fig. 3, the three human action priors are simultaneously
integrated into our diffusion-based framework. One inter-
esting interpretation of this strategy is from the perspective
of the classifier-free guidance of diffusion models [27]. Our
model can be viewed to be fully conditional with the mask
MN, partially conditional with the mask MB and MR, and
unconditional with the mask MP. This brings a versatile
model that captures the action distribution from different as-
pects. We also discuss more potential forms or the extend-
ing usage of this masking strategy in the supplementary.
5. Experiments
5.1. Setup
Datasets. Experiments are performed on three bench-
mark datasets. GTEA [21] contains 28 egocentric daily ac-
tivity videos with 11 action classes. On average, each video
is of one-minute duration and 19 action instances approxi-
mately. 50Salads [59] includes 50 top-view videos regard-
ing salad preparation with actions falling into 17 classes.
The average video length is six minutes and the average
number of instances per video is 20 roughly. Breakfast [34]
comprises 1712 videos in the third-person view and 48 ac-
tion classes related to making breakfasts. The videos are
two-minute long on average but show a large variance in
length. Seven action instances are contained in each video
on average. Among the three datasets, Breakfast has the
largest scale, and 50Salads consists of the longest videos
and the most instances per video. Following previous
works [40, 6, 69, 72, 41, 31, 20], five-fold cross-validation is
performed on 50Salads and four-fold cross-validations are
performed on GTEA and Breakfast. We use the same splits
as in previous works.
Metrics. Following previous works, the frame-wise ac-
curacy (Acc), the edit score (Edit), and the F1 scores at over-
lap thresholds 10%, 25%, 50% (F1@{10, 25, 50}) are re-
ported. The accuracy assesses the results at the frame level,
while the edit score and F1 scores measure the performance
at the segment level. The same evaluation codes are used as
in previous works [72, 20].
Implementation Details. For all the datasets, we lever-
age the I3D features [8] used in most prior works as the
input features F with D“2048 dimensions. The encoder
hϕ is a re-implemented ASFormer encoder [72]. The de-
10143

Page 6
GTEA
50Salads
Breakfast
Method
F1@{10, 25, 50} Edit Acc Avg
F1@{10, 25, 50} Edit Acc Avg
F1@{10, 25, 50} Edit Acc Avg
[41]MS-TCN++, PAMI’20
88.8 / 85.7 / 76.0 83.5 80.1 82.8
80.7 / 78.5 / 70.1 74.3 83.7 77.5
64.1 / 58.6 / 45.9 65.6 67.6 60.4
[11]SSTDA, CVPR’20
90.0 / 89.1 / 78.0 86.2 79.8 84.6
83.0 / 81.5 / 73.8 75.8 83.2 79.5
75.0 / 69.1 / 55.2 73.7 70.2 68.6
[29]GTRM, CVPR’20
- / - / -
-
-
-
75.4 / 72.8 / 63.9 67.5 82.6 72.4
57.5 / 54.0 / 43.3 58.7 65.0 55.7
[68]BCN, ECCV’20
88.5 / 87.1 / 77.3 84.4 79.8 83.4
82.3 / 81.3 / 74.0 74.3 84.4 79.3
68.7 / 65.5 / 55.0 66.2 70.4 65.2
[10]MTDA, WACV’20
90.5 / 88.4 / 76.2 85.8 80.0 84.2
82.0 / 80.1 / 72.5 75.2 83.2 78.6
74.2 / 68.6 / 56.5 73.6 71.0 68.8
[23]G2L, CVPR’21
89.9 / 87.3 / 75.8 84.6 78.5 83.2
80.3 / 78.0 / 69.8 73.4 82.2 76.7
74.9 / 69.0 / 55.2 73.3 70.7 68.6
[1]HASR, ICCV’21
90.9 / 88.6 / 76.4 87.5 78.7 84.4
86.6 / 85.7 / 78.5 81.0 83.9 83.1
74.7 / 69.5 / 57.0 71.9 69.4 68.5
[31]ASRF, WACV’21
89.4 / 87.8 / 79.8 83.7 77.3 83.6
84.9 / 83.5 / 77.3 79.3 84.5 81.9
74.3 / 68.9 / 56.1 72.4 67.6 67.9
[72]ASFormer, BMVC’21
90.1 / 88.8 / 79.2 84.6 79.7 84.5
85.1 / 83.4 / 76.0 79.6 85.6 81.9
76.0 / 70.6 / 57.4 75.0 73.5 70.5
[9]UARL, IJCAI’22
92.7 / 91.5 / 82.8 88.1 79.6 86.9
85.3 / 83.5 / 77.8 78.2 84.1 81.8
65.2 / 59.4 / 47.4 66.2 67.8 61.2
[47]DPRN, PR’22
92.9 / 92.0 / 82.9 90.9 82.0 88.1
87.8 / 86.3 / 79.4 82.0 87.2 84.5
75.6 / 70.5 / 57.6 75.1 71.7 70.1
[33]SEDT, EL’22
93.7 / 92.4 / 84.0 91.3 81.3 88.5
89.9 / 88.7 / 81.1 84.7 86.5 86.2
- / - / -
-
-
-
[4]TCTr, IVC’22
91.3 / 90.1 / 80.0 87.9 81.1 86.1
87.5 / 86.1 / 80.2 83.4 86.6 84.8
76.6 / 71.1 / 58.5 76.1 77.5 72.0
[19]FAMMSDTN, NPL’22
91.6 / 90.9 / 80.9 88.3 80.7 86.5
86.2 / 84.4 / 77.9 79.9 86.4 83.0
78.5 / 72.9 / 60.2 77.5 74.8 72.8
[69]DTL, NeurIPS’22
- / - / -
-
-
-
87.1 / 85.7 / 78.5 80.5 86.9 83.7
78.8 / 74.5 / 62.9 77.7 75.8 73.9
[6]UVAST, ECCV’22
92.7 / 91.3 / 81.0 92.1 80.2 87.5
89.1 / 87.6 / 81.7 83.9 87.4 85.9
76.9 / 71.5 / 58.0 77.1 69.7 70.6
[40]BrPrompt, CVPR’22
94.1 / 92.0 / 83.0 91.6 81.2 88.4
89.2 / 87.8 / 81.3 83.8 88.1 86.0
- / - / -
-
-
-
[30]MCFM, ICIP’22
91.8 / 91.2 / 80.8 88.0 80.5 86.5
90.6 / 89.5 / 84.2 84.6 90.3 87.8
- / - / -
-
-
-
DiffAct, Ours
92.5 / 91.5 / 84.7 89.6 82.2 88.1
90.1 / 89.2 / 83.7 85.0 88.9 87.4
80.3 / 75.9 / 64.6 78.4 76.4 75.1
Table 1. Comparison with state-of-the-art methods. Methods in gray are not suitable for direct comparison due to the extra usage of multi-
modal features [40] or hand pose features [30]. We list them here for readers’ reference. Our method achieves superior results on 50Salads
and Breakfast, and comparable performance on GTEA. The average number (Avg) of the five evaluation metrics is also presented.
coder gψ is a re-implemented ASFormer decoder modified
to be step-dependent, which adds a step embedding to the
input as in [26]. The encoder has 10, 10, 12 layers and
64, 64, 256 feature maps for GTEA, 50Salads, and Break-
fast respectively. We adjust the decoder to be lightweight to
reduce the computational cost of the iterative denoising pro-
cess, which includes 8 layers and 24, 24, 128 feature maps
for the three datasets respectively. The intermediate features
from encoder layers with indices 5, 7, 9 are concatenated to
be the conditioning features E with D1“768 for Breakfast
and D1“192 for other datasets. The encoder and decoder
are trained end-to-end using Adam with a batch size of 4.
The learning rate is 1e-4 for Breakfast and 5e-4 for other
datasets. In addition to the loss Lsum for the decoder out-
puts, we append a prediction head to the encoder and apply
Lce and Lsmo as auxiliary supervision. The total steps are
set as S“1000 and 25 steps are utilized at inference based
on the sampling strategy with skipped steps [54]. The ac-
tion sequences are normalized to r´1, 1s when adding and
removing noise in Eq. 4 and Eq. 9. All frames are processed
together and all actions are predicted together, without any
auto-regressive method at training or inference.
5.2. Comparison to State-of-the-Art
Table 1 presents the experimental results of our method
and other recent approaches on three datasets. Our pro-
posed method advances the state-of-the-art by an evident
margin on 50Salads and Breakfast, and achieves compara-
ble performance on GTEA. Specifically, the average per-
formance is improved from 86.2 to 87.4 on 50Salads and
from 73.9 to 75.1 on Breakfast. On the smallest dataset
GTEA, our method obtains similar overall performance
with higher accuracy and F1@50. The results show the ef-
fectiveness of our diffusion-based action segmentation as
a new framework and its particular advantage on large or
complex datasets. It is also promising to combine more re-
cent backbones such as SEDT [33] and DPRN [47] into our
framework to further improve the results.
5.3. Ablation Studies
Extensive ablation studies are performed to validate the
design choices in our method. We select the 50Salads
dataset for ablation studies because of its substantial com-
plexity and proper data size.
Effect of Prior Modeling. To inspect the impact of prior
modeling, experiments are conducted in Table 2 with differ-
ent combinations of condition masking schemes. It is ob-
served that our method reaches the best performance when
all three priors are considered. Notably, the position prior is
especially useful among the three priors.
Effect of Training Losses. In Table 3, we investigate the
effect of loss functions, where each of the following losses
is adopted, the full Lsum loss, the Lsum without Lbd, the
Lsum without Lsmo, and the vanilla Lce loss. It is found
that all the loss components are necessary for the best result.
Our proposed boundary alignment loss Lbd brings perfor-
mance gain in terms of both frame-wise accuracy and tem-
poral continuity on top of Lce and Lsmo.
Effect of Inference Steps. Experiment results using dif-
ferent number of inference steps are reported in Table 4,
from which we can notice a steady increase in performance,
with diminishing marginal benefits, as the step number gets
10144

Page 7
N
P
B
R
F1@{10, 25, 50}
Edit
Acc
Avg
89.0 / 88.1 / 82.4
83.7
88.1
86.3
89.9 / 88.9 / 82.8
84.3
88.2
86.8
89.7 / 88.6 / 82.6
83.9
88.2
86.6
89.6 / 88.7 / 82.7
84.0
88.0
86.6
89.4 / 88.7 / 83.0
84.4
88.2
86.7
90.0 / 88.8 / 83.4
84.4
88.8
87.1
90.2 / 89.3 / 83.6
84.6
88.5
87.2
90.1 / 89.2 / 83.7
85.0
88.9
87.4
Table 2. Ablation study on the prior modeling. N: Baseline. P:
Position prior. B: Boundary prior. R: Relation prior. For each
row, a scheme is randomly selected from the ticked ones at each
training iteration.
Lce
Lsmo
Lbd
F1@{10, 25, 50}
Edit
Acc
Avg
86.7 / 85.3 / 79.2
80.8
87.0
83.8
89.8 / 88.9 / 83.1
84.0
88.8
86.9
86.9 / 86.1 / 78.7
81.0
85.4
83.6
90.1 / 89.2 / 83.7
85.0
88.9
87.4
Table 3. Ablation study on the loss functions.
Steps
F1@{10, 25, 50}
Edit
Acc
Avg
1
64.9 / 63.8 / 59.3
56.5
88.6
66.6
2
81.7 / 80.5 / 75.5
74.5
88.9
80.2
4
87.6 / 86.6 / 81.2
82.1
89.1
85.3
8
89.3 / 88.3 / 83.1
83.5
89.0
86.6
16
90.0 / 88.8 / 83.3
84.5
89.0
87.1
25
90.1 / 89.2 / 83.7
85.0
88.9
87.4
50
90.4 / 89.5 / 84.0
85.3
89.0
87.6
100
90.4 / 89.7 / 84.3
85.3
88.9
87.7
Table 4. Ablation study on the number of inference steps.
Features
F1@{10, 25, 50}
Edit
Acc
Avg
Input Features F
82.5 / 80.6 / 72.3
75.7
82.5
78.7
hϕ Layer 5
90.3 / 89.2 / 83.9
85.1
89.1
87.5
hϕ Layer 7
90.4 / 89.4 / 83.4
85.0
88.8
87.4
hϕ Layer 9
90.0 / 89.0 / 83.4
84.4
88.8
87.1
hϕ Layer 5,7,9
90.1 / 89.2 / 83.7
85.0
88.9
87.4
hϕ Prediction
90.3 / 89.3 / 83.3
84.6
87.8
87.1
Table 5. Ablation study on the conditioning features.
larger. The computation grows linearly with the step num-
ber. We leverage 25 steps to keep a good balance between
the performance and the computational cost.
Effect of Conditioning Features. For the condition of
generation, the input video features F and the features from
different layers of the encoder hϕ are explored as in Table 5.
The performance drops remarkably when using the input
feature F as the condition, suggesting the necessity of an
encoder. On the other hand, the performance is not sensitive
to which encoder layer the features are extracted from.
Effect of the Backbone. The choices of the encoder
and decoder in DiffAct are flexible. Therefore, we change
our backbone to MS-TCN [20] to show such flexibility. In
detail, a single-stage TCN is directly used as the encoder
and is modified with the step embedding to be the decoder.
Method
F1@{10, 25, 50}
Edit
Acc
Avg
[20]MS-TCN
76.3 / 74.0 / 64.5
67.9
80.7
72.7
[41]MS-TCN++
80.7 / 78.5 / 70.1
74.3
83.7
77.5
[1]HASR (MS-TCN)
83.4 / 81.8 / 71.9
77.4
81.7
79.2
[31]ASRF
84.9 / 83.5 / 77.3
79.3
84.5
81.9
[69]DTL (MS-TCN)
78.3 / 76.5 / 67.6
70.5
81.5
74.9
DiffAct (MS-TCN)
86.9 / 85.3 / 79.4
80.3
88.2
84.0
Table 6. Results on 50Salads using MS-TCN backbone.
cut cheese
peel cucumber
cut cucumber
cut tomato
serve
place cucumber
add dressing
mix ingredients
add salt
add oil
mix dressing
add pepper
𝑌
𝑃!%
𝑃 $%
𝑃# %
𝑃$#%
𝑃%%%
𝑌
𝑌 !%
𝑌"$%
𝑌$ %
𝑌%#%
Video
(GT)
(Pred.)
(a)
(b)
(c)
Figure 4. Visualization of the iterative denoising process. The
ground truth is presented in (b), where some segments are marked
with class labels. The (a) and (c) respectively plot the inference
trajectory ˆYs and the denoised sequences Ps at different steps
(Eq. 9). The video is ‘rgb-01-2’ from 50Salads.
Table 6 compares our results to recent methods with MS-
TCN backbones, which show the superiority of our method.
5.4. Qualitative Result and Computational Cost
Qualitative Result. To illustrate the refinement process
along the denoising steps, the step-wise results for a video
from 50Salads are visualized in Fig. 4. The model refines
an initial random noise sequence to generate the final action
prediction in an iterative manner. For example, as in the
black box in Fig. 4, the segment of ‘cut cucumber’ is bro-
ken up by ‘cut tomato’ and ‘peel cucumber’, which share
similar visual representations. After a number of iterations,
the relation between these actions is constructed and the er-
ror is gradually corrected. Finally, a continuous segment of
‘cut cucumber’ can be properly predicted.
Computational Cost. Table 7 compares the computa-
tional costs of our method and its backbone ASFormer [72].
Our method, which is equipped with a lightweight decoder,
largely outperforms ASFormer with fewer FLOPs at infer-
ence when using 8 steps. Using 25 steps, our method further
improves the result at an acceptable overhead.
10145

Page 8
Method
Avg
#params
FLOPs
Mem.
Time
ASFormer [72]
81.9
1.134M
6.66G
3.5G
2.38s
DiffAct (8 Steps)
86.6
0.975M
4.96G
1.9G
0.68s
DiffAct (16 Steps)
87.1
0.975M
7.73G
1.9G
1.30s
DiffAct (25 Steps)
87.4
0.975M
10.85G
1.9G
2.09s
Table 7. Computational cost comparison. The number of parame-
ters, the average FLOPs at inference, the GPU memory cost during
training, the average inference time, and the average performance
(Avg) on 50Salads for our method and ASFormer.
Masking
F1@{10, 25, 50}
Edit
Acc
Avg
N
90.1 / 89.2 / 83.7
85.0
88.9
87.4
P
25.7 / 21.5 / 11.6
34.8
20.9
22.9
B
89.4 / 88.6 / 83.0
84.1
88.4
86.7
R
88.9 / 87.8 / 81.7
83.5
87.2
85.8
Table 8. Results on 50Salads using different condition masking
types at the inference stage. Note that this is only for analysis
purposes since the mask types B and R depend on the ground truth.
The model performance is maintained at a reasonable level using
different masks, suggesting the action priors are well captured.
cut cheese
peel cucumber
cut cucumber
cut tomato
place tomato
mix ingredients
add oil
mix dressing
add pepper
cut lettuce
serve
place cheese
GT
C
MR
Pred.
MB
Pred.
A
B
MP
Pred.
MN
Pred.
Figure 5. Visualization of the masks and the corresponding predic-
tions using the masked conditions at inference. The video is ‘rgb-
03-2’ from 50Salads. In MN, MP, MB, MR, masked locations are
colored in black. More results using MP at inference and further
discussions are given in the supplementary material.
6. Discussion
Analysis of the Prior Modeling. In this section, an ex-
ploratory experiment is performed to analyze to what extent
the position prior, boundary prior, and relation prior are cap-
tured in our model. Recall that the proposed method uses
no masking (MN) at inference by default, here in this exper-
iment, we input the masked conditions with each masking
type (MP,MB,MR) for inference instead. As in Table 8,
the model can still achieve reasonably good performance
when the mask MB or mask MR is applied, indicating that
the boundary prior and relation prior are well handled. It
is also interesting to discover that the result using the com-
pletely masked condition (MP), which has a 34.8 edit score,
is much better than the random guess. This reveals that the
model has learned meaningful correlations between actions
and time locations via our position prior modeling. We fur-
ther visualize in Fig. 5 the condition masks and the cor-
responding action predictions for a video when each mask
type is applied at inference. It is clear that the model pro-
duces a generally plausible action sequence when all the
features are blocked by MP. For example, the actions of
cutting and placing ingredients are located in the middle of
the video (Fig. 5 A), while the actions of mixing and serv-
ing occur at the end (Fig. 5 B). With mask MB, the model is
still able to find action boundaries. The missing action ‘cut
tomato’ masked by MR is successfully restored at Fig. 5 C.
These analyses demonstrate the capability of our method in
prior modeling.
Limitation and Future Work. One limitation of the
proposed method is that its advantage on the small-scale
dataset, GTEA, is not as significant as on large datasets. We
speculate that it is more difficult to generatively learn the
distribution of action sequences given only a few videos,
which leads to a lower edit score. Note that this is not a
problem on large datasets on which the model makes clear
gains in terms of the edit score in Table 1. Potential reme-
dies on small data include model reassembly [71] or replac-
ing the Gaussian noise in the diffusion process with some
perturbations based on the statistics of actions, e.g., trans-
forming the distribution towards the mean sequence ob-
tained from the training set, to reduce the hypothesis space
and thus the amount of data required. Future works can also
combine the generation of frame-wise action sequences and
segment-wise ordered action lists jointly in our diffusion-
based action segmentation. It is also promising to extend
the current framework for unified action segmentation and
action anticipation in the future since our generative frame-
work is intuitively appropriate for the anticipation task. We
share other early attempts in the supplementary.
7. Conclusion
This paper proposes a new framework for tempo-
ral action segmentation which generates action sequences
through an iterative denoising process. A flexible condition
masking strategy is designed to jointly exploit the position
prior, the boundary prior, and the relation prior of human
actions. With its nature of iterative refinement, its ability of
generative modeling, and its enhancement of the three ac-
tion priors, the proposed framework achieves state-of-the-
art results on benchmark datasets, unlocking new possibili-
ties for action segmentation.
Acknowledgement.
This work was supported in
part by the Australian Research Council under Project
DP210101859 and the University of Sydney Research Ac-
celerator (SOAR) Prize. The training platforms supporting
this work were provided by High-Flyer AI and National
Computational Infrastructure Australia.
10146

Page 9
References
[1] Hyemin Ahn and Dongheui Lee. Refining action segmenta-
tion with hierarchical video representations. In ICCV, 2021.
2, 6, 7
[2] Tomer Amit, Eliya Nachmani, Tal Shaharbany, and Lior
Wolf. SegDiff: Image segmentation with diffusion proba-
bilistic models. arXiv preprint arXiv:2112.00390, 2021. 3
[3] Evlampios Apostolidis, Eleni Adamantidou, Alexandros I
Metsai, Vasileios Mezaris, and Ioannis Patras. Video sum-
marization using deep neural networks: A survey. Proceed-
ings of the IEEE, 2021. 1
[4] Nicolas Aziere and Sinisa Todorovic. Multistage temporal
convolution transformer for action segmentation. Image and
Vision Computing, 2022. 2, 6
[5] Dmitry Baranchuk, Ivan Rubachev, Andrey Voynov,
Valentin Khrulkov, and Artem Babenko. Label-efficient se-
mantic segmentation with diffusion models. ICLR, 2021. 3
[6] Nadine Behrmann, S Alireza Golestaneh, Zico Kolter,
Jürgen Gall, and Mehdi Noroozi. Unified fully and times-
tamp supervised temporal action segmentation via sequence
to sequence translation. In ECCV, 2022. 2, 5, 6
[7] Ankan Kumar Bhunia, Salman Khan, Hisham Cholakkal,
Rao Muhammad Anwer, Jorma Laaksonen, Mubarak Shah,
and Fahad Shahbaz Khan. Person image synthesis via de-
noising diffusion model. arXiv preprint arXiv:2211.12500,
2022. 3
[8] Joao Carreira and Andrew Zisserman. Quo vadis, action
recognition? a new model and the Kinetics dataset. In CVPR,
2017. 5
[9] Lei Chen, Muheng Li, Yueqi Duan, Jie Zhou, and Jiwen
Lu. Uncertainty-aware representation learning for action
segmentation. In IJCAI, 2022. 2, 6
[10] Min-Hung Chen, Baopu Li, Yingze Bao, and Ghassan Al-
Regib. Action segmentation with mixed temporal domain
adaptation. In WACV, 2020. 2, 6
[11] Min-Hung Chen, Baopu Li, Yingze Bao, Ghassan Al-
Regib, and Zsolt Kira. Action segmentation with joint self-
supervised temporal domain adaptation. In CVPR, 2020. 2,
6
[12] Shoufa Chen, Peize Sun, Yibing Song, and Ping Luo. Dif-
fusionDet: Diffusion model for object detection. arXiv
preprint arXiv:2211.09788, 2022. 3
[13] Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu,
and Mubarak Shah. Diffusion models in vision: A survey.
arXiv preprint arXiv:2209.04747, 2022. 1, 3
[14] Prafulla Dhariwal and Alexander Nichol. Diffusion models
beat GANs on image synthesis. NeurIPS, 2021. 1, 3
[15] Guodong Ding, Fadime Sener, and Angela Yao. Tempo-
ral action segmentation: An analysis of modern technique.
arXiv preprint arXiv:2210.10352, 2022. 1, 2
[16] Li Ding and Chenliang Xu. Tricornet: A hybrid temporal
convolutional and recurrent network for video action seg-
mentation. arXiv preprint arXiv:1705.07818, 2017. 2
[17] Anh-Dung Dinh, Daochang Liu, and Chang Xu. PixelAs-
Param: A gradient view on diffusion sampling with guid-
ance. In ICML, 2023. 3
[18] Dazhao Du, Bing Su, Yu Li, Zhongang Qi, Lingyu Si, and
Ying Shan. Do we really need temporal convolutions in ac-
tion segmentation? arXiv preprint arXiv:2205.13425, 2022.
2
[19] Zexing Du and Qing Wang. Dilated transformer with feature
aggregation module for action segmentation. Neural Pro-
cessing Letters, 2022. 2, 6
[20] Yazan Abu Farha and Jurgen Gall. MS-TCN: Multi-stage
temporal convolutional network for action segmentation. In
CVPR, 2019. 1, 2, 4, 5, 7
[21] Alireza Fathi, Xiaofeng Ren, and James M Rehg. Learning
to recognize objects in egocentric activities. In CVPR, 2011.
2, 5
[22] Harshala Gammulle, Tharindu Fernando, Simon Denman,
Sridha Sridharan, and Clinton Fookes. Coupled generative
adversarial network for continuous fine-grained action seg-
mentation. In WACV, 2019. 2
[23] Shang-Hua Gao, Qi Han, Zhong-Yu Li, Pai Peng, Liang
Wang, and Ming-Ming Cheng. Global2local: Efficient struc-
ture search for video action segmentation. In CVPR, 2021.
2, 6
[24] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo
Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vec-
tor quantized diffusion model for text-to-image synthesis. In
CVPR, 2022. 3
[25] Basavaraj Hampiholi, Christian Jarvers, Wolfgang Mader,
and Heiko Neumann. Depthwise separable temporal con-
volutional network for action segmentation. In 3DV, 2020.
2
[26] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
sion probabilistic models. NeurIPS, 2020. 1, 3, 6
[27] Jonathan Ho and Tim Salimans. Classifier-free diffusion
guidance. In NeurIPS 2021 Workshop on Deep Generative
Models and Downstream Applications, 2021. 5
[28] Tobias Höppe, Arash Mehrjou, Stefan Bauer, Didrik Nielsen,
and Andrea Dittadi. Diffusion models for video prediction
and infilling. arXiv preprint arXiv:2206.07696, 2022. 3
[29] Yifei Huang, Yusuke Sugano, and Yoichi Sato. Improving
action segmentation via graph-based temporal reasoning. In
CVPR, 2020. 2, 6
[30] Kenta Ishihara, Gaku Nakano, and Tetsuo Inoshita. MCFM:
Mutual cross fusion module for intermediate fusion-based
action segmentation. In ICIP, 2022. 6
[31] Yuchi Ishikawa, Seito Kasai, Yoshimitsu Aoki, and Hi-
rokatsu Kataoka. Alleviating over-segmentation errors by
detecting action boundaries. In WACV, 2021. 2, 5, 6, 7
[32] Gwanghyun Kim and Jong Chul Ye. DiffusionClip: Text-
guided image manipulation using diffusion models. In
CVPR, 2022. 3
[33] Gyeong-hyeon Kim and Eunwoo Kim. Stacked encoder-
decoder transformer with boundary smoothing for action
segmentation. Electronics Letters, 2022. 2, 6
[34] Hilde Kuehne, Ali Arslan, and Thomas Serre. The language
of actions: Recovering the syntax and semantics of goal-
directed human activities. In CVPR, 2014. 2, 5
[35] Max WY Lam, Jun Wang, Dan Su, and Dong Yu. BDDM:
Bilateral denoising diffusion models for fast and high-quality
speech synthesis. arXiv preprint arXiv:2203.13508, 2022. 3
10147

Page 10
[36] Colin Lea, Michael D Flynn, Rene Vidal, Austin Reiter, and
Gregory D Hager. Temporal convolutional networks for ac-
tion segmentation and detection. In CVPR, 2017. 2
[37] Colin Lea, Austin Reiter, René Vidal, and Gregory D Hager.
Segmental spatiotemporal CNNs for fine-grained action seg-
mentation. In ECCV, 2016. 2
[38] Peng Lei and Sinisa Todorovic. Temporal deformable resid-
ual networks for action segmentation in videos. In CVPR,
2018. 2
[39] Yichong Leng, Zehua Chen, Junliang Guo, Haohe Liu, Ji-
awei Chen, Xu Tan, Danilo Mandic, Lei He, Xiang-Yang Li,
Tao Qin, et al. Binauralgrad: A two-stage conditional diffu-
sion probabilistic model for binaural audio synthesis. arXiv
preprint arXiv:2205.14807, 2022. 3
[40] Muheng Li, Lei Chen, Yueqi Duan, Zhilan Hu, Jianjiang
Feng, Jie Zhou, and Jiwen Lu. Bridge-prompt: Towards or-
dinal action understanding in instructional videos. In CVPR,
2022. 2, 5, 6
[41] Shi-Jie Li, Yazan AbuFarha, Yun Liu, Ming-Ming Cheng,
and Juergen Gall. MS-TCN++: Multi-stage temporal con-
volutional network for action segmentation. IEEE TPAMI,
2020. 1, 2, 4, 5, 6, 7
[42] Yunheng Li, Zhuben Dong, Kaiyuan Liu, Lin Feng, Lianyu
Hu, Jie Zhu, Li Xu, Shenglan Liu, et al. Efficient two-step
networks for temporal action segmentation. Neurocomput-
ing, 2021. 2
[43] Daochang Liu, Qiyue Li, Tingting Jiang, Yizhou Wang,
Rulin Miao, Fei Shan, and Ziyu Li. Towards unified surgical
skill assessment. In CVPR, 2021. 1
[44] Zhichao Liu, Leshan Wang, Desen Zhou, Jian Wang,
Songyang Zhang, Yang Bai, Errui Ding, and Rui Fan. Tem-
poral segment transformer for action segmentation. arXiv
preprint arXiv:2302.13074, 2023. 2
[45] Calvin Luo. Understanding diffusion models: A unified per-
spective. arXiv preprint arXiv:2208.11970, 2022. 3
[46] Khoi-Nguyen C Mac, Dhiraj Joshi, Raymond A Yeh, Jinjun
Xiong, Rogerio S Feris, and Minh N Do. Learning motion
in feature space: Locally-consistent deformable convolution
networks for fine-grained action detection. In ICCV, 2019. 2
[47] Junyong Park, Daekyum Kim, Sejoon Huh, and Sungho Jo.
Maximization and restoration: Action segmentation through
dilation passing and temporal reconstruction. Pattern Recog-
nition, 2022. 2, 6
[48] Konpat Preechakul, Nattanat Chatthee, Suttisak Wizad-
wongsa, and Supasorn Suwajanakorn. Diffusion autoen-
coders: Toward a meaningful and decodable representation.
In CVPR, pages 10619–10629, 2022. 3
[49] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, and Björn Ommer. High-resolution image syn-
thesis with latent diffusion models. In CVPR, 2022. 3
[50] Fadime Sener, Dipika Singhania, and Angela Yao. Temporal
aggregate representations for long-range video understand-
ing. In ECCV, 2020. 2
[51] Bharat Singh, Tim K Marks, Michael Jones, Oncel Tuzel,
and Ming Shao. A multi-stream bi-directional recurrent neu-
ral network for fine-grained action detection. In CVPR, 2016.
2
[52] Dipika Singhania, Rahul Rahaman, and Angela Yao. Coarse
to fine multi-resolution temporal convolutional network.
arXiv preprint arXiv:2105.10859, 2021. 2
[53] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan,
and Surya Ganguli. Deep unsupervised learning using
nonequilibrium thermodynamics. In ICML, 2015. 3
[54] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois-
ing diffusion implicit models. ICLR, 2021. 1, 3, 5, 6
[55] Yang Song and Stefano Ermon. Generative modeling by es-
timating gradients of the data distribution. NeurIPS, 2019.
3
[56] Yang Song and Stefano Ermon. Improved techniques for
training score-based generative models. NeurIPS, 2020. 3
[57] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab-
hishek Kumar, Stefano Ermon, and Ben Poole. Score-based
generative modeling through stochastic differential equa-
tions. arXiv preprint arXiv:2011.13456, 2020. 3
[58] Yaser Souri, Yazan Abu Farha, Fabien Despinoy, Gianpiero
Francesca, and Juergen Gall. FIFA: Fast inference approxi-
mation for action segmentation. In DAGM German Confer-
ence on Pattern Recognition, 2021. 2
[59] Sebastian Stein and Stephen J McKenna. Combining em-
bedded accelerometers with computer vision for recognizing
food preparation activities. In Proceedings of the 2013 ACM
international joint conference on Pervasive and ubiquitous
computing, 2013. 2, 5
[60] Lorin Sweeney, Graham Healy, and Alan F Smeaton. Diffus-
ing surrogate dreams of video scenes to predict video mem-
orability. arXiv preprint arXiv:2212.09308, 2022. 3
[61] Xiaoyan Tian, Ye Jin, and Xianglong Tang. Local-Global
transformer neural network for temporal action segmenta-
tion. Multimedia Systems, 2022. 2
[62] Sarvesh Vishwakarma and Anupam Agrawal. A survey
on activity recognition and behavior understanding in video
surveillance. The Visual Computer, 2013. 1
[63] Vikram Voleti, Alexia Jolicoeur-Martineau, and Christopher
Pal. Masked conditional video diffusion for prediction, gen-
eration, and interpolation. arXiv preprint arXiv:2205.09853,
2022. 3
[64] Dong Wang, Yuan Yuan, and Qi Wang. Gated forward re-
finement network for action segmentation. Neurocomputing,
2020. 2
[65] Jiahao Wang, Zhengyin Du, Annan Li, and Yunhong Wang.
Atrous temporal convolutional network for video action seg-
mentation. In ICIP, 2019. 2
[66] Jiahui Wang, Zhenyou Wang, Shanna Zhuang, and Hui
Wang. Cross-enhancement transformer for action segmen-
tation. arXiv preprint arXiv:2205.09445, 2022. 2
[67] Yunke Wang, Xiyu Wang, Anh-Dung Dinh, Bo Du, and
Chang Xu. Learning to schedule in diffusion probabilistic
models. In KDD, 2023. 3
[68] Zhenzhi Wang, Ziteng Gao, Limin Wang, Zhifeng Li, and
Gangshan Wu. Boundary-aware cascade networks for tem-
poral action segmentation. In ECCV, 2020. 1, 2, 6
[69] Ziwei Xu, Yogesh S Rawat, Yongkang Wong, Mohan
Kankanhalli, and Mubarak Shah. Don’t pour cereal into cof-
fee: Differentiable temporal logic for temporal action seg-
mentation. In NeurIPS, 2022. 2, 5, 6, 7
10148

Page 11
[70] Ruihan Yang, Prakhar Srivastava, and Stephan Mandt. Dif-
fusion probabilistic modeling for video generation. arXiv
preprint arXiv:2203.09481, 2022. 3
[71] Xingyi Yang, Daquan Zhou, Songhua Liu, Jingwen Ye, and
Xinchao Wang. Deep model reassembly. In NeurIPS, 2022.
8
[72] Fangqiu Yi, Hongyu Wen, and Tingting Jiang. ASFormer:
Transformer for action segmentation. In BMVC, 2021. 1, 2,
5, 6, 7, 8
[73] Peiyu Yu, Sirui Xie, Xiaojian Ma, Baoxiong Jia, Bo Pang,
Ruigi Gao, Yixin Zhu, Song-Chun Zhu, and Ying Nian Wu.
Latent diffusion energy-based model for interpretable text
modeling. arXiv preprint arXiv:2206.05895, 2022. 3
[74] Junbin Zhang, Pei-Hsuan Tsai, and Meng-Hsun Tsai. Se-
mantic2graph: Graph-based multi-modal feature for action
segmentation in videos. arXiv preprint arXiv:2209.05653,
2022. 2
[75] Yunlu Zhang, Keyan Ren, Chun Zhang, and Tong Yan. SG-
TCN: Semantic guidance temporal convolutional network
for action segmentation. In IJCNN, 2022. 2
[76] Xian Zhong, Zipeng Li, Shuqin Chen, Kui Jiang, Chen
Chen, and Mang Ye. Refined semantic enhancement towards
frequency diffusion for video captioning. arXiv preprint
arXiv:2211.15076, 2022. 3
10149