¹¹institutetext: Tencent Youtu Lab
¹¹email: {taianguo,linuswu,hanjunli,ruizhiqiao,winfredsun}@tencent.com ²²institutetext: Tsinghua Shenzhen International Graduate School, Tsinghua University
²²email: [email protected] ^{\textsuperscript{*}}^{\textsuperscript{*}}footnotetext: Equal contribution.^{\textsuperscript{\textdagger}}^{\textsuperscript{\textdagger}}footnotetext: Work done during internship at Tencent.^{\textsuperscript{\textdaggerdbl}}^{\textsuperscript{\textdaggerdbl}}footnotetext: Corresponding author: Ruizhi Qiao.

Multimodal Label Relevance Ranking via Reinforcement Learning

Taian Guo^*\orcidlink0000-0003-0787-511X 11 Taolin Zhang^*^†\orcidlink0009-0006-2441-2861 22 Haoqian Wu\orcidlink0000-0003-1035-1499 11 Hanjun Li\orcidlink0009-0006-4211-7479 11 Ruizhi Qiao^‡\orcidlink0000-0002-3663-0149 11 Xing Sun\orcidlink0000-0001-8132-9083 11

Abstract

Conventional multi-label recognition methods often focus on label confidence, frequently overlooking the pivotal role of partial order relations consistent with human preference. To resolve these issues, we introduce a novel method for multimodal label relevance ranking, named Label Relevance Ranking with Proximal Policy Optimization (LR²PPO), which effectively discerns partial order relations among labels. LR²PPO first utilizes partial order pairs in the target domain to train a reward model, which aims to capture human preference intrinsic to the specific scenario. Furthermore, we meticulously design state representation and a policy loss tailored for ranking tasks, enabling LR²PPO to boost the performance of label relevance ranking model and largely reduce the requirement of partial order annotation for transferring to new scenes. To assist in the evaluation of our approach and similar methods, we further propose a novel benchmark dataset, LRMovieNet, featuring multimodal labels and their corresponding partial order data. Extensive experiments demonstrate that our LR²PPO algorithm achieves state-of-the-art performance, proving its effectiveness in addressing the multimodal label relevance ranking problem. Codes and the proposed LRMovieNet dataset are publicly available at https://github.com/ChazzyGordon/LR2PPO.

Keywords:

Label Relevance Ranking Reinforcement Learning Multimodal

1 Introduction

Refer to caption — Figure 1: Illustration of the Difference between Label Confidence and Label Relevance. This figure provides an example of a movie footage consisting of three consecutive keyframes and its scene description. Generally, conventional label confidence tends to place more emphasis on the tangible objects, whereas the proposed label relevance better reveals the relations between labels and the real scene which they correspond to. As shown in the top right histogram, label confidence models tend to assign a higher level of confidence to the label ‘Man’ due to its higher frequency of occurrence within the context. In contrast, the label ‘Flirting’ is more closely aligned with the primary theme of the movie scene, resulting in a higher label relevance score.

Multi-label recognition, a fundamental task in computer vision, aims to identify all possible labels contained within a variety of media forms such as images and videos. Visual or multimodal recognition is broadly applied in areas such as scene understanding, intelligent content moderation, recommendation systems, surveillance systems, and autonomous driving. However, due to the complexity of the real world, simple label recognition often proves to be insufficient, as it treats all predicted labels equally without considering the priority of human preferences. A feasible solution could be to rank all the labels according to their relevance to a specific scene. This approach would allow participants to focus on labels with high scene relevance, while reducing the importance of secondary labels, regardless of their potentially high label confidence.

In contrast to predicting label confidence, the task of label relevance constitutes a more challenging problem that cannot be sufficiently tackled via methods such as calibration [30, 31, 21, 40], as the main objective of these methods is to correct label biases or inaccuracies to improve model performance, rather than establishing relevance between the label and the input data. As illustrated in Fig. 1, label confidence typically refers to the estimation from a model about the probability of a label’s occurrence, while label relevance primarily denotes the significance of the label to the primary theme of multimodal inputs. The observation also demonstrates that relevance labels bear a closer alignment with human preferences. Understandably, ranking the labels in order of relevance can be employed to emphasize the important labels.

Recently, Learning to Rank (LTR) methods [14, 13, 35, 9, 54, 38, 7, 1, 59, 44, 8] have been explored to tackle ranking problems. However, the primary focus of ranking techniques is based on retrieved documents or recommendation lists, rather than on the target label set. Therefore, these approaches are not directly and effectively applicable to address the problem of label relevance ranking due to the data and task setting.

In addition to LTR, there is a more theoretical branch of research that deals with the problem of ranking labels, known as label ranking [22, 15, 5, 50, 6, 26, 12, 20, 11, 28, 29, 16, 18, 19]. However, previous works on label ranking are oblivious to the semantic information of the label classes. Therefore, the basis for ranking is the difference between the positive and negative instances within each class label, rather than truly ranking the labels based on the difference in relevance between the label and the input.

Recognizing the significance of label relevance and acknowledging the research gap in this field, our study pioneers an investigation into the relevance between labels and multimodal inputs of video clips. We rank the labels with the relevance scores, thereby facilitating participants to extract primary labels of the clip. To show the difference in relevance between distinct labels and the video clips, we develop a multimodal label relevance dataset LRMovieNet with relevance categories annotation. Expanded from MovieNet [24], LRMovieNet contains various types of multimodal labels and a broad spectrum of label semantic levels, making it more capable of representing situations in the real world.

Intuitively, label relevance ranking can be addressed by performing a simple regression towards the ground truth relevance score. However, this approach has some obvious shortcomings: firstly, the definition of relevance categories does not perfectly conform to human preferences for label relevance. Secondly, the range of relevance scores cannot accurately distinguish the differences in relevance between labels, which limits the accuracy of the label relevance ranking model, especially when transferring to a new scenario. Given the above shortcomings, it is necessary to design a method that directly takes advantage of the differences in relevance between different labels and multimodal inputs to better and more efficiently transfer the label relevance ranking ability from an existing scenario to a new one. For clarity, we term the original scenario as source domain, and the new scenario with new labels or new video clips as target domain.

By introducing a new state definition and policy loss suitable for the label relevance ranking task, our LR²PPO algorithm is able to effectively utilize the partial order relations in the target domain. This makes the ranking model more in line with human preferences, significantly improving the performance of the label relevance ranking algorithm. Specifically, we train a reward model over the partial order annotation to align with human preference in the target domain, and then utilize it to guide the training of the LR²PPO framework. It is sufficient to train the reward model using a few partial order annotations from the target domain, along with partial order pair samples augmented from the source domain. Since partial order annotations can better reflect human preferences for primary labels compared to relevance category definitions, this approach can effectively improve the label relevance ranking performance in the target domain.

The main contributions of our work can be summarized as follows:

1.

We recognize the significant role of label relevance, and analyze the limitations of previous ranking methods when dealing with label relevance. To solve this problem, we propose a multimodal label relevance ranking approach to rank the labels according to the relevance between label and the multimodal input. To the best of our knowledge, this is the first work to explore the ranking in the perspective of label relevance.
2.

To better generalize the capability to new scenarios, we design a paradigm that transfers label relevance ranking ability from the source domain to the target domain. Besides, we propose the LR²PPO (Label Relevance Ranking with Proximal Policy Optimization) to effectively mine the partial order relations among labels.
3.

To better evaluate the effectiveness of LR²PPO, we annotate each video clip with corresponding class labels and their relevance order of the MovieNet dataset [24], and develop a new multimodal label relevance ranking benchmark dataset, LRMovieNet (Label Relevance of MovieNet). Comprehensive experiments on this dataset and traditional LTR datasets demonstrate the effectiveness of our proposed LR²PPO algorithm.

2 Related Works

2.1 Learning to Rank

Learning to rank methods can be categorized into pointwise, pairwise, and listwise approaches. Classic algorithms include Subset Ranking [13], McRank [35], Prank [14] (pointwise), RankNet, LambdaRank, LambdaMart [7] (pairwise), and ListNet [9], ListMLE [54], DLCM [1], SetRank [44], RankFormer [8] (listwise). Generative models like miRNN [59] estimate the entire sequence directly for optimal sequence selection. These methods are primarily used in information retrieval and recommender systems, and differ from label relevance ranking in that they typically rank retrieved documents or recommendation list rather than labels. Meanwhile, label ranking [22, 15, 5, 50, 6, 26, 12, 20, 11, 28, 29, 16, 18, 19] is a rather theoretical research field that investigates the relative order of labels in a closed label set. These methods typically lack perception of the textual semantic information of categories, mainly learning the order relationship based on the difference between positive and negative instances in the training set, rather than truly according to the relevance between the label and the input. In addition, these methods heavily rely on manual annotation, which also limits their application in real-world scenarios. Moreover, these methods have primarily focused on single modality, mainly images, and object labels, suitable for relatively simple scenarios. These methods differ significantly from our proposed LR²PPO, which for the first time explores the ranking of multimodal labels according to the relevance between labels and input. LR²PPO also handles a diverse set of multimodal labels, including not only objects but also events, attributes, and character identities, which are often more challenging and crucial in real-world multimodal video label relevance ranking scenarios.

2.2 Reinforcement Learning

Reinforcement learning is a research field of great significance. Classic reinforcement learning methods, including algorithms like Monte Carlo [4], Q-Learning [53], DQN [42], DPG [51], DDPG [37], TRPO [48], etc., are broadly employed in gaming, robot control, financial trading, etc. Recently, Proximal Policy Optimization (PPO [49]) algorithm proposed by OpenAI enhances the policy update process, achieving significant improvements in many tasks. InstructGPT [43] adopts PPO for human preference feedback learning, significantly improving the performance of language generation. In order to make the ranking model more effectively understand the human preference inherent in the partial order annotation from the target domain, we adapt the Proximal Policy Optimization (PPO) algorithm to the label relevance ranking task. By designing state definitions and policy loss tailored for label relevance ranking, partial order relations are effectively mined in accordance with human preference, improving the performance of label relevance ranking model.

2.3 Vision-Language Pretraining

Works in this area include two-stream models like CLIP [47], ALIGN [27], and single-stream models like ViLBERT [41], UNITER [10], UNIMO [36], SOHO [25], ALBEF [34], VLMO [3], TCL [55], X-VLM [57], BLIP [33], BLIP2 [32], CoCa [56]. These works are mainly applied in tasks like Visual Question Answering (VQA), visual entailment, visual grounding, multimodal retrieval, etc. Our proposed LR²PPO also applies to multimodal inputs, using two-stream Transformers to extract visual and textual features, which are then fused through the cross attention module for subsequent label relevance score prediction.

3 Method

3.1 Preliminary

Definition 1 (Label Confidence)

Given a multi-label classification task with a set of labels $\mathcal{L}=\{l_{1},l_{2},\ldots,l_{n}\}$ , an instance $x$ is associated with a label subset $\mathcal{L}_{x}\subseteq\mathcal{L}$ . The label confidence of a label $l_{i}$ for instance $x$ , denoted as $C(l_{i}|x)$ , is defined as the probability that $l_{i}$ is a correct label for $x$ , i.e.,

C(l_{i}|x)=P(l_{i}\in\mathcal{L}_{x}|x).

(1)

Definition 2 (Label Relevance)

The label relevance of a label $l_{i}$ for instance $x$ , denoted as $R(l_{i}|x)$ , is defined as the degree of association between $l_{i}$ and $x$ , i.e.,

R(l_{i}|x)=f(l_{i},x),

(2)

where $f$ is a function that measures the degree of association between $l_{i}$ and $x$ .

Given $V$ video clips, where the $j$ -th clip consists of frames ${F}^{j}=[F^{j}_{0},F^{j}_{1},...,F^{j}_{N-1}]$ , with $N$ representing the total number of frames extracted from a video clip, and $j$ ranging from 0 to $V-1$ . Each video clip is accompanied by text descriptions ${T}^{j}$ and a set of recognized labels denoted as $\mathcal{L}^{j}$ , where $\mathcal{L}^{j}=\{l^{j}_{0},l^{j}_{1},...,l^{j}_{i},...,l^{j}_{|\mathcal{L}^{j% }|-1}\}$ , and $|\mathcal{L}^{j}|$ is the number of labels in the $j$ -th video clip. The objective of label relevance ranking is to learn a ranking function $f_{\text{rank}}:{F}^{j},{T}^{j},\mathcal{L}^{j}\rightarrow{U}^{j}$ , where ${U}^{j}=[u^{j}_{0},u^{j}_{1},...,u^{j}_{i},...,u^{j}_{|\mathcal{L}^{j}|-1}]$ represents the ranking result of the label set $\mathcal{L}^{j}$ .

3.2 Label Relevance Ranking with Proximal Policy Optimization (LR²PPO)

We now present the details of our LR²PPO. As depicted in Fig. 2, LR²PPO primarily consists of three network modules: actor, reward, and critic, and the training can be divided into 3 stages. We first discuss the relations between three training stages, and then we provide a detailed explanation of each stage.

Relations between 3 Training Stages. We use a two-stream Transformer to handle multimodal input and train three models in LR²PPO: actor, reward, and critic. In stage 1, the actor model is supervised on the source domain to obtain a label relevance ranking base model. Stages 2 and 3 generalize the actor model to the target domain. In stage 2, the reward model is trained with a few annotated label pairs in the target domain and augmented pairs in the source domain. In stage 3, reward and critic networks are initialized with the stage 2 reward model. Then, actor and critic are jointly optimized under LR²PPO with label pair data, guided by the stage 2 reward model, to instruct the actor network with partial order relations in the target domain. Finally, the optimized actor network, structurally identical to the stage 1 actor, is used as the final label relevance ranking network, adding no inference overhead.

Stage 1. Label Relevance Ranking Base Model. During Stage 1, the training of the label relevance ranking base model adopts a supervised paradigm, i.e., it is trained on the source domain based on manually annotated relevance categories (high, medium and low). The label relevance ranking base network accepts multimodal inputs of multiple video frames $F^{j}$ , text descriptions $T^{j}$ , and text label set $\mathcal{L}^{j}$ . For a video clip, each text label is concatenated with text descriptions. Then ViT [17] and Roberta [39] are utilized to extract visual and textual features, respectively. Subsequently, the multimodal features are fused through the cross attention module. Finally, through the regression head, the relevance score of each label is predicted, and the SmoothL1Loss is calculated based on the manually annotated relevance categories as the final loss, which can be formulated as:

L_{\text{SmoothL1}}(p)=\begin{cases}0.5(p-y)^{2}/\beta&\text{if }|p-y|<\beta\\ |p-y|-0.5\beta&\text{otherwise},\end{cases}

(3)

where $p$ and $y$ are the predicted and ground truth relevance score of a label in a video clip, respectively. $y=2,1,0$ corresponds to high, medium and low relevance, respectively. $\beta$ is a hyper-parameter.

Stage 2. Reward Model. We train a reward model on the target domain in stage 2. With a few label pair annotations on the target domain, along with augmented pairs sampled from the source domain, the reward model can be trained to assign rewards to the partial order relationships between label pairs of a given clip. This kind of partial order relation annotation aligns with human perference of label relevance ranking, thus benefiting relevance ranking performance with limited annotation data. In a video clip, the reward model takes the concatenation of the initial pair of labels and the same labels in reranked order (in ground truth order of relevance or in reverse) as input, and predicts the reward of the pair in reranked order through the initial pair. The loss function adopted for the training of the reward model can be formulated as:

\begin{split}L_{RM}(g_{ini},g_{c})=\max(0,m_{R}-(R([g_{ini},g_{c}])-R([g_{ini}% ,\operatorname{flip}(g_{c})]))),\end{split}

(4)

where $g_{ini}$ and $g_{c}$ represent the initial label pair and the pair in ground truth order respectively, $[\cdot,\cdot]$ means concatenation of two label pairs, $\operatorname{flip}(\cdot)$ means flipping the order of the label pair, and $R(\cdot)$ denotes the reward model. $m_{R}$ is a margin hyperparameter.

Stage 3. LR²PPO. Stage 3 builds upon the first two training stages to jointly train the LR²PPO framework. To better address the issues in the label relevance ranking, we modify the state definition and policy loss in the original PPO. We redefine the state $s_{t}$ as the order of a group of labels (specifically, a label pair) at timestep $t$ , with the initial state $s_{0}$ being the original label order at input. The policy network (aka. actor model) predicts the relevance score of the labels and ranks them from high to low to obtain a new label order as next state $s_{t+1}$ , which is considered a state transition, or action $a_{t}$ . This process can be regarded as the policy $\pi_{\theta}$ of state transition. The combination of state $s_{t}$ (the initial label pair) and action $a_{t}$ (implicitly representing the reranked pair) is evaluated by the reward model to obtain a reward $r_{t}$ .

We denote the target value function estimate at time step $t$ as $V^{target}_{t}$ , which can be formulated as:

\displaystyle V_{t}^{\text{target }}=r_{t}+\gamma r_{t+1}+\gamma^{2}r_{t+2}+% \ldots+\gamma^{T-t-1}r_{T-1}+\gamma^{T-t}V_{\omega_{old}}\left(s_{T}\right),

(5)

where $V_{\omega_{old}}(s_{T})$ denotes the old value function estimate at state $s_{T}$ . $\omega$ is the trainable parameters of an employed state value network (aka. critic model) $V_{\omega}(\cdot)$ , while $V_{\omega_{old}}(\cdot)$ represents the old state value network. $\gamma$ is the discount factor. $T$ is the terminal time step. We can further obtain the advantage $\hat{A}_{t}$ estimate at time step $t$ via the target value $V_{t}^{target}$ and the critic model’s prediction for the value of the current state $s_{t}$ :

\hat{A}_{t}=V_{t}^{target}-V_{\omega_{old}}(s_{t}),

(6)

where $V_{\omega_{old}}(s_{t})$ denotes the old value function estimate at state $s_{t}$ .

In typical reinforcement learning tasks, such as gaming, decision control, language generation, etc., PPO usually takes the maximum component in the vector of predicted action probability distribution to obtain the ratio item of policy loss. (See supplementary for more details.) However, in the label relevance ranking task, it requires a complete probability vector to represent the change in label order, i.e., a state transition. Thus, it is difficult to directly build the ratio item of policy loss from action probability.

To address this problem, we first define the partial order function, i.e.:

H_{partial}(p_{t}^{1},p_{t}^{2})=\operatorname{max}(0,m-(p_{t}^{1}-p_{t}^{2})),

(7)

where $p_{t}^{1}$ and $p_{t}^{2}$ represent the predicted scores of the input label pair by the actor network at time step $t$ , i.e., $\pi_{\theta}(a_{t}|s_{t})$ , and $m$ is a margin hyperparameter, which helps the model better distinguish between correct and incorrect predictions to make the policy loss more precise.

Maximizing the surrogate objective is equal to minimize the policy loss in the context of reinforcement learning. In original PPO, ratio $r_{t}(\theta)$ is adopted to measure the change in policy and serves as a multiplication factor in the surrogate objective to encourage the policy to increase the probability of actions that have a positive advantage, and decrease the probability of actions that have a negative advantage. However, in label relevance ranking, determining the action for a single state transition necessitates a comprehensive label sequence probability distribution. Consequently, if we employ the ratio calculation approach from the original PPO, the ratio fails to encapsulate the change between the new and old policies, thereby inhibiting effective adjustment of the advantage within the surrogate objective. This issue persists even when adopting the clipped objective function. Please refer to Sec. 5.3 for more experimental analysis. To solve this problem, we propose partial order ratio $r_{t}^{\prime}(\theta)$ to provide a more suitable adjustment for advantage in the surrogate objective. It is a function that depends on the sign of $\hat{A}_{t}$ , the estimated advantage at time step $t$ . In practice, we utilize a small negative threshold $\delta$ instead of zero to stabilize the joint training of LR²PPO framework. Specifically, $r_{t}^{\prime}(\theta)$ is formulated as:

r_{t}^{\prime}(\theta)=\begin{cases}-H^{partial}(p_{t}^{1},p_{t}^{2})&\hat{A}_% {t}\geq\delta\\ -H^{partial}(p_{t}^{2},p_{t}^{1})&\hat{A}_{t}<\delta.\end{cases}

(8)

The proposed partial order ratio $r_{t}^{\prime}(\theta)$ encourages the model to correctly rank the labels by penalizing incorrect orderings. In Eq. (8), assuming the advantage $\hat{A}_{t}$ surpasses $\delta$ (i.e., $\hat{A}_{t}\geq\delta$ ), the reward model favors the first label (scored $p_{t}^{1}$ ) over the second (scored $p_{t}^{2}$ ). If $p_{t}^{1}>p_{t}^{2}$ , the absolute value of $r_{t}^{\prime}(\theta)$ falls below $m$ , lessening the penalty in Eq. (9). Conversely, if the advantage is below $\delta$ , the opposite happens. The policy function loss of LR²PPO is formulated as:

L_{LR\textsuperscript{2}PPO}^{PF}(\theta)=-\mathbb{E}_{t}\left(r^{\prime}_{t}(% \theta)abs(\hat{A}_{t})\right).

(9)

In our design, the order of label pairs is adjusted based on the relative magnitude of the advantage $\hat{A}_{t}$ and $\delta$ . The absolute value function $abs(\cdot)$ in Eq. (9) ensures that the advantage is always positive, reflecting the fact that moving a more important label to a higher position is beneficial. At the same time, the advantage is taken as an absolute value, indicating that after adjusting the order of the labels, i.e., moving the more important label to the front (according to the relative magnitude of the original advantage $\hat{A}_{t}$ and $\delta$ ), the advantage can maintain a positive value. In this way, policy loss can be more suitable for the label relevance ranking task, ensuring the ranking performance of LR²PPO.

Meanwhile, as original PPO, the value function loss of LR²PPO is given by:

L_{LR\textsuperscript{2}PPO}^{VF}(\omega)=L^{VF}(\omega)=\mathbb{E}_{t}\left[% \left(V_{\omega}(s_{t})-V^{target}_{t}\right)^{2}\right].

(10)

This is the expected value at time step $t$ of the squared difference between the value function estimate $V_{\omega}(s_{t})$ and the target value function estimate $V^{target}_{t}$ , under the policy parameters $\theta$ . This loss function measures the discrepancy between the predicted and actual value functions, driving the model to better estimate the expected return.

As original PPO, we utilize entropy bonus $S(\pi_{\theta})$ to encourage exploration by maximizing the entropy of the policy, and employ KL penalty $KL_{penalty}(\pi_{\theta_{old}},\pi_{\theta})$ to constrain policy updates and to prevent large performance drops during optimization. Please refer to supplementary for more details. Finally, the overall loss function of LR²PPO combines the policy function loss, the value function loss, the entropy bonus, and the KL penalty for encouraging exploration:

	$\displaystyle L_{LR\textsuperscript{2}PPO}(\theta,\omega)=$	$\displaystyle L_{LR\textsuperscript{2}PPO}^{PF}(\theta)+c_{1}L_{LR% \textsuperscript{2}PPO}^{VF}(\omega)$
		$\displaystyle-c_{2}S(\pi_{\theta})+c_{3}KL_{penalty}(\pi_{\theta_{old}},\pi_{% \theta}),$		(11)

where $c_{1}$ , $c_{2}$ and $c_{3}$ are hyper-parameter coefficients.

In summary, our LR²PPO algorithm leverages a combination of a label relevance ranking base model, a reward model, and a critic model, trained in a three-stage process. The algorithm is guided by a carefully designed loss function that encourages correct label relevance ranking, accurate value estimation, sufficient exploration, and imposes a constraint on policy updates. The pseudo-code of our LR²PPO is provided in Algorithm 1.

Algorithm 1 Label Relevance Ranking with Proximal Policy Optimization (LR²PPO), Actor-Critic Style

1:Policy network

\pi_{\theta_{\text{old}}}

, state value network

V_{\omega_{\text{old}}}

, number of timesteps

T

, number of trajectories in an iteration

N_{\text{Trajs}}

, number of epochs

K

, minibatch size

M

2:Policy network parameter

\theta

, state value network parameter

\omega

3:Initialization:

4:Initialize

\theta_{\text{old}}

and

\omega_{\text{old}}

with base model and reward model

5:LOOP Process

6:for iteration = 1, 2, … do

7: for

n_{\text{traj}}=1,2,\ldots,{N_{\text{Trajs}}}

8: Run policy

\pi_{\theta_{\text{old}}}

and state value network

V_{\omega_{\text{old}}}

in environment for

T

timesteps

9: Compute advantage estimates

\hat{A}_{1},\ldots,\hat{A}_{T}

according to Eq. (6)

10: end for

11: Compute joint loss

L_{\text{LR}^{2}\text{PPO}}

according to Eq. (3.2)

12: Optimize surrogate

L_{\text{LR}^{2}\text{PPO}}

with respect to

\theta

and

\omega

, with

K

epochs and minibatch size

M\leq N_{\text{Trajs}}\cdot T

13:

\theta_{\text{old}}\leftarrow\theta

\omega_{\text{old}}\leftarrow\omega

14:end for

15:return:

\theta

\omega

4 Label Relevance Ranking Dataset

Our objective in this paper is to tackle the challenge of label relevance ranking within multimodal scenarios, with the aim of better identifying salient and core labels in these contexts. However, we find that existing label ranking datasets are often designed with a focus on single-image inputs, lack text modality, and their fixed label systems limit the richness and diversity of labels. Furthermore, datasets related to Learning to Rank are typically tailored for tasks such as document ranking, making them unsuitable for label relevance ranking tasks.

Our proposed method for multimodal label relevance ranking is primarily designed for multimodal scenarios that feature a rich and diverse array of labels, particularly in typical scenarios like video clips. In our search for suitable datasets, we identify the MovieNet dataset as a rich source of multimodal video data. However, the MovieNet dataset only provides image-level object label annotations, while the wealth of information available in movie video clips can be used to extract a broad range of multimodal labels. To address this gap, we have undertaken a process of further label extraction and cleaning from the MovieNet dataset, with the aim of transforming the benchmark for label relevance ranking. This process has allowed us to create a more comprehensive and versatile dataset, better suited to the challenges of multimodal label relevance ranking.

Specifically, we select 3,206 clips from 219 videos in the MovieNet dataset [24]. For each movie clip, we extract frames from the video and input them into the RAM model [58] to obtain image labels. Concurrently, we input the descriptions of each movie clip into the LLaMa2 model [52] and extract correspoinding class labels. These generated image and text labels are then filtered and modified manually, which ensures that accurate and comprehensive annotations are selected for the video clips. We also standardize each clip into 20 labels through truncation or augmentation. As a result, we annotate 101,627 labels for 2,551 clips, with a total of 15,234 distinct label classes. We refer to the new benchmark obtained from our further annotation of the MovieNet dataset as LRMovieNet (Label Relevance of MovieNet).

5 Experiments

Method NDCG @ 1 NDCG@3 NDCG@5 NDCG@10 NDCG@20 OV-based CLIP [47] 0.5523 0.5209 0.5271 0.6009 0.7612 MKT [23] 0.3517 0.3533 0.3765 0.4704 0.6774 LTR-based PRM [45] 0.6320 0.6037 0.6083 0.6650 0.8022 DLCM [1] 0.6153 0.5807 0.5811 0.6310 0.7866 ListNet [9] 0.5947 0.5733 0.5787 0.6438 0.7872 GSF [2] 0.594 0.571 0.579 0.643 0.787 SetRank [44] 0.6337 0.6038 0.6125 0.6658 0.8030 RankFormer [8] 0.6350 0.6048 0.6108 0.6655 0.8033 Ours LR²PPO (S1) 0.6330 0.6018 0.6061 0.6667 0.8021 LR²PPO 0.6820 0.6714 0.6869 0.7628 0.8475

Table 1: State-of-the-art comparison for Label Relevance Ranking task on the LRMovieNet dataset. Bold indicates the best score.

5.1 Experiments Setup

LRMovieNet Dataset To assess the effectiveness of our approach, we conduct experiments using the LRMovieNet dataset. After obtaining image and text labels, we split the video dataset into source and target domains based on video label types. As label relevance ranking focuses on multimodal input, we partition the domains from a label perspective. To highlight label differences, we divide the class label set by video genres. Notably, there is a significant disparity between head and long-tail genres. Thus, we use the number of clips per genre to guide the partitioning. Specifically, we rank genres by the number of clips and divide them into sets $S_{P}$ and $S_{Q}$ . Set $S_{P}$ includes genres with more clips, while set $S_{Q}$ includes those with fewer clips (i.e., long-tail genres).

We designate labels in set $S_{Q}$ as the target domain and the difference between labels in sets $S_{P}$ and $S_{Q}$ as the source domain, achieving domain partitioning while maintaining label diversity between domains. For source domain labels, we manually assign relevance categories (high, medium, low) based on their relevance to video clip content. For target domain labels, we randomly sample 5%-40% of label pairs and annotate their relative order based on their relevance to the video clip content. To evaluate our label relevance ranking algorithm, we also annotate the test set in the target domain with high, medium, and low relevance categories for the labels. We obtain 2551/2206/1000 video clips for the first stage/second stage/test split. The first stage data contains 10393 distinct labels, while the second stage and validation set contain 4841 different labels.

Method NDCG @ 1 NDCG@3 NDCG@5 NDCG@10 NDCG@20 PRM [45] 0.5726 0.5804 0.5973 0.6407 0.7603 DLCM [1] 0.5983 0.6025 0.6125 0.6797 0.7744 ListNet [9] 0.5449 0.5575 0.5699 0.6324 0.7467 GSF [2] 0.6004 0.6265 0.6471 0.7054 0.7892 SetRank [44] 0.5299 0.5380 0.5555 0.6083 0.7365 RankFormer [8] 0.5684 0.5511 0.5643 0.6164 0.7458 LR²PPO 0.6496 0.6830 0.7033 0.7710 0.8240

Table 2: State-of-the-art comparison on traditional datasets for label relevance ranking on the MSLR-Web10K

\rightarrow

MQ2008 transfering task.

MSLR-WEB10k $\rightarrow$ MQ2008 To demonstrate the generalizability of our method, we further conduct experiments on traditional LTR datasets. In this transfer learning scenario, we use MSLR-WEB10k as the source domain and MQ2008 as the target domain, based on datasets introduced by Qin and Liu [46].

Evaluation Metrics We use the NDCG (Normalized Discounted Cumulative Gain) metric as the evaluation metric for multimodal label relevance ranking. For each video clip, we compute NDCG@ $k$ for the top $k$ labels.

5.2 State-of-the-art Comparison

LRMovieNet Dataset We compare LR²PPO with previous state-of-the-art LTR methods, reporting NDCG metrics for the LRMovieNet dataset. Results of LTR-based methods are reproduced based on the paper description, since the original models can not be directly applied to this task. As seen in Tab. 1, LR²PPO significantly outperforms previous methods. Meanwhile, compared to the first-stage model LR²PPO (S1), the final LR²PPO achieves consistent improvement of over 3% at different NDCG@k. Open-vocabulary (OV) based methods, such as CLIP and MKT, utilize label confidence for the ranking of labels, exhibiting relatively poor performance. Furthermore, these models solely focus on specific objects, thus perform inadequately when dealing with semantic information. LTR-based methods shows superiority over OV-based ones. However, they are not originally designed for ranking for labels, especially when transferring to a new scenario. In contrast, our LR²PPO model allows the base model to interact with the environment and optimize over unlabeled data, resulting in superior performance compared to other baseline models.

Annotation Proportion Reward Model Accuracy NDCG@1 NDCG@3 NDCG@5 NDCG@10 NDCG@20 $0\%$ - 0.6330 0.6018 0.6061 0.6667 0.8021 $5\%$ 0.7697 0.6787 0.6581 0.6770 0.7514 0.8416 $10\%$ 0.7757 0.6820 0.6714 0.6869 0.7628 0.8475 $20\%$ 0.7837 0.6800 0.6784 0.6980 0.7667 0.8506 $40\%$ 0.7866 0.6830 0.6682 0.6877 0.7617 0.8467

Table 3: Stage 2 and 3 results with different annotation proportions in target domain.

MSLR-WEB10k $\rightarrow$ MQ2008 Comparison on LTR datasets are shown in Tab. 2. It can concluded that our proposed LR²PPO method surpasses previous LTR methods significantly when performing transfer learning on LTR datasets.

5.3 Ablation Studies

Annotation Proportion in Target Domain To explore the influence of the annotation proportion of ordered pairs in the target domain, we adjust the annotation proportion during the training of the reward model in stage 2. Subsequently, we adopt the adjusted reward model when training the entire LR²PPO framework in stage 3. The accuracy of the reward model in the second stage and the NDCG metric in the third stage are reported in Tab. 3. The proportion of 10% achieves relatively high reward model accuracy and ultimate ranking relevance, while maintaining limited annotation.

Partial Order Ratio As shown in Fig. 3(a), partial order ratio shows stability in training and achieves better NDCG, in contrast to the original ratio in PPO, which experiences a training collapse. This result indicates that the original ratio in PPO may not be directly applicable in the setting of label relevance ranking, thereby demonstrating the effectiveness of our proposed partial order ratio.

Hyper-parameters Sensitivity Here, we investigate the impact of the threshold $\delta$ in Eq. (8). As shown in Fig. 3(b), negative thresholds $-0.1$ achieves better NDCG scores, while improving the robustness of training procedure in stage 3. Refer to supplementary for more details.

5.4 Qualitative Assessment

To clearly reveal the effectiveness of our method, we visualize the label relevance ranking prediction results of the LR²PPO algorithm and other state-of-the-art OV-based or LTR-based methods on some samples in the LRMovieNet test set. Fig. 4 shows the comparison between the LR²PPO algorithm and CLIP and PRM. For a set of video frame sequences and a plot text description, as well as a set of labels, we compare the ranking results of different methods based on label relevance for the given label set, and list the top 5 high relevance labels predicted by each method. Compared with CLIP and PRM, our method ranks more high relevance labels at the top and low relevance labels at the bottom. The results show that our method can better rank the labels based on the relevance between the label and the multimodal input, to more accurately obtain high-value labels.

6 Conclusion

In this study, we prove the pivotal role of label relevance in label tasks, and propose a novel approach, named LR²PPO, to effectively mine the partial order relations and apply label relevance ranking, especifically for a new scenario. To evaluate the performance of the method, a new benchmark dataset, named LRMovieNet, is proposed. Experimental results on this dataset and other LTR datasets validate the effectiveness of our proposed method.

References

[1] Ai, Q., Bi, K., Guo, J., Croft, W.B.: Learning a deep listwise context model for ranking refinement. In: The 41st international ACM SIGIR conference on research & development in information retrieval. pp. 135–144 (2018)
[2] Ai, Q., Wang, X., Bruch, S., Golbandi, N., Bendersky, M., Najork, M.: Learning groupwise multivariate scoring functions using deep neural networks. In: Proceedings of the 2019 ACM SIGIR international conference on theory of information retrieval. pp. 85–92 (2019)
[3] Bao, H., Wang, W., Dong, L., Liu, Q., Mohammed, O.K., Aggarwal, K., Som, S., Piao, S., Wei, F.: Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems 35, 32897–32912 (2022)
[4] Barto, A., Duff, M.: Monte carlo matrix inversion and reinforcement learning. Advances in Neural Information Processing Systems 6 (1993)
[5] Brinker, K., Hüllermeier, E.: Case-based label ranking. In: Machine Learning: ECML 2006: 17th European Conference on Machine Learning Berlin, Germany, September 18-22, 2006 Proceedings 17. pp. 566–573. Springer (2006)
[6] Brinker, K., Hüllermeier, E.: Case-based multilabel ranking. In: Proceedings of the 20th international joint conference on Artifical intelligence. pp. 702–707 (2007)
[7] Burges, C.J.: From ranknet to lambdarank to lambdamart: An overview. Learning 11(23-581), 81 (2010)
[8] Buyl, M., Missault, P., Sondag, P.A.: Rankformer: Listwise learning-to-rank using listwide labels. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. p. 3762–3773. KDD ’23, Association for Computing Machinery (2023). https://doi.org/10.1145/3580305.3599892
[9] Cao, Z., Qin, T., Liu, T.Y., Tsai, M.F., Li, H.: Learning to rank: from pairwise approach to listwise approach. In: Proceedings of the 24th international conference on Machine learning. pp. 129–136 (2007)
[10] Chen, Y.C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.: Uniter: Universal image-text representation learning. In: European conference on computer vision. pp. 104–120. Springer (2020)
[11] Cheng, W., Hühn, J., Hüllermeier, E.: Decision tree and instance-based learning for label ranking. In: Proceedings of the 26th annual international conference on machine learning. pp. 161–168 (2009)
[12] Cheng, W., Hüllermeier, E.: Instance-based label ranking using the mallows model. In: ECCBR workshops. pp. 143–157 (2008)
[13] Cossock, D., Zhang, T.: Subset ranking using regression. In: Learning Theory: 19th Annual Conference on Learning Theory, COLT 2006, Pittsburgh, PA, USA, June 22-25, 2006. Proceedings 19. pp. 605–619. Springer (2006)
[14] Crammer, K., Singer, Y.: Pranking with ranking. Advances in neural information processing systems 14 (2001)
[15] Dekel, O., Singer, Y., Manning, C.D.: Log-linear models for label ranking. Advances in neural information processing systems 16 (2003)
[16] Dery, L., Shmueli, E.: Improving label ranking ensembles using boosting techniques. arXiv preprint arXiv:2001.07744 (2020)
[17] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR 2021: The Ninth International Conference on Learning Representations (2021)
[18] Fotakis, D., Kalavasis, A., Kontonis, V., Tzamos, C.: Linear label ranking with bounded noise. Advances in Neural Information Processing Systems 35, 15642–15656 (2022)
[19] Fotakis, D., Kalavasis, A., Psaroudaki, E.: Label ranking through nonparametric regression. In: International Conference on Machine Learning. pp. 6622–6659. PMLR (2022)
[20] Fürnkranz, J., Hüllermeier, E., Loza Mencía, E., Brinker, K.: Multilabel classification via calibrated label ranking. Machine learning 73, 133–153 (2008)
[21] Garg, S., Wu, Y., Balakrishnan, S., Lipton, Z.: A unified view of label shift estimation. Advances in Neural Information Processing Systems 33, 3290–3300 (2020)
[22] Har-Peled, S., Roth, D., Zimak, D.: Constraint classification for multiclass classification and ranking. Advances in neural information processing systems 15 (2002)
[23] He, S., Guo, T., Dai, T., Qiao, R., Shu, X., Ren, B., Xia, S.T.: Open-vocabulary multi-label classification via multi-modal knowledge transfer. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 808–816 (2023)
[24] Huang, Q., Xiong, Y., Rao, A., Wang, J., Lin, D.: Movienet: A holistic dataset for movie understanding. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16. pp. 709–727. Springer (2020)
[25] Huang, Z., Zeng, Z., Huang, Y., Liu, B., Fu, D., Fu, J.: Seeing out of the box: End-to-end pre-training for vision-language representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12976–12985 (2021)
[26] Hüllermeier, E., Fürnkranz, J., Cheng, W., Brinker, K.: Label ranking by learning pairwise preferences. Artificial Intelligence 172(16-17), 1897–1916 (2008)
[27] Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning. pp. 4904–4916. PMLR (2021)
[28] Kanehira, A., Harada, T.: Multi-label ranking from positive and unlabeled data. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5138–5146 (2016)
[29] Korba, A., Garcia, A., d’Alché Buc, F.: A structured prediction approach for label ranking. Advances in neural information processing systems 31 (2018)
[30] Kumar, A., Liang, P.S., Ma, T.: Verified uncertainty calibration. Advances in Neural Information Processing Systems 32 (2019)
[31] Li, C., Pavlu, V., Aslam, J., Wang, B., Qin, K.: Learning to calibrate and rerank multi-label predictions. In: Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings, Part III. pp. 220–236. Springer (2020)
[32] Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
[33] Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning. pp. 12888–12900. PMLR (2022)
[34] Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems 34, 9694–9705 (2021)
[35] Li, P., Wu, Q., Burges, C.: Mcrank: Learning to rank using multiple classification and gradient boosting. Advances in neural information processing systems 20 (2007)
[36] Li, W., Gao, C., Niu, G., Xiao, X., Liu, H., Liu, J., Wu, H., Wang, H.: Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409 (2020)
[37] Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)
[38] Liu, T.Y., et al.: Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval 3(3), 225–331 (2009)
[39] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
[40] Liu, Z., Shen, Z., Long, Y., Xing, E., Cheng, K.T., Leichner, C.: Data-free neural architecture search via recursive label calibration. In: European Conference on Computer Vision. pp. 391–406. Springer (2022)
[41] Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32 (2019)
[42] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)
[43] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35, 27730–27744 (2022)
[44] Pang, L., Xu, J., Ai, Q., Lan, Y., Cheng, X., Wen, J.: Setrank: Learning a permutation-invariant ranking model for information retrieval. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. pp. 499–508 (2020)
[45] Pei, C., Zhang, Y., Zhang, Y., Sun, F., Lin, X., Sun, H., Wu, J., Jiang, P., Ge, J., Ou, W., et al.: Personalized re-ranking for recommendation. In: Proceedings of the 13th ACM conference on recommender systems. pp. 3–11 (2019)
[46] Qin, T., Liu, T.: Introducing LETOR 4.0 datasets. CoRR abs/1306.2597 (2013), http://arxiv.org/abs/1306.2597
[47] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML 2021: 38th International Conference on Machine Learning. pp. 8748–8763 (2021)
[48] Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International conference on machine learning. pp. 1889–1897. PMLR (2015)
[49] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
[50] Shalev-Shwartz, S., Singer, Y.: Efficient learning of label ranking by soft projections onto polyhedra (2006)
[51] Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.: Deterministic policy gradient algorithms. In: International conference on machine learning. pp. 387–395. Pmlr (2014)
[52] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
[53] Watkins, C.J.C.H.: Learning from delayed rewards (1989)
[54] Xia, F., Liu, T.Y., Wang, J., Zhang, W., Li, H.: Listwise approach to learning to rank: theory and algorithm. In: Proceedings of the 25th international conference on Machine learning. pp. 1192–1199 (2008)
[55] Yang, J., Duan, J., Tran, S., Xu, Y., Chanda, S., Chen, L., Zeng, B., Chilimbi, T., Huang, J.: Vision-language pre-training with triple contrastive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15671–15680 (2022)
[56] Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917 (2022)
[57] Zeng, Y., Zhang, X., Li, H.: Multi-grained vision language pre-training: Aligning texts with visual concepts. arXiv preprint arXiv:2111.08276 (2021)
[58] Zhang, Y., Huang, X., Ma, J., Li, Z., Luo, Z., Xie, Y., Qin, Y., Luo, T., Li, Y., Liu, S., et al.: Recognize anything: A strong image tagging model. arXiv preprint arXiv:2306.03514 (2023)
[59] Zhuang, T., Ou, W., Wang, Z.: Globally optimized mutual influence aware ranking in e-commerce search. arXiv preprint arXiv:1805.08524 (2018)

Appendix A PPO Algorithm

In this section, we provide a detailed explanation of the original Proximal Policy Optimization (PPO) algorithm, as proposed by Schulman et al. [49]. The PPO algorithm is defined by three key loss functions: the policy function loss, the value function loss, and the entropy bonus. In Sec. 3.2 of the main paper, we present the concept of the target value function estimate $V_{t}^{\text{{target}}}$ and the advantage estimate $\hat{A}_{t}$ at time step $t$ . Building upon this, we proceed to introduce the details of the PPO algorithm.

Formally, $r_{t}(\theta)$ is defined as the ratio of the action probability under the policy $\pi_{\theta}$ to the action probability under the old policy $\pi_{\theta_{\text{old}}}$ :

r_{t}(\theta)=\frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{\text{old}}}(a_{t}% |s_{t})}.

(A.1)

Subsequently, the policy function loss, denoted as $L^{\text{PF}}(\theta)$ , is computed based on the policy parameters $\theta$ :

L^{PF}(\theta)=-\mathbb{E}_{t}\left(r_{t}(\theta)\hat{A}_{t}\right).

(A.2)

To prevent the policy from changing too drastically in a single update, the PPO algorithm typically employs a clipped formation of policy loss:

\displaystyle L^{CPF}(\theta)=-\mathbb{E}_{t}\Bigl{[}\min\Bigl{(}r_{t}(\theta)% \hat{A}_{t},\text{clip}\bigl{(}r_{t}(\theta),1-\epsilon,1+\epsilon\bigr{)}\hat% {A}_{t}\Bigr{)}\Bigr{]}.

(A.3)

The value function loss, denoted as $L^{VF}(\omega)$ , has also been defined in Sec. 3.2 of the main paper.

The entropy bonus, denoted as $S(\pi_{\theta})$ , is defined as:

S(\pi_{\theta})=\mathbb{E}_{t}\left[\sum_{a_{t}}-\pi_{\theta}(a_{t}|s_{t})\log% \pi_{\theta}(a_{t}|s_{t})\right].

(A.4)

This is the expected value at time step $t$ of the sum over all possible actions of the product of the action probability under the policy $\pi_{\theta}$ and the logarithm of the action probability under the policy $\pi_{\theta}$ . This term encourages exploration by maximizing the entropy of the policy. Ultimately, the total loss of PPO can be defined as:

L^{\prime}(\theta)=L^{CPF}(\theta)+c_{1}^{\prime}L^{VF}(\omega)-c_{2}^{\prime}% S(\pi_{\theta}),

(A.5)

where $c_{1}^{\prime}$ and $c_{2}^{\prime}$ represent adjustable hyperparameters, which can be tuned to optimize the performance of the model.

In many cases, instead of using the clipped policy loss form in Eq. (A.3), the PPO algorithm incorporates a KL penalty term in the overall loss to prevent overly large policy updates that could lead to instability or performance drops in the learning process. The KL penalty, denoted as $KL_{penalty}$ , is formulated as:

KL_{penalty}(\pi_{\theta_{old}},\pi_{\theta})=E_{t}[KL(\pi_{\theta_{old}}(a_{t% }|s_{t}),\pi_{\theta}(a_{t}|s_{t}))],

(A.6)

where $KL(\cdot,\cdot)$ represents the Kullback-Leibler (KL) divergence. Thereby, the overall loss function, denoted as $L^{\prime\prime}(\theta)$ , is a combination of the four aforementioned loss functions. It is computed as:

\displaystyle L^{\prime\prime}(\theta)=L^{PF}(\theta)+c_{1}^{\prime\prime}L^{% VF}(\omega)-c_{2}^{\prime\prime}S(\pi_{\theta})+c_{3}^{\prime\prime}KL_{% penalty}(\pi_{\theta_{old}},\pi_{\theta}).

(A.7)

Here, the hyperparameters $c_{1}^{\prime\prime}$ , $c_{2}^{\prime\prime}$ and $c_{3}^{\prime\prime}$ are used to balance the contributions of the four components to the overall loss, and are typically determined through empirical tuning.

Appendix B LR²PPO Algorithm

The joint training of Label Relevance Ranking with Proximal Policy Optimization (LR²PPO) framework is illustrated in Algorithm B.1. Note that the definitions of entropy bonus $S(\pi_{\theta})$ and KL penalty $KL_{penalty}(\pi_{\theta_{old}},\pi_{\theta})$ are the same as the original PPO. (See Eq. (A.4) and Eq. (A.6).)

Algorithm B.1 Label Relevance Ranking with Proximal Policy Optimization (LR²PPO), Full Procedure

T=

Maximal state transition timestep length,

N_{\text{Trajs}}=

Number of state transition trajectories collected as training data in a training iteration,

K=

Number of learning epochs in a training iteration,

N_{\text{Iters}}=

Number of training iterations, weighting factors

c_{1}

c_{2}

and

c_{3}

M=

Minibatch size,

\gamma=

Discount factor through timesteps,

m=

Margin in partial order function,

\delta=

Threshold when calculating partial order ratio

\pi_{\theta}

V_{\omega}

\pi_{\theta}\leftarrow

newActorNet (),

V_{\omega}\leftarrow

newCriticNet (), optimizer

\leftarrow

newOptimizer

\left(\pi_{\theta},V_{\omega}\right)

4:env

\leftarrow

Muitimodal input and corresponding labels

\triangleright

Environment of label relevance ranking task

5:num_batches =

\lfloor\frac{N_{\text{Trajs}}\cdot T}{M}\rfloor

\triangleright

Number of batches in a training iteration

6:for

n_{iter}=1,2,\ldots,{N_{\text{Iters}}}

7: train_data

\leftarrow[]

// Produce training data.

8: for

n_{k}=1,2,\ldots,{N_{\text{Trajs}}}

s_{t=0}\leftarrow

env.randomlySampleLabelPair ()

\triangleright

Randomly sample pair of labels as initial state
// Let actor interact with environment and collect training data:

10: for

t=1,2,\ldots,{T}

11:

a_{t}\leftarrow\pi_{\theta}

. generate_action

\left(s_{t}\right)

12:

\pi_{\theta_{\text{old }}}\left(a_{t}\mid s_{t}\right)\leftarrow\pi_{\theta}.% \text{ distribution.get\_probability }\left(a_{t}\right)

\triangleright

Old action logits

13:

s_{t+1},r_{t}\leftarrow

env.step

\left(a_{t}\right)

\triangleright

Transition state and get reward

14: train_data

\leftarrow

train_data

+\operatorname{tuple}\left(s_{t},a_{t},r_{t},\pi_{\theta_{\text{old }}}\left(a% _{t}\mid s_{t}\right)\right)

15: end for
// Calculate and collect target value and advantage at timestep

t

16: for

t=1,2,\ldots,{T}

17:

V_{\omega_{old}}(s_{T})\leftarrow V_{\omega}

.get_value(

s_{T}

)

18:

V_{\omega_{old}}(s_{t})\leftarrow V_{\omega}

.get_value(

s_{t}

)

19:

V_{t}^{\text{target }}=r_{t}+\gamma r_{t+1}+\gamma^{2}r_{t+2}+\ldots+\gamma^{T% -t-1}r_{T-1}+\gamma^{T-t}V_{\omega_{old}}\left(s_{T}\right)

\triangleright

Target Value

20:

\hat{A}_{t}=V_{t}^{\text{target }}-V_{\omega_{old}}\left(s_{t}\right)

\triangleright

Estimated advantage

21: train_data

[(n_{k}-1)\cdot T+t]\leftarrow

train_data

[(n_{k}-1)\cdot T+t]+\operatorname{tuple}\left(A_{t},V_{t}^{\text{target }}\right)

22: end for

23: end for

24: optimizer.resetGradients

\left(\pi_{\theta},V_{\omega}\right)

// Update trainable parameters

\theta

and

\omega

for

E

epochs:

25: for epoch

=1,2,\ldots,K

26: train_data

\leftarrow

randomizeOrder(train_data)

27: for batch_index

=1,2,\ldots

, num_batches do

28:

\mathrm{E_{batch}}\leftarrow\text{ getNextBatch(train\_data, batch\_index, M) }

\triangleright

Get minibatch for training

29: for example

\mathrm{e}\in\mathrm{E_{batch}}

30:

s_{t},a_{t},r_{t},\pi_{\theta_{\text{old }}}\left(a_{t}\mid s_{t}\right),\hat{% A}_{t},V_{t}^{\text{target }}\leftarrow\operatorname{unpack}(e)

31:

\_\leftarrow\pi_{\theta}

.generate_action

\left(s_{t}\right)

\triangleright

Parameterization of policy

32:

{\pi}_{\theta}\left(a_{t}\mid s_{t}\right)\leftarrow\pi_{\theta}

. distribution.get_probability

\left(s_{t}\right)

\triangleright

Action logits

33:

p_{t}^{1},p_{t}^{2}\leftarrow\operatorname{unpack}({\pi}_{\theta}\left(a_{t}% \mid s_{t}\right))

\triangleright

Action_logit for each label in the label pair

34:

r_{t}^{\prime}(\theta)=\begin{cases}-\operatorname{max}(0,m-(p_{t}^{1}-p_{t}^{% 2})),\hat{A}_{t}\geq\delta\\ -\operatorname{max}(0,m-(p_{t}^{2}-p_{t}^{1})),\hat{A}_{t}<\delta\end{cases}

\triangleright

Partial order ratio

35: end for

36:

L_{LR\textsuperscript{2}PPO}^{PF}(\theta)=-\frac{1}{|M|}\sum_{t\in\{1,2,\ldots% ,|M|\}}r^{\prime}_{t}(\theta)\operatorname{abs}(\hat{A}_{t})

\triangleright

Policy loss

37:

L_{LR\textsuperscript{2}PPO}^{VF}(\omega)=\frac{1}{|M|}\sum_{t\in\{1,2,\ldots,% |M|\}}\left(V_{\omega}(s_{t})-V^{target}_{t}\right)^{2}

\triangleright

Value loss

38:

S(\pi_{\theta})=-\frac{1}{|M|}\sum_{t\in{1,2,\ldots,|M|}}\sum_{a_{t}}\pi_{% \theta}(a_{t}|s_{t})\log\pi_{\theta}(a_{t}|s_{t})

\triangleright

Entropy bonus

39:

KL_{penalty}(\pi_{\theta_{old}},\pi_{\theta})=\frac{1}{|M|}\sum_{t\in{1,2,% \ldots,|M|}}KL(\pi_{\theta_{old}}(a_{t}|s_{t}),\pi_{\theta}(a_{t}|s_{t}))

\triangleright

KL penalty

40:

L_{LR\textsuperscript{2}PPO}(\theta,\omega)=L_{LR\textsuperscript{2}PPO}^{PF}(% \theta)+c_{1}L_{LR\textsuperscript{2}PPO}^{VF}(\omega)-c_{2}S(\pi_{\theta})+c_% {3}KL_{penalty}(\pi_{\theta_{old}},\pi_{\theta})

\triangleright

Total loss

41: optimizer.backpropagate

(\pi_{\theta},V_{\omega},L_{LR\textsuperscript{2}PPO}(\theta,\omega))

42: optimizer.updateTrainableParameters

\left(\pi_{\theta},V_{\omega}\right)

43: end for

44: end for

45:end for

46:return:

\pi_{\theta}

V_{\omega}

Appendix C Experiments

C.1 More Details about LRMovieNet Dataset

Prompts to Extract Text Labels When generating textual labels from scene description in each movie clip of MovieNet [24] dataset, we feed the following prompts into LLaMa2 [52] to obtain event labels and entity labels, respectively. (1) Event labels: “Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: Use descriptive language tags consisting of less than three words each to capture events without entities in the following sentence: {sentence} Please seperate the labels by numbers and DO NOT return sentence. ### Response:” (2) Entity labels: “Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: Use descriptive language tags consisting of less than three words each to capture entities in the following sentence: {sentence} Please seperate the labels by numbers and DO NOT return sentence. ### Response: ”. The produced text labels and image labels (by RAM model [58]), are subsequently screened and manually adjusted.

Statistics of LRMovieNet Dataset To better understand the overall distribution of LRMovieNet, we provide histograms of its statistics in several dimensions, which are displayed in Fig. C.1. Specifically, Fig. C.1 (a) shows the total number of video clips in all videos of different genres, Fig. C.1 (b) displays the number of label classes in different genres, and Fig. C.1 (c) provides the number of labels with different frequencies in the entire LRMovieNet dataset. The details of Fig. C.1 (a) and Fig. C.1 (b) are listed in Tab. 1(a) and Tab. 1(b), respectively.

Genre Index	Genre Name	Clips Count
$0$	Drama	1765
$1$	Action	1224
$2$	Thriller	1154
$3$	Sci-Fi	869
$4$	Crime	829
$5$	Adventure	814
$6$	Comedy	562
$7$	Mystery	525
$8$	Fantasy	520
$9$	Romance	427
$10$	Biography	232
$11$	War	171
$12$	Horror	147
$13$	Family	140
$14$	History	132
$15$	Music	82
$16$	Western	66
$17$	Sport	51
$18$	Musical	21

(a) Details about clips count in different genres

Genre Index	Genre Name	Classes Count
$0$	Drama	10088
$1$	Action	7923
$2$	Thriller	7573
$3$	Sci-Fi	6315
$4$	Adventure	5999
$5$	Crime	5542
$6$	Comedy	4933
$7$	Fantasy	4468
$8$	Mystery	4277
$9$	Romance	3688
$10$	Biography	2443
$11$	War	1879
$12$	Horror	1658
$13$	History	1652
$14$	Family	1543
$15$	Music	1191
$16$	Western	872
$17$	Sport	757
$18$	Musical	446

(b) Details about label classes count in different genres

Table C.1: Details about number of video clips and label classes in all videos of different genres in LRMovieNet.

Samples of LRMovieNet Dataset To illustrate the variation of types (e.g., objects, attributes, scenes, character identities) and semantic levels (e.g., general, specific, abstract) in LRMovieNet, we provide some training samples in Fig. C.2. These samples also serve to clarify the process of label relevance categories annotation and the annotated partial order label pairs. The annotated label relevance categories in the source domain are used to train the relevance ranking base model in the first stage. Subsequently, the partial order label pairs in the target domain (along with label pairs augmented from the source domain) are employed to train the reward model in the second stage. This reward model then guides the joint training of LR²PPO in the third stage.

C.2 Metrics

In this paper, we evaluate the performance of our label relevance ranking algorithm using the Normalized Discounted Cumulative Gain (NDCG) metric, which is widely adopted in information retrieval and recommender systems.

To better comprehend the NDCG metric, let’s first introduce the concept of relevance scores. These scores, denoted as $rel_{i}$ , are assigned to the item (label) at position $i$ . Based on their relevance levels, the manually annotated labels in the test set are assigned these scores: labels with high, medium, and low relevance are given scores of 2, 1, and 0, respectively. Next, we define the Discounted Cumulative Gain (DCG) at position $k$ as follows:

\text{DCG}@k=\sum_{i=1}^{k}\frac{2^{rel_{i}}-1}{\log_{2}{(i+1)}},

(C.1)

where $rel_{i}$ is the relevance score of the item at position $i$ . The DCG metric measures the gain of a ranking algorithm by considering both the relevance of the items and their positions in the ranking.

To obtain the Ideal Discounted Cumulative Gain (IDCG) at position $k$ , we first rank the items by their relevance scores in descending order and then calculate the DCG for this ideal ranking:

\text{IDCG}@k=\sum_{i=1}^{k}\frac{2^{rel^{*}_{i}}-1}{\log_{2}{(i+1)}},

(C.2)

where $rel_{i}^{*}$ is the relevance score of the item (label) at position $i$ in the ideal ranking. The IDCG represents the maximum possible DCG for a given query.

Finally, we compute the Normalized Discounted Cumulative Gain (NDCG) at position $k$ by dividing the DCG by the IDCG to ensure that it lies between 0 and 1:

\text{NDCG}@k=\frac{\text{DCG}@k}{\text{IDCG}@k}.

(C.3)

A value of 1 indicates a perfect ranking, while a value of 0 indicates the worst possible ranking. The normalization also allows for the comparison of NDCG values across different video clips, as it accounts for the varying number of labels.

C.3 Implementation Details

We leverage published Vision Transformer [17] and Roberta [39] weights to initialize the parameters of vision encoder and language encoder, respectively. The parameters of the two encoders are fixed during three training stages. The coefficients $c_{1}$ and $c_{2}$ in Eq. (11) are set to $1$ , $1\times 10^{-3}$ , respectively. The coefficient $c_{3}$ of KL penalty in Eq. (11) is set to $1\times 10^{-3}$ . In our implementation, the KL penalty is subtracted from the reward, instead of being directly included in the joint loss. In this way, the policy is still updated based on the expected return, but the return is adjusted based on the policy change. This allows the algorithm to explore more freely while still being penalized for large policy changes, leading to more effective exploration and potentially higher overall returns. $\gamma$ in the definition of $V_{t}^{target}$ is set to $0$ . $\beta$ in Eq. (3) is set to $0.3$ . Margin $m_{R}$ in Eq. (4) and margin $m$ in Eq. (7) are both set to $1$ , and $\delta$ in Eq. (8) is set to $-0.1$ . AdamW optimizer is adopted for all optimized networks in three training stages, with learning rate $2\times 10^{-5}$ in the first and second training stages and $1\times 10^{-3}$ in the third training stage. We train the first two stages for 15 epochs, respectively. In Algorithm B.1, $T$ is set to $1$ , $N_{\text{Trajs}}$ is set to $200$ , $K$ is set to $1$ , $N_{\text{Iters}}$ is set to $412$ , $M$ is set to $24$ . During the training of stage 2, only 10% of ordered annotation pairs is used. In stage 3, 40% of randomly sampled pairs without annotation is utilized to train the LR²PPO framework, with the guidance of the reward model trained in stage 2. We use four V100 GPUs, each with 32GB of memory, to train the models. In comparison experiments, we utilize “There is a {label} in the scene” as textual prompt for both CLIP [47] and MKT [23].

C.4 More Ablation Studies about LR²PPO

Classification and regression. The relevance between each label and the multimodal input are annotated into high, medium and low (with scores 2, 1, 0, respectively). We adopt regression instead of classification for better performance and convenience in the latter stages. The results of classification in stage 1 (S1-CLS) are listed in Tab. C.2. We contribute the performance gap between classification (S1-CLS) and regression (S1) to the need to fully rank in the task. The predicted logits of classification need to be weighted to form the final relevance score, which hinders its performance. Regression scores are more suitable for ranking compared to classification logits.

Results w/o stage 1. The results of omitting stage 1 (w/o S1) are listed in Tab. C.2. LR²PPO achieves better performance since it starts learning from the source domain pretrained model, whose parameters more coincide with the ranking task, while LR²PPO (w/o S1) starts from the officially pretrained ViT [17] and Roberta [39].

Method NDCG @ 1 NDCG@3 NDCG@5 NDCG@10 NDCG@20 LR²PPO (S1-CLS) 0.6077 0.5988 0.5998 0.6601 0.7981 LR²PPO (S1) 0.6330 0.6018 0.6061 0.6667 0.8021 LR²PPO (w/o S1) 0.6750 0.6583 0.6781 0.7529 0.8432 LR²PPO 0.6820 0.6714 0.6869 0.7628 0.8475

Table C.2: More ablation studies on LRMovieNet.

C.5 More Hyper-parameters Sensitivity Analysis

Here, we conduct experimental analysis on the hyperparameters of the LR²PPO algorithm, namely margin $m_{R}$ in reward model training, margin $m$ and hyperparameter $c_{1}$ in joint training of LR²PPO. (See Eq. (4), Eq. (7), and Eq. (11) in the main paper.) The results are shown in Fig. C.3. It can be observed that the final performance of LR²PPO is not sensitive to the variations of these hyperparameters, indicating the stability of our method.

C.6 Training Curves

To better illustrate the changes during the training process of the LR²PPO algorithm, we have plotted the training curves of various variables against iterations. These are displayed in Fig. C.4. Fig. 4(a) to Fig. 4(m) represent the training curves for the third stage, while 4(n) and 4(o) correspond to the training curves for the second and first stages, respectively. The reward and value curves, along with the NDCG curves in the third training stage, all exhibit an upward trend, while the policy loss and value loss show a downward trend. All of these observations validate the effectiveness of the LR²PPO learning process.