11institutetext: Tencent Youtu Lab
11email: {taianguo,linuswu,hanjunli,ruizhiqiao,winfredsun}@tencent.com
22institutetext: Tsinghua Shenzhen International Graduate School, Tsinghua University
22email: [email protected]
\textsuperscript{*}\textsuperscript{*}footnotetext: Equal contribution.\textsuperscript{\textdagger}\textsuperscript{\textdagger}footnotetext: Work done during internship at Tencent.\textsuperscript{\textdaggerdbl}\textsuperscript{\textdaggerdbl}footnotetext: Corresponding author: Ruizhi Qiao.

Multimodal Label Relevance Ranking via Reinforcement Learning

Taian Guo*\orcidlink0000-0003-0787-511X 11 Taolin Zhang*\orcidlink0009-0006-2441-2861 22 Haoqian Wu\orcidlink0000-0003-1035-1499 11 Hanjun Li\orcidlink0009-0006-4211-7479 11 Ruizhi Qiao\orcidlink0000-0002-3663-0149 11 Xing Sun\orcidlink0000-0001-8132-9083 11
Abstract

Conventional multi-label recognition methods often focus on label confidence, frequently overlooking the pivotal role of partial order relations consistent with human preference. To resolve these issues, we introduce a novel method for multimodal label relevance ranking, named Label Relevance Ranking with Proximal Policy Optimization (LR2PPO), which effectively discerns partial order relations among labels. LR2PPO first utilizes partial order pairs in the target domain to train a reward model, which aims to capture human preference intrinsic to the specific scenario. Furthermore, we meticulously design state representation and a policy loss tailored for ranking tasks, enabling LR2PPO to boost the performance of label relevance ranking model and largely reduce the requirement of partial order annotation for transferring to new scenes. To assist in the evaluation of our approach and similar methods, we further propose a novel benchmark dataset, LRMovieNet, featuring multimodal labels and their corresponding partial order data. Extensive experiments demonstrate that our LR2PPO algorithm achieves state-of-the-art performance, proving its effectiveness in addressing the multimodal label relevance ranking problem. Codes and the proposed LRMovieNet dataset are publicly available at https://github.com/ChazzyGordon/LR2PPO.

Keywords:
Label Relevance Ranking Reinforcement Learning Multimodal

1 Introduction

Refer to caption
Figure 1: Illustration of the Difference between Label Confidence and Label Relevance. This figure provides an example of a movie footage consisting of three consecutive keyframes and its scene description. Generally, conventional label confidence tends to place more emphasis on the tangible objects, whereas the proposed label relevance better reveals the relations between labels and the real scene which they correspond to. As shown in the top right histogram, label confidence models tend to assign a higher level of confidence to the label ‘Man’ due to its higher frequency of occurrence within the context. In contrast, the label ‘Flirting’ is more closely aligned with the primary theme of the movie scene, resulting in a higher label relevance score.

Multi-label recognition, a fundamental task in computer vision, aims to identify all possible labels contained within a variety of media forms such as images and videos. Visual or multimodal recognition is broadly applied in areas such as scene understanding, intelligent content moderation, recommendation systems, surveillance systems, and autonomous driving. However, due to the complexity of the real world, simple label recognition often proves to be insufficient, as it treats all predicted labels equally without considering the priority of human preferences. A feasible solution could be to rank all the labels according to their relevance to a specific scene. This approach would allow participants to focus on labels with high scene relevance, while reducing the importance of secondary labels, regardless of their potentially high label confidence.

In contrast to predicting label confidence, the task of label relevance constitutes a more challenging problem that cannot be sufficiently tackled via methods such as calibration [30, 31, 21, 40], as the main objective of these methods is to correct label biases or inaccuracies to improve model performance, rather than establishing relevance between the label and the input data. As illustrated in Fig. 1, label confidence typically refers to the estimation from a model about the probability of a label’s occurrence, while label relevance primarily denotes the significance of the label to the primary theme of multimodal inputs. The observation also demonstrates that relevance labels bear a closer alignment with human preferences. Understandably, ranking the labels in order of relevance can be employed to emphasize the important labels.

Recently, Learning to Rank (LTR) methods [14, 13, 35, 9, 54, 38, 7, 1, 59, 44, 8] have been explored to tackle ranking problems. However, the primary focus of ranking techniques is based on retrieved documents or recommendation lists, rather than on the target label set. Therefore, these approaches are not directly and effectively applicable to address the problem of label relevance ranking due to the data and task setting.

In addition to LTR, there is a more theoretical branch of research that deals with the problem of ranking labels, known as label ranking [22, 15, 5, 50, 6, 26, 12, 20, 11, 28, 29, 16, 18, 19]. However, previous works on label ranking are oblivious to the semantic information of the label classes. Therefore, the basis for ranking is the difference between the positive and negative instances within each class label, rather than truly ranking the labels based on the difference in relevance between the label and the input.

Recognizing the significance of label relevance and acknowledging the research gap in this field, our study pioneers an investigation into the relevance between labels and multimodal inputs of video clips. We rank the labels with the relevance scores, thereby facilitating participants to extract primary labels of the clip. To show the difference in relevance between distinct labels and the video clips, we develop a multimodal label relevance dataset LRMovieNet with relevance categories annotation. Expanded from MovieNet [24], LRMovieNet contains various types of multimodal labels and a broad spectrum of label semantic levels, making it more capable of representing situations in the real world.

Intuitively, label relevance ranking can be addressed by performing a simple regression towards the ground truth relevance score. However, this approach has some obvious shortcomings: firstly, the definition of relevance categories does not perfectly conform to human preferences for label relevance. Secondly, the range of relevance scores cannot accurately distinguish the differences in relevance between labels, which limits the accuracy of the label relevance ranking model, especially when transferring to a new scenario. Given the above shortcomings, it is necessary to design a method that directly takes advantage of the differences in relevance between different labels and multimodal inputs to better and more efficiently transfer the label relevance ranking ability from an existing scenario to a new one. For clarity, we term the original scenario as source domain, and the new scenario with new labels or new video clips as target domain.

By introducing a new state definition and policy loss suitable for the label relevance ranking task, our LR2PPO algorithm is able to effectively utilize the partial order relations in the target domain. This makes the ranking model more in line with human preferences, significantly improving the performance of the label relevance ranking algorithm. Specifically, we train a reward model over the partial order annotation to align with human preference in the target domain, and then utilize it to guide the training of the LR2PPO framework. It is sufficient to train the reward model using a few partial order annotations from the target domain, along with partial order pair samples augmented from the source domain. Since partial order annotations can better reflect human preferences for primary labels compared to relevance category definitions, this approach can effectively improve the label relevance ranking performance in the target domain.

The main contributions of our work can be summarized as follows:

  1. 1.

    We recognize the significant role of label relevance, and analyze the limitations of previous ranking methods when dealing with label relevance. To solve this problem, we propose a multimodal label relevance ranking approach to rank the labels according to the relevance between label and the multimodal input. To the best of our knowledge, this is the first work to explore the ranking in the perspective of label relevance.

  2. 2.

    To better generalize the capability to new scenarios, we design a paradigm that transfers label relevance ranking ability from the source domain to the target domain. Besides, we propose the LR2PPO (Label Relevance Ranking with Proximal Policy Optimization) to effectively mine the partial order relations among labels.

  3. 3.

    To better evaluate the effectiveness of LR2PPO, we annotate each video clip with corresponding class labels and their relevance order of the MovieNet dataset [24], and develop a new multimodal label relevance ranking benchmark dataset, LRMovieNet (Label Relevance of MovieNet). Comprehensive experiments on this dataset and traditional LTR datasets demonstrate the effectiveness of our proposed LR2PPO algorithm.

2 Related Works

2.1 Learning to Rank

Learning to rank methods can be categorized into pointwise, pairwise, and listwise approaches. Classic algorithms include Subset Ranking [13], McRank [35], Prank [14] (pointwise), RankNet, LambdaRank, LambdaMart [7] (pairwise), and ListNet [9], ListMLE [54], DLCM [1], SetRank [44], RankFormer [8] (listwise). Generative models like miRNN [59] estimate the entire sequence directly for optimal sequence selection. These methods are primarily used in information retrieval and recommender systems, and differ from label relevance ranking in that they typically rank retrieved documents or recommendation list rather than labels. Meanwhile, label ranking [22, 15, 5, 50, 6, 26, 12, 20, 11, 28, 29, 16, 18, 19] is a rather theoretical research field that investigates the relative order of labels in a closed label set. These methods typically lack perception of the textual semantic information of categories, mainly learning the order relationship based on the difference between positive and negative instances in the training set, rather than truly according to the relevance between the label and the input. In addition, these methods heavily rely on manual annotation, which also limits their application in real-world scenarios. Moreover, these methods have primarily focused on single modality, mainly images, and object labels, suitable for relatively simple scenarios. These methods differ significantly from our proposed LR2PPO, which for the first time explores the ranking of multimodal labels according to the relevance between labels and input. LR2PPO also handles a diverse set of multimodal labels, including not only objects but also events, attributes, and character identities, which are often more challenging and crucial in real-world multimodal video label relevance ranking scenarios.

2.2 Reinforcement Learning

Reinforcement learning is a research field of great significance. Classic reinforcement learning methods, including algorithms like Monte Carlo [4], Q-Learning [53], DQN [42], DPG [51], DDPG [37], TRPO [48], etc., are broadly employed in gaming, robot control, financial trading, etc. Recently, Proximal Policy Optimization (PPO [49]) algorithm proposed by OpenAI enhances the policy update process, achieving significant improvements in many tasks. InstructGPT [43] adopts PPO for human preference feedback learning, significantly improving the performance of language generation. In order to make the ranking model more effectively understand the human preference inherent in the partial order annotation from the target domain, we adapt the Proximal Policy Optimization (PPO) algorithm to the label relevance ranking task. By designing state definitions and policy loss tailored for label relevance ranking, partial order relations are effectively mined in accordance with human preference, improving the performance of label relevance ranking model.

2.3 Vision-Language Pretraining

Works in this area include two-stream models like CLIP [47], ALIGN [27], and single-stream models like ViLBERT [41], UNITER [10], UNIMO [36], SOHO [25], ALBEF [34], VLMO [3], TCL [55], X-VLM [57], BLIP [33], BLIP2 [32], CoCa [56]. These works are mainly applied in tasks like Visual Question Answering (VQA), visual entailment, visual grounding, multimodal retrieval, etc. Our proposed LR2PPO also applies to multimodal inputs, using two-stream Transformers to extract visual and textual features, which are then fused through the cross attention module for subsequent label relevance score prediction.

3 Method

3.1 Preliminary

Definition 1 (Label Confidence)

Given a multi-label classification task with a set of labels ={l1,l2,,ln}subscript𝑙1subscript𝑙2subscript𝑙𝑛\mathcal{L}=\{l_{1},l_{2},\ldots,l_{n}\}caligraphic_L = { italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, an instance x𝑥xitalic_x is associated with a label subset xsubscript𝑥\mathcal{L}_{x}\subseteq\mathcal{L}caligraphic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ⊆ caligraphic_L. The label confidence of a label lisubscript𝑙𝑖l_{i}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for instance x𝑥xitalic_x, denoted as C(li|x)𝐶conditionalsubscript𝑙𝑖𝑥C(l_{i}|x)italic_C ( italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ), is defined as the probability that lisubscript𝑙𝑖l_{i}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a correct label for x𝑥xitalic_x, i.e.,

C(li|x)=P(lix|x).𝐶conditionalsubscript𝑙𝑖𝑥𝑃subscript𝑙𝑖conditionalsubscript𝑥𝑥C(l_{i}|x)=P(l_{i}\in\mathcal{L}_{x}|x).italic_C ( italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ) = italic_P ( italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_L start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT | italic_x ) . (1)
Definition 2 (Label Relevance)

The label relevance of a label lisubscript𝑙𝑖l_{i}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for instance x𝑥xitalic_x, denoted as R(li|x)𝑅conditionalsubscript𝑙𝑖𝑥R(l_{i}|x)italic_R ( italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ), is defined as the degree of association between lisubscript𝑙𝑖l_{i}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and x𝑥xitalic_x, i.e.,

R(li|x)=f(li,x),𝑅conditionalsubscript𝑙𝑖𝑥𝑓subscript𝑙𝑖𝑥R(l_{i}|x)=f(l_{i},x),italic_R ( italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x ) = italic_f ( italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_x ) , (2)

where f𝑓fitalic_f is a function that measures the degree of association between lisubscript𝑙𝑖l_{i}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and x𝑥xitalic_x.

Given V𝑉Vitalic_V video clips, where the j𝑗jitalic_j-th clip consists of frames Fj=[F0j,F1j,,FN1j]superscript𝐹𝑗subscriptsuperscript𝐹𝑗0subscriptsuperscript𝐹𝑗1subscriptsuperscript𝐹𝑗𝑁1{F}^{j}=[F^{j}_{0},F^{j}_{1},...,F^{j}_{N-1}]italic_F start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = [ italic_F start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_F start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_F start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT ], with N𝑁Nitalic_N representing the total number of frames extracted from a video clip, and j𝑗jitalic_j ranging from 0 to V1𝑉1V-1italic_V - 1. Each video clip is accompanied by text descriptions Tjsuperscript𝑇𝑗{T}^{j}italic_T start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT and a set of recognized labels denoted as jsuperscript𝑗\mathcal{L}^{j}caligraphic_L start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, where j={l0j,l1j,,lij,,l|j|1j}superscript𝑗subscriptsuperscript𝑙𝑗0subscriptsuperscript𝑙𝑗1subscriptsuperscript𝑙𝑗𝑖subscriptsuperscript𝑙𝑗superscript𝑗1\mathcal{L}^{j}=\{l^{j}_{0},l^{j}_{1},...,l^{j}_{i},...,l^{j}_{|\mathcal{L}^{j% }|-1}\}caligraphic_L start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = { italic_l start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_l start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_l start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_l start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT | caligraphic_L start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | - 1 end_POSTSUBSCRIPT }, and |j|superscript𝑗|\mathcal{L}^{j}|| caligraphic_L start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | is the number of labels in the j𝑗jitalic_j-th video clip. The objective of label relevance ranking is to learn a ranking function frank:Fj,Tj,jUj:subscript𝑓ranksuperscript𝐹𝑗superscript𝑇𝑗superscript𝑗superscript𝑈𝑗f_{\text{rank}}:{F}^{j},{T}^{j},\mathcal{L}^{j}\rightarrow{U}^{j}italic_f start_POSTSUBSCRIPT rank end_POSTSUBSCRIPT : italic_F start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_T start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , caligraphic_L start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT → italic_U start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, where Uj=[u0j,u1j,,uij,,u|j|1j]superscript𝑈𝑗subscriptsuperscript𝑢𝑗0subscriptsuperscript𝑢𝑗1subscriptsuperscript𝑢𝑗𝑖subscriptsuperscript𝑢𝑗superscript𝑗1{U}^{j}=[u^{j}_{0},u^{j}_{1},...,u^{j}_{i},...,u^{j}_{|\mathcal{L}^{j}|-1}]italic_U start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = [ italic_u start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_u start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_u start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT | caligraphic_L start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | - 1 end_POSTSUBSCRIPT ] represents the ranking result of the label set jsuperscript𝑗\mathcal{L}^{j}caligraphic_L start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT.

3.2 Label Relevance Ranking with Proximal Policy Optimization (LR2PPO)

Refer to caption
Figure 2: Illustration of the training paradigm of LR2PPO. Each stage takes multimodal data as input but differs in terms of specific data division and annotation type. Technically, in Stage 1, data from the source domain is employed to establish a label relevance ranking base model (i.e., Actor). Stage 2 involves preference data to train a Reward model. Finally, in Stage 3, Critic model interacts with the first two models and all data w/o annotations is utilized to boost the performance of the Actor, which will solely be applied in the inference stage.

We now present the details of our LR2PPO. As depicted in Fig. 2, LR2PPO primarily consists of three network modules: actor, reward, and critic, and the training can be divided into 3 stages. We first discuss the relations between three training stages, and then we provide a detailed explanation of each stage.

Relations between 3 Training Stages. We use a two-stream Transformer to handle multimodal input and train three models in LR2PPO: actor, reward, and critic. In stage 1, the actor model is supervised on the source domain to obtain a label relevance ranking base model. Stages 2 and 3 generalize the actor model to the target domain. In stage 2, the reward model is trained with a few annotated label pairs in the target domain and augmented pairs in the source domain. In stage 3, reward and critic networks are initialized with the stage 2 reward model. Then, actor and critic are jointly optimized under LR2PPO with label pair data, guided by the stage 2 reward model, to instruct the actor network with partial order relations in the target domain. Finally, the optimized actor network, structurally identical to the stage 1 actor, is used as the final label relevance ranking network, adding no inference overhead.

Stage 1. Label Relevance Ranking Base Model. During Stage 1, the training of the label relevance ranking base model adopts a supervised paradigm, i.e., it is trained on the source domain based on manually annotated relevance categories (high, medium and low). The label relevance ranking base network accepts multimodal inputs of multiple video frames Fjsuperscript𝐹𝑗F^{j}italic_F start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, text descriptions Tjsuperscript𝑇𝑗T^{j}italic_T start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, and text label set jsuperscript𝑗\mathcal{L}^{j}caligraphic_L start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT. For a video clip, each text label is concatenated with text descriptions. Then ViT [17] and Roberta [39] are utilized to extract visual and textual features, respectively. Subsequently, the multimodal features are fused through the cross attention module. Finally, through the regression head, the relevance score of each label is predicted, and the SmoothL1Loss is calculated based on the manually annotated relevance categories as the final loss, which can be formulated as:

LSmoothL1(p)={0.5(py)2/βif |py|<β|py|0.5βotherwise,subscript𝐿SmoothL1𝑝cases0.5superscript𝑝𝑦2𝛽if 𝑝𝑦𝛽𝑝𝑦0.5𝛽otherwiseL_{\text{SmoothL1}}(p)=\begin{cases}0.5(p-y)^{2}/\beta&\text{if }|p-y|<\beta\\ |p-y|-0.5\beta&\text{otherwise},\end{cases}italic_L start_POSTSUBSCRIPT SmoothL1 end_POSTSUBSCRIPT ( italic_p ) = { start_ROW start_CELL 0.5 ( italic_p - italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_β end_CELL start_CELL if | italic_p - italic_y | < italic_β end_CELL end_ROW start_ROW start_CELL | italic_p - italic_y | - 0.5 italic_β end_CELL start_CELL otherwise , end_CELL end_ROW (3)

where p𝑝pitalic_p and y𝑦yitalic_y are the predicted and ground truth relevance score of a label in a video clip, respectively. y=2,1,0𝑦210y=2,1,0italic_y = 2 , 1 , 0 corresponds to high, medium and low relevance, respectively. β𝛽\betaitalic_β is a hyper-parameter.

Stage 2. Reward Model. We train a reward model on the target domain in stage 2. With a few label pair annotations on the target domain, along with augmented pairs sampled from the source domain, the reward model can be trained to assign rewards to the partial order relationships between label pairs of a given clip. This kind of partial order relation annotation aligns with human perference of label relevance ranking, thus benefiting relevance ranking performance with limited annotation data. In a video clip, the reward model takes the concatenation of the initial pair of labels and the same labels in reranked order (in ground truth order of relevance or in reverse) as input, and predicts the reward of the pair in reranked order through the initial pair. The loss function adopted for the training of the reward model can be formulated as:

LRM(gini,gc)=max(0,mR(R([gini,gc])R([gini,flip(gc)]))),subscript𝐿𝑅𝑀subscript𝑔𝑖𝑛𝑖subscript𝑔𝑐0subscript𝑚𝑅𝑅subscript𝑔𝑖𝑛𝑖subscript𝑔𝑐𝑅subscript𝑔𝑖𝑛𝑖flipsubscript𝑔𝑐\begin{split}L_{RM}(g_{ini},g_{c})=\max(0,m_{R}-(R([g_{ini},g_{c}])-R([g_{ini}% ,\operatorname{flip}(g_{c})]))),\end{split}start_ROW start_CELL italic_L start_POSTSUBSCRIPT italic_R italic_M end_POSTSUBSCRIPT ( italic_g start_POSTSUBSCRIPT italic_i italic_n italic_i end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) = roman_max ( 0 , italic_m start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT - ( italic_R ( [ italic_g start_POSTSUBSCRIPT italic_i italic_n italic_i end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ] ) - italic_R ( [ italic_g start_POSTSUBSCRIPT italic_i italic_n italic_i end_POSTSUBSCRIPT , roman_flip ( italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ] ) ) ) , end_CELL end_ROW (4)

where ginisubscript𝑔𝑖𝑛𝑖g_{ini}italic_g start_POSTSUBSCRIPT italic_i italic_n italic_i end_POSTSUBSCRIPT and gcsubscript𝑔𝑐g_{c}italic_g start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT represent the initial label pair and the pair in ground truth order respectively, [,][\cdot,\cdot][ ⋅ , ⋅ ] means concatenation of two label pairs, flip()flip\operatorname{flip}(\cdot)roman_flip ( ⋅ ) means flipping the order of the label pair, and R()𝑅R(\cdot)italic_R ( ⋅ ) denotes the reward model. mRsubscript𝑚𝑅m_{R}italic_m start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT is a margin hyperparameter.

Stage 3. LR2PPO. Stage 3 builds upon the first two training stages to jointly train the LR2PPO framework. To better address the issues in the label relevance ranking, we modify the state definition and policy loss in the original PPO. We redefine the state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as the order of a group of labels (specifically, a label pair) at timestep t𝑡titalic_t, with the initial state s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT being the original label order at input. The policy network (aka. actor model) predicts the relevance score of the labels and ranks them from high to low to obtain a new label order as next state st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, which is considered a state transition, or action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. This process can be regarded as the policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT of state transition. The combination of state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (the initial label pair) and action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (implicitly representing the reranked pair) is evaluated by the reward model to obtain a reward rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

We denote the target value function estimate at time step t𝑡titalic_t as Vttargetsubscriptsuperscript𝑉𝑡𝑎𝑟𝑔𝑒𝑡𝑡V^{target}_{t}italic_V start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which can be formulated as:

Vttarget =rt+γrt+1+γ2rt+2++γTt1rT1+γTtVωold(sT),superscriptsubscript𝑉𝑡target subscript𝑟𝑡𝛾subscript𝑟𝑡1superscript𝛾2subscript𝑟𝑡2superscript𝛾𝑇𝑡1subscript𝑟𝑇1superscript𝛾𝑇𝑡subscript𝑉subscript𝜔𝑜𝑙𝑑subscript𝑠𝑇\displaystyle V_{t}^{\text{target }}=r_{t}+\gamma r_{t+1}+\gamma^{2}r_{t+2}+% \ldots+\gamma^{T-t-1}r_{T-1}+\gamma^{T-t}V_{\omega_{old}}\left(s_{T}\right),italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT target end_POSTSUPERSCRIPT = italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γ italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t + 2 end_POSTSUBSCRIPT + … + italic_γ start_POSTSUPERSCRIPT italic_T - italic_t - 1 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT + italic_γ start_POSTSUPERSCRIPT italic_T - italic_t end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) , (5)

where Vωold(sT)subscript𝑉subscript𝜔𝑜𝑙𝑑subscript𝑠𝑇V_{\omega_{old}}(s_{T})italic_V start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) denotes the old value function estimate at state sTsubscript𝑠𝑇s_{T}italic_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. ω𝜔\omegaitalic_ω is the trainable parameters of an employed state value network (aka. critic model) Vω()subscript𝑉𝜔V_{\omega}(\cdot)italic_V start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( ⋅ ), while Vωold()subscript𝑉subscript𝜔𝑜𝑙𝑑V_{\omega_{old}}(\cdot)italic_V start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ ) represents the old state value network. γ𝛾\gammaitalic_γ is the discount factor. T𝑇Titalic_T is the terminal time step. We can further obtain the advantage A^tsubscript^𝐴𝑡\hat{A}_{t}over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT estimate at time step t𝑡titalic_t via the target value Vttargetsuperscriptsubscript𝑉𝑡𝑡𝑎𝑟𝑔𝑒𝑡V_{t}^{target}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUPERSCRIPT and the critic model’s prediction for the value of the current state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

A^t=VttargetVωold(st),subscript^𝐴𝑡superscriptsubscript𝑉𝑡𝑡𝑎𝑟𝑔𝑒𝑡subscript𝑉subscript𝜔𝑜𝑙𝑑subscript𝑠𝑡\hat{A}_{t}=V_{t}^{target}-V_{\omega_{old}}(s_{t}),over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUPERSCRIPT - italic_V start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (6)

where Vωold(st)subscript𝑉subscript𝜔𝑜𝑙𝑑subscript𝑠𝑡V_{\omega_{old}}(s_{t})italic_V start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) denotes the old value function estimate at state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

In typical reinforcement learning tasks, such as gaming, decision control, language generation, etc., PPO usually takes the maximum component in the vector of predicted action probability distribution to obtain the ratio item of policy loss. (See supplementary for more details.) However, in the label relevance ranking task, it requires a complete probability vector to represent the change in label order, i.e., a state transition. Thus, it is difficult to directly build the ratio item of policy loss from action probability.

To address this problem, we first define the partial order function, i.e.:

Hpartial(pt1,pt2)=max(0,m(pt1pt2)),subscript𝐻𝑝𝑎𝑟𝑡𝑖𝑎𝑙superscriptsubscript𝑝𝑡1superscriptsubscript𝑝𝑡2max0𝑚superscriptsubscript𝑝𝑡1superscriptsubscript𝑝𝑡2H_{partial}(p_{t}^{1},p_{t}^{2})=\operatorname{max}(0,m-(p_{t}^{1}-p_{t}^{2})),italic_H start_POSTSUBSCRIPT italic_p italic_a italic_r italic_t italic_i italic_a italic_l end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = roman_max ( 0 , italic_m - ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) , (7)

where pt1superscriptsubscript𝑝𝑡1p_{t}^{1}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and pt2superscriptsubscript𝑝𝑡2p_{t}^{2}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT represent the predicted scores of the input label pair by the actor network at time step t𝑡titalic_t , i.e., πθ(at|st)subscript𝜋𝜃conditionalsubscript𝑎𝑡subscript𝑠𝑡\pi_{\theta}(a_{t}|s_{t})italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , and m𝑚mitalic_m is a margin hyperparameter, which helps the model better distinguish between correct and incorrect predictions to make the policy loss more precise.

Maximizing the surrogate objective is equal to minimize the policy loss in the context of reinforcement learning. In original PPO, ratio rt(θ)subscript𝑟𝑡𝜃r_{t}(\theta)italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) is adopted to measure the change in policy and serves as a multiplication factor in the surrogate objective to encourage the policy to increase the probability of actions that have a positive advantage, and decrease the probability of actions that have a negative advantage. However, in label relevance ranking, determining the action for a single state transition necessitates a comprehensive label sequence probability distribution. Consequently, if we employ the ratio calculation approach from the original PPO, the ratio fails to encapsulate the change between the new and old policies, thereby inhibiting effective adjustment of the advantage within the surrogate objective. This issue persists even when adopting the clipped objective function. Please refer to Sec. 5.3 for more experimental analysis. To solve this problem, we propose partial order ratio rt(θ)superscriptsubscript𝑟𝑡𝜃r_{t}^{\prime}(\theta)italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ ) to provide a more suitable adjustment for advantage in the surrogate objective. It is a function that depends on the sign of A^tsubscript^𝐴𝑡\hat{A}_{t}over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the estimated advantage at time step t𝑡titalic_t. In practice, we utilize a small negative threshold δ𝛿\deltaitalic_δ instead of zero to stabilize the joint training of LR2PPO framework. Specifically, rt(θ)superscriptsubscript𝑟𝑡𝜃r_{t}^{\prime}(\theta)italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ ) is formulated as:

rt(θ)={Hpartial(pt1,pt2)A^tδHpartial(pt2,pt1)A^t<δ.superscriptsubscript𝑟𝑡𝜃casessuperscript𝐻𝑝𝑎𝑟𝑡𝑖𝑎𝑙superscriptsubscript𝑝𝑡1superscriptsubscript𝑝𝑡2subscript^𝐴𝑡𝛿superscript𝐻𝑝𝑎𝑟𝑡𝑖𝑎𝑙superscriptsubscript𝑝𝑡2superscriptsubscript𝑝𝑡1subscript^𝐴𝑡𝛿r_{t}^{\prime}(\theta)=\begin{cases}-H^{partial}(p_{t}^{1},p_{t}^{2})&\hat{A}_% {t}\geq\delta\\ -H^{partial}(p_{t}^{2},p_{t}^{1})&\hat{A}_{t}<\delta.\end{cases}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ ) = { start_ROW start_CELL - italic_H start_POSTSUPERSCRIPT italic_p italic_a italic_r italic_t italic_i italic_a italic_l end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ italic_δ end_CELL end_ROW start_ROW start_CELL - italic_H start_POSTSUPERSCRIPT italic_p italic_a italic_r italic_t italic_i italic_a italic_l end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) end_CELL start_CELL over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT < italic_δ . end_CELL end_ROW (8)

The proposed partial order ratio rt(θ)superscriptsubscript𝑟𝑡𝜃r_{t}^{\prime}(\theta)italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ ) encourages the model to correctly rank the labels by penalizing incorrect orderings. In Eq. (8), assuming the advantage A^tsubscript^𝐴𝑡\hat{A}_{t}over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT surpasses δ𝛿\deltaitalic_δ (i.e., A^tδsubscript^𝐴𝑡𝛿\hat{A}_{t}\geq\deltaover^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ italic_δ), the reward model favors the first label (scored pt1superscriptsubscript𝑝𝑡1p_{t}^{1}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT) over the second (scored pt2superscriptsubscript𝑝𝑡2p_{t}^{2}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT). If pt1>pt2superscriptsubscript𝑝𝑡1superscriptsubscript𝑝𝑡2p_{t}^{1}>p_{t}^{2}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT > italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, the absolute value of rt(θ)superscriptsubscript𝑟𝑡𝜃r_{t}^{\prime}(\theta)italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ ) falls below m𝑚mitalic_m, lessening the penalty in Eq. (9). Conversely, if the advantage is below δ𝛿\deltaitalic_δ, the opposite happens. The policy function loss of LR2PPO is formulated as:

LLR2PPOPF(θ)=𝔼t(rt(θ)abs(A^t)).superscriptsubscript𝐿𝐿𝑅2𝑃𝑃𝑂𝑃𝐹𝜃subscript𝔼𝑡subscriptsuperscript𝑟𝑡𝜃𝑎𝑏𝑠subscript^𝐴𝑡L_{LR\textsuperscript{2}PPO}^{PF}(\theta)=-\mathbb{E}_{t}\left(r^{\prime}_{t}(% \theta)abs(\hat{A}_{t})\right).italic_L start_POSTSUBSCRIPT italic_L italic_R italic_P italic_P italic_O end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P italic_F end_POSTSUPERSCRIPT ( italic_θ ) = - blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) italic_a italic_b italic_s ( over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) . (9)

In our design, the order of label pairs is adjusted based on the relative magnitude of the advantage A^tsubscript^𝐴𝑡\hat{A}_{t}over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and δ𝛿\deltaitalic_δ. The absolute value function abs()𝑎𝑏𝑠abs(\cdot)italic_a italic_b italic_s ( ⋅ ) in Eq. (9) ensures that the advantage is always positive, reflecting the fact that moving a more important label to a higher position is beneficial. At the same time, the advantage is taken as an absolute value, indicating that after adjusting the order of the labels, i.e., moving the more important label to the front (according to the relative magnitude of the original advantage A^tsubscript^𝐴𝑡\hat{A}_{t}over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and δ𝛿\deltaitalic_δ), the advantage can maintain a positive value. In this way, policy loss can be more suitable for the label relevance ranking task, ensuring the ranking performance of LR2PPO.

Meanwhile, as original PPO, the value function loss of LR2PPO is given by:

LLR2PPOVF(ω)=LVF(ω)=𝔼t[(Vω(st)Vttarget)2].superscriptsubscript𝐿𝐿𝑅2𝑃𝑃𝑂𝑉𝐹𝜔superscript𝐿𝑉𝐹𝜔subscript𝔼𝑡delimited-[]superscriptsubscript𝑉𝜔subscript𝑠𝑡subscriptsuperscript𝑉𝑡𝑎𝑟𝑔𝑒𝑡𝑡2L_{LR\textsuperscript{2}PPO}^{VF}(\omega)=L^{VF}(\omega)=\mathbb{E}_{t}\left[% \left(V_{\omega}(s_{t})-V^{target}_{t}\right)^{2}\right].italic_L start_POSTSUBSCRIPT italic_L italic_R italic_P italic_P italic_O end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V italic_F end_POSTSUPERSCRIPT ( italic_ω ) = italic_L start_POSTSUPERSCRIPT italic_V italic_F end_POSTSUPERSCRIPT ( italic_ω ) = blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ( italic_V start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_V start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (10)

This is the expected value at time step t𝑡titalic_t of the squared difference between the value function estimate Vω(st)subscript𝑉𝜔subscript𝑠𝑡V_{\omega}(s_{t})italic_V start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and the target value function estimate Vttargetsubscriptsuperscript𝑉𝑡𝑎𝑟𝑔𝑒𝑡𝑡V^{target}_{t}italic_V start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, under the policy parameters θ𝜃\thetaitalic_θ. This loss function measures the discrepancy between the predicted and actual value functions, driving the model to better estimate the expected return.

As original PPO, we utilize entropy bonus S(πθ)𝑆subscript𝜋𝜃S(\pi_{\theta})italic_S ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) to encourage exploration by maximizing the entropy of the policy, and employ KL penalty KLpenalty(πθold,πθ)𝐾subscript𝐿𝑝𝑒𝑛𝑎𝑙𝑡𝑦subscript𝜋subscript𝜃𝑜𝑙𝑑subscript𝜋𝜃KL_{penalty}(\pi_{\theta_{old}},\pi_{\theta})italic_K italic_L start_POSTSUBSCRIPT italic_p italic_e italic_n italic_a italic_l italic_t italic_y end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) to constrain policy updates and to prevent large performance drops during optimization. Please refer to supplementary for more details. Finally, the overall loss function of LR2PPO combines the policy function loss, the value function loss, the entropy bonus, and the KL penalty for encouraging exploration:

LLR2PPO(θ,ω)=subscript𝐿𝐿𝑅2𝑃𝑃𝑂𝜃𝜔absent\displaystyle L_{LR\textsuperscript{2}PPO}(\theta,\omega)=italic_L start_POSTSUBSCRIPT italic_L italic_R italic_P italic_P italic_O end_POSTSUBSCRIPT ( italic_θ , italic_ω ) = LLR2PPOPF(θ)+c1LLR2PPOVF(ω)superscriptsubscript𝐿𝐿𝑅2𝑃𝑃𝑂𝑃𝐹𝜃subscript𝑐1superscriptsubscript𝐿𝐿𝑅2𝑃𝑃𝑂𝑉𝐹𝜔\displaystyle L_{LR\textsuperscript{2}PPO}^{PF}(\theta)+c_{1}L_{LR% \textsuperscript{2}PPO}^{VF}(\omega)italic_L start_POSTSUBSCRIPT italic_L italic_R italic_P italic_P italic_O end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P italic_F end_POSTSUPERSCRIPT ( italic_θ ) + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_L italic_R italic_P italic_P italic_O end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V italic_F end_POSTSUPERSCRIPT ( italic_ω )
c2S(πθ)+c3KLpenalty(πθold,πθ),subscript𝑐2𝑆subscript𝜋𝜃subscript𝑐3𝐾subscript𝐿𝑝𝑒𝑛𝑎𝑙𝑡𝑦subscript𝜋subscript𝜃𝑜𝑙𝑑subscript𝜋𝜃\displaystyle-c_{2}S(\pi_{\theta})+c_{3}KL_{penalty}(\pi_{\theta_{old}},\pi_{% \theta}),- italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_S ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) + italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_K italic_L start_POSTSUBSCRIPT italic_p italic_e italic_n italic_a italic_l italic_t italic_y end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) , (11)

where c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and c3subscript𝑐3c_{3}italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are hyper-parameter coefficients.

In summary, our LR2PPO algorithm leverages a combination of a label relevance ranking base model, a reward model, and a critic model, trained in a three-stage process. The algorithm is guided by a carefully designed loss function that encourages correct label relevance ranking, accurate value estimation, sufficient exploration, and imposes a constraint on policy updates. The pseudo-code of our LR2PPO is provided in Algorithm 1.

Algorithm 1 Label Relevance Ranking with Proximal Policy Optimization (LR2PPO), Actor-Critic Style
1:Policy network πθoldsubscript𝜋subscript𝜃old\pi_{\theta_{\text{old}}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT old end_POSTSUBSCRIPT end_POSTSUBSCRIPT, state value network Vωoldsubscript𝑉subscript𝜔oldV_{\omega_{\text{old}}}italic_V start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT old end_POSTSUBSCRIPT end_POSTSUBSCRIPT, number of timesteps T𝑇Titalic_T, number of trajectories in an iteration NTrajssubscript𝑁TrajsN_{\text{Trajs}}italic_N start_POSTSUBSCRIPT Trajs end_POSTSUBSCRIPT, number of epochs K𝐾Kitalic_K, minibatch size M𝑀Mitalic_M
2:Policy network parameter θ𝜃\thetaitalic_θ, state value network parameter ω𝜔\omegaitalic_ω
3:Initialization:
4:Initialize θoldsubscript𝜃old\theta_{\text{old}}italic_θ start_POSTSUBSCRIPT old end_POSTSUBSCRIPT and ωoldsubscript𝜔old\omega_{\text{old}}italic_ω start_POSTSUBSCRIPT old end_POSTSUBSCRIPT with base model and reward model
5:LOOP Process
6:for iteration = 1, 2, … do
7:   for ntraj=1,2,,NTrajssubscript𝑛traj12subscript𝑁Trajsn_{\text{traj}}=1,2,\ldots,{N_{\text{Trajs}}}italic_n start_POSTSUBSCRIPT traj end_POSTSUBSCRIPT = 1 , 2 , … , italic_N start_POSTSUBSCRIPT Trajs end_POSTSUBSCRIPT do
8:      Run policy πθoldsubscript𝜋subscript𝜃old\pi_{\theta_{\text{old}}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT old end_POSTSUBSCRIPT end_POSTSUBSCRIPT and state value network Vωoldsubscript𝑉subscript𝜔oldV_{\omega_{\text{old}}}italic_V start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT old end_POSTSUBSCRIPT end_POSTSUBSCRIPT in environment for T𝑇Titalic_T timesteps
9:      Compute advantage estimates A^1,,A^Tsubscript^𝐴1subscript^𝐴𝑇\hat{A}_{1},\ldots,\hat{A}_{T}over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT according to Eq. (6)
10:   end for
11:   Compute joint loss LLR2PPOsubscript𝐿superscriptLR2PPOL_{\text{LR}^{2}\text{PPO}}italic_L start_POSTSUBSCRIPT LR start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT PPO end_POSTSUBSCRIPT according to Eq. (3.2)
12:   Optimize surrogate LLR2PPOsubscript𝐿superscriptLR2PPOL_{\text{LR}^{2}\text{PPO}}italic_L start_POSTSUBSCRIPT LR start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT PPO end_POSTSUBSCRIPT with respect to θ𝜃\thetaitalic_θ and ω𝜔\omegaitalic_ω, with K𝐾Kitalic_K epochs and minibatch size MNTrajsT𝑀subscript𝑁Trajs𝑇M\leq N_{\text{Trajs}}\cdot Titalic_M ≤ italic_N start_POSTSUBSCRIPT Trajs end_POSTSUBSCRIPT ⋅ italic_T
13:   θoldθsubscript𝜃old𝜃\theta_{\text{old}}\leftarrow\thetaitalic_θ start_POSTSUBSCRIPT old end_POSTSUBSCRIPT ← italic_θ, ωoldωsubscript𝜔old𝜔\omega_{\text{old}}\leftarrow\omegaitalic_ω start_POSTSUBSCRIPT old end_POSTSUBSCRIPT ← italic_ω
14:end for
15:return: θ𝜃\thetaitalic_θ, ω𝜔\omegaitalic_ω

4 Label Relevance Ranking Dataset

Our objective in this paper is to tackle the challenge of label relevance ranking within multimodal scenarios, with the aim of better identifying salient and core labels in these contexts. However, we find that existing label ranking datasets are often designed with a focus on single-image inputs, lack text modality, and their fixed label systems limit the richness and diversity of labels. Furthermore, datasets related to Learning to Rank are typically tailored for tasks such as document ranking, making them unsuitable for label relevance ranking tasks.

Our proposed method for multimodal label relevance ranking is primarily designed for multimodal scenarios that feature a rich and diverse array of labels, particularly in typical scenarios like video clips. In our search for suitable datasets, we identify the MovieNet dataset as a rich source of multimodal video data. However, the MovieNet dataset only provides image-level object label annotations, while the wealth of information available in movie video clips can be used to extract a broad range of multimodal labels. To address this gap, we have undertaken a process of further label extraction and cleaning from the MovieNet dataset, with the aim of transforming the benchmark for label relevance ranking. This process has allowed us to create a more comprehensive and versatile dataset, better suited to the challenges of multimodal label relevance ranking.

Specifically, we select 3,206 clips from 219 videos in the MovieNet dataset [24]. For each movie clip, we extract frames from the video and input them into the RAM model [58] to obtain image labels. Concurrently, we input the descriptions of each movie clip into the LLaMa2 model [52] and extract correspoinding class labels. These generated image and text labels are then filtered and modified manually, which ensures that accurate and comprehensive annotations are selected for the video clips. We also standardize each clip into 20 labels through truncation or augmentation. As a result, we annotate 101,627 labels for 2,551 clips, with a total of 15,234 distinct label classes. We refer to the new benchmark obtained from our further annotation of the MovieNet dataset as LRMovieNet (Label Relevance of MovieNet).

5 Experiments

Method NDCG @ 1 NDCG@3 NDCG@5 NDCG@10 NDCG@20 OV-based CLIP [47] 0.5523 0.5209 0.5271 0.6009 0.7612 MKT [23] 0.3517 0.3533 0.3765 0.4704 0.6774 LTR-based PRM [45] 0.6320 0.6037 0.6083 0.6650 0.8022 DLCM [1] 0.6153 0.5807 0.5811 0.6310 0.7866 ListNet [9] 0.5947 0.5733 0.5787 0.6438 0.7872 GSF [2] 0.594 0.571 0.579 0.643 0.787 SetRank [44] 0.6337 0.6038 0.6125 0.6658 0.8030 RankFormer [8] 0.6350 0.6048 0.6108 0.6655 0.8033 Ours LR2PPO (S1) 0.6330 0.6018 0.6061 0.6667 0.8021 LR2PPO 0.6820 0.6714 0.6869 0.7628 0.8475

Table 1: State-of-the-art comparison for Label Relevance Ranking task on the LRMovieNet dataset. Bold indicates the best score.

5.1 Experiments Setup

LRMovieNet Dataset To assess the effectiveness of our approach, we conduct experiments using the LRMovieNet dataset. After obtaining image and text labels, we split the video dataset into source and target domains based on video label types. As label relevance ranking focuses on multimodal input, we partition the domains from a label perspective. To highlight label differences, we divide the class label set by video genres. Notably, there is a significant disparity between head and long-tail genres. Thus, we use the number of clips per genre to guide the partitioning. Specifically, we rank genres by the number of clips and divide them into sets SPsubscript𝑆𝑃S_{P}italic_S start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT and SQsubscript𝑆𝑄S_{Q}italic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT. Set SPsubscript𝑆𝑃S_{P}italic_S start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT includes genres with more clips, while set SQsubscript𝑆𝑄S_{Q}italic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT includes those with fewer clips (i.e., long-tail genres).

We designate labels in set SQsubscript𝑆𝑄S_{Q}italic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT as the target domain and the difference between labels in sets SPsubscript𝑆𝑃S_{P}italic_S start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT and SQsubscript𝑆𝑄S_{Q}italic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT as the source domain, achieving domain partitioning while maintaining label diversity between domains. For source domain labels, we manually assign relevance categories (high, medium, low) based on their relevance to video clip content. For target domain labels, we randomly sample 5%-40% of label pairs and annotate their relative order based on their relevance to the video clip content. To evaluate our label relevance ranking algorithm, we also annotate the test set in the target domain with high, medium, and low relevance categories for the labels. We obtain 2551/2206/1000 video clips for the first stage/second stage/test split. The first stage data contains 10393 distinct labels, while the second stage and validation set contain 4841 different labels.

Method NDCG @ 1 NDCG@3 NDCG@5 NDCG@10 NDCG@20 PRM [45] 0.5726 0.5804 0.5973 0.6407 0.7603 DLCM [1] 0.5983 0.6025 0.6125 0.6797 0.7744 ListNet [9] 0.5449 0.5575 0.5699 0.6324 0.7467 GSF [2] 0.6004 0.6265 0.6471 0.7054 0.7892 SetRank [44] 0.5299 0.5380 0.5555 0.6083 0.7365 RankFormer [8] 0.5684 0.5511 0.5643 0.6164 0.7458 LR2PPO 0.6496 0.6830 0.7033 0.7710 0.8240

Table 2: State-of-the-art comparison on traditional datasets for label relevance ranking on the MSLR-Web10K \rightarrow MQ2008 transfering task.

MSLR-WEB10k \rightarrow MQ2008 To demonstrate the generalizability of our method, we further conduct experiments on traditional LTR datasets. In this transfer learning scenario, we use MSLR-WEB10k as the source domain and MQ2008 as the target domain, based on datasets introduced by Qin and Liu [46].

Evaluation Metrics We use the NDCG (Normalized Discounted Cumulative Gain) metric as the evaluation metric for multimodal label relevance ranking. For each video clip, we compute NDCG@k𝑘kitalic_k for the top k𝑘kitalic_k labels.

5.2 State-of-the-art Comparison

LRMovieNet Dataset We compare LR2PPO with previous state-of-the-art LTR methods, reporting NDCG metrics for the LRMovieNet dataset. Results of LTR-based methods are reproduced based on the paper description, since the original models can not be directly applied to this task. As seen in Tab. 1, LR2PPO significantly outperforms previous methods. Meanwhile, compared to the first-stage model LR2PPO (S1), the final LR2PPO achieves consistent improvement of over 3% at different NDCG@k. Open-vocabulary (OV) based methods, such as CLIP and MKT, utilize label confidence for the ranking of labels, exhibiting relatively poor performance. Furthermore, these models solely focus on specific objects, thus perform inadequately when dealing with semantic information. LTR-based methods shows superiority over OV-based ones. However, they are not originally designed for ranking for labels, especially when transferring to a new scenario. In contrast, our LR2PPO model allows the base model to interact with the environment and optimize over unlabeled data, resulting in superior performance compared to other baseline models.

Annotation Proportion Reward Model Accuracy NDCG@1 NDCG@3 NDCG@5 NDCG@10 NDCG@20 0%percent00\%0 % - 0.6330 0.6018 0.6061 0.6667 0.8021 5%percent55\%5 % 0.7697 0.6787 0.6581 0.6770 0.7514 0.8416 10%percent1010\%10 % 0.7757 0.6820 0.6714 0.6869 0.7628 0.8475 20%percent2020\%20 % 0.7837 0.6800 0.6784 0.6980 0.7667 0.8506 40%percent4040\%40 % 0.7866 0.6830 0.6682 0.6877 0.7617 0.8467

Table 3: Stage 2 and 3 results with different annotation proportions in target domain.
Refer to caption
(a) Different ratio design
Refer to caption
(b) Different threshold δ𝛿\deltaitalic_δ
Figure 3: NDCG curves during training. (a) PPO with different ratio design. Original ratio in PPO is not applicable to the definitions of state and action in the ranking task, leading to a training collapse, while our proposed partial order ratio solves this problem. (b) PPO with different thresholds δ𝛿\deltaitalic_δ in rt(θ)subscriptsuperscript𝑟𝑡𝜃r^{\prime}_{t}(\theta)italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ). A small negative threshold δ=0.1𝛿0.1\delta=-0.1italic_δ = - 0.1 stabilizes the training, leading to superior performance.

MSLR-WEB10k \rightarrow MQ2008 Comparison on LTR datasets are shown in Tab. 2. It can concluded that our proposed LR2PPO method surpasses previous LTR methods significantly when performing transfer learning on LTR datasets.

5.3 Ablation Studies

Annotation Proportion in Target Domain To explore the influence of the annotation proportion of ordered pairs in the target domain, we adjust the annotation proportion during the training of the reward model in stage 2. Subsequently, we adopt the adjusted reward model when training the entire LR2PPO framework in stage 3. The accuracy of the reward model in the second stage and the NDCG metric in the third stage are reported in Tab. 3. The proportion of 10% achieves relatively high reward model accuracy and ultimate ranking relevance, while maintaining limited annotation.

Partial Order Ratio As shown in Fig. 3(a), partial order ratio shows stability in training and achieves better NDCG, in contrast to the original ratio in PPO, which experiences a training collapse. This result indicates that the original ratio in PPO may not be directly applicable in the setting of label relevance ranking, thereby demonstrating the effectiveness of our proposed partial order ratio.

Hyper-parameters Sensitivity Here, we investigate the impact of the threshold δ𝛿\deltaitalic_δ in Eq. (8). As shown in Fig. 3(b), negative thresholds 0.10.1-0.1- 0.1 achieves better NDCG scores, while improving the robustness of training procedure in stage 3. Refer to supplementary for more details.

5.4 Qualitative Assessment

To clearly reveal the effectiveness of our method, we visualize the label relevance ranking prediction results of the LR2PPO algorithm and other state-of-the-art OV-based or LTR-based methods on some samples in the LRMovieNet test set. Fig. 4 shows the comparison between the LR2PPO algorithm and CLIP and PRM. For a set of video frame sequences and a plot text description, as well as a set of labels, we compare the ranking results of different methods based on label relevance for the given label set, and list the top 5 high relevance labels predicted by each method. Compared with CLIP and PRM, our method ranks more high relevance labels at the top and low relevance labels at the bottom. The results show that our method can better rank the labels based on the relevance between the label and the multimodal input, to more accurately obtain high-value labels.

(a) *
Refer to caption
(b) *
Refer to caption
Figure 4: Comparison between LR2PPO and other state-of-the-art ranking methods. The red, blue and green labels listed after the method represent low, medium and high in ground truth, respectively. The value below each label represents the corresponding relevance score. Best viewed in color and zoomed in.

6 Conclusion

In this study, we prove the pivotal role of label relevance in label tasks, and propose a novel approach, named LR2PPO, to effectively mine the partial order relations and apply label relevance ranking, especifically for a new scenario. To evaluate the performance of the method, a new benchmark dataset, named LRMovieNet, is proposed. Experimental results on this dataset and other LTR datasets validate the effectiveness of our proposed method.

References

  • [1] Ai, Q., Bi, K., Guo, J., Croft, W.B.: Learning a deep listwise context model for ranking refinement. In: The 41st international ACM SIGIR conference on research & development in information retrieval. pp. 135–144 (2018)
  • [2] Ai, Q., Wang, X., Bruch, S., Golbandi, N., Bendersky, M., Najork, M.: Learning groupwise multivariate scoring functions using deep neural networks. In: Proceedings of the 2019 ACM SIGIR international conference on theory of information retrieval. pp. 85–92 (2019)
  • [3] Bao, H., Wang, W., Dong, L., Liu, Q., Mohammed, O.K., Aggarwal, K., Som, S., Piao, S., Wei, F.: Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems 35, 32897–32912 (2022)
  • [4] Barto, A., Duff, M.: Monte carlo matrix inversion and reinforcement learning. Advances in Neural Information Processing Systems 6 (1993)
  • [5] Brinker, K., Hüllermeier, E.: Case-based label ranking. In: Machine Learning: ECML 2006: 17th European Conference on Machine Learning Berlin, Germany, September 18-22, 2006 Proceedings 17. pp. 566–573. Springer (2006)
  • [6] Brinker, K., Hüllermeier, E.: Case-based multilabel ranking. In: Proceedings of the 20th international joint conference on Artifical intelligence. pp. 702–707 (2007)
  • [7] Burges, C.J.: From ranknet to lambdarank to lambdamart: An overview. Learning 11(23-581),  81 (2010)
  • [8] Buyl, M., Missault, P., Sondag, P.A.: Rankformer: Listwise learning-to-rank using listwide labels. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. p. 3762–3773. KDD ’23, Association for Computing Machinery (2023). https://doi.org/10.1145/3580305.3599892
  • [9] Cao, Z., Qin, T., Liu, T.Y., Tsai, M.F., Li, H.: Learning to rank: from pairwise approach to listwise approach. In: Proceedings of the 24th international conference on Machine learning. pp. 129–136 (2007)
  • [10] Chen, Y.C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.: Uniter: Universal image-text representation learning. In: European conference on computer vision. pp. 104–120. Springer (2020)
  • [11] Cheng, W., Hühn, J., Hüllermeier, E.: Decision tree and instance-based learning for label ranking. In: Proceedings of the 26th annual international conference on machine learning. pp. 161–168 (2009)
  • [12] Cheng, W., Hüllermeier, E.: Instance-based label ranking using the mallows model. In: ECCBR workshops. pp. 143–157 (2008)
  • [13] Cossock, D., Zhang, T.: Subset ranking using regression. In: Learning Theory: 19th Annual Conference on Learning Theory, COLT 2006, Pittsburgh, PA, USA, June 22-25, 2006. Proceedings 19. pp. 605–619. Springer (2006)
  • [14] Crammer, K., Singer, Y.: Pranking with ranking. Advances in neural information processing systems 14 (2001)
  • [15] Dekel, O., Singer, Y., Manning, C.D.: Log-linear models for label ranking. Advances in neural information processing systems 16 (2003)
  • [16] Dery, L., Shmueli, E.: Improving label ranking ensembles using boosting techniques. arXiv preprint arXiv:2001.07744 (2020)
  • [17] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR 2021: The Ninth International Conference on Learning Representations (2021)
  • [18] Fotakis, D., Kalavasis, A., Kontonis, V., Tzamos, C.: Linear label ranking with bounded noise. Advances in Neural Information Processing Systems 35, 15642–15656 (2022)
  • [19] Fotakis, D., Kalavasis, A., Psaroudaki, E.: Label ranking through nonparametric regression. In: International Conference on Machine Learning. pp. 6622–6659. PMLR (2022)
  • [20] Fürnkranz, J., Hüllermeier, E., Loza Mencía, E., Brinker, K.: Multilabel classification via calibrated label ranking. Machine learning 73, 133–153 (2008)
  • [21] Garg, S., Wu, Y., Balakrishnan, S., Lipton, Z.: A unified view of label shift estimation. Advances in Neural Information Processing Systems 33, 3290–3300 (2020)
  • [22] Har-Peled, S., Roth, D., Zimak, D.: Constraint classification for multiclass classification and ranking. Advances in neural information processing systems 15 (2002)
  • [23] He, S., Guo, T., Dai, T., Qiao, R., Shu, X., Ren, B., Xia, S.T.: Open-vocabulary multi-label classification via multi-modal knowledge transfer. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 808–816 (2023)
  • [24] Huang, Q., Xiong, Y., Rao, A., Wang, J., Lin, D.: Movienet: A holistic dataset for movie understanding. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16. pp. 709–727. Springer (2020)
  • [25] Huang, Z., Zeng, Z., Huang, Y., Liu, B., Fu, D., Fu, J.: Seeing out of the box: End-to-end pre-training for vision-language representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12976–12985 (2021)
  • [26] Hüllermeier, E., Fürnkranz, J., Cheng, W., Brinker, K.: Label ranking by learning pairwise preferences. Artificial Intelligence 172(16-17), 1897–1916 (2008)
  • [27] Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning. pp. 4904–4916. PMLR (2021)
  • [28] Kanehira, A., Harada, T.: Multi-label ranking from positive and unlabeled data. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5138–5146 (2016)
  • [29] Korba, A., Garcia, A., d’Alché Buc, F.: A structured prediction approach for label ranking. Advances in neural information processing systems 31 (2018)
  • [30] Kumar, A., Liang, P.S., Ma, T.: Verified uncertainty calibration. Advances in Neural Information Processing Systems 32 (2019)
  • [31] Li, C., Pavlu, V., Aslam, J., Wang, B., Qin, K.: Learning to calibrate and rerank multi-label predictions. In: Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings, Part III. pp. 220–236. Springer (2020)
  • [32] Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
  • [33] Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning. pp. 12888–12900. PMLR (2022)
  • [34] Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems 34, 9694–9705 (2021)
  • [35] Li, P., Wu, Q., Burges, C.: Mcrank: Learning to rank using multiple classification and gradient boosting. Advances in neural information processing systems 20 (2007)
  • [36] Li, W., Gao, C., Niu, G., Xiao, X., Liu, H., Liu, J., Wu, H., Wang, H.: Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409 (2020)
  • [37] Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)
  • [38] Liu, T.Y., et al.: Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval 3(3), 225–331 (2009)
  • [39] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
  • [40] Liu, Z., Shen, Z., Long, Y., Xing, E., Cheng, K.T., Leichner, C.: Data-free neural architecture search via recursive label calibration. In: European Conference on Computer Vision. pp. 391–406. Springer (2022)
  • [41] Lu, J., Batra, D., Parikh, D., Lee, S.: Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems 32 (2019)
  • [42] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.: Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013)
  • [43] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35, 27730–27744 (2022)
  • [44] Pang, L., Xu, J., Ai, Q., Lan, Y., Cheng, X., Wen, J.: Setrank: Learning a permutation-invariant ranking model for information retrieval. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. pp. 499–508 (2020)
  • [45] Pei, C., Zhang, Y., Zhang, Y., Sun, F., Lin, X., Sun, H., Wu, J., Jiang, P., Ge, J., Ou, W., et al.: Personalized re-ranking for recommendation. In: Proceedings of the 13th ACM conference on recommender systems. pp. 3–11 (2019)
  • [46] Qin, T., Liu, T.: Introducing LETOR 4.0 datasets. CoRR abs/1306.2597 (2013), http://arxiv.org/abs/1306.2597
  • [47] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML 2021: 38th International Conference on Machine Learning. pp. 8748–8763 (2021)
  • [48] Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: International conference on machine learning. pp. 1889–1897. PMLR (2015)
  • [49] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
  • [50] Shalev-Shwartz, S., Singer, Y.: Efficient learning of label ranking by soft projections onto polyhedra (2006)
  • [51] Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., Riedmiller, M.: Deterministic policy gradient algorithms. In: International conference on machine learning. pp. 387–395. Pmlr (2014)
  • [52] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
  • [53] Watkins, C.J.C.H.: Learning from delayed rewards (1989)
  • [54] Xia, F., Liu, T.Y., Wang, J., Zhang, W., Li, H.: Listwise approach to learning to rank: theory and algorithm. In: Proceedings of the 25th international conference on Machine learning. pp. 1192–1199 (2008)
  • [55] Yang, J., Duan, J., Tran, S., Xu, Y., Chanda, S., Chen, L., Zeng, B., Chilimbi, T., Huang, J.: Vision-language pre-training with triple contrastive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15671–15680 (2022)
  • [56] Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917 (2022)
  • [57] Zeng, Y., Zhang, X., Li, H.: Multi-grained vision language pre-training: Aligning texts with visual concepts. arXiv preprint arXiv:2111.08276 (2021)
  • [58] Zhang, Y., Huang, X., Ma, J., Li, Z., Luo, Z., Xie, Y., Qin, Y., Luo, T., Li, Y., Liu, S., et al.: Recognize anything: A strong image tagging model. arXiv preprint arXiv:2306.03514 (2023)
  • [59] Zhuang, T., Ou, W., Wang, Z.: Globally optimized mutual influence aware ranking in e-commerce search. arXiv preprint arXiv:1805.08524 (2018)

Appendix A PPO Algorithm

In this section, we provide a detailed explanation of the original Proximal Policy Optimization (PPO) algorithm, as proposed by Schulman et al. [49]. The PPO algorithm is defined by three key loss functions: the policy function loss, the value function loss, and the entropy bonus. In Sec. 3.2 of the main paper, we present the concept of the target value function estimate Vttargetsuperscriptsubscript𝑉𝑡targetV_{t}^{\text{{target}}}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT target end_POSTSUPERSCRIPT and the advantage estimate A^tsubscript^𝐴𝑡\hat{A}_{t}over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time step t𝑡titalic_t. Building upon this, we proceed to introduce the details of the PPO algorithm.

Formally, rt(θ)subscript𝑟𝑡𝜃r_{t}(\theta)italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) is defined as the ratio of the action probability under the policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to the action probability under the old policy πθoldsubscript𝜋subscript𝜃old\pi_{\theta_{\text{old}}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT old end_POSTSUBSCRIPT end_POSTSUBSCRIPT:

rt(θ)=πθ(at|st)πθold(at|st).subscript𝑟𝑡𝜃subscript𝜋𝜃conditionalsubscript𝑎𝑡subscript𝑠𝑡subscript𝜋subscript𝜃oldconditionalsubscript𝑎𝑡subscript𝑠𝑡r_{t}(\theta)=\frac{\pi_{\theta}(a_{t}|s_{t})}{\pi_{\theta_{\text{old}}}(a_{t}% |s_{t})}.italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) = divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT old end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG . (A.1)

Subsequently, the policy function loss, denoted as LPF(θ)superscript𝐿PF𝜃L^{\text{PF}}(\theta)italic_L start_POSTSUPERSCRIPT PF end_POSTSUPERSCRIPT ( italic_θ ), is computed based on the policy parameters θ𝜃\thetaitalic_θ:

LPF(θ)=𝔼t(rt(θ)A^t).superscript𝐿𝑃𝐹𝜃subscript𝔼𝑡subscript𝑟𝑡𝜃subscript^𝐴𝑡L^{PF}(\theta)=-\mathbb{E}_{t}\left(r_{t}(\theta)\hat{A}_{t}\right).italic_L start_POSTSUPERSCRIPT italic_P italic_F end_POSTSUPERSCRIPT ( italic_θ ) = - blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (A.2)

To prevent the policy from changing too drastically in a single update, the PPO algorithm typically employs a clipped formation of policy loss:

LCPF(θ)=𝔼t[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)].superscript𝐿𝐶𝑃𝐹𝜃subscript𝔼𝑡delimited-[]subscript𝑟𝑡𝜃subscript^𝐴𝑡clipsubscript𝑟𝑡𝜃1italic-ϵ1italic-ϵsubscript^𝐴𝑡\displaystyle L^{CPF}(\theta)=-\mathbb{E}_{t}\Bigl{[}\min\Bigl{(}r_{t}(\theta)% \hat{A}_{t},\text{clip}\bigl{(}r_{t}(\theta),1-\epsilon,1+\epsilon\bigr{)}\hat% {A}_{t}\Bigr{)}\Bigr{]}.italic_L start_POSTSUPERSCRIPT italic_C italic_P italic_F end_POSTSUPERSCRIPT ( italic_θ ) = - blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ roman_min ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , clip ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) , 1 - italic_ϵ , 1 + italic_ϵ ) over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] . (A.3)

The value function loss, denoted as LVF(ω)superscript𝐿𝑉𝐹𝜔L^{VF}(\omega)italic_L start_POSTSUPERSCRIPT italic_V italic_F end_POSTSUPERSCRIPT ( italic_ω ), has also been defined in Sec. 3.2 of the main paper.

The entropy bonus, denoted as S(πθ)𝑆subscript𝜋𝜃S(\pi_{\theta})italic_S ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ), is defined as:

S(πθ)=𝔼t[atπθ(at|st)logπθ(at|st)].𝑆subscript𝜋𝜃subscript𝔼𝑡delimited-[]subscriptsubscript𝑎𝑡subscript𝜋𝜃conditionalsubscript𝑎𝑡subscript𝑠𝑡subscript𝜋𝜃conditionalsubscript𝑎𝑡subscript𝑠𝑡S(\pi_{\theta})=\mathbb{E}_{t}\left[\sum_{a_{t}}-\pi_{\theta}(a_{t}|s_{t})\log% \pi_{\theta}(a_{t}|s_{t})\right].italic_S ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] . (A.4)

This is the expected value at time step t𝑡titalic_t of the sum over all possible actions of the product of the action probability under the policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and the logarithm of the action probability under the policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. This term encourages exploration by maximizing the entropy of the policy. Ultimately, the total loss of PPO can be defined as:

L(θ)=LCPF(θ)+c1LVF(ω)c2S(πθ),superscript𝐿𝜃superscript𝐿𝐶𝑃𝐹𝜃superscriptsubscript𝑐1superscript𝐿𝑉𝐹𝜔superscriptsubscript𝑐2𝑆subscript𝜋𝜃L^{\prime}(\theta)=L^{CPF}(\theta)+c_{1}^{\prime}L^{VF}(\omega)-c_{2}^{\prime}% S(\pi_{\theta}),italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ ) = italic_L start_POSTSUPERSCRIPT italic_C italic_P italic_F end_POSTSUPERSCRIPT ( italic_θ ) + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT italic_V italic_F end_POSTSUPERSCRIPT ( italic_ω ) - italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_S ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) , (A.5)

where c1superscriptsubscript𝑐1c_{1}^{\prime}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and c2superscriptsubscript𝑐2c_{2}^{\prime}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT represent adjustable hyperparameters, which can be tuned to optimize the performance of the model.

In many cases, instead of using the clipped policy loss form in Eq. (A.3), the PPO algorithm incorporates a KL penalty term in the overall loss to prevent overly large policy updates that could lead to instability or performance drops in the learning process. The KL penalty, denoted as KLpenalty𝐾subscript𝐿𝑝𝑒𝑛𝑎𝑙𝑡𝑦KL_{penalty}italic_K italic_L start_POSTSUBSCRIPT italic_p italic_e italic_n italic_a italic_l italic_t italic_y end_POSTSUBSCRIPT, is formulated as:

KLpenalty(πθold,πθ)=Et[KL(πθold(at|st),πθ(at|st))],𝐾subscript𝐿𝑝𝑒𝑛𝑎𝑙𝑡𝑦subscript𝜋subscript𝜃𝑜𝑙𝑑subscript𝜋𝜃subscript𝐸𝑡delimited-[]𝐾𝐿subscript𝜋subscript𝜃𝑜𝑙𝑑conditionalsubscript𝑎𝑡subscript𝑠𝑡subscript𝜋𝜃conditionalsubscript𝑎𝑡subscript𝑠𝑡KL_{penalty}(\pi_{\theta_{old}},\pi_{\theta})=E_{t}[KL(\pi_{\theta_{old}}(a_{t% }|s_{t}),\pi_{\theta}(a_{t}|s_{t}))],italic_K italic_L start_POSTSUBSCRIPT italic_p italic_e italic_n italic_a italic_l italic_t italic_y end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) = italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_K italic_L ( italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ] , (A.6)

where KL(,)𝐾𝐿KL(\cdot,\cdot)italic_K italic_L ( ⋅ , ⋅ ) represents the Kullback-Leibler (KL) divergence. Thereby, the overall loss function, denoted as L′′(θ)superscript𝐿′′𝜃L^{\prime\prime}(\theta)italic_L start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_θ ), is a combination of the four aforementioned loss functions. It is computed as:

L′′(θ)=LPF(θ)+c1′′LVF(ω)c2′′S(πθ)+c3′′KLpenalty(πθold,πθ).superscript𝐿′′𝜃superscript𝐿𝑃𝐹𝜃superscriptsubscript𝑐1′′superscript𝐿𝑉𝐹𝜔superscriptsubscript𝑐2′′𝑆subscript𝜋𝜃superscriptsubscript𝑐3′′𝐾subscript𝐿𝑝𝑒𝑛𝑎𝑙𝑡𝑦subscript𝜋subscript𝜃𝑜𝑙𝑑subscript𝜋𝜃\displaystyle L^{\prime\prime}(\theta)=L^{PF}(\theta)+c_{1}^{\prime\prime}L^{% VF}(\omega)-c_{2}^{\prime\prime}S(\pi_{\theta})+c_{3}^{\prime\prime}KL_{% penalty}(\pi_{\theta_{old}},\pi_{\theta}).italic_L start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_θ ) = italic_L start_POSTSUPERSCRIPT italic_P italic_F end_POSTSUPERSCRIPT ( italic_θ ) + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT italic_V italic_F end_POSTSUPERSCRIPT ( italic_ω ) - italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT italic_S ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) + italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT italic_K italic_L start_POSTSUBSCRIPT italic_p italic_e italic_n italic_a italic_l italic_t italic_y end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) . (A.7)

Here, the hyperparameters c1′′superscriptsubscript𝑐1′′c_{1}^{\prime\prime}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT,c2′′superscriptsubscript𝑐2′′c_{2}^{\prime\prime}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT and c3′′superscriptsubscript𝑐3′′c_{3}^{\prime\prime}italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT are used to balance the contributions of the four components to the overall loss, and are typically determined through empirical tuning.

Appendix B LR2PPO Algorithm

The joint training of Label Relevance Ranking with Proximal Policy Optimization (LR2PPO) framework is illustrated in Algorithm B.1. Note that the definitions of entropy bonus S(πθ)𝑆subscript𝜋𝜃S(\pi_{\theta})italic_S ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) and KL penalty KLpenalty(πθold,πθ)𝐾subscript𝐿𝑝𝑒𝑛𝑎𝑙𝑡𝑦subscript𝜋subscript𝜃𝑜𝑙𝑑subscript𝜋𝜃KL_{penalty}(\pi_{\theta_{old}},\pi_{\theta})italic_K italic_L start_POSTSUBSCRIPT italic_p italic_e italic_n italic_a italic_l italic_t italic_y end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) are the same as the original PPO. (See Eq. (A.4) and Eq. (A.6).)

Algorithm B.1 Label Relevance Ranking with Proximal Policy Optimization (LR2PPO), Full Procedure
1:T=𝑇absentT=italic_T = Maximal state transition timestep length, NTrajs=subscript𝑁TrajsabsentN_{\text{Trajs}}=italic_N start_POSTSUBSCRIPT Trajs end_POSTSUBSCRIPT = Number of state transition trajectories collected as training data in a training iteration, K=𝐾absentK=italic_K = Number of learning epochs in a training iteration, NIters=subscript𝑁ItersabsentN_{\text{Iters}}=italic_N start_POSTSUBSCRIPT Iters end_POSTSUBSCRIPT = Number of training iterations, weighting factors c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and c3subscript𝑐3c_{3}italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, M=𝑀absentM=italic_M = Minibatch size, γ=𝛾absent\gamma=italic_γ = Discount factor through timesteps, m=𝑚absentm=italic_m = Margin in partial order function, δ=𝛿absent\delta=italic_δ = Threshold when calculating partial order ratio
2:πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, Vωsubscript𝑉𝜔V_{\omega}italic_V start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT
3:πθsubscript𝜋𝜃absent\pi_{\theta}\leftarrowitalic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ← newActorNet (), Vωsubscript𝑉𝜔absentV_{\omega}\leftarrowitalic_V start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ← newCriticNet (), optimizer \leftarrow newOptimizer (πθ,Vω)subscript𝜋𝜃subscript𝑉𝜔\left(\pi_{\theta},V_{\omega}\right)( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT )
4:env \leftarrow Muitimodal input and corresponding labels \triangleright Environment of label relevance ranking task
5:num_batches = NTrajsTMsubscript𝑁Trajs𝑇𝑀\lfloor\frac{N_{\text{Trajs}}\cdot T}{M}\rfloor⌊ divide start_ARG italic_N start_POSTSUBSCRIPT Trajs end_POSTSUBSCRIPT ⋅ italic_T end_ARG start_ARG italic_M end_ARG ⌋ \triangleright Number of batches in a training iteration
6:for niter=1,2,,NIterssubscript𝑛𝑖𝑡𝑒𝑟12subscript𝑁Itersn_{iter}=1,2,\ldots,{N_{\text{Iters}}}italic_n start_POSTSUBSCRIPT italic_i italic_t italic_e italic_r end_POSTSUBSCRIPT = 1 , 2 , … , italic_N start_POSTSUBSCRIPT Iters end_POSTSUBSCRIPT do
7:   train_data []absent\leftarrow[]← [ ]
// Produce training data.
8:   for nk=1,2,,NTrajssubscript𝑛𝑘12subscript𝑁Trajsn_{k}=1,2,\ldots,{N_{\text{Trajs}}}italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1 , 2 , … , italic_N start_POSTSUBSCRIPT Trajs end_POSTSUBSCRIPT do
9:      st=0subscript𝑠𝑡0absents_{t=0}\leftarrowitalic_s start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT ← env.randomlySampleLabelPair () \triangleright Randomly sample pair of labels as initial state
// Let actor interact with environment and collect training data:
10:      for t=1,2,,T𝑡12𝑇t=1,2,\ldots,{T}italic_t = 1 , 2 , … , italic_T do
11:         atπθsubscript𝑎𝑡subscript𝜋𝜃a_{t}\leftarrow\pi_{\theta}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. generate_action (st)subscript𝑠𝑡\left(s_{t}\right)( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
12:         πθold (atst)πθ. distribution.get_probability (at)formulae-sequencesubscript𝜋subscript𝜃old conditionalsubscript𝑎𝑡subscript𝑠𝑡subscript𝜋𝜃 distribution.get_probability subscript𝑎𝑡\pi_{\theta_{\text{old }}}\left(a_{t}\mid s_{t}\right)\leftarrow\pi_{\theta}.% \text{ distribution.get\_probability }\left(a_{t}\right)italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT old end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ← italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT . distribution.get_probability ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) \triangleright Old action logits
13:         st+1,rtsubscript𝑠𝑡1subscript𝑟𝑡absents_{t+1},r_{t}\leftarrowitalic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← env.step (at)subscript𝑎𝑡\left(a_{t}\right)( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) \triangleright Transition state and get reward
14:         train_data \leftarrow train_data +tuple(st,at,rt,πθold (atst))tuplesubscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡subscript𝜋subscript𝜃old conditionalsubscript𝑎𝑡subscript𝑠𝑡+\operatorname{tuple}\left(s_{t},a_{t},r_{t},\pi_{\theta_{\text{old }}}\left(a% _{t}\mid s_{t}\right)\right)+ roman_tuple ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT old end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) )
15:      end for
// Calculate and collect target value and advantage at timestep t𝑡titalic_t
16:      for t=1,2,,T𝑡12𝑇t=1,2,\ldots,{T}italic_t = 1 , 2 , … , italic_T do
17:         Vωold(sT)Vωsubscript𝑉subscript𝜔𝑜𝑙𝑑subscript𝑠𝑇subscript𝑉𝜔V_{\omega_{old}}(s_{T})\leftarrow V_{\omega}italic_V start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ← italic_V start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT.get_value(sTsubscript𝑠𝑇s_{T}italic_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT)
18:         Vωold(st)Vωsubscript𝑉subscript𝜔𝑜𝑙𝑑subscript𝑠𝑡subscript𝑉𝜔V_{\omega_{old}}(s_{t})\leftarrow V_{\omega}italic_V start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ← italic_V start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT.get_value(stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT)
19:         Vttarget =rt+γrt+1+γ2rt+2++γTt1rT1+γTtVωold(sT)superscriptsubscript𝑉𝑡target subscript𝑟𝑡𝛾subscript𝑟𝑡1superscript𝛾2subscript𝑟𝑡2superscript𝛾𝑇𝑡1subscript𝑟𝑇1superscript𝛾𝑇𝑡subscript𝑉subscript𝜔𝑜𝑙𝑑subscript𝑠𝑇V_{t}^{\text{target }}=r_{t}+\gamma r_{t+1}+\gamma^{2}r_{t+2}+\ldots+\gamma^{T% -t-1}r_{T-1}+\gamma^{T-t}V_{\omega_{old}}\left(s_{T}\right)italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT target end_POSTSUPERSCRIPT = italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γ italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t + 2 end_POSTSUBSCRIPT + … + italic_γ start_POSTSUPERSCRIPT italic_T - italic_t - 1 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT + italic_γ start_POSTSUPERSCRIPT italic_T - italic_t end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) \triangleright Target Value
20:         A^t=Vttarget Vωold(st)subscript^𝐴𝑡superscriptsubscript𝑉𝑡target subscript𝑉subscript𝜔𝑜𝑙𝑑subscript𝑠𝑡\hat{A}_{t}=V_{t}^{\text{target }}-V_{\omega_{old}}\left(s_{t}\right)over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT target end_POSTSUPERSCRIPT - italic_V start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) \triangleright Estimated advantage
21:         train_data [(nk1)T+t]delimited-[]subscript𝑛𝑘1𝑇𝑡absent[(n_{k}-1)\cdot T+t]\leftarrow[ ( italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - 1 ) ⋅ italic_T + italic_t ] ← train_data [(nk1)T+t]+tuple(At,Vttarget )delimited-[]subscript𝑛𝑘1𝑇𝑡tuplesubscript𝐴𝑡superscriptsubscript𝑉𝑡target [(n_{k}-1)\cdot T+t]+\operatorname{tuple}\left(A_{t},V_{t}^{\text{target }}\right)[ ( italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - 1 ) ⋅ italic_T + italic_t ] + roman_tuple ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT target end_POSTSUPERSCRIPT )
22:      end for
23:   end for
24:   optimizer.resetGradients (πθ,Vω)subscript𝜋𝜃subscript𝑉𝜔\left(\pi_{\theta},V_{\omega}\right)( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT )
// Update trainable parameters θ𝜃\thetaitalic_θ and ω𝜔\omegaitalic_ω for E𝐸Eitalic_E epochs:
25:   for epoch =1,2,,Kabsent12𝐾=1,2,\ldots,K= 1 , 2 , … , italic_K do
26:      train_data \leftarrow randomizeOrder(train_data)
27:      for batch_index =1,2,absent12=1,2,\ldots= 1 , 2 , …, num_batches do
28:         Ebatch getNextBatch(train_data, batch_index, M) subscriptEbatch getNextBatch(train_data, batch_index, M) \mathrm{E_{batch}}\leftarrow\text{ getNextBatch(train\_data, batch\_index, M) }roman_E start_POSTSUBSCRIPT roman_batch end_POSTSUBSCRIPT ← getNextBatch(train_data, batch_index, M) \triangleright Get minibatch for training
29:         for example eEbatchesubscriptEbatch\mathrm{e}\in\mathrm{E_{batch}}roman_e ∈ roman_E start_POSTSUBSCRIPT roman_batch end_POSTSUBSCRIPT do
30:            st,at,rt,πθold (atst),A^t,Vttarget unpack(e)subscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡subscript𝜋subscript𝜃old conditionalsubscript𝑎𝑡subscript𝑠𝑡subscript^𝐴𝑡superscriptsubscript𝑉𝑡target unpack𝑒s_{t},a_{t},r_{t},\pi_{\theta_{\text{old }}}\left(a_{t}\mid s_{t}\right),\hat{% A}_{t},V_{t}^{\text{target }}\leftarrow\operatorname{unpack}(e)italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT old end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT target end_POSTSUPERSCRIPT ← roman_unpack ( italic_e )
31:            _πθ_subscript𝜋𝜃\_\leftarrow\pi_{\theta}_ ← italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT.generate_action (st)subscript𝑠𝑡\left(s_{t}\right)( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) \triangleright Parameterization of policy
32:            πθ(atst)πθsubscript𝜋𝜃conditionalsubscript𝑎𝑡subscript𝑠𝑡subscript𝜋𝜃{\pi}_{\theta}\left(a_{t}\mid s_{t}\right)\leftarrow\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ← italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. distribution.get_probability (st)subscript𝑠𝑡\left(s_{t}\right)( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) \triangleright Action logits
33:            pt1,pt2unpack(πθ(atst))superscriptsubscript𝑝𝑡1superscriptsubscript𝑝𝑡2unpacksubscript𝜋𝜃conditionalsubscript𝑎𝑡subscript𝑠𝑡p_{t}^{1},p_{t}^{2}\leftarrow\operatorname{unpack}({\pi}_{\theta}\left(a_{t}% \mid s_{t}\right))italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ← roman_unpack ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) \triangleright Action_logit for each label in the label pair
34:            rt(θ)={max(0,m(pt1pt2)),A^tδmax(0,m(pt2pt1)),A^t<δsuperscriptsubscript𝑟𝑡𝜃casesmax0𝑚superscriptsubscript𝑝𝑡1superscriptsubscript𝑝𝑡2subscript^𝐴𝑡𝛿otherwisemax0𝑚superscriptsubscript𝑝𝑡2superscriptsubscript𝑝𝑡1subscript^𝐴𝑡𝛿otherwiser_{t}^{\prime}(\theta)=\begin{cases}-\operatorname{max}(0,m-(p_{t}^{1}-p_{t}^{% 2})),\hat{A}_{t}\geq\delta\\ -\operatorname{max}(0,m-(p_{t}^{2}-p_{t}^{1})),\hat{A}_{t}<\delta\end{cases}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_θ ) = { start_ROW start_CELL - roman_max ( 0 , italic_m - ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) , over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≥ italic_δ end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL - roman_max ( 0 , italic_m - ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) ) , over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT < italic_δ end_CELL start_CELL end_CELL end_ROW \triangleright Partial order ratio
35:         end for
36:         LLR2PPOPF(θ)=1|M|t{1,2,,|M|}rt(θ)abs(A^t)superscriptsubscript𝐿𝐿𝑅2𝑃𝑃𝑂𝑃𝐹𝜃1𝑀subscript𝑡12𝑀subscriptsuperscript𝑟𝑡𝜃abssubscript^𝐴𝑡L_{LR\textsuperscript{2}PPO}^{PF}(\theta)=-\frac{1}{|M|}\sum_{t\in\{1,2,\ldots% ,|M|\}}r^{\prime}_{t}(\theta)\operatorname{abs}(\hat{A}_{t})italic_L start_POSTSUBSCRIPT italic_L italic_R italic_P italic_P italic_O end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P italic_F end_POSTSUPERSCRIPT ( italic_θ ) = - divide start_ARG 1 end_ARG start_ARG | italic_M | end_ARG ∑ start_POSTSUBSCRIPT italic_t ∈ { 1 , 2 , … , | italic_M | } end_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) roman_abs ( over^ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) \triangleright Policy loss
37:         LLR2PPOVF(ω)=1|M|t{1,2,,|M|}(Vω(st)Vttarget)2superscriptsubscript𝐿𝐿𝑅2𝑃𝑃𝑂𝑉𝐹𝜔1𝑀subscript𝑡12𝑀superscriptsubscript𝑉𝜔subscript𝑠𝑡subscriptsuperscript𝑉𝑡𝑎𝑟𝑔𝑒𝑡𝑡2L_{LR\textsuperscript{2}PPO}^{VF}(\omega)=\frac{1}{|M|}\sum_{t\in\{1,2,\ldots,% |M|\}}\left(V_{\omega}(s_{t})-V^{target}_{t}\right)^{2}italic_L start_POSTSUBSCRIPT italic_L italic_R italic_P italic_P italic_O end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V italic_F end_POSTSUPERSCRIPT ( italic_ω ) = divide start_ARG 1 end_ARG start_ARG | italic_M | end_ARG ∑ start_POSTSUBSCRIPT italic_t ∈ { 1 , 2 , … , | italic_M | } end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_V start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT \triangleright Value loss
38:         S(πθ)=1|M|t1,2,,|M|atπθ(at|st)logπθ(at|st)𝑆subscript𝜋𝜃1𝑀subscript𝑡12𝑀subscriptsubscript𝑎𝑡subscript𝜋𝜃conditionalsubscript𝑎𝑡subscript𝑠𝑡subscript𝜋𝜃conditionalsubscript𝑎𝑡subscript𝑠𝑡S(\pi_{\theta})=-\frac{1}{|M|}\sum_{t\in{1,2,\ldots,|M|}}\sum_{a_{t}}\pi_{% \theta}(a_{t}|s_{t})\log\pi_{\theta}(a_{t}|s_{t})italic_S ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) = - divide start_ARG 1 end_ARG start_ARG | italic_M | end_ARG ∑ start_POSTSUBSCRIPT italic_t ∈ 1 , 2 , … , | italic_M | end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) \triangleright Entropy bonus
39:         KLpenalty(πθold,πθ)=1|M|t1,2,,|M|KL(πθold(at|st),πθ(at|st))𝐾subscript𝐿𝑝𝑒𝑛𝑎𝑙𝑡𝑦subscript𝜋subscript𝜃𝑜𝑙𝑑subscript𝜋𝜃1𝑀subscript𝑡12𝑀𝐾𝐿subscript𝜋subscript𝜃𝑜𝑙𝑑conditionalsubscript𝑎𝑡subscript𝑠𝑡subscript𝜋𝜃conditionalsubscript𝑎𝑡subscript𝑠𝑡KL_{penalty}(\pi_{\theta_{old}},\pi_{\theta})=\frac{1}{|M|}\sum_{t\in{1,2,% \ldots,|M|}}KL(\pi_{\theta_{old}}(a_{t}|s_{t}),\pi_{\theta}(a_{t}|s_{t}))italic_K italic_L start_POSTSUBSCRIPT italic_p italic_e italic_n italic_a italic_l italic_t italic_y end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG | italic_M | end_ARG ∑ start_POSTSUBSCRIPT italic_t ∈ 1 , 2 , … , | italic_M | end_POSTSUBSCRIPT italic_K italic_L ( italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) \triangleright KL penalty
40:         LLR2PPO(θ,ω)=LLR2PPOPF(θ)+c1LLR2PPOVF(ω)c2S(πθ)+c3KLpenalty(πθold,πθ)subscript𝐿𝐿𝑅2𝑃𝑃𝑂𝜃𝜔superscriptsubscript𝐿𝐿𝑅2𝑃𝑃𝑂𝑃𝐹𝜃subscript𝑐1superscriptsubscript𝐿𝐿𝑅2𝑃𝑃𝑂𝑉𝐹𝜔subscript𝑐2𝑆subscript𝜋𝜃subscript𝑐3𝐾subscript𝐿𝑝𝑒𝑛𝑎𝑙𝑡𝑦subscript𝜋subscript𝜃𝑜𝑙𝑑subscript𝜋𝜃L_{LR\textsuperscript{2}PPO}(\theta,\omega)=L_{LR\textsuperscript{2}PPO}^{PF}(% \theta)+c_{1}L_{LR\textsuperscript{2}PPO}^{VF}(\omega)-c_{2}S(\pi_{\theta})+c_% {3}KL_{penalty}(\pi_{\theta_{old}},\pi_{\theta})italic_L start_POSTSUBSCRIPT italic_L italic_R italic_P italic_P italic_O end_POSTSUBSCRIPT ( italic_θ , italic_ω ) = italic_L start_POSTSUBSCRIPT italic_L italic_R italic_P italic_P italic_O end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P italic_F end_POSTSUPERSCRIPT ( italic_θ ) + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_L italic_R italic_P italic_P italic_O end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V italic_F end_POSTSUPERSCRIPT ( italic_ω ) - italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_S ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) + italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_K italic_L start_POSTSUBSCRIPT italic_p italic_e italic_n italic_a italic_l italic_t italic_y end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_o italic_l italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) \triangleright Total loss
41:         optimizer.backpropagate (πθ,Vω,LLR2PPO(θ,ω))subscript𝜋𝜃subscript𝑉𝜔subscript𝐿𝐿𝑅2𝑃𝑃𝑂𝜃𝜔(\pi_{\theta},V_{\omega},L_{LR\textsuperscript{2}PPO}(\theta,\omega))( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_L italic_R italic_P italic_P italic_O end_POSTSUBSCRIPT ( italic_θ , italic_ω ) )
42:         optimizer.updateTrainableParameters (πθ,Vω)subscript𝜋𝜃subscript𝑉𝜔\left(\pi_{\theta},V_{\omega}\right)( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT )
43:      end for
44:   end for
45:end for
46:return: πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, Vωsubscript𝑉𝜔V_{\omega}italic_V start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT

Appendix C Experiments

C.1 More Details about LRMovieNet Dataset

Prompts to Extract Text Labels When generating textual labels from scene description in each movie clip of MovieNet [24] dataset, we feed the following prompts into LLaMa2 [52] to obtain event labels and entity labels, respectively. (1) Event labels: “Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: Use descriptive language tags consisting of less than three words each to capture events without entities in the following sentence: {sentence} Please seperate the labels by numbers and DO NOT return sentence. ### Response:” (2) Entity labels: “Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: Use descriptive language tags consisting of less than three words each to capture entities in the following sentence: {sentence} Please seperate the labels by numbers and DO NOT return sentence. ### Response: ”. The produced text labels and image labels (by RAM model [58]), are subsequently screened and manually adjusted.

Statistics of LRMovieNet Dataset To better understand the overall distribution of LRMovieNet, we provide histograms of its statistics in several dimensions, which are displayed in Fig. C.1. Specifically, Fig. C.1 (a) shows the total number of video clips in all videos of different genres, Fig. C.1 (b) displays the number of label classes in different genres, and Fig. C.1 (c) provides the number of labels with different frequencies in the entire LRMovieNet dataset. The details of Fig. C.1 (a) and Fig. C.1 (b) are listed in Tab. 1(a) and Tab. 1(b), respectively.

Refer to caption
(a) Clips count in different genres
Refer to caption
(b) Label classes count in different genres
Refer to caption
(c) Labels count about label frequency

Figure C.1: Data statistics of LRMovieNet.
Genre Index Genre Name Clips Count
00 Drama 1765
1111 Action 1224
2222 Thriller 1154
3333 Sci-Fi 869
4444 Crime 829
5555 Adventure 814
6666 Comedy 562
7777 Mystery 525
8888 Fantasy 520
9999 Romance 427
10101010 Biography 232
11111111 War 171
12121212 Horror 147
13131313 Family 140
14141414 History 132
15151515 Music 82
16161616 Western 66
17171717 Sport 51
18181818 Musical 21
(a) Details about clips count in different genres
Genre Index Genre Name Classes Count
00 Drama 10088
1111 Action 7923
2222 Thriller 7573
3333 Sci-Fi 6315
4444 Adventure 5999
5555 Crime 5542
6666 Comedy 4933
7777 Fantasy 4468
8888 Mystery 4277
9999 Romance 3688
10101010 Biography 2443
11111111 War 1879
12121212 Horror 1658
13131313 History 1652
14141414 Family 1543
15151515 Music 1191
16161616 Western 872
17171717 Sport 757
18181818 Musical 446
(b) Details about label classes count in different genres
Table C.1: Details about number of video clips and label classes in all videos of different genres in LRMovieNet.

Samples of LRMovieNet Dataset To illustrate the variation of types (e.g., objects, attributes, scenes, character identities) and semantic levels (e.g., general, specific, abstract) in LRMovieNet, we provide some training samples in Fig. C.2. These samples also serve to clarify the process of label relevance categories annotation and the annotated partial order label pairs. The annotated label relevance categories in the source domain are used to train the relevance ranking base model in the first stage. Subsequently, the partial order label pairs in the target domain (along with label pairs augmented from the source domain) are employed to train the reward model in the second stage. This reward model then guides the joint training of LR2PPO in the third stage.

(a) Source Domain
Refer to caption Refer to caption
(b) Target Domain
Refer to caption Refer to caption
Figure C.2: Annotated training samples in source and target domains. The red, blue and green labels listed in the upper subfigure represent low, medium and high in ground truth in the source domain, respectively. For each label pair in the lower subfigure, the left label are more relevant than the right in accordance with the video episode context (i.e., descriptions and frames). Best viewed in color and zoomed in.

C.2 Metrics

In this paper, we evaluate the performance of our label relevance ranking algorithm using the Normalized Discounted Cumulative Gain (NDCG) metric, which is widely adopted in information retrieval and recommender systems.

To better comprehend the NDCG metric, let’s first introduce the concept of relevance scores. These scores, denoted as reli𝑟𝑒subscript𝑙𝑖rel_{i}italic_r italic_e italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, are assigned to the item (label) at position i𝑖iitalic_i. Based on their relevance levels, the manually annotated labels in the test set are assigned these scores: labels with high, medium, and low relevance are given scores of 2, 1, and 0, respectively. Next, we define the Discounted Cumulative Gain (DCG) at position k𝑘kitalic_k as follows:

DCG@k=i=1k2reli1log2(i+1),DCG@𝑘superscriptsubscript𝑖1𝑘superscript2𝑟𝑒subscript𝑙𝑖1subscript2𝑖1\text{DCG}@k=\sum_{i=1}^{k}\frac{2^{rel_{i}}-1}{\log_{2}{(i+1)}},DCG @ italic_k = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG 2 start_POSTSUPERSCRIPT italic_r italic_e italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - 1 end_ARG start_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_i + 1 ) end_ARG , (C.1)

where reli𝑟𝑒subscript𝑙𝑖rel_{i}italic_r italic_e italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the relevance score of the item at position i𝑖iitalic_i. The DCG metric measures the gain of a ranking algorithm by considering both the relevance of the items and their positions in the ranking.

To obtain the Ideal Discounted Cumulative Gain (IDCG) at position k𝑘kitalic_k, we first rank the items by their relevance scores in descending order and then calculate the DCG for this ideal ranking:

IDCG@k=i=1k2reli1log2(i+1),IDCG@𝑘superscriptsubscript𝑖1𝑘superscript2𝑟𝑒subscriptsuperscript𝑙𝑖1subscript2𝑖1\text{IDCG}@k=\sum_{i=1}^{k}\frac{2^{rel^{*}_{i}}-1}{\log_{2}{(i+1)}},IDCG @ italic_k = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG 2 start_POSTSUPERSCRIPT italic_r italic_e italic_l start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - 1 end_ARG start_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_i + 1 ) end_ARG , (C.2)

where reli𝑟𝑒superscriptsubscript𝑙𝑖rel_{i}^{*}italic_r italic_e italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the relevance score of the item (label) at position i𝑖iitalic_i in the ideal ranking. The IDCG represents the maximum possible DCG for a given query.

Finally, we compute the Normalized Discounted Cumulative Gain (NDCG) at position k𝑘kitalic_k by dividing the DCG by the IDCG to ensure that it lies between 0 and 1:

NDCG@k=DCG@kIDCG@k.NDCG@𝑘DCG@𝑘IDCG@𝑘\text{NDCG}@k=\frac{\text{DCG}@k}{\text{IDCG}@k}.NDCG @ italic_k = divide start_ARG DCG @ italic_k end_ARG start_ARG IDCG @ italic_k end_ARG . (C.3)

A value of 1 indicates a perfect ranking, while a value of 0 indicates the worst possible ranking. The normalization also allows for the comparison of NDCG values across different video clips, as it accounts for the varying number of labels.

C.3 Implementation Details

We leverage published Vision Transformer [17] and Roberta [39] weights to initialize the parameters of vision encoder and language encoder, respectively. The parameters of the two encoders are fixed during three training stages. The coefficients c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in Eq. (11) are set to 1111, 1×1031superscript1031\times 10^{-3}1 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, respectively. The coefficient c3subscript𝑐3c_{3}italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT of KL penalty in Eq. (11) is set to 1×1031superscript1031\times 10^{-3}1 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. In our implementation, the KL penalty is subtracted from the reward, instead of being directly included in the joint loss. In this way, the policy is still updated based on the expected return, but the return is adjusted based on the policy change. This allows the algorithm to explore more freely while still being penalized for large policy changes, leading to more effective exploration and potentially higher overall returns. γ𝛾\gammaitalic_γ in the definition of Vttargetsuperscriptsubscript𝑉𝑡𝑡𝑎𝑟𝑔𝑒𝑡V_{t}^{target}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUPERSCRIPT is set to 00. β𝛽\betaitalic_β in Eq. (3) is set to 0.30.30.30.3. Margin mRsubscript𝑚𝑅m_{R}italic_m start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT in Eq. (4) and margin m𝑚mitalic_m in Eq. (7) are both set to 1111, and δ𝛿\deltaitalic_δ in Eq. (8) is set to 0.10.1-0.1- 0.1. AdamW optimizer is adopted for all optimized networks in three training stages, with learning rate 2×1052superscript1052\times 10^{-5}2 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT in the first and second training stages and 1×1031superscript1031\times 10^{-3}1 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT in the third training stage. We train the first two stages for 15 epochs, respectively. In Algorithm B.1, T𝑇Titalic_T is set to 1111, NTrajssubscript𝑁TrajsN_{\text{Trajs}}italic_N start_POSTSUBSCRIPT Trajs end_POSTSUBSCRIPT is set to 200200200200, K𝐾Kitalic_K is set to 1111, NIterssubscript𝑁ItersN_{\text{Iters}}italic_N start_POSTSUBSCRIPT Iters end_POSTSUBSCRIPT is set to 412412412412, M𝑀Mitalic_M is set to 24242424. During the training of stage 2, only 10% of ordered annotation pairs is used. In stage 3, 40% of randomly sampled pairs without annotation is utilized to train the LR2PPO framework, with the guidance of the reward model trained in stage 2. We use four V100 GPUs, each with 32GB of memory, to train the models. In comparison experiments, we utilize “There is a {label} in the scene” as textual prompt for both CLIP [47] and MKT [23].

C.4 More Ablation Studies about LR2PPO

Classification and regression. The relevance between each label and the multimodal input are annotated into high, medium and low (with scores 2, 1, 0, respectively). We adopt regression instead of classification for better performance and convenience in the latter stages. The results of classification in stage 1 (S1-CLS) are listed in Tab. C.2. We contribute the performance gap between classification (S1-CLS) and regression (S1) to the need to fully rank in the task. The predicted logits of classification need to be weighted to form the final relevance score, which hinders its performance. Regression scores are more suitable for ranking compared to classification logits.

Results w/o stage 1. The results of omitting stage 1 (w/o S1) are listed in Tab. C.2. LR2PPO achieves better performance since it starts learning from the source domain pretrained model, whose parameters more coincide with the ranking task, while LR2PPO (w/o S1) starts from the officially pretrained ViT [17] and Roberta [39].

Method NDCG @ 1 NDCG@3 NDCG@5 NDCG@10 NDCG@20 LR2PPO (S1-CLS) 0.6077 0.5988 0.5998 0.6601 0.7981 LR2PPO (S1) 0.6330 0.6018 0.6061 0.6667 0.8021 LR2PPO (w/o S1) 0.6750 0.6583 0.6781 0.7529 0.8432 LR2PPO 0.6820 0.6714 0.6869 0.7628 0.8475

Table C.2: More ablation studies on LRMovieNet.

C.5 More Hyper-parameters Sensitivity Analysis

Here, we conduct experimental analysis on the hyperparameters of the LR2PPO algorithm, namely margin mRsubscript𝑚𝑅m_{R}italic_m start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT in reward model training, margin m𝑚mitalic_m and hyperparameter c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in joint training of LR2PPO. (See Eq. (4), Eq. (7), and Eq. (11) in the main paper.) The results are shown in Fig. C.3. It can be observed that the final performance of LR2PPO is not sensitive to the variations of these hyperparameters, indicating the stability of our method.

(a) *
Refer to caption
(b) *
Refer to caption
(c) *
Refer to caption
Figure C.3: Results with different margins mRsubscript𝑚𝑅m_{R}italic_m start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT, m𝑚mitalic_m, and hyperparameter c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

C.6 Training Curves

To better illustrate the changes during the training process of the LR2PPO algorithm, we have plotted the training curves of various variables against iterations. These are displayed in Fig. C.4. Fig. 4(a) to Fig. 4(m) represent the training curves for the third stage, while 4(n) and 4(o) correspond to the training curves for the second and first stages, respectively. The reward and value curves, along with the NDCG curves in the third training stage, all exhibit an upward trend, while the policy loss and value loss show a downward trend. All of these observations validate the effectiveness of the LR2PPO learning process.

Refer to caption
(a) Reward
Refer to caption
(b) Value
Refer to caption
(c) Old value
Refer to caption
(d) Advantage
Refer to caption
(e) Policy loss
Refer to caption
(f) Value loss
Refer to caption
(g) Partial order ratio
Refer to caption
(h) Entropy bonus
Refer to caption
(i) KL penalty
Refer to caption
(j) Reward - KL penalty
Refer to caption
(k) NDCG@3
Refer to caption
(l) NDCG@5
Refer to caption
(m) NDCG@20
Refer to caption
(n) Stage 2 reward accuracy
Refer to caption
(o) Stage 1 NDCG curve
Figure C.4: Training curves of LR2PPO.