HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: arydshln

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2312.15300v1 [cs.CV] 23 Dec 2023

Q-Boost: On Visual Quality Assessment Ability of Low-level Multi-Modality Foundation Models

Abstract

Recent advancements in Multi-modality Large Language Models (MLLMs) have demonstrated remarkable capabilities in complex high-level vision tasks. However, the exploration of MLLM potential in visual quality assessment, a vital aspect of low-level vision, remains limited. To address this gap, we introduce Q-Boost, a novel strategy designed to enhance low-level MLLMs in image quality assessment (IQA) and video quality assessment (VQA) tasks, which is structured around two pivotal components: 1) Triadic-Tone Integration: Ordinary prompt design simply oscillates between the binary extremes of positive𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒positiveitalic_p italic_o italic_s italic_i italic_t italic_i italic_v italic_e and negative𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒negativeitalic_n italic_e italic_g italic_a italic_t italic_i italic_v italic_e. Q-Boost innovates by incorporating a ‘middle ground’ approach through neutral𝑛𝑒𝑢𝑡𝑟𝑎𝑙neutralitalic_n italic_e italic_u italic_t italic_r italic_a italic_l prompts, allowing for a more balanced and detailed assessment. 2) Multi-Prompt Ensemble: Multiple quality-centric prompts are used to mitigate bias and acquire more accurate evaluation. The experimental results show that the low-level MLLMs exhibit outstanding zeros-shot performance on the IQA/VQA tasks equipped with Q-Boost strategy.

Index Terms—  Multi-modality large language models, Image quality assessment, Triadic-tone integration, Multi-prompt ensemble, Video quality assessment, Zero-shot

1 Introduction

Multi-modality Large Language Models (MLLMs) have exhibited strong capabilities in general visual perception and understanding, which have been tested and confirmed in various vision-language tasks, including image captioning, visual question answering, cross-modality grounding, as well as standard vision tasks like image classification or segmentation. While the focus has largely been on high-level visual content perception and understanding, the proficiency of MLLMs in visual quality assessment remains less understood.

Refer to caption
Fig. 1: Motivation of Q-Boost. The Triadic-Tone Integration strategy helps provide a more balanced and comprehensive assessment while the Multi-Prompt Ensemble strategy helps improve the accuracy and reliability of evaluation.

Over the past decade, a significant amount of research has been dedicated to exploring image quality assessment (IQA) and video quality assessment (VQA). In the initial phases, researchers extract handcrafted, quality-aware features from images or videos, subsequently regressing these features to derive quality scores [1, 2, 3]. Subsequently, the advent and success of deep neural networks (DNNs) have paved the way for the use of deep-learning models to extract and regress quality features into corresponding scores [4, 5, 6, 7, 8, 9]. Regardless of the extraction technique, these quality assessment methodologies predominantly depend on mean opinion scores (MOSs). The fundamental learning approach involves minimizing the discrepancy between these features and MOSs to obtain insightful knowledge for quality assessment. While MOS regression is straightforward and enables models to directly assess quality, it suffers from a lack of interpretability. In contrast, the text-supervised MLLM can output quality assessment results in natural language format and make explanations, which operate more similarly as human perception. Given the rapid advancements in MLLM, integrating it into quality assessment tasks is not only meaningful but also expands the horizon beyond traditional MOS regression. This approach facilitates training AI through text supervision to gain quality knowledge and designing prompts for MLLM to predict quality scores effectively.

Refer to caption
Fig. 2: The framework of the Q-Boost. The image and prompt are fed into the MLLM, where the log probabilities (logits) are computed between the [SCORE_TOKEN] and triadic-tone words (including ensembles of multiple prompts). Then the logits of different tones are put through softmax operation and integrated into the zero-shot quality score with weighted average pooling.

The design of prompts like “Rate the quality of the image” and “Score the quality of the image from 1 to 5” is a natural approach for eliciting feedback from MLLMs regarding image quality. Nevertheless, these strategies have been shown to exhibit bias [10], as responses from MLLMs are often positively skewed and extreme. It is more recommended to transfer the IQA task into a binary classification problem on the probabilities of responding “Rate the quality of the image” with good and bad from the MLLMs, where the probabilities are quantified into quality scores via softmax. This softmax-based evaluation strategy has been used to benchmark the assessment abilities of various MLLMs [10] and even revealed that MLLMs are actually effective zero-shot assessment assistants after multi-modality low-level dataset tuning [11].

To further enhance this methodology, we introduce Q-Boost, a novel advancement designed to significantly improve the performance of MLLMs on IQA [6, 4, 12] as well as IQA-based video quality assessment (VQA) [13, 14, 15, 16, 17]. Q-Boost consists of two strategies: 1) The traditional prompt design, which typically fluctuates between the binary extremes of positive𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒positiveitalic_p italic_o italic_s italic_i italic_t italic_i italic_v italic_e and negative𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒negativeitalic_n italic_e italic_g italic_a italic_t italic_i italic_v italic_e, neglects the fact that some images might be of medium quality. Q-Boost proposes a significant innovation Triadic-Tone Integration (TTI) in this context, by introducing neutral𝑛𝑒𝑢𝑡𝑟𝑎𝑙neutralitalic_n italic_e italic_u italic_t italic_r italic_a italic_l prompts such as average, medium, and acceptable. This not only enables a more balanced and comprehensive assessment but also enriches the evaluation process with greater detail, moving beyond the simplistic binary paradigm. 2) Q-Boost presents Multi-Prompt Ensemble (MPE) to enhance the accuracy of the assessment results. On one hand, multi-prompt expands the semantic scope, capturing nuances that a single word may overlook. On the other hand, multi-prompt reduces ambiguity as different terms carry slightly varied connotations, leading to more reliable evaluation. The experimental results show that the MLLM with Q-Boost achieves state-of-the-art zero-shot performance on both IQA/VQA datasets.

[Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
𝒫log(good)subscript𝒫𝑙𝑜𝑔𝑔𝑜𝑜𝑑\mathcal{{P}}_{log}(good)caligraphic_P start_POSTSUBSCRIPT italic_l italic_o italic_g end_POSTSUBSCRIPT ( italic_g italic_o italic_o italic_d ) 𝒫log(average)subscript𝒫𝑙𝑜𝑔𝑎𝑣𝑒𝑟𝑎𝑔𝑒\mathcal{{P}}_{log}(average)caligraphic_P start_POSTSUBSCRIPT italic_l italic_o italic_g end_POSTSUBSCRIPT ( italic_a italic_v italic_e italic_r italic_a italic_g italic_e ) 𝒫log(poor)subscript𝒫𝑙𝑜𝑔𝑝𝑜𝑜𝑟\mathcal{{P}}_{log}(poor)caligraphic_P start_POSTSUBSCRIPT italic_l italic_o italic_g end_POSTSUBSCRIPT ( italic_p italic_o italic_o italic_r ) GT 𝒫log(good)subscript𝒫𝑙𝑜𝑔𝑔𝑜𝑜𝑑\mathcal{{P}}_{log}(good)caligraphic_P start_POSTSUBSCRIPT italic_l italic_o italic_g end_POSTSUBSCRIPT ( italic_g italic_o italic_o italic_d ) 𝒫log(average)subscript𝒫𝑙𝑜𝑔𝑎𝑣𝑒𝑟𝑎𝑔𝑒\mathcal{{P}}_{log}(average)caligraphic_P start_POSTSUBSCRIPT italic_l italic_o italic_g end_POSTSUBSCRIPT ( italic_a italic_v italic_e italic_r italic_a italic_g italic_e ) 𝒫log(poor)subscript𝒫𝑙𝑜𝑔𝑝𝑜𝑜𝑟\mathcal{{P}}_{log}(poor)caligraphic_P start_POSTSUBSCRIPT italic_l italic_o italic_g end_POSTSUBSCRIPT ( italic_p italic_o italic_o italic_r ) GT 𝒫log(good)subscript𝒫𝑙𝑜𝑔𝑔𝑜𝑜𝑑\mathcal{{P}}_{log}(good)caligraphic_P start_POSTSUBSCRIPT italic_l italic_o italic_g end_POSTSUBSCRIPT ( italic_g italic_o italic_o italic_d ) 𝒫log(average)subscript𝒫𝑙𝑜𝑔𝑎𝑣𝑒𝑟𝑎𝑔𝑒\mathcal{{P}}_{log}(average)caligraphic_P start_POSTSUBSCRIPT italic_l italic_o italic_g end_POSTSUBSCRIPT ( italic_a italic_v italic_e italic_r italic_a italic_g italic_e ) 𝒫log(poor)subscript𝒫𝑙𝑜𝑔𝑝𝑜𝑜𝑟\mathcal{{P}}_{log}(poor)caligraphic_P start_POSTSUBSCRIPT italic_l italic_o italic_g end_POSTSUBSCRIPT ( italic_p italic_o italic_o italic_r ) GT 𝒫log(good)subscript𝒫𝑙𝑜𝑔𝑔𝑜𝑜𝑑\mathcal{{P}}_{log}(good)caligraphic_P start_POSTSUBSCRIPT italic_l italic_o italic_g end_POSTSUBSCRIPT ( italic_g italic_o italic_o italic_d ) 𝒫log(average)subscript𝒫𝑙𝑜𝑔𝑎𝑣𝑒𝑟𝑎𝑔𝑒\mathcal{{P}}_{log}(average)caligraphic_P start_POSTSUBSCRIPT italic_l italic_o italic_g end_POSTSUBSCRIPT ( italic_a italic_v italic_e italic_r italic_a italic_g italic_e ) 𝒫log(poor)subscript𝒫𝑙𝑜𝑔𝑝𝑜𝑜𝑟\mathcal{{P}}_{log}(poor)caligraphic_P start_POSTSUBSCRIPT italic_l italic_o italic_g end_POSTSUBSCRIPT ( italic_p italic_o italic_o italic_r ) GT
\hdashline11.14 8.68 7.96 4.12/5 10.02 10.71 9.07 3.11/5 8.92 10.75 9.59 2.51/5 6.46 8.62 10.70 1.10/5
Table 1: Example log probabilities (logits) of good, average, and poor. It is evident that images of lower quality exhibit a stronger correlation with negative prompts, whereas those of higher quality demonstrate a greater alignment with positive prompts. Images classified as being of medium quality tend to show increased neutral logits.

2 The Q-Boost

The assessment framework of MLLM (Q-Boost) is illustrated in Fig. 2. First, we describe the process of getting log probabilities (logits) of certain prompts from MLLM. The image along with the prompt “Rate the quality of the image. The quality of this image is [SCORE_TOKEN]” is input into the MLLM and the logits can be derived as the probability distance between [SCORE_TOKEN] and the certain prompt words such good and bad:

𝒫log(𝒲)=MLLM(𝒯,),subscript𝒫𝑙𝑜𝑔𝒲MLLM𝒯\mathcal{P}_{log}(\mathcal{W})=\mathrm{MLLM}(\mathcal{T},\mathcal{I}),caligraphic_P start_POSTSUBSCRIPT italic_l italic_o italic_g end_POSTSUBSCRIPT ( caligraphic_W ) = roman_MLLM ( caligraphic_T , caligraphic_I ) , (1)

where 𝒫log(𝒲)subscript𝒫𝑙𝑜𝑔𝒲\mathcal{P}_{log}(\mathcal{W})caligraphic_P start_POSTSUBSCRIPT italic_l italic_o italic_g end_POSTSUBSCRIPT ( caligraphic_W ) represents the log probabilities of certain prompt words, MLLM()MLLM\mathrm{MLLM}(\cdot)roman_MLLM ( ⋅ ) indicates the MLLM assessment operation, 𝒯𝒯\mathcal{T}caligraphic_T stands for the input prompt text, and 𝒯𝒯\mathcal{T}caligraphic_T denotes the input image.

2.1 Triadic-Tone Integration (TTI)

In many text-prompted quality assessment works [12, 18], only binary extremes (positive and negative) are taken into consideration, which is difficult to distinguish the quality of images with medium quality as shown in Table 1. Therefore, we introduce a third tone neutral to such as average, medium, and acceptable to provide a more balanced and comprehensive evaluation:

𝒬pos,subscript𝒬𝑝𝑜𝑠\displaystyle\mathcal{Q}_{pos},caligraphic_Q start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT , 𝒬neu,𝒬neg=subscript𝒬𝑛𝑒𝑢subscript𝒬𝑛𝑒𝑔absent\displaystyle\mathcal{Q}_{neu},\mathcal{Q}_{neg}=caligraphic_Q start_POSTSUBSCRIPT italic_n italic_e italic_u end_POSTSUBSCRIPT , caligraphic_Q start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT = (2)
γ(𝒫¯log(𝒲pos),𝒫¯log(𝒲neu),𝒫¯log(𝒲neg)),𝛾subscript¯𝒫𝑙𝑜𝑔subscript𝒲𝑝𝑜𝑠subscript¯𝒫𝑙𝑜𝑔subscript𝒲𝑛𝑒𝑢subscript¯𝒫𝑙𝑜𝑔subscript𝒲𝑛𝑒𝑔\displaystyle\gamma(\mathcal{\bar{P}}_{log}(\mathcal{W}_{pos}),\mathcal{\bar{P% }}_{log}(\mathcal{W}_{neu}),\mathcal{\bar{P}}_{log}(\mathcal{W}_{neg})),italic_γ ( over¯ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_l italic_o italic_g end_POSTSUBSCRIPT ( caligraphic_W start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ) , over¯ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_l italic_o italic_g end_POSTSUBSCRIPT ( caligraphic_W start_POSTSUBSCRIPT italic_n italic_e italic_u end_POSTSUBSCRIPT ) , over¯ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_l italic_o italic_g end_POSTSUBSCRIPT ( caligraphic_W start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT ) ) ,

where 𝒫¯log(𝒲pos),𝒫¯log(𝒲neu),𝒫¯log(𝒲neg)subscript¯𝒫𝑙𝑜𝑔subscript𝒲𝑝𝑜𝑠subscript¯𝒫𝑙𝑜𝑔subscript𝒲𝑛𝑒𝑢subscript¯𝒫𝑙𝑜𝑔subscript𝒲𝑛𝑒𝑔\mathcal{\bar{P}}_{log}(\mathcal{W}_{pos}),\mathcal{\bar{P}}_{log}(\mathcal{W}% _{neu}),\mathcal{\bar{P}}_{log}(\mathcal{W}_{neg})over¯ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_l italic_o italic_g end_POSTSUBSCRIPT ( caligraphic_W start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ) , over¯ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_l italic_o italic_g end_POSTSUBSCRIPT ( caligraphic_W start_POSTSUBSCRIPT italic_n italic_e italic_u end_POSTSUBSCRIPT ) , over¯ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_l italic_o italic_g end_POSTSUBSCRIPT ( caligraphic_W start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT ) represent the averaged logits of positive, neutral, negative prompt words (see in Sec. 2.2), γ()𝛾\gamma(\cdot)italic_γ ( ⋅ ) indicates the softmax function. Then the integrated quality scores can be obtained as follows:

𝒬=w1𝒬pos+w2𝒬neu,𝒬subscript𝑤1subscript𝒬𝑝𝑜𝑠subscript𝑤2subscript𝒬𝑛𝑒𝑢\mathcal{Q}=w_{1}\mathcal{Q}_{pos}+w_{2}\mathcal{Q}_{neu},caligraphic_Q = italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_Q start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_Q start_POSTSUBSCRIPT italic_n italic_e italic_u end_POSTSUBSCRIPT , (3)

where w1=1.0subscript𝑤11.0w_{1}=1.0italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1.0 and w2=0.5subscript𝑤20.5w_{2}=0.5italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.5 are the weighted parameters for 𝒬possubscript𝒬𝑝𝑜𝑠\mathcal{Q}_{pos}caligraphic_Q start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT and 𝒬neusubscript𝒬𝑛𝑒𝑢\mathcal{Q}_{neu}caligraphic_Q start_POSTSUBSCRIPT italic_n italic_e italic_u end_POSTSUBSCRIPT.

2.2 Multi-Prompt Ensemble (MPE)

Generally speaking, multi-prompt enhance the semantic range, grasping subtleties that a single word might miss. Conversely, they minimize ambiguity since different terms have slightly different implications, resulting in a more dependable evaluation. Therefore, we propose to use a group of synonym words instead of a single word to describe one certain tune:

𝒫¯log(𝒲α)=1nα𝒫log(𝒲α),subscript¯𝒫𝑙𝑜𝑔subscript𝒲𝛼1subscript𝑛𝛼subscript𝒫𝑙𝑜𝑔subscript𝒲𝛼\bar{\mathcal{P}}_{log}(\mathcal{W}_{\alpha})=\frac{1}{n_{\alpha}}\sum{% \mathcal{P}}_{log}(\mathcal{W}_{\alpha}),over¯ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_l italic_o italic_g end_POSTSUBSCRIPT ( caligraphic_W start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_ARG ∑ caligraphic_P start_POSTSUBSCRIPT italic_l italic_o italic_g end_POSTSUBSCRIPT ( caligraphic_W start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) , (4)

where α{pos,neu,neg}𝛼𝑝𝑜𝑠𝑛𝑒𝑢𝑛𝑒𝑔\alpha\in\{pos,neu,neg\}italic_α ∈ { italic_p italic_o italic_s , italic_n italic_e italic_u , italic_n italic_e italic_g }, 𝒫¯log(𝒲α)subscript¯𝒫𝑙𝑜𝑔subscript𝒲𝛼\bar{\mathcal{P}}_{log}(\mathcal{W}_{\alpha})over¯ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_l italic_o italic_g end_POSTSUBSCRIPT ( caligraphic_W start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) represents the averaged logits of tune α𝛼\alphaitalic_α, and nαsubscript𝑛𝛼n_{\alpha}italic_n start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT stands for the number of the words for describing tune α𝛼\alphaitalic_α.

2.3 Video Evaluation

The process described above can directly get the quality score of a single image. For videos, we extract the frames every second and average the frames’ logits as the video logits:

𝒫¯logv(𝒲α)=1nvi=1nv𝒫¯logi(𝒲α),superscriptsubscript¯𝒫𝑙𝑜𝑔𝑣subscript𝒲𝛼1subscript𝑛𝑣superscriptsubscript𝑖1subscript𝑛𝑣superscriptsubscript¯𝒫𝑙𝑜𝑔𝑖subscript𝒲𝛼\bar{\mathcal{P}}_{log}^{v}(\mathcal{W}_{\alpha})=\frac{1}{n_{v}}\sum_{i=1}^{n% _{v}}\bar{\mathcal{P}}_{log}^{i}(\mathcal{W}_{\alpha}),over¯ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_l italic_o italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ( caligraphic_W start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT over¯ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_l italic_o italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( caligraphic_W start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) , (5)

where α{pos,neu,neg}𝛼𝑝𝑜𝑠𝑛𝑒𝑢𝑛𝑒𝑔\alpha\in\{pos,neu,neg\}italic_α ∈ { italic_p italic_o italic_s , italic_n italic_e italic_u , italic_n italic_e italic_g }, nvsubscript𝑛𝑣n_{v}italic_n start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT indicates the length of the video, 𝒫¯logi(𝒲α)superscriptsubscript¯𝒫𝑙𝑜𝑔𝑖subscript𝒲𝛼\bar{\mathcal{P}}_{log}^{i}(\mathcal{W}_{\alpha})over¯ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_l italic_o italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( caligraphic_W start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) represents the i𝑖iitalic_i-th frame’s logits, and 𝒫¯logv(𝒲α)superscriptsubscript¯𝒫𝑙𝑜𝑔𝑣subscript𝒲𝛼\bar{\mathcal{P}}_{log}^{v}(\mathcal{W}_{\alpha})over¯ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_l italic_o italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ( caligraphic_W start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) stands for the video’s logits. Then the video quality can be calculated with the same interaction process described above.

Table 2: Review of the IQA/VQA datasets.
Typ Dataset Num. Content
IQA KonIQ-10k [19] 10,741 In-the-wild
KADID-10K [20] 10,125 Artificially Distorted
CGIQA-6K [21] 6,000 Computer Graphics Images
\hdashlineVQA LIVE-VQC [22] 585 In-the-wild
KoNViD-1k [23] 1,200 In-the-wild
Youtube-UGC [24] 1,380 In-the-wild
MaxWell (val) [14] 909 In-the-wild
Table 3: Results on the IQA datasets. Both mPO-7B (Q-Instruct) and mPO-7B (Q-Boost) utilize the Q-Instruct [11]-tuned weights. mPO-7B (Q-Instruct) is validated without the Q-Boost strategy. About 48.92% images of the KonIQ-10k are included in the Q-Instruct and 0% for other datasets. Best zero-shot performance is marked in bold.
Datasets KonIQ-10k KADID-10K CGIQA-6K
\hdashlineModel SRCC PLCC SRCC PLCC SRCC PLCC
Opinion-Aware Approaches
\hdashlineDBCNN [4] 0.884 0.875 0.856 0.851 0.662 0.664
HyperIQA [5] 0.917 0.906 0.845 0.852 0.636 0.660
MUSIQ [6] 0.915 0.937 0.572 0.584 0.625 0.641
\hdashlineZero-shot Approaches
\hdashlineNIQE [1] 0.316 0.377 0.374 0.428 0.075 0.056
IL-NIQE [2] 0.537 0.523 0.558 0.534 0.108 0.082
CLIP-IQA [12] 0.468 0.505 0.501 0.520 0.294 0.300
\hdashlinemPO-7B (Q-Instruct [11]) 0.902 0.888 0.698 0.684 0.629 0.644
mPO-7B (Q-Boost) 0.912 0.923 0.702 0.699 0.630 0.648
Table 4: Results on the VQA datasets. The best and second zero-shot performance are marked in bold and underline.
Datasets LIVE-VQC KoNViD-1k YouTube-UGC MaxWell (val)
\hdashlineModel SRCC PLCC SRCC PLCC SRCC PLCC SRCC PLCC
Opinion-Aware Approaches
\hdashlineTLVQM [25] 0.799 0.803 0.773 0.768 0.669 0.659 0.661 0.652
VSFA [7] 0.773 0.795 0.773 0.775 0.724 0.743 0.676 0.678
VIDEVAL [26] 0.752 0.751 0.783 0.780 0.779 0.773 0.597 0.601
\hdashlineZero-shot Approaches
\hdashline(Spatial, Classical) NIQE [1] 0.596 0.628 0.541 0.553 0.278 0.290 0.312 0.301
(Temporal, Classical) TPQI [27] 0.636 0.645 0.556 0.549 0.111 0.218 0.361 0.377
(CLIP-based) SAQI [18] 0.629 0.638 0.608 0.602 0.585 0.606 0.541 0.559
\hdashline(Ensemble of Three Above) BVQI [18] 0.784 0.794 0.760 0.760 0.525 0.556 0.637 0.648
\hdashlineSpatial) mPO-7B (Q-Instruct [11]) 0.720 0.657 0.692 0.603 0.539 0.540 0.554 0.515
(Spatial) mPO-7B (Q-Boost) 0.741 0.793 0.801 0.803 0.723 0.705 0.682 0.692

2.4 Inference Cost

Q-Boost does not introduce additional computational complexity, as the entire computation process essentially takes place after the input prompt is entered and the language backbone returns the [SCORE_TOKEN]. The subsequent MPE simply involves tokenizing the words used, and then calculating the logits on the [SCORE_TOKEN]. As the computational cost of the softmax and score weighting of TTI is much less than the inference of the MLLM, the additional computation cost of the Q-Boost is almost negligible. This is different from CLIP-based QA methods [12, 18], in which each prompt word needs to pass through the text backbone independently. Thus, applying a similar multi-prompt strategy to CLIP-based QA methods would significantly increase inference costs.

3 Experiment

3.1 Implementation Details

3.1.1 Experimental Setup

The mPLUG-Owl-2-7B [28] (abbr. as mPO-7B) with CLIP-ViT-Large-14 vision backbone and LLaMA2-7B language backbone is used for validation. The mPO-7B weights are loaded from the instruction-tuning process of Q-Instruct [11] and frozen during the inference process. The mPO-7B (Q-Instruct) indicates using Q-Instruct-tuned weights without the Q-Boost strategy. The strategy-enhanced MLLM model is denoted as mPO-7B (Q-Boost). Only the TTI strategy is employed for IQA while both TTI and MPE strategies are utilized for VQA (reasons are discussed in Sec. 3). In the MPE strategy, (good + high + fine \leftrightarrow average + medium + acceptable \leftrightarrow poor + low + bad) are used as the groups of words for describing (positive \leftrightarrow neutral \leftrightarrow negative) tunes. The weighted parameters w1subscript𝑤1w_{1}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and w2subscript𝑤2w_{2}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT mentioned in Sec. 2.1 are set as 1 and 0.5 respectively.

Refer to caption
(a) SRCC
Refer to caption
(b) PLCC
Fig. 3: The SRCC (Fig. (a)) and PLCC (Fig. (b)) performance comparison of the best zero-shot competitor, mPO-7B (Q-Instruct), and mPO-7B (Q-Boost), where the index values are calculated as SRCCSRCCmax𝑆𝑅𝐶𝐶𝑆𝑅𝐶subscript𝐶𝑚𝑎𝑥\frac{SRCC}{SRCC_{max}}divide start_ARG italic_S italic_R italic_C italic_C end_ARG start_ARG italic_S italic_R italic_C italic_C start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG and PLCCPLCCmax𝑃𝐿𝐶𝐶𝑃𝐿𝐶subscript𝐶𝑚𝑎𝑥\frac{PLCC}{PLCC_{max}}divide start_ARG italic_P italic_L italic_C italic_C end_ARG start_ARG italic_P italic_L italic_C italic_C start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_ARG. It can be seen that mPO-7B (Q-Boost) achieves the best performance in general and significantly boosts the performance of mPO-7B (Q-Instruct) on the VQA datasets.
Table 5: Experimental results for the ablation study. good \leftrightarrow / \leftrightarrow poor indicates using only binary extremes prompts good and poor, good \leftrightarrow average \leftrightarrow poor represents using Triadic-Tone Integration without Multi-Prompt Ensemble, g+h+f \leftrightarrow a+m+ac \leftrightarrow p+l+b stands for utilizing prompts of (good+high+fine \leftrightarrow average+medium+acceptable \leftrightarrow poor+low+bad), which employs both strategies.
Datasets IQA VQA
KonIQ-10k KADID-10K CGIQA-6K LIVE-VQC KoNViD-1k YouTube-UGC MaxWell (val)
\hdashlinePrompts SRCC PLCC SRCC PLCC SRCC PLCC SRCC PLCC SRCC PLCC SRCC PLCC SRCC PLCC
good \leftrightarrow / \leftrightarrow poor 0.902 0.888 0.698 0.684 0.629 0.644 0.720 0.657 0.692 0.603 0.539 0.539 0.554 0.516
good \leftrightarrow average \leftrightarrow poor 0.912 0.923 0.702 0.699 0.630 0.648 0.740 0.725 0.766 0.701 0.577 0.583 0.611 0.589
g+h+f \leftrightarrow a+m+ac \leftrightarrow p+l+b 0.898 0.910 0.704 0.678 0.558 0.577 0.741 0.793 0.801 0.803 0.723 0.705 0.682 0.692

3.1.2 Datasets Introduction

3 IQA datasets and 4 VQA datasets are utilized for performance comparison, which are briefly reviewed in Table 2. About 48.92% images of the KonIQ-10k [19] dataset are included in the Q-Instruct [11]-tuning stage and no images of the rest datasets are seen by the mPO-7B (Q-Instruct & Q-Boost). Additionally, the CGIQA-6K  [21] dataset contains two separate sub-sets which consist of 3,000 game images and 3,000 movie images respectively and we average the performance of the sub-sets for the final exhibition.

3.1.3 IQA/VQA Competitors

Both opinion-aware (need supervised training) and zero-shot methods (training-free) are included for comparison.

\bullet The opinion-aware IQA methods consist of DBCNN [4], HyperIQA [5], and MUSIQ [6]. \bullet The zero-shot IQA methods contain NIQE [1], IL-NIQE [2], and CLIP-IQA [12]. \bullet The opinion-aware VQA methods include TLVQM [25], VSFA [7], and VIDEVAL [26]. \bullet The zero-shot VQA methods contain NIQE [2], TPQI [27], and SAQI [18].