Q-Boost: On Visual Quality Assessment Ability of Low-level Multi-Modality Foundation Models
Abstract
Recent advancements in Multi-modality Large Language Models (MLLMs) have demonstrated remarkable capabilities in complex high-level vision tasks. However, the exploration of MLLM potential in visual quality assessment, a vital aspect of low-level vision, remains limited. To address this gap, we introduce Q-Boost, a novel strategy designed to enhance low-level MLLMs in image quality assessment (IQA) and video quality assessment (VQA) tasks, which is structured around two pivotal components: 1) Triadic-Tone Integration: Ordinary prompt design simply oscillates between the binary extremes of and . Q-Boost innovates by incorporating a ‘middle ground’ approach through prompts, allowing for a more balanced and detailed assessment. 2) Multi-Prompt Ensemble: Multiple quality-centric prompts are used to mitigate bias and acquire more accurate evaluation. The experimental results show that the low-level MLLMs exhibit outstanding zeros-shot performance on the IQA/VQA tasks equipped with Q-Boost strategy.
Index Terms— Multi-modality large language models, Image quality assessment, Triadic-tone integration, Multi-prompt ensemble, Video quality assessment, Zero-shot
1 Introduction
Multi-modality Large Language Models (MLLMs) have exhibited strong capabilities in general visual perception and understanding, which have been tested and confirmed in various vision-language tasks, including image captioning, visual question answering, cross-modality grounding, as well as standard vision tasks like image classification or segmentation. While the focus has largely been on high-level visual content perception and understanding, the proficiency of MLLMs in visual quality assessment remains less understood.
Over the past decade, a significant amount of research has been dedicated to exploring image quality assessment (IQA) and video quality assessment (VQA). In the initial phases, researchers extract handcrafted, quality-aware features from images or videos, subsequently regressing these features to derive quality scores [1, 2, 3]. Subsequently, the advent and success of deep neural networks (DNNs) have paved the way for the use of deep-learning models to extract and regress quality features into corresponding scores [4, 5, 6, 7, 8, 9]. Regardless of the extraction technique, these quality assessment methodologies predominantly depend on mean opinion scores (MOSs). The fundamental learning approach involves minimizing the discrepancy between these features and MOSs to obtain insightful knowledge for quality assessment. While MOS regression is straightforward and enables models to directly assess quality, it suffers from a lack of interpretability. In contrast, the text-supervised MLLM can output quality assessment results in natural language format and make explanations, which operate more similarly as human perception. Given the rapid advancements in MLLM, integrating it into quality assessment tasks is not only meaningful but also expands the horizon beyond traditional MOS regression. This approach facilitates training AI through text supervision to gain quality knowledge and designing prompts for MLLM to predict quality scores effectively.
The design of prompts like “Rate the quality of the image” and “Score the quality of the image from 1 to 5” is a natural approach for eliciting feedback from MLLMs regarding image quality. Nevertheless, these strategies have been shown to exhibit bias [10], as responses from MLLMs are often positively skewed and extreme. It is more recommended to transfer the IQA task into a binary classification problem on the probabilities of responding “Rate the quality of the image” with good and bad from the MLLMs, where the probabilities are quantified into quality scores via softmax. This softmax-based evaluation strategy has been used to benchmark the assessment abilities of various MLLMs [10] and even revealed that MLLMs are actually effective zero-shot assessment assistants after multi-modality low-level dataset tuning [11].
To further enhance this methodology, we introduce Q-Boost, a novel advancement designed to significantly improve the performance of MLLMs on IQA [6, 4, 12] as well as IQA-based video quality assessment (VQA) [13, 14, 15, 16, 17]. Q-Boost consists of two strategies: 1) The traditional prompt design, which typically fluctuates between the binary extremes of and , neglects the fact that some images might be of medium quality. Q-Boost proposes a significant innovation Triadic-Tone Integration (TTI) in this context, by introducing prompts such as average, medium, and acceptable. This not only enables a more balanced and comprehensive assessment but also enriches the evaluation process with greater detail, moving beyond the simplistic binary paradigm. 2) Q-Boost presents Multi-Prompt Ensemble (MPE) to enhance the accuracy of the assessment results. On one hand, multi-prompt expands the semantic scope, capturing nuances that a single word may overlook. On the other hand, multi-prompt reduces ambiguity as different terms carry slightly varied connotations, leading to more reliable evaluation. The experimental results show that the MLLM with Q-Boost achieves state-of-the-art zero-shot performance on both IQA/VQA datasets.
GT | GT | GT | GT | ||||||||||||
\hdashline11.14 | 8.68 | 7.96 | 4.12/5 | 10.02 | 10.71 | 9.07 | 3.11/5 | 8.92 | 10.75 | 9.59 | 2.51/5 | 6.46 | 8.62 | 10.70 | 1.10/5 |
2 The Q-Boost
The assessment framework of MLLM (Q-Boost) is illustrated in Fig. 2. First, we describe the process of getting log probabilities (logits) of certain prompts from MLLM. The image along with the prompt “Rate the quality of the image. The quality of this image is [SCORE_TOKEN]” is input into the MLLM and the logits can be derived as the probability distance between [SCORE_TOKEN] and the certain prompt words such good and bad:
(1) |
where represents the log probabilities of certain prompt words, indicates the MLLM assessment operation, stands for the input prompt text, and denotes the input image.
2.1 Triadic-Tone Integration (TTI)
In many text-prompted quality assessment works [12, 18], only binary extremes (positive and negative) are taken into consideration, which is difficult to distinguish the quality of images with medium quality as shown in Table 1. Therefore, we introduce a third tone neutral to such as average, medium, and acceptable to provide a more balanced and comprehensive evaluation:
(2) | ||||
where represent the averaged logits of positive, neutral, negative prompt words (see in Sec. 2.2), indicates the softmax function. Then the integrated quality scores can be obtained as follows:
(3) |
where and are the weighted parameters for and .
2.2 Multi-Prompt Ensemble (MPE)
Generally speaking, multi-prompt enhance the semantic range, grasping subtleties that a single word might miss. Conversely, they minimize ambiguity since different terms have slightly different implications, resulting in a more dependable evaluation. Therefore, we propose to use a group of synonym words instead of a single word to describe one certain tune:
(4) |
where , represents the averaged logits of tune , and stands for the number of the words for describing tune .
2.3 Video Evaluation
The process described above can directly get the quality score of a single image. For videos, we extract the frames every second and average the frames’ logits as the video logits:
(5) |
where , indicates the length of the video, represents the -th frame’s logits, and stands for the video’s logits. Then the video quality can be calculated with the same interaction process described above.
Datasets | KonIQ-10k | KADID-10K | CGIQA-6K | |||
---|---|---|---|---|---|---|
\hdashlineModel | SRCC | PLCC | SRCC | PLCC | SRCC | PLCC |
Opinion-Aware Approaches | ||||||
\hdashlineDBCNN [4] | 0.884 | 0.875 | 0.856 | 0.851 | 0.662 | 0.664 |
HyperIQA [5] | 0.917 | 0.906 | 0.845 | 0.852 | 0.636 | 0.660 |
MUSIQ [6] | 0.915 | 0.937 | 0.572 | 0.584 | 0.625 | 0.641 |
\hdashlineZero-shot Approaches | ||||||
\hdashlineNIQE [1] | 0.316 | 0.377 | 0.374 | 0.428 | 0.075 | 0.056 |
IL-NIQE [2] | 0.537 | 0.523 | 0.558 | 0.534 | 0.108 | 0.082 |
CLIP-IQA [12] | 0.468 | 0.505 | 0.501 | 0.520 | 0.294 | 0.300 |
\hdashlinemPO-7B (Q-Instruct [11]) | 0.902 | 0.888 | 0.698 | 0.684 | 0.629 | 0.644 |
mPO-7B (Q-Boost) | 0.912 | 0.923 | 0.702 | 0.699 | 0.630 | 0.648 |
Datasets | LIVE-VQC | KoNViD-1k | YouTube-UGC | MaxWell (val) | ||||
---|---|---|---|---|---|---|---|---|
\hdashlineModel | SRCC | PLCC | SRCC | PLCC | SRCC | PLCC | SRCC | PLCC |
Opinion-Aware Approaches | ||||||||
\hdashlineTLVQM [25] | 0.799 | 0.803 | 0.773 | 0.768 | 0.669 | 0.659 | 0.661 | 0.652 |
VSFA [7] | 0.773 | 0.795 | 0.773 | 0.775 | 0.724 | 0.743 | 0.676 | 0.678 |
VIDEVAL [26] | 0.752 | 0.751 | 0.783 | 0.780 | 0.779 | 0.773 | 0.597 | 0.601 |
\hdashlineZero-shot Approaches | ||||||||
\hdashline(Spatial, Classical) NIQE [1] | 0.596 | 0.628 | 0.541 | 0.553 | 0.278 | 0.290 | 0.312 | 0.301 |
(Temporal, Classical) TPQI [27] | 0.636 | 0.645 | 0.556 | 0.549 | 0.111 | 0.218 | 0.361 | 0.377 |
(CLIP-based) SAQI [18] | 0.629 | 0.638 | 0.608 | 0.602 | 0.585 | 0.606 | 0.541 | 0.559 |
\hdashline(Ensemble of Three Above) BVQI [18] | 0.784 | 0.794 | 0.760 | 0.760 | 0.525 | 0.556 | 0.637 | 0.648 |
\hdashlineSpatial) mPO-7B (Q-Instruct [11]) | 0.720 | 0.657 | 0.692 | 0.603 | 0.539 | 0.540 | 0.554 | 0.515 |
(Spatial) mPO-7B (Q-Boost) | 0.741 | 0.793 | 0.801 | 0.803 | 0.723 | 0.705 | 0.682 | 0.692 |
2.4 Inference Cost
Q-Boost does not introduce additional computational complexity, as the entire computation process essentially takes place after the input prompt is entered and the language backbone returns the [SCORE_TOKEN]. The subsequent MPE simply involves tokenizing the words used, and then calculating the logits on the [SCORE_TOKEN]. As the computational cost of the softmax and score weighting of TTI is much less than the inference of the MLLM, the additional computation cost of the Q-Boost is almost negligible. This is different from CLIP-based QA methods [12, 18], in which each prompt word needs to pass through the text backbone independently. Thus, applying a similar multi-prompt strategy to CLIP-based QA methods would significantly increase inference costs.
3 Experiment
3.1 Implementation Details
3.1.1 Experimental Setup
The mPLUG-Owl-2-7B [28] (abbr. as mPO-7B) with CLIP-ViT-Large-14 vision backbone and LLaMA2-7B language backbone is used for validation. The mPO-7B weights are loaded from the instruction-tuning process of Q-Instruct [11] and frozen during the inference process. The mPO-7B (Q-Instruct) indicates using Q-Instruct-tuned weights without the Q-Boost strategy. The strategy-enhanced MLLM model is denoted as mPO-7B (Q-Boost). Only the TTI strategy is employed for IQA while both TTI and MPE strategies are utilized for VQA (reasons are discussed in Sec. 3). In the MPE strategy, (good + high + fine average + medium + acceptable poor + low + bad) are used as the groups of words for describing (positive neutral negative) tunes. The weighted parameters and mentioned in Sec. 2.1 are set as 1 and 0.5 respectively.
Datasets | IQA | VQA | ||||||||||||
KonIQ-10k | KADID-10K | CGIQA-6K | LIVE-VQC | KoNViD-1k | YouTube-UGC | MaxWell (val) | ||||||||
\hdashlinePrompts | SRCC | PLCC | SRCC | PLCC | SRCC | PLCC | SRCC | PLCC | SRCC | PLCC | SRCC | PLCC | SRCC | PLCC |
good / poor | 0.902 | 0.888 | 0.698 | 0.684 | 0.629 | 0.644 | 0.720 | 0.657 | 0.692 | 0.603 | 0.539 | 0.539 | 0.554 | 0.516 |
good average poor | 0.912 | 0.923 | 0.702 | 0.699 | 0.630 | 0.648 | 0.740 | 0.725 | 0.766 | 0.701 | 0.577 | 0.583 | 0.611 | 0.589 |
g+h+f a+m+ac p+l+b | 0.898 | 0.910 | 0.704 | 0.678 | 0.558 | 0.577 | 0.741 | 0.793 | 0.801 | 0.803 | 0.723 | 0.705 | 0.682 | 0.692 |
3.1.2 Datasets Introduction
3 IQA datasets and 4 VQA datasets are utilized for performance comparison, which are briefly reviewed in Table 2. About 48.92% images of the KonIQ-10k [19] dataset are included in the Q-Instruct [11]-tuning stage and no images of the rest datasets are seen by the mPO-7B (Q-Instruct & Q-Boost). Additionally, the CGIQA-6K [21] dataset contains two separate sub-sets which consist of 3,000 game images and 3,000 movie images respectively and we average the performance of the sub-sets for the final exhibition.
3.1.3 IQA/VQA Competitors
Both opinion-aware (need supervised training) and zero-shot methods (training-free) are included for comparison.
The opinion-aware IQA methods consist of DBCNN [4], HyperIQA [5], and MUSIQ [6]. The zero-shot IQA methods contain NIQE [1], IL-NIQE [2], and CLIP-IQA [12]. The opinion-aware VQA methods include TLVQM [25], VSFA [7], and VIDEVAL [26]. The zero-shot VQA methods contain NIQE [2], TPQI [27], and SAQI [18].