Multimedia
See recent articles
Showing new listings for Monday, 30 September 2024
- [1] arXiv:2409.18142 (cross-list from cs.AI) [pdf, other]
-
Title: A Survey on Multimodal Benchmarks: In the Era of Large AI ModelsComments: Ongoing projectSubjects: Artificial Intelligence (cs.AI); Multimedia (cs.MM)
The rapid evolution of Multimodal Large Language Models (MLLMs) has brought substantial advancements in artificial intelligence, significantly enhancing the capability to understand and generate multimodal content. While prior studies have largely concentrated on model architectures and training methodologies, a thorough analysis of the benchmarks used for evaluating these models remains underexplored. This survey addresses this gap by systematically reviewing 211 benchmarks that assess MLLMs across four core domains: understanding, reasoning, generation, and application. We provide a detailed analysis of task designs, evaluation metrics, and dataset constructions, across diverse modalities. We hope that this survey will contribute to the ongoing advancement of MLLM research by offering a comprehensive overview of benchmarking practices and identifying promising directions for future work. An associated GitHub repository collecting the latest papers is available.
- [2] arXiv:2409.18236 (cross-list from cs.CV) [pdf, other]
-
Title: Spatial Visibility and Temporal Dynamics: Revolutionizing Field of View Prediction in Adaptive Point Cloud Video StreamingSubjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Image and Video Processing (eess.IV)
Field-of-View (FoV) adaptive streaming significantly reduces bandwidth requirement of immersive point cloud video (PCV) by only transmitting visible points in a viewer's FoV. The traditional approaches often focus on trajectory-based 6 degree-of-freedom (6DoF) FoV predictions. The predicted FoV is then used to calculate point visibility. Such approaches do not explicitly consider video content's impact on viewer attention, and the conversion from FoV to point visibility is often error-prone and time-consuming. We reformulate the PCV FoV prediction problem from the cell visibility perspective, allowing for precise decision-making regarding the transmission of 3D data at the cell level based on the predicted visibility distribution. We develop a novel spatial visibility and object-aware graph model that leverages the historical 3D visibility data and incorporates spatial perception, neighboring cell correlation, and occlusion information to predict the cell visibility in the future. Our model significantly improves the long-term cell visibility prediction, reducing the prediction MSE loss by up to 50% compared to the state-of-the-art models while maintaining real-time performance (more than 30fps) for point cloud videos with over 1 million points.
- [3] arXiv:2409.18459 (cross-list from cs.CV) [pdf, other]
-
Title: FoodMLLM-JP: Leveraging Multimodal Large Language Models for Japanese Recipe GenerationComments: 14 pages, 5 figuresSubjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
Research on food image understanding using recipe data has been a long-standing focus due to the diversity and complexity of the data. Moreover, food is inextricably linked to people's lives, making it a vital research area for practical applications such as dietary management. Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities, not only in their vast knowledge but also in their ability to handle languages naturally. While English is predominantly used, they can also support multiple languages including Japanese. This suggests that MLLMs are expected to significantly improve performance in food image understanding tasks. We fine-tuned open MLLMs LLaVA-1.5 and Phi-3 Vision on a Japanese recipe dataset and benchmarked their performance against the closed model GPT-4o. We then evaluated the content of generated recipes, including ingredients and cooking procedures, using 5,000 evaluation samples that comprehensively cover Japanese food culture. Our evaluation demonstrates that the open models trained on recipe data outperform GPT-4o, the current state-of-the-art model, in ingredient generation. Our model achieved F1 score of 0.531, surpassing GPT-4o's F1 score of 0.481, indicating a higher level of accuracy. Furthermore, our model exhibited comparable performance to GPT-4o in generating cooking procedure text.
- [4] arXiv:2409.18680 (cross-list from cs.SD) [pdf, other]
-
Title: Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language ModelsComments: EMNLP24 FindingsSubjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
Various audio-LLMs (ALLMs) have been explored recently for tackling different audio tasks simultaneously using a single, unified model. While existing evaluations of ALLMs primarily focus on single-audio tasks, real-world applications often involve processing multiple audio streams simultaneously. To bridge this gap, we propose the first multi-audio evaluation (MAE) benchmark that consists of 20 datasets from 11 multi-audio tasks encompassing both speech and sound scenarios. Comprehensive experiments on MAE demonstrate that the existing ALLMs, while being powerful in comprehending primary audio elements in individual audio inputs, struggling to handle multi-audio scenarios. To this end, we propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios using discriminative learning on our proposed synthetic data. The results demonstrate that the proposed MALLM outperforms all baselines and achieves high data efficiency using synthetic data without requiring human annotations. The proposed MALLM opens the door for ALLMs towards multi-audio processing era and brings us closer to replicating human auditory capabilities in machines.
- [5] arXiv:2409.18713 (cross-list from eess.IV) [pdf, other]
-
Title: Decoding Complexity-Rate-Quality Pareto-Front for Adaptive VVC StreamingComments: 5 pagesSubjects: Image and Video Processing (eess.IV); Multimedia (cs.MM)
Pareto-front optimization is crucial for addressing the multi-objective challenges in video streaming, enabling the identification of optimal trade-offs between conflicting goals such as bitrate, video quality, and decoding complexity. This paper explores the construction of efficient bitrate ladders for adaptive Versatile Video Coding (VVC) streaming, focusing on optimizing these trade-offs. We investigate various ladder construction methods based on Pareto-front optimization, including exhaustive Rate-Quality and fixed ladder approaches. We propose a joint decoding time-rate-quality Pareto-front, providing a comprehensive framework to balance bitrate, decoding time, and video quality in video streaming. This allows streaming services to tailor their encoding strategies to meet specific requirements, prioritizing low decoding latency, bandwidth efficiency, or a balanced approach, thus enhancing the overall user experience. The experimental results confirm and demonstrate these opportunities for navigating the decoding time-rate-quality space to support various use cases. For example, when prioritizing low decoding latency, the proposed method achieves decoding time reduction of 14.86% while providing Bjontegaard delta rate savings of 4.65% and 0.32dB improvement in the eXtended Peak Signal-to-Noise Ratio (XPSNR)-Rate domain over the traditional fixed ladder solution.
- [6] arXiv:2409.18901 (cross-list from cs.CV) [pdf, other]
-
Title: Improving Visual Object Tracking through Visual PromptingComments: Accepted and to appear in IEEE Transactions on MultimediaSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Image and Video Processing (eess.IV)
Learning a discriminative model to distinguish a target from its surrounding distractors is essential to generic visual object tracking. Dynamic target representation adaptation against distractors is challenging due to the limited discriminative capabilities of prevailing trackers. We present a new visual Prompting mechanism for generic Visual Object Tracking (PiVOT) to address this issue. PiVOT proposes a prompt generation network with the pre-trained foundation model CLIP to automatically generate and refine visual prompts, enabling the transfer of foundation model knowledge for tracking. While CLIP offers broad category-level knowledge, the tracker, trained on instance-specific data, excels at recognizing unique object instances. Thus, PiVOT first compiles a visual prompt highlighting potential target locations. To transfer the knowledge of CLIP to the tracker, PiVOT leverages CLIP to refine the visual prompt based on the similarities between candidate objects and the reference templates across potential targets. Once the visual prompt is refined, it can better highlight potential target locations, thereby reducing irrelevant prompt information. With the proposed prompting mechanism, the tracker can generate improved instance-aware feature maps through the guidance of the visual prompt, thus effectively reducing distractors. The proposed method does not involve CLIP during training, thereby keeping the same training complexity and preserving the generalization capability of the pretrained foundation model. Extensive experiments across multiple benchmarks indicate that PiVOT, using the proposed prompting method can suppress distracting objects and enhance the tracker.
Cross submissions (showing 6 of 6 entries)
- [7] arXiv:2405.20078 (replaced) [pdf, other]
-
Title: NeRF View Synthesis: Subjective Quality Assessment and Objective Metrics EvaluationSubjects: Multimedia (cs.MM)
Neural radiance fields (NeRF) are a groundbreaking computer vision technology that enables the generation of high-quality, immersive visual content from multiple viewpoints. This capability has significant advantages for applications such as virtual/augmented reality, 3D modelling, and content creation for the film and entertainment industry. However, the evaluation of NeRF methods poses several challenges, including a lack of comprehensive datasets, reliable assessment methodologies, and objective quality metrics. This paper addresses the problem of NeRF view synthesis (NVS) quality assessment thoroughly, by conducting a rigorous subjective quality assessment test that considers several scene classes and recently proposed NVS methods. Additionally, the performance of a wide range of state-of-the-art conventional and learning-based full-reference 2D image and video quality assessment metrics is evaluated against the subjective scores of the subjective study. This study found that errors in camera pose estimation can result in spatial misalignments between synthesized and reference images, which need to be corrected before applying an objective quality metric. The experimental results are analyzed in depth, providing a comparative evaluation of several NVS methods and objective quality metrics, across different classes of visual scenes, including real and synthetic content for front-face and 360-degree camera trajectories.
- [8] arXiv:2409.15545 (replaced) [pdf, other]
-
Title: Rethinking Emotion Bias in Music via Frechet Audio DistanceSubjects: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD)
The subjective nature of music emotion introduces inherent bias in both recognition and generation, especially when relying on a single audio encoder, emotion classifier, or evaluation metric. In this work, we conduct a study on Music Emotion Recognition (MER) and Emotional Music Generation (EMG), employing diverse audio encoders alongside the Frechet Audio Distance (FAD), a reference-free evaluation metric. Our study begins with a benchmark evaluation of MER, highlighting the limitations associated with using a single audio encoder and the disparities observed across different measurements. We then propose assessing MER performance using FAD from multiple encoders to provide a more objective measure of music emotion. Furthermore, we introduce an enhanced EMG approach designed to improve both the variation and prominence of generated music emotion, thus enhancing realism. Additionally, we investigate the realism disparities between the emotions conveyed in real and synthetic music, comparing our EMG model against two baseline models. Experimental results underscore the emotion bias problem in both MER and EMG and demonstrate the potential of using FAD and diverse audio encoders to evaluate music emotion objectively.
- [9] arXiv:2409.16937 (replaced) [pdf, other]
-
Title: Semi-Supervised Cognitive State Classification from Speech with Multi-View Pseudo-LabelingSubjects: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM); Sound (cs.SD)
The lack of labeled data is a common challenge in speech classification tasks, particularly those requiring extensive subjective assessment, such as cognitive state classification. In this work, we propose a Semi-Supervised Learning (SSL) framework, introducing a novel multi-view pseudo-labeling method that leverages both acoustic and linguistic characteristics to select the most confident data for training the classification model. Acoustically, unlabeled data are compared to labeled data using the Frechet audio distance, calculated from embeddings generated by multiple audio encoders. Linguistically, large language models are prompted to revise automatic speech recognition transcriptions and predict labels based on our proposed task-specific knowledge. High-confidence data are identified when pseudo-labels from both sources align, while mismatches are treated as low-confidence data. A bimodal classifier is then trained to iteratively label the low-confidence data until a predefined criterion is met. We evaluate our SSL framework on emotion recognition and dementia detection tasks. Experimental results demonstrate that our method achieves competitive performance compared to fully supervised learning using only 30% of the labeled data and significantly outperforms two selected baselines.