Search | arXiv e-print repository

VideoGameBunny: Towards vision assistants for video games

Authors: Mohammad Reza Taesiri, Cor-Paul Bezemer

Abstract: Large multimodal models (LMMs) hold substantial promise across various domains, from personal assistance in daily tasks to sophisticated applications like medical diagnostics. However, their capabilities have limitations in the video game domain, such as challenges with scene understanding, hallucinations, and inaccurate descriptions of video game content, especially in open-source models. This pa… ▽ More Large multimodal models (LMMs) hold substantial promise across various domains, from personal assistance in daily tasks to sophisticated applications like medical diagnostics. However, their capabilities have limitations in the video game domain, such as challenges with scene understanding, hallucinations, and inaccurate descriptions of video game content, especially in open-source models. This paper describes the development of VideoGameBunny, a LLaVA-style model based on Bunny, specifically tailored for understanding images from video games. We release intermediate checkpoints, training logs, and an extensive dataset comprising 185,259 video game images from 413 titles, along with 389,565 image-instruction pairs that include image captions, question-answer pairs, and a JSON representation of 16 elements of 136,974 images. Our experiments show that our high quality game-related data has the potential to make a relatively small model outperform the much larger state-of-the-art model LLaVa-1.6-34b (which has more than 4x the number of parameters). Our study paves the way for future research in video game understanding on tasks such as playing, commentary, and debugging. Code and data are available at https://videogamebunny.github.io/ △ Less

Submitted 21 July, 2024; originally announced July 2024.

arXiv:2407.06581 [pdf, other]

Vision language models are blind

Authors: Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, Anh Totti Nguyen

Abstract: While large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini 1.5 Pro, are powering various image-text applications and scoring high on many vision-understanding benchmarks, we find that they are surprisingly still struggling with low-level vision tasks that are easy to humans. Specifically, on BlindTest, our suite of 7 very simple tasks such as identifying (a) whether two c… ▽ More While large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini 1.5 Pro, are powering various image-text applications and scoring high on many vision-understanding benchmarks, we find that they are surprisingly still struggling with low-level vision tasks that are easy to humans. Specifically, on BlindTest, our suite of 7 very simple tasks such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c) which letter is being circled in a word; and (d) counting circles in an Olympic-like logo, four state-of-the-art VLMs are only 58.57% accurate on average. Claude 3.5 Sonnet performs the best at 74.94% accuracy, but this is still far from the human expected accuracy of 100%. Across different image resolutions and line widths, VLMs consistently struggle with tasks that require precise spatial information and recognizing geometric primitives that overlap or are close together. Code and data are available at: https://vlmsareblind.github.io △ Less

Submitted 25 July, 2024; v1 submitted 9 July, 2024; originally announced July 2024.

arXiv:2404.05238 [pdf, other]

Allowing humans to interactively guide machines where to look does not always improve human-AI team's classification accuracy

Authors: Giang Nguyen, Mohammad Reza Taesiri, Sunnie S. Y. Kim, Anh Nguyen

Abstract: Via thousands of papers in Explainable AI (XAI), attention maps \cite{vaswani2017attention} and feature importance maps \cite{bansal2020sam} have been established as a common means for finding how important each input feature is to an AI's decisions. It is an interesting, unexplored question whether allowing users to edit the feature importance at test time would improve a human-AI team's accuracy… ▽ More Via thousands of papers in Explainable AI (XAI), attention maps \cite{vaswani2017attention} and feature importance maps \cite{bansal2020sam} have been established as a common means for finding how important each input feature is to an AI's decisions. It is an interesting, unexplored question whether allowing users to edit the feature importance at test time would improve a human-AI team's accuracy on downstream tasks. In this paper, we address this question by leveraging CHM-Corr, a state-of-the-art, ante-hoc explainable classifier \cite{taesiri2022visual} that first predicts patch-wise correspondences between the input and training-set images, and then bases on them to make classification decisions. We build CHM-Corr++, an interactive interface for CHM-Corr, enabling users to edit the feature importance map provided by CHM-Corr and observe updated model decisions. Via CHM-Corr++, users can gain insights into if, when, and how the model changes its outputs, improving their understanding beyond static explanations. However, our study with 18 expert users who performed 1,400 decisions finds no statistical significance that our interactive approach improves user accuracy on CUB-200 bird image classification over static explanations. This challenges the hypothesis that interactivity can boost human-AI team accuracy and raises needs for future research. We open-source CHM-Corr++, an interactive tool for editing image classifier attention (see an interactive demo here: http://137.184.82.109:7080/). We release code and data on github: https://github.com/anguyen8/chm-corr-interactive. △ Less

Submitted 20 April, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

Comments: Accepted for presentation at the XAI4CV Workshop, part of the CVPR 2024 proceedings

arXiv:2312.05291 [pdf, other]

GlitchBench: Can large multimodal models detect video game glitches?

Authors: Mohammad Reza Taesiri, Tianjun Feng, Anh Nguyen, Cor-Paul Bezemer

Abstract: Large multimodal models (LMMs) have evolved from large language models (LLMs) to integrate multiple input modalities, such as visual inputs. This integration augments the capacity of LLMs for tasks requiring visual comprehension and reasoning. However, the extent and limitations of their enhanced abilities are not fully understood, especially when it comes to real-world tasks. To address this gap,… ▽ More Large multimodal models (LMMs) have evolved from large language models (LLMs) to integrate multiple input modalities, such as visual inputs. This integration augments the capacity of LLMs for tasks requiring visual comprehension and reasoning. However, the extent and limitations of their enhanced abilities are not fully understood, especially when it comes to real-world tasks. To address this gap, we introduce GlitchBench, a novel benchmark derived from video game quality assurance tasks, to test and evaluate the reasoning capabilities of LMMs. Our benchmark is curated from a variety of unusual and glitched scenarios from video games and aims to challenge both the visual and linguistic reasoning powers of LMMs in detecting and interpreting out-of-the-ordinary events. We evaluate multiple state-of-the-art LMMs, and we show that GlitchBench presents a new challenge for these models. Code and data are available at: https://glitchbench.github.io/ △ Less

Submitted 29 March, 2024; v1 submitted 8 December, 2023; originally announced December 2023.

Comments: CVPR 2024

arXiv:2308.13651 [pdf, other]

PCNN: Probable-Class Nearest-Neighbor Explanations Improve Fine-Grained Image Classification Accuracy for AIs and Humans

Authors: Giang, Nguyen, Valerie Chen, Mohammad Reza Taesiri, Anh Totti Nguyen

Abstract: Nearest neighbors (NN) are traditionally used to compute final decisions, e.g., in Support Vector Machines or k-NN classifiers, and to provide users with explanations for the model's decision. In this paper, we show a novel utility of nearest neighbors: To improve predictions of a frozen, pretrained image classifier C. We leverage an image comparator S that (1) compares the input image with NN ima… ▽ More Nearest neighbors (NN) are traditionally used to compute final decisions, e.g., in Support Vector Machines or k-NN classifiers, and to provide users with explanations for the model's decision. In this paper, we show a novel utility of nearest neighbors: To improve predictions of a frozen, pretrained image classifier C. We leverage an image comparator S that (1) compares the input image with NN images from the top-K most probable classes given by C; and (2) uses scores from S to weight the confidence scores of C to refine predictions. Our method consistently improves fine-grained image classification accuracy on CUB-200, Cars-196, and Dogs-120. Also, a human study finds that showing users our probable-class nearest neighbors (PCNN) reduces over-reliance on AI, thus improving their decision accuracy over prior work which only shows only the most-probable (top-1) class examples. △ Less

Submitted 26 August, 2024; v1 submitted 25 August, 2023; originally announced August 2023.

Comments: Accepted to Transaction on Machine Learning Research 2024; 50 pages, 35 Figures & 17 Tables

arXiv:2304.05538 [pdf, other]

ImageNet-Hard: The Hardest Images Remaining from a Study of the Power of Zoom and Spatial Biases in Image Classification

Authors: Mohammad Reza Taesiri, Giang Nguyen, Sarra Habchi, Cor-Paul Bezemer, Anh Nguyen

Abstract: Image classifiers are information-discarding machines, by design. Yet, how these models discard information remains mysterious. We hypothesize that one way for image classifiers to reach high accuracy is to first zoom to the most discriminative region in the image and then extract features from there to predict image labels, discarding the rest of the image. Studying six popular networks ranging f… ▽ More Image classifiers are information-discarding machines, by design. Yet, how these models discard information remains mysterious. We hypothesize that one way for image classifiers to reach high accuracy is to first zoom to the most discriminative region in the image and then extract features from there to predict image labels, discarding the rest of the image. Studying six popular networks ranging from AlexNet to CLIP, we find that proper framing of the input image can lead to the correct classification of 98.91% of ImageNet images. Furthermore, we uncover positional biases in various datasets, especially a strong center bias in two popular datasets: ImageNet-A and ObjectNet. Finally, leveraging our insights into the potential of zooming, we propose a test-time augmentation (TTA) technique that improves classification accuracy by forcing models to explicitly perform zoom-in operations before making predictions. Our method is more interpretable, accurate, and faster than MEMO, a state-of-the-art (SOTA) TTA method. We introduce ImageNet-Hard, a new benchmark that challenges SOTA classifiers including large vision-language models even when optimal zooming is allowed. △ Less

Submitted 8 October, 2023; v1 submitted 11 April, 2023; originally announced April 2023.

Comments: NeurIPS 2023 Track on Datasets and Benchmarks

arXiv:2210.02506 [pdf, other]

Large Language Models are Pretty Good Zero-Shot Video Game Bug Detectors

Authors: Mohammad Reza Taesiri, Finlay Macklon, Yihe Wang, Hengshuo Shen, Cor-Paul Bezemer

Abstract: Video game testing requires game-specific knowledge as well as common sense reasoning about the events in the game. While AI-driven agents can satisfy the first requirement, it is not yet possible to meet the second requirement automatically. Therefore, video game testing often still relies on manual testing, and human testers are required to play the game thoroughly to detect bugs. As a result, i… ▽ More Video game testing requires game-specific knowledge as well as common sense reasoning about the events in the game. While AI-driven agents can satisfy the first requirement, it is not yet possible to meet the second requirement automatically. Therefore, video game testing often still relies on manual testing, and human testers are required to play the game thoroughly to detect bugs. As a result, it is challenging to fully automate game testing. In this study, we explore the possibility of leveraging the zero-shot capabilities of large language models for video game bug detection. By formulating the bug detection problem as a question-answering task, we show that large language models can identify which event is buggy in a sequence of textual descriptions of events from a game. To this end, we introduce the GameBugDescriptions benchmark dataset, which consists of 167 buggy gameplay videos and a total of 334 question-answer pairs across 8 games. We extensively evaluate the performance of six models across the OPT and InstructGPT large language model families on our benchmark dataset. Our results show promising results for employing language models to detect video game bugs. With the proper prompting technique, we could achieve an accuracy of 70.66%, and on some video games, up to 78.94%. Our code, evaluation data and the benchmark can be found on https://asgaardlab.github.io/LLMxBugs △ Less

Submitted 5 October, 2022; originally announced October 2022.

arXiv:2208.02335 [pdf, other]

Automatically Detecting Visual Bugs in HTML5 <canvas> Games

Authors: Finlay Macklon, Mohammad Reza Taesiri, Markos Viggiato, Stefan Antoszko, Natalia Romanova, Dale Paas, Cor-Paul Bezemer

Abstract: The HTML5 <canvas> is used to display high quality graphics in web applications such as web games (i.e., <canvas> games). However, automatically testing <canvas> games is not possible with existing web testing techniques and tools, and manual testing is laborious. Many widely used web testing tools rely on the Document Object Model (DOM) to drive web test automation, but the contents of the <canva… ▽ More The HTML5 <canvas> is used to display high quality graphics in web applications such as web games (i.e., <canvas> games). However, automatically testing <canvas> games is not possible with existing web testing techniques and tools, and manual testing is laborious. Many widely used web testing tools rely on the Document Object Model (DOM) to drive web test automation, but the contents of the <canvas> are not represented in the DOM. The main alternative approach, snapshot testing, involves comparing oracle snapshot images with test-time snapshot images using an image similarity metric to catch visual bugs, i.e., bugs in the graphics of the web application. However, creating and maintaining oracle snapshot images for <canvas> games is onerous, defeating the purpose of test automation. In this paper, we present a novel approach to automatically detect visual bugs in <canvas> games. By leveraging an internal representation of objects on the <canvas>, we decompose snapshot images into a set of object images, each of which is compared with a respective oracle asset (e.g., a sprite) using four similarity metrics: percentage overlap, mean squared error, structural similarity, and embedding similarity. We evaluate our approach by injecting 24 visual bugs into a custom <canvas> game, and find that our approach achieves an accuracy of 100%, compared to an accuracy of 44.6% with traditional snapshot testing. △ Less

Submitted 3 August, 2022; originally announced August 2022.

Comments: Accepted at ASE 2022 conference

arXiv:2208.00780 [pdf, other]

Visual correspondence-based explanations improve AI robustness and human-AI team accuracy

Authors: Giang Nguyen, Mohammad Reza Taesiri, Anh Nguyen

Abstract: Explaining artificial intelligence (AI) predictions is increasingly important and even imperative in many high-stakes applications where humans are the ultimate decision-makers. In this work, we propose two novel architectures of self-interpretable image classifiers that first explain, and then predict (as opposed to post-hoc explanations) by harnessing the visual correspondences between a query i… ▽ More Explaining artificial intelligence (AI) predictions is increasingly important and even imperative in many high-stakes applications where humans are the ultimate decision-makers. In this work, we propose two novel architectures of self-interpretable image classifiers that first explain, and then predict (as opposed to post-hoc explanations) by harnessing the visual correspondences between a query image and exemplars. Our models consistently improve (by 1 to 4 points) on out-of-distribution (OOD) datasets while performing marginally worse (by 1 to 2 points) on in-distribution tests than ResNet-50 and a $k$-nearest neighbor classifier (kNN). Via a large-scale, human study on ImageNet and CUB, our correspondence-based explanations are found to be more useful to users than kNN explanations. Our explanations help users more accurately reject AI's wrong decisions than all other tested methods. Interestingly, for the first time, we show that it is possible to achieve complementary human-AI team accuracy (i.e., that is higher than either AI-alone or human-alone), in ImageNet and CUB image classification tasks. △ Less

Submitted 30 August, 2023; v1 submitted 26 July, 2022; originally announced August 2022.

Comments: NeurIPS 2022 conference paper

arXiv:2203.11096 [pdf, other]

CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning

Authors: Mohammad Reza Taesiri, Finlay Macklon, Cor-Paul Bezemer

Abstract: Gameplay videos contain rich information about how players interact with the game and how the game responds. Sharing gameplay videos on social media platforms, such as Reddit, has become a common practice for many players. Often, players will share gameplay videos that showcase video game bugs. Such gameplay videos are software artifacts that can be utilized for game testing, as they provide insig… ▽ More Gameplay videos contain rich information about how players interact with the game and how the game responds. Sharing gameplay videos on social media platforms, such as Reddit, has become a common practice for many players. Often, players will share gameplay videos that showcase video game bugs. Such gameplay videos are software artifacts that can be utilized for game testing, as they provide insight for bug analysis. Although large repositories of gameplay videos exist, parsing and mining them in an effective and structured fashion has still remained a big challenge. In this paper, we propose a search method that accepts any English text query as input to retrieve relevant videos from large repositories of gameplay videos. Our approach does not rely on any external information (such as video metadata); it works solely based on the content of the video. By leveraging the zero-shot transfer capabilities of the Contrastive Language-Image Pre-Training (CLIP) model, our approach does not require any data labeling or training. To evaluate our approach, we present the $\texttt{GamePhysics}$ dataset consisting of 26,954 videos from 1,873 games, that were collected from the GamePhysics section on the Reddit website. Our approach shows promising results in our extensive analysis of simple queries, compound queries, and bug queries, indicating that our approach is useful for object and event detection in gameplay videos. An example application of our approach is as a gameplay video search engine to aid in reproducing video game bugs. Please visit the following link for the code and the data: https://asgaardlab.github.io/CLIPxGamePhysics/ △ Less

Submitted 22 March, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

Comments: Accepted by MSR 2022 conference

arXiv:2109.12321 [pdf, other]

Under the Skin of Foundation NFT Auctions

Authors: MohammadAmin Fazli, Ali Owfi, Mohammad Reza Taesiri

Abstract: Non Fungible Tokens (NFTs) have gained a solid foothold within the crypto community, and substantial amounts of money have been allocated to their trades. In this paper, we studied one of the most prominent marketplaces dedicated to NFT auctions and trades, Foundation. We analyzed the activities on Foundation and identified several intriguing underlying dynamics that occur on this platform. Moreov… ▽ More Non Fungible Tokens (NFTs) have gained a solid foothold within the crypto community, and substantial amounts of money have been allocated to their trades. In this paper, we studied one of the most prominent marketplaces dedicated to NFT auctions and trades, Foundation. We analyzed the activities on Foundation and identified several intriguing underlying dynamics that occur on this platform. Moreover, We performed social network analysis on a graph that we had created based on transferred NFTs on Foundation, and then described the characteristics of this graph. Lastly, We built a neural network-based similarity model for retrieving and clustering similar NFTs. We also showed that for most NFTs, their performances in auctions were comparable with the auction performance of other NFTs in their cluster. △ Less

Submitted 25 September, 2021; originally announced September 2021.

Showing 1–11 of 11 results for author: Taesiri, M R