Search | arXiv e-print repository

COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark

Authors: Koki Maeda, Tosho Hirasawa, Atsushi Hashimoto, Jun Harashima, Leszek Rybicki, Yusuke Fukasawa, Yoshitaka Ushiku

Abstract: Procedural video understanding is gaining attention in the vision and language community. Deep learning-based video analysis requires extensive data. Consequently, existing works often use web videos as training resources, making it challenging to query instructional contents from raw video observations. To address this issue, we propose a new dataset, COM Kitchens. The dataset consists of unedite… ▽ More Procedural video understanding is gaining attention in the vision and language community. Deep learning-based video analysis requires extensive data. Consequently, existing works often use web videos as training resources, making it challenging to query instructional contents from raw video observations. To address this issue, we propose a new dataset, COM Kitchens. The dataset consists of unedited overhead-view videos captured by smartphones, in which participants performed food preparation based on given recipes. Fixed-viewpoint video datasets often lack environmental diversity due to high camera setup costs. We used modern wide-angle smartphone lenses to cover cooking counters from sink to cooktop in an overhead view, capturing activity without in-person assistance. With this setup, we collected a diverse dataset by distributing smartphones to participants. With this dataset, we propose the novel video-to-text retrieval task Online Recipe Retrieval (OnRR) and new video captioning domain Dense Video Captioning on unedited Overhead-View videos (DVC-OV). Our experiments verified the capabilities and limitations of current web-video-based SOTA methods in handling these tasks. △ Less

Submitted 5 August, 2024; originally announced August 2024.

Comments: ECCV2024 accepted

arXiv:2407.05814 [pdf, other]

Cross-domain Few-shot In-context Learning for Enhancing Traffic Sign Recognition

Authors: Yaozong Gan, Guang Li, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama

Abstract: Recent multimodal large language models (MLLM) such as GPT-4o and GPT-4v have shown great potential in autonomous driving. In this paper, we propose a cross-domain few-shot in-context learning method based on the MLLM for enhancing traffic sign recognition (TSR). We first construct a traffic sign detection network based on Vision Transformer Adapter and an extraction module to extract traffic sign… ▽ More Recent multimodal large language models (MLLM) such as GPT-4o and GPT-4v have shown great potential in autonomous driving. In this paper, we propose a cross-domain few-shot in-context learning method based on the MLLM for enhancing traffic sign recognition (TSR). We first construct a traffic sign detection network based on Vision Transformer Adapter and an extraction module to extract traffic signs from the original road images. To reduce the dependence on training data and improve the performance stability of cross-country TSR, we introduce a cross-domain few-shot in-context learning method based on the MLLM. To enhance MLLM's fine-grained recognition ability of traffic signs, the proposed method generates corresponding description texts using template traffic signs. These description texts contain key information about the shape, color, and composition of traffic signs, which can stimulate the ability of MLLM to perceive fine-grained traffic sign categories. By using the description texts, our method reduces the cross-domain differences between template and real traffic signs. Our approach requires only simple and uniform textual indications, without the need for large-scale traffic sign images and labels. We perform comprehensive evaluations on the German traffic sign recognition benchmark dataset, the Belgium traffic sign dataset, and two real-world datasets taken from Japan. The experimental results show that our method significantly enhances the TSR performance. △ Less

Submitted 8 July, 2024; originally announced July 2024.

arXiv:2406.13316 [pdf, other]

Reinforcing Pre-trained Models Using Counterfactual Images

Authors: Xiang Li, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama

Abstract: This paper proposes a novel framework to reinforce classification models using language-guided generated counterfactual images. Deep learning classification models are often trained using datasets that mirror real-world scenarios. In this training process, because learning is based solely on correlations with labels, there is a risk that models may learn spurious relationships, such as an overreli… ▽ More This paper proposes a novel framework to reinforce classification models using language-guided generated counterfactual images. Deep learning classification models are often trained using datasets that mirror real-world scenarios. In this training process, because learning is based solely on correlations with labels, there is a risk that models may learn spurious relationships, such as an overreliance on features not central to the subject, like background elements in images. However, due to the black-box nature of the decision-making process in deep learning models, identifying and addressing these vulnerabilities has been particularly challenging. We introduce a novel framework for reinforcing the classification models, which consists of a two-stage process. First, we identify model weaknesses by testing the model using the counterfactual image dataset, which is generated by perturbed image captions. Subsequently, we employ the counterfactual images as an augmented dataset to fine-tune and reinforce the classification model. Through extensive experiments on several classification models across various datasets, we revealed that fine-tuning with a small set of counterfactual images effectively strengthens the model. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: 6 pages, 4 figures

arXiv:2404.17732 [pdf, other]

Generative Dataset Distillation: Balancing Global Structure and Local Details

Authors: Longzhen Li, Guang Li, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama

Abstract: In this paper, we propose a new dataset distillation method that considers balancing global structure and local details when distilling the information from a large dataset into a generative model. Dataset distillation has been proposed to reduce the size of the required dataset when training models. The conventional dataset distillation methods face the problem of long redeployment time and poor… ▽ More In this paper, we propose a new dataset distillation method that considers balancing global structure and local details when distilling the information from a large dataset into a generative model. Dataset distillation has been proposed to reduce the size of the required dataset when training models. The conventional dataset distillation methods face the problem of long redeployment time and poor cross-architecture performance. Moreover, previous methods focused too much on the high-level semantic attributes between the synthetic dataset and the original dataset while ignoring the local features such as texture and shape. Based on the above understanding, we propose a new method for distilling the original image dataset into a generative model. Our method involves using a conditional generative adversarial network to generate the distilled dataset. Subsequently, we ensure balancing global structure and local details in the distillation process, continuously optimizing the generator for more information-dense dataset generation. △ Less

Submitted 26 April, 2024; originally announced April 2024.

Comments: Accepted by the 1st CVPR Workshop on Dataset Distillation

arXiv:2403.18258 [pdf, other]

Enhancing Generative Class Incremental Learning Performance with Model Forgetting Approach

Authors: Taro Togo, Ren Togo, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama

Abstract: This study presents a novel approach to Generative Class Incremental Learning (GCIL) by introducing the forgetting mechanism, aimed at dynamically managing class information for better adaptation to streaming data. GCIL is one of the hot topics in the field of computer vision, and this is considered one of the crucial tasks in society, specifically the continual learning of generative models. The… ▽ More This study presents a novel approach to Generative Class Incremental Learning (GCIL) by introducing the forgetting mechanism, aimed at dynamically managing class information for better adaptation to streaming data. GCIL is one of the hot topics in the field of computer vision, and this is considered one of the crucial tasks in society, specifically the continual learning of generative models. The ability to forget is a crucial brain function that facilitates continual learning by selectively discarding less relevant information for humans. However, in the field of machine learning models, the concept of intentionally forgetting has not been extensively investigated. In this study we aim to bridge this gap by incorporating the forgetting mechanisms into GCIL, thereby examining their impact on the models' ability to learn in continual learning. Through our experiments, we have found that integrating the forgetting mechanisms significantly enhances the models' performance in acquiring new knowledge, underscoring the positive role that strategic forgetting plays in the process of continual learning. △ Less

Submitted 27 March, 2024; originally announced March 2024.

arXiv:2402.17969 [pdf, other]

Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction

Authors: Koki Maeda, Shuhei Kurita, Taiki Miyanishi, Naoaki Okazaki

Abstract: Given the accelerating progress of vision and language modeling, accurate evaluation of machine-generated image captions remains critical. In order to evaluate captions more closely to human preferences, metrics need to discriminate between captions of varying quality and content. However, conventional metrics fail short of comparing beyond superficial matches of words or embedding similarities; t… ▽ More Given the accelerating progress of vision and language modeling, accurate evaluation of machine-generated image captions remains critical. In order to evaluate captions more closely to human preferences, metrics need to discriminate between captions of varying quality and content. However, conventional metrics fail short of comparing beyond superficial matches of words or embedding similarities; thus, they still need improvement. This paper presents VisCE$^2$, a vision language model-based caption evaluation method. Our method focuses on visual context, which refers to the detailed content of images, including objects, attributes, and relationships. By extracting and organizing them into a structured format, we replace the human-written references with visual contexts and help VLMs better understand the image, enhancing evaluation performance. Through meta-evaluation on multiple datasets, we validated that VisCE$^2$ outperforms the conventional pre-trained metrics in capturing caption quality and demonstrates superior consistency with human judgment. △ Less

Submitted 27 February, 2024; originally announced February 2024.

arXiv:2402.11145 [pdf, other]

Supporting Experts with a Multimodal Machine-Learning-Based Tool for Human Behavior Analysis of Conversational Videos

Authors: Riku Arakawa, Kiyosu Maeda, Hiromu Yakura

Abstract: Multimodal scene search of conversations is essential for unlocking valuable insights into social dynamics and enhancing our communication. While experts in conversational analysis have their own knowledge and skills to find key scenes, a lack of comprehensive, user-friendly tools that streamline the processing of diverse multimodal queries impedes efficiency and objectivity. To solve it, we devel… ▽ More Multimodal scene search of conversations is essential for unlocking valuable insights into social dynamics and enhancing our communication. While experts in conversational analysis have their own knowledge and skills to find key scenes, a lack of comprehensive, user-friendly tools that streamline the processing of diverse multimodal queries impedes efficiency and objectivity. To solve it, we developed Providence, a visual-programming-based tool based on design considerations derived from a formative study with experts. It enables experts to combine various machine learning algorithms to capture human behavioral cues without writing code. Our study showed its preferable usability and satisfactory output with less cognitive load imposed in accomplishing scene search tasks of conversations, verifying the importance of its customizability and transparency. Furthermore, through the in-the-wild trial, we confirmed the objectivity and reusability of the tool transform experts' workflow, suggesting the advantage of expert-AI teaming in a highly human-contextual domain. △ Less

Submitted 16 February, 2024; originally announced February 2024.

arXiv:2310.12650 [pdf, other]

Hibikino-Musashi@Home 2023 Team Description Paper

Authors: Tomoya Shiba, Akinobu Mizutani, Yuga Yano, Tomohiro Ono, Shoshi Tokuno, Daiju Kanaoka, Yukiya Fukuda, Hayato Amano, Mayu Koresawa, Yoshifumi Sakai, Ryogo Takemoto, Katsunori Tamai, Kazuo Nakahara, Hiroyuki Hayashi, Satsuki Fujimatsu, Yusuke Mizoguchi, Moeno Anraku, Mayo Suzuka, Lu Shen, Kohei Maeda, Fumiya Matsuzaki, Ikuya Matsumoto, Kazuya Murai, Kosei Isomoto, Kim Minje , et al. (3 additional authors not shown)

Abstract: This paper describes an overview of the techniques of Hibikino-Musashi@Home, which intends to participate in the domestic standard platform league. The team has developed a dataset generator for the training of a robot vision system and an open-source development environment running on a human support robot simulator. The robot system comprises self-developed libraries including those for motion s… ▽ More This paper describes an overview of the techniques of Hibikino-Musashi@Home, which intends to participate in the domestic standard platform league. The team has developed a dataset generator for the training of a robot vision system and an open-source development environment running on a human support robot simulator. The robot system comprises self-developed libraries including those for motion synthesis and open-source software works on the robot operating system. The team aims to realize a home service robot that assists humans in a home, and continuously attend the competition to evaluate the developed system. The brain-inspired artificial intelligence system is also proposed for service robots which are expected to work in a real home environment. △ Less

Submitted 19 October, 2023; originally announced October 2023.

arXiv:2307.02799 [pdf, other]

Few-shot Personalized Saliency Prediction Based on Inter-personnel Gaze Patterns

Authors: Yuya Moroto, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama

Abstract: This paper presents few-shot personalized saliency prediction based on inter-personnel gaze patterns. In contrast to general saliency maps, personalized saliecny maps (PSMs) have been great potential since PSMs indicate the person-specific visual attention useful for obtaining individual visual preferences. The PSM prediction is needed for acquiring the PSMs for unseen images, but its prediction i… ▽ More This paper presents few-shot personalized saliency prediction based on inter-personnel gaze patterns. In contrast to general saliency maps, personalized saliecny maps (PSMs) have been great potential since PSMs indicate the person-specific visual attention useful for obtaining individual visual preferences. The PSM prediction is needed for acquiring the PSMs for unseen images, but its prediction is still a challenging task due to the complexity of individual gaze patterns. Moreover, the eye-tracking data obtained from each person is necessary to construct and predict PSMs, but it is difficult to acquire the massive amounts of such data. One solution for realizing PSM prediction from the limited amount of data is the effective use of eye-tracking data obtained from other persons. To efficiently treat the PSMs of other persons, this paper focuses on the selection of images to acquire eye-tracking data and the preservation of structural information of PSMs of other persons. In the proposed method, such images are selected such that they bring more diverse gaze patterns to persons, and the structural information is preserved by adopting the tensor-based regression method. Experimental results demonstrate that the above two points are beneficial for the few-shot PSM prediction. △ Less

Submitted 3 March, 2024; v1 submitted 6 July, 2023; originally announced July 2023.

Comments: 5pages, 3 figures

arXiv:2204.02308 [pdf, other]

CalmResponses: Displaying Collective Audience Reactions in Remote Communication

Authors: Kiyosu Maeda, Riku Arakawa, Jun Rekimoto

Abstract: We propose a system displaying audience eye gaze and nod reactions for enhancing synchronous remote communication. Recently, we have had increasing opportunities to speak to others remotely. In contrast to offline situations, however, speakers often have difficulty observing audience reactions at once in remote communication, which makes them feel more anxious and less confident in their speeches.… ▽ More We propose a system displaying audience eye gaze and nod reactions for enhancing synchronous remote communication. Recently, we have had increasing opportunities to speak to others remotely. In contrast to offline situations, however, speakers often have difficulty observing audience reactions at once in remote communication, which makes them feel more anxious and less confident in their speeches. Recent studies have proposed methods of presenting various audience reactions to speakers. Since these methods require additional devices to measure audience reactions, they are not appropriate for practical situations. Moreover, these methods do not present overall audience reactions. In contrast, we design and develop CalmResponses, a browser-based system which measures audience eye gaze and nod reactions only with a built-in webcam and collectively presents them to speakers. The results of our two user studies indicated that the number of fillers in speaker's speech decreases when audiences' eye gaze is presented, and their self-rating score increases when audiences' nodding is presented. Moreover, comments from audiences suggested benefits of CalmResponses for them in terms of co-presence and privacy concerns. △ Less

Submitted 5 April, 2022; originally announced April 2022.

Comments: To appear in ACM International Conference on Interactive Media Experiences

arXiv:2202.02319 [pdf, other]

An integrated heterogeneous computing framework for ensemble simulations of laser-induced ignition

Authors: Kazuki Maeda, Thiago Teixeira, Jonathan M. Wang, Jeffrey M. Hokanson, Caetano Melone, Mario Di Renzo, Steve Jones, Javier Urzay, Gianluca Iaccarino

Abstract: An integrated computational framework is introduced to study complex engineering systems through physics-based ensemble simulations on heterogeneous supercomputers. The framework is primarily designed for the quantitative assessment of laser-induced ignition in rocket engines. We develop and combine an implicit programming system, a compressible reacting flow solver, and a data generation/manageme… ▽ More An integrated computational framework is introduced to study complex engineering systems through physics-based ensemble simulations on heterogeneous supercomputers. The framework is primarily designed for the quantitative assessment of laser-induced ignition in rocket engines. We develop and combine an implicit programming system, a compressible reacting flow solver, and a data generation/management strategy on a robust and portable platform. We systematically present this framework using test problems on a hybrid CPU/GPU machine. Efficiency, scalability, and accuracy of the solver are comprehensively assessed with canonical unit problems. Ensemble data management and autoencoding are demonstrated using a canonical diffusion flame case. Sensitivity analysis of the ignition of a turbulent, gaseous fuel jet is performed using a simplified, three-dimensional model combustor. Our approach unifies computer science, physics and engineering, and data science to realize a cross-disciplinary workflow. The framework is exascale-oriented and can be considered a benchmark for future computational science studies of real-world systems. △ Less

Submitted 4 February, 2022; originally announced February 2022.

Comments: 28 pages, 12 figures

arXiv:cs/0204006 [pdf, ps, other]

TableTrans, MultiTrans, InterTrans and TreeTrans: Diverse Tools Built on the Annotation Graph Toolkit

Authors: Steven Bird, Kazuaki Maeda, Xiaoyi Ma, Haejoong Lee, Beth Randall, Salim Zayat

Abstract: Four diverse tools built on the Annotation Graph Toolkit are described. Each tool associates linguistic codes and structures with time-series data. All are based on the same software library and tool architecture. TableTrans is for observational coding, using a spreadsheet whose rows are aligned to a signal. MultiTrans is for transcribing multi-party communicative interactions recorded using mul… ▽ More Four diverse tools built on the Annotation Graph Toolkit are described. Each tool associates linguistic codes and structures with time-series data. All are based on the same software library and tool architecture. TableTrans is for observational coding, using a spreadsheet whose rows are aligned to a signal. MultiTrans is for transcribing multi-party communicative interactions recorded using multi-channel signals. InterTrans is for creating interlinear text aligned to audio. TreeTrans is for creating and manipulating syntactic trees. This work demonstrates that the development of diverse tools and re-use of software components is greatly facilitated by a common high-level application programming interface for representing the data and managing input/output, together with a common architecture for managing the interaction of multiple components. △ Less

Submitted 3 April, 2002; originally announced April 2002.

Comments: 7 pages, 7 figures

ACM Class: D.2.13; H.5.5; I.2.7

Journal ref: Proceedings of the Third International Conference on Language Resources and Evaluation, Paris: European Language Resources Association, 2002

arXiv:cs/0204005 [pdf, ps, other]

Creating Annotation Tools with the Annotation Graph Toolkit

Authors: Kazuaki Maeda, Steven Bird, Xiaoyi Ma, Haejoong Lee

Abstract: The Annotation Graph Toolkit is a collection of software supporting the development of annotation tools based on the annotation graph model. The toolkit includes application programming interfaces for manipulating annotation graph data and for importing data from other formats. There are interfaces for the scripting languages Tcl and Python, a database interface, specialized graphical user inter… ▽ More The Annotation Graph Toolkit is a collection of software supporting the development of annotation tools based on the annotation graph model. The toolkit includes application programming interfaces for manipulating annotation graph data and for importing data from other formats. There are interfaces for the scripting languages Tcl and Python, a database interface, specialized graphical user interfaces for a variety of annotation tasks, and several sample applications. This paper describes all the toolkit components for the benefit of would-be application developers. △ Less

Submitted 3 April, 2002; originally announced April 2002.

Comments: 8 pages, 12 figures

ACM Class: D.2.13; H.5.5; I.2.7

Journal ref: Proceedings of the Third International Conference on Language Resources and Evaluation, Paris: European Language Resources Association, 2002

arXiv:cs/0204004 [pdf, ps, other]

Models and Tools for Collaborative Annotation

Authors: Xiaoyi Ma, Haejoong Lee, Steven Bird, Kazuaki Maeda

Abstract: The Annotation Graph Toolkit (AGTK) is a collection of software which facilitates development of linguistic annotation tools. AGTK provides a database interface which allows applications to use a database server for persistent storage. This paper discusses various modes of collaborative annotation and how they can be supported with tools built using AGTK and its database interface. We describe t… ▽ More The Annotation Graph Toolkit (AGTK) is a collection of software which facilitates development of linguistic annotation tools. AGTK provides a database interface which allows applications to use a database server for persistent storage. This paper discusses various modes of collaborative annotation and how they can be supported with tools built using AGTK and its database interface. We describe the relational database schema and API, and describe a version of the TableTrans tool which supports collaborative annotation. The remainder of the paper discusses a high-level query language for annotation graphs, along with optimizations, in support of expressive and efficient access to the annotations held on a large central server. The paper demonstrates that it is straightforward to support a variety of different levels of collaborative annotation with existing AGTK-based tools, with a minimum of additional programming effort. △ Less

Submitted 3 April, 2002; originally announced April 2002.

Comments: 8 pages, 6 figures

ACM Class: H.2.4; H.5.3; H.5.5; I.2.7

Journal ref: Proceedings of the Third International Conference on Language Resources and Evaluation, Paris: European Language Resources Association, 2002

Showing 1–14 of 14 results for author: Maeda, K