Search | arXiv e-print repository

TraveLER: A Multi-LMM Agent Framework for Video Question-Answering

Authors: Chuyi Shang, Amos You, Sanjay Subramanian, Trevor Darrell, Roei Herzig

Abstract: Recently, Large Multimodal Models (LMMs) have made significant progress in video question-answering using a frame-wise approach by leveraging large-scale, image-based pretraining in a zero-shot manner. While image-based methods for videos have shown impressive performance, a current limitation is that they often overlook how key timestamps are selected and cannot adjust when incorrect timestamps a… ▽ More Recently, Large Multimodal Models (LMMs) have made significant progress in video question-answering using a frame-wise approach by leveraging large-scale, image-based pretraining in a zero-shot manner. While image-based methods for videos have shown impressive performance, a current limitation is that they often overlook how key timestamps are selected and cannot adjust when incorrect timestamps are identified. Moreover, they are unable to extract details relevant to the question, instead providing general descriptions of the frame. To overcome this, we design a multi-LMM agent framework that travels along the video, iteratively collecting relevant information from keyframes through interactive question-asking until there is sufficient information to answer the question. Specifically, we propose TraveLER, a model that can create a plan to "Traverse" through the video, ask questions about individual frames to "Locate" and store key information, and then "Evaluate" if there is enough information to answer the question. Finally, if there is not enough information, our method is able to "Replan" based on its collected knowledge. Through extensive experiments, we find that the proposed TraveLER approach improves performance on several video question-answering benchmarks, such as NExT-QA, STAR, and Perception Test, without the need to fine-tune on specific datasets. △ Less

Submitted 1 April, 2024; originally announced April 2024.

arXiv:2309.11580 [pdf, other]

A real-time, hardware agnostic framework for close-up branch reconstruction using RGB data

Authors: Alexander You, Aarushi Mehta, Luke Strohbehn, Jochen Hemming, Cindy Grimm, Joseph R. Davidson

Abstract: Creating accurate 3D models of tree topology is an important task for tree pruning. The 3D model is used to decide which branches to prune and then to execute the pruning cuts. Previous methods for creating 3D tree models have typically relied on point clouds, which are often computationally expensive to process and can suffer from data defects, especially with thin branches. In this paper, we pro… ▽ More Creating accurate 3D models of tree topology is an important task for tree pruning. The 3D model is used to decide which branches to prune and then to execute the pruning cuts. Previous methods for creating 3D tree models have typically relied on point clouds, which are often computationally expensive to process and can suffer from data defects, especially with thin branches. In this paper, we propose a method for actively scanning along a primary tree branch, detecting secondary branches to be pruned, and reconstructing their 3D geometry using just an RGB camera mounted on a robot arm. We experimentally validate that our setup is able to produce primary branch models with 4-5 mm accuracy and secondary branch models with 15 degrees orientation accuracy with respect to the ground truth model. Our framework is real-time and can run up to 10 cm/s with no loss in model accuracy or ability to detect secondary branches. △ Less

Submitted 18 June, 2024; v1 submitted 20 September, 2023; originally announced September 2023.

arXiv:2206.07201 [pdf, other]

An autonomous robot for pruning modern, planar fruit trees

Authors: Alexander You, Nidhi Parayil, Josyula Gopala Krishna, Uddhav Bhattarai, Ranjan Sapkota, Dawood Ahmed, Matthew Whiting, Manoj Karkee, Cindy M. Grimm, Joseph R. Davidson

Abstract: Dormant pruning of fruit trees is an important task for maintaining tree health and ensuring high-quality fruit. Due to decreasing labor availability, pruning is a prime candidate for robotic automation. However, pruning also represents a uniquely difficult problem for robots, requiring robust systems for perception, pruning point determination, and manipulation that must operate under variable li… ▽ More Dormant pruning of fruit trees is an important task for maintaining tree health and ensuring high-quality fruit. Due to decreasing labor availability, pruning is a prime candidate for robotic automation. However, pruning also represents a uniquely difficult problem for robots, requiring robust systems for perception, pruning point determination, and manipulation that must operate under variable lighting conditions and in complex, highly unstructured environments. In this paper, we introduce a system for pruning sweet cherry trees (in a planar tree architecture called an upright fruiting offshoot configuration) that integrates various subsystems from our previous work on perception and manipulation. The resulting system is capable of operating completely autonomously and requires minimal control of the environment. We validate the performance of our system through field trials in a sweet cherry orchard, ultimately achieving a cutting success rate of 58%. Though not fully robust and requiring improvements in throughput, our system is the first to operate on fruit trees and represents a useful base platform to be improved in the future. △ Less

Submitted 14 June, 2022; originally announced June 2022.

arXiv:2202.13050 [pdf, other]

Optical flow-based branch segmentation for complex orchard environments

Authors: Alexander You, Cindy Grimm, Joseph R. Davidson

Abstract: Machine vision is a critical subsystem for enabling robots to be able to perform a variety of tasks in orchard environments. However, orchards are highly visually complex environments, and computer vision algorithms operating in them must be able to contend with variable lighting conditions and background noise. Past work on enabling deep learning algorithms to operate in these environments has ty… ▽ More Machine vision is a critical subsystem for enabling robots to be able to perform a variety of tasks in orchard environments. However, orchards are highly visually complex environments, and computer vision algorithms operating in them must be able to contend with variable lighting conditions and background noise. Past work on enabling deep learning algorithms to operate in these environments has typically required large amounts of hand-labeled data to train a deep neural network or physically controlling the conditions under which the environment is perceived. In this paper, we train a neural network system in simulation only using simulated RGB data and optical flow. This resulting neural network is able to perform foreground segmentation of branches in a busy orchard environment without additional real-world training or using any special setup or equipment beyond a standard camera. Our results show that our system is highly accurate and, when compared to a network using manually labeled RGBD data, achieves significantly more consistent and robust performance across environments that differ from the training set. △ Less

Submitted 25 February, 2022; originally announced February 2022.

arXiv:2109.13162 [pdf, other]

Precision fruit tree pruning using a learned hybrid vision/interaction controller

Authors: Alexander You, Hannah Kolano, Nidhi Parayil, Cindy Grimm, Joseph R. Davidson

Abstract: Robotic tree pruning requires highly precise manipulator control in order to accurately align a cutting implement with the desired pruning point at the correct angle. Simultaneously, the robot must avoid applying excessive force to rigid parts of the environment such as trees, support posts, and wires. In this paper, we propose a hybrid control system that uses a learned vision-based controller to… ▽ More Robotic tree pruning requires highly precise manipulator control in order to accurately align a cutting implement with the desired pruning point at the correct angle. Simultaneously, the robot must avoid applying excessive force to rigid parts of the environment such as trees, support posts, and wires. In this paper, we propose a hybrid control system that uses a learned vision-based controller to initially align the cutter with the desired pruning point, taking in images of the environment and outputting control actions. This controller is trained entirely in simulation, but transfers easily to real trees via a neural network which transforms raw images into a simplified, segmented representation. Once contact is established, the system hands over control to an interaction controller that guides the cutter pivot point to the branch while minimizing interaction forces. With this simple, yet novel, approach we demonstrate an improvement of over 30 percentage points in accuracy over a baseline controller that uses camera depth data. △ Less

Submitted 27 September, 2021; originally announced September 2021.

Comments: Submitted for consideration for the 2022 IEEE International Conference on Robotics and Automation (ICRA)

arXiv:2108.08674 [pdf, other]

doi 10.1145/3474085.3475206

Towards Controllable and Photorealistic Region-wise Image Manipulation

Authors: Ansheng You, Chenglin Zhou, Qixuan Zhang, Lan Xu

Abstract: Adaptive and flexible image editing is a desirable function of modern generative models. In this work, we present a generative model with auto-encoder architecture for per-region style manipulation. We apply a code consistency loss to enforce an explicit disentanglement between content and style latent representations, making the content and style of generated samples consistent with their corresp… ▽ More Adaptive and flexible image editing is a desirable function of modern generative models. In this work, we present a generative model with auto-encoder architecture for per-region style manipulation. We apply a code consistency loss to enforce an explicit disentanglement between content and style latent representations, making the content and style of generated samples consistent with their corresponding content and style references. The model is also constrained by a content alignment loss to ensure the foreground editing will not interfere background contents. As a result, given interested region masks provided by users, our model supports foreground region-wise style transfer. Specially, our model receives no extra annotations such as semantic labels except for self-supervision. Extensive experiments show the effectiveness of the proposed method and exhibit the flexibility of the proposed model for various applications, including region-wise style editing, latent space interpolation, cross-domain style transfer. △ Less

Submitted 19 August, 2021; originally announced August 2021.

Journal ref: ACMMM 2021

arXiv:2103.02833 [pdf, other]

Semantics-guided Skeletonization of Sweet Cherry Trees for Robotic Pruning

Authors: Alexander You, Cindy Grimm, Abhisesh Silwal, Joseph R. Davidson

Abstract: Dormant pruning for fresh market fruit trees is a relatively unexplored application of agricultural robotics for which few end-to-end systems exist. One of the biggest challenges in creating an autonomous pruning system is the need to reconstruct a model of a tree which is accurate and informative enough to be useful for deciding where to cut. One useful structure for modeling a tree is a skeleton… ▽ More Dormant pruning for fresh market fruit trees is a relatively unexplored application of agricultural robotics for which few end-to-end systems exist. One of the biggest challenges in creating an autonomous pruning system is the need to reconstruct a model of a tree which is accurate and informative enough to be useful for deciding where to cut. One useful structure for modeling a tree is a skeleton: a 1D, lightweight representation of the geometry and the topology of a tree. This skeletonization problem is an important one within the field of computer graphics, and a number of algorithms have been specifically developed for the task of modeling trees. These skeletonization algorithms have largely addressed the problem as a geometric one. In agricultural contexts, however, the parts of the tree have distinct labels, such as the trunk, supporting branches, etc. This labeled structure is important for understanding where to prune. We introduce an algorithm which produces such a labeled skeleton, using the topological and geometric priors associated with these labels to improve our skeletons. We test our skeletonization algorithm on point clouds from 29 upright fruiting offshoot (UFO) trees and demonstrate a median accuracy of 70% with respect to a human-evaluated gold standard. We also make point cloud scans of 82 UFO trees open-source to other researchers. Our work represents a significant first step towards a robust tree modeling framework which can be used in an autonomous pruning system. △ Less

Submitted 3 March, 2021; originally announced March 2021.

arXiv:2011.03308 [pdf, other]

doi 10.1109/TIP.2021.3099369

Towards Efficient Scene Understanding via Squeeze Reasoning

Authors: Xiangtai Li, Xia Li, Ansheng You, Li Zhang, Guangliang Cheng, Kuiyuan Yang, Yunhai Tong, Zhouchen Lin

Abstract: Graph-based convolutional model such as non-local block has shown to be effective for strengthening the context modeling ability in convolutional neural networks (CNNs). However, its pixel-wise computational overhead is prohibitive which renders it unsuitable for high resolution imagery. In this paper, we explore the efficiency of context graph reasoning and propose a novel framework called Squeez… ▽ More Graph-based convolutional model such as non-local block has shown to be effective for strengthening the context modeling ability in convolutional neural networks (CNNs). However, its pixel-wise computational overhead is prohibitive which renders it unsuitable for high resolution imagery. In this paper, we explore the efficiency of context graph reasoning and propose a novel framework called Squeeze Reasoning. Instead of propagating information on the spatial map, we first learn to squeeze the input feature into a channel-wise global vector and perform reasoning within the single vector where the computation cost can be significantly reduced. Specifically, we build the node graph in the vector where each node represents an abstract semantic concept. The refined feature within the same semantic category results to be consistent, which is thus beneficial for downstream tasks. We show that our approach can be modularized as an end-to-end trained block and can be easily plugged into existing networks. {Despite its simplicity and being lightweight, the proposed strategy allows us to establish the considerable results on different semantic segmentation datasets and shows significant improvements with respect to strong baselines on various other scene understanding tasks including object detection, instance segmentation and panoptic segmentation.} Code is available at \url{https://github.com/lxtGH/SFSegNets}. △ Less

Submitted 20 July, 2021; v1 submitted 6 November, 2020; originally announced November 2020.

Comments: Accepted by IEEE-TIP

arXiv:2003.12697 [pdf, other]

Semantically Multi-modal Image Synthesis

Authors: Zhen Zhu, Zhiliang Xu, Ansheng You, Xiang Bai

Abstract: In this paper, we focus on semantically multi-modal image synthesis (SMIS) task, namely, generating multi-modal images at the semantic level. Previous work seeks to use multiple class-specific generators, constraining its usage in datasets with a small number of classes. We instead propose a novel Group Decreasing Network (GroupDNet) that leverages group convolutions in the generator and progressi… ▽ More In this paper, we focus on semantically multi-modal image synthesis (SMIS) task, namely, generating multi-modal images at the semantic level. Previous work seeks to use multiple class-specific generators, constraining its usage in datasets with a small number of classes. We instead propose a novel Group Decreasing Network (GroupDNet) that leverages group convolutions in the generator and progressively decreases the group numbers of the convolutions in the decoder. Consequently, GroupDNet is armed with much more controllability on translating semantic labels to natural images and has plausible high-quality yields for datasets with many classes. Experiments on several challenging datasets demonstrate the superiority of GroupDNet on performing the SMIS task. We also show that GroupDNet is capable of performing a wide range of interesting synthesis applications. Codes and models are available at: https://github.com/Seanseattle/SMIS. △ Less

Submitted 2 April, 2020; v1 submitted 28 March, 2020; originally announced March 2020.

Comments: To appear in CVPR 2020

arXiv:2002.10120 [pdf, other]

Semantic Flow for Fast and Accurate Scene Parsing

Authors: Xiangtai Li, Ansheng You, Zhen Zhu, Houlong Zhao, Maoke Yang, Kuiyuan Yang, Yunhai Tong

Abstract: In this paper, we focus on designing effective method for fast and accurate scene parsing. A common practice to improve the performance is to attain high resolution feature maps with strong semantic representation. Two strategies are widely used -- atrous convolutions and feature pyramid fusion, are either computation intensive or ineffective. Inspired by the Optical Flow for motion alignment betw… ▽ More In this paper, we focus on designing effective method for fast and accurate scene parsing. A common practice to improve the performance is to attain high resolution feature maps with strong semantic representation. Two strategies are widely used -- atrous convolutions and feature pyramid fusion, are either computation intensive or ineffective. Inspired by the Optical Flow for motion alignment between adjacent video frames, we propose a Flow Alignment Module (FAM) to learn Semantic Flow between feature maps of adjacent levels, and broadcast high-level features to high resolution features effectively and efficiently. Furthermore, integrating our module to a common feature pyramid structure exhibits superior performance over other real-time methods even on light-weight backbone networks, such as ResNet-18. Extensive experiments are conducted on several challenging datasets, including Cityscapes, PASCAL Context, ADE20K and CamVid. Especially, our network is the first to achieve 80.4\% mIoU on Cityscapes with a frame rate of 26 FPS. The code is available at \url{https://github.com/lxtGH/SFSegNets}. △ Less

Submitted 29 March, 2021; v1 submitted 24 February, 2020; originally announced February 2020.

Comments: accepted by ECCV 2020(oral)

arXiv:1909.07229 [pdf, other]

Global Aggregation then Local Distribution in Fully Convolutional Networks

Authors: Xiangtai Li, Li Zhang, Ansheng You, Maoke Yang, Kuiyuan Yang, Yunhai Tong

Abstract: It has been widely proven that modelling long-range dependencies in fully convolutional networks (FCNs) via global aggregation modules is critical for complex scene understanding tasks such as semantic segmentation and object detection. However, global aggregation is often dominated by features of large patterns and tends to oversmooth regions that contain small patterns (e.g., boundaries and smal… ▽ More It has been widely proven that modelling long-range dependencies in fully convolutional networks (FCNs) via global aggregation modules is critical for complex scene understanding tasks such as semantic segmentation and object detection. However, global aggregation is often dominated by features of large patterns and tends to oversmooth regions that contain small patterns (e.g., boundaries and small objects). To resolve this problem, we propose to first use \emph{Global Aggregation} and then \emph{Local Distribution}, which is called GALD, where long-range dependencies are more confidently used inside large pattern regions and vice versa. The size of each pattern at each position is estimated in the network as a per-channel mask map. GALD is end-to-end trainable and can be easily plugged into existing FCNs with various global aggregation modules for a wide range of vision tasks, and consistently improves the performance of state-of-the-art object detection and instance segmentation approaches. In particular, GALD used in semantic segmentation achieves new state-of-the-art performance on Cityscapes test set with mIoU 83.3\%. Code is available at: \url{https://github.com/lxtGH/GALD-Net} △ Less

Submitted 16 September, 2019; originally announced September 2019.

Comments: accepted at BMVC 2019

arXiv:1907.11830 [pdf, other]

Reprojection R-CNN: A Fast and Accurate Object Detector for 360° Images

Authors: Pengyu Zhao, Ansheng You, Yuanxing Zhang, Jiaying Liu, Kaigui Bian, Yunhai Tong

Abstract: 360° images are usually represented in either equirectangular projection (ERP) or multiple perspective projections. Different from the flat 2D images, the detection task is challenging for 360° images due to the distortion of ERP and the inefficiency of perspective projections. However, existing methods mostly focus on one of the above representations instead of both, leading to limited detection… ▽ More 360° images are usually represented in either equirectangular projection (ERP) or multiple perspective projections. Different from the flat 2D images, the detection task is challenging for 360° images due to the distortion of ERP and the inefficiency of perspective projections. However, existing methods mostly focus on one of the above representations instead of both, leading to limited detection performance. Moreover, the lack of appropriate bounding-box annotations as well as the annotated datasets further increases the difficulties of the detection task. In this paper, we present a standard object detection framework for 360° images. Specifically, we adapt the terminologies of the traditional object detection task to the omnidirectional scenarios, and propose a novel two-stage object detector, i.e., Reprojection R-CNN by combining both ERP and perspective projection. Owing to the omnidirectional field-of-view of ERP, Reprojection R-CNN first generates coarse region proposals efficiently by a distortion-aware spherical region proposal network. Then, it leverages the distortion-free perspective projection and refines the proposed regions by a novel reprojection network. We construct two novel synthetic datasets for training and evaluation. Experiments reveal that Reprojection R-CNN outperforms the previous state-of-the-art methods on the mAP metric. In addition, the proposed detector could run at 178ms per image in the panoramic datasets, which implies its practicability in real-world applications. △ Less

Submitted 26 July, 2019; originally announced July 2019.

Comments: 10 pages, 7 figures

Showing 1–12 of 12 results for author: You, A