-
Attenuation-Aware Weighted Optical Flow with Medium Transmission Map for Learning-based Visual Odometry in Underwater terrain
Authors:
Bach Nguyen Gia,
Chanh Minh Tran,
Kamioka Eiji,
Tan Phan Xuan
Abstract:
This paper addresses the challenge of improving learning-based monocular visual odometry (VO) in underwater environments by integrating principles of underwater optical imaging to manipulate optical flow estimation. Leveraging the inherent properties of underwater imaging, the novel wflow-TartanVO is introduced, enhancing the accuracy of VO systems for autonomous underwater vehicles (AUVs). The pr…
▽ More
This paper addresses the challenge of improving learning-based monocular visual odometry (VO) in underwater environments by integrating principles of underwater optical imaging to manipulate optical flow estimation. Leveraging the inherent properties of underwater imaging, the novel wflow-TartanVO is introduced, enhancing the accuracy of VO systems for autonomous underwater vehicles (AUVs). The proposed method utilizes a normalized medium transmission map as a weight map to adjust the estimated optical flow for emphasizing regions with lower degradation and suppressing uncertain regions affected by underwater light scattering and absorption. wflow-TartanVO does not require fine-tuning of pre-trained VO models, thus promoting its adaptability to different environments and camera models. Evaluation of different real-world underwater datasets demonstrates the outperformance of wflow-TartanVO over baseline VO methods, as evidenced by the considerably reduced Absolute Trajectory Error (ATE). The implementation code is available at: https://github.com/bachzz/wflow-TartanVO
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
TF-SASM: Training-free Spatial-aware Sparse Memory for Multi-object Tracking
Authors:
Thuc Nguyen-Quang,
Minh-Triet Tran
Abstract:
Multi-object tracking (MOT) in computer vision remains a significant challenge, requiring precise localization and continuous tracking of multiple objects in video sequences. The emergence of data sets that emphasize robust reidentification, such as DanceTrack, has highlighted the need for effective solutions. While memory-based approaches have shown promise, they often suffer from high computatio…
▽ More
Multi-object tracking (MOT) in computer vision remains a significant challenge, requiring precise localization and continuous tracking of multiple objects in video sequences. The emergence of data sets that emphasize robust reidentification, such as DanceTrack, has highlighted the need for effective solutions. While memory-based approaches have shown promise, they often suffer from high computational complexity and memory usage due to storing feature at every single frame. In this paper, we propose a novel memory-based approach that selectively stores critical features based on object motion and overlapping awareness, aiming to enhance efficiency while minimizing redundancy. As a result, our method not only store longer temporal information with limited number of stored features in the memory, but also diversify states of a particular object to enhance the association performance. Our approach significantly improves over MOTRv2 in the DanceTrack test set, demonstrating a gain of 2.0% AssA score and 2.1% in IDF1 score.
△ Less
Submitted 15 July, 2024; v1 submitted 5 July, 2024;
originally announced July 2024.
-
Koopman based trajectory model and computation offloading for high mobility paradigm in ISAC enabled IoT system
Authors:
Minh-Tuan Tran
Abstract:
User experience on mobile devices is constrained by limited battery capacity and processing power, but 6G technology advancements are diving rapidly into mobile technical evolution. Mobile edge computing (MEC) offers a solution, offloading computationally intensive tasks to edge cloud servers, reducing battery drain compared to local processing. The upcoming integrated sensing and communication in…
▽ More
User experience on mobile devices is constrained by limited battery capacity and processing power, but 6G technology advancements are diving rapidly into mobile technical evolution. Mobile edge computing (MEC) offers a solution, offloading computationally intensive tasks to edge cloud servers, reducing battery drain compared to local processing. The upcoming integrated sensing and communication in mobile communication may improve the trajectory prediction and processing delays. This study proposes a greedy resource allocation optimization strategy for multi-user networks to minimize aggregate energy usage. Numerical results show potential improvement at 33\% for every 1000 iteration. Addressing prediction model division and velocity accuracy issues is crucial for better results. A plan for further improvement and achieving objectives is outlined for the upcoming work phase.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
SAM-EG: Segment Anything Model with Egde Guidance framework for efficient Polyp Segmentation
Authors:
Quoc-Huy Trinh,
Hai-Dang Nguyen,
Bao-Tram Nguyen Ngoc,
Debesh Jha,
Ulas Bagci,
Minh-Triet Tran
Abstract:
Polyp segmentation, a critical concern in medical imaging, has prompted numerous proposed methods aimed at enhancing the quality of segmented masks. While current state-of-the-art techniques produce impressive results, the size and computational cost of these models pose challenges for practical industry applications. Recently, the Segment Anything Model (SAM) has been proposed as a robust foundat…
▽ More
Polyp segmentation, a critical concern in medical imaging, has prompted numerous proposed methods aimed at enhancing the quality of segmented masks. While current state-of-the-art techniques produce impressive results, the size and computational cost of these models pose challenges for practical industry applications. Recently, the Segment Anything Model (SAM) has been proposed as a robust foundation model, showing promise for adaptation to medical image segmentation. Inspired by this concept, we propose SAM-EG, a framework that guides small segmentation models for polyp segmentation to address the computation cost challenge. Additionally, in this study, we introduce the Edge Guiding module, which integrates edge information into image features to assist the segmentation model in addressing boundary issues from current segmentation model in this task. Through extensive experiments, our small models showcase their efficacy by achieving competitive results with state-of-the-art methods, offering a promising approach to developing compact models with high accuracy for polyp segmentation and in the broader field of medical imaging.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
What is in the Chrome Web Store? Investigating Security-Noteworthy Browser Extensions
Authors:
Sheryl Hsu,
Manda Tran,
Aurore Fass
Abstract:
This paper is the first attempt at providing a holistic view of the Chrome Web Store (CWS). We leverage historical data provided by ChromeStats to study global trends in the CWS and security implications. We first highlight the extremely short life cycles of extensions: roughly 60% of extensions stay in the CWS for one year. Second, we define and show that Security-Noteworthy Extensions (SNE) are…
▽ More
This paper is the first attempt at providing a holistic view of the Chrome Web Store (CWS). We leverage historical data provided by ChromeStats to study global trends in the CWS and security implications. We first highlight the extremely short life cycles of extensions: roughly 60% of extensions stay in the CWS for one year. Second, we define and show that Security-Noteworthy Extensions (SNE) are a significant issue: they pervade the CWS for years and affect almost 350 million users. Third, we identify clusters of extensions with a similar code base. We discuss how code similarity techniques could be used to flag suspicious extensions. By developing an approach to extract URLs from extensions' comments, we show that extensions reuse code snippets from public repositories or forums, leading to the propagation of dated code and vulnerabilities. Finally, we underline a critical lack of maintenance in the CWS: 60% of the extensions in the CWS have never been updated; half of the extensions known to be vulnerable are still in the CWS and still vulnerable 2 years after disclosure; a third of extensions use vulnerable library versions. We believe that these issues should be widely known in order to pave the way for a more secure CWS.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Designing Interactions with Autonomous Physical Systems
Authors:
Marius Hoggenmueller,
Tram Thi Minh Tran,
Luke Hespanhol,
Martin Tomitsch
Abstract:
In this position paper, we present a collection of four different prototyping approaches which we have developed and applied to prototype and evaluate interfaces for and interactions around autonomous physical systems. Further, we provide a classification of our approaches aiming to support other researchers and designers in choosing appropriate prototyping platforms and representations.
In this position paper, we present a collection of four different prototyping approaches which we have developed and applied to prototype and evaluate interfaces for and interactions around autonomous physical systems. Further, we provide a classification of our approaches aiming to support other researchers and designers in choosing appropriate prototyping platforms and representations.
△ Less
Submitted 16 June, 2024;
originally announced June 2024.
-
TabularFM: An Open Framework For Tabular Foundational Models
Authors:
Quan M. Tran,
Suong N. Hoang,
Lam M. Nguyen,
Dzung Phan,
Hoang Thanh Lam
Abstract:
Foundational models (FMs), pretrained on extensive datasets using self-supervised techniques, are capable of learning generalized patterns from large amounts of data. This reduces the need for extensive labeled datasets for each new task, saving both time and resources by leveraging the broad knowledge base established during pretraining. Most research on FMs has primarily focused on unstructured…
▽ More
Foundational models (FMs), pretrained on extensive datasets using self-supervised techniques, are capable of learning generalized patterns from large amounts of data. This reduces the need for extensive labeled datasets for each new task, saving both time and resources by leveraging the broad knowledge base established during pretraining. Most research on FMs has primarily focused on unstructured data, such as text and images, or semi-structured data, like time-series. However, there has been limited attention to structured data, such as tabular data, which, despite its prevalence, remains under-studied due to a lack of clean datasets and insufficient research on the transferability of FMs for various tabular data tasks. In response to this gap, we introduce a framework called TabularFM, which incorporates state-of-the-art methods for developing FMs specifically for tabular data. This includes variations of neural architectures such as GANs, VAEs, and Transformers. We have curated a million of tabular datasets and released cleaned versions to facilitate the development of tabular FMs. We pretrained FMs on this curated data, benchmarked various learning methods on these datasets, and released the pretrained models along with leaderboards for future comparative studies. Our fully open-sourced system provides a comprehensive analysis of the transferability of tabular FMs. By releasing these datasets, pretrained models, and leaderboards, we aim to enhance the validity and usability of tabular FMs in the near future.
△ Less
Submitted 17 June, 2024; v1 submitted 14 June, 2024;
originally announced June 2024.
-
Context-Based Interface Prototyping: Understanding the Effect of Prototype Representation on User Feedback
Authors:
Marius Hoggenmueller,
Martin Tomitsch,
Luke Hespanhol,
Tram Thi Minh Tran,
Stewart Worrall,
Eduardo Nebot
Abstract:
The rise of autonomous systems in cities, such as automated vehicles (AVs), requires new approaches for prototyping and evaluating how people interact with those systems through context-based user interfaces, such as external human-machine interfaces (eHMIs). In this paper, we present a comparative study of three prototype representations (real-world VR, computer-generated VR, real-world video) of…
▽ More
The rise of autonomous systems in cities, such as automated vehicles (AVs), requires new approaches for prototyping and evaluating how people interact with those systems through context-based user interfaces, such as external human-machine interfaces (eHMIs). In this paper, we present a comparative study of three prototype representations (real-world VR, computer-generated VR, real-world video) of an eHMI in a mixed-methods study with 42 participants. Quantitative results show that while the real-world VR representation results in higher sense of presence, no significant differences in user experience and trust towards the AV itself were found. However, interview data shows that participants focused on different experiential and perceptual aspects in each of the prototype representations. These differences are linked to spatial awareness and perceived realism of the AV behaviour and its context, affecting in turn how participants assess trust and the eHMI. The paper offers guidelines for prototyping and evaluating context-based interfaces through simulations.
△ Less
Submitted 12 June, 2024;
originally announced June 2024.
-
HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model
Authors:
Khoa Vo,
Thinh Phan,
Kashu Yamazaki,
Minh Tran,
Ngan Le
Abstract:
Current video-language models (VLMs) rely extensively on instance-level alignment between video and language modalities, which presents two major limitations: (1) visual reasoning disobeys the natural perception that humans do in first-person perspective, leading to a lack of reasoning interpretation; and (2) learning is limited in capturing inherent fine-grained relationships between two modaliti…
▽ More
Current video-language models (VLMs) rely extensively on instance-level alignment between video and language modalities, which presents two major limitations: (1) visual reasoning disobeys the natural perception that humans do in first-person perspective, leading to a lack of reasoning interpretation; and (2) learning is limited in capturing inherent fine-grained relationships between two modalities.
In this paper, we take an inspiration from human perception and explore a compositional approach for egocentric video representation. We introduce HENASY (Hierarchical ENtities ASsemblY), which includes a spatiotemporal token grouping mechanism to explicitly assemble dynamically evolving scene entities through time and model their relationship for video representation. By leveraging compositional structure understanding, HENASY possesses strong interpretability via visual grounding with free-form text queries. We further explore a suite of multi-grained contrastive losses to facilitate entity-centric understandings. This comprises three alignment types: video-narration, noun-entity, verb-entities alignments.
Our method demonstrates strong interpretability in both quantitative and qualitative experiments; while maintaining competitive performances on five downstream tasks via zero-shot transfer or as video/text representation, including video/text retrieval, action recognition, multi-choice query, natural language query, and moments query.
△ Less
Submitted 6 June, 2024; v1 submitted 1 June, 2024;
originally announced June 2024.
-
SarcNet: A Novel AI-based Framework to Automatically Analyze and Score Sarcomere Organizations in Fluorescently Tagged hiPSC-CMs
Authors:
Huyen Le,
Khiet Dang,
Tien Lai,
Nhung Nguyen,
Mai Tran,
Hieu Pham
Abstract:
Quantifying sarcomere structure organization in human-induced pluripotent stem cell-derived cardiomyocytes (hiPSC-CMs) is crucial for understanding cardiac disease pathology, improving drug screening, and advancing regenerative medicine. Traditional methods, such as manual annotation and Fourier transform analysis, are labor-intensive, error-prone, and lack high-throughput capabilities. In this st…
▽ More
Quantifying sarcomere structure organization in human-induced pluripotent stem cell-derived cardiomyocytes (hiPSC-CMs) is crucial for understanding cardiac disease pathology, improving drug screening, and advancing regenerative medicine. Traditional methods, such as manual annotation and Fourier transform analysis, are labor-intensive, error-prone, and lack high-throughput capabilities. In this study, we present a novel deep learning-based framework that leverages cell images and integrates cell features to automatically evaluate the sarcomere structure of hiPSC-CMs from the onset of differentiation. This framework overcomes the limitations of traditional methods through automated, high-throughput analysis, providing consistent, reliable results while accurately detecting complex sarcomere patterns across diverse samples. The proposed framework contains the SarcNet, a linear layers-added ResNet-18 module, to output a continuous score ranging from one to five that captures the level of sarcomere structure organization. It is trained and validated on an open-source dataset of hiPSC-CMs images with the endogenously GFP-tagged alpha-actinin-2 structure developed by the Allen Institute for Cell Science (AICS). SarcNet achieves a Spearman correlation of 0.831 with expert evaluations, demonstrating superior performance and an improvement of 0.075 over the current state-of-the-art approach, which uses Linear Regression. Our results also show a consistent pattern of increasing organization from day 18 to day 32 of differentiation, aligning with expert evaluations. By integrating the quantitative features calculated directly from the images with the visual features learned during the deep learning model, our framework offers a more comprehensive and accurate assessment, thereby enhancing the further utility of hiPSC-CMs in medical research and therapy development.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
ShapeFormer: Shapelet Transformer for Multivariate Time Series Classification
Authors:
Xuan-May Le,
Ling Luo,
Uwe Aickelin,
Minh-Tuan Tran
Abstract:
Multivariate time series classification (MTSC) has attracted significant research attention due to its diverse real-world applications. Recently, exploiting transformers for MTSC has achieved state-of-the-art performance. However, existing methods focus on generic features, providing a comprehensive understanding of data, but they ignore class-specific features crucial for learning the representat…
▽ More
Multivariate time series classification (MTSC) has attracted significant research attention due to its diverse real-world applications. Recently, exploiting transformers for MTSC has achieved state-of-the-art performance. However, existing methods focus on generic features, providing a comprehensive understanding of data, but they ignore class-specific features crucial for learning the representative characteristics of each class. This leads to poor performance in the case of imbalanced datasets or datasets with similar overall patterns but differing in minor class-specific details. In this paper, we propose a novel Shapelet Transformer (ShapeFormer), which comprises class-specific and generic transformer modules to capture both of these features. In the class-specific module, we introduce the discovery method to extract the discriminative subsequences of each class (i.e. shapelets) from the training set. We then propose a Shapelet Filter to learn the difference features between these shapelets and the input time series. We found that the difference feature for each shapelet contains important class-specific features, as it shows a significant distinction between its class and others. In the generic module, convolution filters are used to extract generic features that contain information to distinguish among all classes. For each module, we employ the transformer encoder to capture the correlation between their features. As a result, the combination of two transformer modules allows our model to exploit the power of both types of features, thereby enhancing the classification performance. Our experiments on 30 UEA MTSC datasets demonstrate that ShapeFormer has achieved the highest accuracy ranking compared to state-of-the-art methods. The code is available at https://github.com/xuanmay2701/shapeformer.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
S3Former: Self-supervised High-resolution Transformer for Solar PV Profiling
Authors:
Minh Tran,
Adrian De Luis,
Haitao Liao,
Ying Huang,
Roy McCann,
Alan Mantooth,
Jack Cothren,
Ngan Le
Abstract:
As the impact of climate change escalates, the global necessity to transition to sustainable energy sources becomes increasingly evident. Renewable energies have emerged as a viable solution for users, with Photovoltaic energy being a favored choice for small installations due to its reliability and efficiency. Accurate mapping of PV installations is crucial for understanding the extension of its…
▽ More
As the impact of climate change escalates, the global necessity to transition to sustainable energy sources becomes increasingly evident. Renewable energies have emerged as a viable solution for users, with Photovoltaic energy being a favored choice for small installations due to its reliability and efficiency. Accurate mapping of PV installations is crucial for understanding the extension of its adoption and informing energy policy. To meet this need, we introduce S3Former, designed to segment solar panels from aerial imagery and provide size and location information critical for analyzing the impact of such installations on the grid. Solar panel identification is challenging due to factors such as varying weather conditions, roof characteristics, Ground Sampling Distance variations and lack of appropriate initialization weights for optimized training. To tackle these complexities, S3Former features a Masked Attention Mask Transformer incorporating a self-supervised learning pretrained backbone. Specifically, our model leverages low-level and high-level features extracted from the backbone and incorporates an instance query mechanism incorporated on the Transformer architecture to enhance the localization of solar PV installations. We introduce a self-supervised learning phase (pretext task) to improve the initialization weights on the backbone of S3Former. We evaluated S3Former using diverse datasets, demonstrate improvement state-of-the-art models.
△ Less
Submitted 7 May, 2024;
originally announced May 2024.
-
Wireless Information and Energy Transfer in the Era of 6G Communications
Authors:
Constantinos Psomas,
Konstantinos Ntougias,
Nikita Shanin,
Dongfang Xu,
Kenneth MacSporran Mayer,
Nguyen Minh Tran,
Laura Cottatellucci,
Kae Won Choi,
Dong In Kim,
Robert Schober,
Ioannis Krikidis
Abstract:
Wireless information and energy transfer (WIET) represents an emerging paradigm which employs controllable transmission of radio-frequency signals for the dual purpose of data communication and wireless charging. As such, WIET is widely regarded as an enabler of envisioned 6G use cases that rely on energy-sustainable Internet-of-Things (IoT) networks, such as smart cities and smart grids. Meeting…
▽ More
Wireless information and energy transfer (WIET) represents an emerging paradigm which employs controllable transmission of radio-frequency signals for the dual purpose of data communication and wireless charging. As such, WIET is widely regarded as an enabler of envisioned 6G use cases that rely on energy-sustainable Internet-of-Things (IoT) networks, such as smart cities and smart grids. Meeting the quality-of-service demands of WIET, in terms of both data transfer and power delivery, requires effective co-design of the information and energy signals. In this article, we present the main principles and design aspects of WIET, focusing on its integration in 6G networks. First, we discuss how conventional communication notions such as resource allocation and waveform design need to be revisited in the context of WIET. Next, we consider various candidate 6G technologies that can boost WIET efficiency, namely, holographic multiple-input multiple-output, near-field beamforming, terahertz communication, intelligent reflecting surfaces (IRSs), and reconfigurable (fluid) antenna arrays. We introduce respective WIET design methods, analyze the promising performance gains of these WIET systems, and discuss challenges, open issues, and future research directions. Finally, a near-field energy beamforming scheme and a power-based IRS beamforming algorithm are experimentally validated using a wireless energy transfer testbed. The vision of WIET in communication systems has been gaining momentum in recent years, with constant progress with respect to theoretical but also practical aspects. The comprehensive overview of the state of the art of WIET presented in this paper highlights the potentials of WIET systems as well as their overall benefits in 6G networks.
△ Less
Submitted 16 May, 2024; v1 submitted 29 April, 2024;
originally announced April 2024.
-
CarcassFormer: An End-to-end Transformer-based Framework for Simultaneous Localization, Segmentation and Classification of Poultry Carcass Defect
Authors:
Minh Tran,
Sang Truong,
Arthur F. A. Fernandes,
Michael T. Kidd,
Ngan Le
Abstract:
In the food industry, assessing the quality of poultry carcasses during processing is a crucial step. This study proposes an effective approach for automating the assessment of carcass quality without requiring skilled labor or inspector involvement. The proposed system is based on machine learning (ML) and computer vision (CV) techniques, enabling automated defect detection and carcass quality as…
▽ More
In the food industry, assessing the quality of poultry carcasses during processing is a crucial step. This study proposes an effective approach for automating the assessment of carcass quality without requiring skilled labor or inspector involvement. The proposed system is based on machine learning (ML) and computer vision (CV) techniques, enabling automated defect detection and carcass quality assessment. To this end, an end-to-end framework called CarcassFormer is introduced. It is built upon a Transformer-based architecture designed to effectively extract visual representations while simultaneously detecting, segmenting, and classifying poultry carcass defects. Our proposed framework is capable of analyzing imperfections resulting from production and transport welfare issues, as well as processing plant stunner, scalder, picker, and other equipment malfunctions. To benchmark the framework, a dataset of 7,321 images was initially acquired, which contained both single and multiple carcasses per image. In this study, the performance of the CarcassFormer system is compared with other state-of-the-art (SOTA) approaches for both classification, detection, and segmentation tasks. Through extensive quantitative experiments, our framework consistently outperforms existing methods, demonstrating remarkable improvements across various evaluation metrics such as AP, AP@50, and AP@75. Furthermore, the qualitative results highlight the strengths of CarcassFormer in capturing fine details, including feathers, and accurately localizing and segmenting carcasses with high precision. To facilitate further research and collaboration, the pre-trained model and source code of CarcassFormer is available for research purposes at: \url{https://github.com/UARK-AICV/CarcassFormer}.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
Improving Referring Image Segmentation using Vision-Aware Text Features
Authors:
Hai Nguyen-Truong,
E-Ro Nguyen,
Tuan-Anh Vu,
Minh-Triet Tran,
Binh-Son Hua,
Sai-Kit Yeung
Abstract:
Referring image segmentation is a challenging task that involves generating pixel-wise segmentation masks based on natural language descriptions. Existing methods have relied mostly on visual features to generate the segmentation masks while treating text features as supporting components. This over-reliance on visual features can lead to suboptimal results, especially in complex scenarios where t…
▽ More
Referring image segmentation is a challenging task that involves generating pixel-wise segmentation masks based on natural language descriptions. Existing methods have relied mostly on visual features to generate the segmentation masks while treating text features as supporting components. This over-reliance on visual features can lead to suboptimal results, especially in complex scenarios where text prompts are ambiguous or context-dependent. To overcome these challenges, we present a novel framework VATEX to improve referring image segmentation by enhancing object and context understanding with Vision-Aware Text Feature. Our method involves using CLIP to derive a CLIP Prior that integrates an object-centric visual heatmap with text description, which can be used as the initial query in DETR-based architecture for the segmentation task. Furthermore, by observing that there are multiple ways to describe an instance in an image, we enforce feature similarity between text variations referring to the same visual input by two components: a novel Contextual Multimodal Decoder that turns text embeddings into vision-aware text features, and a Meaning Consistency Constraint to ensure further the coherent and consistent interpretation of language expressions with the context understanding obtained from the image. Our method achieves a significant performance improvement on three benchmark datasets RefCOCO, RefCOCO+ and G-Ref. Code is available at: https://nero1342.github.io/VATEX\_RIS.
△ Less
Submitted 12 April, 2024;
originally announced April 2024.
-
Enhancing Video Summarization with Context Awareness
Authors:
Hai-Dang Huynh-Lam,
Ngoc-Phuong Ho-Thi,
Minh-Triet Tran,
Trung-Nghia Le
Abstract:
Video summarization is a crucial research area that aims to efficiently browse and retrieve relevant information from the vast amount of video content available today. With the exponential growth of multimedia data, the ability to extract meaningful representations from videos has become essential. Video summarization techniques automatically generate concise summaries by selecting keyframes, shot…
▽ More
Video summarization is a crucial research area that aims to efficiently browse and retrieve relevant information from the vast amount of video content available today. With the exponential growth of multimedia data, the ability to extract meaningful representations from videos has become essential. Video summarization techniques automatically generate concise summaries by selecting keyframes, shots, or segments that capture the video's essence. This process improves the efficiency and accuracy of various applications, including video surveillance, education, entertainment, and social media. Despite the importance of video summarization, there is a lack of diverse and representative datasets, hindering comprehensive evaluation and benchmarking of algorithms. Existing evaluation metrics also fail to fully capture the complexities of video summarization, limiting accurate algorithm assessment and hindering the field's progress. To overcome data scarcity challenges and improve evaluation, we propose an unsupervised approach that leverages video data structure and information for generating informative summaries. By moving away from fixed annotations, our framework can produce representative summaries effectively. Moreover, we introduce an innovative evaluation pipeline tailored specifically for video summarization. Human participants are involved in the evaluation, comparing our generated summaries to ground truth summaries and assessing their informativeness. This human-centric approach provides valuable insights into the effectiveness of our proposed techniques. Experimental results demonstrate that our training-free framework outperforms existing unsupervised approaches and achieves competitive results compared to state-of-the-art supervised methods.
△ Less
Submitted 6 April, 2024;
originally announced April 2024.
-
Cluster-based Video Summarization with Temporal Context Awareness
Authors:
Hai-Dang Huynh-Lam,
Ngoc-Phuong Ho-Thi,
Minh-Triet Tran,
Trung-Nghia Le
Abstract:
In this paper, we present TAC-SUM, a novel and efficient training-free approach for video summarization that addresses the limitations of existing cluster-based models by incorporating temporal context. Our method partitions the input video into temporally consecutive segments with clustering information, enabling the injection of temporal awareness into the clustering process, setting it apart fr…
▽ More
In this paper, we present TAC-SUM, a novel and efficient training-free approach for video summarization that addresses the limitations of existing cluster-based models by incorporating temporal context. Our method partitions the input video into temporally consecutive segments with clustering information, enabling the injection of temporal awareness into the clustering process, setting it apart from prior cluster-based summarization methods. The resulting temporal-aware clusters are then utilized to compute the final summary, using simple rules for keyframe selection and frame importance scoring. Experimental results on the SumMe dataset demonstrate the effectiveness of our proposed approach, outperforming existing unsupervised methods and achieving comparable performance to state-of-the-art supervised summarization techniques. Our source code is available for reference at \url{https://github.com/hcmus-thesis-gulu/TAC-SUM}.
△ Less
Submitted 6 April, 2024;
originally announced April 2024.
-
Ensemble Learning for Vietnamese Scene Text Spotting in Urban Environments
Authors:
Hieu Nguyen,
Cong-Hoang Ta,
Phuong-Thuy Le-Nguyen,
Minh-Triet Tran,
Trung-Nghia Le
Abstract:
This paper presents a simple yet efficient ensemble learning framework for Vietnamese scene text spotting. Leveraging the power of ensemble learning, which combines multiple models to yield more accurate predictions, our approach aims to significantly enhance the performance of scene text spotting in challenging urban settings. Through experimental evaluations on the VinText dataset, our proposed…
▽ More
This paper presents a simple yet efficient ensemble learning framework for Vietnamese scene text spotting. Leveraging the power of ensemble learning, which combines multiple models to yield more accurate predictions, our approach aims to significantly enhance the performance of scene text spotting in challenging urban settings. Through experimental evaluations on the VinText dataset, our proposed method achieves a significant improvement in accuracy compared to existing methods with an impressive accuracy of 5%. These results unequivocally demonstrate the efficacy of ensemble learning in the context of Vietnamese scene text spotting in urban environments, highlighting its potential for real world applications, such as text detection and recognition in urban signage, advertisements, and various text-rich urban scenes.
△ Less
Submitted 31 March, 2024;
originally announced April 2024.
-
Exploring Holistic HMI Design for Automated Vehicles: Insights from a Participatory Workshop to Bridge In-Vehicle and External Communication
Authors:
Haoyu Dong,
Tram Thi Minh Tran,
Rutger Verstegen,
Silvia Cazacu,
Ruolin Gao,
Marius Hoggenmüller,
Debargha Dey,
Mervyn Franssen,
Markus Sasalovici,
Pavlo Bazilinskyy,
Marieke Martens
Abstract:
Human-Machine Interfaces (HMIs) for automated vehicles (AVs) are typically divided into two categories: internal HMIs for interactions within the vehicle, and external HMIs for communication with other road users. In this work, we examine the prospects of bridging these two seemingly distinct domains. Through a participatory workshop with automotive user interface researchers and practitioners, we…
▽ More
Human-Machine Interfaces (HMIs) for automated vehicles (AVs) are typically divided into two categories: internal HMIs for interactions within the vehicle, and external HMIs for communication with other road users. In this work, we examine the prospects of bridging these two seemingly distinct domains. Through a participatory workshop with automotive user interface researchers and practitioners, we facilitated a critical exploration of holistic HMI design by having workshop participants collaboratively develop interaction scenarios involving AVs, in-vehicle users, and external road users. The discussion offers insights into the escalation of interface elements as an HMI design strategy, the direct interactions between different users, and an expanded understanding of holistic HMI design. This work reflects a collaborative effort to understand the practical aspects of this holistic design approach, offering new perspectives and encouraging further investigation into this underexplored aspect of automotive user interfaces.
△ Less
Submitted 28 March, 2024;
originally announced March 2024.
-
Text-Enhanced Data-free Approach for Federated Class-Incremental Learning
Authors:
Minh-Tuan Tran,
Trung Le,
Xuan-May Le,
Mehrtash Harandi,
Dinh Phung
Abstract:
Federated Class-Incremental Learning (FCIL) is an underexplored yet pivotal issue, involving the dynamic addition of new classes in the context of federated learning. In this field, Data-Free Knowledge Transfer (DFKT) plays a crucial role in addressing catastrophic forgetting and data privacy problems. However, prior approaches lack the crucial synergy between DFKT and the model training phases, c…
▽ More
Federated Class-Incremental Learning (FCIL) is an underexplored yet pivotal issue, involving the dynamic addition of new classes in the context of federated learning. In this field, Data-Free Knowledge Transfer (DFKT) plays a crucial role in addressing catastrophic forgetting and data privacy problems. However, prior approaches lack the crucial synergy between DFKT and the model training phases, causing DFKT to encounter difficulties in generating high-quality data from a non-anchored latent space of the old task model. In this paper, we introduce LANDER (Label Text Centered Data-Free Knowledge Transfer) to address this issue by utilizing label text embeddings (LTE) produced by pretrained language models. Specifically, during the model training phase, our approach treats LTE as anchor points and constrains the feature embeddings of corresponding training samples around them, enriching the surrounding area with more meaningful information. In the DFKT phase, by using these LTE anchors, LANDER can synthesize more meaningful samples, thereby effectively addressing the forgetting problem. Additionally, instead of tightly constraining embeddings toward the anchor, the Bounding Loss is introduced to encourage sample embeddings to remain flexible within a defined radius. This approach preserves the natural differences in sample embeddings and mitigates the embedding overlap caused by heterogeneous federated settings. Extensive experiments conducted on CIFAR100, Tiny-ImageNet, and ImageNet demonstrate that LANDER significantly outperforms previous methods and achieves state-of-the-art performance in FCIL. The code is available at https://github.com/tmtuan1307/lander.
△ Less
Submitted 20 March, 2024;
originally announced March 2024.
-
Holistic HMI Design for Automated Vehicles: Bridging In-Vehicle and External Communication
Authors:
Haoyu Dong,
Tram Thi Minh Tran,
Pavlo Bazilinskyy,
Marius Hoggenmüller,
Debargha Dey,
Silvia Cazacu,
Mervyn Franssen,
Ruolin Gao
Abstract:
As the field of automated vehicles (AVs) advances, it has become increasingly critical to develop human-machine interfaces (HMI) for both internal and external communication. Critical dialogue is emerging around the potential necessity for a holistic approach to HMI designs, which promotes the integration of both in-vehicle user and external road user perspectives. This approach aims to create a u…
▽ More
As the field of automated vehicles (AVs) advances, it has become increasingly critical to develop human-machine interfaces (HMI) for both internal and external communication. Critical dialogue is emerging around the potential necessity for a holistic approach to HMI designs, which promotes the integration of both in-vehicle user and external road user perspectives. This approach aims to create a unified and coherent experience for different stakeholders interacting with AVs. This workshop seeks to bring together designers, engineers, researchers, and other stakeholders to delve into relevant use cases, exploring the potential advantages and challenges of this approach. The insights generated from this workshop aim to inform further design and research in the development of coherent HMIs for AVs, ultimately for more seamless integration of AVs into existing traffic.
△ Less
Submitted 17 March, 2024;
originally announced March 2024.
-
A Review of Virtual Reality Studies on Autonomous Vehicle--Pedestrian Interaction
Authors:
Tram Thi Minh Tran,
Callum Parker,
Martin Tomitsch
Abstract:
An increasing number of studies employ virtual reality (VR) to evaluate interactions between autonomous vehicles (AVs) and pedestrians. VR simulators are valued for their cost-effectiveness, flexibility in developing various traffic scenarios, safe conduct of user studies, and acceptable ecological validity. Reviewing the literature between 2010 and 2020, we found 31 empirical studies using VR as…
▽ More
An increasing number of studies employ virtual reality (VR) to evaluate interactions between autonomous vehicles (AVs) and pedestrians. VR simulators are valued for their cost-effectiveness, flexibility in developing various traffic scenarios, safe conduct of user studies, and acceptable ecological validity. Reviewing the literature between 2010 and 2020, we found 31 empirical studies using VR as a testing apparatus for both implicit and explicit communication. By performing a systematic analysis, we identified current coverage of critical use cases, obtained a comprehensive account of factors influencing pedestrian behavior in simulated traffic scenarios, and assessed evaluation measures. Based on the findings, we present a set of recommendations for implementing VR pedestrian simulators and propose directions for future research.
△ Less
Submitted 17 March, 2024;
originally announced March 2024.
-
Simulating Wearable Urban Augmented Reality Experiences in VR: Lessons Learnt from Designing Two Future Urban Interfaces
Authors:
Tram Thi Minh Tran,
Callum Parker,
Marius Hoggenmüller,
Luke Hespanhol,
Martin Tomitsch
Abstract:
Augmented reality (AR) has the potential to fundamentally change how people engage with increasingly interactive urban environments. However, many challenges exist in designing and evaluating these new urban AR experiences, such as technical constraints and safety concerns associated with outdoor AR. We contribute to this domain by assessing the use of virtual reality (VR) for simulating wearable…
▽ More
Augmented reality (AR) has the potential to fundamentally change how people engage with increasingly interactive urban environments. However, many challenges exist in designing and evaluating these new urban AR experiences, such as technical constraints and safety concerns associated with outdoor AR. We contribute to this domain by assessing the use of virtual reality (VR) for simulating wearable urban AR experiences, allowing participants to interact with future AR interfaces in a realistic, safe and controlled setting. This paper describes two wearable urban AR applications (pedestrian navigation and autonomous mobility) simulated in VR. Based on a thematic analysis of interview data collected across the two studies, we found that the VR simulation successfully elicited feedback on the functional benefits of AR concepts and the potential impact of urban contextual factors, such as safety concerns, attentional capacity, and social considerations. At the same time, we highlighted the limitations of this approach in terms of assessing the AR interface's visual quality and providing exhaustive contextual information. The paper concludes with recommendations for simulating wearable urban AR experiences in VR.
△ Less
Submitted 17 March, 2024;
originally announced March 2024.
-
ShapeFormer: Shape Prior Visible-to-Amodal Transformer-based Amodal Instance Segmentation
Authors:
Minh Tran,
Winston Bounsavy,
Khoa Vo,
Anh Nguyen,
Tri Nguyen,
Ngan Le
Abstract:
Amodal Instance Segmentation (AIS) presents a challenging task as it involves predicting both visible and occluded parts of objects within images. Existing AIS methods rely on a bidirectional approach, encompassing both the transition from amodal features to visible features (amodal-to-visible) and from visible features to amodal features (visible-to-amodal). Our observation shows that the utiliza…
▽ More
Amodal Instance Segmentation (AIS) presents a challenging task as it involves predicting both visible and occluded parts of objects within images. Existing AIS methods rely on a bidirectional approach, encompassing both the transition from amodal features to visible features (amodal-to-visible) and from visible features to amodal features (visible-to-amodal). Our observation shows that the utilization of amodal features through the amodal-to-visible can confuse the visible features due to the extra information of occluded/hidden segments not presented in visible display. Consequently, this compromised quality of visible features during the subsequent visible-to-amodal transition. To tackle this issue, we introduce ShapeFormer, a decoupled Transformer-based model with a visible-to-amodal transition. It facilitates the explicit relationship between output segmentations and avoids the need for amodal-to-visible transitions. ShapeFormer comprises three key modules: (i) Visible-Occluding Mask Head for predicting visible segmentation with occlusion awareness, (ii) Shape-Prior Amodal Mask Head for predicting amodal and occluded masks, and (iii) Category-Specific Shape Prior Retriever aims to provide shape prior knowledge. Comprehensive experiments and extensive ablation studies across various AIS benchmarks demonstrate the effectiveness of our ShapeFormer. The code is available at: \url{https://github.com/UARK-AICV/ShapeFormer}
△ Less
Submitted 17 April, 2024; v1 submitted 17 March, 2024;
originally announced March 2024.
-
Dyadic Interaction Modeling for Social Behavior Generation
Authors:
Minh Tran,
Di Chang,
Maksim Siniukov,
Mohammad Soleymani
Abstract:
Human-human communication is like a delicate dance where listeners and speakers concurrently interact to maintain conversational dynamics. Hence, an effective model for generating listener nonverbal behaviors requires understanding the dyadic context and interaction. In this paper, we present an effective framework for creating 3D facial motions in dyadic interactions. Existing work consider a lis…
▽ More
Human-human communication is like a delicate dance where listeners and speakers concurrently interact to maintain conversational dynamics. Hence, an effective model for generating listener nonverbal behaviors requires understanding the dyadic context and interaction. In this paper, we present an effective framework for creating 3D facial motions in dyadic interactions. Existing work consider a listener as a reactive agent with reflexive behaviors to the speaker's voice and facial motions. The heart of our framework is Dyadic Interaction Modeling (DIM), a pre-training approach that jointly models speakers' and listeners' motions through masking and contrastive learning to learn representations that capture the dyadic context. To enable the generation of non-deterministic behaviors, we encode both listener and speaker motions into discrete latent representations, through VQ-VAE. The pre-trained model is further fine-tuned for motion generation. Extensive experiments demonstrate the superiority of our framework in generating listener motions, establishing a new state-of-the-art according to the quantitative measures capturing the diversity and realism of generated motions. Qualitative results demonstrate the superior capabilities of the proposed approach in generating diverse and realistic expressions, eye blinks and head gestures. The code is available at https://github.com/Boese0601/Dyadic-Interaction-Modeling
△ Less
Submitted 17 July, 2024; v1 submitted 13 March, 2024;
originally announced March 2024.
-
ARtVista: Gateway To Empower Anyone Into Artist
Authors:
Trong-Vu Hoang,
Quang-Binh Nguyen,
Duy-Nam Ly,
Khanh-Duy Le,
Tam V. Nguyen,
Minh-Triet Tran,
Trung-Nghia Le
Abstract:
Drawing is an art that enables people to express their imagination and emotions. However, individuals usually face challenges in drawing, especially when translating conceptual ideas into visually coherent representations and bridging the gap between mental visualization and practical execution. In response, we propose ARtVista - a novel system integrating AR and generative AI technologies. ARtVis…
▽ More
Drawing is an art that enables people to express their imagination and emotions. However, individuals usually face challenges in drawing, especially when translating conceptual ideas into visually coherent representations and bridging the gap between mental visualization and practical execution. In response, we propose ARtVista - a novel system integrating AR and generative AI technologies. ARtVista not only recommends reference images aligned with users' abstract ideas and generates sketches for users to draw but also goes beyond, crafting vibrant paintings in various painting styles. ARtVista also offers users an alternative approach to create striking paintings by simulating the paint-by-number concept on reference images, empowering users to create visually stunning artwork devoid of the necessity for advanced drawing skills. We perform a pilot study and reveal positive feedback on its usability, emphasizing its effectiveness in visualizing user ideas and aiding the painting process to achieve stunning pictures without requiring advanced drawing skills. The source code will be available at https://github.com/htrvu/ARtVista.
△ Less
Submitted 13 March, 2024;
originally announced March 2024.
-
iCONTRA: Toward Thematic Collection Design Via Interactive Concept Transfer
Authors:
Dinh-Khoi Vo,
Duy-Nam Ly,
Khanh-Duy Le,
Tam V. Nguyen,
Minh-Triet Tran,
Trung-Nghia Le
Abstract:
Creating thematic collections in industries demands innovative designs and cohesive concepts. Designers may face challenges in maintaining thematic consistency when drawing inspiration from existing objects, landscapes, or artifacts. While AI-powered graphic design tools offer help, they often fail to generate cohesive sets based on specific thematic concepts. In response, we introduce iCONTRA, an…
▽ More
Creating thematic collections in industries demands innovative designs and cohesive concepts. Designers may face challenges in maintaining thematic consistency when drawing inspiration from existing objects, landscapes, or artifacts. While AI-powered graphic design tools offer help, they often fail to generate cohesive sets based on specific thematic concepts. In response, we introduce iCONTRA, an interactive CONcept TRAnsfer system. With a user-friendly interface, iCONTRA enables both experienced designers and novices to effortlessly explore creative design concepts and efficiently generate thematic collections. We also propose a zero-shot image editing algorithm, eliminating the need for fine-tuning models, which gradually integrates information from initial objects, ensuring consistency in the generation process without influencing the background. A pilot study suggests iCONTRA's potential to reduce designers' efforts. Experimental results demonstrate its effectiveness in producing consistent and high-quality object concept transfers. iCONTRA stands as a promising tool for innovation and creative exploration in thematic collection design. The source code will be available at: https://github.com/vdkhoi20/iCONTRA.
△ Less
Submitted 13 March, 2024;
originally announced March 2024.
-
Designing Wearable Augmented Reality Concepts to Support Scalability in Autonomous Vehicle-Pedestrian Interaction
Authors:
Tram Thi Minh Tran,
Callum Parker,
Yiyuan Wang,
Martin Tomitsch
Abstract:
Wearable augmented reality (AR) offers new ways for supporting the interaction between autonomous vehicles (AVs) and pedestrians due to its ability to integrate timely and contextually relevant data into the user's field of view. This article presents novel wearable AR concepts that assist crossing pedestrians in multi-vehicle scenarios where several AVs frequent the road from both directions. Thr…
▽ More
Wearable augmented reality (AR) offers new ways for supporting the interaction between autonomous vehicles (AVs) and pedestrians due to its ability to integrate timely and contextually relevant data into the user's field of view. This article presents novel wearable AR concepts that assist crossing pedestrians in multi-vehicle scenarios where several AVs frequent the road from both directions. Three concepts with different communication approaches for signaling responses from multiple AVs to a crossing request, as well as a conventional pedestrian push button, were simulated and tested within a virtual reality environment. The results showed that wearable AR is a promising way to reduce crossing pedestrians' cognitive load when the design offers both individual AV responses and a clear signal to cross. The willingness of pedestrians to adopt a wearable AR solution, however, is subject to different factors, including costs, data privacy, technical defects, liability risks, maintenance duties, and form factors. We further found that all participants favored sending a crossing request to AVs rather than waiting for the vehicles to detect their intentions-pointing to an important gap and opportunity in the current AV-pedestrian interaction literature.
△ Less
Submitted 8 March, 2024;
originally announced March 2024.
-
Scoping Out the Scalability Issues of Autonomous Vehicle-Pedestrian Interaction
Authors:
Tram Thi Minh Tran,
Callum Parker,
Martin Tomitsch
Abstract:
Autonomous vehicles (AVs) may use external interfaces, such as LED light bands, to communicate with pedestrians safely and intuitively. While previous research has demonstrated the effectiveness of these interfaces in simple traffic scenarios involving one pedestrian and one vehicle, their performance in more complex scenarios with multiple road users remains unclear. The scalability of AV externa…
▽ More
Autonomous vehicles (AVs) may use external interfaces, such as LED light bands, to communicate with pedestrians safely and intuitively. While previous research has demonstrated the effectiveness of these interfaces in simple traffic scenarios involving one pedestrian and one vehicle, their performance in more complex scenarios with multiple road users remains unclear. The scalability of AV external communication has therefore attracted increasing attention, prompting the need for further investigation. This scoping review synthesises information from 54 papers to identify seven key scalability issues in multi-vehicle and multi-pedestrian environments, with Clarity of Recipients, Information Overload, and Multi-Lane Safety emerging as the most pressing concerns. To guide future research in scalable AV-pedestrian interactions, we propose high-level design directions focused on three communication loci: vehicle, infrastructure, and pedestrian. Our work contributes the groundwork and a roadmap for designing simplified, coordinated, and targeted external AV communication, ultimately improving safety and efficiency in complex traffic scenarios.
△ Less
Submitted 8 March, 2024;
originally announced March 2024.
-
Exploring the Impact of Interconnected External Interfaces in Autonomous Vehicleson Pedestrian Safety and Experience
Authors:
Tram Thi Minh Tran,
Callum Parker,
Marius Hoggenmuller,
Yiyuan Wang,
Martin Tomitsch
Abstract:
Policymakers advocate for the use of external Human-Machine Interfaces (eHMIs) to allow autonomous vehicles (AVs) to communicate their intentions or status. Nonetheless, scalability concerns in complex traffic scenarios arise, such as potentially increasing pedestrian cognitive load or conveying contradictory signals. Building upon precursory works, our study explores 'interconnected eHMIs,' where…
▽ More
Policymakers advocate for the use of external Human-Machine Interfaces (eHMIs) to allow autonomous vehicles (AVs) to communicate their intentions or status. Nonetheless, scalability concerns in complex traffic scenarios arise, such as potentially increasing pedestrian cognitive load or conveying contradictory signals. Building upon precursory works, our study explores 'interconnected eHMIs,' where multiple AV interfaces are interconnected to provide pedestrians with clear and unified information. In a virtual reality study (N=32), we assessed the effectiveness of this concept in improving pedestrian safety and their crossing experience. We compared these results against two conditions: no eHMIs and unconnected eHMIs. Results indicated interconnected eHMIs enhanced safety feelings and encouraged cautious crossings. However, certain design elements, such as the use of the colour red, led to confusion and discomfort. Prior knowledge slightly influenced perceptions of interconnected eHMIs, underscoring the need for refined user education. We conclude with practical implications and future eHMI design research directions.
△ Less
Submitted 17 March, 2024; v1 submitted 8 March, 2024;
originally announced March 2024.
-
Overview of the VLSP 2023 -- ComOM Shared Task: A Data Challenge for Comparative Opinion Mining from Vietnamese Product Reviews
Authors:
Hoang-Quynh Le,
Duy-Cat Can,
Khanh-Vinh Nguyen,
Mai-Vu Tran
Abstract:
This paper presents a comprehensive overview of the Comparative Opinion Mining from Vietnamese Product Reviews shared task (ComOM), held as part of the 10$^{th}$ International Workshop on Vietnamese Language and Speech Processing (VLSP 2023). The primary objective of this shared task is to advance the field of natural language processing by developing techniques that proficiently extract comparati…
▽ More
This paper presents a comprehensive overview of the Comparative Opinion Mining from Vietnamese Product Reviews shared task (ComOM), held as part of the 10$^{th}$ International Workshop on Vietnamese Language and Speech Processing (VLSP 2023). The primary objective of this shared task is to advance the field of natural language processing by developing techniques that proficiently extract comparative opinions from Vietnamese product reviews. Participants are challenged to propose models that adeptly extract a comparative "quintuple" from a comparative sentence, encompassing Subject, Object, Aspect, Predicate, and Comparison Type Label. We construct a human-annotated dataset comprising $120$ documents, encompassing $7427$ non-comparative sentences and $2468$ comparisons within $1798$ sentences. Participating models undergo evaluation and ranking based on the Exact match macro-averaged quintuple F1 score.
△ Less
Submitted 4 March, 2024; v1 submitted 21 February, 2024;
originally announced February 2024.
-
Error detection using pneumatic logic
Authors:
Shane Hoang,
Mabel Shehada,
Zinal Patel,
Minh-Huy Tran,
Konstantinos Karydis,
Philip Brisk,
William H. Grover
Abstract:
Pneumatic systems are common in manufacturing, healthcare, transportation, robotics, and many other fields. Failures in these systems can have very serious consequences, particularly if they go undetected. In this work, we present an air-powered error detector device that can detect and respond to failures in pneumatically actuated systems. The device contains 21 monolithic membrane valves that ac…
▽ More
Pneumatic systems are common in manufacturing, healthcare, transportation, robotics, and many other fields. Failures in these systems can have very serious consequences, particularly if they go undetected. In this work, we present an air-powered error detector device that can detect and respond to failures in pneumatically actuated systems. The device contains 21 monolithic membrane valves that act like transistors in a pneumatic logic "circuit" that uses vacuum to represent TRUE and atmospheric pressure as FALSE. Three pneumatic exclusive-OR (XOR) gates are used to calculate the parity bit corresponding to the values of several control bits. If the calculated value of the parity bit differs from the expected value, then an error (like a leak or a blocked air line) has been detected and the device outputs a pneumatic error signal which can in turn be used to alert a user, shut down the system, or take some other action. As a proof-of-concept, we used our pneumatic error detector to monitor the operation of a medical device, an intermittent pneumatic compression (IPC) device commonly used to prevent the formation of life-threatening blood clots in the wearer's legs. Experiments confirm that when the IPC device was damaged, the pneumatic error detector immediately recognized the error (a leak) and alerted the wearer using sound. By providing a simple and low-cost way to add fault detection to pneumatic actuation systems without using sensors, our pneumatic error detector can promote safety and reliability across the wide range of pneumatic systems.
△ Less
Submitted 29 January, 2024;
originally announced January 2024.
-
B-Cos Aligned Transformers Learn Human-Interpretable Features
Authors:
Manuel Tran,
Amal Lahiani,
Yashin Dicente Cid,
Melanie Boxberg,
Peter Lienemann,
Christian Matek,
Sophia J. Wagner,
Fabian J. Theis,
Eldad Klaiman,
Tingying Peng
Abstract:
Vision Transformers (ViTs) and Swin Transformers (Swin) are currently state-of-the-art in computational pathology. However, domain experts are still reluctant to use these models due to their lack of interpretability. This is not surprising, as critical decisions need to be transparent and understandable. The most common approach to understanding transformers is to visualize their attention. Howev…
▽ More
Vision Transformers (ViTs) and Swin Transformers (Swin) are currently state-of-the-art in computational pathology. However, domain experts are still reluctant to use these models due to their lack of interpretability. This is not surprising, as critical decisions need to be transparent and understandable. The most common approach to understanding transformers is to visualize their attention. However, attention maps of ViTs are often fragmented, leading to unsatisfactory explanations. Here, we introduce a novel architecture called the B-cos Vision Transformer (BvT) that is designed to be more interpretable. It replaces all linear transformations with the B-cos transform to promote weight-input alignment. In a blinded study, medical experts clearly ranked BvTs above ViTs, suggesting that our network is better at capturing biomedically relevant structures. This is also true for the B-cos Swin Transformer (Bwin). Compared to the Swin Transformer, it even improves the F1-score by up to 4.7% on two public datasets.
△ Less
Submitted 18 January, 2024; v1 submitted 16 January, 2024;
originally announced January 2024.
-
ChatFDA: Medical Records Risk Assessment
Authors:
M Tran,
C Sun
Abstract:
In healthcare, the emphasis on patient safety and the minimization of medical errors cannot be overstated. Despite concerted efforts, many healthcare systems, especially in low-resource regions, still grapple with preventing these errors effectively. This study explores a pioneering application aimed at addressing this challenge by assisting caregivers in gauging potential risks derived from medic…
▽ More
In healthcare, the emphasis on patient safety and the minimization of medical errors cannot be overstated. Despite concerted efforts, many healthcare systems, especially in low-resource regions, still grapple with preventing these errors effectively. This study explores a pioneering application aimed at addressing this challenge by assisting caregivers in gauging potential risks derived from medical notes. The application leverages data from openFDA, delivering real-time, actionable insights regarding prescriptions. Preliminary analyses conducted on the MIMIC-III \cite{mimic} dataset affirm a proof of concept highlighting a reduction in medical errors and an amplification in patient safety. This tool holds promise for drastically enhancing healthcare outcomes in settings with limited resources. To bolster reproducibility and foster further research, the codebase underpinning our methodology is accessible on https://github.com/autonlab/2023.hackAuton/tree/main/prescription_checker. This is a submission for the 30th HackAuton CMU.
△ Less
Submitted 19 December, 2023;
originally announced December 2023.
-
TSRNet: Simple Framework for Real-time ECG Anomaly Detection with Multimodal Time and Spectrogram Restoration Network
Authors:
Nhat-Tan Bui,
Dinh-Hieu Hoang,
Thinh Phan,
Minh-Triet Tran,
Brijesh Patel,
Donald Adjeroh,
Ngan Le
Abstract:
The electrocardiogram (ECG) is a valuable signal used to assess various aspects of heart health, such as heart rate and rhythm. It plays a crucial role in identifying cardiac conditions and detecting anomalies in ECG data. However, distinguishing between normal and abnormal ECG signals can be a challenging task. In this paper, we propose an approach that leverages anomaly detection to identify unh…
▽ More
The electrocardiogram (ECG) is a valuable signal used to assess various aspects of heart health, such as heart rate and rhythm. It plays a crucial role in identifying cardiac conditions and detecting anomalies in ECG data. However, distinguishing between normal and abnormal ECG signals can be a challenging task. In this paper, we propose an approach that leverages anomaly detection to identify unhealthy conditions using solely normal ECG data for training. Furthermore, to enhance the information available and build a robust system, we suggest considering both the time series and time-frequency domain aspects of the ECG signal. As a result, we introduce a specialized network called the Multimodal Time and Spectrogram Restoration Network (TSRNet) designed specifically for detecting anomalies in ECG signals. TSRNet falls into the category of restoration-based anomaly detection and draws inspiration from both the time series and spectrogram domains. By extracting representations from both domains, TSRNet effectively captures the comprehensive characteristics of the ECG signal. This approach enables the network to learn robust representations with superior discrimination abilities, allowing it to distinguish between normal and abnormal ECG patterns more effectively. Furthermore, we introduce a novel inference method, termed Peak-based Error, that specifically focuses on ECG peaks, a critical component in detecting abnormalities. The experimental result on the large-scale dataset PTB-XL has demonstrated the effectiveness of our approach in ECG anomaly detection, while also prioritizing efficiency by minimizing the number of trainable parameters. Our code is available at https://github.com/UARK-AICV/TSRNet.
△ Less
Submitted 5 March, 2024; v1 submitted 15 December, 2023;
originally announced December 2023.
-
3FM: Multi-modal Meta-learning for Federated Tasks
Authors:
Minh Tran,
Roochi Shah,
Zejun Gong
Abstract:
We present a novel approach in the domain of federated learning (FL), particularly focusing on addressing the challenges posed by modality heterogeneity, variability in modality availability across clients, and the prevalent issue of missing data. We introduce a meta-learning framework specifically designed for multimodal federated tasks. Our approach is motivated by the need to enable federated m…
▽ More
We present a novel approach in the domain of federated learning (FL), particularly focusing on addressing the challenges posed by modality heterogeneity, variability in modality availability across clients, and the prevalent issue of missing data. We introduce a meta-learning framework specifically designed for multimodal federated tasks. Our approach is motivated by the need to enable federated models to robustly adapt when exposed to new modalities, a common scenario in FL where clients often differ in the number of available modalities. The effectiveness of our proposed framework is demonstrated through extensive experimentation on an augmented MNIST dataset, enriched with audio and sign language data. We demonstrate that the proposed algorithm achieves better performance than the baseline on a subset of missing modality scenarios with careful tuning of the meta-learning rates. This is a shortened report, and our work will be extended and updated soon.
△ Less
Submitted 15 December, 2023;
originally announced December 2023.
-
NearbyPatchCL: Leveraging Nearby Patches for Self-Supervised Patch-Level Multi-Class Classification in Whole-Slide Images
Authors:
Gia-Bao Le,
Van-Tien Nguyen,
Trung-Nghia Le,
Minh-Triet Tran
Abstract:
Whole-slide image (WSI) analysis plays a crucial role in cancer diagnosis and treatment. In addressing the demands of this critical task, self-supervised learning (SSL) methods have emerged as a valuable resource, leveraging their efficiency in circumventing the need for a large number of annotations, which can be both costly and time-consuming to deploy supervised methods. Nevertheless, patch-wis…
▽ More
Whole-slide image (WSI) analysis plays a crucial role in cancer diagnosis and treatment. In addressing the demands of this critical task, self-supervised learning (SSL) methods have emerged as a valuable resource, leveraging their efficiency in circumventing the need for a large number of annotations, which can be both costly and time-consuming to deploy supervised methods. Nevertheless, patch-wise representation may exhibit instability in performance, primarily due to class imbalances stemming from patch selection within WSIs. In this paper, we introduce Nearby Patch Contrastive Learning (NearbyPatchCL), a novel self-supervised learning method that leverages nearby patches as positive samples and a decoupled contrastive loss for robust representation learning. Our method demonstrates a tangible enhancement in performance for downstream tasks involving patch-level multi-class classification. Additionally, we curate a new dataset derived from WSIs sourced from the Canine Cutaneous Cancer Histology, thus establishing a benchmark for the rigorous evaluation of patch-level multi-class classification methodologies. Intensive experiments show that our method significantly outperforms the supervised baseline and state-of-the-art SSL methods with top-1 classification accuracy of 87.56%. Our method also achieves comparable results while utilizing a mere 1% of labeled data, a stark contrast to the 100% labeled data requirement of other approaches. Source code: https://github.com/nvtien457/NearbyPatchCL
△ Less
Submitted 12 December, 2023;
originally announced December 2023.
-
Super-rays grouping scheme and novel coding architecture for computational time reduction of graph-based Light Field coding
Authors:
Bach Nguyen Gia,
Chanh Minh Tran,
Tho Nguyen Duc,
Tan Phan Xuan,
Eiji Kamioka
Abstract:
Graph-based Light Field coding using the concept of super-rays is powerful to exploit signal redundancy along irregular shapes and achieves good energy compaction, compared to rectangular block -based approaches. However, its main limitation lies in the high time complexity for eigen-decomposition of each super-ray local graph, a high number of which can be found in a Light Field when segmented in…
▽ More
Graph-based Light Field coding using the concept of super-rays is powerful to exploit signal redundancy along irregular shapes and achieves good energy compaction, compared to rectangular block -based approaches. However, its main limitation lies in the high time complexity for eigen-decomposition of each super-ray local graph, a high number of which can be found in a Light Field when segmented into super-rays. This paper examines a grouping scheme for super-rays in order to reduce the number of eigen-decomposition times, and proposes a novel coding architecture to handle the signal residual data arising for each super-ray group, as a tradeoff to achieve lower computational time. Experimental results have shown to reduce a considerable amount of decoding time for Light Field scenes, despite having a slight increase in the coding bitrates when compared with the original non-grouping super-ray -based approach. The proposal also remains to have competitive performance in Rate Distortion in comparison to HEVC-based and JPEG Pleno -based methods.
△ Less
Submitted 10 December, 2023;
originally announced December 2023.
-
PGDS: Pose-Guidance Deep Supervision for Mitigating Clothes-Changing in Person Re-Identification
Authors:
Quoc-Huy Trinh,
Nhat-Tan Bui,
Dinh-Hieu Hoang,
Phuoc-Thao Vo Thi,
Hai-Dang Nguyen,
Debesh Jha,
Ulas Bagci,
Ngan Le,
Minh-Triet Tran
Abstract:
Person Re-Identification (Re-ID) task seeks to enhance the tracking of multiple individuals by surveillance cameras. It supports multimodal tasks, including text-based person retrieval and human matching. One of the most significant challenges faced in Re-ID is clothes-changing, where the same person may appear in different outfits. While previous methods have made notable progress in maintaining…
▽ More
Person Re-Identification (Re-ID) task seeks to enhance the tracking of multiple individuals by surveillance cameras. It supports multimodal tasks, including text-based person retrieval and human matching. One of the most significant challenges faced in Re-ID is clothes-changing, where the same person may appear in different outfits. While previous methods have made notable progress in maintaining clothing data consistency and handling clothing change data, they still rely excessively on clothing information, which can limit performance due to the dynamic nature of human appearances. To mitigate this challenge, we propose the Pose-Guidance Deep Supervision (PGDS), an effective framework for learning pose guidance within the Re-ID task. It consists of three modules: a human encoder, a pose encoder, and a Pose-to-Human Projection module (PHP). Our framework guides the human encoder, i.e., the main re-identification model, with pose information from the pose encoder through multiple layers via the knowledge transfer mechanism from the PHP module, helping the human encoder learn body parts information without increasing computation resources in the inference stage. Through extensive experiments, our method surpasses the performance of current state-of-the-art methods, demonstrating its robustness and effectiveness for real-world applications. Our code is available at https://github.com/huyquoctrinh/PGDS.
△ Less
Submitted 1 June, 2024; v1 submitted 9 December, 2023;
originally announced December 2023.
-
Overview of the VLSP 2022 -- Abmusu Shared Task: A Data Challenge for Vietnamese Abstractive Multi-document Summarization
Authors:
Mai-Vu Tran,
Hoang-Quynh Le,
Duy-Cat Can,
Quoc-An Nguyen
Abstract:
This paper reports the overview of the VLSP 2022 - Vietnamese abstractive multi-document summarization (Abmusu) shared task for Vietnamese News. This task is hosted at the 9$^{th}$ annual workshop on Vietnamese Language and Speech Processing (VLSP 2022). The goal of Abmusu shared task is to develop summarization systems that could create abstractive summaries automatically for a set of documents o…
▽ More
This paper reports the overview of the VLSP 2022 - Vietnamese abstractive multi-document summarization (Abmusu) shared task for Vietnamese News. This task is hosted at the 9$^{th}$ annual workshop on Vietnamese Language and Speech Processing (VLSP 2022). The goal of Abmusu shared task is to develop summarization systems that could create abstractive summaries automatically for a set of documents on a topic. The model input is multiple news documents on the same topic, and the corresponding output is a related abstractive summary. In the scope of Abmusu shared task, we only focus on Vietnamese news summarization and build a human-annotated dataset of 1,839 documents in 600 clusters, collected from Vietnamese news in 8 categories. Participated models are evaluated and ranked in terms of \texttt{ROUGE2-F1} score, the most typical evaluation metric for document summarization problem.
△ Less
Submitted 26 November, 2023;
originally announced November 2023.
-
SafeSea: Synthetic Data Generation for Adverse & Low Probability Maritime Conditions
Authors:
Martin Tran,
Jordan Shipard,
Hermawan Mulyono,
Arnold Wiliem,
Clinton Fookes
Abstract:
High-quality training data is essential for enhancing the robustness of object detection models. Within the maritime domain, obtaining a diverse real image dataset is particularly challenging due to the difficulty of capturing sea images with the presence of maritime objects , especially in stormy conditions. These challenges arise due to resource limitations, in addition to the unpredictable appe…
▽ More
High-quality training data is essential for enhancing the robustness of object detection models. Within the maritime domain, obtaining a diverse real image dataset is particularly challenging due to the difficulty of capturing sea images with the presence of maritime objects , especially in stormy conditions. These challenges arise due to resource limitations, in addition to the unpredictable appearance of maritime objects. Nevertheless, acquiring data from stormy conditions is essential for training effective maritime detection models, particularly for search and rescue, where real-world conditions can be unpredictable. In this work, we introduce SafeSea, which is a stepping stone towards transforming actual sea images with various Sea State backgrounds while retaining maritime objects. Compared to existing generative methods such as Stable Diffusion Inpainting~\cite{stableDiffusion}, this approach reduces the time and effort required to create synthetic datasets for training maritime object detection models. The proposed method uses two automated filters to only pass generated images that meet the criteria. In particular, these filters will first classify the sea condition according to its Sea State level and then it will check whether the objects from the input image are still preserved. This method enabled the creation of the SafeSea dataset, offering diverse weather condition backgrounds to supplement the training of maritime models. Lastly, we observed that a maritime object detection model faced challenges in detecting objects in stormy sea backgrounds, emphasizing the impact of weather conditions on detection accuracy. The code, and dataset are available at https://github.com/martin-3240/SafeSea.
△ Less
Submitted 23 November, 2023;
originally announced November 2023.
-
SolarFormer: Multi-scale Transformer for Solar PV Profiling
Authors:
Adrian de Luis,
Minh Tran,
Taisei Hanyu,
Anh Tran,
Liao Haitao,
Roy McCann,
Alan Mantooth,
Ying Huang,
Ngan Le
Abstract:
As climate change intensifies, the global imperative to shift towards sustainable energy sources becomes more pronounced. Photovoltaic (PV) energy is a favored choice due to its reliability and ease of installation. Accurate mapping of PV installations is crucial for understanding their adoption and informing energy policy. To meet this need, we introduce the SolarFormer, designed to segment solar…
▽ More
As climate change intensifies, the global imperative to shift towards sustainable energy sources becomes more pronounced. Photovoltaic (PV) energy is a favored choice due to its reliability and ease of installation. Accurate mapping of PV installations is crucial for understanding their adoption and informing energy policy. To meet this need, we introduce the SolarFormer, designed to segment solar panels from aerial imagery, offering insights into their location and size. However, solar panel identification in Computer Vision is intricate due to various factors like weather conditions, roof conditions, and Ground Sampling Distance (GSD) variations. To tackle these complexities, we present the SolarFormer, featuring a multi-scale Transformer encoder and a masked-attention Transformer decoder. Our model leverages low-level features and incorporates an instance query mechanism to enhance the localization of solar PV installations. We rigorously evaluated our SolarFormer using diverse datasets, including GGE (France), IGN (France), and USGS (California, USA), across different GSDs. Our extensive experiments consistently demonstrate that our model either matches or surpasses state-of-the-art models, promising enhanced solar panel segmentation for global sustainable energy initiatives.
△ Less
Submitted 30 October, 2023;
originally announced October 2023.
-
Towards long-tailed, multi-label disease classification from chest X-ray: Overview of the CXR-LT challenge
Authors:
Gregory Holste,
Yiliang Zhou,
Song Wang,
Ajay Jaiswal,
Mingquan Lin,
Sherry Zhuge,
Yuzhe Yang,
Dongkyun Kim,
Trong-Hieu Nguyen-Mau,
Minh-Triet Tran,
Jaehyup Jeong,
Wongi Park,
Jongbin Ryu,
Feng Hong,
Arsh Verma,
Yosuke Yamagishi,
Changhyun Kim,
Hyeryeong Seo,
Myungjoo Kang,
Leo Anthony Celi,
Zhiyong Lu,
Ronald M. Summers,
George Shih,
Zhangyang Wang,
Yifan Peng
Abstract:
Many real-world image recognition problems, such as diagnostic medical imaging exams, are "long-tailed" $\unicode{x2013}$ there are a few common findings followed by many more relatively rare conditions. In chest radiography, diagnosis is both a long-tailed and multi-label problem, as patients often present with multiple findings simultaneously. While researchers have begun to study the problem of…
▽ More
Many real-world image recognition problems, such as diagnostic medical imaging exams, are "long-tailed" $\unicode{x2013}$ there are a few common findings followed by many more relatively rare conditions. In chest radiography, diagnosis is both a long-tailed and multi-label problem, as patients often present with multiple findings simultaneously. While researchers have begun to study the problem of long-tailed learning in medical image recognition, few have studied the interaction of label imbalance and label co-occurrence posed by long-tailed, multi-label disease classification. To engage with the research community on this emerging topic, we conducted an open challenge, CXR-LT, on long-tailed, multi-label thorax disease classification from chest X-rays (CXRs). We publicly release a large-scale benchmark dataset of over 350,000 CXRs, each labeled with at least one of 26 clinical findings following a long-tailed distribution. We synthesize common themes of top-performing solutions, providing practical recommendations for long-tailed, multi-label medical image classification. Finally, we use these insights to propose a path forward involving vision-language foundation models for few- and zero-shot disease classification.
△ Less
Submitted 1 April, 2024; v1 submitted 24 October, 2023;
originally announced October 2023.
-
Filling the Holes on 3D Heritage Object Surface based on Automatic Segmentation Algorithm
Authors:
Sinh Van Nguyen,
Son Thanh Le,
Minh Khai Tran,
Le Thanh Sach
Abstract:
Reconstructing and processing the 3D objects are popular activities in the research field of computer graphics, image processing and computer vision. The 3D objects are processed based on the methods like geometric modeling, a branch of applied mathematics and computational geometry, or the machine learning algorithms based on image processing. The computation of geometrical objects includes proce…
▽ More
Reconstructing and processing the 3D objects are popular activities in the research field of computer graphics, image processing and computer vision. The 3D objects are processed based on the methods like geometric modeling, a branch of applied mathematics and computational geometry, or the machine learning algorithms based on image processing. The computation of geometrical objects includes processing the curves and surfaces, subdivision, simplification, meshing, holes filling, reconstructing, and refining the 3D surface objects on both point cloud data and triangular mesh. While the machine learning methods are developed using deep learning models. With the support of 3D laser scan devices and Lidar techniques, the obtained dataset is close to original shape of the real objects. Besides, the photography and its application based on the modern techniques in recent years help us collect data and process the 3D models more precise. This article proposes an improved method for filling holes on the 3D object surface based on an automatic segmentation. Instead of filling the hole directly as the existing methods, we now subdivide the hole before filling it. The hole is first determined and segmented automatically based on computation of its local curvature. It is then filled on each part of the hole to match its local curvature shape. The method can work on both 3D point cloud surfaces and triangular mesh surface. Comparing to the state of the art methods, our proposed method obtained higher accuracy of the reconstructed 3D objects.
△ Less
Submitted 16 October, 2023;
originally announced October 2023.
-
Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation
Authors:
Kashu Yamazaki,
Taisei Hanyu,
Khoa Vo,
Thang Pham,
Minh Tran,
Gianfranco Doretto,
Anh Nguyen,
Ngan Le
Abstract:
Precise 3D environmental mapping is pivotal in robotics. Existing methods often rely on predefined concepts during training or are time-intensive when generating semantic maps. This paper presents Open-Fusion, a groundbreaking approach for real-time open-vocabulary 3D mapping and queryable scene representation using RGB-D data. Open-Fusion harnesses the power of a pre-trained vision-language found…
▽ More
Precise 3D environmental mapping is pivotal in robotics. Existing methods often rely on predefined concepts during training or are time-intensive when generating semantic maps. This paper presents Open-Fusion, a groundbreaking approach for real-time open-vocabulary 3D mapping and queryable scene representation using RGB-D data. Open-Fusion harnesses the power of a pre-trained vision-language foundation model (VLFM) for open-set semantic comprehension and employs the Truncated Signed Distance Function (TSDF) for swift 3D scene reconstruction. By leveraging the VLFM, we extract region-based embeddings and their associated confidence maps. These are then integrated with 3D knowledge from TSDF using an enhanced Hungarian-based feature-matching mechanism. Notably, Open-Fusion delivers outstanding annotation-free 3D segmentation for open-vocabulary without necessitating additional 3D training. Benchmark tests on the ScanNet dataset against leading zero-shot methods highlight Open-Fusion's superiority. Furthermore, it seamlessly combines the strengths of region-based VLFM and TSDF, facilitating real-time 3D scene comprehension that includes object concepts and open-world semantics. We encourage the readers to view the demos on our project page: https://uark-aicv.github.io/OpenFusion
△ Less
Submitted 5 October, 2023;
originally announced October 2023.
-
NAYER: Noisy Layer Data Generation for Efficient and Effective Data-free Knowledge Distillation
Authors:
Minh-Tuan Tran,
Trung Le,
Xuan-May Le,
Mehrtash Harandi,
Quan Hung Tran,
Dinh Phung
Abstract:
Data-Free Knowledge Distillation (DFKD) has made significant recent strides by transferring knowledge from a teacher neural network to a student neural network without accessing the original data. Nonetheless, existing approaches encounter a significant challenge when attempting to generate samples from random noise inputs, which inherently lack meaningful information. Consequently, these models s…
▽ More
Data-Free Knowledge Distillation (DFKD) has made significant recent strides by transferring knowledge from a teacher neural network to a student neural network without accessing the original data. Nonetheless, existing approaches encounter a significant challenge when attempting to generate samples from random noise inputs, which inherently lack meaningful information. Consequently, these models struggle to effectively map this noise to the ground-truth sample distribution, resulting in prolonging training times and low-quality outputs. In this paper, we propose a novel Noisy Layer Generation method (NAYER) which relocates the random source from the input to a noisy layer and utilizes the meaningful constant label-text embedding (LTE) as the input. LTE is generated by using the language model once, and then it is stored in memory for all subsequent training processes. The significance of LTE lies in its ability to contain substantial meaningful inter-class information, enabling the generation of high-quality samples with only a few training steps. Simultaneously, the noisy layer plays a key role in addressing the issue of diversity in sample generation by preventing the model from overemphasizing the constrained label information. By reinitializing the noisy layer in each iteration, we aim to facilitate the generation of diverse samples while still retaining the method's efficiency, thanks to the ease of learning provided by LTE. Experiments carried out on multiple datasets demonstrate that our NAYER not only outperforms the state-of-the-art methods but also achieves speeds 5 to 15 times faster than previous approaches. The code is available at https://github.com/tmtuan1307/nayer.
△ Less
Submitted 21 March, 2024; v1 submitted 30 September, 2023;
originally announced October 2023.
-
RSAM: Learning on manifolds with Riemannian Sharpness-aware Minimization
Authors:
Tuan Truong,
Hoang-Phi Nguyen,
Tung Pham,
Minh-Tuan Tran,
Mehrtash Harandi,
Dinh Phung,
Trung Le
Abstract:
Nowadays, understanding the geometry of the loss landscape shows promise in enhancing a model's generalization ability. In this work, we draw upon prior works that apply geometric principles to optimization and present a novel approach to improve robustness and generalization ability for constrained optimization problems. Indeed, this paper aims to generalize the Sharpness-Aware Minimization (SAM)…
▽ More
Nowadays, understanding the geometry of the loss landscape shows promise in enhancing a model's generalization ability. In this work, we draw upon prior works that apply geometric principles to optimization and present a novel approach to improve robustness and generalization ability for constrained optimization problems. Indeed, this paper aims to generalize the Sharpness-Aware Minimization (SAM) optimizer to Riemannian manifolds. In doing so, we first extend the concept of sharpness and introduce a novel notion of sharpness on manifolds. To support this notion of sharpness, we present a theoretical analysis characterizing generalization capabilities with respect to manifold sharpness, which demonstrates a tighter bound on the generalization gap, a result not known before. Motivated by this analysis, we introduce our algorithm, Riemannian Sharpness-Aware Minimization (RSAM). To demonstrate RSAM's ability to enhance generalization ability, we evaluate and contrast our algorithm on a broad set of problems, such as image classification and contrastive learning across different datasets, including CIFAR100, CIFAR10, and FGVCAircraft. Our code is publicly available at \url{https://t.ly/RiemannianSAM}.
△ Less
Submitted 29 September, 2023;
originally announced September 2023.
-
Cloud Watching: Understanding Attacks Against Cloud-Hosted Services
Authors:
Liz Izhikevich,
Manda Tran,
Michalis Kallitsis,
Aurore Fass,
Zakir Durumeric
Abstract:
Cloud computing has dramatically changed service deployment patterns. In this work, we analyze how attackers identify and target cloud services in contrast to traditional enterprise networks and network telescopes. Using a diverse set of cloud honeypots in 5~providers and 23~countries as well as 2~educational networks and 1~network telescope, we analyze how IP address assignment, geography, networ…
▽ More
Cloud computing has dramatically changed service deployment patterns. In this work, we analyze how attackers identify and target cloud services in contrast to traditional enterprise networks and network telescopes. Using a diverse set of cloud honeypots in 5~providers and 23~countries as well as 2~educational networks and 1~network telescope, we analyze how IP address assignment, geography, network, and service-port selection, influence what services are targeted in the cloud. We find that scanners that target cloud compute are selective: they avoid scanning networks without legitimate services and they discriminate between geographic regions. Further, attackers mine Internet-service search engines to find exploitable services and, in some cases, they avoid targeting IANA-assigned protocols, causing researchers to misclassify at least 15\% of traffic on select ports. Based on our results, we derive recommendations for researchers and operators.
△ Less
Submitted 28 September, 2023; v1 submitted 23 September, 2023;
originally announced September 2023.
-
SAM3D: Segment Anything Model in Volumetric Medical Images
Authors:
Nhat-Tan Bui,
Dinh-Hieu Hoang,
Minh-Triet Tran,
Gianfranco Doretto,
Donald Adjeroh,
Brijesh Patel,
Arabinda Choudhary,
Ngan Le
Abstract:
Image segmentation remains a pivotal component in medical image analysis, aiding in the extraction of critical information for precise diagnostic practices. With the advent of deep learning, automated image segmentation methods have risen to prominence, showcasing exceptional proficiency in processing medical imagery. Motivated by the Segment Anything Model (SAM)-a foundational model renowned for…
▽ More
Image segmentation remains a pivotal component in medical image analysis, aiding in the extraction of critical information for precise diagnostic practices. With the advent of deep learning, automated image segmentation methods have risen to prominence, showcasing exceptional proficiency in processing medical imagery. Motivated by the Segment Anything Model (SAM)-a foundational model renowned for its remarkable precision and robust generalization capabilities in segmenting 2D natural images-we introduce SAM3D, an innovative adaptation tailored for 3D volumetric medical image analysis. Unlike current SAM-based methods that segment volumetric data by converting the volume into separate 2D slices for individual analysis, our SAM3D model processes the entire 3D volume image in a unified approach. Extensive experiments are conducted on multiple medical image datasets to demonstrate that our network attains competitive results compared with other state-of-the-art methods in 3D medical segmentation tasks while being significantly efficient in terms of parameters. Code and checkpoints are available at https://github.com/UARK-AICV/SAM3D.
△ Less
Submitted 5 March, 2024; v1 submitted 7 September, 2023;
originally announced September 2023.
-
MEGANet: Multi-Scale Edge-Guided Attention Network for Weak Boundary Polyp Segmentation
Authors:
Nhat-Tan Bui,
Dinh-Hieu Hoang,
Quang-Thuc Nguyen,
Minh-Triet Tran,
Ngan Le
Abstract:
Efficient polyp segmentation in healthcare plays a critical role in enabling early diagnosis of colorectal cancer. However, the segmentation of polyps presents numerous challenges, including the intricate distribution of backgrounds, variations in polyp sizes and shapes, and indistinct boundaries. Defining the boundary between the foreground (i.e. polyp itself) and the background (surrounding tiss…
▽ More
Efficient polyp segmentation in healthcare plays a critical role in enabling early diagnosis of colorectal cancer. However, the segmentation of polyps presents numerous challenges, including the intricate distribution of backgrounds, variations in polyp sizes and shapes, and indistinct boundaries. Defining the boundary between the foreground (i.e. polyp itself) and the background (surrounding tissue) is difficult. To mitigate these challenges, we propose Multi-Scale Edge-Guided Attention Network (MEGANet) tailored specifically for polyp segmentation within colonoscopy images. This network draws inspiration from the fusion of a classical edge detection technique with an attention mechanism. By combining these techniques, MEGANet effectively preserves high-frequency information, notably edges and boundaries, which tend to erode as neural networks deepen. MEGANet is designed as an end-to-end framework, encompassing three key modules: an encoder, which is responsible for capturing and abstracting the features from the input image, a decoder, which focuses on salient features, and the Edge-Guided Attention module (EGA) that employs the Laplacian Operator to accentuate polyp boundaries. Extensive experiments, both qualitative and quantitative, on five benchmark datasets, demonstrate that our MEGANet outperforms other existing SOTA methods under six evaluation metrics. Our code is available at https://github.com/UARK-AICV/MEGANet.
△ Less
Submitted 4 November, 2023; v1 submitted 6 September, 2023;
originally announced September 2023.