Search | arXiv e-print repository

A Simple Baseline with Single-encoder for Referring Image Segmentation

Authors: Seonghoon Yu, Ilchae Jung, Byeongju Han, Taeoh Kim, Yunho Kim, Dongyoon Wee, Jeany Son

Abstract: Referring image segmentation (RIS) requires dense vision-language interactions between visual pixels and textual words to segment objects based on a given description. However, commonly adapted dual-encoders in RIS, e.g., Swin transformer and BERT (uni-modal encoders) or CLIP (a multi-modal dual-encoder), lack dense multi-modal interactions during pre-training, leading to a gap with a pixel-level… ▽ More Referring image segmentation (RIS) requires dense vision-language interactions between visual pixels and textual words to segment objects based on a given description. However, commonly adapted dual-encoders in RIS, e.g., Swin transformer and BERT (uni-modal encoders) or CLIP (a multi-modal dual-encoder), lack dense multi-modal interactions during pre-training, leading to a gap with a pixel-level RIS task. To bridge this gap, existing RIS methods often rely on multi-modal fusion modules that interact two encoders, but this approach leads to high computational costs. In this paper, we present a novel RIS method with a single-encoder, i.e., BEiT-3, maximizing the potential of shared self-attention across all framework components. This enables seamless interactions of two modalities from input to final prediction, producing granularly aligned multi-modal features. Furthermore, we propose lightweight yet effective decoder modules, a Shared FPN and a Shared Mask Decoder, which contribute to the high efficiency of our model. Our simple baseline with a single encoder achieves outstanding performances on the RIS benchmark datasets while maintaining computational efficiency, compared to the most recent SoTA methods based on dual-encoders. △ Less

Submitted 28 August, 2024; originally announced August 2024.

Comments: ArXiv pre-print

arXiv:2407.16177 [pdf, ps, other]

Logifold: A Geometrical Foundation of Ensemble Machine Learning

Authors: Inkee Jung, Siu-Cheong Lau

Abstract: We present a local-to-global and measure-theoretical approach to understanding datasets. The core idea is to formulate a logifold structure and to interpret network models with restricted domains as local charts of datasets. In particular, this provides a mathematical foundation for ensemble machine learning. Our experiments demonstrate that logifolds can be implemented to identify fuzzy domains a… ▽ More We present a local-to-global and measure-theoretical approach to understanding datasets. The core idea is to formulate a logifold structure and to interpret network models with restricted domains as local charts of datasets. In particular, this provides a mathematical foundation for ensemble machine learning. Our experiments demonstrate that logifolds can be implemented to identify fuzzy domains and improve accuracy compared to taking average of model outputs. Additionally, we provide a theoretical example of a logifold, highlighting the importance of restricting to domains of classifiers in an ensemble. △ Less

Submitted 23 July, 2024; originally announced July 2024.

Comments: 6 pages

arXiv:2405.05492 [pdf, other]

A logifold structure on measure space

Authors: Inkee Jung, Siu-Cheong Lau

Abstract: In this paper,we develop a local-to-global and measure-theoretical approach to understand datasets. The idea is to take network models with restricted domains as local charts of datasets. We develop the mathematical foundations for these structures, and show in experiments how it can be used to find fuzzy domains and to improve accuracy in data classification problems. In this paper,we develop a local-to-global and measure-theoretical approach to understand datasets. The idea is to take network models with restricted domains as local charts of datasets. We develop the mathematical foundations for these structures, and show in experiments how it can be used to find fuzzy domains and to improve accuracy in data classification problems. △ Less

Submitted 8 May, 2024; originally announced May 2024.

Comments: 43 pages, 4 figures

MSC Class: 55N31; 53Z50; 68T07; 68T09; 60A10; 81P45; 94D05

arXiv:2310.02713 [pdf, other]

scHyena: Foundation Model for Full-Length Single-Cell RNA-Seq Analysis in Brain

Authors: Gyutaek Oh, Baekgyu Choi, Inkyung Jung, Jong Chul Ye

Abstract: Single-cell RNA sequencing (scRNA-seq) has made significant strides in unraveling the intricate cellular diversity within complex tissues. This is particularly critical in the brain, presenting a greater diversity of cell types than other tissue types, to gain a deeper understanding of brain function within various cellular contexts. However, analyzing scRNA-seq data remains a challenge due to inh… ▽ More Single-cell RNA sequencing (scRNA-seq) has made significant strides in unraveling the intricate cellular diversity within complex tissues. This is particularly critical in the brain, presenting a greater diversity of cell types than other tissue types, to gain a deeper understanding of brain function within various cellular contexts. However, analyzing scRNA-seq data remains a challenge due to inherent measurement noise stemming from dropout events and the limited utilization of extensive gene expression information. In this work, we introduce scHyena, a foundation model designed to address these challenges and enhance the accuracy of scRNA-seq analysis in the brain. Specifically, inspired by the recent Hyena operator, we design a novel Transformer architecture called singe-cell Hyena (scHyena) that is equipped with a linear adaptor layer, the positional encoding via gene-embedding, and a {bidirectional} Hyena operator. This enables us to process full-length scRNA-seq data without losing any information from the raw data. In particular, our model learns generalizable features of cells and genes through pre-training scHyena using the full length of scRNA-seq data. We demonstrate the superior performance of scHyena compared to other benchmark methods in downstream tasks, including cell type classification and scRNA-seq imputation. △ Less

Submitted 4 October, 2023; originally announced October 2023.

Comments: 21 pages, 16 figures

arXiv:2205.11179 [pdf, other]

Online Hybrid Lightweight Representations Learning: Its Application to Visual Tracking

Authors: Ilchae Jung, Minji Kim, Eunhyeok Park, Bohyung Han

Abstract: This paper presents a novel hybrid representation learning framework for streaming data, where an image frame in a video is modeled by an ensemble of two distinct deep neural networks; one is a low-bit quantized network and the other is a lightweight full-precision network. The former learns coarse primary information with low cost while the latter conveys residual information for high fidelity to… ▽ More This paper presents a novel hybrid representation learning framework for streaming data, where an image frame in a video is modeled by an ensemble of two distinct deep neural networks; one is a low-bit quantized network and the other is a lightweight full-precision network. The former learns coarse primary information with low cost while the latter conveys residual information for high fidelity to original representations. The proposed parallel architecture is effective to maintain complementary information since fixed-point arithmetic can be utilized in the quantized network and the lightweight model provides precise representations given by a compact channel-pruned network. We incorporate the hybrid representation technique into an online visual tracking task, where deep neural networks need to handle temporal variations of target appearances in real-time. Compared to the state-of-the-art real-time trackers based on conventional deep neural networks, our tracking algorithm demonstrates competitive accuracy on the standard benchmarks with a small fraction of computational cost and memory footprint. △ Less

Submitted 23 May, 2022; originally announced May 2022.

Comments: 7 pages, 1 figure, accepted at IJCAI2022

arXiv:2107.03046 [pdf, ps, other]

6G: from Densification to Diversification

Authors: Hyunsoo Kim, Taehyung Kim, Hyejin Kim, Insik Jung, Hakkeon Lee, Hyunmin Seo, Daesik Hong

Abstract: The 5G system has finally begun commercialization, and now is the time to start discussing the road map for the 6G system. While the 5G system was designed with a focus on discovering new service types for high speed, low-latency, and massive connective services, the evolution of the network interface for 6G should be considered with an eye toward supporting these complicated communication environ… ▽ More The 5G system has finally begun commercialization, and now is the time to start discussing the road map for the 6G system. While the 5G system was designed with a focus on discovering new service types for high speed, low-latency, and massive connective services, the evolution of the network interface for 6G should be considered with an eye toward supporting these complicated communication environments. As machine-driven data traffic continues to increase exponentially, 6G must be able to support a series of connection methods that did not previously exist. In departure from base-station-oriented cell densification, network diversification is necessary if we are to satisfy the comprehensive requirements of end terminals for diverse applications. In this article, we predict what will drive 6G and look at what key requirements should be considered in 6G. We then diversify four types of network architectures according to link characteristics, communication ranges, and target services. The four types of networks play complementary roles while at the same time collaborating across the entire 6G network. Lastly, we call attention to key technologies and challenges in the air, network, and assistive technologies that will have to be addressed when designing the 6G system. △ Less

Submitted 7 July, 2021; originally announced July 2021.

Comments: 11 pages, 5 figures, 3 tables

arXiv:1911.11170 [pdf, other]

Real-Time Object Tracking via Meta-Learning: Efficient Model Adaptation and One-Shot Channel Pruning

Authors: Ilchae Jung, Kihyun You, Hyeonwoo Noh, Minsu Cho, Bohyung Han

Abstract: We propose a novel meta-learning framework for real-time object tracking with efficient model adaptation and channel pruning. Given an object tracker, our framework learns to fine-tune its model parameters in only a few iterations of gradient-descent during tracking while pruning its network channels using the target ground-truth at the first frame. Such a learning problem is formulated as a meta-… ▽ More We propose a novel meta-learning framework for real-time object tracking with efficient model adaptation and channel pruning. Given an object tracker, our framework learns to fine-tune its model parameters in only a few iterations of gradient-descent during tracking while pruning its network channels using the target ground-truth at the first frame. Such a learning problem is formulated as a meta-learning task, where a meta-tracker is trained by updating its meta-parameters for initial weights, learning rates, and pruning masks through carefully designed tracking simulations. The integrated meta-tracker greatly improves tracking performance by accelerating the convergence of online learning and reducing the cost of feature computation. Experimental evaluation on the standard datasets demonstrates its outstanding accuracy and speed compared to the state-of-the-art methods. △ Less

Submitted 4 December, 2019; v1 submitted 25 November, 2019; originally announced November 2019.

Comments: 9 pages, 5 figures, AAAI 2020 accepted

arXiv:1808.08834 [pdf, other]

Real-Time MDNet

Authors: Ilchae Jung, Jeany Son, Mooyeol Baek, Bohyung Han

Abstract: We present a fast and accurate visual tracking algorithm based on the multi-domain convolutional neural network (MDNet). The proposed approach accelerates feature extraction procedure and learns more discriminative models for instance classification; it enhances representation quality of target and background by maintaining a high resolution feature map with a large receptive field per activation.… ▽ More We present a fast and accurate visual tracking algorithm based on the multi-domain convolutional neural network (MDNet). The proposed approach accelerates feature extraction procedure and learns more discriminative models for instance classification; it enhances representation quality of target and background by maintaining a high resolution feature map with a large receptive field per activation. We also introduce a novel loss term to differentiate foreground instances across multiple domains and learn a more discriminative embedding of target objects with similar semantics. The proposed techniques are integrated into the pipeline of a well known CNN-based visual tracking algorithm, MDNet. We accomplish approximately 25 times speed-up with almost identical accuracy compared to MDNet. Our algorithm is evaluated in multiple popular tracking benchmark datasets including OTB2015, UAV123, and TempleColor, and outperforms the state-of-the-art real-time tracking methods consistently even without dataset-specific parameter tuning. △ Less

Submitted 27 August, 2018; originally announced August 2018.

Comments: 16 pages, 8 figures, accepted at ECCV 2018

arXiv:1705.00494 [pdf, other]

Orthogonal Code-based Block Transmission for Burst Transmission

Authors: Hyejin Kim, Insik Jung, Wonsuk Chung, Sooyong Choi, Daesik Hong

Abstract: This paper proposes a new multi-carrier system, called orthogonal code-based block transmission (OCBT). OCBT applies a time-spreading method with an orthogonal code to have a block signal structure and a windowing procedure to reduce the out-of-band (OOB) radiation. The proposed OCBT can transmit the quadrature amplitude modulation (QAM) signals to use the conventional multiple input multiple outp… ▽ More This paper proposes a new multi-carrier system, called orthogonal code-based block transmission (OCBT). OCBT applies a time-spreading method with an orthogonal code to have a block signal structure and a windowing procedure to reduce the out-of-band (OOB) radiation. The proposed OCBT can transmit the quadrature amplitude modulation (QAM) signals to use the conventional multiple input multiple output techniques. Numerical results show that the proposed OCBT using QAM signal has the short burst compared to the filter-bank multi-carrier (FBMC), the low complexity compared to FBMC and windowed orthogonal frequency division multiplexing (W-OFDM) and also the low OOB radiation compared to OFDM. △ Less

Submitted 1 May, 2017; v1 submitted 1 May, 2017; originally announced May 2017.

Comments: 5 pages, 5 figures, submitted to IEEE Transactions on Vehicular Technology

arXiv:1612.01669 [pdf, other]

MarioQA: Answering Questions by Watching Gameplay Videos

Authors: Jonghwan Mun, Paul Hongsuck Seo, Ilchae Jung, Bohyung Han

Abstract: We present a framework to analyze various aspects of models for video question answering (VideoQA) using customizable synthetic datasets, which are constructed automatically from gameplay videos. Our work is motivated by the fact that existing models are often tested only on datasets that require excessively high-level reasoning or mostly contain instances accessible through single frame inference… ▽ More We present a framework to analyze various aspects of models for video question answering (VideoQA) using customizable synthetic datasets, which are constructed automatically from gameplay videos. Our work is motivated by the fact that existing models are often tested only on datasets that require excessively high-level reasoning or mostly contain instances accessible through single frame inferences. Hence, it is difficult to measure capacity and flexibility of trained models, and existing techniques often rely on ad-hoc implementations of deep neural networks without clear insight into datasets and models. We are particularly interested in understanding temporal relationships between video events to solve VideoQA problems; this is because reasoning temporal dependency is one of the most distinct components in videos from images. To address this objective, we automatically generate a customized synthetic VideoQA dataset using {\em Super Mario Bros.} gameplay videos so that it contains events with different levels of reasoning complexity. Using the dataset, we show that properly constructed datasets with events in various complexity levels are critical to learn effective models and improve overall performance. △ Less

Submitted 13 August, 2017; v1 submitted 6 December, 2016; originally announced December 2016.

Showing 1–10 of 10 results for author: Jung, I