Zum Hauptinhalt springen

Showing 1–11 of 11 results for author: Kanda, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.17185  [pdf, other

    cs.CL

    Vaporetto: Efficient Japanese Tokenization Based on Improved Pointwise Linear Classification

    Authors: Koichi Akabe, Shunsuke Kanda, Yusuke Oda, Shinsuke Mori

    Abstract: This paper proposes an approach to improve the runtime efficiency of Japanese tokenization based on the pointwise linear classification (PLC) framework, which formulates the whole tokenization process as a sequence of linear classification problems. Our approach optimizes tokenization by leveraging the characteristics of the PLC framework and the task definition. Our approach involves (1) composin… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  2. arXiv:2403.04951  [pdf, other

    cs.DS

    NP-Completeness for the Space-Optimality of Double-Array Tries

    Authors: Hideo Bannai, Keisuke Goto, Shunsuke Kanda, Dominik Köppl

    Abstract: Indexing a set of strings for prefix search or membership queries is a fundamental task with many applications such as information retrieval or database systems. A classic abstract data type for modelling such an index is a trie. Due to the fundamental nature of this problem, it has sparked much interest, leading to a variety of trie implementations with different characteristics. A trie implement… ▽ More

    Submitted 7 March, 2024; originally announced March 2024.

  3. Engineering faster double-array Aho-Corasick automata

    Authors: Shunsuke Kanda, Koichi Akabe, Yusuke Oda

    Abstract: Multiple pattern matching in strings is a fundamental problem in text processing applications such as regular expressions or tokenization. This paper studies efficient implementations of double-array Aho-Corasick automata (DAACs), data structures for quickly performing the multiple pattern matching. The practical performance of DAACs is improved by carefully designing the data structure, and many… ▽ More

    Submitted 23 June, 2024; v1 submitted 27 July, 2022; originally announced July 2022.

    Comments: Accepted by Software: Practice and Experience (Accepted version)

    Journal ref: Software: Practice and Experience (SPE), 53(6): 1332-1361, 2023

  4. arXiv:2207.02571  [pdf, other

    cs.DS

    Computing NP-hard Repetitiveness Measures via MAX-SAT

    Authors: Hideo Bannai, Keisuke Goto, Masakazu Ishihata, Shunsuke Kanda, Dominik Köppl, Takaaki Nishimoto

    Abstract: Repetitiveness measures reveal profound characteristics of datasets, and give rise to compressed data structures and algorithms working in compressed space. Alas, the computation of some of these measures is NP-hard, and straight-forward computation is infeasible for datasets of even small sizes. Three such measures are the smallest size of a string attractor, the smallest size of a bidirectional… ▽ More

    Submitted 12 July, 2022; v1 submitted 6 July, 2022; originally announced July 2022.

    Comments: paper accepted to ESA 2022 (plus Appendix); corrected attribution of Python program for computing https://oeis.org/A339391

  5. arXiv:2202.07885  [pdf, other

    cs.DS

    An Optimal-Time RLBWT Construction in BWT-runs Bounded Space

    Authors: Takaaki Nishimoto, Shunsuke Kanda, Yasuo Tabei

    Abstract: The compression of highly repetitive strings (i.e., strings with many repetitions) has been a central research topic in string processing, and quite a few compression methods for these strings have been proposed thus far. Among them, an efficient compression format gathering increasing attention is the run-length Burrows--Wheeler transform (RLBWT), which is a run-length encoded BWT as a reversible… ▽ More

    Submitted 16 February, 2022; originally announced February 2022.

  6. Rank/Select Queries over Mutable Bitmaps

    Authors: Giulio Ermanno Pibiri, Shunsuke Kanda

    Abstract: The problem of answering rank/select queries over a bitmap is of utmost importance for many succinct data structures. When the bitmap does not change, many solutions exist in the theoretical and practical side. In this work we consider the case where one is allowed to modify the bitmap via a flip(i) operation that toggles its i-th bit. By adapting and properly extending some results concerning pre… ▽ More

    Submitted 23 February, 2021; v1 submitted 27 September, 2020; originally announced September 2020.

    Comments: Accepted by Information Systems (INFOSYS)

    Journal ref: Information Systems, 2021, Volume 99, 15 pages

  7. arXiv:2009.11559  [pdf, other

    cs.DS cs.IR

    Dynamic Similarity Search on Integer Sketches

    Authors: Shunsuke Kanda, Yasuo Tabei

    Abstract: Similarity-preserving hashing is a core technique for fast similarity searches, and it randomly maps data points in a metric space to strings of discrete symbols (i.e., sketches) in the Hamming space. While traditional hashing techniques produce binary sketches, recent ones produce integer sketches for preserving various similarity measures. However, most similarity search methods are designed for… ▽ More

    Submitted 24 September, 2020; originally announced September 2020.

    Comments: Accepted by IEEE ICDM 2020 as a full paper

  8. arXiv:2005.10917  [pdf, other

    cs.DS cs.DB cs.LG

    Succinct Trit-array Trie for Scalable Trajectory Similarity Search

    Authors: Shunsuke Kanda, Koh Takeuchi, Keisuke Fujii, Yasuo Tabei

    Abstract: Massive datasets of spatial trajectories representing the mobility of a diversity of moving objects are ubiquitous in research and industry. Similarity search of a large collection of trajectories is indispensable for turning these datasets into knowledge. Locality sensitive hashing (LSH) is a powerful technique for fast similarity searches. Recent methods employ LSH and attempt to realize an effi… ▽ More

    Submitted 21 September, 2020; v1 submitted 21 May, 2020; originally announced May 2020.

    Comments: Accepted by ACM SIGSPATIAL 2020 as a full paper

  9. arXiv:1910.08278  [pdf, other

    cs.LG stat.ML

    $b$-Bit Sketch Trie: Scalable Similarity Search on Integer Sketches

    Authors: Shunsuke Kanda, Yasuo Tabei

    Abstract: Recently, randomly mapping vectorial data to strings of discrete symbols (i.e., sketches) for fast and space-efficient similarity searches has become popular. Such random mapping is called similarity-preserving hashing and approximates a similarity metric by using the Hamming distance. Although many efficient similarity searches have been proposed, most of them are designed for binary sketches. Si… ▽ More

    Submitted 18 October, 2019; originally announced October 2019.

    Comments: To be appeared in the Proceedings of IEEE BigData'19

  10. arXiv:1906.06015  [pdf, other

    cs.DS cs.IR

    Dynamic Path-Decomposed Tries

    Authors: Shunsuke Kanda, Dominik Köppl, Yasuo Tabei, Kazuhiro Morita, Masao Fuketa

    Abstract: A keyword dictionary is an associative array whose keys are strings. Recent applications handling massive keyword dictionaries in main memory have a need for a space-efficient implementation. When limited to static applications, there are a number of highly-compressed keyword dictionaries based on the advancements of practical succinct data structures. However, as most succinct data structures are… ▽ More

    Submitted 21 July, 2020; v1 submitted 14 June, 2019; originally announced June 2019.

    Comments: Accepted by ACM Journal of Experimental Algorithmics

  11. c-trie++: A Dynamic Trie Tailored for Fast Prefix Searches

    Authors: Kazuya Tsuruta, Dominik Köppl, Shunsuke Kanda, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, Masayuki Takeda

    Abstract: Given a dynamic set $K$ of $k$ strings of total length $n$ whose characters are drawn from an alphabet of size $σ$, a keyword dictionary is a data structure built on $K$ that provides locate, prefix search, and update operations on $K$. Under the assumption that $α= w / \lg σ$ characters fit into a single machine word $w$, we propose a keyword dictionary that represents $K$ in… ▽ More

    Submitted 7 October, 2020; v1 submitted 16 April, 2019; originally announced April 2019.

    Journal ref: Full version of conference paper at DCC, pages 243-252, 2020