Search | arXiv e-print repository

Optimal Square Detection Over General Alphabets

Authors: Jonas Ellert, Paweł Gawrychowski, Garance Gourdel

Abstract: Squares (fragments of the form $xx$, for some string $x$) are arguably the most natural type of repetition in strings. The basic algorithmic question concerning squares is to check if a given string of length $n$ is square-free, that is, does not contain a fragment of such form. Main and Lorentz [J. Algorithms 1984] designed an $\mathcal{O}(n\log n)$ time algorithm for this problem, and proved a m… ▽ More Squares (fragments of the form $xx$, for some string $x$) are arguably the most natural type of repetition in strings. The basic algorithmic question concerning squares is to check if a given string of length $n$ is square-free, that is, does not contain a fragment of such form. Main and Lorentz [J. Algorithms 1984] designed an $\mathcal{O}(n\log n)$ time algorithm for this problem, and proved a matching lower bound assuming the so-called general alphabet, meaning that the algorithm is only allowed to check if two characters are equal. However, their lower bound also assumes that there are $Ω(n)$ distinct symbols in the string. As an open question, they asked if there is a faster algorithm if one restricts the size of the alphabet. Crochemore [Theor. Comput. Sci. 1986] designed a linear-time algorithm for constant-size alphabets, and combined with more recent results his approach in fact implies such an algorithm for linearly-sortable alphabets. Very recently, Ellert and Fischer [ICALP 2021] significantly relaxed this assumption by designing a linear-time algorithm for general ordered alphabets, that is, assuming a linear order on the characters that permits constant time order comparisons. However, the open question of Main and Lorentz from 1984 remained unresolved for general (unordered) alphabets. In this paper, we show that testing square-freeness of a length-$n$ string over general alphabet of size $σ$ can be done with $\mathcal{O}(n\log σ)$ comparisons, and cannot be done with $o(n\log σ)$ comparisons. We complement this result with an $\mathcal{O}(n\log σ)$ time algorithm in the Word RAM model. Finally, we extend the algorithm to reporting all the runs (maximal repetitions) in the same complexity. △ Less

Submitted 13 March, 2023; originally announced March 2023.

Comments: extended version of a paper published in SODA 2023

arXiv:2102.08670 [pdf, other]

Linear Time Runs over General Ordered Alphabets

Authors: Jonas Ellert, Johannes Fischer

Abstract: A run in a string is a maximal periodic substring. For example, the string $\texttt{bananatree}$ contains the runs $\texttt{anana} = (\texttt{an})^{3/2}$ and $\texttt{ee} = \texttt{e}^2$. There are less than $n$ runs in any length-$n$ string, and computing all runs for a string over a linearly-sortable alphabet takes $\mathcal{O}(n)$ time (Bannai et al., SODA 2015). Kosolobov conjectured that ther… ▽ More A run in a string is a maximal periodic substring. For example, the string $\texttt{bananatree}$ contains the runs $\texttt{anana} = (\texttt{an})^{3/2}$ and $\texttt{ee} = \texttt{e}^2$. There are less than $n$ runs in any length-$n$ string, and computing all runs for a string over a linearly-sortable alphabet takes $\mathcal{O}(n)$ time (Bannai et al., SODA 2015). Kosolobov conjectured that there also exists a linear time runs algorithm for general ordered alphabets (Inf. Process. Lett. 2016). The conjecture was almost proven by Crochemore et al., who presented an $\mathcal{O}(nα(n))$ time algorithm (where $α(n)$ is the extremely slowly growing inverse Ackermann function). We show how to achieve $\mathcal{O}(n)$ time by exploiting combinatorial properties of the Lyndon array, thus proving Kosolobov's conjecture. △ Less

Submitted 17 February, 2021; originally announced February 2021.

Comments: This work has been submitted to ICALP 2021

arXiv:2006.02219 [pdf, ps, other]

LCP-Aware Parallel String Sorting

Authors: Jonas Ellert, Johannes Fischer, Nodari Sitchinava

Abstract: When lexicographically sorting strings, it is not always necessary to inspect all symbols. For example, the lexicographical rank of "europar" amongst the strings "eureka", "eurasia", and "excells" only depends on its so called relevant prefix "euro". The distinguishing prefix size $D$ of a set of strings is the number of symbols that actually need to be inspected to establish the lexicographical o… ▽ More When lexicographically sorting strings, it is not always necessary to inspect all symbols. For example, the lexicographical rank of "europar" amongst the strings "eureka", "eurasia", and "excells" only depends on its so called relevant prefix "euro". The distinguishing prefix size $D$ of a set of strings is the number of symbols that actually need to be inspected to establish the lexicographical ordering of all strings. Efficient string sorters should be $D$-aware, i.e. their complexity should depend on $D$ rather than on the total number $N$ of all symbols in all strings. While there are many $D$-aware sorters in the sequential setting, there appear to be no such results in the PRAM model. We propose a framework yielding a $D$-aware modification of any existing PRAM string sorter. The derived algorithms are work-optimal with respect to their original counterpart: If the original algorithm requires $O(w(N))$ work, the derived one requires $O(w(D))$ work. The execution time increases only by a small factor that is logarithmic in the length of the longest relevant prefix. Our framework universally works for deterministic and randomized algorithms in all variations of the PRAM model, such that future improvements in ($D$-unaware) parallel string sorting will directly result in improvements in $D$-aware parallel string sorting. △ Less

Submitted 3 June, 2020; originally announced June 2020.

Comments: Accepted at Euro-Par 2020 and to be published by Springer as part of the conference proceedings

arXiv:1911.03542 [pdf, ps, other]

Space Efficient Construction of Lyndon Arrays in Linear Time

Authors: Philip Bille, Jonas Ellert, Johannes Fischer, Inge Li Gørtz, Florian Kurpicz, Ian Munro, Eva Rotenberg

Abstract: We present the first linear time algorithm to construct the $2n$-bit version of the Lyndon array for a string of length $n$ using only $o(n)$ bits of working space. A simpler variant of this algorithm computes the plain ($n\lg n$-bit) version of the Lyndon array using only $\mathcal{O}(1)$ words of additional working space. All previous algorithms are either not linear, or use at least $n\lg n$ bi… ▽ More We present the first linear time algorithm to construct the $2n$-bit version of the Lyndon array for a string of length $n$ using only $o(n)$ bits of working space. A simpler variant of this algorithm computes the plain ($n\lg n$-bit) version of the Lyndon array using only $\mathcal{O}(1)$ words of additional working space. All previous algorithms are either not linear, or use at least $n\lg n$ bits of additional working space. Also in practice, our new algorithms outperform the previous best ones by an order of magnitude, both in terms of time and space. △ Less

Submitted 10 December, 2019; v1 submitted 8 November, 2019; originally announced November 2019.

arXiv:1907.03235 [pdf, other]

Bidirectional Text Compression in External Memory

Authors: Patrick Dinklage, Jonas Ellert, Johannes Fischer, Dominik Köppl, Manuel Penschuck

Abstract: Bidirectional compression algorithms work by substituting repeated substrings by references that, unlike in the famous LZ77-scheme, can point to either direction. We present such an algorithm that is particularly suited for an external memory implementation. We evaluate it experimentally on large data sets of size up to 128 GiB (using only 16 GiB of RAM) and show that it is significantly faster th… ▽ More Bidirectional compression algorithms work by substituting repeated substrings by references that, unlike in the famous LZ77-scheme, can point to either direction. We present such an algorithm that is particularly suited for an external memory implementation. We evaluate it experimentally on large data sets of size up to 128 GiB (using only 16 GiB of RAM) and show that it is significantly faster than all known LZ77 compressors, while producing a roughly similar number of factors. We also introduce an external memory decompressor for texts compressed with any uni- or bidirectional compression scheme. △ Less

Submitted 3 December, 2019; v1 submitted 7 July, 2019; originally announced July 2019.

Showing 1–5 of 5 results for author: Ellert, J