Search | arXiv e-print repository

doi 10.1145/3442381.3450014

Unsupervised Multi-Index Semantic Hashing

Authors: Christian Hansen, Casper Hansen, Jakob Grue Simonsen, Stephen Alstrup, Christina Lioma

Abstract: Semantic hashing represents documents as compact binary vectors (hash codes) and allows both efficient and effective similarity search in large-scale information retrieval. The state of the art has primarily focused on learning hash codes that improve similarity search effectiveness, while assuming a brute-force linear scan strategy for searching over all the hash codes, even though much faster al… ▽ More Semantic hashing represents documents as compact binary vectors (hash codes) and allows both efficient and effective similarity search in large-scale information retrieval. The state of the art has primarily focused on learning hash codes that improve similarity search effectiveness, while assuming a brute-force linear scan strategy for searching over all the hash codes, even though much faster alternatives exist. One such alternative is multi-index hashing, an approach that constructs a smaller candidate set to search over, which depending on the distribution of the hash codes can lead to sub-linear search time. In this work, we propose Multi-Index Semantic Hashing (MISH), an unsupervised hashing model that learns hash codes that are both effective and highly efficient by being optimized for multi-index hashing. We derive novel training objectives, which enable to learn hash codes that reduce the candidate sets produced by multi-index hashing, while being end-to-end trainable. In fact, our proposed training objectives are model agnostic, i.e., not tied to how the hash codes are generated specifically in MISH, and are straight-forward to include in existing and future semantic hashing models. We experimentally compare MISH to state-of-the-art semantic hashing baselines in the task of document similarity search. We find that even though multi-index hashing also improves the efficiency of the baselines compared to a linear scan, they are still upwards of 33% slower than MISH, while MISH is still able to obtain state-of-the-art effectiveness. △ Less

Submitted 26 March, 2021; originally announced March 2021.

Comments: Proceedings of the 2021 World Wide Web Conference, published under Creative Commons CC-BY 4.0 License

arXiv:2007.00380 [pdf, other]

doi 10.1145/3397271.3401220

Unsupervised Semantic Hashing with Pairwise Reconstruction

Authors: Casper Hansen, Christian Hansen, Jakob Grue Simonsen, Stephen Alstrup, Christina Lioma

Abstract: Semantic Hashing is a popular family of methods for efficient similarity search in large-scale datasets. In Semantic Hashing, documents are encoded as short binary vectors (i.e., hash codes), such that semantic similarity can be efficiently computed using the Hamming distance. Recent state-of-the-art approaches have utilized weak supervision to train better performing hashing models. Inspired by t… ▽ More Semantic Hashing is a popular family of methods for efficient similarity search in large-scale datasets. In Semantic Hashing, documents are encoded as short binary vectors (i.e., hash codes), such that semantic similarity can be efficiently computed using the Hamming distance. Recent state-of-the-art approaches have utilized weak supervision to train better performing hashing models. Inspired by this, we present Semantic Hashing with Pairwise Reconstruction (PairRec), which is a discrete variational autoencoder based hashing model. PairRec first encodes weakly supervised training pairs (a query document and a semantically similar document) into two hash codes, and then learns to reconstruct the same query document from both of these hash codes (i.e., pairwise reconstruction). This pairwise reconstruction enables our model to encode local neighbourhood structures within the hash code directly through the decoder. We experimentally compare PairRec to traditional and state-of-the-art approaches, and obtain significant performance improvements in the task of document similarity search. △ Less

Submitted 1 July, 2020; originally announced July 2020.

Comments: Accepted at SIGIR'20

arXiv:2006.09736 [pdf, other]

doi 10.1145/3397271.3401221

Factuality Checking in News Headlines with Eye Tracking

Authors: Christian Hansen, Casper Hansen, Jakob Grue Simonsen, Birger Larsen, Stephen Alstrup, Christina Lioma

Abstract: We study whether it is possible to infer if a news headline is true or false using only the movement of the human eyes when reading news headlines. Our study with 55 participants who are eye-tracked when reading 108 news headlines (72 true, 36 false) shows that false headlines receive statistically significantly less visual attention than true headlines. We further build an ensemble learner that p… ▽ More We study whether it is possible to infer if a news headline is true or false using only the movement of the human eyes when reading news headlines. Our study with 55 participants who are eye-tracked when reading 108 news headlines (72 true, 36 false) shows that false headlines receive statistically significantly less visual attention than true headlines. We further build an ensemble learner that predicts news headline factuality using only eye-tracking measurements. Our model yields a mean AUC of 0.688 and is better at detecting false than true headlines. Through a model analysis, we find that eye-tracking 25 users when reading 3-6 headlines is sufficient for our ensemble learner. △ Less

Submitted 17 June, 2020; originally announced June 2020.

Comments: Accepted to SIGIR 2020

arXiv:2006.00617 [pdf, other]

doi 10.1145/3397271.3401060

Content-aware Neural Hashing for Cold-start Recommendation

Authors: Casper Hansen, Christian Hansen, Jakob Grue Simonsen, Stephen Alstrup, Christina Lioma

Abstract: Content-aware recommendation approaches are essential for providing meaningful recommendations for \textit{new} (i.e., \textit{cold-start}) items in a recommender system. We present a content-aware neural hashing-based collaborative filtering approach (NeuHash-CF), which generates binary hash codes for users and items, such that the highly efficient Hamming distance can be used for estimating user… ▽ More Content-aware recommendation approaches are essential for providing meaningful recommendations for \textit{new} (i.e., \textit{cold-start}) items in a recommender system. We present a content-aware neural hashing-based collaborative filtering approach (NeuHash-CF), which generates binary hash codes for users and items, such that the highly efficient Hamming distance can be used for estimating user-item relevance. NeuHash-CF is modelled as an autoencoder architecture, consisting of two joint hashing components for generating user and item hash codes. Inspired from semantic hashing, the item hashing component generates a hash code directly from an item's content information (i.e., it generates cold-start and seen item hash codes in the same manner). This contrasts existing state-of-the-art models, which treat the two item cases separately. The user hash codes are generated directly based on user id, through learning a user embedding matrix. We show experimentally that NeuHash-CF significantly outperforms state-of-the-art baselines by up to 12\% NDCG and 13\% MRR in cold-start recommendation settings, and up to 4\% in both NDCG and MRR in standard settings where all items are present while training. Our approach uses 2-4x shorter hash codes, while obtaining the same or better performance compared to the state of the art, thus consequently also enabling a notable storage reduction. △ Less

Submitted 31 May, 2020; originally announced June 2020.

Comments: Accepted to SIGIR 2020

arXiv:1909.06856 [pdf, other]

Modelling End-of-Session Actions in Educational Systems

Authors: Christian Hansen, Casper Hansen, Stephen Alstrup, Christina Lioma

Abstract: In this paper we consider the problem of modelling when students end their session in an online mathematics educational system. Being able to model this accurately will help us optimize the way content is presented and consumed. This is done by modelling the probability of an action being the last in a session, which we denote as the End-of-Session probability. We use log data from a system where… ▽ More In this paper we consider the problem of modelling when students end their session in an online mathematics educational system. Being able to model this accurately will help us optimize the way content is presented and consumed. This is done by modelling the probability of an action being the last in a session, which we denote as the End-of-Session probability. We use log data from a system where students can learn mathematics through various kinds of learning materials, as well as multiple types of exercises, such that a student session can consist of many different activities. We model the End-of-Session probability by a deep recurrent neural network in order to utilize the long term temporal aspect, which we experimentally show is central for this task. Using a large scale dataset of more than 70 million student actions, we obtain an AUC of 0.81 on an unseen collection of students. Through a detailed error analysis, we observe that our model is robust across different session structures and across varying session lengths. △ Less

Submitted 15 September, 2019; originally announced September 2019.

Comments: In proceedings of EDM 2019

arXiv:1908.08937 [pdf, ps, other]

Tracking Behavioral Patterns among Students in an Online Educational System

Authors: Stephan Lorenzen, Niklas Hjuler, Stephen Alstrup

Abstract: Analysis of log data generated by online educational systems is an essential task to better the educational systems and increase our understanding of how students learn. In this study we investigate previously unseen data from Clio Online, the largest provider of digital learning content for primary schools in Denmark. We consider data for 14,810 students with 3 million sessions in the period 2015… ▽ More Analysis of log data generated by online educational systems is an essential task to better the educational systems and increase our understanding of how students learn. In this study we investigate previously unseen data from Clio Online, the largest provider of digital learning content for primary schools in Denmark. We consider data for 14,810 students with 3 million sessions in the period 2015-2017. We analyze student activity in periods of one week. By using non-negative matrix factorization techniques, we obtain soft clusterings, revealing dependencies among time of day, subject, activity type, activity complexity (measured by Bloom's taxonomy), and performance. Furthermore, our method allows for tracking behavioral changes of individual students over time, as well as general behavioral changes in the educational system. Based on the results, we give suggestions for behavioral changes, in order to optimize the learning experience and improve performance. △ Less

Submitted 21 August, 2019; originally announced August 2019.

Journal ref: In Proceedings of the 11'th International Conference on Educational Data Mining (EDM), p. 280-285. 2018

arXiv:1906.03072 [pdf, ps, other]

Investigating Writing Style Development in High School

Authors: Stephan Lorenzen, Niklas Hjuler, Stephen Alstrup

Abstract: In this paper we do the first large scale analysis of writing style development among Danish high school students. More than 10K students with more than 100K essays are analyzed. Writing style itself is often studied in the natural language processing community, but usually with the goal of verifying authorship, assessing quality or popularity, or other kinds of predictions. In this work, we ana… ▽ More In this paper we do the first large scale analysis of writing style development among Danish high school students. More than 10K students with more than 100K essays are analyzed. Writing style itself is often studied in the natural language processing community, but usually with the goal of verifying authorship, assessing quality or popularity, or other kinds of predictions. In this work, we analyze writing style changes over time, with the goal of detecting global development trends among students, and identifying at-risk students. We train a Siamese neural network to compute the similarity between two texts. Using this similarity measure, a student's newer essays are compared to their first essays, and a writing style development profile is constructed for the student. We cluster these student profiles and analyze the resulting clusters in order to detect general development patterns. We evaluate clusters with respect to writing style quality indicators, and identify optimal clusters, showing significant improvement in writing style, while also observing suboptimal clusters, exhibiting periods of limited development and even setbacks. Furthermore, we identify general development trends between high school students, showing that as students progress through high school, their writing style deviates, leaving students less similar when they finish high school, than when they start. △ Less

Submitted 4 June, 2019; originally announced June 2019.

Comments: A short version of this paper will be presented at EDM 2019

arXiv:1906.01635 [pdf, ps, other]

Detecting Ghostwriters in High Schools

Authors: Magnus Stavngaard, August Sørensen, Stephan Lorenzen, Niklas Hjuler, Stephen Alstrup

Abstract: Students hiring ghostwriters to write their assignments is an increasing problem in educational institutions all over the world, with companies selling these services as a product. In this work, we develop automatic techniques with special focus on detecting such ghostwriting in high school assignments. This is done by training deep neural networks on an unprecedented large amount of data supplied… ▽ More Students hiring ghostwriters to write their assignments is an increasing problem in educational institutions all over the world, with companies selling these services as a product. In this work, we develop automatic techniques with special focus on detecting such ghostwriting in high school assignments. This is done by training deep neural networks on an unprecedented large amount of data supplied by the Danish company MaCom, which covers 90% of Danish high schools. We achieve an accuracy of 0.875 and a AUC score of 0.947 on an evenly split data set. △ Less

Submitted 4 June, 2019; originally announced June 2019.

Comments: Presented at ESANN 2019

Journal ref: Proceedings. ESANN 2019: 27th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. ed. Michel Verleysen. 2019. p 197-202

arXiv:1906.00674 [pdf, other]

Contextually Propagated Term Weights for Document Representation

Authors: Casper Hansen, Christian Hansen, Stephen Alstrup, Jakob Grue Simonsen, Christina Lioma

Abstract: Word embeddings predict a word from its neighbours by learning small, dense embedding vectors. In practice, this prediction corresponds to a semantic score given to the predicted word (or term weight). We present a novel model that, given a target word, redistributes part of that word's weight (that has been computed with word embeddings) across words occurring in similar contexts as the target wo… ▽ More Word embeddings predict a word from its neighbours by learning small, dense embedding vectors. In practice, this prediction corresponds to a semantic score given to the predicted word (or term weight). We present a novel model that, given a target word, redistributes part of that word's weight (that has been computed with word embeddings) across words occurring in similar contexts as the target word. Thus, our model aims to simulate how semantic meaning is shared by words occurring in similar contexts, which is incorporated into bag-of-words document representations. Experimental evaluation in an unsupervised setting against 8 state of the art baselines shows that our model yields the best micro and macro F1 scores across datasets of increasing difficulty. △ Less

Submitted 3 June, 2019; originally announced June 2019.

Comments: SIGIR 2019

arXiv:1906.00671 [pdf, other]

Unsupervised Neural Generative Semantic Hashing

Authors: Casper Hansen, Christian Hansen, Jakob Grue Simonsen, Stephen Alstrup, Christina Lioma

Abstract: Fast similarity search is a key component in large-scale information retrieval, where semantic hashing has become a popular strategy for representing documents as binary hash codes. Recent advances in this area have been obtained through neural network based models: generative models trained by learning to reconstruct the original documents. We present a novel unsupervised generative semantic hash… ▽ More Fast similarity search is a key component in large-scale information retrieval, where semantic hashing has become a popular strategy for representing documents as binary hash codes. Recent advances in this area have been obtained through neural network based models: generative models trained by learning to reconstruct the original documents. We present a novel unsupervised generative semantic hashing approach, \textit{Ranking based Semantic Hashing} (RBSH) that consists of both a variational and a ranking based component. Similarly to variational autoencoders, the variational component is trained to reconstruct the original document conditioned on its generated hash code, and as in prior work, it only considers documents individually. The ranking component solves this limitation by incorporating inter-document similarity into the hash code generation, modelling document ranking through a hinge loss. To circumvent the need for labelled data to compute the hinge loss, we use a weak labeller and thus keep the approach fully unsupervised. Extensive experimental evaluation on four publicly available datasets against traditional baselines and recent state-of-the-art methods for semantic hashing shows that RBSH significantly outperforms all other methods across all evaluated hash code lengths. In fact, RBSH hash codes are able to perform similarly to state-of-the-art hash codes while using 2-4x fewer bits. △ Less

Submitted 3 June, 2019; originally announced June 2019.

Comments: SIGIR 2019

arXiv:1904.00761 [pdf, other]

Neural Speed Reading with Structural-Jump-LSTM

Authors: Christian Hansen, Casper Hansen, Stephen Alstrup, Jakob Grue Simonsen, Christina Lioma

Abstract: Recurrent neural networks (RNNs) can model natural language by sequentially 'reading' input tokens and outputting a distributed representation of each token. Due to the sequential nature of RNNs, inference time is linearly dependent on the input length, and all inputs are read regardless of their importance. Efforts to speed up this inference, known as 'neural speed reading', either ignore or skim… ▽ More Recurrent neural networks (RNNs) can model natural language by sequentially 'reading' input tokens and outputting a distributed representation of each token. Due to the sequential nature of RNNs, inference time is linearly dependent on the input length, and all inputs are read regardless of their importance. Efforts to speed up this inference, known as 'neural speed reading', either ignore or skim over part of the input. We present Structural-Jump-LSTM: the first neural speed reading model to both skip and jump text during inference. The model consists of a standard LSTM and two agents: one capable of skipping single words when reading, and one capable of exploiting punctuation structure (sub-sentence separators (,:), sentence end symbols (.!?), or end of text markers) to jump ahead after reading a word. A comprehensive experimental evaluation of our model against all five state-of-the-art neural reading models shows that Structural-Jump-LSTM achieves the best overall floating point operations (FLOP) reduction (hence is faster), while keeping the same accuracy or even improving it compared to a vanilla LSTM that reads the whole text. △ Less

Submitted 2 April, 2019; v1 submitted 20 March, 2019; originally announced April 2019.

Comments: 10 pages

Journal ref: 7th International Conference on Learning Representations (ICLR) 2019

arXiv:1903.08408 [pdf, other]

Modelling Sequential Music Track Skips using a Multi-RNN Approach

Authors: Christian Hansen, Casper Hansen, Stephen Alstrup, Jakob Grue Simonsen, Christina Lioma

Abstract: Modelling sequential music skips provides streaming companies the ability to better understand the needs of the user base, resulting in a better user experience by reducing the need to manually skip certain music tracks. This paper describes the solution of the University of Copenhagen DIKU-IR team in the 'Spotify Sequential Skip Prediction Challenge', where the task was to predict the skip behavi… ▽ More Modelling sequential music skips provides streaming companies the ability to better understand the needs of the user base, resulting in a better user experience by reducing the need to manually skip certain music tracks. This paper describes the solution of the University of Copenhagen DIKU-IR team in the 'Spotify Sequential Skip Prediction Challenge', where the task was to predict the skip behaviour of the second half in a music listening session conditioned on the first half. We model this task using a Multi-RNN approach consisting of two distinct stacked recurrent neural networks, where one network focuses on encoding the first half of the session and the other network focuses on utilizing the encoding to make sequential skip predictions. The encoder network is initialized by a learned session-wide music encoding, and both of them utilize a learned track embedding. Our final model consists of a majority voted ensemble of individually trained models, and ranked 2nd out of 45 participating teams in the competition with a mean average accuracy of 0.641 and an accuracy on the first skip prediction of 0.807. Our code is released at https://github.com/Varyn/WSDM-challenge-2019-spotify. △ Less

Submitted 20 March, 2019; originally announced March 2019.

Comments: 4 pages

Journal ref: 12th ACM International Conference on Web Search and Data Mining (WSDM) 2019, WSDM Cup

arXiv:1903.08404 [pdf, other]

Neural Check-Worthiness Ranking with Weak Supervision: Finding Sentences for Fact-Checking

Authors: Casper Hansen, Christian Hansen, Stephen Alstrup, Jakob Grue Simonsen, Christina Lioma

Abstract: Automatic fact-checking systems detect misinformation, such as fake news, by (i) selecting check-worthy sentences for fact-checking, (ii) gathering related information to the sentences, and (iii) inferring the factuality of the sentences. Most prior research on (i) uses hand-crafted features to select check-worthy sentences, and does not explicitly account for the recent finding that the top weigh… ▽ More Automatic fact-checking systems detect misinformation, such as fake news, by (i) selecting check-worthy sentences for fact-checking, (ii) gathering related information to the sentences, and (iii) inferring the factuality of the sentences. Most prior research on (i) uses hand-crafted features to select check-worthy sentences, and does not explicitly account for the recent finding that the top weighted terms in both check-worthy and non-check-worthy sentences are actually overlapping [15]. Motivated by this, we present a neural check-worthiness sentence ranking model that represents each word in a sentence by \textit{both} its embedding (aiming to capture its semantics) and its syntactic dependencies (aiming to capture its role in modifying the semantics of other terms in the sentence). Our model is an end-to-end trainable neural network for check-worthiness ranking, which is trained on large amounts of unlabelled data through weak supervision. Thorough experimental evaluation against state of the art baselines, with and without weak supervision, shows our model to be superior at all times (+13% in MAP and +28% at various Precision cut-offs from the best baseline with statistical significance). Empirical analysis of the use of weak supervision, word embedding pretraining on domain-specific data, and the use of syntactic dependencies of our model reveals that check-worthy sentences contain notably more identical syntactic dependencies than non-check-worthy sentences. △ Less

Submitted 20 March, 2019; originally announced March 2019.

Comments: 6 pages

Journal ref: In Companion Proceedings of the 2019 World Wide Web Conference

arXiv:1709.01960 [pdf, other]

Constructing Light Spanners Deterministically in Near-Linear Time

Authors: Stephen Alstrup, Søren Dahlgaard, Arnold Filtser, Morten Stöckel, Christian Wulff-Nilsen

Abstract: Graph spanners are well-studied and widely used both in theory and practice. In a recent breakthrough, Chechik and Wulff-Nilsen [CW18] improved the state-of-the-art for light spanners by constructing a $(2k-1)(1+ε)$-spanner with $O(n^{1+1/k})$ edges and $O_ε(n^{1/k})$ lightness. Soon after, Filtser and Solomon [FS19] showed that the classic greedy spanner construction achieves the same bounds The… ▽ More Graph spanners are well-studied and widely used both in theory and practice. In a recent breakthrough, Chechik and Wulff-Nilsen [CW18] improved the state-of-the-art for light spanners by constructing a $(2k-1)(1+ε)$-spanner with $O(n^{1+1/k})$ edges and $O_ε(n^{1/k})$ lightness. Soon after, Filtser and Solomon [FS19] showed that the classic greedy spanner construction achieves the same bounds The major drawback of the greedy spanner is its running time of $O(mn^{1+1/k})$ (which is faster than [CW16]). This makes the construction impractical even for graphs of moderate size. Much faster spanner constructions do exist but they only achieve lightness $Ω_ε(kn^{1/k})$, even when randomization is used. The contribution of this paper is deterministic spanner constructions that are fast, and achieve similar bounds as the state-of-the-art slower constructions. Our first result is an $O_ε(n^{2+1/k+ε'})$ time spanner construction which achieves the state-of-the-art bounds. Our second result is an $O_ε(m + n\log n)$ time construction of a spanner with $(2k-1)(1+ε)$ stretch, $O(\log k\cdot n^{1+1/k})$ edges and $O_ε(\log k\cdot n^{1/k})$ lightness. This is an exponential improvement in the dependence on $k$ compared to the previous result with such running time. Finally, for the important special case where $k=\log n$, for every constant $ε>0$, we provide an $O(m+n^{1+ε})$ time construction that produces an $O(\log n)$-spanner with $O(n)$ edges and $O(1)$ lightness which is asymptotically optimal. This is the first known sub-quadratic construction of such a spanner for any $k = ω(1)$. To achieve our constructions, we show a novel deterministic incremental approximate distance oracle, which may be of independent interest. △ Less

Submitted 19 January, 2022; v1 submitted 6 September, 2017; originally announced September 2017.

arXiv:1708.06403 [pdf, other]

Smart City Analytics: Ensemble-Learned Prediction of Citizen Home Care

Authors: Casper Hansen, Christian Hansen, Stephen Alstrup, Christina Lioma

Abstract: We present an ensemble learning method that predicts large increases in the hours of home care received by citizens. The method is supervised, and uses different ensembles of either linear (logistic regression) or non-linear (random forests) classifiers. Experiments with data available from 2013 to 2017 for every citizen in Copenhagen receiving home care (27,775 citizens) show that prediction can… ▽ More We present an ensemble learning method that predicts large increases in the hours of home care received by citizens. The method is supervised, and uses different ensembles of either linear (logistic regression) or non-linear (random forests) classifiers. Experiments with data available from 2013 to 2017 for every citizen in Copenhagen receiving home care (27,775 citizens) show that prediction can achieve state of the art performance as reported in similar health related domains (AUC=0.715). We further find that competitive results can be obtained by using limited information for training, which is very useful when full records are not accessible or available. Smart city analytics does not necessarily require full city records. To our knowledge this preliminary study is the first to predict large increases in home care for smart city analytics. △ Less

Submitted 21 August, 2017; originally announced August 2017.

arXiv:1708.04164 [pdf, other]

Sequence Modelling For Analysing Student Interaction with Educational Systems

Authors: Christian Hansen, Casper Hansen, Niklas Hjuler, Stephen Alstrup, Christina Lioma

Abstract: The analysis of log data generated by online educational systems is an important task for improving the systems, and furthering our knowledge of how students learn. This paper uses previously unseen log data from Edulab, the largest provider of digital learning for mathematics in Denmark, to analyse the sessions of its users, where 1.08 million student sessions are extracted from a subset of their… ▽ More The analysis of log data generated by online educational systems is an important task for improving the systems, and furthering our knowledge of how students learn. This paper uses previously unseen log data from Edulab, the largest provider of digital learning for mathematics in Denmark, to analyse the sessions of its users, where 1.08 million student sessions are extracted from a subset of their data. We propose to model students as a distribution of different underlying student behaviours, where the sequence of actions from each session belongs to an underlying student behaviour. We model student behaviour as Markov chains, such that a student is modelled as a distribution of Markov chains, which are estimated using a modified k-means clustering algorithm. The resulting Markov chains are readily interpretable, and in a qualitative analysis around 125,000 student sessions are identified as exhibiting unproductive student behaviour. Based on our results this student representation is promising, especially for educational systems offering many different learning usages, and offers an alternative to common approaches like modelling student behaviour as a single Markov chain often done in the literature. △ Less

Submitted 14 August, 2017; originally announced August 2017.

Comments: The 10th International Conference on Educational Data Mining 2017

arXiv:1607.04911 [pdf, other]

Near-Optimal Induced Universal Graphs for Bounded Degree Graphs

Authors: Mikkel Abrahamsen, Stephen Alstrup, Jacob Holm, Mathias Bæk Tejs Knudsen, Morten Stöckel

Abstract: A graph $U$ is an induced universal graph for a family $F$ of graphs if every graph in $F$ is a vertex-induced subgraph of $U$. For the family of all undirected graphs on $n$ vertices Alstrup, Kaplan, Thorup, and Zwick [STOC 2015] give an induced universal graph with $O\!\left(2^{n/2}\right)$ vertices, matching a lower bound by Moon [Proc. Glasgow Math. Assoc. 1965]. Let $k= \lceil D/2 \rceil$.… ▽ More A graph $U$ is an induced universal graph for a family $F$ of graphs if every graph in $F$ is a vertex-induced subgraph of $U$. For the family of all undirected graphs on $n$ vertices Alstrup, Kaplan, Thorup, and Zwick [STOC 2015] give an induced universal graph with $O\!\left(2^{n/2}\right)$ vertices, matching a lower bound by Moon [Proc. Glasgow Math. Assoc. 1965]. Let $k= \lceil D/2 \rceil$. Improving asymptotically on previous results by Butler [Graphs and Combinatorics 2009] and Esperet, Arnaud and Ochem [IPL 2008], we give an induced universal graph with $O\!\left(\frac{k2^k}{k!}n^k \right)$ vertices for the family of graphs with $n$ vertices of maximum degree $D$. For constant $D$, Butler gives a lower bound of $Ω\!\left(n^{D/2}\right)$. For an odd constant $D\geq 3$, Esperet et al. and Alon and Capalbo [SODA 2008] give a graph with $O\!\left(n^{k-\frac{1}{D}}\right)$ vertices. Using their techniques for any (including constant) even values of $D$ gives asymptotically worse bounds than we present. For large $D$, i.e. when $D = Ω\left(\log^3 n\right)$, the previous best upper bound was ${n\choose\lceil D/2\rceil} n^{O(1)}$ due to Adjiashvili and Rotbart [ICALP 2014]. We give upper and lower bounds showing that the size is ${\lfloor n/2\rfloor\choose\lfloor D/2 \rfloor}2^{\pm\tilde{O}\left(\sqrt{D}\right)}$. Hence the optimal size is $2^{\tilde{O}(D)}$ and our construction is within a factor of $2^{\tilde{O}\left(\sqrt{D}\right)}$ from this. The previous results were larger by at least a factor of $2^{Ω(D)}$. As a part of the above, proving a conjecture by Esperet et al., we construct an induced universal graph with $2n-1$ vertices for the family of graphs with max degree $2$. In addition, we give results for acyclic graphs with max degree $2$ and cycle graphs. Our results imply the first labeling schemes that for any $D$ are at most $o(n)$ bits from optimal. △ Less

Submitted 21 July, 2016; v1 submitted 17 July, 2016; originally announced July 2016.

arXiv:1507.04046 [pdf, ps, other]

Distance labeling schemes for trees

Authors: Stephen Alstrup, Inge Li Gørtz, Esben Bistrup Halvorsen, Ely Porat

Abstract: We consider distance labeling schemes for trees: given a tree with $n$ nodes, label the nodes with binary strings such that, given the labels of any two nodes, one can determine, by looking only at the labels, the distance in the tree between the two nodes. A lower bound by Gavoille et. al. (J. Alg. 2004) and an upper bound by Peleg (J. Graph Theory 2000) establish that labels must use… ▽ More We consider distance labeling schemes for trees: given a tree with $n$ nodes, label the nodes with binary strings such that, given the labels of any two nodes, one can determine, by looking only at the labels, the distance in the tree between the two nodes. A lower bound by Gavoille et. al. (J. Alg. 2004) and an upper bound by Peleg (J. Graph Theory 2000) establish that labels must use $Θ(\log^2 n)$ bits\footnote{Throughout this paper we use $\log$ for $\log_2$.}. Gavoille et. al. (ESA 2001) show that for very small approximate stretch, labels use $Θ(\log n \log \log n)$ bits. Several other papers investigate various variants such as, for example, small distances in trees (Alstrup et. al., SODA'03). We improve the known upper and lower bounds of exact distance labeling by showing that $\frac{1}{4} \log^2 n$ bits are needed and that $\frac{1}{2} \log^2 n$ bits are sufficient. We also give ($1+ε$)-stretch labeling schemes using $Θ(\log n)$ bits for constant $ε>0$. ($1+ε$)-stretch labeling schemes with polylogarithmic label size have previously been established for doubling dimension graphs by Talwar (STOC 2004). In addition, we present matching upper and lower bounds for distance labeling for caterpillars, showing that labels must have size $2\log n - Θ(\log\log n)$. For simple paths with $k$ nodes and edge weights in $[1,n]$, we show that labels must have size $\frac{k-1}{k}\log n+Θ(\log k)$. △ Less

Submitted 14 July, 2015; originally announced July 2015.

arXiv:1507.02618 [pdf, other]

Sublinear Distance Labeling

Authors: Stephen Alstrup, Søren Dahlgaard, Mathias Bæk Tejs Knudsen, Ely Porat

Abstract: A distance labeling scheme labels the $n$ nodes of a graph with binary strings such that, given the labels of any two nodes, one can determine the distance in the graph between the two nodes by looking only at the labels. A $D$-preserving distance labeling scheme only returns precise distances between pairs of nodes that are at distance at least $D$ from each other. In this paper we consider dista… ▽ More A distance labeling scheme labels the $n$ nodes of a graph with binary strings such that, given the labels of any two nodes, one can determine the distance in the graph between the two nodes by looking only at the labels. A $D$-preserving distance labeling scheme only returns precise distances between pairs of nodes that are at distance at least $D$ from each other. In this paper we consider distance labeling schemes for the classical case of unweighted graphs with both directed and undirected edges. We present a $O(\frac{n}{D}\log^2 D)$ bit $D$-preserving distance labeling scheme, improving the previous bound by Bollobás et. al. [SIAM J. Discrete Math. 2005]. We also give an almost matching lower bound of $Ω(\frac{n}{D})$. With our $D$-preserving distance labeling scheme as a building block, we additionally achieve the following results: 1. We present the first distance labeling scheme of size $o(n)$ for sparse graphs (and hence bounded degree graphs). This addresses an open problem by Gavoille et. al. [J. Algo. 2004], hereby separating the complexity from distance labeling in general graphs which require $Ω(n)$ bits, Moon [Proc. of Glasgow Math. Association 1965]. 2. For approximate $r$-additive labeling schemes, that return distances within an additive error of $r$ we show a scheme of size $O\left ( \frac{n}{r} \cdot\frac{\operatorname{polylog} (r\log n)}{\log n} \right )$ for $r \ge 2$. This improves on the current best bound of $O\left(\frac{n}{r}\right)$ by Alstrup et. al. [SODA 2016] for sub-polynomial $r$, and is a generalization of a result by Gawrychowski et al. [arXiv preprint 2015] who showed this for $r=2$. △ Less

Submitted 8 September, 2016; v1 submitted 9 July, 2015; originally announced July 2015.

Comments: A preliminary version of this paper appeared at ESA'16

arXiv:1504.04498 [pdf, ps, other]

Simpler, faster and shorter labels for distances in graphs

Authors: Stephen Alstrup, Cyril Gavoille, Esben Bistrup Halvorsen, Holger Petersen

Abstract: We consider how to assign labels to any undirected graph with n nodes such that, given the labels of two nodes and no other information regarding the graph, it is possible to determine the distance between the two nodes. The challenge in such a distance labeling scheme is primarily to minimize the maximum label lenght and secondarily to minimize the time needed to answer distance queries (decoding… ▽ More We consider how to assign labels to any undirected graph with n nodes such that, given the labels of two nodes and no other information regarding the graph, it is possible to determine the distance between the two nodes. The challenge in such a distance labeling scheme is primarily to minimize the maximum label lenght and secondarily to minimize the time needed to answer distance queries (decoding). Previous schemes have offered different trade-offs between label lengths and query time. This paper presents a simple algorithm with shorter labels and shorter query time than any previous solution, thereby improving the state-of-the-art with respect to both label length and query time in one single algorithm. Our solution addresses several open problems concerning label length and decoding time and is the first improvement of label length for more than three decades. More specifically, we present a distance labeling scheme with label size (log 3)/2 + o(n) (logarithms are in base 2) and O(1) decoding time. This outperforms all existing results with respect to both size and decoding time, including Winkler's (Combinatorica 1983) decade-old result, which uses labels of size (log 3)n and O(n/log n) decoding time, and Gavoille et al. (SODA'01), which uses labels of size 11n + o(n) and O(loglog n) decoding time. In addition, our algorithm is simpler than the previous ones. In the case of integral edge weights of size at most W, we present almost matching upper and lower bounds for label sizes. For r-additive approximation schemes, where distances can be off by an additive constant r, we give both upper and lower bounds. In particular, we present an upper bound for 1-additive approximation schemes which, in the unweighted case, has the same size (ignoring second order terms) as an adjacency scheme: n/2. We also give results for bipartite graphs and for exact and 1-additive distance oracles. △ Less

Submitted 17 April, 2015; originally announced April 2015.

ACM Class: E.1; G.2.2; E.4

arXiv:1504.02306 [pdf, other]

Optimal induced universal graphs and adjacency labeling for trees

Authors: Stephen Alstrup, Søren Dahlgaard, Mathias Bæk Tejs Knudsen

Abstract: We show that there exists a graph $G$ with $O(n)$ nodes, where any forest of $n$ nodes is a node-induced subgraph of $G$. Furthermore, for constant arboricity $k$, the result implies the existence of a graph with $O(n^k)$ nodes that contains all $n$-node graphs as node-induced subgraphs, matching a $Ω(n^k)$ lower bound. The lower bound and previously best upper bounds were presented in Alstrup and… ▽ More We show that there exists a graph $G$ with $O(n)$ nodes, where any forest of $n$ nodes is a node-induced subgraph of $G$. Furthermore, for constant arboricity $k$, the result implies the existence of a graph with $O(n^k)$ nodes that contains all $n$-node graphs as node-induced subgraphs, matching a $Ω(n^k)$ lower bound. The lower bound and previously best upper bounds were presented in Alstrup and Rauhe (FOCS'02). Our upper bounds are obtained through a $\log_2 n +O(1)$ labeling scheme for adjacency queries in forests. We hereby solve an open problem being raised repeatedly over decades, e.g. in Kannan, Naor, Rudich (STOC 1988), Chung (J. of Graph Theory 1990), Fraigniaud and Korman (SODA 2010). △ Less

Submitted 15 February, 2016; v1 submitted 9 April, 2015; originally announced April 2015.

Comments: A preliminary version of this paper appeared at FOCS'15

arXiv:1404.3391 [pdf, ps, other]

Adjacency labeling schemes and induced-universal graphs

Authors: Stephen Alstrup, Haim Kaplan, Mikkel Thorup, Uri Zwick

Abstract: We describe a way of assigning labels to the vertices of any undirected graph on up to $n$ vertices, each composed of $n/2+O(1)$ bits, such that given the labels of two vertices, and no other information regarding the graph, it is possible to decide whether or not the vertices are adjacent in the graph. This is optimal, up to an additive constant, and constitutes the first improvement in almost 50… ▽ More We describe a way of assigning labels to the vertices of any undirected graph on up to $n$ vertices, each composed of $n/2+O(1)$ bits, such that given the labels of two vertices, and no other information regarding the graph, it is possible to decide whether or not the vertices are adjacent in the graph. This is optimal, up to an additive constant, and constitutes the first improvement in almost 50 years of an $n/2+O(\log n)$ bound of Moon. As a consequence, we obtain an induced-universal graph for $n$-vertex graphs containing only $O(2^{n/2})$ vertices, which is optimal up to a multiplicative constant, solving an open problem of Vizing from 1968. We obtain similar tight results for directed graphs, tournaments and bipartite graphs. △ Less

Submitted 13 April, 2014; originally announced April 2014.

ACM Class: G.2.2; E.1; E.2; G.2.1

arXiv:1312.4413 [pdf, ps, other]

doi 10.1137/1.9781611973402.72

Near-optimal labeling schemes for nearest common ancestors

Authors: Stephen Alstrup, Esben Bistrup Halvorsen, Kasper Green Larsen

Abstract: We consider NCA labeling schemes: given a rooted tree $T$, label the nodes of $T$ with binary strings such that, given the labels of any two nodes, one can determine, by looking only at the labels, the label of their nearest common ancestor. For trees with $n$ nodes we present upper and lower bounds establishing that labels of size $(2\pm ε)\log n$, $ε<1$ are both sufficient and necessary. (All… ▽ More We consider NCA labeling schemes: given a rooted tree $T$, label the nodes of $T$ with binary strings such that, given the labels of any two nodes, one can determine, by looking only at the labels, the label of their nearest common ancestor. For trees with $n$ nodes we present upper and lower bounds establishing that labels of size $(2\pm ε)\log n$, $ε<1$ are both sufficient and necessary. (All logarithms in this paper are in base 2.) Alstrup, Bille, and Rauhe (SIDMA'05) showed that ancestor and NCA labeling schemes have labels of size $\log n +Ω(\log \log n)$. Our lower bound increases this to $\log n + Ω(\log n)$ for NCA labeling schemes. Since Fraigniaud and Korman (STOC'10) established that labels in ancestor labeling schemes have size $\log n +Θ(\log \log n)$, our new lower bound separates ancestor and NCA labeling schemes. Our upper bound improves the $10 \log n$ upper bound by Alstrup, Gavoille, Kaplan and Rauhe (TOCS'04), and our theoretical result even outperforms some recent experimental studies by Fischer (ESA'09) where variants of the same NCA labeling scheme are shown to all have labels of size approximately $8 \log n$. △ Less

Submitted 16 December, 2013; originally announced December 2013.

ACM Class: E.1; G.2.2; E.4

arXiv:cs/0310065 [pdf, ps, other]

Maintaining Information in Fully-Dynamic Trees with Top Trees

Authors: Stephen Alstrup, Jacob Holm, Kristian de Lichtenberg, Mikkel Thorup

Abstract: We introduce top trees as a design of a new simpler interface for data structures maintaining information in a fully-dynamic forest. We demonstrate how easy and versatile they are to use on a host of different applications. For example, we show how to maintain the diameter, center, and median of each tree in the forest. The forest can be updated by insertion and deletion of edges and by changes… ▽ More We introduce top trees as a design of a new simpler interface for data structures maintaining information in a fully-dynamic forest. We demonstrate how easy and versatile they are to use on a host of different applications. For example, we show how to maintain the diameter, center, and median of each tree in the forest. The forest can be updated by insertion and deletion of edges and by changes to vertex and edge weights. Each update is supported in O(log n) time, where n is the size of the tree(s) involved in the update. Also, we show how to support nearest common ancestor queries and level ancestor queries with respect to arbitrary roots in O(log n) time. Finally, with marked and unmarked vertices, we show how to compute distances to a nearest marked vertex. The later has applications to approximate nearest marked vertex in general graphs, and thereby to static optimization problems over shortest path metrics. Technically speaking, top trees are easily implemented either with Frederickson's topology trees [Ambivalent Data Structures for Dynamic 2-Edge-Connectivity and k Smallest Spanning Trees, SIAM J. Comput. 26 (2) pp. 484-538, 1997] or with Sleator and Tarjan's dynamic trees [A Data Structure for Dynamic Trees. J. Comput. Syst. Sc. 26 (3) pp. 362-391, 1983]. However, we claim that the interface is simpler for many applications, and indeed our new bounds are quadratic improvements over previous bounds where they exist. △ Less

Submitted 21 November, 2003; v1 submitted 31 October, 2003; originally announced October 2003.

Comments: Preliminary versions of this work presented at ICALP'97 and SWAT'00. The new version takes layered top trees into account

ACM Class: E.1; F.2.2; G.2.2

arXiv:cs/0211010 [pdf, ps, other]

Efficient Tree Layout in a Multilevel Memory Hierarchy

Authors: Stephen Alstrup, Michael A. Bender, Erik D. Demaine, Martin Farach-Colton, Theis Rauhe, Mikkel Thorup

Abstract: We consider the problem of laying out a tree with fixed parent/child structure in hierarchical memory. The goal is to minimize the expected number of block transfers performed during a search along a root-to-leaf path, subject to a given probability distribution on the leaves. This problem was previously considered by Gil and Itai, who developed optimal but slow algorithms when the block-transfe… ▽ More We consider the problem of laying out a tree with fixed parent/child structure in hierarchical memory. The goal is to minimize the expected number of block transfers performed during a search along a root-to-leaf path, subject to a given probability distribution on the leaves. This problem was previously considered by Gil and Itai, who developed optimal but slow algorithms when the block-transfer size B is known. We present faster but approximate algorithms for the same problem; the fastest such algorithm runs in linear time and produces a solution that is within an additive constant of optimal. In addition, we show how to extend any approximately optimal algorithm to the cache-oblivious setting in which the block-transfer size is unknown to the algorithm. The query performance of the cache-oblivious layout is within a constant factor of the query performance of the optimal known-block-size layout. Computing the cache-oblivious layout requires only logarithmically many calls to the layout algorithm for known block size; in particular, the cache-oblivious layout can be computed in O(N lg N) time, where N is the number of nodes. Finally, we analyze two greedy strategies, and show that they have a performance ratio between Omega(lg B / lg lg B) and O(lg B) when compared to the optimal layout. △ Less

Submitted 28 July, 2004; v1 submitted 11 November, 2002; originally announced November 2002.

Comments: 18 pages. Version 2 adds faster dynamic programs. Preliminary version appeared in European Symposium on Algorithms, 2002

ACM Class: E.1; F.2.2

Showing 1–25 of 25 results for author: Alstrup, S