-
The Birkhoff completion of finite lattices
Authors:
Mohammad Abdulla,
Johannes Hirth,
Gerd Stumme
Abstract:
We introduce the Birkhoff completion as the smallest distributive lattice in which a given finite lattice can be embedded as semi-lattice. We discuss its relationship to implicational theories, in particular to R. Wille's simply-implicational theories. By an example, we show how the Birkhoff completion can be used as a tool for ordinal data science.
We introduce the Birkhoff completion as the smallest distributive lattice in which a given finite lattice can be embedded as semi-lattice. We discuss its relationship to implicational theories, in particular to R. Wille's simply-implicational theories. By an example, we show how the Birkhoff completion can be used as a tool for ordinal data science.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
Conceptual Mapping of Controversies
Authors:
Claude Draude,
Dominik Dürrschnabel,
Johannes Hirth,
Viktoria Horn,
Jonathan Kropf,
Jörn Lamla,
Gerd Stumme,
Markus Uhlmann
Abstract:
With our work, we contribute towards a qualitative analysis of the discourse on controversies in online news media. For this, we employ Formal Concept Analysis and the economics of conventions to derive conceptual controversy maps. In our experiments, we analyze two maps from different news journals with methods from ordinal data science. We show how these methods can be used to assess the diversi…
▽ More
With our work, we contribute towards a qualitative analysis of the discourse on controversies in online news media. For this, we employ Formal Concept Analysis and the economics of conventions to derive conceptual controversy maps. In our experiments, we analyze two maps from different news journals with methods from ordinal data science. We show how these methods can be used to assess the diversity, complexity and potential bias of controversies. In addition to that, we discuss how the diagrams of concept lattices can be used to navigate between news articles.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
Towards Ordinal Data Science
Authors:
Gerd Stumme,
Dominik Dürrschnabel,
Tom Hanika
Abstract:
Order is one of the main instruments to measure the relationship between objects in (empirical) data. However, compared to methods that use numerical properties of objects, the amount of ordinal methods developed is rather small. One reason for this is the limited availability of computational resources in the last century that would have been required for ordinal computations. Another reason -- p…
▽ More
Order is one of the main instruments to measure the relationship between objects in (empirical) data. However, compared to methods that use numerical properties of objects, the amount of ordinal methods developed is rather small. One reason for this is the limited availability of computational resources in the last century that would have been required for ordinal computations. Another reason -- particularly important for this line of research -- is that order-based methods are often seen as too mathematically rigorous for applying them to real-world data. In this paper, we will therefore discuss different means for measuring and 'calculating' with ordinal structures -- a specific class of directed graphs -- and show how to infer knowledge from them. Our aim is to establish Ordinal Data Science as a fundamentally new research agenda. Besides cross-fertilization with other cornerstone machine learning and knowledge representation methods, a broad range of disciplines will benefit from this endeavor, including, psychology, sociology, economics, web science, knowledge engineering, scientometrics.
△ Less
Submitted 6 December, 2023; v1 submitted 13 July, 2023;
originally announced July 2023.
-
Automatic Textual Explanations of Concept Lattices
Authors:
Johannes Hirth,
Viktoria Horn,
Gerd Stumme,
Tom Hanika
Abstract:
Lattices and their order diagrams are an essential tool for communicating knowledge and insights about data. This is in particular true when applying Formal Concept Analysis. Such representations, however, are difficult to comprehend by untrained users and in general in cases where lattices are large. We tackle this problem by automatically generating textual explanations for lattices using standa…
▽ More
Lattices and their order diagrams are an essential tool for communicating knowledge and insights about data. This is in particular true when applying Formal Concept Analysis. Such representations, however, are difficult to comprehend by untrained users and in general in cases where lattices are large. We tackle this problem by automatically generating textual explanations for lattices using standard scales. Our method is based on the general notion of ordinal motifs in lattices for the special case of standard scales. We show the computational complexity of identifying a small number of standard scales that cover most of the lattice structure. For these, we provide textual explanation templates, which can be applied to any occurrence of a scale in any data domain. These templates are derived using principles from human-computer interaction and allow for a comprehensive textual explanation of lattices. We demonstrate our approach on the spices planner data set, which is a medium sized formal context comprised of fifty-six meals (objects) and thirty-seven spices (attributes). The resulting 531 formal concepts can be covered by means of about 100 standard scales.
△ Less
Submitted 17 April, 2023;
originally announced April 2023.
-
Ordinal Motifs in Lattices
Authors:
Johannes Hirth,
Viktoria Horn,
Gerd Stumme,
Tom Hanika
Abstract:
Lattices are a commonly used structure for the representation and analysis of relational and ontological knowledge. In particular, the analysis of these requires a decomposition of a large and high-dimensional lattice into a set of understandably large parts. With the present work we propose /ordinal motifs/ as analytical units of meaning. We study these ordinal substructures (or standard scales)…
▽ More
Lattices are a commonly used structure for the representation and analysis of relational and ontological knowledge. In particular, the analysis of these requires a decomposition of a large and high-dimensional lattice into a set of understandably large parts. With the present work we propose /ordinal motifs/ as analytical units of meaning. We study these ordinal substructures (or standard scales) through (full) scale-measures of formal contexts from the field of formal concept analysis. We show that the underlying decision problems are NP-complete and provide results on how one can incrementally identify ordinal motifs to save computational effort. Accompanying our theoretical results, we demonstrate how ordinal motifs can be leveraged to retrieve basic meaning from a medium sized ordinal data set.
△ Less
Submitted 10 April, 2023;
originally announced April 2023.
-
Maximal Ordinal Two-Factorizations
Authors:
Dominik Dürrschnabel,
Gerd Stumme
Abstract:
Given a formal context, an ordinal factor is a subset of its incidence relation that forms a chain in the concept lattice, i.e., a part of the dataset that corresponds to a linear order. To visualize the data in a formal context, Ganter and Glodeanu proposed a biplot based on two ordinal factors. For the biplot to be useful, it is important that these factors comprise as much data points as possib…
▽ More
Given a formal context, an ordinal factor is a subset of its incidence relation that forms a chain in the concept lattice, i.e., a part of the dataset that corresponds to a linear order. To visualize the data in a formal context, Ganter and Glodeanu proposed a biplot based on two ordinal factors. For the biplot to be useful, it is important that these factors comprise as much data points as possible, i.e., that they cover a large part of the incidence relation. In this work, we investigate such ordinal two-factorizations. First, we investigate for formal contexts that omit ordinal two-factorizations the disjointness of the two factors. Then, we show that deciding on the existence of two-factorizations of a given size is an NP-complete problem which makes computing maximal factorizations computationally expensive. Finally, we provide the algorithm Ord2Factor that allows us to compute large ordinal two-factorizations.
△ Less
Submitted 20 June, 2023; v1 submitted 6 April, 2023;
originally announced April 2023.
-
Greedy Discovery of Ordinal Factors
Authors:
Dominik Dürrschnabel,
Gerd Stumme
Abstract:
In large datasets, it is hard to discover and analyze structure. It is thus common to introduce tags or keywords for the items. In applications, such datasets are then filtered based on these tags. Still, even medium-sized datasets with a few tags result in complex and for humans hard-to-navigate systems. In this work, we adopt the method of ordinal factor analysis to address this problem. An ordi…
▽ More
In large datasets, it is hard to discover and analyze structure. It is thus common to introduce tags or keywords for the items. In applications, such datasets are then filtered based on these tags. Still, even medium-sized datasets with a few tags result in complex and for humans hard-to-navigate systems. In this work, we adopt the method of ordinal factor analysis to address this problem. An ordinal factor arranges a subset of the tags in a linear order based on their underlying structure. A complete ordinal factorization, which consists of such ordinal factors, precisely represents the original dataset. Based on such an ordinal factorization, we provide a way to discover and explain relationships between different items and attributes in the dataset. However, computing even just one ordinal factor of high cardinality is computationally complex. We thus propose the greedy algorithm in this work. This algorithm extracts ordinal factors using already existing fast algorithms developed in formal concept analysis. Then, we leverage to propose a comprehensive way to discover relationships in the dataset. We furthermore introduce a distance measure based on the representation emerging from the ordinal factorization to discover similar items. To evaluate the method, we conduct a case study on different datasets.
△ Less
Submitted 19 February, 2023;
originally announced February 2023.
-
Factorizing Lattices by Interval Relations
Authors:
Maren Koyda,
Gerd Stumme
Abstract:
This work investigates the factorization of finite lattices to implode selected intervals while preserving the remaining order structure. We examine how complete congruence relations and complete tolerance relations can be utilized for this purpose and answer the question of finding the finest of those relations to implode a given interval in the generated factor lattice. To overcome the limitatio…
▽ More
This work investigates the factorization of finite lattices to implode selected intervals while preserving the remaining order structure. We examine how complete congruence relations and complete tolerance relations can be utilized for this purpose and answer the question of finding the finest of those relations to implode a given interval in the generated factor lattice. To overcome the limitations of the factorization based on those relations, we introduce a new lattice factorization that enables the imploding of selected disjoint intervals of a finite lattice. To this end, we propose an interval relation that generates this factorization. To obtain lattices rather than arbitrary ordered sets, we restrict this approach to so-called pure intervals. For our study, we will make use of methods from Formal Concept Analysis (FCA). We will also provide a new FCA construction by introducing the enrichment of an incidence relation by a set of intervals in a formal context, to investigate the approach for lattice-generating interval relations on the context side.
△ Less
Submitted 20 December, 2022;
originally announced December 2022.
-
Discovering Locally Maximal Bipartite Subgraphs
Authors:
Dominik Dürrschnabel,
Tom Hanika,
Gerd Stumme
Abstract:
Induced bipartite subgraphs of maximal vertex cardinality are an essential concept for the analysis of graphs. Yet, discovering them in large graphs is known to be computationally hard. Therefore, we consider in this work a weaker notion of this problem, where we discard the maximality constraint in favor of inclusion maximality. Thus, we aim to discover locally maximal bipartite subgraphs. For th…
▽ More
Induced bipartite subgraphs of maximal vertex cardinality are an essential concept for the analysis of graphs. Yet, discovering them in large graphs is known to be computationally hard. Therefore, we consider in this work a weaker notion of this problem, where we discard the maximality constraint in favor of inclusion maximality. Thus, we aim to discover locally maximal bipartite subgraphs. For this, we present three heuristic approaches to extract such subgraphs and compare their results to the solutions of the global problem. For the latter, we employ the algorithmic strength of fast SAT-solvers. Our three proposed heuristics are based on a greedy strategy, a simulated annealing approach, and a genetic algorithm, respectively. We evaluate all four algorithms with respect to their time requirement and the vertex cardinality of the discovered bipartite subgraphs on several benchmark datasets
△ Less
Submitted 18 November, 2022;
originally announced November 2022.
-
Attribute Exploration with Multiple Contradicting Partial Experts
Authors:
Maximilian Felde,
Gerd Stumme
Abstract:
Attribute exploration is a method from Formal Concept Analysis (FCA) that helps a domain expert discover structural dependencies in knowledge domains which can be represented as formal contexts (cross tables of objects and attributes). In this paper we present an extension of attribute exploration that allows for a group of domain experts and explores their shared views. Each expert has their own…
▽ More
Attribute exploration is a method from Formal Concept Analysis (FCA) that helps a domain expert discover structural dependencies in knowledge domains which can be represented as formal contexts (cross tables of objects and attributes). In this paper we present an extension of attribute exploration that allows for a group of domain experts and explores their shared views. Each expert has their own view of the domain and the views of multiple experts may contain contradicting information.
△ Less
Submitted 31 May, 2022;
originally announced May 2022.
-
Mapping Research Trajectories
Authors:
Bastian Schäfermeier,
Gerd Stumme,
Tom Hanika
Abstract:
Steadily growing amounts of information, such as annually published scientific papers, have become so large that they elude an extensive manual analysis. Hence, to maintain an overview, automated methods for the mapping and visualization of knowledge domains are necessary and important, e.g., for scientific decision makers. Of particular interest in this field is the development of research topics…
▽ More
Steadily growing amounts of information, such as annually published scientific papers, have become so large that they elude an extensive manual analysis. Hence, to maintain an overview, automated methods for the mapping and visualization of knowledge domains are necessary and important, e.g., for scientific decision makers. Of particular interest in this field is the development of research topics of different entities (e.g., scientific authors and venues) over time. However, existing approaches for their analysis are only suitable for single entity types, such as venues, and they often do not capture the research topics or the time dimension in an easily interpretable manner.
Hence, we propose a principled approach for \emph{mapping research trajectories}, which is applicable to all kinds of scientific entities that can be represented by sets of published papers. For this, we transfer ideas and principles from the geographic visualization domain, specifically trajectory maps and interactive geographic maps. Our visualizations depict the research topics of entities over time in a straightforward interpr. manner. They can be navigated by the user intuitively and restricted to specific elements of interest. The maps are derived from a corpus of research publications (i.e., titles and abstracts) through a combination of unsupervised machine learning methods.
In a practical demonstrator application, we exemplify the proposed approach on a publication corpus from machine learning. We observe that our trajectory visualizations of 30 top machine learning venues and 1000 major authors in this field are well interpretable and are consistent with background knowledge drawn from the entities' publications. Next to producing interactive, interpr. visualizations supporting different kinds of analyses, our computed trajectories are suitable for trajectory mining applications in the future.
△ Less
Submitted 25 April, 2022;
originally announced April 2022.
-
The Mont Blanc of Twitter: Identifying Hierarchies of Outstanding Peaks in Social Networks
Authors:
Maximilian Stubbemann,
Gerd Stumme
Abstract:
The investigation of social networks is often hindered by their size as such networks often consist of at least thousands of vertices and edges. Hence, it is of major interest to derive compact structures that represent important connections of the original network. In this work, we derive such structures with orometric methods that are originally designed to identify outstanding mountain peaks an…
▽ More
The investigation of social networks is often hindered by their size as such networks often consist of at least thousands of vertices and edges. Hence, it is of major interest to derive compact structures that represent important connections of the original network. In this work, we derive such structures with orometric methods that are originally designed to identify outstanding mountain peaks and relationships between them. By adapting these methods to social networks, it is possible to derive family trees of important vertices. Our approach consists of two steps. We first apply a novel method for discarding edges that stand for weak connections. This is done such that the connectivity of the network is preserved. Then, we identify the important peaks in the network and the key cols, i.e., the lower points that connect them. This gives us a compact network that displays which peaks are connected through which cols. Thus, a natural hierarchy on the peaks arises by the question which higher peak comes behind the col, yielding to chains of peaks with increasing heights. The resulting line-parent hierarchy displays dominance relations between important vertices. We show that networks with hundreds or thousands of edges can be condensed to a small set of vertices and key connections between them.
△ Less
Submitted 27 September, 2023; v1 submitted 26 October, 2021;
originally announced October 2021.
-
Towards Explainable Scientific Venue Recommendations
Authors:
Bastian Schäfermeier,
Gerd Stumme,
Tom Hanika
Abstract:
Selecting the best scientific venue (i.e., conference/journal) for the submission of a research article constitutes a multifaceted challenge. Important aspects to consider are the suitability of research topics, a venue's prestige, and the probability of acceptance. The selection problem is exacerbated through the continuous emergence of additional venues. Previously proposed approaches for suppor…
▽ More
Selecting the best scientific venue (i.e., conference/journal) for the submission of a research article constitutes a multifaceted challenge. Important aspects to consider are the suitability of research topics, a venue's prestige, and the probability of acceptance. The selection problem is exacerbated through the continuous emergence of additional venues. Previously proposed approaches for supporting authors in this process rely on complex recommender systems, e.g., based on Word2Vec or TextCNN. These, however, often elude an explanation for their recommendations. In this work, we propose an unsophisticated method that advances the state-of-the-art in two aspects: First, we enhance the interpretability of recommendations through non-negative matrix factorization based topic models; Second, we surprisingly can obtain competitive recommendation performance while using simpler learning methods.
△ Less
Submitted 21 September, 2021;
originally announced September 2021.
-
LG4AV: Combining Language Models and Graph Neural Networks for Author Verification
Authors:
Maximilian Stubbemann,
Gerd Stumme
Abstract:
The automatic verification of document authorships is important in various settings. Researchers are for example judged and compared by the amount and impact of their publications and public figures are confronted by their posts on social media platforms. Therefore, it is important that authorship information in frequently used web services and platforms is correct. The question whether a given do…
▽ More
The automatic verification of document authorships is important in various settings. Researchers are for example judged and compared by the amount and impact of their publications and public figures are confronted by their posts on social media platforms. Therefore, it is important that authorship information in frequently used web services and platforms is correct. The question whether a given document is written by a given author is commonly referred to as authorship verification (AV). While AV is a widely investigated problem in general, only few works consider settings where the documents are short and written in a rather uniform style. This makes most approaches unpractical for online databases and knowledge graphs in the scholarly domain. Here, authorships of scientific publications have to be verified, often with just abstracts and titles available. To this point, we present our novel approach LG4AV which combines language models and graph neural networks for authorship verification. By directly feeding the available texts in a pre-trained transformer architecture, our model does not need any hand-crafted stylometric features that are not meaningful in scenarios where the writing style is, at least to some extent, standardized. By the incorporation of a graph neural network structure, our model can benefit from relations between authors that are meaningful with respect to the verification process. For example, scientific authors are more likely to write about topics that are addressed by their co-authors and twitter users tend to post about the same subjects as people they follow. We experimentally evaluate our model and study to which extent the inclusion of co-authorships enhances verification decisions in bibliometric environments.
△ Less
Submitted 3 September, 2021;
originally announced September 2021.
-
Attribute Selection using Contranominal Scales
Authors:
Dominik Dürrschnabel,
Maren Koyda,
Gerd Stumme
Abstract:
Formal Concept Analysis (FCA) allows to analyze binary data by deriving concepts and ordering them in lattices. One of the main goals of FCA is to enable humans to comprehend the information that is encapsulated in the data; however, the large size of concept lattices is a limiting factor for the feasibility of understanding the underlying structural properties. The size of such a lattice depends…
▽ More
Formal Concept Analysis (FCA) allows to analyze binary data by deriving concepts and ordering them in lattices. One of the main goals of FCA is to enable humans to comprehend the information that is encapsulated in the data; however, the large size of concept lattices is a limiting factor for the feasibility of understanding the underlying structural properties. The size of such a lattice depends on the number of subcontexts in the corresponding formal context that are isomorphic to a contranominal scale of high dimension. In this work, we propose the algorithm ContraFinder that enables the computation of all contranominal scales of a given formal context. Leveraging this algorithm, we introduce delta-adjusting, a novel approach in order to decrease the number of contranominal scales in a formal context by the selection of an appropriate attribute subset. We demonstrate that delta-adjusting a context reduces the size of the hereby emerging sub-semilattice and that the implication set is restricted to meaningful implications. This is evaluated with respect to its associated knowledge by means of a classification task. Hence, our proposed technique strongly improves understandability while preserving important conceptual structures.
△ Less
Submitted 1 July, 2021; v1 submitted 21 June, 2021;
originally announced June 2021.
-
Topological Indoor Mapping through WiFi Signals
Authors:
Bastian Schaefermeier,
Gerd Stumme,
Tom Hanika
Abstract:
The ubiquitous presence of WiFi access points and mobile devices capable of measuring WiFi signal strengths allow for real-world applications in indoor localization and mapping. In particular, no additional infrastructure is required. Previous approaches in this field were, however, often hindered by problems such as effortful map-building processes, changing environments and hardware differences.…
▽ More
The ubiquitous presence of WiFi access points and mobile devices capable of measuring WiFi signal strengths allow for real-world applications in indoor localization and mapping. In particular, no additional infrastructure is required. Previous approaches in this field were, however, often hindered by problems such as effortful map-building processes, changing environments and hardware differences. We tackle these problems focussing on topological maps. These represent discrete locations, such as rooms, and their relations, e.g., distances and transition frequencies. In our unsupervised method, we employ WiFi signal strength distributions, dimension reduction and clustering. It can be used in settings where users carry mobile devices and follow their normal routine. We aim for applications in short-lived indoor events such as conferences.
△ Less
Submitted 17 June, 2021;
originally announced June 2021.
-
Boolean Substructures in Formal Concept Analysis
Authors:
Maren Koyda,
Gerd Stumme
Abstract:
It is known that a (concept) lattice contains an n-dimensional Boolean suborder if and only if the context contains an n-dimensional contra-nominal scale as subcontext. In this work, we investigate more closely the interplay between the Boolean subcontexts of a given finite context and the Boolean suborders of its concept lattice. To this end, we define mappings from the set of subcontexts of a co…
▽ More
It is known that a (concept) lattice contains an n-dimensional Boolean suborder if and only if the context contains an n-dimensional contra-nominal scale as subcontext. In this work, we investigate more closely the interplay between the Boolean subcontexts of a given finite context and the Boolean suborders of its concept lattice. To this end, we define mappings from the set of subcontexts of a context to the set of suborders of its concept lattice and vice versa and study their structural properties. In addition, we introduce closed-subcontexts as an extension of closed relations to investigate the set of all sublattices of a given lattice.
△ Less
Submitted 14 April, 2021;
originally announced April 2021.
-
Force-Directed Layout of Order Diagrams using Dimensional Reduction
Authors:
Dominik Dürrschnabel,
Gerd Stumme
Abstract:
Order diagrams allow human analysts to understand and analyze structural properties of ordered data. While an experienced expert can create easily readable order diagrams, the automatic generation of those remains a hard task. In this work, we adapt force-directed approaches, which are known to generate aesthetically-pleasing drawings of graphs, to the realm of order diagrams. Our algorithm ReDraw…
▽ More
Order diagrams allow human analysts to understand and analyze structural properties of ordered data. While an experienced expert can create easily readable order diagrams, the automatic generation of those remains a hard task. In this work, we adapt force-directed approaches, which are known to generate aesthetically-pleasing drawings of graphs, to the realm of order diagrams. Our algorithm ReDraw thereby embeds the order in a high dimension and then iteratively reduces the dimension until a two-dimensional drawing is achieved. To improve aesthetics, this reduction is equipped with two force-directed steps where one optimizes on distances of nodes and the other on distances of lines in order to satisfy a set of a priori fixed conditions. By respecting an invariant about the vertical position of the elements in each step of our algorithm we ensure that the resulting drawings satisfy all necessary properties of order diagrams. Finally, we present the results of a user study to demonstrate that our algorithm outperforms comparable approaches on drawings of lattices with a high degree of distributivity.
△ Less
Submitted 4 February, 2021;
originally announced February 2021.
-
Triadic Exploration and Exploration with Multiple Experts
Authors:
Maximilian Felde,
Gerd Stumme
Abstract:
Formal Concept Analysis (FCA) provides a method called attribute exploration which helps a domain expert discover structural dependencies in knowledge domains that can be represented by a formal context (a cross table of objects and attributes). Triadic Concept Analysis is an extension of FCA that incorporates the notion of conditions. Many extensions and variants of attribute exploration have bee…
▽ More
Formal Concept Analysis (FCA) provides a method called attribute exploration which helps a domain expert discover structural dependencies in knowledge domains that can be represented by a formal context (a cross table of objects and attributes). Triadic Concept Analysis is an extension of FCA that incorporates the notion of conditions. Many extensions and variants of attribute exploration have been studied but only few attempts at incorporating multiple experts have been made. In this paper we present triadic exploration based on Triadic Concept Analysis to explore conditional attribute implications in a triadic domain. We then adapt this approach to formulate attribute exploration with multiple experts that have different views on a domain.
△ Less
Submitted 4 February, 2021;
originally announced February 2021.
-
Topic Space Trajectories: A case study on machine learning literature
Authors:
Bastian Schäfermeier,
Gerd Stumme,
Tom Hanika
Abstract:
The annual number of publications at scientific venues, for example, conferences and journals, is growing quickly. Hence, even for researchers it becomes harder and harder to keep track of research topics and their progress. In this task, researchers can be supported by automated publication analysis. Yet, many such methods result in uninterpretable, purely numerical representations. As an attempt…
▽ More
The annual number of publications at scientific venues, for example, conferences and journals, is growing quickly. Hence, even for researchers it becomes harder and harder to keep track of research topics and their progress. In this task, researchers can be supported by automated publication analysis. Yet, many such methods result in uninterpretable, purely numerical representations. As an attempt to support human analysts, we present topic space trajectories, a structure that allows for the comprehensible tracking of research topics. We demonstrate how these trajectories can be interpreted based on eight different analysis approaches. To obtain comprehensible results, we employ non-negative matrix factorization as well as suitable visualization techniques. We show the applicability of our approach on a publication corpus spanning 50 years of machine learning research from 32 publication venues. Our novel analysis method may be employed for paper classification, for the prediction of future research topics, and for the recommendation of fitting conferences and journals for submitting unpublished work.
△ Less
Submitted 18 May, 2021; v1 submitted 23 October, 2020;
originally announced October 2020.
-
Interactive Collaborative Exploration using Incomplete Contexts
Authors:
Maximilian Felde,
Gerd Stumme
Abstract:
A well-known knowledge acquisition method in the field of Formal Concept Analysis (FCA) is attribute exploration. It is used to reveal dependencies in a set of attributes with help of a domain expert. In most applications no single expert is capable (time- and knowledge-wise) of exploring the knowledge domain alone. However, there is up to now no theory that models the interaction of multiple expe…
▽ More
A well-known knowledge acquisition method in the field of Formal Concept Analysis (FCA) is attribute exploration. It is used to reveal dependencies in a set of attributes with help of a domain expert. In most applications no single expert is capable (time- and knowledge-wise) of exploring the knowledge domain alone. However, there is up to now no theory that models the interaction of multiple experts for the task of attribute exploration with incomplete knowledge. To this end, we to develop a theoretical framework that allows multiple experts to explore domains together. We use a representation of incomplete knowledge as three-valued contexts. We then adapt the corresponding version of attribute exploration to fit the setting of multiple experts. We suggest formalizations for key components like expert knowledge, interaction and collaboration strategy. In particular, we define an order that allows to compare the results of different exploration strategies on the same task with respect to their information completeness. Furthermore we discuss other ways of comparing collaboration strategies and suggest avenues for future research.
△ Less
Submitted 31 January, 2020; v1 submitted 23 August, 2019;
originally announced August 2019.
-
Orometric Methods in Bounded Metric Data
Authors:
Maximilian Stubbemann,
Tom Hanika,
Gerd Stumme
Abstract:
A large amount of data accommodated in knowledge graphs (KG) is actually metric. For example, the Wikidata KG contains a plenitude of metric facts about geographic entities like cities, chemical compounds or celestial objects. In this paper, we propose a novel approach that transfers orometric (topographic) measures to bounded metric spaces. While these methods were originally designed to identify…
▽ More
A large amount of data accommodated in knowledge graphs (KG) is actually metric. For example, the Wikidata KG contains a plenitude of metric facts about geographic entities like cities, chemical compounds or celestial objects. In this paper, we propose a novel approach that transfers orometric (topographic) measures to bounded metric spaces. While these methods were originally designed to identify relevant mountain peaks on the surface of the earth, we demonstrate a notion to use them for metric data sets in general. Notably, metric sets of items inclosed in knowledge graphs. Based on this we present a method for identifying outstanding items using the transferred valuations functions 'isolation' and 'prominence'. Building up on this we imagine an item recommendation process. To demonstrate the relevance of the novel valuations for such processes we use item sets from the Wikidata knowledge graph. We then evaluate the usefulness of 'isolation' and 'prominence' empirically in a supervised machine learning setting. In particular, we find structurally relevant items in the geographic population distributions of Germany and France.
△ Less
Submitted 22 July, 2019;
originally announced July 2019.
-
Drawing Order Diagrams Through Two-Dimension Extension
Authors:
Dominik Dürrschnabel,
Tom Hanika,
Gerd Stumme
Abstract:
Order diagrams are an important tool to visualize the complex structure of ordered sets. Favorable drawings of order diagrams, i.e., easily readable for humans, are hard to come by, even for small ordered sets. Many attempts were made to transfer classical graph drawing approaches to order diagrams. Although these methods produce satisfying results for some ordered sets, they unfortunately perform…
▽ More
Order diagrams are an important tool to visualize the complex structure of ordered sets. Favorable drawings of order diagrams, i.e., easily readable for humans, are hard to come by, even for small ordered sets. Many attempts were made to transfer classical graph drawing approaches to order diagrams. Although these methods produce satisfying results for some ordered sets, they unfortunately perform poorly in general. In this work we present the novel algorithm DimDraw to draw order diagrams. This algorithm is based on a relation between the dimension of an ordered set and the bipartiteness of a corresponding graph.
△ Less
Submitted 14 June, 2019;
originally announced June 2019.
-
Collaborative Interactive Learning -- A clarification of terms and a differentiation from other research fields
Authors:
Tom Hanika,
Marek Herde,
Jochen Kuhn,
Jan Marco Leimeister,
Paul Lukowicz,
Sarah Oeste-Reiß,
Albrecht Schmidt,
Bernhard Sick,
Gerd Stumme,
Sven Tomforde,
Katharina Anna Zweig
Abstract:
The field of collaborative interactive learning (CIL) aims at developing and investigating the technological foundations for a new generation of smart systems that support humans in their everyday life. While the concept of CIL has already been carved out in detail (including the fields of dedicated CIL and opportunistic CIL) and many research objectives have been stated, there is still the need t…
▽ More
The field of collaborative interactive learning (CIL) aims at developing and investigating the technological foundations for a new generation of smart systems that support humans in their everyday life. While the concept of CIL has already been carved out in detail (including the fields of dedicated CIL and opportunistic CIL) and many research objectives have been stated, there is still the need to clarify some terms such as information, knowledge, and experience in the context of CIL and to differentiate CIL from recent and ongoing research in related fields such as active learning, collaborative learning, and others. Both aspects are addressed in this paper.
△ Less
Submitted 16 May, 2019;
originally announced May 2019.
-
DimDraw -- A novel tool for drawing concept lattices
Authors:
Dominik Dürrschnabel,
Tom Hanika,
Gerd Stumme
Abstract:
Concept lattice drawings are an important tool to visualize complex relations in data in a simple manner to human readers. Many attempts were made to transfer classical graph drawing approaches to order diagrams. Although those methods are satisfying for some lattices they unfortunately perform poorly in general. In this work we present a novel tool to draw concept lattices that is purely motivate…
▽ More
Concept lattice drawings are an important tool to visualize complex relations in data in a simple manner to human readers. Many attempts were made to transfer classical graph drawing approaches to order diagrams. Although those methods are satisfying for some lattices they unfortunately perform poorly in general. In this work we present a novel tool to draw concept lattices that is purely motivated by the order structure.
△ Less
Submitted 2 March, 2019;
originally announced March 2019.
-
Discovering Implicational Knowledge in Wikidata
Authors:
Tom Hanika,
Maximilian Marx,
Gerd Stumme
Abstract:
Knowledge graphs have recently become the state-of-the-art tool for representing the diverse and complex knowledge of the world. Examples include the proprietary knowledge graphs of companies such as Google, Facebook, IBM, or Microsoft, but also freely available ones such as YAGO, DBpedia, and Wikidata. A distinguishing feature of Wikidata is that the knowledge is collaboratively edited and curate…
▽ More
Knowledge graphs have recently become the state-of-the-art tool for representing the diverse and complex knowledge of the world. Examples include the proprietary knowledge graphs of companies such as Google, Facebook, IBM, or Microsoft, but also freely available ones such as YAGO, DBpedia, and Wikidata. A distinguishing feature of Wikidata is that the knowledge is collaboratively edited and curated. While this greatly enhances the scope of Wikidata, it also makes it impossible for a single individual to grasp complex connections between properties or understand the global impact of edits in the graph. We apply Formal Concept Analysis to efficiently identify comprehensible implications that are implicitly present in the data. Although the complex structure of data modelling in Wikidata is not amenable to a direct approach, we overcome this limitation by extracting contextual representations of parts of Wikidata in a systematic fashion. We demonstrate the practical feasibility of our approach through several experiments and show that the results may lead to the discovery of interesting implicational knowledge. Besides providing a method for obtaining large real-world data sets for FCA, we sketch potential applications in offering semantic assistance for editing and curating Wikidata.
△ Less
Submitted 3 February, 2019;
originally announced February 2019.
-
Relevant Attributes in Formal Contexts
Authors:
Tom Hanika,
Maren Koyda,
Gerd Stumme
Abstract:
Computing conceptual structures, like formal concept lattices, is in the age of massive data sets a challenging task. There are various approaches to deal with this, e.g., random sampling, parallelization, or attribute extraction. A so far not investigated method in the realm of formal concept analysis is attribute selection, as done in machine learning. Building up on this we introduce a method f…
▽ More
Computing conceptual structures, like formal concept lattices, is in the age of massive data sets a challenging task. There are various approaches to deal with this, e.g., random sampling, parallelization, or attribute extraction. A so far not investigated method in the realm of formal concept analysis is attribute selection, as done in machine learning. Building up on this we introduce a method for attribute selection in formal contexts. To this end, we propose the notion of relevant attributes which enables us to define a relative relevance function, reflecting both the order structure of the concept lattice as well as distribution of objects on it. Finally, we overcome computational challenges for computing the relative relevance through an approximation approach based on information entropy.
△ Less
Submitted 20 December, 2018;
originally announced December 2018.
-
Distances for WiFi Based Topological Indoor Mapping
Authors:
Bastian Schäfermeier,
Tom Hanika,
Gerd Stumme
Abstract:
For localization and mapping of indoor environments through WiFi signals, locations are often represented as likelihoods of the received signal strength indicator. In this work we compare various measures of distance between such likelihoods in combination with different methods for estimation and representation. In particular, we show that among the considered distance measures the Earth Mover's…
▽ More
For localization and mapping of indoor environments through WiFi signals, locations are often represented as likelihoods of the received signal strength indicator. In this work we compare various measures of distance between such likelihoods in combination with different methods for estimation and representation. In particular, we show that among the considered distance measures the Earth Mover's Distance seems the most beneficial for the localization task. Combined with kernel density estimation we were able to retain the topological structure of rooms in a real-world office scenario.
△ Less
Submitted 19 September, 2018;
originally announced September 2018.
-
Intrinsic dimension and its application to association rules
Authors:
Tom Hanika,
Friedrich Martin Schneider,
Gerd Stumme
Abstract:
The curse of dimensionality in the realm of association rules is twofold. Firstly, we have the well known exponential increase in computational complexity with increasing item set size. Secondly, there is a \emph{related curse} concerned with the distribution of (spare) data itself in high dimension. The former problem is often coped with by projection, i.e., feature selection, whereas the best kn…
▽ More
The curse of dimensionality in the realm of association rules is twofold. Firstly, we have the well known exponential increase in computational complexity with increasing item set size. Secondly, there is a \emph{related curse} concerned with the distribution of (spare) data itself in high dimension. The former problem is often coped with by projection, i.e., feature selection, whereas the best known strategy for the latter is avoidance. This work summarizes the first attempt to provide a computationally feasible method for measuring the extent of dimension curse present in a data set with respect to a particular class machine of learning procedures. This recent development enables the application of various other methods from geometric analysis to be investigated and applied in machine learning procedures in the presence of high dimension.
△ Less
Submitted 15 May, 2018;
originally announced May 2018.
-
Clones in Graphs
Authors:
Stephan Doerfel,
Tom Hanika,
Gerd Stumme
Abstract:
Finding structural similarities in graph data, like social networks, is a far-ranging task in data mining and knowledge discovery. A (conceptually) simple reduction would be to compute the automorphism group of a graph. However, this approach is ineffective in data mining since real world data does not exhibit enough structural regularity. Here we step in with a novel approach based on mappings th…
▽ More
Finding structural similarities in graph data, like social networks, is a far-ranging task in data mining and knowledge discovery. A (conceptually) simple reduction would be to compute the automorphism group of a graph. However, this approach is ineffective in data mining since real world data does not exhibit enough structural regularity. Here we step in with a novel approach based on mappings that preserve the maximal cliques. For this we exploit the well known correspondence between bipartite graphs and the data structure formal context $(G,M,I)$ from Formal Concept Analysis. From there we utilize the notion of clone items. The investigation of these is still an open problem to which we add new insights with this work. Furthermore, we produce a substantial experimental investigation of real world data. We conclude with demonstrating the generalization of clone items to permutations.
△ Less
Submitted 30 July, 2018; v1 submitted 21 February, 2018;
originally announced February 2018.
-
Intrinsic Dimension of Geometric Data Sets
Authors:
Tom Hanika,
Friedrich Martin Schneider,
Gerd Stumme
Abstract:
The curse of dimensionality is a phenomenon frequently observed in machine learning (ML) and knowledge discovery (KD). There is a large body of literature investigating its origin and impact, using methods from mathematics as well as from computer science. Among the mathematical insights into data dimensionality, there is an intimate link between the dimension curse and the phenomenon of measure c…
▽ More
The curse of dimensionality is a phenomenon frequently observed in machine learning (ML) and knowledge discovery (KD). There is a large body of literature investigating its origin and impact, using methods from mathematics as well as from computer science. Among the mathematical insights into data dimensionality, there is an intimate link between the dimension curse and the phenomenon of measure concentration, which makes the former accessible to methods of geometric analysis. The present work provides a comprehensive study of the intrinsic geometry of a data set, based on Gromov's metric measure geometry and Pestov's axiomatic approach to intrinsic dimension. In detail, we define a concept of geometric data set and introduce a metric as well as a partial order on the set of isomorphism classes of such data sets. Based on these objects, we propose and investigate an axiomatic approach to the intrinsic dimension of geometric data sets and establish a concrete dimension function with the desired properties. Our model for data sets and their intrinsic dimension is computationally feasible and, moreover, adaptable to specific ML/KD-algorithms, as illustrated by various experiments.
△ Less
Submitted 26 October, 2020; v1 submitted 24 January, 2018;
originally announced January 2018.
-
Adaptive kNN using Expected Accuracy for Classification of Geo-Spatial Data
Authors:
Mark Kibanov,
Martin Becker,
Juergen Mueller,
Martin Atzmueller,
Andreas Hotho,
Gerd Stumme
Abstract:
The k-Nearest Neighbor (kNN) classification approach is conceptually simple - yet widely applied since it often performs well in practical applications. However, using a global constant k does not always provide an optimal solution, e.g., for datasets with an irregular density distribution of data points. This paper proposes an adaptive kNN classifier where k is chosen dynamically for each instanc…
▽ More
The k-Nearest Neighbor (kNN) classification approach is conceptually simple - yet widely applied since it often performs well in practical applications. However, using a global constant k does not always provide an optimal solution, e.g., for datasets with an irregular density distribution of data points. This paper proposes an adaptive kNN classifier where k is chosen dynamically for each instance (point) to be classified, such that the expected accuracy of classification is maximized. We define the expected accuracy as the accuracy of a set of structurally similar observations. An arbitrary similarity function can be used to find these observations. We introduce and evaluate different similarity functions. For the evaluation, we use five different classification tasks based on geo-spatial data. Each classification task consists of (tens of) thousands of items. We demonstrate, that the presented expected accuracy measures can be a good estimator for kNN performance, and the proposed adaptive kNN classifier outperforms common kNN and previously introduced adaptive kNN algorithms. Also, we show that the range of considered k can be significantly reduced to speed up the algorithm without negative influence on classification accuracy.
△ Less
Submitted 14 December, 2017;
originally announced January 2018.
-
Mining Social Media to Inform Peatland Fire and Haze Disaster Management
Authors:
Mark Kibanov,
Gerd Stumme,
Imaduddin Amin,
Jong Gun Lee
Abstract:
Peatland fires and haze events are disasters with national, regional and international implications. The phenomena lead to direct damage to local assets, as well as broader economic and environmental losses. Satellite imagery is still the main and often the only available source of information for disaster management. In this article, we test the potential of social media to assist disaster manage…
▽ More
Peatland fires and haze events are disasters with national, regional and international implications. The phenomena lead to direct damage to local assets, as well as broader economic and environmental losses. Satellite imagery is still the main and often the only available source of information for disaster management. In this article, we test the potential of social media to assist disaster management. To this end, we compare insights from two datasets: fire hotspots detected via NASA satellite imagery and almost all GPS-stamped tweets from Sumatra Island, Indonesia, posted during 2014. Sumatra Island is chosen as it regularly experiences a significant number of haze events, which affect citizens in Indonesia as well as in nearby countries including Malaysia and Singapore. We analyse temporal correlations between the datasets and their geo-spatial interdependence. Furthermore, we show how Twitter data reveals changes in users' behavior during severe haze events. Overall, we demonstrate that social media is a valuable source of complementary and supplementary information for haze disaster management. Based on our methodology and findings, an analytics tool to improve peatland fire and haze disaster management by the Indonesian authorities is under development.
△ Less
Submitted 2 August, 2017; v1 submitted 16 June, 2017;
originally announced June 2017.
-
Predicting Rising Follower Counts on Twitter Using Profile Information
Authors:
Juergen Mueller,
Gerd Stumme
Abstract:
When evaluating the cause of one's popularity on Twitter, one thing is considered to be the main driver: Many tweets. There is debate about the kind of tweet one should publish, but little beyond tweets. Of particular interest is the information provided by each Twitter user's profile page. One of the features are the given names on those profiles. Studies on psychology and economics identified co…
▽ More
When evaluating the cause of one's popularity on Twitter, one thing is considered to be the main driver: Many tweets. There is debate about the kind of tweet one should publish, but little beyond tweets. Of particular interest is the information provided by each Twitter user's profile page. One of the features are the given names on those profiles. Studies on psychology and economics identified correlations of the first name to, e.g., one's school marks or chances of getting a job interview in the US. Therefore, we are interested in the influence of those profile information on the follower count. We addressed this question by analyzing the profiles of about 6 Million Twitter users. All profiles are separated into three groups: Users that have a first name, English words, or neither of both in their name field. The assumption is that names and words influence the discoverability of a user and subsequently his/her follower count. We propose a classifier that labels users who will increase their follower count within a month by applying different models based on the user's group. The classifiers are evaluated with the area under the receiver operator curve score and achieves a score above 0.800.
△ Less
Submitted 9 May, 2017;
originally announced May 2017.
-
Gender Inference using Statistical Name Characteristics in Twitter
Authors:
Juergen Mueller,
Gerd Stumme
Abstract:
Much attention has been given to the task of gender inference of Twitter users. Although names are strong gender indicators, the names of Twitter users are rarely used as a feature; probably due to the high number of ill-formed names, which cannot be found in any name dictionary. Instead of relying solely on a name database, we propose a novel name classifier. Our approach extracts characteristics…
▽ More
Much attention has been given to the task of gender inference of Twitter users. Although names are strong gender indicators, the names of Twitter users are rarely used as a feature; probably due to the high number of ill-formed names, which cannot be found in any name dictionary. Instead of relying solely on a name database, we propose a novel name classifier. Our approach extracts characteristics from the user names and uses those in order to assign the names to a gender. This enables us to classify international first names as well as ill-formed names.
△ Less
Submitted 1 July, 2016; v1 submitted 17 June, 2016;
originally announced June 2016.
-
Link Prediction and the Role of Stronger Ties in Networks of Face-to-Face Proximity
Authors:
Christoph Scholz,
Martin Atzmueller,
Gerd Stumme
Abstract:
Understanding the structures why links are formed is an important and prominent research topic. In this paper, we therefore consider the link prediction problem in face-to-face contact networks, and analyze the predictability of new and recurring links. Furthermore, we study additional influence factors, and the role of stronger ties in these networks. Specifically, we compare neighborhood-based a…
▽ More
Understanding the structures why links are formed is an important and prominent research topic. In this paper, we therefore consider the link prediction problem in face-to-face contact networks, and analyze the predictability of new and recurring links. Furthermore, we study additional influence factors, and the role of stronger ties in these networks. Specifically, we compare neighborhood-based and path-based network proximity measures in a threshold-based analysis for capturing temporal dynamics. The results and insights of the analysis are a first step onto predictability applications for human contact networks, for example, for improving recommendations.
△ Less
Submitted 8 July, 2014;
originally announced July 2014.
-
On the Predictability of Talk Attendance at Academic Conferences
Authors:
Christoph Scholz,
Jens Illig,
Martin Atzmueller,
Gerd Stumme
Abstract:
This paper focuses on the prediction of real-world talk attendances at academic conferences with respect to different influence factors. We study the predictability of talk attendances using real-world tracked face-to-face contacts. Furthermore, we investigate and discuss the predictive power of user interests extracted from the users' previous publications. We apply Hybrid Rooted PageRank, a stat…
▽ More
This paper focuses on the prediction of real-world talk attendances at academic conferences with respect to different influence factors. We study the predictability of talk attendances using real-world tracked face-to-face contacts. Furthermore, we investigate and discuss the predictive power of user interests extracted from the users' previous publications. We apply Hybrid Rooted PageRank, a state-of-the-art unsupervised machine learning method that combines information from different sources. Using this method, we analyze and discuss the predictive power of contact and interest networks separately and in combination. We find that contact and similarity networks achieve comparable results, and that combinations of different networks can only to a limited extend help to improve the prediction quality. For our experiments, we analyze the predictability of talk attendance at the ACM Conference on Hypertext and Hypermedia 2011 collected using the conference management system Conferator.
△ Less
Submitted 2 July, 2014;
originally announced July 2014.
-
User-Relatedness and Community Structure in Social Interaction Networks
Authors:
Folke Mitzlaff,
Martin Atzmueller,
Dominik Benz,
Andreas Hotho,
Gerd Stumme
Abstract:
With social media and the according social and ubiquitous applications finding their way into everyday life, there is a rapidly growing amount of user generated content yielding explicit and implicit network structures. We consider social activities and phenomena as proxies for user relatedness. Such activities are represented in so-called social interaction networks or evidence networks, with dif…
▽ More
With social media and the according social and ubiquitous applications finding their way into everyday life, there is a rapidly growing amount of user generated content yielding explicit and implicit network structures. We consider social activities and phenomena as proxies for user relatedness. Such activities are represented in so-called social interaction networks or evidence networks, with different degrees of explicitness. We focus on evidence networks containing relations on users, which are represented by connections between individual nodes. Explicit interaction networks are then created by specific user actions, for example, when building a friend network. On the other hand, more implicit networks capture user traces or evidences of user actions as observed in Web portals, blogs, resource sharing systems, and many other social services. These implicit networks can be applied for a broad range of analysis methods instead of using expensive gold-standard information.
In this paper, we analyze different properties of a set of networks in social media. We show that there are dependencies and correlations between the networks. These allow for drawing reciprocal conclusions concerning pairs of networks, based on the assessment of structural correlations and ranking interchangeability. Additionally, we show how these inter-network correlations can be used for assessing the results of structural analysis techniques, e.g., community mining methods.
△ Less
Submitted 16 September, 2013;
originally announced September 2013.
-
Onomastics 2.0 - The Power of Social Co-Occurrences
Authors:
Folke Mitzlaff,
Gerd Stumme
Abstract:
Onomastics is "the science or study of the origin and forms of proper names of persons or places." ["Onomastics". Merriam-Webster.com, 2013. http://www.merriam-webster.com (11 February 2013)]. Especially personal names play an important role in daily life, as all over the world future parents are facing the task of finding a suitable given name for their child. This choice is influenced by differe…
▽ More
Onomastics is "the science or study of the origin and forms of proper names of persons or places." ["Onomastics". Merriam-Webster.com, 2013. http://www.merriam-webster.com (11 February 2013)]. Especially personal names play an important role in daily life, as all over the world future parents are facing the task of finding a suitable given name for their child. This choice is influenced by different factors, such as the social context, language, cultural background and, in particular, personal taste.
With the rise of the Social Web and its applications, users more and more interact digitally and participate in the creation of heterogeneous, distributed, collaborative data collections. These sources of data also reflect current and new naming trends as well as new emerging interrelations among names.
The present work shows, how basic approaches from the field of social network analysis and information retrieval can be applied for discovering relations among names, thus extending Onomastics by data mining techniques. The considered approach starts with building co-occurrence graphs relative to data from the Social Web, respectively for given names and city names. As a main result, correlations between semantically grounded similarities among names (e.g., geographical distance for city names) and structural graph based similarities are observed.
The discovered relations among given names are the foundation of "nameling" [http://nameling.net], a search engine and academic research platform for given names which attracted more than 30,000 users within four months, underpinningthe relevance of the proposed methodology.
△ Less
Submitted 3 March, 2013;
originally announced March 2013.
-
Recommending Given Names
Authors:
Folke Mitzlaff,
Gerd Stumme
Abstract:
All over the world, future parents are facing the task of finding a suitable given name for their child. This choice is influenced by different factors, such as the social context, language, cultural background and especially personal taste. Although this task is omnipresent, little research has been conducted on the analysis and application of interrelations among given names from a data mining p…
▽ More
All over the world, future parents are facing the task of finding a suitable given name for their child. This choice is influenced by different factors, such as the social context, language, cultural background and especially personal taste. Although this task is omnipresent, little research has been conducted on the analysis and application of interrelations among given names from a data mining perspective.
The present work tackles the problem of recommending given names, by firstly mining for inter-name relatedness in data from the Social Web. Based on these results, the name search engine "Nameling" was built, which attracted more than 35,000 users within less than six months, underpinning the relevance of the underlying recommendation task. The accruing usage data is then used for evaluating different state-of-the-art recommendation systems, as well our new NameRank algorithm which we adopted from our previous work on folksonomies and which yields the best results, considering the trade-off between prediction accuracy and runtime performance as well as its ability to generate personalized recommendations. We also show, how the gathered inter-name relationships can be used for meaningful result diversification of PageRank-based recommendation systems.
As all of the considered usage data is made publicly available, the present work establishes baseline results, encouraging other researchers to implement advanced recommendation systems for given names.
△ Less
Submitted 19 February, 2013; v1 submitted 18 February, 2013;
originally announced February 2013.
-
Semantic Analysis of Tag Similarity Measures in Collaborative Tagging Systems
Authors:
Ciro Cattuto,
Dominik Benz,
Andreas Hotho,
Gerd Stumme
Abstract:
Social bookmarking systems allow users to organise collections of resources on the Web in a collaborative fashion. The increasing popularity of these systems as well as first insights into their emergent semantics have made them relevant to disciplines like knowledge extraction and ontology learning. The problem of devising methods to measure the semantic relatedness between tags and characteriz…
▽ More
Social bookmarking systems allow users to organise collections of resources on the Web in a collaborative fashion. The increasing popularity of these systems as well as first insights into their emergent semantics have made them relevant to disciplines like knowledge extraction and ontology learning. The problem of devising methods to measure the semantic relatedness between tags and characterizing it semantically is still largely open. Here we analyze three measures of tag relatedness: tag co-occurrence, cosine similarity of co-occurrence distributions, and FolkRank, an adaptation of the PageRank algorithm to folksonomies. Each measure is computed on tags from a large-scale dataset crawled from the social bookmarking system del.icio.us. To provide a semantic grounding of our findings, a connection to WordNet (a semantic lexicon for the English language) is established by mapping tags into synonym sets of WordNet, and applying there well-known metrics of semantic similarity. Our results clearly expose different characteristics of the selected measures of relatedness, making them applicable to different subtasks of knowledge extraction such as synonym detection or discovery of concept hierarchies.
△ Less
Submitted 14 May, 2008;
originally announced May 2008.