-
A Critical Re-evaluation of Benchmark Datasets for (Deep) Learning-Based Matching Algorithms
Authors:
George Papadakis,
Nishadi Kirielle,
Peter Christen,
Themis Palpanas
Abstract:
Entity resolution (ER) is the process of identifying records that refer to the same entities within one or across multiple databases. Numerous techniques have been developed to tackle ER challenges over the years, with recent emphasis placed on machine and deep learning methods for the matching phase. However, the quality of the benchmark datasets typically used in the experimental evaluations of…
▽ More
Entity resolution (ER) is the process of identifying records that refer to the same entities within one or across multiple databases. Numerous techniques have been developed to tackle ER challenges over the years, with recent emphasis placed on machine and deep learning methods for the matching phase. However, the quality of the benchmark datasets typically used in the experimental evaluations of learning-based matching algorithms has not been examined in the literature. To cover this gap, we propose four different approaches to assessing the difficulty and appropriateness of 13 established datasets: two theoretical approaches, which involve new measures of linearity and existing measures of complexity, and two practical approaches: the difference between the best non-linear and linear matchers, as well as the difference between the best learning-based matcher and the perfect oracle. Our analysis demonstrates that most of the popular datasets pose rather easy classification tasks. As a result, they are not suitable for properly evaluating learning-based matching algorithms. To address this issue, we propose a new methodology for yielding benchmark datasets. We put it into practice by creating four new matching tasks, and we verify that these new benchmarks are more challenging and therefore more suitable for further advancements in the field.
△ Less
Submitted 12 November, 2023; v1 submitted 3 July, 2023;
originally announced July 2023.
-
Privacy in Practice: Private COVID-19 Detection in X-Ray Images (Extended Version)
Authors:
Lucas Lange,
Maja Schneider,
Peter Christen,
Erhard Rahm
Abstract:
Machine learning (ML) can help fight pandemics like COVID-19 by enabling rapid screening of large volumes of images. To perform data analysis while maintaining patient privacy, we create ML models that satisfy Differential Privacy (DP). Previous works exploring private COVID-19 models are in part based on small datasets, provide weaker or unclear privacy guarantees, and do not investigate practica…
▽ More
Machine learning (ML) can help fight pandemics like COVID-19 by enabling rapid screening of large volumes of images. To perform data analysis while maintaining patient privacy, we create ML models that satisfy Differential Privacy (DP). Previous works exploring private COVID-19 models are in part based on small datasets, provide weaker or unclear privacy guarantees, and do not investigate practical privacy. We suggest improvements to address these open gaps. We account for inherent class imbalances and evaluate the utility-privacy trade-off more extensively and over stricter privacy budgets. Our evaluation is supported by empirically estimating practical privacy through black-box Membership Inference Attacks (MIAs). The introduced DP should help limit leakage threats posed by MIAs, and our practical analysis is the first to test this hypothesis on the COVID-19 classification task. Our results indicate that needed privacy levels might differ based on the task-dependent practical threat from MIAs. The results further suggest that with increasing DP guarantees, empirical privacy leakage only improves marginally, and DP therefore appears to have a limited impact on practical MIA defense. Our findings identify possibilities for better utility-privacy trade-offs, and we believe that empirical attack-specific privacy estimation can play a vital role in tuning for practical privacy.
△ Less
Submitted 26 April, 2023; v1 submitted 21 November, 2022;
originally announced November 2022.
-
Programming Data Structures for Large-Scale Desktop Simulations of Complex Systems
Authors:
Patrik Christen
Abstract:
The investigation of complex systems requires running large-scale simulations over many temporal iterations. It is therefore important to provide efficient implementations. The present study borrows philosophical concepts from Gilbert Simondon to identify data structures and algorithms that have the biggest impact on running time and memory usage. These are the entity $e$-tuple $\mathcal{E}$ and t…
▽ More
The investigation of complex systems requires running large-scale simulations over many temporal iterations. It is therefore important to provide efficient implementations. The present study borrows philosophical concepts from Gilbert Simondon to identify data structures and algorithms that have the biggest impact on running time and memory usage. These are the entity $e$-tuple $\mathcal{E}$ and the intertwined update function $φ$. Focusing on implementing data structures in C#, $\mathcal{E}$ is implemented as a list of objects according to current software engineering practice and as an array of pointers according to theoretical considerations. Cellular automaton simulations with $10^9$ entities over one iteration reveal that the object-list with dynamic typing and multi-state readiness has a drastic effect on running time and memory usage, especially dynamic typing as it has a big impact on the evolution time. Pointer-arrays are possible to implement in C# and are more running time and memory efficient as compared to the object-list implementation, however, they are cumbersome to implement. In conclusion, avoiding dynamic typing in object-list based implementations or using pointer-arrays gives evolution times that are acceptable in practice, even on desktop computers.
△ Less
Submitted 29 July, 2022; v1 submitted 10 May, 2022;
originally announced May 2022.
-
Curb Your Self-Modifying Code
Authors:
Patrik Christen
Abstract:
Self-modifying code has many intriguing applications in a broad range of fields including software security, artificial general intelligence, and open-ended evolution. Having control over self-modifying code, however, is still an open challenge since it is a balancing act between providing as much freedom as possible so as not to limit possible solutions, while at the same time imposing restrictio…
▽ More
Self-modifying code has many intriguing applications in a broad range of fields including software security, artificial general intelligence, and open-ended evolution. Having control over self-modifying code, however, is still an open challenge since it is a balancing act between providing as much freedom as possible so as not to limit possible solutions, while at the same time imposing restriction to avoid security issues and invalid code or solutions. In the present study, I provide a prototype implementation of how one might curb self-modifying code by introducing control mechanisms for code modifications within specific regions and for specific transitions between code and data. I show that this is possible to achieve with the so-called allagmatic method - a framework to formalise, model, implement, and interpret complex systems inspired by Gilbert Simondon's philosophy of individuation and Alfred North Whitehead's philosophy of organism. Thereby, the allagmatic method serves as guidance for self-modification based on concepts defined in a metaphysical framework. I conclude that the allagmatic method seems to be a suitable framework for control mechanisms in self-modifying code and that there are intriguing analogies between the presented control mechanisms and gene regulation.
△ Less
Submitted 29 July, 2022; v1 submitted 28 February, 2022;
originally announced February 2022.
-
Self-Modifying Code in Open-Ended Evolutionary Systems
Authors:
Patrik Christen
Abstract:
Having a model and being able to implement open-ended evolutionary systems is important for advancing our understanding of open-endedness. Complex systems science and newest generation high-level programming languages provide intriguing possibilities to do so. First, some recent advances in modelling and implementing open-ended evolutionary systems are reviewed. Then, the so-called allagmatic meth…
▽ More
Having a model and being able to implement open-ended evolutionary systems is important for advancing our understanding of open-endedness. Complex systems science and newest generation high-level programming languages provide intriguing possibilities to do so. First, some recent advances in modelling and implementing open-ended evolutionary systems are reviewed. Then, the so-called allagmatic method is introduced that describes, models, implements, and allows interpretation of complex systems. After highlighting some current modelling and implementation challenges, model building blocks of open-ended evolutionary systems are identified, a system metamodel of open-ended evolution is formalised in the allagmatic method, an implementation self-modifying code prototype with a high-level programming language is provided, and guidance from the allagmatic method to create code blocks is described. The proposed prototype allows modifying code at runtime in a controlled way within a system metamodel. Since the allagmatic method has been built based on metaphysical concepts borrowed from Gilbert Simondon and Alfred N. Whitehead, the proposed prototype provides a promising starting point to interpret novelty generated at runtime with the help of a metaphysical framework.
△ Less
Submitted 1 March, 2022; v1 submitted 18 January, 2022;
originally announced January 2022.
-
Big Data is not the New Oil: Common Misconceptions about Population Data
Authors:
Peter Christen,
Rainer Schnell
Abstract:
Databases covering all individuals of a population are increasingly used for research and decision-making. The massive size of such databases is often mistaken as a guarantee for valid inferences. However, population data have characteristics that make them challenging to use. Various assumptions on population coverage and data quality are commonly made, including how such data were captured and w…
▽ More
Databases covering all individuals of a population are increasingly used for research and decision-making. The massive size of such databases is often mistaken as a guarantee for valid inferences. However, population data have characteristics that make them challenging to use. Various assumptions on population coverage and data quality are commonly made, including how such data were captured and what types of processing have been applied to them. Furthermore, the full potential of population data can often only be unlocked when such data are linked to other databases. Record linkage often implies subtle technical problems, which are easily missed. We discuss a diverse range of misconceptions relevant for anybody capturing, processing, linking, or analysing population data. Remarkably many of these misconceptions are due to the social nature of data collections and are therefore missed by purely technical accounts of data processing. Many of these misconceptions are also not well documented in scientific publications. We conclude with a set of recommendations for using population data.
△ Less
Submitted 2 September, 2022; v1 submitted 20 December, 2021;
originally announced December 2021.
-
Large Scale Record Linkage in the Presence of Missing Data
Authors:
Thilina Ranbaduge,
Peter Christen,
Rainer Schnell
Abstract:
Record linkage is aimed at the accurate and efficient identification of records that represent the same entity within or across disparate databases. It is a fundamental task in data integration and increasingly required for accurate decision making in application domains ranging from health analytics to national security. Traditional record linkage techniques calculate string similarities between…
▽ More
Record linkage is aimed at the accurate and efficient identification of records that represent the same entity within or across disparate databases. It is a fundamental task in data integration and increasingly required for accurate decision making in application domains ranging from health analytics to national security. Traditional record linkage techniques calculate string similarities between quasi-identifying (QID) values, such as the names and addresses of people. Errors, variations, and missing QID values can however lead to low linkage quality because the similarities between records cannot be calculated accurately. To overcome this challenge, we propose a novel technique that can accurately link records even when QID values contain errors or variations, or are missing. We first generate attribute signatures (concatenated QID values) using an Apriori based selection of suitable QID attributes, and then relational signatures that encapsulate relationship information between records. Combined, these signatures can uniquely identify individual records and facilitate fast and high quality linking of very large databases through accurate similarity calculations between records. We evaluate the linkage quality and scalability of our approach using large real-world databases, showing that it can achieve high linkage quality even when the databases being linked contain substantial amounts of missing values and errors.
△ Less
Submitted 19 April, 2021;
originally announced April 2021.
-
Accurate and Efficient Suffix Tree Based Privacy-Preserving String Matching
Authors:
Sirintra Vaiwsri,
Thilina Ranbaduge,
Peter Christen,
Kee Siong Ng
Abstract:
The task of calculating similarities between strings held by different organizations without revealing these strings is an increasingly important problem in areas such as health informatics, national censuses, genomics, and fraud detection. Most existing privacy-preserving string comparison functions are either based on comparing sets of encoded character q-grams, allow only exact matching of encr…
▽ More
The task of calculating similarities between strings held by different organizations without revealing these strings is an increasingly important problem in areas such as health informatics, national censuses, genomics, and fraud detection. Most existing privacy-preserving string comparison functions are either based on comparing sets of encoded character q-grams, allow only exact matching of encrypted strings, or they are aimed at long genomic sequences that have a small alphabet. The set-based privacy-preserving similarity functions commonly used to compare name and address strings in the context of privacy-preserving record linkage do not take the positions of sub-strings into account. As a result, two very different strings can potentially be considered as an exact match leading to wrongly linked records. Existing set-based techniques also cannot identify the length of the longest common sub-string across two strings. In this paper we propose a novel approach for accurate and efficient privacy-preserving string matching based on suffix trees that are encoded using chained hashing. We incorporate a hashing based encoding technique upon the encoded suffixes to improve privacy against frequency attacks such as those exploiting Benford's law. Our approach allows various operations to be performed without the strings to be compared being revealed: the length of the longest common sub-string, do two strings have the same beginning, middle or end, and the longest common sub-string similarity between two strings. These functions allow a more accurate comparison of, for example, bank account, credit card, or telephone numbers, which cannot be compared appropriately with existing privacy-preserving string matching techniques. Our evaluation on several data sets with different types of strings validates the privacy and accuracy of our proposed approach.
△ Less
Submitted 7 April, 2021;
originally announced April 2021.
-
Open-Ended Automatic Programming Through Combinatorial Evolution
Authors:
Sebastian Fix,
Thomas Probst,
Oliver Ruggli,
Thomas Hanne,
Patrik Christen
Abstract:
Combinatorial evolution - the creation of new things through the combination of existing things - can be a powerful way to evolve rather than design technical objects such as electronic circuits. Intriguingly, this seems to be an ongoing and thus open-ended process creating novelty with increasing complexity. Here, we employ combinatorial evolution in software development. While current approaches…
▽ More
Combinatorial evolution - the creation of new things through the combination of existing things - can be a powerful way to evolve rather than design technical objects such as electronic circuits. Intriguingly, this seems to be an ongoing and thus open-ended process creating novelty with increasing complexity. Here, we employ combinatorial evolution in software development. While current approaches such as genetic programming are efficient in solving particular problems, they all converge towards a solution and do not create anything new anymore afterwards. Combinatorial evolution of complex systems such as languages and technology are considered open-ended. Therefore, open-ended automatic programming might be possible through combinatorial evolution. We implemented a computer program simulating combinatorial evolution of code blocks stored in a database to make them available for combining. Automatic programming in the sense of algorithm-based code generation is achieved by evaluating regular expressions. We found that reserved keywords of a programming language are suitable for defining the basic code blocks at the beginning of the simulation. We also found that placeholders can be used to combine code blocks and that code complexity can be described in terms of the importance to the programming language. As in a previous combinatorial evolution simulation of electronic circuits, complexity increased from simple keywords and special characters to more complex variable declarations, class definitions, methods, and classes containing methods and variable declarations. Combinatorial evolution, therefore, seems to be a promising approach for open-ended automatic programming.
△ Less
Submitted 22 November, 2021; v1 submitted 20 February, 2021;
originally announced February 2021.
-
Philosophy-Guided Modelling and Implementation of Adaptation and Control in Complex Systems
Authors:
Olivier Del Fabbro,
Patrik Christen
Abstract:
Control was from its very beginning an important concept in cybernetics. Later on, with the works of W. Ross Ashby, for example, biological concepts such as adaptation were interpreted in the light of cybernetic systems theory. Adaptation is the process by which a system is capable of regulating or controlling itself in order to adapt to changes of its inner and outer environment maintaining a hom…
▽ More
Control was from its very beginning an important concept in cybernetics. Later on, with the works of W. Ross Ashby, for example, biological concepts such as adaptation were interpreted in the light of cybernetic systems theory. Adaptation is the process by which a system is capable of regulating or controlling itself in order to adapt to changes of its inner and outer environment maintaining a homeostatic state. In earlier works we have developed a system metamodel that on the one hand refers to cybernetic concepts such as structure, operation, and system, and on the other to the philosophy of individuation of Gilbert Simondon. The result is the so-called allagmatic method that is capable of creating concrete models of systems such as artificial neural networks and cellular automata starting from abstract building blocks. In this paper, we add to our already existing method the cybernetic concepts of control and especially adaptation. In regard to the system metamodel, we rely again on philosophical theories, this time the philosophy of organism of Alfred N. Whitehead. We show how these new meta-theoretical concepts are described formally and how they are implemented in program code. We also show what role they play in simple experiments. We conclude that philosophical abstract concepts help to better understand the process of creating computer models and their control and adaptation. In the outlook we discuss how the allagmatic method needs to be extended in order to cover the field of complex systems and Norbert Wiener's ideas on control.
△ Less
Submitted 25 September, 2021; v1 submitted 31 August, 2020;
originally announced September 2020.
-
F*: An Interpretable Transformation of the F-measure
Authors:
David J. Hand,
Peter Christen,
Nishadi Kirielle
Abstract:
The F-measure, also known as the F1-score, is widely used to assess the performance of classification algorithms. However, some researchers find it lacking in intuitive interpretation, questioning the appropriateness of combining two aspects of performance as conceptually distinct as precision and recall, and also questioning whether the harmonic mean is the best way to combine them. To ease this…
▽ More
The F-measure, also known as the F1-score, is widely used to assess the performance of classification algorithms. However, some researchers find it lacking in intuitive interpretation, questioning the appropriateness of combining two aspects of performance as conceptually distinct as precision and recall, and also questioning whether the harmonic mean is the best way to combine them. To ease this concern, we describe a simple transformation of the F-measure, which we call F* (F-star), which has an immediate practical interpretation.
△ Less
Submitted 17 March, 2021; v1 submitted 31 July, 2020;
originally announced August 2020.
-
Pattern Masking for Dictionary Matching
Authors:
Panagiotis Charalampopoulos,
Huiping Chen,
Peter Christen,
Grigorios Loukides,
Nadia Pisanti,
Solon P. Pissis,
Jakub Radoszewski
Abstract:
In the Pattern Masking for Dictionary Matching (PMDM) problem, we are given a dictionary $\mathcal{D}$ of $d$ strings, each of length $\ell$, a query string $q$ of length $\ell$, and a positive integer $z$, and we are asked to compute a smallest set $K\subseteq\{1,\ldots,\ell\}$, so that if $q[i]$, for all $i\in K$, is replaced by a wildcard, then $q$ matches at least $z$ strings from…
▽ More
In the Pattern Masking for Dictionary Matching (PMDM) problem, we are given a dictionary $\mathcal{D}$ of $d$ strings, each of length $\ell$, a query string $q$ of length $\ell$, and a positive integer $z$, and we are asked to compute a smallest set $K\subseteq\{1,\ldots,\ell\}$, so that if $q[i]$, for all $i\in K$, is replaced by a wildcard, then $q$ matches at least $z$ strings from $\mathcal{D}$. The PMDM problem lies at the heart of two important applications featured in large-scale real-world systems: record linkage of databases that contain sensitive information, and query term dropping. In both applications, solving PMDM allows for providing data utility guarantees as opposed to existing approaches.
We first show, through a reduction from the well-known $k$-Clique problem, that a decision version of the PMDM problem is NP-complete, even for strings over a binary alphabet. We present a data structure for PMDM that answers queries over $\mathcal{D}$ in time $\mathcal{O}(2^{\ell/2}(2^{\ell/2}+τ)\ell)$ and requires space $\mathcal{O}(2^{\ell}d^2/τ^2+2^{\ell/2}d)$, for any parameter $τ\in[1,d]$. We also approach the problem from a more practical perspective. We show an $\mathcal{O}((d\ell)^{k/3}+d\ell)$-time and $\mathcal{O}(d\ell)$-space algorithm for PMDM if $k=|K|=\mathcal{O}(1)$. We generalize our exact algorithm to mask multiple query strings simultaneously. We complement our results by showing a two-way polynomial-time reduction between PMDM and the Minimum Union problem [Chlamtáč et al., SODA 2017]. This gives a polynomial-time $\mathcal{O}(d^{1/4+ε})$-approximation algorithm for PMDM, which is tight under plausible complexity conjectures.
△ Less
Submitted 8 March, 2024; v1 submitted 29 June, 2020;
originally announced June 2020.
-
Philosophy-Guided Mathematical Formalism for Complex Systems Modelling
Authors:
Patrik Christen,
Olivier Del Fabbro
Abstract:
We recently presented the so-called allagmatic method, which includes a system metamodel providing a framework for describing, modelling, simulating, and interpreting complex systems. Its development and programming was guided by philosophy, especially by Gilbert Simondon's philosophy of individuation, Alfred North Whitehead's philosophy of organism, and concepts from cybernetics. Here, a mathemat…
▽ More
We recently presented the so-called allagmatic method, which includes a system metamodel providing a framework for describing, modelling, simulating, and interpreting complex systems. Its development and programming was guided by philosophy, especially by Gilbert Simondon's philosophy of individuation, Alfred North Whitehead's philosophy of organism, and concepts from cybernetics. Here, a mathematical formalism is presented to better describe and define the system metamodel of the allagmatic method, thereby further generalising it and extending its reach to a more formal treatment and allowing more theoretical studies. By using the formalism, an example for such a further study is provided with mathematical definitions and proofs for model creation and equivalence of cellular automata and artificial neural networks.
△ Less
Submitted 29 July, 2022; v1 submitted 3 May, 2020;
originally announced May 2020.
-
Cybernetical Concepts for Cellular Automaton and Artificial Neural Network Modelling and Implementation
Authors:
Patrik Christen,
Olivier Del Fabbro
Abstract:
As a discipline cybernetics has a long and rich history. In its first generation it not only had a worldwide span, in the area of computer modelling, for example, its proponents such as John von Neumann, Stanislaw Ulam, Warren McCulloch and Walter Pitts, also came up with models and methods such as cellular automata and artificial neural networks, which are still the foundation of most modern mode…
▽ More
As a discipline cybernetics has a long and rich history. In its first generation it not only had a worldwide span, in the area of computer modelling, for example, its proponents such as John von Neumann, Stanislaw Ulam, Warren McCulloch and Walter Pitts, also came up with models and methods such as cellular automata and artificial neural networks, which are still the foundation of most modern modelling approaches. At the same time, cybernetics also got the attention of philosophers, such as the Frenchman Gilbert Simondon, who made use of cybernetical concepts in order to establish a metaphysics and a natural philosophy of individuation, giving cybernetics thereby a philosophical interpretation, which he baptised allagmatic. In this paper, we emphasise this allagmatic theory by showing how Simondon's philosophical concepts can be used to formulate a generic computer model or metamodel for complex systems modelling and its implementation in program code, according to generic programming. We also present how the developed allagmatic metamodel is capable of building simple cellular automata and artificial neural networks.
△ Less
Submitted 31 August, 2020; v1 submitted 24 November, 2019;
originally announced January 2020.
-
Incremental Clustering Techniques for Multi-Party Privacy-Preserving Record Linkage
Authors:
Dinusha Vatsalan,
Peter Christen,
Erhard Rahm
Abstract:
Privacy-Preserving Record Linkage (PPRL) supports the integration of sensitive information from multiple datasets, in particular the privacy-preserving matching of records referring to the same entity. PPRL has gained much attention in many application areas, with the most prominent ones in the healthcare domain. PPRL techniques tackle this problem by conducting linkage on masked (encoded) values.…
▽ More
Privacy-Preserving Record Linkage (PPRL) supports the integration of sensitive information from multiple datasets, in particular the privacy-preserving matching of records referring to the same entity. PPRL has gained much attention in many application areas, with the most prominent ones in the healthcare domain. PPRL techniques tackle this problem by conducting linkage on masked (encoded) values. Employing PPRL on records from multiple (more than two) parties/sources (multi-party PPRL, MP-PPRL) is an increasingly important but challenging problem that so far has not been sufficiently solved. Existing MP-PPRL approaches are limited to finding only those entities that are present in all parties thereby missing entities that match only in a subset of parties. Furthermore, previous MP-PPRL approaches face substantial scalability limitations due to the need of a large number of comparisons between masked records. We thus propose and evaluate new MP-PPRL approaches that find matches in any subset of parties and still scale to many parties. Our approaches maintain all matches within clusters, where these clusters are incrementally extended or refined by considering records from one party after the other. An empirical evaluation using multiple real datasets ranging from 3 to 26 parties each containing up to $5$ million records validates that our protocols are efficient, and significantly outperform existing MP-PPRL approaches in terms of linkage quality and scalability.
△ Less
Submitted 28 November, 2019;
originally announced November 2019.
-
Automatic Programming of Cellular Automata and Artificial Neural Networks Guided by Philosophy
Authors:
Patrik Christen,
Olivier Del Fabbro
Abstract:
Many computer models such as cellular automata and artificial neural networks have been developed and successfully applied. However, in some cases, these models might be restrictive on the possible solutions or their solutions might be difficult to interpret. To overcome this problem, we outline a new approach, the so-called allagmatic method, that automatically programs and executes models with a…
▽ More
Many computer models such as cellular automata and artificial neural networks have been developed and successfully applied. However, in some cases, these models might be restrictive on the possible solutions or their solutions might be difficult to interpret. To overcome this problem, we outline a new approach, the so-called allagmatic method, that automatically programs and executes models with as little limitations as possible while maintaining human interpretability. Earlier we described a metamodel and its building blocks according to the philosophical concepts of structure (spatial dimension) and operation (temporal dimension). They are entity, milieu, and update function that together abstractly describe cellular automata, artificial neural networks, and possibly any kind of computer model. By automatically combining these building blocks in an evolutionary computation, interpretability might be increased by the relationship to the metamodel, and models might be translated into more interpretable models via the metamodel. We propose generic and object-oriented programming to implement the entities and their milieus as dynamic and generic arrays and the update function as a method. We show two experiments where a simple cellular automaton and an artificial neural network are automatically programmed, compiled, and executed. A target state is successfully evolved and learned in the cellular automaton and artificial neural network, respectively. We conclude that the allagmatic method can create and execute cellular automaton and artificial neural network models in an automated manner with the guidance of philosophy.
△ Less
Submitted 31 August, 2020; v1 submitted 10 May, 2019;
originally announced May 2019.
-
Temporal graph-based clustering for historical record linkage
Authors:
Charini Nanayakkara,
Peter Christen,
Thilina Ranbaduge
Abstract:
Research in the social sciences is increasingly based on large and complex data collections, where individual data sets from different domains are linked and integrated to allow advanced analytics. A popular type of data used in such a context are historical censuses, as well as birth, death, and marriage certificates. Individually, such data sets however limit the types of studies that can be con…
▽ More
Research in the social sciences is increasingly based on large and complex data collections, where individual data sets from different domains are linked and integrated to allow advanced analytics. A popular type of data used in such a context are historical censuses, as well as birth, death, and marriage certificates. Individually, such data sets however limit the types of studies that can be conducted. Specifically, it is impossible to track individuals, families, or households over time. Once such data sets are linked and family trees spanning several decades are available it is possible to, for example, investigate how education, health, mobility, employment, and social status influence each other and the lives of people over two or even more generations. A major challenge is however the accurate linkage of historical data sets which is due to data quality and commonly also the lack of ground truth data being available. Unsupervised techniques need to be employed, which can be based on similarity graphs generated by comparing individual records. In this paper we present initial results from clustering birth records from Scotland where we aim to identify all births of the same mother and group siblings into clusters. We extend an existing clustering technique for record linkage by incorporating temporal constraints that must hold between births by the same mother, and propose a novel greedy temporal clustering technique. Experimental results show improvements over non-temporary approaches, however further work is needed to obtain links of high quality.
△ Less
Submitted 6 July, 2018;
originally announced July 2018.
-
Developing a Temporal Bibliographic Data Set for Entity Resolution
Authors:
Yichen Hu,
Qing Wang,
Peter Christen
Abstract:
Entity resolution is the process of identifying groups of records within or across data sets where each group represents a real-world entity. Novel techniques that consider temporal features to improve the quality of entity resolution have recently attracted significant attention. However, there are currently no large data sets available that contain both temporal information as well as ground tru…
▽ More
Entity resolution is the process of identifying groups of records within or across data sets where each group represents a real-world entity. Novel techniques that consider temporal features to improve the quality of entity resolution have recently attracted significant attention. However, there are currently no large data sets available that contain both temporal information as well as ground truth information to evaluate the quality of temporal entity resolution approaches. In this paper, we describe the preparation of a temporal data set based on author profiles extracted from the Digital Bibliography and Library Project (DBLP). We completed missing links between publications and author profiles in the DBLP data set using the DBLP public API. We then used the Microsoft Academic Graph (MAG) to link temporal affiliation information for DBLP authors. We selected around 80K (1%) of author profiles that cover 2 million (50%) publications using information in DBLP such as alternative author names and personal web profile to improve the reliability of the resulting ground truth, while at the same time keeping the data set challenging for temporal entity resolution research.
△ Less
Submitted 19 June, 2018;
originally announced June 2018.
-
A Decision Tree Approach to Predicting Recidivism in Domestic Violence
Authors:
Senuri Wijenayake,
Timothy Graham,
Peter Christen
Abstract:
Domestic violence (DV) is a global social and public health issue that is highly gendered. Being able to accurately predict DV recidivism, i.e., re-offending of a previously convicted offender, can speed up and improve risk assessment procedures for police and front-line agencies, better protect victims of DV, and potentially prevent future re-occurrences of DV. Previous work in DV recidivism has…
▽ More
Domestic violence (DV) is a global social and public health issue that is highly gendered. Being able to accurately predict DV recidivism, i.e., re-offending of a previously convicted offender, can speed up and improve risk assessment procedures for police and front-line agencies, better protect victims of DV, and potentially prevent future re-occurrences of DV. Previous work in DV recidivism has employed different classification techniques, including decision tree (DT) induction and logistic regression, where the main focus was on achieving high prediction accuracy. As a result, even the diagrams of trained DTs were often too difficult to interpret due to their size and complexity, making decision-making challenging. Given there is often a trade-off between model accuracy and interpretability, in this work our aim is to employ DT induction to obtain both interpretable trees as well as high prediction accuracy. Specifically, we implement and evaluate different approaches to deal with class imbalance as well as feature selection. Compared to previous work in DV recidivism prediction that employed logistic regression, our approach can achieve comparable area under the ROC curve results by using only 3 of 11 available features and generating understandable decision trees that contain only 4 leaf nodes.
△ Less
Submitted 26 March, 2018;
originally announced March 2018.
-
Scalable Entity Resolution Using Probabilistic Signatures on Parallel Databases
Authors:
Yuhang Zhang,
Kee Siong Ng,
Michael Walker,
Pauline Chou,
Tania Churchill,
Peter Christen
Abstract:
Accurate and efficient entity resolution is an open challenge of particular relevance to intelligence organisations that collect large datasets from disparate sources with differing levels of quality and standard. Starting from a first-principles formulation of entity resolution, this paper presents a novel Entity Resolution algorithm that introduces a data-driven blocking and record-linkage techn…
▽ More
Accurate and efficient entity resolution is an open challenge of particular relevance to intelligence organisations that collect large datasets from disparate sources with differing levels of quality and standard. Starting from a first-principles formulation of entity resolution, this paper presents a novel Entity Resolution algorithm that introduces a data-driven blocking and record-linkage technique based on the probabilistic identification of entity signatures in data. The scalability and accuracy of the proposed algorithm are evaluated using benchmark datasets and shown to achieve state-of-the-art results. The proposed algorithm can be implemented simply on modern parallel databases, which allows it to be deployed with relative ease in large industrial applications.
△ Less
Submitted 18 March, 2018; v1 submitted 27 December, 2017;
originally announced December 2017.
-
Loklak - A Distributed Crawler and Data Harvester for Overcoming Rate Limits
Authors:
Sudheesh Singanamalla,
Michael Peter Christen
Abstract:
Modern social networks have become sources for vast quantities of data. Having access to such big data can be very useful for various researchers and data scientists. In this paper we describe Loklak, an open source distributed peer to peer crawler and scraper for supporting such research on platforms like Twitter, Weibo and other social networks. Social networks such as Twitter and Weibo pose var…
▽ More
Modern social networks have become sources for vast quantities of data. Having access to such big data can be very useful for various researchers and data scientists. In this paper we describe Loklak, an open source distributed peer to peer crawler and scraper for supporting such research on platforms like Twitter, Weibo and other social networks. Social networks such as Twitter and Weibo pose various limitations to the user on the rate at which one could freely collect such data for research. Our crawler enables researchers to continuously collect data while overcoming the barriers of authentication and rate limits imposed to provide a repository of open data as a service.
△ Less
Submitted 12 April, 2017;
originally announced April 2017.
-
Scalable Multi-Database Privacy-Preserving Record Linkage using Counting Bloom Filters
Authors:
Dinusha Vatsalan,
Peter Christen,
Erhard Rahm
Abstract:
Privacy-preserving record linkage (PPRL) aims at integrating sensitive information from multiple disparate databases of different organizations. PPRL approaches are increasingly required in real-world application areas such as healthcare, national security, and business. Previous approaches have mostly focused on linking only two databases as well as the use of a dedicated linkage unit. Scaling PP…
▽ More
Privacy-preserving record linkage (PPRL) aims at integrating sensitive information from multiple disparate databases of different organizations. PPRL approaches are increasingly required in real-world application areas such as healthcare, national security, and business. Previous approaches have mostly focused on linking only two databases as well as the use of a dedicated linkage unit. Scaling PPRL to more databases (multi-party PPRL) is an open challenge since privacy threats as well as the computation and communication costs for record linkage increase significantly with the number of databases. We thus propose the use of a new encoding method of sensitive data based on Counting Bloom Filters (CBF) to improve privacy for multi-party PPRL. We also investigate optimizations to reduce communication and computation costs for CBF-based multi-party PPRL with and without the use of a dedicated linkage unit. Empirical evaluations conducted with real datasets show the viability of the proposed approaches and demonstrate their scalability, linkage quality, and privacy protection.
△ Less
Submitted 5 January, 2017;
originally announced January 2017.
-
Multi-Party Privacy-Preserving Record Linkage using Bloom Filters
Authors:
Dinusha Vatsalan,
Peter Christen
Abstract:
Privacy-preserving record linkage (PPRL), the problem of identifying records that correspond to the same real-world entity across several data sources held by different parties without revealing any sensitive information about these records, is increasingly being required in many real-world application areas. Examples range from public health surveillance to crime and fraud detection, and national…
▽ More
Privacy-preserving record linkage (PPRL), the problem of identifying records that correspond to the same real-world entity across several data sources held by different parties without revealing any sensitive information about these records, is increasingly being required in many real-world application areas. Examples range from public health surveillance to crime and fraud detection, and national security. Various techniques have been developed to tackle the problem of PPRL, with the majority of them considering linking data from only two sources. However, in many real-world applications data from more than two sources need to be linked. In this paper we propose a viable solution for multi-party PPRL using two efficient privacy techniques: Bloom filter encoding and distributed secure summation. Our proposed protocol efficiently identifies matching sets of records held by all data sources that have a similarity above a certain minimum threshold. While being efficient, our protocol is also secure under the semi-honest adversary model in that no party can learn any sensitive information about any other parties' data, but all parties learn which of their records have a high similarity with records held by the other parties. We evaluate our protocol on a large real voter registration database showing the scalability, linkage quality, and privacy of our approach.
△ Less
Submitted 28 December, 2016;
originally announced December 2016.
-
Application of Advanced Record Linkage Techniques for Complex Population Reconstruction
Authors:
Peter Christen
Abstract:
Record linkage is the process of identifying records that refer to the same entities from several databases. This process is challenging because commonly no unique entity identifiers are available. Linkage therefore has to rely on partially identifying attributes, such as names and addresses of people. Recent years have seen the development of novel techniques for linking data from diverse applica…
▽ More
Record linkage is the process of identifying records that refer to the same entities from several databases. This process is challenging because commonly no unique entity identifiers are available. Linkage therefore has to rely on partially identifying attributes, such as names and addresses of people. Recent years have seen the development of novel techniques for linking data from diverse application areas, where a major focus has been on linking complex data that contain records about different types of entities. Advanced approaches that exploit both the similarities between record attributes as well as the relationships between entities to identify clusters of matching records have been developed.
In this application paper we study the novel problem where rather than different types of entities we have databases where the same entity can have different roles, and where these roles change over time. We specifically develop novel techniques for linking historical birth, death, marriage and census records with the aim to reconstruct the population covered by these records over a period of several decades. Our experimental evaluation on real Scottish data shows that even with advanced linkage techniques that consider group, relationship, and temporal aspects it is challenging to achieve high quality linkage from such complex data.
△ Less
Submitted 13 December, 2016;
originally announced December 2016.
-
Sensor Discovery and Configuration Framework for The Internet of Things Paradigm
Authors:
Charith Perera,
Prem Prakash Jayaraman,
Arkady Zaslavsky,
Peter Christen,
Dimitrios Georgakopoulos
Abstract:
Internet of Things (IoT) will comprise billions of devices that can sense, communicate, compute and potentially actuate. The data generated by the Internet of Things are valuable and have the potential to drive innovative and novel applications. The data streams coming from these devices will challenge the traditional approaches to data management and contribute to the emerging paradigm of big dat…
▽ More
Internet of Things (IoT) will comprise billions of devices that can sense, communicate, compute and potentially actuate. The data generated by the Internet of Things are valuable and have the potential to drive innovative and novel applications. The data streams coming from these devices will challenge the traditional approaches to data management and contribute to the emerging paradigm of big data. One of the most challenging tasks before collecting and processing data from these devices (e.g. sensors) is discovering and configuring the sensors and the associated data streams. In this paper, we propose a tool called SmartLink that can be used to discover and configure sensors. Specifically, SmartLink, is capable of discovering sensors deployed in a particular location despite their heterogeneity (e.g. different communication protocols, communication sequences, capabilities). SmartLink establishes the direct communication between the sensor hardware and cloud-based IoT middleware. We address the challenge of heterogeneity using a plugin architecture. Our prototype tool is developed on the Android platform. We evaluate the significance of our approach by discovering and configuring 52 different types of Libelium sensors.
△ Less
Submitted 23 December, 2013;
originally announced December 2013.
-
Context-aware Dynamic Discovery and Configuration of 'Things' in Smart Environments
Authors:
Charith Perera,
Prem Jayaraman,
Arkady Zaslavsky,
Peter Christen,
Dimitrios Georgakopoulos
Abstract:
The Internet of Things (IoT) is a dynamic global information network consisting of Internet-connected objects, such as RFIDs, sensors, actuators, as well as other instruments and smart appliances that are becoming an integral component of the future Internet. Currently, such Internet-connected objects or `things' outnumber both people and computers connected to the Internet and their population is…
▽ More
The Internet of Things (IoT) is a dynamic global information network consisting of Internet-connected objects, such as RFIDs, sensors, actuators, as well as other instruments and smart appliances that are becoming an integral component of the future Internet. Currently, such Internet-connected objects or `things' outnumber both people and computers connected to the Internet and their population is expected to grow to 50 billion in the next 5 to 10 years. To be able to develop IoT applications, such `things' must become dynamically integrated into emerging information networks supported by architecturally scalable and economically feasible Internet service delivery models, such as cloud computing. Achieving such integration through discovery and configuration of `things' is a challenging task. Towards this end, we propose a Context-Aware Dynamic Discovery of {Things} (CADDOT) model. We have developed a tool SmartLink, that is capable of discovering sensors deployed in a particular location despite their heterogeneity. SmartLink helps to establish the direct communication between sensor hardware and cloud-based IoT middleware platforms. We address the challenge of heterogeneity using a plug in architecture. Our prototype tool is developed on an Android platform. Further, we employ the Global Sensor Network (GSN) as the IoT middleware for the proof of concept validation. The significance of the proposed solution is validated using a test-bed that comprises 52 Arduino-based Libelium sensors.
△ Less
Submitted 8 November, 2013;
originally announced November 2013.
-
MOSDEN: An Internet of Things Middleware for Resource Constrained Mobile Devices
Authors:
Charith Perera,
Prem Prakash Jayaraman,
Arkady Zaslavsky,
Peter Christen,
Dimitrios Georgakopoulos
Abstract:
The Internet of Things (IoT) is part of Future Internet and will comprise many billions of Internet Connected Objects (ICO) or `things' where things can sense, communicate, compute and potentially actuate as well as have intelligence, multi-modal interfaces, physical/ virtual identities and attributes. Collecting data from these objects is an important task as it allows software systems to underst…
▽ More
The Internet of Things (IoT) is part of Future Internet and will comprise many billions of Internet Connected Objects (ICO) or `things' where things can sense, communicate, compute and potentially actuate as well as have intelligence, multi-modal interfaces, physical/ virtual identities and attributes. Collecting data from these objects is an important task as it allows software systems to understand the environment better. Many different hardware devices may involve in the process of collecting and uploading sensor data to the cloud where complex processing can occur. Further, we cannot expect all these objects to be connected to the computers due to technical and economical reasons. Therefore, we should be able to utilize resource constrained devices to collect data from these ICOs. On the other hand, it is critical to process the collected sensor data before sending them to the cloud to make sure the sustainability of the infrastructure due to energy constraints. This requires to move the sensor data processing tasks towards the resource constrained computational devices (e.g. mobile phones). In this paper, we propose Mobile Sensor Data Processing Engine (MOSDEN), an plug-in-based IoT middleware for mobile devices, that allows to collect and process sensor data without programming efforts. Our architecture also supports sensing as a service model. We present the results of the evaluations that demonstrate its suitability towards real world deployments. Our proposed middleware is built on Android platform.
△ Less
Submitted 15 October, 2013;
originally announced October 2013.
-
Sensor Search Techniques for Sensing as a Service Architecture for The Internet of Things
Authors:
Charith Perera,
Arkady Zaslavsky,
Chi Harold Liu,
Michael Compton,
Peter Christen,
Dimitrios Georgakopoulos
Abstract:
The Internet of Things (IoT) is part of the Internet of the future and will comprise billions of intelligent communicating "things" or Internet Connected Objects (ICO) which will have sensing, actuating, and data processing capabilities. Each ICO will have one or more embedded sensors that will capture potentially enormous amounts of data. The sensors and related data streams can be clustered phys…
▽ More
The Internet of Things (IoT) is part of the Internet of the future and will comprise billions of intelligent communicating "things" or Internet Connected Objects (ICO) which will have sensing, actuating, and data processing capabilities. Each ICO will have one or more embedded sensors that will capture potentially enormous amounts of data. The sensors and related data streams can be clustered physically or virtually, which raises the challenge of searching and selecting the right sensors for a query in an efficient and effective way. This paper proposes a context-aware sensor search, selection and ranking model, called CASSARAM, to address the challenge of efficiently selecting a subset of relevant sensors out of a large set of sensors with similar functionality and capabilities. CASSARAM takes into account user preferences and considers a broad range of sensor characteristics, such as reliability, accuracy, location, battery life, and many more. The paper highlights the importance of sensor search, selection and ranking for the IoT, identifies important characteristics of both sensors and data capture processes, and discusses how semantic and quantitative reasoning can be combined together. This work also addresses challenges such as efficient distributed sensor search and relational-expression based filtering. CASSARAM testing and performance evaluation results are presented and discussed.
△ Less
Submitted 13 September, 2013;
originally announced September 2013.
-
Context Aware Sensor Configuration Model for Internet of Things
Authors:
Charith Perera,
Arkady Zaslavsky,
Michael Compton,
Peter Christen,
Dimitrios Georgakopoulos
Abstract:
We propose a Context Aware Sensor Configuration Model (CASCoM) to address the challenge of automated context-aware configuration of filtering, fusion, and reasoning mechanisms in IoT middleware according to the problems at hand. We incorporate semantic technologies in solving the above challenges.
We propose a Context Aware Sensor Configuration Model (CASCoM) to address the challenge of automated context-aware configuration of filtering, fusion, and reasoning mechanisms in IoT middleware according to the problems at hand. We incorporate semantic technologies in solving the above challenges.
△ Less
Submitted 6 September, 2013;
originally announced September 2013.
-
Semantic-driven Configuration of Internet of Things Middleware
Authors:
Charith Perera,
Arkady Zaslavsky,
Michael Compton,
Peter Christen,
Dimitrios Georgakopoulos
Abstract:
We are currently observing emerging solutions to enable the Internet of Things (IoT). Efficient and feature rich IoT middeware platforms are key enablers for IoT. However, due to complexity, most of these middleware platforms are designed to be used by IT experts. In this paper, we propose a semantics-driven model that allows non-IT experts (e.g. plant scientist, city planner) to configure IoT mid…
▽ More
We are currently observing emerging solutions to enable the Internet of Things (IoT). Efficient and feature rich IoT middeware platforms are key enablers for IoT. However, due to complexity, most of these middleware platforms are designed to be used by IT experts. In this paper, we propose a semantics-driven model that allows non-IT experts (e.g. plant scientist, city planner) to configure IoT middleware components easier and faster. Such tools allow them to retrieve the data they want without knowing the underlying technical details of the sensors and the data processing components. We propose a Context Aware Sensor Configuration Model (CASCoM) to address the challenge of automated context-aware configuration of filtering, fusion, and reasoning mechanisms in IoT middleware according to the problems at hand. We incorporate semantic technologies in solving the above challenges. We demonstrate the feasibility and the scalability of our approach through a prototype implementation based on an IoT middleware called Global Sensor Networks (GSN), though our model can be generalized into any other middleware platform. We evaluate CASCoM in agriculture domain and measure both performance in terms of usability and computational complexity.
△ Less
Submitted 5 September, 2013;
originally announced September 2013.
-
Sensing as a Service Model for Smart Cities Supported by Internet of Things
Authors:
Charith Perera,
Arkady Zaslavsky,
Peter Christen,
Dimitrios Georgakopoulos
Abstract:
The world population is growing at a rapid pace. Towns and cities are accommodating half of the world's population thereby creating tremendous pressure on every aspect of urban living. Cities are known to have large concentration of resources and facilities. Such environments attract people from rural areas. However, unprecedented attraction has now become an overwhelming issue for city governance…
▽ More
The world population is growing at a rapid pace. Towns and cities are accommodating half of the world's population thereby creating tremendous pressure on every aspect of urban living. Cities are known to have large concentration of resources and facilities. Such environments attract people from rural areas. However, unprecedented attraction has now become an overwhelming issue for city governance and politics. The enormous pressure towards efficient city management has triggered various Smart City initiatives by both government and private sector businesses to invest in ICT to find sustainable solutions to the growing issues. The Internet of Things (IoT) has also gained significant attention over the past decade. IoT envisions to connect billions of sensors to the Internet and expects to use them for efficient and effective resource management in Smart Cities. Today infrastructure, platforms, and software applications are offered as services using cloud technologies. In this paper, we explore the concept of sensing as a service and how it fits with the Internet of Things. Our objective is to investigate the concept of sensing as a service model in technological, economical, and social perspectives and identify the major open challenges and issues.
△ Less
Submitted 30 July, 2013;
originally announced July 2013.
-
Context Aware Computing for The Internet of Things: A Survey
Authors:
Charith Perera,
Arkady Zaslavsky,
Peter Christen,
Dimitrios Georgakopoulos
Abstract:
As we are moving towards the Internet of Things (IoT), the number of sensors deployed around the world is growing at a rapid pace. Market research has shown a significant growth of sensor deployments over the past decade and has predicted a significant increment of the growth rate in the future. These sensors continuously generate enormous amounts of data. However, in order to add value to raw sen…
▽ More
As we are moving towards the Internet of Things (IoT), the number of sensors deployed around the world is growing at a rapid pace. Market research has shown a significant growth of sensor deployments over the past decade and has predicted a significant increment of the growth rate in the future. These sensors continuously generate enormous amounts of data. However, in order to add value to raw sensor data we need to understand it. Collection, modelling, reasoning, and distribution of context in relation to sensor data plays critical role in this challenge. Context-aware computing has proven to be successful in understanding sensor data. In this paper, we survey context awareness from an IoT perspective. We present the necessary background by introducing the IoT paradigm and context-aware fundamentals at the beginning. Then we provide an in-depth analysis of context life cycle. We evaluate a subset of projects (50) which represent the majority of research and commercial solutions proposed in the field of context-aware computing conducted over the last decade (2001-2011) based on our own taxonomy. Finally, based on our evaluation, we highlight the lessons to be learnt from the past and some possible directions for future research. The survey addresses a broad range of techniques, methods, models, functionalities, systems, applications, and middleware solutions related to context awareness and IoT. Our goal is not only to analyse, compare and consolidate past research work but also to appreciate their findings and discuss their applicability towards the IoT.
△ Less
Submitted 4 May, 2013;
originally announced May 2013.
-
Context-aware Sensor Search, Selection and Ranking Model for Internet of Things Middleware
Authors:
Charith Perera,
Arkady Zaslavsky,
Peter Christen,
Michael Compton,
Dimitrios Georgakopoulos
Abstract:
As we are moving towards the Internet of Things (IoT), the number of sensors deployed around the world is growing at a rapid pace. Market research has shown a significant growth of sensor deployments over the past decade and has predicted a substantial acceleration of the growth rate in the future. It is also evident that the increasing number of IoT middleware solutions are developed in both rese…
▽ More
As we are moving towards the Internet of Things (IoT), the number of sensors deployed around the world is growing at a rapid pace. Market research has shown a significant growth of sensor deployments over the past decade and has predicted a substantial acceleration of the growth rate in the future. It is also evident that the increasing number of IoT middleware solutions are developed in both research and commercial environments. However, sensor search and selection remain a critical requirement and a challenge. In this paper, we present CASSARAM, a context-aware sensor search, selection, and ranking model for Internet of Things to address the research challenges of selecting sensors when large numbers of sensors with overlapping and sometimes redundant functionality are available. CASSARAM proposes the search and selection of sensors based on user priorities. CASSARAM considers a broad range of characteristics of sensors for search such as reliability, accuracy, battery life just to name a few. Our approach utilises both semantic querying and quantitative reasoning techniques. User priority based weighted Euclidean distance comparison in multidimensional space technique is used to index and rank sensors. Our objectives are to highlight the importance of sensor search in IoT paradigm, identify important characteristics of both sensors and data acquisition processes which help to select sensors, understand how semantic and statistical reasoning can be combined together to address this problem in an efficient manner. We developed a tool called CASSARA to evaluate the proposed model in terms of resource consumption and response time.
△ Less
Submitted 11 March, 2013;
originally announced March 2013.
-
Dynamic Configuration of Sensors Using Mobile Sensor Hub in Internet of Things Paradigm
Authors:
Charith Perera,
Prem Jayaraman,
Arkady Zaslavsky,
Peter Christen,
Dimitrios Georgakopoulos
Abstract:
Internet of Things (IoT) envisions billions of sensors to be connected to the Internet. By deploying intelligent low-level computational devices such as mobile phones in-between sensors and cloud servers, we can reduce data communication with the use of intelligent processing such as fusing and filtering sensor data, which saves significant amount of energy. This is also ideal for real world senso…
▽ More
Internet of Things (IoT) envisions billions of sensors to be connected to the Internet. By deploying intelligent low-level computational devices such as mobile phones in-between sensors and cloud servers, we can reduce data communication with the use of intelligent processing such as fusing and filtering sensor data, which saves significant amount of energy. This is also ideal for real world sensor deployments where connecting sensors directly to a computer or to the Internet is not practical. Most of the leading IoT middleware solutions require manual and labour intensive tasks to be completed in order to connect a mobile phone to them. In this paper we present a mobile application called Mobile Sensor Hub (MoSHub). It allows variety of different sensors to be connected to a mobile phone and send the data to the cloud intelligently reducing network communication. Specifically, we explore techniques that allow MoSHub to be connected to cloud based IoT middleware solutions autonomously. For our experiments, we employed Global Sensor Network (GSN) middleware to implement and evaluate our approach. Such automated configuration reduces significant amount of manual labour that need to be performed by technical experts otherwise. We also evaluated different methods that can be used to automate the configuration process.
△ Less
Submitted 5 February, 2013;
originally announced February 2013.
-
Connecting Mobile Things to Global Sensor Network Middleware using System-generated Wrappers
Authors:
Charith Perera,
Arkady Zaslavsky,
Peter Christen,
Ali Salehi,
Dimitrios Georgakopoulos
Abstract:
Internet of Things (IoT) will create a cyberphysical world where all the things around us are connected to the Inter net, sense and produce "big data" that has to be stored, processed and communicated with minimum human intervention. With the ever increasing emergence of new sensors, interfaces and mobile devices, the grand challenge is to keep up with this race in developing software drivers and…
▽ More
Internet of Things (IoT) will create a cyberphysical world where all the things around us are connected to the Inter net, sense and produce "big data" that has to be stored, processed and communicated with minimum human intervention. With the ever increasing emergence of new sensors, interfaces and mobile devices, the grand challenge is to keep up with this race in developing software drivers and wrappers for IoT things. In this paper, we examine the approaches that automate the process of developing middleware drivers/wrappers for the IoT things. We propose ASCM4GSN architecture to address this challenge efficiently and effectively. We demonstrate the proposed approach using Global Sensor Network (GSN) middleware which exemplifies a cluster of data streaming engines. The ASCM4GSN architecture significantly speeds up the wrapper development and sensor configuration process as demonstrated for Android mobile phone based sensors as well as for Sun SPOT sensors.
△ Less
Submitted 6 January, 2013;
originally announced January 2013.
-
CA4IOT Context Awareness for Internet of Things
Authors:
Charith Perera,
Arkady Zaslavsky,
Peter Christen,
Dimitrios Georgakopoulos
Abstract:
Internet of Things (IoT) will connect billions of sensors deployed around the world together. This will create an ideal opportunity to build a sensing-as-a-service platform. Due to large number of sensor deployments, there would be number of sensors that can be used to sense and collect similar information. Further, due to advances in sensor hardware technology, new methods and measurements will b…
▽ More
Internet of Things (IoT) will connect billions of sensors deployed around the world together. This will create an ideal opportunity to build a sensing-as-a-service platform. Due to large number of sensor deployments, there would be number of sensors that can be used to sense and collect similar information. Further, due to advances in sensor hardware technology, new methods and measurements will be introduced continuously. In the IoT paradigm, selecting the most appropriate sensors which can provide relevant sensor data to address the problems at hand among billions of possibilities would be a challenge for both technical and non-technical users. In this paper, we propose the Context Awareness for Internet of Things (CA4IOT) architecture to help users by automating the task of selecting the sensors according to the problems/tasks at hand. We focus on automated configuration of filtering, fusion and reasoning mechanisms that can be applied to the collected sensor data streams using selected sensors. Our objective is to allow the users to submit their problems, so our proposed architecture understands them and produces more comprehensive and meaningful information than the raw sensor data streams generated by individual sensors.
△ Less
Submitted 6 January, 2013;
originally announced January 2013.
-
Capturing Sensor Data from Mobile Phones using Global Sensor Network Middleware
Authors:
Charith Perera,
Arkady Zaslavsky,
Peter Christen,
Ali Salehi,
Dimitrios Georgakopoulos
Abstract:
Mobile phones play increasingly bigger role in our everyday lives. Today, most smart phones comprise a wide variety of sensors which can sense the physical environment. The Internet of Things vision encompasses participatory sensing which is enabled using mobile phones based sensing and reasoning. In this research, we propose and demonstrate our DAM4GSN architecture to capture sensor data using se…
▽ More
Mobile phones play increasingly bigger role in our everyday lives. Today, most smart phones comprise a wide variety of sensors which can sense the physical environment. The Internet of Things vision encompasses participatory sensing which is enabled using mobile phones based sensing and reasoning. In this research, we propose and demonstrate our DAM4GSN architecture to capture sensor data using sensors built into the mobile phones. Specifically, we combine an open source sensor data stream processing engine called 'Global Sensor Network (GSN)' with the Android platform to capture sensor data. To achieve this goal, we proposed and developed a prototype application that can be installed on Android devices as well as a AndroidWrapper as a GSN middleware component. The process and the difficulty of manually connecting sensor devices to sensor data processing middleware systems are examined. We evaluated the performance of the system based on power consumption of the mobile client.
△ Less
Submitted 1 February, 2013; v1 submitted 1 January, 2013;
originally announced January 2013.