-
Verbesserung des Record Linkage für die Gesundheitsforschung in Deutschland
Authors:
Timm Intemann,
Knut Kaulke,
Dennis-Kenji Kipker,
Vanessa Lettieri,
Christoph Stallmann,
Carsten O. Schmidt,
Lars Geidel,
Martin Bialke,
Christopher Hampf,
Dana Stahl,
Martin Lablans,
Florens Rohde,
Martin Franke,
Klaus Kraywinkel,
Joachim Kieschke,
Sebastian Bartholomäus,
Anatol-Fiete Näher,
Galina Tremper,
Mohamed Lambarki,
Stefanie March,
Fabian Prasser,
Anna Christine Haber,
Johannes Drepper,
Irene Schlünder,
Toralf Kirsten
, et al. (5 additional authors not shown)
Abstract:
Record linkage means linking data from multiple sources. This approach enables the answering of scientific questions that cannot be addressed using single data sources due to limited variables. The potential of linked data for health research is enormous, as it can enhance prevention, treatment, and population health policies. Due the sensitivity of health data, there are strict legal requirements…
▽ More
Record linkage means linking data from multiple sources. This approach enables the answering of scientific questions that cannot be addressed using single data sources due to limited variables. The potential of linked data for health research is enormous, as it can enhance prevention, treatment, and population health policies. Due the sensitivity of health data, there are strict legal requirements to prevent potential misuse. However, these requirements also limit the use of health data for research, thereby hindering innovations in prevention and care. Also, comprehensive Record linkage in Germany is often challenging due to lacking unique personal identifiers or interoperable solutions. Rather, the need to protect data is often weighed against the importance of research aiming at healthcare enhancements: for instance, data protection officers may demand the informed consent of individual study participants for data linkage, even when this is not mandatory. Furthermore, legal frameworks may be interpreted differently on varying occasions. Given both, technical and legal challenges, record linkage for health research in Germany falls behind the standards of other European countries. To ensure successful record linkage, case-specific solutions must be developed, tested, and modified as necessary before implementation. This paper discusses limitations and possibilities of various data linkage approaches tailored to different use cases in compliance with the European General Data Protection Regulation. It further describes requirements for achieving a more research-friendly approach to linking health data records in Germany. Additionally, it provides recommendations to legislators. The objective of this work is to improve record linkage for health research in Germany.
△ Less
Submitted 14 December, 2023;
originally announced December 2023.
-
Privacy-Preserving Linkage of Distributed Datasets using the Personal Health Train
Authors:
Maximilian Jugl,
Sascha Welten,
Yongli Mou,
Yeliz Ucer Yediel,
Oya Deniz Beyan,
Ulrich Sax,
Toralf Kirsten
Abstract:
With the generation of personal and medical data at several locations, medical data science faces unique challenges when working on distributed datasets. Growing data protection requirements in recent years drastically limit the use of personally identifiable information. Distributed data analysis aims to provide solutions for securely working on highly sensitive data while minimizing the risk of…
▽ More
With the generation of personal and medical data at several locations, medical data science faces unique challenges when working on distributed datasets. Growing data protection requirements in recent years drastically limit the use of personally identifiable information. Distributed data analysis aims to provide solutions for securely working on highly sensitive data while minimizing the risk of information leaks, which would not be possible to the same degree in a centralized approach. A novel concept in this field is the Personal Health Train (PHT), which encapsulates the idea of bringing the analysis to the data, not vice versa. Data sources are represented as train stations. Trains containing analysis tasks move between stations and aggregate results. Train executions are coordinated by a central station which data analysts can interact with. Data remains at their respective stations and analysis results are only stored inside the train, providing a safe and secure environment for distributed data analysis.
Duplicate records across multiple locations can skew results in a distributed data analysis. On the other hand, merging information from several datasets referring to the same real-world entities may improve data completeness and therefore data quality. In this paper, we present an approach for record linkage on distributed datasets using the Personal Health Train. We verify this approach and evaluate its effectiveness by applying it to two datasets based on real-world data and outline its possible applications in the context of distributed data analysis tasks.
△ Less
Submitted 12 September, 2023;
originally announced September 2023.
-
Distributed Learning for Melanoma Classification using Personal Health Train
Authors:
Yongli Mou,
Sascha Welten,
Yeliz Ucer Yediel,
Toralf Kirsten,
Oya Deniz Beyan
Abstract:
Skin cancer is the most common cancer type. Usually, patients with suspicion of cancer are treated by doctors without any aided visual inspection. At this point, dermoscopy has become a suitable tool to support physicians in their decision-making. However, clinicians need years of expertise to classify possibly malicious skin lesions correctly. Therefore, research has applied image processing and…
▽ More
Skin cancer is the most common cancer type. Usually, patients with suspicion of cancer are treated by doctors without any aided visual inspection. At this point, dermoscopy has become a suitable tool to support physicians in their decision-making. However, clinicians need years of expertise to classify possibly malicious skin lesions correctly. Therefore, research has applied image processing and analysis tools to improve the treatment process. In order to perform image analysis and train a model on dermoscopic images data needs to be centralized. Nevertheless, data centralization does not often comply with local data protection regulations due to its sensitive nature and due to the loss of sovereignty if data providers allow unlimited access to the data. A method to circumvent all privacy-related challenges of data centralization is Distributed Analytics (DA) approaches, which bring the analysis to the data instead of vice versa. This paradigm shift enables data analyses - in our case, image analysis - with data remaining inside institutional borders, i.e., the origin. In this documentation, we describe a straightforward use case including a model training for skin lesion classification based on decentralised data.
△ Less
Submitted 24 March, 2021;
originally announced March 2021.
-
Data Partitioning for Parallel Entity Matching
Authors:
Toralf Kirsten,
Lars Kolb,
Michael Hartung,
Anika Groß,
Hanna Köpcke,
Erhard Rahm
Abstract:
Entity matching is an important and difficult step for integrating web data. To reduce the typically high execution time for matching we investigate how we can perform entity matching in parallel on a distributed infrastructure. We propose different strategies to partition the input data and generate multiple match tasks that can be independently executed. One of our strategies supports both, bloc…
▽ More
Entity matching is an important and difficult step for integrating web data. To reduce the typically high execution time for matching we investigate how we can perform entity matching in parallel on a distributed infrastructure. We propose different strategies to partition the input data and generate multiple match tasks that can be independently executed. One of our strategies supports both, blocking to reduce the search space for matching and parallel matching to improve efficiency. Special attention is given to the number and size of data partitions as they impact the overall communication overhead and memory requirements of individual match tasks. We have developed a service-based distributed infrastructure for the parallel execution of match workflows. We evaluate our approach in detail for different match strategies for matching real-world product data of different web shops. We also consider caching of in-put entities and affinity-based scheduling of match tasks.
△ Less
Submitted 28 June, 2010;
originally announced June 2010.