Background: Duplicate and near-duplicate medical documents are problematic in document management, clinical use, and medical research. In this study, we focus on multisourced medical documents in the context of a population-based cancer registry in Switzerland. Although the data collection process is well-regulated, the volume of transmitted documents steadily increases and the presence of full or near-duplicates slows down and complicates document processing. Identifying near-duplicates is particularly challenging because the large number of documents makes pairwise comparison non-feasible.
Methods: We implemented a system based on both normal hash functions, Simhash (Locality Sensitive Hashing), and Smith-Waterman text alignment similarity. Simhash offers good performance and confirming its results by the Smith-Waterman algorithm with a selected similarity threshold reduces the false positive rate to near zero without lowering sensitivity. Extracted differences in near-duplicate content documents are shown by highlighting differences in original PDF documents. We validated the method using 3042 manually verified document pairs containing 1252 full-duplicate and 398 near-duplicate pairs. The area under the curve (AUC) was 0.96, sensitivity 0.92, specificity 1.00, PPV 1.00, and NPV 0.91. For the same size simulated data, corresponding values were 0.86, 0.72, 1.00, 1.00, and 0.77, respectively.
Results: We applied the method against 224,398 medical documents in the cancer registry. We found 5.5% of duplicates on the text level, and 0.17-0.24% near-duplicates depending on the used parameters and threshold values. Most near-duplicates related to the same patient and originated from the same transmitter. Manual evaluation showed that only 2% of differences were in medical contents and 83% in administrative data (21% in patient, 11% in doctor, and 51% in other administrative data). Many near-duplicates looked strikingly similar from a human perspective.
Conclusions: We demonstrated that our method can efficiently find all full-duplicates and most near-duplicates in a large set of multisourced medical documents. Potential ways to further improve this method are discussed. The method can be applied to documents in all domains.
Keywords: (near-) duplicate documents; Cancer registry; Locality sensitive hashing; Smith-Waterman.
Copyright © 2025 The Author(s). Published by Elsevier B.V. All rights reserved.