Objective: The goal of this work is to reduce the amount of manual work required to go from data capture to regulatory submission. It will be shown that the use of Siamese networks will allow for the generation of embeddings that can be used by traditional machine learning classifiers to perform the classification at much higher levels of accuracy than standard approaches.
Methods: Siamese networks are a method for training data embeddings such that data within the same class are closer with respect to a given distance metric than they are to data points in another class. Because they are designed to learn similarity within pairs of data points, they work well in situations where the number of classes is relatively large compared to the number of training samples. In this work, we will show that embeddings generated via a Siamese network from metadata associated with electronic data capture forms can be used to predict the associated SDTM field.
Results: With a relatively simple network coupled with a basic classification algorithm, the proposed method can achieve accuracies greater than 90%, which is significantly higher than what has been achieved with traditional methods, with many of the inaccurate mappings due to a lack of training data. In many cases, there is a 15% increase in accuracy vs. more traditional methods.
Conclusion: Leveraging Siamese networks, it is possible to generate embeddings that efficiently represent data fields in a lower dimensional space. This allows the creation of a system that can automatically map between data schemas at high levels of accuracy. Such systems represent the first step in automating one of the many labor-intensive data management tasks associated with clinical trials.
Copyright: © 2024 Yang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.