Advanced predictive modeling approaches have harnessed data to fuel important innovations at all stages of drug development. However, the need for a machine-readable drug product library which consolidates many aspects of formulation design and performance remains largely unmet. This study presents a scripted, reproducible approach to database curation and explores its potential to streamline oral medicine development. The Product Information files for all centrally authorized drug products containing a small molecule active ingredient were retrieved programmatically from the European Medicines Agency Web site. Text processing isolated relevant information, including the maximum clinical dose, dosage form, route of administration, excipients, and pharmacokinetic performance. Chemical and bioactivity data were integrated through automated linking to external curated databases. The capability of this database to inform oral medicine development was assessed in the context of drug-likeness evaluation, excipient selection, and prediction of oral fraction absorbed. Existing filters of drug-likeness, such as the Rule of Five, were found to poorly capture the chemical space of marketed oral drug products. Association rule learning identified frequent patterns in tablet formulation compositions that can be used to establish excipient combinations that have seen clinical success. Binary prediction models of oral fraction absorbed constructed exclusively from regulatory data achieved acceptable performance (balanced accuracytest = 0.725), demonstrating its modelability and potential for use during early stage molecule prioritization tasks. This study illustrates the impact of highly linked drug product data in accelerating clinical translation and underlines the ongoing need for accuracy and completeness of data reported in the regulatory datasphere.
Keywords: absorption; cheminformatics; computational pharmaceutics; drug product database; formulation; machine learning.