A Comparative Study on Automatic Coding of Medical Letters with Explainability

Jamie Glen α, Lifeng Han β, Paul Rayson α    Goran Nenadic β
αLancaster University | βThe University of Manchester, UK

[email protected], [email protected]
lifeng.han, [email protected]
Abstract

This study aims to explore the implementation of Natural Language Processing (NLP) and machine learning (ML) techniques to automate the coding of medical letters with visualised explainability and light-weighted local computer settings. Currently in clinical settings, coding is a manual process that involves assigning codes to each condition, procedure, and medication in a patient’s paperwork (e.g., 56265001 heart disease using SNOMED CT code). There are preliminary research on automatic coding in this field using state-of-the-art ML models; however, due to the complexity and size of the models, the real-world deployment is not achieved. To further facilitate the possibility of automatic coding practice, we explore some solutions in a local computer setting; in addition, we explore the function of explainability for transparency of AI models. We used the publicly available MIMIC-III database and the HAN/HLAN network models for ICD code prediction purposes. We also experimented with the mapping between ICD and SNOMED CT knowledge bases. In our experiments, the models provided useful information for 97.98% of codes. The result of this investigation can shed some light on implementing automatic clinical coding in practice, such as in hospital settings, on the local computers used by clinicians, project page https://github.com/Glenj01/Medical-Coding.

1 Introduction

The coding of medical letters is currently something that is completed manually in advanced healthcare systems such as the UK and the US 111NHS UK https://www.nhs.uk/. It involves professionals reviewing the paperwork for a patient’s hospital visit or appointment and assigning specific codes to the conditions, diseases, procedures, and medications in the letters. This study aims to examine the potential automation of this process using Natural Language Processing (NLP) and Machine-Learning (ML) techniques, to create a prototype that could be used alongside the coders to speed up the coding process and to explore if such a system could be integrated into the real practice.

Clinical codes are used to remove ambiguity in the language of the letters, provide easily generated statistics, give a standardised way to represent medical concepts and allow the NHS’s Electronic Health Record (EHR) system to process and store the codes more easily NHS-Digital (2023). Also, in the case of private healthcare providers, coding can make it easier to keep track of billing 222https://www.ashfordstpeters.nhs.uk/clinical-coding. To do this, the coder takes a medical letter as input, which can be anything from a prescription request to a hospital discharge summary, and outputs potential codes from a designated terminology and/or classification system. The NHS ‘fundamental information standard’ is the “Systemised Nomenclature of Medicine – Clinical Terms” (aka SNOMED-CT) terminology system, which uses ‘concepts’ to represent clinical thoughts. Each concept is paired with a ‘Concept Id’ – a unique numerical identifier e.g., 56265001 heart disease (disorder) - which is then arranged by relationships into hierarchies from the general to the more detailed NHS-Digital (2023). It is worth noting that SNOMED is not the only system used for coding. The other system relevant to this work is the International Classification of Diseases (ICD), specifically ICD-9 333https://www.cdc.gov/nchs/icd/icd9cm.htm. This was the official system used to code diagnoses and procedures in the US. While SNOMED is a terminology system that has a comprehensive scope, covering every illness, event, symptom, procedure, test, organism, substance, and medicine, ICD is a classification system with a scope of just classifying diagnoses and procedures. In the NHS UK, coding is a significant issue because it takes time, energy, and resources away from an already underfunded and overworked system. There have been efforts to solve this by having dedicated clinical coding departments in larger hospitals 444https://www.stepintothenhs.nhs.uk/careers/clinical-coder; however, in most smaller practices, it is still the medical professionals who will do the coding. It takes the average coder 7-8 minutes to code each case, and a dedicated department of 25-30 coders usually codes more than 20,000 cases monthly Dong et al. (2022). Even so, there is almost always a backlog of cases to be coded, which has been known to extend over a year. It is estimated that AI applications in the healthcare industry have the potential to free up 1.944 million hours each year for healthcare professionals, with the biggest cut being taken from AI in virtual health assistance (such as automated medical coding) at 1.145 million hours Biundo et al. (2020). Clinical coding is such a challenging task due to two main concerns. The first is that the classification systems are complex and dynamic. The international edition of SNOMED contains 352,567 concepts 555Five Step Briefing, SNOMED international https://www.snomed.org/five-step-briefing, and while it should be noted that not all of these are diagnoses, finding the correct code can be challenging. The other issue is that there is no consistent structure in the documents to be coded. They can be notational, lengthy, and incomplete in addition to being full of abbreviations and symbols. Since all the coding is done manually, the human factor must also be considered. A study by Burns et al. (2012) found that the median accuracy of coders under evaluation was 83.2%. It should be noted that this was with an interquartile range of 67.3% - 92.1%, which further proves the issue of inconsistency with human coding.

This paper explores the potential of replacing the time-consuming process of manually coding letters with a program that automatically assigns codes to letters in a local computer setup. In the following sections, this paper will explore the background of automated medical coding, explain the implementation choices and issues encountered with this investigation, review the testing methods and results, and conclude by discussing the implications of these findings and the potential future of medical coding.

2 Backgrounds and Related Work

The background session will be presented in two sections. The first section, pre-neural networks, will focus on the early attempts at automated medical coding, how they worked, and the reasons why none of them were implemented in the real world. The second section, the introduction of neural networks, will follow the development from recurrent neural networks to transformer-based attention networks. We will explore the methodology and results of each one and conclude with the platform on which the chosen model is based.

2.1 Pre-Neural Methods

Most papers regarding general healthcare NLP can be divided into two topics: text classification and information extraction Dong et al. (2022). Classification can be split into three versions, each getting more complex: binary classification, where an instance is in one of two distinct categories (e.g., smoker or non-smoker); multi-class classification, where there are multiple categories, but an instance can still only be assigned to one class (e.g., current smoker, former smoker, non-smoker); and multi-label text classification 666https://huggingface.co/blog/Valerii-Knowledgator/multi-label-classification. This involves instances that can be associated with several different labels/categories simultaneously, such as discharge letters, in which each letter always contains multiple conditions. Automated medical coding is often identified as a multi-label text classification problem; however, some older attempts still utilise information extraction or a combination of methods from both topics. The first attempts at automated clinical coding were from around 1970, such as this 1973 study by Dinwoodie and Howell (1973) that utilises a ‘fruit machine’ methodology. This entails representing each significant word of a diagnosis with an associated code number and, like a fruit machine in a pub, the code is correct when a common code number appears for all words in the diagnosis. While this study returns impressive results with a correct coding rate of over 95%, this is only done with a small collection of pre-coded morbidity data from 16 doctors around Scotland. Thus, the project will not scale up to the complex real-world scenario.

No real progress was then made for the next few decades. A 2010 literature review on clinical coding Stanfill et al. (2010) evaluated the results of 113 studies, the earliest being the above 1973 study, and concluded that while the systems hold promise, there has been no clear trend of improvement over time. Another interesting trend from this review is that, while no improvements had been made, researchers’ interest was increasing, as all but 4 of the studies found were published after 1994. Examples of attempted innovation from this period include a study from Farkas and Szarvas (2008) focusing on rule-based automated radiology report coding. It uses a variation of multi-label classification that treats the assignment of each label as a separate task, as opposed to treating valid sets of labels as a single class. It then builds a rule-based expert system that operates on if-then codes through the ICD hierarchy. It uses decision trees (which recursively classify the data through conditions, similar to the rule-based system used to classify codes) to predict false positives, which occur when the model incorrectly predicts a positive outcome. It then uses a maximum entropy classifier to tackle false negatives, calculating each token’s probability of a false negative. Both the decision tree and max entropy classifier worked to increase the micro-averaged Fβ=1subscript𝐹𝛽1F_{\beta=1}italic_F start_POSTSUBSCRIPT italic_β = 1 end_POSTSUBSCRIPT scores by   4%, to 87.92%.

While these rule-based solutions are very accurate for the specific types of documents they examine, they will not generalise well to new problems since they are domain-specific. For them to be feasible for real-world use, the rules would need to be extended to tens of thousands of codes and would require a substantial investment of time and expertise to be executed properly. Statistical approaches such as initial attempts from Mullenbach et al. (2018), which utilised logistic regression (LR), and Perotte et al. (2014), which made use of Support Vector Machines (SVM), were attempted. However, the results on the full MIMIC database (shown in Figure 1) indected that they were also infeasible. Therefore, a different method had to be attempted: deep learning and neural networks.

2.2 Neural Networks and Attentions

Refer to caption
Figure 1: Graph showing the AUC, F1 and Precision scores of the various methods explained in the Background section for both MIMIC-III-50 and MIMIC-III Full. Best scores for each category are highlighted in light blue.

The general approach of deep learning in neural networks aims to map a complex function learned through the training data to match the information in the text to an appropriate set of medical codes Dong et al. (2022). Before any deep learning is completed, the common first step in these projects - aside from preprocessing - is to produce word embeddings for each token. Each embedding is a semantically meaningful mathematical representation, usually a vector, of the token designed so that tokens with similar meanings have similar vectors Percha (2021). To compare the meaning of two words, one calculates the cosine similarity of their corresponding vectors. The most common method for doing this is ‘word2vec’, which operates on the assumption that words with similar meanings tend to occur in similar contexts. It uses either a continuous bag of words (CBOW) model that predicts the target words based on the context words (words surrounding the target word) or a skip-gram that predicts the context words based on the target words Mikolov et al. (2013), both of which are examples of single-layer neural networks. A more advanced version of word2vec that strays from the standard embedding practice of one vector per word/token/document, is the development of bidirectional encoder representations from transformers (BERT) Devlin et al. (2019). These are massive pre-trained language models that are too resource-intensive to be trained from scratch in most circumstances, however, models trained on a general corpus can be fine-tuned to meet specific needs (such as clinical text mining through transfer learning Peng et al. (2019)). Unfortunately, due to their size and complexity, they are not currently feasible to be trained on larger datasets without significant modification. The first successful deep learning attempts utilised recurrent neural networks (RNNs), with a focus on two specific types: Gated Recurrent Units (GRUs) and Long Short-Term Memory Networks (LSTMs). The project Nigam (2016) constructs an RNN with a single layer consisting of 20 time steps; with each time step, a normalised vector representing a patient note is submitted in a time sequential order (oldest to most recent). The activation (threshold) function is tanh, a mathematical operation applied to the weighted sum of inputs and biases in each neuron that introduces non-linearity into the network. There is a dropout rate of 0.1 that is applied to prevent overfitting during training, and a learning rate of 0.001 is used to determine how much the weights of the network are updated during each training iteration. Finally, the model uses cross-entropy loss as its sigmoid function, normalising the neuron’s output to a value between 0 and 1.

GRUs are implemented as recurrent units, where each unit contains a reset gate and an update gate, which allow the GRU to regulate the flow of information and selectively update its hidden state. They are computationally more efficient; however, they may be outperformed by LSTMs in tasks requiring long-range dependencies. LSTMs are built like GRUs but using three gates instead of two: an input gate, a forget gate, and an output gate. They are more powerful due to their additional gates and memory cells that allow them to better preserve information over time. Convolutional neural networks (CNNs) consist of convolution layers and pooling layers and are mainly used for image and video processing, however, if the text is manipulated and processed correctly, they can be very effective for text processing. For example, one of the most successful studies into automated medical coding is the 2018 project Convolutional Attention for Multi-Label Classification (CAML) Mullenbach et al. (2018), which utilised a CNN but swapped the pooling layer for an attention mechanism. This attention mechanism is applied to the data to identify relevant portions of the document for each code prediction, allowing it to selectively focus on and assign higher importance to the relevant words and phrases Vaswani et al. (2017). Using attention mechanisms in this way also allows for enhanced interpretability. It provides insights into which parts of the document it made its predictions from, instead of just being put through a function as with previous methods.

With attention comes transformer-based networks, and while attention networks are not exclusively transformer-based, transformers are exclusively attention-based Vaswani et al. (2017). They rely solely on self-attention mechanisms, parallel processing the entire input sequence. This makes them more efficient for handling long sequences and allows for faster training and inference than more sequential models like RNNs. Transformers also allow for multi-head attention, an extension of the self-attention mechanism that allows the model to further parallelize the processing, enabling transformers to capture different aspects of the input data in parallel, allowing for more complex modelling of the relationships and patterns. This has recently been introduced into automated medical coding and, as demonstrated with HiLAT Liu et al. (2022), it is already promising. However, due to the computational complexity of such a model, it has only been tested on the limited MIMIC-III-50 dataset.

The table shown in Figure 1 demonstrates automated coding techniques’ slow but consistent progress. The highlighted segments represent the top performers in their respective categories. The transformer-based HiLAT model outperforms every other model in every metric when tested on the MIMIC-III-50 database. On the other hand, the CNN + attention-based model of CAML does the same when tested against all the models on the MIMIC-III Full database, while it is also the only model that can provide a level of explainability to its answers. These results indicate that an attention-based model is the preferred choice due to the superior results and their ability to provide explainability for their answers.

2.3 The MIMIC-III Dataset

In Clinical NLP, the first resource is the MIMIC-III dataset, which is the only publicly available mainstream English dataset with enough data to perform proper training. Additionally, most models that attempt to solve the automatic coding problem use this dataset.

MIMIC-III Johnson et al. (2016) is a large, freely available database comprising de-identified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Centre between 2001 – 2012 777https://mimic.mit.edu/docs/iii/. The database is freely available to researchers worldwide, provided they have become a credentialed user of PhysioNet Johnson et al. (2016) and completed the required ‘Data or Specimens Only Research’ CITI training 888https://physionet.org/content/mimiciii/view-required-training/1.4/ (Or another recognized course in protecting human research participants that includes HIPAA requirements). All data in the MIMIC database has been deidentified per HIPAA (Health Insurance Portability and Accountability Act) standards. This ensures that all 18 listed identifying data elements, such as names, telephone numbers, and addresses, are removed. The only thing not removed are dates, which are shifted in a random but consistent manner to preserve intervals. Therefore, all dates occur between 2100-2200, but the time of day, day of the week, and approximate seasonality have been conserved.

MIMIC is a relational database consisting of 26 tables containing different forms of data, from the patient’s clinical notes in NOTEEVENTS to extremely granular data such as the hourly documentation of patients’ heart rates. This makes it a vast and complex database to work with - however since we are only using the database for its clinical notes, only five tables are required:

  • NOTEEVENTS – Deidentified notes, including nursing and physician notes, ECG reports, imaging reports, and discharge summaries.

  • DIAGNOSES_ICD - Hospital-assigned diagnoses, coded using the International Statistical Classification of Diseases and Related Health Problems (ICD) system.

  • PROCEDURES_ICD - Patient procedures, coded using the International Statistical Classification of Diseases and Related Health Problems (ICD) system.

  • D_ICD_DIAGNOSES - Dictionary of International Statistical Classification of Diseases and Related Health Problems (ICD) codes relating to diagnoses.

  • D_ICD_PROCEDURES - Dictionary of International Statistical Classification of Diseases and Related Health Problems (ICD) codes relating to procedures.

This still leaves a lot of unnecessary data. For example, the NOTEEVENTS table contains CHARTTIME, CHARTDATE, and STORETIME, which are the time and date a note was charted and the time it was stored in the system. The notes in NOTEEVENTS vary in usefulness and format, with the type of note indicated in the DESCRIPTION column. Since all the medical coding projects that use MIMIC unanimously choose to use the discharge summaries as they contain the most potential codes per letter (15.9 labels per document). We removed all the other types of notes. This was done by creating a new table that copied each line as long as the DESCRIPTION = ‘Discharge Summary’. The next step is to combine the data in separate tables into one table for easier access.

Another note on MIMIC is about its most popular subset, MIMIC-III-50, that contains only the notes and codes of the top 50 most frequently occurring codes (Table 1). First occurring in CAML Mullenbach et al. (2018), MIMIIC-III-50 is often used as a proof-of-concept database for automatic medical coding projects due to it being significantly smaller (8,067 documents compared to 47,724) and with fewer labels (5.7 compared to 15.9 for MIMIC full), which means it takes less time and computational resources to train against. Projects like HiLAT Liu et al. (2022) that face challenges in accessing the necessary computing power for training their models have utilised the MIMIC-III-50 dataset to train on and achieve state-of-the-art results. The only issue with using MIMIC-III-50 is that, as Figure 2 demonstrates, it doesn’t give the same opportunity to test models against a long tail distribution.

A database that follows a long tail distribution is one where there are many data points that are not well-represented, and the majority of occurrences are concentrated around a few values at the “head” of the distribution Zhang et al. (2023). This accurately describes the MIMIC-III-Full database, where the top 105 codes make up 50% of the total labels in the set, and there are 3,110 labels that have fewer than 5 examples Nigam (2016), with 203 codes not appearing in any discharge summaries at all. Solving the long tail distribution of MIMIC is one of the key challenges that will need to be addressed by the potential models to be deployed.

m.full mimic-iii 50
training documents 47,724 8,067
Vocabulary size 51,917 51,917
Mean tokens per doc 1,485 1,530
Mean labels per doc 15.9 5.7
Total labels 8,922 50
Table 1: Details regarding the discharge summaries in the MIMIC-III Full (m.full) and MIMIC-III-50 databases Dong et al. (2021).
Refer to caption
Figure 2: Distribution of labels for the MIMIC-III and MIMIC-III-50 dataset
Refer to caption
Figure 3: Medical Letter Example

3 Model Selections

We have selected three potential models and in this section each model will be evaluated, reviewing their results, methodology, and suitability for the study’s needs, concluding with the chosen model.

3.1 Problem Formalisation

Before each selected model is evaluated, the problem needs to be formally defined. Taking 𝒳𝒳\mathcal{X}caligraphic_X as the collection of clinical notes and 𝒴𝒴\mathcal{Y}caligraphic_Y as the full set of labels (ICD-9 codes). Each instance xdXsubscript𝑥𝑑𝑋x_{d}\in Xitalic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ italic_X is a word sequence of a document, d𝑑ditalic_d, and is associated with label set ydYsubscript𝑦𝑑𝑌y_{d}\subseteq Yitalic_y start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ⊆ italic_Y, where each ydsubscript𝑦𝑑y_{d}italic_y start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT can be represented as a |Y|𝑌|Y|| italic_Y | multi hot vector (a vector where multiple elements can have a value of 1, indicating multiple features/categories are present at the same time), Yd=[yd1,yd2,,yd|Y|]𝑌𝑑subscript𝑦𝑑1subscript𝑦𝑑2subscript𝑦𝑑𝑌\overrightarrow{Yd}=[y_{d1},y_{d2},...,y_{d|Y|}]over→ start_ARG italic_Y italic_d end_ARG = [ italic_y start_POSTSUBSCRIPT italic_d 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_d 2 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_d | italic_Y | end_POSTSUBSCRIPT ], and ydl(0,1)subscript𝑦𝑑𝑙01y_{dl}\in(0,1)italic_y start_POSTSUBSCRIPT italic_d italic_l end_POSTSUBSCRIPT ∈ ( 0 , 1 ) where l𝑙litalic_l indicates the lth𝑙𝑡l\textquoteright thitalic_l ’ italic_t italic_h label has been used for the dth𝑑𝑡dthitalic_d italic_t italic_h instance and 0 indicates irrelevance Perotte et al. (2014). From this, the task of the models is to learn a complex function f:𝒳𝒴:𝑓𝒳𝒴f:\mathcal{X}\rightarrow\mathcal{Y}italic_f : caligraphic_X → caligraphic_Y from the training set.

All the chosen models use the same loss function, binary cross-entropy, and optimise it with L2 normalisation using the Adam (Adaptive Movement Estimation) optimiser Kingma and Ba (2014). Loss functions are used in neural networks as a measure of how well the networks predictions match the true values of the training data, with binary cross entropy loss measuring the dissimilarity between the true binary labels and the predicted probability of the model. In the context of these models, L2 normalisation is used to avoid overfitting, which occurs when the model is trained so well on a particular dataset that it fails to generalise well to new, unseen data. To prevent this, penalty terms proportional to the magnitude of the vectors (Euclidean norm) are added, which penalise overly specific mappings and encourage the model to learn simpler, more generalised weight configurations. The Adam optimiser is a popular optimisation algorithm used to update the parameters of a neural network to minimise the loss function during training.

3.2 Model-1: Convolutional Attention for Multi-Label Classification (CAML)

CAML Mullenbach et al. (2018) (Figure 4), as already mentioned in the background section, utilises a CNN based architecture but swaps the traditional pooling layer for an attention mechanism. The model starts by horizontally concatenating pretrained word embeddings into a matrix, X. A sliding window approach as is standard in CNNs is then applied to this matrix that computes an equation on each section of the matrix, resulting in the matrix H.

Next, the model applies a per-label attention mechanism. For each label, l𝑙litalic_l, the matrix vector product is computed, and the result of this is passed through a SoftMax operator that essentially reduces the input values to the range [0,1] while ensuring that they sum up to 1 so they can be used as probabilities. This SoftMax operator returns the distribution over locations in the document in the form of attention vector α𝛼\alphaitalic_α. This attention vector is then used to compute vector representations for each label, vl𝑣𝑙vlitalic_v italic_l. Finally, a probability is computed for label l𝑙litalic_l using another linear layer and sigmoid transformation to obtain the final label predictions yl𝑦𝑙ylitalic_y italic_l. This normalisation process ensures that the probability of the label is normalised independently rather than normalising the probability distribution over all labels like the SoftMax operator does.

Refer to caption
Figure 4: The CAML architecture with per label attention shown for one label from Mullenbach et al. (2018).
Refer to caption
Figure 5: The HLAN Model Dong et al. (2021)

3.3 Model-2: Hierarchical Label Attention Network (HLAN)

The HLAN model Dong et al. (2021) is built around providing explainability for its results, and consists of an embedding layer, the HLAN layers, and a prediction layer. The embedding layer converts each token in the sentence into a continuous vector where the word embedding algorithm word2vec returns the vector of word embeddings xdisubscript𝑥𝑑𝑖x_{di}italic_x start_POSTSUBSCRIPT italic_d italic_i end_POSTSUBSCRIPT.

The HLAN makes extended use of Gated Recurrent Units (GRU) to capture long-term dependencies. The GRU unit processes tokens one by one, generating a new hidden state for each token. At each hidden state, the GRU considers the previous tokens using a reset gate and an update gate. The GRU method implemented is known as Bi-GRU because it reads the sequence both forwards and backwards, concatenating the states at each step, to create a more complete representation.

The label wise word-level attention mechanism, which contains a context matrix (Vwsubscript𝑉𝑤V_{w}italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT) where each row Vwlsubscript𝑉𝑤𝑙V_{wl}italic_V start_POSTSUBSCRIPT italic_w italic_l end_POSTSUBSCRIPT, is the context vector to the corresponding label ylsubscript𝑦𝑙y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. The attention score is calculated as a SoftMax function of the dot product similarity between the vector representation of the hidden layers from the Bi-GRU and the context vector for the same label. The sentence representation matrix Cssubscript𝐶𝑠C_{s}italic_C start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is computed as the weighted average of all hidden state vectors hisuperscript𝑖h^{i}italic_h start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT for the label yisuperscript𝑦𝑖y^{i}italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT.

The label-wise sentence-level attention mechanism is computed in much the same way, outputting sentence-level attention scores and the document representation matrix Cdsubscript𝐶𝑑C_{d}italic_C start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. The prediction layer then utilises a label-wise, dot product projection with logistic sigmoid activation to model the probabilities of each label to each document. Finally, the binary cross entropy loss function is optimised with L2 normalisation and the Adam optimiser.

The HLAN has an extra label embedding initialisation (denoted as +LE) that can be implemented in place of the normal embedding layer and functions by leveraging the complex semantic relations (how different elements are related to each other in terms of their meanings) among the ICD codes. The embedding works off for two correlated labels; one would expect the prediction of one label to impact the other for some notes, which is represented as giving each label representation corresponding weights. The HLAN model was based on the HAN model Yang et al. (2016), where the only difference between the two is that at the sentence and document level, HLAN utilises contextual matrices, whereas HAN uses contextual vectors. This means that while HLAN is more individually label-oriented, HAN still produces an attention visualisation for the whole document and the results are only slightly worse but reducing the computational complexity of training the model. HAN Yang et al. (2016) model was originally proposed as “Hierarchical Attention Networks for Document Classification”.

3.4 Model-3: Multi-Hop Label-wise Attention (MHLAT)

Much like HLAN, MHLAT Duan et al. (2023) is comprised of three main components: an input/encoder layer, MHLAT layer, and a decoder layer (Figure 6). It also utilises the same label-wise attention mechanism, however, that is where the similarities end. In the encoding layer, MHLAT first splits the text into chunks with 512 tokens per chunk. It then adopts the general domain pre-trained XLNet Yang et al. (2019) (similar to BERT but less computationally expensive), which is further trained on MIMIC, and then applied to every chunk. Each chunk from the text is then concatenated to form a global vector of the input text, H.

While using label-wise attention through multiple passes is utilised for both HLAN and MHLAT, where HLAN uses multiple Bi-GRUs increasing the scope each time, MHLAT presents a ‘multi-hop’ approach. Initially, the label-wise attention is derived from matrices of the tokens of the input sentence from the encoder, followed by a ‘fusion’ operation that combines label-specific representations and label embeddings. A hop function is then defined that iteratively updates context information and label embeddings, which is then repeated. The decoding layer implements an independent linear layer for computing the label score and utilises the same binary cross entropy loss function as the other models.

Refer to caption
Figure 6: The MHLAT model architecture Duan et al. (2023).
Refer to caption
Figure 7: The results of the above models on the MIMIC-III-50 and MIMIC-III-Full databases.

3.5 Model summaries

If going purely off results (given in Figure 7), the MHLAT model returns state-of-the-art performance compared to the others in every metric it had resulted in. However, it is worth noting that the model, despite being attention-based, did not factor any type of explainability into itself. As mentioned in the motivations, we want to explore some level of interpretability of coding models, otherwise, the professionals (clinicians) using them would have no way to verify the results and build trust.

Looking at the results of the remaining models, it is clear that HLAN performs better than CAML, which in turn performs better than HAN. However, the objective of the project was to prioritise explainability in the results, which made HLAN/HAN the ideal model as despite a slight reduction in performance for the MIMIC Full dataset. The enhanced interpretability in its answers justifies its use, especially in domains such as medical coding where transparency and understanding of the models’ decisions are crucial.

Refer to caption
Figure 8: Data Processing for Model Learning Pipeline.
Refer to caption
Figure 9: Model Deployment Pipleline with ICD Coding Visuralisation and Mapping to SNOMED CT.

4 Coding with Explainability

The goal of this study is to develop a program that could attempt to fulfill the investigation aims, that being to produce SNOMED codes and visualisation, and could then be utilised to evaluate a comparable system being implemented in the real setting, such as NHS UK. The program was implemented in Python 3.8 using the TensorFlow framework and leverages the HAN model to predict ICD codes, converts these codes to SNOMED, and provides visualised attention scores for each document.

4.1 data processing and ICD coding

The preprocessing (Figure 8) takes three of the tables from MIMIC described in Section 2.3, NOTEEVENTS, PROCEDURES_ICD, and DIAGNOSES_ICD, and combines them into one table, notes_labeled, with the schema SUBJECT_ID, HADM_ID, TEXT, LABELS where:

  • SUBJECT_ID – identifier unique to a patient, found in NOTEEVENTS.

  • HADM_ID – identifier unique to a hospital stay, found in NOTEEVENTS.

  • TEXT – The free text of the document. There can be multiple documents with the same HADM_ID. Found in NOTEEVENTS.

  • LABLES – ICD_9 labels professionally assigned and stored in sequence order in either DIAGNOSES_ICD or PROCEDURES_ICD, depending on if they were diagnoses or procedures.

This is accomplished by first concatenating both _ICD tables into one table of codes, ALL_CODES. In the next step it preprocesses the raw TEXT from NOTEVENTS, removing tokens that contain no alphabetic characters (i.e., removing 500 but not 500mg), removing white space, and lowercasing all tokens. The processed text is stored in the disch_full table, which is then joined on the HADM_ID of each line to the ALL_CODES table to form the notes_labeled table.

The code then generates the MIMIC_III_50 database by iterating through the notes_labeled file, counting the occurrences of each code, and saving the HADM_IDs to 50_hadm_ids and the codes to TOP_50_CODES. Both the standard notes_labeled and the dev_50 tables are split 90/10 to train/test respectively and stored in the train/test version of their tables.

When attempting to train the HLAN model on the full MIMIC dataset, the system that it was being trained on (our local PC) did not have sufficient memory, therefore the HAN model Yang et al. (2016) was used instead. This model did not need to be trained as the pretrained model could be downloaded from the GitHub.

There is now a working model that took a text document as input and outputted an attention visualisation in Excel and a list of predicted codes in the console.

4.2 Entity Linking to SNOMED

Now with a working model, the next step is to map the ICD codes to SNOMED (Figure 9). The map 999https://www.nlm.nih.gov/research/umls/mapping_projects/icd9cm_to_snomedct.html was originally created for the Unified Medical Language System (UMLS) to facilitate the translation of legacy data still coded in ICD-9 to SNOMED CT codes. Therefore, it is perfect for the project’s needs. It does contain multiple columns of data that are not required, mainly usage statistics, however, these can just be ignored. The 202212 most recent release of the map was implemented by UMLS and is split up into two tab-delimited value files with the same file structure; one for one-to-one mappings, and one for one-to-many mappings. The one-to-one mapping contains 7,596 mappings (64.1% of ICD-9 codes), with each line in the file being a separate mapping. For example, the ICD code 427.31 (Atrial Fibrillation) maps directly to the SNOMED code 49436004 (Atrial Fibrillation (disorder)). The one-to-many file contains 3,495 mappings (29.5% of ICD-9 codes), with the mapping being one ICD code to multiple SNOMED codes. The file is set out as one-to-one maps, with the one ICD code being repeated for each of the many SNOMED codes, for example:

  • 719.46 – Pain in joint, lower leg |||| 202489000 – Tibiofibular joint pain

  • 719.46 – Pain in joint, lower leg | 239733006 – Anterior knee pain

  • 719.46 – Pain in joint, lower leg | 299372009 – Tenderness of knee joint

This was implemented by first loading the one-to-one map into a dictionary, then iterating through the predicted_codes list. At each iteration (new ICD code) the program checks to see if the ICD code is in the one-to-one map. If it is, the associated SNOMED code and FSN (fully specified name) are outputted; if not, the one-to-many map is loaded as a dictionary.

The program searches for the ICD code in the one-to-many dictionary, and if found, it outputs all the SNOMED codes related to the ICD code. This is done so that even if the program cannot find a direct mapping, it can at least provide the user with potential options. If an ICD code cannot be found in any mappings, the system will print the ICD code description from either D_ICD_DIAGNOSES or D_ICD_PROCEDURES. There are only a few cases, approximately 6.4% of the ICD codes, where there are no mappings available. This usually occurs with catch-all NEC (not elsewhere classified) ICD codes, such as 480.8 - Pneumonia due to other virus not elsewhere classified, for which SNOMED has no alternative mappings available.

After all these steps, the project now takes notes as input through a text document, processes them using the HAN model, and calculates the attention levels of the ICD codes. The program then converts the ICD codes into SNOMED codes with as many 1-to-1 mappings as it can find, outputting that to the console (Figure 10). Finally, the attention visualisation is exported into Excel (Figure 11) which shows each word in the file and highlights it in a shade of blue. The deeper the blue highlight, the greater the weight that word had when calculating the ICD codes. The visualisation displayed in Figure 11 is split up halfway down for ease of viewing. In reality, the left-hand side of the upper picture and the right-hand side of the lower picture are joined next to each other.

Refer to caption
Figure 10: Examples with the program returning 1-to-1 and 1-to-many ICD – SNOMED mappings.
Refer to caption
Figure 11: Attention visualisation for the results of mapping. The more blue something is highlighted, the more it was used to calculate the mapping.

4.3 Evaluations Setups

The experiments are evaluated in two ways – first, the model is tested against the standard testing scores of micro/macro F1 and precision. Second, the implementation of SNOMED mapping is also considered, calculating the percentage of codes it can predict/give options for.

To accurately test the model, data had to be gathered by running the model against MIMIC discharge summaries from the test files. This was accomplished by randomly selecting 100 notes from the test_full file (refer to sample size and model confidence by Gladkoff et al. (2022)). We then ran each set of notes through the model and put it through a program that returned the true and false positives, as well as the false negatives from the results by comparing the labels generated by the model to the true labels in the file, where:

  • True Positives – when the model predicts a label, and it is correct.

  • False Positives – when the model predicts a label, but it is incorrect.

  • False Negatives – when the model doesn’t predict a label even though there is a correct label.

Now that these values were generated, the model was tested against the same metrics that have been used in all the models previously.

  • Recall - measures how often a model correctly identifies positive instances (true positives) from all the actual positive samples in the dataset 101010https://www.evidentlyai.com/classification-metrics/accuracy-precision-recall and is calculated by dividing the number of true positives by the number of positive instances (true positives + false negatives).

  • Precision – measures how often a model correctly predicts the positive class, calculated by dividing the number of correct positive predictions (true positives) by the total number of instances the model picked as positive (both true and false positives). The precision results from earlier models were with P@5, P@8, or P@15, which means measuring the proportion of relevant items within the top 5, 8, or 15 items retrieved by the system.

  • F1 Score – Calculated as the harmonic mean of the precision and recall scores, therefore, encouraging similar values for both precision and recall. The more the precision and recall deviate from each other, the worse the score.

  • Macro F1 score - is an average of the F1 scores obtained, representing the average performance of the model across all classes (each class having the same weight).

  • Micro F1 score - computes a global average F1 score by counting the sums of the true positives, false negatives, and false positives and then putting those into the normal F1 equation. It essentially computes the proportion of correctly classified observations out of all observations (each token having the same weight).

Aside from gathering these results, the other data collected was that of the SNOMED scores. This was gathered when running the same tests to find the other values, and each returned SNOMED score could be grouped into one of 4 categories:

  • 1-to-1 – The ICD to SNOMED code was a one-to-one match

  • 1-to-M – The ICD to SNOMED code was a one-to-many match

  • No Map – No ICD to SNOMED map was found.

  • No DESC – There was no description found associated with the ICD codes in the D_DIAGNOSES_ICD MIMIC file. This was a rare valid return due to the formatting of the D_DIAGNOSES_ICD file.

Refer to caption
Figure 12: The evaluation results of the first 20 documents tested (full results in appendix).

4.4 Evaluation Results

4.4.1 ICD Coding Evaluation

For ICD coding evaluations, the first 20 documents tested were listed in Figure 12, with the full list in Appendix.

The combined results of all the tests (Table 2) were then calculated, returning the macro F1 as 0.041 (compared to 0.036 from previous HAN tests) and the micro F1 as 0.403 (compared to 0.407 from previous HAN tests). The similarity to the previous results demonstrates that the model was functioning as intended, so although the results weren’t state of the art, they were what was expected. The same can be said for precision, which We calculated using the first 15 values returned, otherwise known as P@15 (the same as previous tests), to get a precision of 0.599 (compared to 0.613).

While these results aren’t the same as the previous HAN model testing, this is to be expected as only 100 documents were tested. This means that if there were outliers, they had a greater effect on the overall results, and the more documents that were tested, the closer to the actual values the results will become.

Models Precision@15 Macro F1 Micro F1
HAN-our 0.599 0.041 0.403
HAN-ori 0.613 0.036 0.407
Table 2: Combined results comparing our HAN testing against the original HAN results.
1-to-1 1-to-M No Map No Desc
Total 446 117 263 17
% Total 52.91% 13.88% 31.20% 2.02%
Table 3: Results of the SNOMED mappings.

4.4.2 SNOMED Mapping Evaluation

Regarding the SNOMED mapping, from the individual results (shown in Figure 12), each row was summed, with 100 subtracted from the No DESC value to ensure that the error of the program producing a No DESC result at the end of each document was not considered in the total. From this, a 1-to-1 map is displayed 52.91% of the time, and a 1-to-many map is displayed 13.88%, which means the program successfully mapped to SNOMED on 66.79% of attempts.

The unexpected result in this situation is the significant amount of ‘no maps’ returned. This is due to differing versions of ICD-9 codes utilised, as MIMIC uses the standard ICD-9 coding, but the mapping uses ICD-9-CM, the clinical modification used for morbidity coding. This means that there will be codes in one version that are not featured in the other, and unfortunately, there is not much that can be done to resolve this aside from creating a new mapping.

Even when returning a ‘no map’, the program still returns the description of the ICD code which is useful information for the user. Therefore, this implementation returns a useful response for 97.98% of attempted codes.

5 Conclusions and Future Work

This study aimed to compare existing coding methods and produce a model that automatically assigns labels to medical texts and gives an explainable outcome, to explore how this investigation can be implemented in real practice, e.g. NHS UK. High ethical standards were maintained during the project considering the field of study. As outcomes, the model does automatically assign labels to the medical texts utilising a pre-trained HAN model that emphasises interpretability in its outcomes, producing a document explaining how it reached its decisions. The project also explores the potential of integrating a similar system into a real setting, utilising mappings to SNOMED as well as having a medical professional give feedback throughout the development of the system and evaluate the results of the final program (Appendix for human evaluations).

Regarding future works specifically for real applications, we believe that for a project like this to be viable, a new dataset needs to be created that more accurately represents the data the model is going to come across. Using discharge summaries from MIMIC to train the model and then expecting it to perform on completely different data is infeasible; no matter how complex the model is and how good it gets at zero-shot learning, etc., it will only ever be good at modelling data that is similar to the data it’s trained against. Making a new database would also eliminate the need to map between coding standards, as making a new database specifically for use cases, e.g. NHS UK, means it can be mapped to SNOMED by default. Another direction is that we can deploy some SOTA medication and treatment extraction tools for richer annotation of clinical data, such as recent work by Belkadi et al. (2023); Tu et al. (2023).

From a more general perspective, automated medical coding as a problem seems to be advancing towards transformer-based solutions in both the full modelling like MHLAT and word embeddings with BERT. This technology shows definite promise with its results against MIMIC-III-50, with its only limit being the computational feasibility of training such a complex model.

Limitations

After our first meeting, the external stakeholder created a simplified mock-up of the NHS Electronic Health Record (EHR) system to store patient information 111111https://github.com/furbrain/SimpleEHR. The system integrated the SNOMED codes into the EHR utilizing the SNOMED terminology service Hermes 121212https://github.com/wardle/hermes Hermes : terminology tools, library and microservice.. Since one of the objectives of the project was to demonstrate how it could be implemented into the wider NHS system, and creating a mock-up of the EHR was deemed as a good starting point.

Unfortunately, there were issues getting Hermes (more specifically the Hermes docker file) to function on a Windows PC, but these issues did not persist on the university virtual machines (VM), therefore the project was moved on to the Linux-based VMs. Doing this had its own problems, as we no longer had permissions to ‘sudo install’ any of the Python libraries required to run Hermes. To solve this, a custom text-based VM had to be created with all the permissions needed to run Hermes. There were access problems regarding this VM with incorrect SSH keys, but once this was fixed a Hermes terminology server was successfully set up on the VM.

Gaining access to MIMIC-III required the completion of two CITI training modules; Data and Specimens only research, and Conflicts of Interest (Both in Appendix). After this, our PyhsioNet account (PhysioNet is a repository of medical data, and where MIMIC is available to download) became credentialed and, therefore, gained access to the full MIMIC dataset.

Unfortunately, the custom VM did not have enough space for the full MIMIC dataset. Therefore, the dataset had to be downloaded onto our personal Windows PC without the working Hermes server and restart the project from there. From here preprocessing could begin to make MIMIC and the HLAN compatible.

Acknowledgements

We thank the external stakeholder (a Local GP) for the support, feedback, and human evaluation during this project. LH and GN are grateful for the grant “Integrating hospital outpatient letters into the healthcare data space” (EP/V047949/1; funder: UKRI/EPSRC).

References

  • Belkadi et al. (2023) Samuel Belkadi, Lifeng Han, Yuping Wu, and Goran Nenadic. 2023. Exploring the value of pre-trained language models for clinical named entity recognition. In 2023 IEEE International Conference on Big Data (BigData), pages 3660–3669.
  • Biundo et al. (2020) E. Biundo, A. Pease, K. Segers, M. de Groote, T. d’Argent, and E. de Schaetzen. 2020. The socio-economic impact of ai in healthcare. Deloitte, MedTech Europe.
  • Burns et al. (2012) Elaine M Burns, E Rigby, R Mamidanna, A Bottle, P Aylin, P Ziprin, and OD Faiz. 2012. Systematic review of discharge coding accuracy. Journal of public health, 34(1):138–148.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  • Dinwoodie and Howell (1973) HP Dinwoodie and RW Howell. 1973. Automatic disease coding: the’fruit-machine’method in general practice. British journal of preventive & social medicine, 27(1):59.
  • Dong et al. (2022) Hang Dong, Matúš Falis, William Whiteley, Beatrice Alex, Joshua Matterson, Shaoxiong Ji, Jiaoyan Chen, and Honghan Wu. 2022. Automated clinical coding: what, why, and where we are? NPJ digital medicine, 5(1):159.
  • Dong et al. (2021) Hang Dong, Víctor Suárez-Paniagua, William Whiteley, and Honghan Wu. 2021. Explainable automated coding of clinical notes using hierarchical label-wise attention networks and label embedding initialisation. Journal of biomedical informatics, 116:103728.
  • Duan et al. (2023) Junwen Duan, Han Jiang, and Ying Yu. 2023. Mhlat: Multi-hop label-wise attention model for automatic icd coding. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  • Farkas and Szarvas (2008) Richárd Farkas and György Szarvas. 2008. Automatic construction of rule-based icd-9-cm coding systems. In BMC bioinformatics, volume 9, pages 1–9. Springer.
  • Gladkoff et al. (2022) Serge Gladkoff, Irina Sorokina, Lifeng Han, and Alexandra Alekseeva. 2022. Measuring uncertainty in translation quality evaluation (TQE). In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1454–1461, Marseille, France. European Language Resources Association.
  • Johnson et al. (2016) Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Liu et al. (2022) Leibo Liu, Oscar Perez-Concha, Anthony Nguyen, Vicki Bennett, and Louisa Jorm. 2022. Hierarchical label-wise attention transformer model for explainable icd coding. Journal of Biomedical Informatics, 133:104161.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26.
  • Mullenbach et al. (2018) James Mullenbach, Sarah Wiegreffe, Jon Duke, Jimeng Sun, and Jacob Eisenstein. 2018. Explainable prediction of medical codes from clinical text. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1101–1111, New Orleans, Louisiana. Association for Computational Linguistics.
  • NHS-Digital (2023) NHS-Digital. 2023. Building healthcare software - clinical coding, classifications and terminology.
  • Nigam (2016) Priyanka Nigam. 2016. Applying deep learning to icd-9 multi-label classification from medical records. Technical report, Technical report, Stanford University.
  • Peng et al. (2019) Yifan Peng, Shankai Yan, and Zhiyong Lu. 2019. Transfer learning in biomedical natural language processing: an evaluation of bert and elmo on ten benchmarking datasets. arXiv preprint arXiv:1906.05474.
  • Percha (2021) Bethany Percha. 2021. Modern clinical text mining: a guide and review. Annual review of biomedical data science, 4(1):165–187.
  • Perotte et al. (2014) Adler Perotte, Rimma Pivovarov, Karthik Natarajan, Nicole Weiskopf, Frank Wood, and Noémie Elhadad. 2014. Diagnosis code assignment: models and evaluation metrics. Journal of the American Medical Informatics Association, 21(2):231–237.
  • Stanfill et al. (2010) Mary H Stanfill, Margaret Williams, Susan H Fenton, Robert A Jenders, and William R Hersh. 2010. A systematic literature review of automated clinical coding and classification systems. Journal of the American Medical Informatics Association, 17(6):646–651.
  • Tu et al. (2023) Hangyu Tu, Lifeng Han, and Goran Nenadic. 2023. Extraction of medication and temporal relation from clinical text using neural language models. In 2023 IEEE International Conference on Big Data (BigData), pages 2735–2744.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Conference on Neural Information Processing System, pages 6000–6010.
  • Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32.
  • Yang et al. (2016) Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1480–1489, San Diego, California. Association for Computational Linguistics.
  • Zhang et al. (2023) Yifan Zhang, Bingyi Kang, Bryan Hooi, Shuicheng Yan, and Jiashi Feng. 2023. Deep long-tailed learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9):10795–10816.

Appendix

Appendix A Study Context

This paper explores the potential of replacing the time-consuming process of manually coding letters with a program that automatically assigns codes to letters. For the program to be of any value to its intended users, the external stakeholder (who is a local GP and has an interest in programming) stated that the output should be explainable. This would allow the users to verify the results if unsure and increase the trust between them and the system. The stakeholder also stated that ideally the system would be easily implemented into the wider NHS systems, so the system can store and link the codes and letters to the patients they are about. This would allow the program to utilise previous letters about the patient to aid with the coding.

Due to the program being oriented around the inherently personal topic of healthcare, ethics approval to gain access to the resources required would always be important. we had to gain access to MIMIC-III (Medical Information Mart for Intensive Care) which is a free database comprised of deidentified healthcare data, as well as the UK and US versions of SNOMED-CT and access to the UMLS ICD-9 to SNOMED-CT maps from the NIH. The MIMIC database had to be pre-processed to train the HLAN (Hierarchical Label Attention Network) system that generated the ICD-9 label predictions. These label predictions had to be mapped to SNOMED-CT terminology codes, and the label predictions exported in a user-friendly and readable manner.

The external stakeholder will evaluate this, and tests will be created to validate the results already generated by the HLAN and see if mapping to SNOMED affects them.

The following training was conducted for the good practice:

  • CITI training 131313https://physionet.org/about/citi-course/: collaborative institutional training initiative (CITI Program)

  • Massachusetts institute of technology affiliates

  • Curriculum group: Human Research

  • Course Learner Group: Data or Specimens Only Research

Appendix B Human Evaluation Insights

The second method of our evaluations is to allow the stakeholder to try and code some example real-world scenario letters. To evaluate this program, we will collect the results of the program coding those letters, as well as the stakeholders verbal feedback on how this would fit within the NHS.

To complete the stakeholder evaluation, the external stakeholder prepared six example letters containing a mix of common and uncommon diseases/procedures that they would come across in their everyday work. The letters included sections designed to test the system, such as the example letter below signed by ‘Dr xxx xxx’:

Dear Dr xxx, Thank you for sending xxx to me. I agree that I think she has quite bad psoriasis; I will refer her for phototherapy. Yours Sincerely, Dr xxx xxx

The letters were processed with the model, and the predicted codes and their attention maps were shown to the stakeholder (the other letters are contained in Appendix E). Unfortunately, the results on almost all the letters were disappointing. With the letter above, the correct codes would be 9104002—psoriasis and either 31394004—light therapy, which is the parent to all forms of phototherapy, or 428545002—phototherapy of skin as the more specific result. The model returned the results and attention map shown in Figure 13.

Refer to caption
Figure 13: Codes returned and the attention map presented to the external stakeholder for the example letter. It should be noted that the code V45.01 = cardiac pacemaker in situ.

With these results, not only were the predicted codes incorrect but the attention maps were also both wrong and removing words. This did not happen with any of the MIMIC discharge summaries, which, even when the codes were wrong, at least specified where in the letter the codes were found (as demonstrated in Figure 14).

There was one letter where the result was correct; the letter stated, ‘I reviewed xxx following his PCA - this has indeed shown a MI which is clearly causing LVF, as evidenced by his raised BNP. We will proceed to a CABG’, where, in this case, LVF = left ventricular failure and CABG = coronary artery bypass graft. The model returned with 42343007 - congestive heart failure, which the external stakeholder identified as a perfect match for LVH, and the procedure ‘continuous invasive mechanical ventilation for less than 96 consecutive hours’, which, although oddly specific, does occur during a CABG.

Since using the pre-prepared letters didn’t give the system a chance to demonstrate how it returns the codes, the external stakeholder was also given the codes returned from a MIMIC discharge summary (Figure 14) that showed codes with direct and indirect SNOMED mappings. Regarding this, they stated that with a good enough accuracy of coding, the solution would genuinely be useful for medical coding, with their only critique being that when there is no direct mapping, usually the least specific (parent in the hierarchy – in the example in Figure 14 that would be 55822004 - Hyperlipidaemia) should be used.

Refer to caption
Figure 14: Result given to the external stakeholder with examples of direct and indirect SNOMED mappings.

From these results conclusions can be made looking at the issues from two angles. The first is that, despite the best efforts of the model, it has succumbed to overfitting with the MIMIC discharge summaries, leading to it not properly functioning when given data that doesn’t resemble said discharge summaries.

The other conclusion is that the MIMIC database simply isn’t representative enough of what this project aims to code. The model is only trained using discharge summaries, which are long and detailed documents, but more importantly, they only contain diseases/procedures that would require hospitalisation. This also explains why the model successfully predicted heart failure – a serious condition that presumably would have been included in multiple discharge summaries – but didn’t detect the other letters (included in Appendix E) about less serious diseases such as ear infection, headaches, and psoriasis.

A note on this conclusion is that the final letter that describes ‘Waldenström’s Macroglobulinemia’ – a rare form of blood cancer - returned no mappings despite it being something with potential for hospitalisation. This was still the case when we changed it to its other well-known name, lymphoplasmacytic lymphoma.

Finally, the stakeholder stated that another thing to be added to make it truly useful would be that it implements the whole of the SNOMED terminology, not just the diagnoses and procedures. Using MIMIC data, the models can only be trained on ICD-9 codes, which as described earlier only contain diagnoses and procedures. SNOMED also has hierarchies for medicines, tests, organisms, and substances that also need coding.

Refer to caption
Figure 15: MIMIC-III-50 Table of Included Codes and corresponding short title of ICD-9 code Dong et al. (2021)

Appendix C Implementation Details

Implementing the HAN model came with surprisingly few difficulties considering its complexity and the previous issues with everything in the project so far. It required Python 3.8 instead of 3.6 and TensorFlow 1 instead of PyTorch like CAML. A note on TensorFlow 1 - The only version available for download is TensorFlow 1.15, deprecated from TensorFlow 2.0.0 and installed through the TensorFlow Hub onto an Anaconda (conda) virtual environment.

To preprocess the data so that it is in the format expected for the HLAN model to train/test, it requires the same preprocessing as CAML. There were some issues running this as some of the Python libraries, more specifically the versions of NumPy, SciPy, and Scikit-Learn in the requirements list, kept throwing errors about each other’s versions on installation. This was fixed by doing a clean install of Python 3.6 in a virtual environment, and this virtual environment was where the CAML preprocessing script was run 141414https://github.com/jamesmullenbach/caml-mimic. In this virtual environment there were problems running Jupyter Notebook, but to fix this, the code was copied from the notebook into a regular Python file that did what the notebook would have done, just without the visualisation.

Since a deprecated installation of pandas was installed due to python versioning differences, each time a new line of combined codes and processed text was added, a new blank line was also added that made the program throw errors. This was sorted by running the clean_notes program that removed all blank lines.

The model was then used by running the runTest.py file with the existing code blocks already set up for MIMIC-III.

Appendix D Full Evaluation Results

The full evaluation results are listed in Figure 16 and 17.

Refer to caption
Figure 16: Full Evaluation Results - Part 1
Refer to caption
Figure 17: Full Evaluation Results - Part 2

Appendix E Example Letters from Stakeholder and Results

Letter 1: “ Dear xx xxx,

I saw xxx today in clinic. I think he has chronic otitis media. I have inserted some grommets, which should hopefully improve his hearing.

Yours Sincerely,

xx xxx ”

\Rightarrow Letter 1 (anonymized) result is shown in Figure 18. The prediction results for ICD code is ‘proc code 38.93’ (Venous catheterization), prediction 427.31 = atrial fibrillation.

Refer to caption
Figure 18: Letter1 Outcomes
Refer to caption
Figure 19: Letter2 Outcomes
Refer to caption
Figure 20: Letter3 Outcomes

Letter 2: “ Dear xx xxx,

Thank you for sending xxx to me. I agree that I think she has quite bad psoriasis; I will refer her for phototherapy.

Yours Sincerely,

xx xxx xxx ”

\Rightarrow Letter 2 (anonymized) result is shown in Figure 19. The prediction result SNOMED mapping for ICD CODE 244.9 151515https://www.findacode.com/icd-9/244-9-hypothyroidism-primary-nos-icd-9-code.html is 40930008, which is Hypothyroidism (disorder) 161616https://www.findacode.com/snomed/40930008--hypothyroidism.html. ICD code V45.01 is cardiac pacemaker in situ 171717https://www.findacode.com/icd-9/v45-01-postsurgical-state-cardiac-pacemaker-icd-9-code.html.

Letter 3: “ Dear xx xxx,

I reviewed xxx following his PCA - this has indeed shown a MI which is clearly causing LVF, as evidenced by his raised BNP. We will proceed to a CABG

xx xxx xxx ”

\Rightarrow Letter 3 (anonymized) result is shown in Figure 20. It predicted SNOMED mapping 42343007, which is congestive heart failure (disorder) 181818https://bioportal.bioontology.org/ontologies/SNOMEDCT?p=classes&conceptid=42343007. ICD code 96.71 is “continuous invasive mechanical ventilation for less than 96 consecutive hours” 191919https://www.findacode.com/icd-9/96-71-continuous-mechanical-ventilation-less-than-96-icd-9-procedure-code.html.

Dear xxx xxx,

I saw xxx today, he has clearly developed Waldenstroms Macroglubulinaemia, which is unusual given his Tay-Sach’s disease. I will start him on chemotherapy shortly.

Best Wishes,

xxx xxx xxx xxx

\Rightarrow No codes found.