  • Manually curating a set of candidate types using expert-level knowledge. Here, we refer to the annotation guidelines available in existing datasets since we believe they are curated by domain experts. For “Tr”, “Pr”, “Te” and “DD” we take the annotation guidelines from i2b2 2010. For “CD”, we use i2b2 2012. For “AD” and “ADE”, we use i2b2 2018 Task 2. We list the curated set in \autorefsubsec_decomposer_annotation.

  • Prompting an LLM for automatic generation. We prompt ChatGPT with “You are an intelligent clinical language model. Your job is to extract {entity_type} from a patient’s discharge summary. What entities can be considered as {entity_type} in a discharge summary?” for each entity type. For reproducibility, we present the results in \autorefsubsec_decomposer_chatgpt.

  • Utilizing an existing medical knowledge bank. We use the Unified Medical Language System (UMLS) since it contains standardized medical vocabulary for many clinical entities. Here, we take the UMLS semantic types for “Tr”, “Pr” and “Te” available in i2b2 2010 guidelines. We list the curated set in \autorefsubsec_decomposer_umls.

Annotation

Treatment: medical treatment, medical intervention, medical procedure, medical device, treatment, biological substance, drug, medication

Problem: medical problem, disease, syndrome, symptom, medical condition, behavior, virus, bacterium, injury, abnormality, abnormal test result, mental status

Test: medical test, medical procedure, medical panel, medical examination, medical evaluation, test, procedure, laboratory procedure, diagnostic procedure, panel, measure, physiologic measure, vital sign, examination, evaluation

Clinical Department: clinical department, medical department, clinical unit, clinical service, clinical practice, clinical room, department, location, building, hospital

Disease/Disorder: medical problem, disease, syndrome, symptom, medical condition, behavior, virus, bacterium, injury, abnormality, abnormal test result

Adverse Drug: drug

Adverse Drug Event: medical problem

ChatGPT

Treatment: medical treatment, medication, medical procedure, therapy, medical intervention, consultation, counseling, discharge instruction, supportive care

Problem: medical problem, medical diagnosis, disease, abnormal test result, symptom, abnormal imaging finding, complication, chronic health condition, medication side effect, mental health issue, social determinants of health

Test: medical test, laboratory test, imaging study, diagnostic procedure, genetic test, electrodiagnostic test, functional test, microbiological test

UMLS

Treatment: medical treatment, therapeutic procedure, preventive procedure, medical device, steroid, pharmacologic substance, biomedical material, dental material, antibiotic, clinical drug, drug delivery device

Problem: medical problem, pathologic function, disease, syndrome, mental dysfunction, behavioral dysfunction, cell dysfunction, molecular dysfunction, congenital abnormality, acquired abnormality, injury, poisoning, anatomic abnormality, neoplastic process, virus, bacterium, symptom

Test: medical test, laboratory procedure, diagnostic procedure

Table \thetable: Extension of \autoreftable_main for Precision (%percent\%%).
\toprule\multirow2*Dataset \multirow2*\makecell[l]Entity
UniNER GNER \multirow2*\makecellUniNER-all
\cmidrule(r)3-7\cmidrule(r)8-12 B ED F EDF ΔΔ\Deltaroman_Δ B ED F EDF ΔΔ\Deltaroman_Δ
\midrule\multirow3*i2b2 2010 Tr 51.63 30.35 61.71 55.09 +++3.46 46.08 27.19 74.16 66.60 +++20.52 80.63
Pr 44.95 30.94 56.02 47.15 +++2.20 33.71 25.61 55.43 47.63 +++13.92 75.87
Te 53.51 26.04 58.67 48.84 --4.67 32.70 24.73 59.91 53.73 +++21.03 79.14
\midrule\multirow4*i2b2 2012 Tr 57.09 36.72 65.25 59.08 +++1.99 48.05 29.56 71.32 63.40 +++15.35 81.10
Pr 42.93 32.65 53.03 46.87 +++3.94 37.49 28.71 52.72 46.57 +++9.08 78.97
Te 51.35 21.58 58.34 47.04 --4.31 29.96 18.15 57.95 49.69 +++19.73 72.88
CD 35.87 19.94 54.85 47.56 +++11.69 55.11 11.73 57.17 50.77 --4.34 59.19
\midruleCLEF 2014 DD 69.14 34.05 79.01 55.95 --13.19 29.10 16.29 40.85 28.18 --0.92 78.71
\midrule\multirow2*i2b2 2018 AD 12.43 4.54 20.93 15.09 +++2.66 1.67 2.43 6.20 8.79 +++7.12 12.32
ADE 6.04 1.36 12.36 5.23 --0.81 0.33 0.56 1.76 2.34 +++2.01
\midruleAvg. 42.49 23.82 52.02 42.79 +++0.30 31.42 18.50 47.75 41.77 +++10.35 64.76
Table \thetable: Extension of \autoreftable_main for Recall (%percent\%%).
\toprule\multirow2*Dataset \multirow2*\makecell[l]Entity
UniNER GNER \multirow2*\makecellUniNER-all
\cmidrule(r)3-7\cmidrule(r)8-12 B ED F EDF ΔΔ\Deltaroman_Δ B ED F EDF ΔΔ\Deltaroman_Δ
\midrule\multirow3*i2b2 2010 Tr 56.18 77.13 48.57 64.98 +++8.80 63.25 71.70 54.33 60.18 --3.07 70.02
Pr 55.56 65.28 49.87 56.64 +++1.08 50.39 60.12 46.83 54.03 +++3.64 70.55
Te 44.81 63.84 31.02 40.32 --4.49 43.32 68.94 27.01 39.88 --3.44 66.76
\midrule\multirow4*i2b2 2012 Tr 52.11 70.34 44.37 57.45 +++5.34 52.95 62.61 45.61 51.92 --1.03 65.34
Pr 50.99 60.64 46.65 54.12 +++3.13 45.50 56.20 42.64 51.29 +++5.79 71.70
Te 41.30 58.42 35.29 45.65 +++4.35 37.15 57.18 31.83 43.98 +++6.83 59.43
CD 49.04 88.78 25.18 32.56 --16.48 63.20 79.88 29.83 30.03 --33.17 35.49
\midruleCLEF 2014 DD 35.29 70.62 31.71 60.75 +++25.46 13.10 29.99 11.90 25.74 +++12.64 52.79
\midrule\multirow2*i2b2 2018 AD 40.93 77.74 39.50 71.63 +++30.70 30.34 34.65 26.93 30.70 +++0.36 17.24
ADE 22.86 52.88 22.27 48.11 +++25.25 3.78 16.90 3.78 15.31 +++11.53 34.00
\midruleAvg. 44.91 68.57 37.44 53.22 +++8.31 40.30 53.82 32.07 40.31 +++0.01 54.33

1 Datasets

We include all entities for i2b2 2010, ClinicalIE, and CLEF 2014. For i2b2 2012, we found that UniversalNER and GNER performed poorly on the last two entities (e.g., evidence and occurrence) and decided to exclude them. We attribute this to them consisting mostly of verb phrases, while the training dataset consists mainly of noun entities. For i2b2 2018 Task 2, we test our method on a more challenging setup, extracting adverse drugs and adverse drug events [henry20202018].

2 Recall and Precision Performance

We provide the precisions and recalls for each dataset and entity type from \autoreftable_main in \autoreftable_precision and \autoreftable_recall respectively. We observe a similar trend for both metrics. Furthermore, we observe that UniNER benefits more from precision and GNER on recalls using our framework.

3 Filter Prompt

We experiment with different ways to prompt in \autorefsubsubsec_result_filter_prompt and provide the specific instructions here.

\thesubsection Without Description (Default)

Can ’{entity}’ be considered a/an {entity_type}? Answer with yes or no.

\thesubsection With Description

Treatment: Can ’{entity}’ be considered a procedure or substance given to a patient to resolve a medical problem? Answer with yes or no.

Problem: Can ’{entity}’ be considered an observation thought to be abnormal or caused by a disease? Answer with yes or no.

Test: Can ’{entity}’ be considered a procedure or measure to find more information about a medical problem? Answer with yes or no.

Clinical Department: Can ’{entity}’ be considered a clinical unit or clinical service name? Answer with yes or no.

4 Few-shot Experiment

Here, we tried including some annotated samples in our framework and compared the approach to standard in-context learning. We randomly sample from the annotation guidelines and add them to the UniversalNER prompt. We also guarantee that there is at least one sample without entities of interest (e.g., sentence does not contain treatments or medical problems). Interestingly, we observe performance degradation across entity types the more samples we use. These contrastive results to general LLMs [xie2023empirical] further justify that open NER LLMs cannot be treated similarly to them. Furthermore, we observe that this also applies to our framework, although it is not as severe as standard in-context learning. We remark that performance drops on in-context learning are not uncommon. Previous works [zhao2021calibrate, zhu2023promptbench] show instability in performance for in-context learning. In addition, few-shot experiments are uncommon for zero-shot NER task [zhou2023universalner, ding2024rethinking, zaratiana2023gliner], even if they use LLMs. Our work reveals that open NER LLMs may not benefit from in-context learning and are different from general LLMs. We leave further investigation to future works.



Figure \thefigure: Few-shot performance comparison. We observe performance drop using in-context learning (ICL). In contrast, our method (EDF) is more robust. We use the i2b2 2012 dataset with entity types treatment (Tr), problem (Pr), and test (Te).
Table \thetable: Performance on GLiNER.
\toprule\multirow2*Dataset \multirow2*\makecell[l]Entity
\multirow2*Metric GLiNER
\cmidrule4-8 B ED F EDF ΔΔ\Deltaroman_Δ
\midrule\multirow9*i2b2 2010 \multirow3*Tr P 52.03 35.70 70.79 66.71 +++14.68
R 44.55 76.13 39.86 63.93 +++19.38
\cmidrule3-8 F1 48.00 48.61 51.00 65.29 +++17.29
\cmidrule2-8 \multirow3*Pr P 71.19 48.48 79.13 67.32 --3.87
R 49.22 63.36 46.16 56.49 +++7.27
\cmidrule3-8 F1 58.20 54.93 58.31 61.43 +++3.23
\cmidrule2-8 \multirow3*Te P 42.80 22.93 63.77 56.65 +++13.85
R 27.23 55.63 23.43 39.14 +++11.91
\cmidrule3-8 F1 33.28 32.47 34.27 46.30 +++13.02
\midrule\multirow12*i2b2 2012 \multirow3*Tr P 53.77 38.88 71.83 66.92 +++13.15
R 48.37 69.76 42.87 58.03 +++9.66
\cmidrule3-8 F1 50.93 49.93 53.69 62.16 +++11.23
\cmidrule2-8 \multirow3*Pr P 71.67 51.93 77.68 67.35 --4.32
R 50.33 63.93 47.27 58.06 +++7.73
\cmidrule3-8 F1 59.13 57.32 58.78 62.36 +++3.23
\cmidrule2-8 \multirow3*Te P 43.97 19.72 66.72 55.42 +++11.45
R 39.17 60.09 35.17 48.41 +++9.24
\cmidrule3-8 F1 41.13 29.69 46.06 51.68 +++10.25
\cmidrule2-8 \multirow3*CD P 48.69 22.99 58.28 50.08 +++1.39
R 71.59 88.27 29.52 32.96 --38.63
\cmidrule3-8 F1 57.96 36.48 39.19 39.76 --18.20
\midrule\multirow3*CLEF 2014 \multirow3*DD P 65.32 41.83 72.26 59.42 --5.90
R 27.90 48.17 26.00 42.99 +++15.99
\cmidrule3-8 F1 39.09 44.78 38.24 49.89 +++10.80
\midrule\multirow6*i2b2 2018 \multirow3*AD P 2.31 3.52 6.47 13.50 +++11.19
R 5.39 67.15 5.39 61.40 +++56.01
\cmidrule3-8 F1 3.23 6.69 5.88 22.13 +++18.90
\cmidrule2-8 \multirow3*ADE P 7.42 2.17 15.03 7.70 +++0.28
R 14.31 44.93 13.12 40.95 +++26.64
\cmidrule3-8 F1 9.77 4.15 14.01 12.96 +++3.19
\midrule\multirow3*Avg. P 45.92 28.82 58.20 51.11 +++5.19
R 37.81 63.74 30.88 50.24 +++12.43
\cmidrule3-8 F1 40.10 36.51 39.94 47.40 +++7.29

5 Performance on BERT-based Models

We use GLiNER [zaratiana2023gliner], a BERT-based model for open-named entity recognition. Note that previous prompt engineering methods cannot be applied here. We conduct the experiment similar to UniNER and GNER, with Xie et.al. [xie2023empirical] as our baseline. We present the results in \autoreftable_gliner. We observe the same trend as in \autorefsubsec_overall_perf with an average of 7.29%percent7.297.29\%7.29 % F1-score improvement.

6 Performance Drop on CD

We observe significant recall drops to “clinical department” entities across all models. Here, we posit that some entities may not necessarily conform to the clinical department in the clinical domain. For instance, some entities are hospitals; thus, if a filter is prompted with our template (e.g., “Can hospitals be considered as a clinical department?”), then it is likely to reject them. One possible solution is using a clear entity description. As illustrated in \autoreftable_filter_ent_desc, our framework outperforms the baseline (e.g. 45.97%percent45.9745.97\%45.97 % vs 38.66%percent38.6638.66\%38.66 % F1-score respectively) when using entity description.

7 \Ours Filter Precision/Recall Trade-off

We analyze how filtering can be made more or less strict to achieve better trade-offs. We use the filter output probability to determine whether the entity is rejected or not. Concretely, rather than directly rejecting them if the filter outputs “No”, we first look at the token probability. If it is less than a certain threshold, we then reject them. Our framework is simplified to entity decomposition if the threshold is 1111. We provide the results in \autoreffig_filter_tradeoff. Overall, increasing the threshold leads to decreased precision and improved recall. Interestingly, better thresholding can improve the F1 Score in “clinical department” entities. This might be due to the noises for the entities as described in \autorefapp_drop_cd.



Figure \thefigure: Filter Precision/Recall Trade-off. There is an improvement in recall but a decrease in precision when increasing the threshold. The dashed line corresponds to performance with threshold =0absent0=0= 0. We use i2b2 2012 dataset.

8 LLM Prompt Templates

Our experiments involve large language models, which are often trained with specific templates. We use their default templates (except Llama2) throughout the experiments and present them here.

\thesubsection UniNER

A virtual assistant answers questions from a user based on the provided text.
USER: Text: {input}
ASSISTANT: I’ve read this text.
USER: {instruction}

\thesubsection GNER

[INST] Please analyze the sentence provided, identifying the type of entity for each word on a token-by-token basis.
Output format is: word_1(label_1), word_2(label_2), ...
We’ll use the BIO-format to label the entities, where:
1. B- (Begin) indicates the start of a named entity.
2. I- (Inside) is used for words within a named entity but are not the first word.
3. O (Outside) denotes words that are not part of a named entity.
Sentence: {input} [/INST]

\thesubsection Asclepius

You are an intelligent clinical languge model.
Below is a snippet of patient’s discharge summary and a following instruction from healthcare professional.
Write a response that appropriately completes the instruction.
The response should provide the accurate answer to the instruction, while being concise.

[Discharge Summary Begin]
[Discharge Summary End]

[Instruction Begin]
[Instruction End]

\thesubsection Llama2

<s>[INST] <<SYS>>
You are an intelligent clinical languge model.
Below is an instruction from healthcare professional.
Write a response that appropriately completes the instruction.
The response should provide the accurate answer to the instruction, while being concise.

{instruction} [/INST]

9 LLM Hyperparameters

We use the default hyperparameters for each model. For UniNER and GNER, we use greedy search. For Asclepius and Llama2, we use temperature and top P𝑃Pitalic_P probability 0.950.950.950.95.