*Appendix \labelapp
Entity Decomposers \labelapp_entity_decomposer
We provide the details for each of our entity decomposition methods described in \autorefsubsec_exp_decomposer here:
-
[nosep,topsep=0pt,parsep=0pt,partopsep=0pt, leftmargin=*]
-
•
Manually curating a set of candidate types using expert-level knowledge. Here, we refer to the annotation guidelines available in existing datasets since we believe they are curated by domain experts. For “Tr”, “Pr”, “Te” and “DD” we take the annotation guidelines from i2b2 2010. For “CD”, we use i2b2 2012. For “AD” and “ADE”, we use i2b2 2018 Task 2. We list the curated set in \autorefsubsec_decomposer_annotation.
-
•
Prompting an LLM for automatic generation. We prompt ChatGPT with “You are an intelligent clinical language model. Your job is to extract {entity_type} from a patient’s discharge summary. What entities can be considered as {entity_type} in a discharge summary?” for each entity type. For reproducibility, we present the results in \autorefsubsec_decomposer_chatgpt.
-
•
Utilizing an existing medical knowledge bank. We use the Unified Medical Language System (UMLS) since it contains standardized medical vocabulary for many clinical entities. Here, we take the UMLS semantic types for “Tr”, “Pr” and “Te” available in i2b2 2010 guidelines. We list the curated set in \autorefsubsec_decomposer_umls.
\thesubsection Annotation
Treatment: medical treatment, medical intervention, medical procedure, medical device, treatment, biological substance, drug, medication
Problem: medical problem, disease, syndrome, symptom, medical condition, behavior, virus, bacterium, injury, abnormality, abnormal test result, mental status
Test: medical test, medical procedure, medical panel, medical examination, medical evaluation, test, procedure, laboratory procedure, diagnostic procedure, panel, measure, physiologic measure, vital sign, examination, evaluation
Clinical Department: clinical department, medical department, clinical unit, clinical service, clinical practice, clinical room, department, location, building, hospital
Disease/Disorder: medical problem, disease, syndrome, symptom, medical condition, behavior, virus, bacterium, injury, abnormality, abnormal test result
Adverse Drug: drug
Adverse Drug Event: medical problem
\thesubsection ChatGPT
Treatment: medical treatment, medication, medical procedure, therapy, medical intervention, consultation, counseling, discharge instruction, supportive care
Problem: medical problem, medical diagnosis, disease, abnormal test result, symptom, abnormal imaging finding, complication, chronic health condition, medication side effect, mental health issue, social determinants of health
Test: medical test, laboratory test, imaging study, diagnostic procedure, genetic test, electrodiagnostic test, functional test, microbiological test
\thesubsection UMLS
Treatment: medical treatment, therapeutic procedure, preventive procedure, medical device, steroid, pharmacologic substance, biomedical material, dental material, antibiotic, clinical drug, drug delivery device
Problem: medical problem, pathologic function, disease, syndrome, mental dysfunction, behavioral dysfunction, cell dysfunction, molecular dysfunction, congenital abnormality, acquired abnormality, injury, poisoning, anatomic abnormality, neoplastic process, virus, bacterium, symptom
Test: medical test, laboratory procedure, diagnostic procedure
\toprule\multirow2*Dataset | \multirow2*\makecell[l]Entity | |||||||||||
UniNER | GNER | \multirow2*\makecellUniNER-all | ||||||||||
(Supervised) | ||||||||||||
\cmidrule(r)3-7\cmidrule(r)8-12 | B | ED | F | EDF | B | ED | F | EDF | ||||
\midrule\multirow3*i2b2 2010 | Tr | 51.63 | 30.35 | 61.71 | 55.09 | 3.46 | 46.08 | 27.19 | 74.16 | 66.60 | 20.52 | 80.63 |
Pr | 44.95 | 30.94 | 56.02 | 47.15 | 2.20 | 33.71 | 25.61 | 55.43 | 47.63 | 13.92 | 75.87 | |
Te | 53.51 | 26.04 | 58.67 | 48.84 | 4.67 | 32.70 | 24.73 | 59.91 | 53.73 | 21.03 | 79.14 | |
\midrule\multirow4*i2b2 2012 | Tr | 57.09 | 36.72 | 65.25 | 59.08 | 1.99 | 48.05 | 29.56 | 71.32 | 63.40 | 15.35 | 81.10 |
Pr | 42.93 | 32.65 | 53.03 | 46.87 | 3.94 | 37.49 | 28.71 | 52.72 | 46.57 | 9.08 | 78.97 | |
Te | 51.35 | 21.58 | 58.34 | 47.04 | 4.31 | 29.96 | 18.15 | 57.95 | 49.69 | 19.73 | 72.88 | |
CD | 35.87 | 19.94 | 54.85 | 47.56 | 11.69 | 55.11 | 11.73 | 57.17 | 50.77 | 4.34 | 59.19 | |
\midruleCLEF 2014 | DD | 69.14 | 34.05 | 79.01 | 55.95 | 13.19 | 29.10 | 16.29 | 40.85 | 28.18 | 0.92 | 78.71 |
\midrule\multirow2*i2b2 2018 | AD | 12.43 | 4.54 | 20.93 | 15.09 | 2.66 | 1.67 | 2.43 | 6.20 | 8.79 | 7.12 | 12.32 |
ADE | 6.04 | 1.36 | 12.36 | 5.23 | 0.81 | 0.33 | 0.56 | 1.76 | 2.34 | 2.01 | ||
\midruleAvg. | 42.49 | 23.82 | 52.02 | 42.79 | 0.30 | 31.42 | 18.50 | 47.75 | 41.77 | 10.35 | 64.76 | |
\bottomrule |
\toprule\multirow2*Dataset | \multirow2*\makecell[l]Entity | |||||||||||
UniNER | GNER | \multirow2*\makecellUniNER-all | ||||||||||
(Supervised) | ||||||||||||
\cmidrule(r)3-7\cmidrule(r)8-12 | B | ED | F | EDF | B | ED | F | EDF | ||||
\midrule\multirow3*i2b2 2010 | Tr | 56.18 | 77.13 | 48.57 | 64.98 | 8.80 | 63.25 | 71.70 | 54.33 | 60.18 | 3.07 | 70.02 |
Pr | 55.56 | 65.28 | 49.87 | 56.64 | 1.08 | 50.39 | 60.12 | 46.83 | 54.03 | 3.64 | 70.55 | |
Te | 44.81 | 63.84 | 31.02 | 40.32 | 4.49 | 43.32 | 68.94 | 27.01 | 39.88 | 3.44 | 66.76 | |
\midrule\multirow4*i2b2 2012 | Tr | 52.11 | 70.34 | 44.37 | 57.45 | 5.34 | 52.95 | 62.61 | 45.61 | 51.92 | 1.03 | 65.34 |
Pr | 50.99 | 60.64 | 46.65 | 54.12 | 3.13 | 45.50 | 56.20 | 42.64 | 51.29 | 5.79 | 71.70 | |
Te | 41.30 | 58.42 | 35.29 | 45.65 | 4.35 | 37.15 | 57.18 | 31.83 | 43.98 | 6.83 | 59.43 | |
CD | 49.04 | 88.78 | 25.18 | 32.56 | 16.48 | 63.20 | 79.88 | 29.83 | 30.03 | 33.17 | 35.49 | |
\midruleCLEF 2014 | DD | 35.29 | 70.62 | 31.71 | 60.75 | 25.46 | 13.10 | 29.99 | 11.90 | 25.74 | 12.64 | 52.79 |
\midrule\multirow2*i2b2 2018 | AD | 40.93 | 77.74 | 39.50 | 71.63 | 30.70 | 30.34 | 34.65 | 26.93 | 30.70 | 0.36 | 17.24 |
ADE | 22.86 | 52.88 | 22.27 | 48.11 | 25.25 | 3.78 | 16.90 | 3.78 | 15.31 | 11.53 | 34.00 | |
\midruleAvg. | 44.91 | 68.57 | 37.44 | 53.22 | 8.31 | 40.30 | 53.82 | 32.07 | 40.31 | 0.01 | 54.33 | |
\bottomrule |
1 Datasets
We include all entities for i2b2 2010, ClinicalIE, and CLEF 2014. For i2b2 2012, we found that UniversalNER and GNER performed poorly on the last two entities (e.g., evidence and occurrence) and decided to exclude them. We attribute this to them consisting mostly of verb phrases, while the training dataset consists mainly of noun entities. For i2b2 2018 Task 2, we test our method on a more challenging setup, extracting adverse drugs and adverse drug events [henry20202018].
2 Recall and Precision Performance
We provide the precisions and recalls for each dataset and entity type from \autoreftable_main in \autoreftable_precision and \autoreftable_recall respectively. We observe a similar trend for both metrics. Furthermore, we observe that UniNER benefits more from precision and GNER on recalls using our framework.
3 Filter Prompt
We experiment with different ways to prompt in \autorefsubsubsec_result_filter_prompt and provide the specific instructions here.
\thesubsection Without Description (Default)
Can ’{entity}’ be considered a/an {entity_type}? Answer with yes or no.
\thesubsection With Description
Treatment: Can ’{entity}’ be considered a procedure or substance given to a patient to resolve a medical problem? Answer with yes or no.
Problem: Can ’{entity}’ be considered an observation thought to be abnormal or caused by a disease? Answer with yes or no.
Test: Can ’{entity}’ be considered a procedure or measure to find more information about a medical problem? Answer with yes or no.
Clinical Department: Can ’{entity}’ be considered a clinical unit or clinical service name? Answer with yes or no.
4 Few-shot Experiment
Here, we tried including some annotated samples in our framework and compared the approach to standard in-context learning. We randomly sample from the annotation guidelines and add them to the UniversalNER prompt. We also guarantee that there is at least one sample without entities of interest (e.g., sentence does not contain treatments or medical problems). Interestingly, we observe performance degradation across entity types the more samples we use. These contrastive results to general LLMs [xie2023empirical] further justify that open NER LLMs cannot be treated similarly to them. Furthermore, we observe that this also applies to our framework, although it is not as severe as standard in-context learning. We remark that performance drops on in-context learning are not uncommon. Previous works [zhao2021calibrate, zhu2023promptbench] show instability in performance for in-context learning. In addition, few-shot experiments are uncommon for zero-shot NER task [zhou2023universalner, ding2024rethinking, zaratiana2023gliner], even if they use LLMs. Our work reveals that open NER LLMs may not benefit from in-context learning and are different from general LLMs. We leave further investigation to future works.
[width=1]figures/fewshot.png
\toprule\multirow2*Dataset | \multirow2*\makecell[l]Entity | ||||||
---|---|---|---|---|---|---|---|
\multirow2*Metric | GLiNER | ||||||
\cmidrule4-8 | B | ED | F | EDF | |||
\midrule\multirow9*i2b2 2010 | \multirow3*Tr | P | 52.03 | 35.70 | 70.79 | 66.71 | 14.68 |
R | 44.55 | 76.13 | 39.86 | 63.93 | 19.38 | ||
\cmidrule3-8 | F1 | 48.00 | 48.61 | 51.00 | 65.29 | 17.29 | |
\cmidrule2-8 | \multirow3*Pr | P | 71.19 | 48.48 | 79.13 | 67.32 | 3.87 |
R | 49.22 | 63.36 | 46.16 | 56.49 | 7.27 | ||
\cmidrule3-8 | F1 | 58.20 | 54.93 | 58.31 | 61.43 | 3.23 | |
\cmidrule2-8 | \multirow3*Te | P | 42.80 | 22.93 | 63.77 | 56.65 | 13.85 |
R | 27.23 | 55.63 | 23.43 | 39.14 | 11.91 | ||
\cmidrule3-8 | F1 | 33.28 | 32.47 | 34.27 | 46.30 | 13.02 | |
\midrule\multirow12*i2b2 2012 | \multirow3*Tr | P | 53.77 | 38.88 | 71.83 | 66.92 | 13.15 |
R | 48.37 | 69.76 | 42.87 | 58.03 | 9.66 | ||
\cmidrule3-8 | F1 | 50.93 | 49.93 | 53.69 | 62.16 | 11.23 | |
\cmidrule2-8 | \multirow3*Pr | P | 71.67 | 51.93 | 77.68 | 67.35 | 4.32 |
R | 50.33 | 63.93 | 47.27 | 58.06 | 7.73 | ||
\cmidrule3-8 | F1 | 59.13 | 57.32 | 58.78 | 62.36 | 3.23 | |
\cmidrule2-8 | \multirow3*Te | P | 43.97 | 19.72 | 66.72 | 55.42 | 11.45 |
R | 39.17 | 60.09 | 35.17 | 48.41 | 9.24 | ||
\cmidrule3-8 | F1 | 41.13 | 29.69 | 46.06 | 51.68 | 10.25 | |
\cmidrule2-8 | \multirow3*CD | P | 48.69 | 22.99 | 58.28 | 50.08 | 1.39 |
R | 71.59 | 88.27 | 29.52 | 32.96 | 38.63 | ||
\cmidrule3-8 | F1 | 57.96 | 36.48 | 39.19 | 39.76 | 18.20 | |
\midrule\multirow3*CLEF 2014 | \multirow3*DD | P | 65.32 | 41.83 | 72.26 | 59.42 | 5.90 |
R | 27.90 | 48.17 | 26.00 | 42.99 | 15.99 | ||
\cmidrule3-8 | F1 | 39.09 | 44.78 | 38.24 | 49.89 | 10.80 | |
\midrule\multirow6*i2b2 2018 | \multirow3*AD | P | 2.31 | 3.52 | 6.47 | 13.50 | 11.19 |
R | 5.39 | 67.15 | 5.39 | 61.40 | 56.01 | ||
\cmidrule3-8 | F1 | 3.23 | 6.69 | 5.88 | 22.13 | 18.90 | |
\cmidrule2-8 | \multirow3*ADE | P | 7.42 | 2.17 | 15.03 | 7.70 | 0.28 |
R | 14.31 | 44.93 | 13.12 | 40.95 | 26.64 | ||
\cmidrule3-8 | F1 | 9.77 | 4.15 | 14.01 | 12.96 | 3.19 | |
\midrule\multirow3*Avg. | P | 45.92 | 28.82 | 58.20 | 51.11 | 5.19 | |
R | 37.81 | 63.74 | 30.88 | 50.24 | 12.43 | ||
\cmidrule3-8 | F1 | 40.10 | 36.51 | 39.94 | 47.40 | 7.29 | |
\bottomrule |
5 Performance on BERT-based Models
We use GLiNER [zaratiana2023gliner], a BERT-based model for open-named entity recognition. Note that previous prompt engineering methods cannot be applied here. We conduct the experiment similar to UniNER and GNER, with Xie et.al. [xie2023empirical] as our baseline. We present the results in \autoreftable_gliner. We observe the same trend as in \autorefsubsec_overall_perf with an average of F1-score improvement.
6 Performance Drop on CD
We observe significant recall drops to “clinical department” entities across all models. Here, we posit that some entities may not necessarily conform to the clinical department in the clinical domain. For instance, some entities are hospitals; thus, if a filter is prompted with our template (e.g., “Can hospitals be considered as a clinical department?”), then it is likely to reject them. One possible solution is using a clear entity description. As illustrated in \autoreftable_filter_ent_desc, our framework outperforms the baseline (e.g. vs F1-score respectively) when using entity description.
7 \Ours Filter Precision/Recall Trade-off
We analyze how filtering can be made more or less strict to achieve better trade-offs. We use the filter output probability to determine whether the entity is rejected or not. Concretely, rather than directly rejecting them if the filter outputs “No”, we first look at the token probability. If it is less than a certain threshold, we then reject them. Our framework is simplified to entity decomposition if the threshold is . We provide the results in \autoreffig_filter_tradeoff. Overall, increasing the threshold leads to decreased precision and improved recall. Interestingly, better thresholding can improve the F1 Score in “clinical department” entities. This might be due to the noises for the entities as described in \autorefapp_drop_cd.
[width=1]figures/filter_tradeoff.png
8 LLM Prompt Templates
Our experiments involve large language models, which are often trained with specific templates. We use their default templates (except Llama2) throughout the experiments and present them here.
\thesubsection UniNER
A virtual assistant answers questions from a user based on the provided text.
USER: Text: {input}
ASSISTANT: I’ve read this text.
USER: {instruction}
ASSISTANT:
\thesubsection GNER
[INST] Please analyze the sentence provided, identifying the type of entity for each word on a token-by-token basis.
Output format is: word_1(label_1), word_2(label_2), ...
We’ll use the BIO-format to label the entities, where:
1. B- (Begin) indicates the start of a named entity.
2. I- (Inside) is used for words within a named entity but are not the first word.
3. O (Outside) denotes words that are not part of a named entity.
{instruction}
Sentence: {input} [/INST]
\thesubsection Asclepius
You are an intelligent clinical languge model.
Below is a snippet of patient’s discharge summary and a following instruction from healthcare professional.
Write a response that appropriately completes the instruction.
The response should provide the accurate answer to the instruction, while being concise.
[Discharge Summary Begin]
{input}
[Discharge Summary End]
[Instruction Begin]
{instruction}
[Instruction End]
\thesubsection Llama2
<s>[INST] <<SYS>>
You are an intelligent clinical languge model.
Below is an instruction from healthcare professional.
Write a response that appropriately completes the instruction.
The response should provide the accurate answer to the instruction, while being concise.
<</SYS>>
{instruction} [/INST]
9 LLM Hyperparameters
We use the default hyperparameters for each model. For UniNER and GNER, we use greedy search. For Asclepius and Llama2, we use temperature and top probability .