Plausible-Parrots @ MSP2023: Enhancing Semantic Plausibility Modeling using Entity and Event Knowledge

Chong Shen    Chenyue Zhou
Institute for Natural Language Processing, University of Stuttgart, Germany
{chong.shen,chenyue.zhou}@ims.uni-stuttgart.de
Abstract

In this work, we investigate the effectiveness of injecting external knowledge to a large language model (LLM) to identify semantic plausibility of simple events. Specifically, we enhance the LLM with fine-grained entity types, event types and their definitions extracted from an external knowledge base. These knowledge are injected into our system via designed templates. We also augment the data to balance the label distribution and adapt the task setting to real world scenarios in which event mentions are expressed as natural language sentences. The experimental results show the effectiveness of the injected knowledge on modeling semantic plausibility of events. An error analysis further emphasizes the importance of identifying non-trivial entity and event types.111Code and data are available at https://github.com/st143575/SemPlaus-plausibleparrots.

1 Introduction

Discerning expressions about plausible events from implausible ones is a fundamental element for understanding event semantics. Semantic plausibility modeling is the task of identifying events that are likely to happen but not necessarily attested in a given world (Gordon and Van Durme, 2013). Previous works have shown the potential of incorporating world knowledge in solving the task, such as physical attributes (Wang et al., 2018), lexical hierarchy (Porada et al., 2021) and degrees of abstractness (Eichel and Im Walde, 2023).

Events in the real world are typically expressed by natural language sentences, which possess diverse, complex and dynamic forms. The words constituting an event are often ambiguous. For example, the subject, verb and object of the event Jobs takes an apple all have multiple meanings and can lead to misinterpretation of the plausibility of the event. Thus, we hypothesize that the types of the trigger (i.e. verb) and the arguments (e.g., subject and object) of an event are crucial to disambiguate the word meaning and thus help the model better understand the semantic plausibility of the event. Furthermore, single events in the form of (subject,verb,object)-triples are inconsistent with the natural language sentences that are input to LLMs during pretraining. This mismatch potentially limits the model’s performance.

subject verb object entity type event type entity type definition definition definition
Figure 1: Simple (s,v,o)-event enhanced by fine-grained entity types for the subject and object and event type for the verb, accompanied with their definitions.

To mitigate these issues, we propose to inject fine-grained entity types and event types as external knowledge to the model. We design multiple templates to construct natural language prompts to inject types of entities (i.e. subject and object) and events (i.e. verb), together with their definitions extracted from a knowledge base. We also perform data augmentation to counteract the unbalanced label distribution. Experimental results verify our hypothesis that the model benefits from the injected knowledge.

The main contributions of this paper are: (1) We enhance an LLM with fine-grained entity- and event knowledge for the semantic plausibility modeling problem; (2) We incorporate rich semantic knowledge of the entity type and event type from an external knowledge base into their labels; (3) We fuse these knowledge using multiple specifically designed templates; (4) We augment the dataset to deal with the unbalanced label distribution.

2 Related Work

2.1 Semantic Plausibility Modeling

Semantic plausibility is a fundamental element for understanding event semantics due to the multi-faceted human intuition in the assessment (Resnik, 1996) and the infrequency, non-typicality and non-preference of plausible and implausible events (Padó et al., 2007). It describes what is likely, but not necessarily attested in a given world (Gordon and Van Durme, 2013). In contrast to selectional preference (Erk and Padó, 2010), which is characterized by the typicality of events, semantic plausibility is sensitive to certain properties that are not explicitly covered by selectional preference (Bagherinezhad et al., 2016).

Modeling semantic plausibility is the task of distinguishing events that are likely to happen and those whose occurrences are implausible. Previous efforts focus on injecting world knowledge about entity properties (Forbes and Choi 2017; Wang et al. 2018) and capturing abstractness in simple events (Eichel and Im Walde, 2023), which requires domain expertise for the annotation. Furthermore, entity properties such as size and weight may not be sufficient for the model to learn the relationship between the subject and object involved in an event. As we will show, knowledge about “what kind of action is occurring in the event” and “what kind of entities are participating in it”, which are given by the type labels and their detailed definitions, can be an effective alternative to the entity properties.

2.2 Ultra Fine-grained Entity Typing

Ultra Fine-grained Entity Typing (UFET) is a multi-label classification problem of predicting fine-grained semantic types for entity mentions in text (Choi et al., 2018). The most significant challenge attributes to the massive label space of entity types (typically over 10k classes). Existing approaches can be categorized into two lines: modeling label dependencies and type hierarchies (Onoe et al., 2021; Zuo et al., 2022), as well as data augmentation with distant supervision (Dai et al., 2021; Zhang et al., 2022; Li et al., 2022a). Feng et al. (2023) propose CASENT, a sequence-to-sequence system that predicts ultra-fine entity types using probability calibration. The model is trained to predict a ground-truth entity type given an input entity mention in an auto-regressive manner. During inference, a calibration module computes a calibrated confidence score for each predicted candidate type label of the entity mention. The system then selects the final predictions from the candidate labels using a threshold on the confidence scores. CASENT achieves new state-of-the-art performance on the UFET dataset (Choi et al., 2018).

2.3 Event Detection

An event is defined as an occurrence of an action that causes the change of a state (Li et al., 2022b). Events are represented in various ways in different studies. In early years, an event is defined either as a proposition of subject and predicate in studies on temporal news comprehension (Filatova and Hovy, 2001), or as a (predicate, dependency)-pair in studies on script learning (Chambers and Jurafsky, 2008). Later, Balasubramanian et al. (2013) represent events by (subject, relation, object)-triples for event schema induction. Recent studies in information extraction define an event as a more complex structure form consisting of event trigger, event type, event arguments and argument roles (Li et al., 2022b). The trigger is the core unit of an event, typically the verb serving as the predicate. The event type describes the representative feature of the event and is usually the type of the trigger. Event arguments are participants involving in the event and other details about the event, such as time and place.

Event detection is the task of identifying event triggers in a given event mention and classifying them into event types. Although multiple benchmarks are proposed for event detection (Grishman et al., 2005; Wang et al., 2020), they are limited to the ranges of topics and suffer from small data sizes. Li et al. (2023) release GLEN, a new event detection dataset with a larger data size and a wider coverage of type labels.

3 Task Definition

We formulate the semantic plausibility modeling as a binary sequence classification problem. Given a knowledge-enhanced event mention x𝑥xitalic_x as prompt, the model should predict whether it is plausible (1) oder implausible (0), i.e.

y^=argmaxy{0,1}P(y|h(x))^𝑦subscriptargmax𝑦01𝑃conditional𝑦𝑥\hat{y}=\operatorname*{arg\,max}_{y\in\{0,1\}}P(y|h(x))over^ start_ARG italic_y end_ARG = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_y ∈ { 0 , 1 } end_POSTSUBSCRIPT italic_P ( italic_y | italic_h ( italic_x ) )

where h(x)𝑥h(x)italic_h ( italic_x ) is the output of model’s last hidden layer for x𝑥xitalic_x.

4 Methods

subj / obj entity types
wd_qid wd_label description
Trader
Q215627
Q43845
Q1424605
Q702269
Q131524
person
businessperson
trader
professional
entrepreneur
being that has certain capacities …
person involved in activities for …
businessperson who exchanges …
person who is paid to undertake …
individual who organizes and …
strategy
Q131841
Q151885
Q1371819
idea
concept
plan
mental image or concept
semantic unit understood in …
outline of a strategy for …
Table 1: An example UFET prediction for the subject and object in the event (trader, ensures, strategy).
event trigger event type
xpo_node name description
robs DWD_Q53706 robbery taking or attempting to take …
accusation DWD_Q19357312 accusation act of accusing or charging …
Table 2: An example event detection prediction for the event (option, robs, accusation).

4.1 Data Augmentation

To address the issue of unbalanced label distribution towards plausible events in one of our datasets, we employ a data augmentation strategy to increase the number of implausible events. This involves enriching the binary-class variant of the dataset with unduplicated implausible events randomly sampled from the multi-class variant of the dataset which also provides additional binary plausibility labels.

4.2 Ultra Fine-grained Entity Typing

We perform UFET to obtain fine-grained entity types and their definitions using CASENT (Feng et al., 2023). For each event triple (s,v,o)𝑠𝑣𝑜(s,v,o)( italic_s , italic_v , italic_o ), we produce two sentences, one with the text span for the subject s indicated using the special tokens <M> and </M>, the other with the text span for the object o indicated using the same special tokens. Then, CASENT predicts a set of fine-grained entity types for s in the first sentence and for o in the second sentence. After that, we extract the definitions (“description”) for the predicted type labels from a knowledge base (KB) built on WikiData222https://wikidata.org.

For example, given the event (trader, ensures, strategy), we produce two sentences “<M> Trader </M> ensures strategy.” and “Trader ensures <M> strategy </M>.” as the inputs to CASENT. The system outputs 5 entity types for the subject trader and 3 entity types for the object strategy together with their definitions as shown in Table 1.

4.3 Event Detection

For each natural language sentence built from an event triple (s,v,o)𝑠𝑣𝑜(s,v,o)( italic_s , italic_v , italic_o ), we identify the event trigger (which is usually v𝑣vitalic_v) and predict its type label tvsubscript𝑡𝑣t_{v}italic_t start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT using the event detection model CEDAR accompanied with GLEN (Li et al., 2023). Furthermore, we extract the definition of the event type label from the KB accompanied with the dataset which is also built on WikiData. An event can have multiple event triggers. Each event trigger, however, can only have one event type.

Table 2 shows the predictions of CEDAR for the event (option, robs, accusation) as an example. The model identifies two event triggers robs and accusation, assigned respectively with the type robberty with its definition taking or attempting to take something of value by force or threat of force or by putting the victim in fear, and the type accusation with its definition act of accusing or charging another with a crime. Based on the structure of simple events, we assume that the event trigger is always the verb v𝑣vitalic_v. Thus, we consider only the event type for the verb in the model’s predictions.

S V O UFET ED Template Prompt
RoBERTa
plausible (1) / implausible (0) KB
Figure 2: System architecture.

4.4 Template-based Prompt Engineering

To better integrate the knowledge elements into the input, we design the following 9 templates to construct natural language prompts containing the event sentence, entity types and the event type, along with their definitions.

  1. 1.

    TEMPLATE_SENT = [EVT]{sent}[/EVT]

  2. 2.

    TEMPLATE_SUBJ_BASIC = The subject “{subj}” has type [STYPE]{stype}[/STYPE],
    which means [DEF]{stype_desc}[/DEF].

  3. 3.

    TEMPLATE_SUBJ_EXTEND = It can also have type [STYPE]{stype}[/STYPE], which means [DEF]stype_desc[/DEF].

  4. 4.

    TEMPLATE_SUBJ_UNK = The subject “{subj}” has an unknown type.

  5. 5.

    TEMPLATE_VERB = The verb “{verb}” has type [ETYPE]etype[/ETYPE], which means [DEF]etype_desc[/DEF].

  6. 6.

    TEMPLATE_VERB_UNK = The verb “{verb}” has an unknown type.

  7. 7.

    TEMPLATE_OBJ_BASIC = The object “{obj}” has type [OTYPE]otype[/OTYPE], which means [DEF]otype_desc[/DEF].

  8. 8.

    TEMPLATE_OBJ_EXTEND = It can also have type [OTYPE]otype[/OTYPE], which means [DEF]otype_desc[/DEF].

  9. 9.

    TEMPLATE_OBJ_UNK = The object “{obj}” has an unknown type.

5 Experiments

5.1 Data

Our experiments are conducted on PEP-3K (Wang et al., 2018) and PAP (Eichel and Im Walde, 2023). We use the values in the column label as ground truth labels. The train- and dev sets of the both datasets are merged for the fine-tuning, each resulting in 4911 and 614 examples. The model is evaluated separately on the test sets of the two datasets, each containing 307 and 308 examples.

5.2 Preliminary Study

In our initial exploration, we investigate the statistic and linguistic characteristics of two datasets to glean insights into their composition and semantic nuances. Our analysis encompasses an examination of the distribution of examples across training, development, and test splits for each dataset. Additionally, we employ word clouds as a visual tool to highlight the most frequently occurring words tied to both plausible and implausible events, offering a vivid portrayal of the datasets’ lexical landscapes. The outcomes of this investigation are presented in Appendix Plausible-Parrots @ MSP2023: Enhancing Semantic Plausibility Modeling using Entity and Event Knowledge.

Model PAP PEP-3K
AUC P R F1 Acc AUC P R F1 Acc
RoBERTaevt+ent,ftsubscriptRoBERTa𝑒𝑣𝑡𝑒𝑛𝑡𝑓𝑡\textsc{RoBERTa}_{evt+ent,ft}RoBERTa start_POSTSUBSCRIPT italic_e italic_v italic_t + italic_e italic_n italic_t , italic_f italic_t end_POSTSUBSCRIPT 0.659 0.717 0.526 0.607 0.659 0.883 0.850 0.928 0.888 0.883
RoBERTaevt,ftsubscriptRoBERTa𝑒𝑣𝑡𝑓𝑡\textsc{RoBERTa}_{evt,ft}RoBERTa start_POSTSUBSCRIPT italic_e italic_v italic_t , italic_f italic_t end_POSTSUBSCRIPT 0.636 0.640 0.623 0.632 0.636 0.844 0.793 0.928 0.855 0.844
RoBERTaent,ftsubscriptRoBERTa𝑒𝑛𝑡𝑓𝑡\textsc{RoBERTa}_{ent,ft}RoBERTa start_POSTSUBSCRIPT italic_e italic_n italic_t , italic_f italic_t end_POSTSUBSCRIPT 0.666 0.763 0.481 0.590 0.666 0.840 0.838 0.843 0.840 0.840
RoBERTabs,0shotsubscriptRoBERTa𝑏𝑠0𝑠𝑜𝑡\textsc{RoBERTa}_{bs,0-shot}RoBERTa start_POSTSUBSCRIPT italic_b italic_s , 0 - italic_s italic_h italic_o italic_t end_POSTSUBSCRIPT 0.532 0.564 0.286 0.379 0.532 0.500 0.500 0.137 0.215 0.502
RoBERTabs,ftsubscriptRoBERTa𝑏𝑠𝑓𝑡\textsc{RoBERTa}_{bs,ft}RoBERTa start_POSTSUBSCRIPT italic_b italic_s , italic_f italic_t end_POSTSUBSCRIPT 0.646 0.737 0.455 0.562 0.646 0.791 0.830 0.732 0.778 0.792
Table 3: Semantic plausibility modeling results. For PAP, injecting entity type leads to the best AUC. For PEP-3K, injecting both event type and entity type significantly improves all metrics. For both datasets, injecting event type improves the recall but reduces the precision, while injecting entity type improves the precision but decreases the recall. evt+ent: event type and entity type, evt: event type, ent: entity type, bs: baseline, ft: fine-tune.

We utilize the python library Gensim333https://github.com/piskvorky/gensim library to calculate semantic similarities between the sets of most frequently appearing words in the context of plausible and implausible events. This comparative analysis yields intriguing findings: the datasets exhibit contrasting semantic profiles, with the PAP dataset showcasing a pronounced dissimilarity between plausible and implausible terms, while the PEP-3K dataset reveals a striking similarity among its terms. This distinction not only sheds light on the inherent linguistic patterns within the datasets but also correlates with our subsequent experimental observations, where classifiers tend to exhibit enhanced performance on the PEP-3K dataset as opposed to the PAP dataset. Figure 3 depicts words similarities between top plausible and implausible words in the datasets.

5.3 Model Implementation

We fine-tune RoBERTa-large444https://huggingface.co/roberta-large (Liu et al., 2019) on the merged dataset enhanced by entity type and event type knowledge (Evt+Ent). As ablation study, we also fine-tune the model on the data with only event type knowledge injected (Evt), as well as with only entity type knowledge injected (Ent). Figure 2 illustrates the architecture of our system.

5.3.1 Baselines

We compare our approach with two baselines, including (1) a zero-shot inference baseline (BS-0-shot); and (2) RoBERTa fine-tuned on the event mentions without knowledge injection (BS-ft).

5.4 Hyperparameters

For Evt+Ent and Ent, we fine-tune the model for 10 epochs with batch size 16, using AdamW optimizer (Loshchilov and Hutter, 2017). The training procedure has an initial learning rate of 1e-5, a weight decay of 0.01 and warm-up steps 10. The Evt group has almost the same set of hyperparameter values, except the warm-up steps being set to 100.

The baseline BS-ft is fine-tuned for 10 epochs with batch size 8 using the AdamW optimizer. The learning rate is 1e-5, with a weight decay of 0.01 and warm-up steps 10.

5.5 Evaluation Metrics

While previous works prefer to use accuracy (Acc) for the evaluation, we also report precision (P), recall (R), F1 and Area Under the Curve (AUC).

6 Results

The experimental results are shown in Table 3. RoBERTaevt+entsubscriptRoBERTa𝑒𝑣𝑡𝑒𝑛𝑡\textsc{RoBERTa}_{evt+ent}RoBERTa start_POSTSUBSCRIPT italic_e italic_v italic_t + italic_e italic_n italic_t end_POSTSUBSCRIPT achieves the highest performance on PEP-3K with respect to all metrics, indicating the effectiveness of event type and entity type knowledge for improving semantic plausibility understanding. On PAP, RoBERTaevt+entsubscriptRoBERTa𝑒𝑣𝑡𝑒𝑛𝑡\textsc{RoBERTa}_{evt+ent}RoBERTa start_POSTSUBSCRIPT italic_e italic_v italic_t + italic_e italic_n italic_t end_POSTSUBSCRIPT surprisingly achieves a lower AUC score than RoBERTaentsubscriptRoBERTa𝑒𝑛𝑡\textsc{RoBERTa}_{ent}RoBERTa start_POSTSUBSCRIPT italic_e italic_n italic_t end_POSTSUBSCRIPT which injects only the entity type knowledge. This shows the limitation of our approach on understanding semantic plausibility of events of different abstractness degrees. For both datasets, injecting event type knowledge improves the recall but reduces the precision, while injecting entity type knowledge increases the precision but suppresses the recall. The results of the baselines indicate that fine-tuning is generally better than zero-shot inference.

7 Qualitative Analysis

We observe 105 wrong predictions from the 308 examples in the PAP test set. 68 examples are assigned with an unknown type by the model. An instance is present in Table 4. Furthermore, 50 of the wrong predictions are assigned with a trivial entity type entity. As shown in Table 5, the co-occurrence of such trivial entity types and the unknown event type may be harmful for the model to get a precise understanding of the event plausibility.

Event (trader, ensures, strategy)
Prompt
[EVT] Trader ensures strategy. [/EVT]
The subject “Trader” has type [STYPE]person[/STYPE], which means [DEF]being that
has certain capacities or attributes constituting personhood (avoid use with P31; use Q5
for humans)[/DEF]. It can also have type [STYPE]businessperson[/STYPE], which me-
ans [DEF]person involved in activities for the purpose of generating revenue[/DEF]. It
can also have type [STYPE]trader[/STYPE], which means [DEF]businessperson who
exchanges stocks, bonds and other such financial instruments[/DEF]. It can also have
type [STYPE]professional[/STYPE], which means [DEF]person who is paid to under-
take a specialized set of tasks and to complete them for a fee[/DEF]. It can also have
type [STYPE]entrepreneur[/STYPE], which means [DEF]individual who organizes and
operates a business[/DEF].
The verb “ensures” has an unknown type.
The object “strategy” has type [OTYPE]idea[/OTYPE], which means [DEF]mental
image or concept[/DEF]. It can also have type [OTYPE]concept[/OTYPE], which
means [DEF]semantic unit understood in different ways, e.g. as mental representation,
ability or abstract object[/DEF]. It can also have type [OTYPE]plan[/OTYPE], which
means [DEF]outline of a strategy for achievement of an objective[/DEF].
Predicted
Label
0
True
Label
1
Table 4: A wrong prediction in PAP made by RoBERTaevt+entsubscriptRoBERTa𝑒𝑣𝑡𝑒𝑛𝑡\textsc{RoBERTa}_{evt+ent}RoBERTa start_POSTSUBSCRIPT italic_e italic_v italic_t + italic_e italic_n italic_t end_POSTSUBSCRIPT. The event type is predicted as unknown.
Event (hook, wins, role)
Prompt
[EVT] Hook wins role. [/EVT]
The subject “Hook” has type [STYPE]concept[/STYPE], which means [DEF]semantic unit
understood in different ways, e.g. as mental representation, ability or abstract object[/DEF].
It can also have type [STYPE]idea[/STYPE], which means [DEF]mental image or
concept[/DEF]. It can also have type [STYPE]entity[/STYPE], which means [DEF]anything
that can be considered, discussed, or observed[/DEF]. It can also have type [STYPE]
hook[/STYPE], which means [DEF]object for hanging, fishing etc.[/DEF].
The verb “wins” has an unknown type.
The object “role” has type [OTYPE]entity[/OTYPE], which means [DEF]anything that can
be considered, discussed, or observed[/DEF]. It can also have type [OTYPE]role[/OTYPE],
which means [DEF]set of behaviours, rights, obligations, beliefs, and norms expected from
an individual that has a certain social status[/DEF].
Predicted
Label
1
True
Label
0
Table 5: A wrong prediction in PAP made by RoBERTaevt+entsubscriptRoBERTa𝑒𝑣𝑡𝑒𝑛𝑡\textsc{RoBERTa}_{evt+ent}RoBERTa start_POSTSUBSCRIPT italic_e italic_v italic_t + italic_e italic_n italic_t end_POSTSUBSCRIPT. The entities are assigned with a trivial type entity.

8 Conclusion

This paper proposes to enhance a large language model with information about fine-grained entity types, event types and their definitions extracted from an external knowledge base. We design templates to integrate these knowledge into the model’s input and mitigate the unbalanced label distribution via data augmentation. In addition, we adapt the task to real world scenarios by converting simple events to natural langauge sentences. The experimental results on PEP-3K shows that the model’s performance on the task benefits from injecting these knowledge. However, there is still big room of improvement on the PAP dataset. This unfolds the need to address abstractness in modeling semantic plausibility.

Limitations

Our experiments are conducted on simple events in the form of (s,v,o)-triples. However, events in the real world scenario usually comprise much more information and are represented in much more complex ways. Furthermore, we do not investigate argument roles in this paper since CEDAR focuses on event detection rather than event extraction, which involves also the identification of event arguments and prediction of their roles. Last but not least, the datasets only contain events expressed in texts, while events can also be depicted by other modalities, such as images and videos. In the future research, we will work on more complex event structure, build an event extraction system that identifies event arguments and classifies argument roles in addition to detecting events from other modalities.

References

  • Bagherinezhad et al. (2016) Hessam Bagherinezhad, Hannaneh Hajishirzi, Yejin Choi, and Ali Farhadi. 2016. Are elephants bigger than butterflies? reasoning about sizes of objects. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30.
  • Balasubramanian et al. (2013) Niranjan Balasubramanian, Stephen Soderland, Oren Etzioni, et al. 2013. Generating coherent event schemas at scale. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1721–1731.
  • Chambers and Jurafsky (2008) Nathanael Chambers and Dan Jurafsky. 2008. Unsupervised learning of narrative event chains. In Proceedings of ACL-08: HLT, pages 789–797.
  • Choi et al. (2018) Eunsol Choi, Omer Levy, Yejin Choi, and Luke Zettlemoyer. 2018. Ultra-fine entity typing. arXiv preprint arXiv:1807.04905.
  • Dai et al. (2021) Hongliang Dai, Yangqiu Song, and Haixun Wang. 2021. Ultra-fine entity typing with weak supervision from a masked language model. arXiv preprint arXiv:2106.04098.
  • Eichel and Im Walde (2023) Annerose Eichel and Sabine Schulte Im Walde. 2023. A dataset for physical and abstract plausibility and sources of human disagreement. In Proceedings of the 17th Linguistic Annotation Workshop (LAW-XVII), pages 31–45.
  • Erk and Padó (2010) Katrin Erk and Sebastian Padó. 2010. Exemplar-based models for word meaning in context. In Proceedings of the acl 2010 conference short papers, pages 92–97.
  • Feng et al. (2023) Yanlin Feng, Adithya Pratapa, and David R Mortensen. 2023. Calibrated seq2seq models for efficient and generalizable ultra-fine entity typing. arXiv preprint arXiv:2311.00835.
  • Filatova and Hovy (2001) Elena Filatova and Eduard Hovy. 2001. Assigning time-stamps to event-clauses. In Proceedings of the ACL 2001 Workshop on Temporal and Spatial Information Processing.
  • Forbes and Choi (2017) Maxwell Forbes and Yejin Choi. 2017. Verb physics: Relative physical knowledge of actions and objects. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 266–276, Vancouver, Canada. Association for Computational Linguistics.
  • Gordon and Van Durme (2013) Jonathan Gordon and Benjamin Van Durme. 2013. Reporting bias and knowledge acquisition. In Proceedings of the 2013 workshop on Automated knowledge base construction, pages 25–30.
  • Grishman et al. (2005) Ralph Grishman, David Westbrook, and Adam Meyers. 2005. Nyu’s english ace 2005 system description. ACE, 5:2.
  • Li et al. (2022a) Bangzheng Li, Wenpeng Yin, and Muhao Chen. 2022a. Ultra-fine entity typing with indirect supervision from natural language inference. Transactions of the Association for Computational Linguistics, 10:607–622.
  • Li et al. (2022b) Qian Li, Jianxin Li, Jiawei Sheng, Shiyao Cui, Jia Wu, Yiming Hei, Hao Peng, Shu Guo, Lihong Wang, Amin Beheshti, et al. 2022b. A survey on deep learning event extraction: Approaches and applications. IEEE Transactions on Neural Networks and Learning Systems.
  • Li et al. (2023) Sha Li, Qiusi Zhan, Kathryn Conger, Martha Palmer, Heng Ji, and Jiawei Han. 2023. Glen: General-purpose event detection for thousands of types. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2823–2838.
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  • Loshchilov and Hutter (2017) Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  • Onoe et al. (2021) Yasumasa Onoe, Michael Boratko, Andrew McCallum, and Greg Durrett. 2021. Modeling fine-grained entity types with box embeddings. arXiv preprint arXiv:2101.00345.
  • Padó et al. (2007) Sebastian Padó, Ulrike Padó, and Katrin Erk. 2007. Flexible, corpus-based modelling of human plausibility judgements. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 400–409.
  • Porada et al. (2021) Ian Porada, Kaheer Suleman, Adam Trischler, and Jackie Chi Kit Cheung. 2021. Modeling event plausibility with consistent conceptual abstraction. arXiv preprint arXiv:2104.10247.
  • Resnik (1996) Philip Resnik. 1996. Selectional constraints: An information-theoretic model and its computational realization. Cognition, 61(1-2):127–159.
  • Wang et al. (2018) Su Wang, Greg Durrett, and Katrin Erk. 2018. Modeling semantic plausibility by injecting world knowledge. arXiv preprint arXiv:1804.00619.
  • Wang et al. (2020) Xiaozhi Wang, Ziqi Wang, Xu Han, Wangyi Jiang, Rong Han, Zhiyuan Liu, Juanzi Li, Peng Li, Yankai Lin, and Jie Zhou. 2020. Maven: A massive general domain event detection dataset. arXiv preprint arXiv:2004.13590.
  • Zhang et al. (2022) Yue Zhang, Hongliang Fei, and Ping Li. 2022. Denoising enhanced distantly supervised ultrafine entity typing. arXiv preprint arXiv:2210.09599.
  • Zuo et al. (2022) Xinyu Zuo, Haijin Liang, Ning Jing, Shuang Zeng, Zhou Fang, and Yu Luo. 2022. Type-enriched hierarchical contrastive strategy for fine-grained entity typing. arXiv preprint arXiv:2208.10081.

Appendix A Appendix

A.1 Most Frequent Words

Figure 3 illustrates the word clouds of the most frequent words associated with labels in different dataset splits.

Refer to caption
(a) PAP (train), plausible
Refer to caption
(b) PAP (train), implausible
Refer to caption
(c) PEP-3K (train), plausible
Refer to caption
(d) in PEP-3K (train), implausible
Figure 3: Word clouds of the most frequent words associated with the labels in PEP-3K train split.

A.2 Semantic Similarity

Figure 4 and 5 depict words similarities between top plausible- and implausible words in the datasets.

Refer to caption
Figure 4: Word similarity between top plausible words and implausible words in PAP
Refer to caption
Figure 5: Word similarity between top plausible words and implausible words in PEP-3K