Zum Hauptinhalt springen

Showing 1–47 of 47 results for author: Ponzetto, S P

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.08120  [pdf, other

    cs.SE

    Interlinking User Stories and GUI Prototyping: A Semi-Automatic LLM-based Approach

    Authors: Kristian Kolthoff, Felix Kretzer, Christian Bartelt, Alexander Maedche, Simone Paolo Ponzetto

    Abstract: Interactive systems are omnipresent today and the need to create graphical user interfaces (GUIs) is just as ubiquitous. For the elicitation and validation of requirements, GUI prototyping is a well-known and effective technique, typically employed after gathering initial user requirements represented in natural language (NL) (e.g., in the form of user stories). Unfortunately, GUI prototyping ofte… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  2. arXiv:2405.15306  [pdf, other

    cs.CL cs.CV

    DeTikZify: Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ

    Authors: Jonas Belouadi, Simone Paolo Ponzetto, Steffen Eger

    Abstract: Creating high-quality scientific figures can be time-consuming and challenging, even though sketching ideas on paper is relatively easy. Furthermore, recreating existing figures that are not stored in formats preserving semantic information is equally complex. To tackle this problem, we introduce DeTikZify, a novel multimodal language model that automatically synthesizes scientific figures as sema… ▽ More

    Submitted 28 May, 2024; v1 submitted 24 May, 2024; originally announced May 2024.

    Comments: Project page: https://github.com/potamides/DeTikZify

  3. arXiv:2403.05303  [pdf, other

    cs.CL

    ACLSum: A New Dataset for Aspect-based Summarization of Scientific Publications

    Authors: Sotaro Takeshita, Tommaso Green, Ines Reinig, Kai Eckert, Simone Paolo Ponzetto

    Abstract: Extensive efforts in the past have been directed toward the development of summarization datasets. However, a predominant number of these resources have been (semi)-automatically generated, typically through web data crawling, resulting in subpar resources for training and evaluating summarization systems, a quality compromise that is arguably due to the substantial costs associated with generatin… ▽ More

    Submitted 8 March, 2024; originally announced March 2024.

  4. arXiv:2403.05186  [pdf, other

    cs.CL

    ROUGE-K: Do Your Summaries Have Keywords?

    Authors: Sotaro Takeshita, Simone Paolo Ponzetto, Kai Eckert

    Abstract: Keywords, that is, content-relevant words in summaries play an important role in efficient information conveyance, making it critical to assess if system-generated summaries contain such informative words during evaluation. However, existing evaluation metrics for extreme summarization models do not pay explicit attention to keywords in summaries, leaving developers ignorant of their presence. To… ▽ More

    Submitted 8 March, 2024; originally announced March 2024.

  5. arXiv:2401.14931  [pdf, other

    cs.CL cs.AI

    Do LLMs Dream of Ontologies?

    Authors: Marco Bombieri, Paolo Fiorini, Simone Paolo Ponzetto, Marco Rospocher

    Abstract: Large language models (LLMs) have recently revolutionized automated text understanding and generation. The performance of these models relies on the high number of parameters of the underlying neural architectures, which allows LLMs to memorize part of the vast quantity of data seen during the training. This paper investigates whether and to what extent general-purpose pre-trained LLMs have memori… ▽ More

    Submitted 26 January, 2024; originally announced January 2024.

  6. arXiv:2311.02025  [pdf, other

    cs.CL

    Vicinal Risk Minimization for Few-Shot Cross-lingual Transfer in Abusive Language Detection

    Authors: Gretel Liz De la Peña Sarracén, Paolo Rosso, Robert Litschko, Goran Glavaš, Simone Paolo Ponzetto

    Abstract: Cross-lingual transfer learning from high-resource to medium and low-resource languages has shown encouraging results. However, the scarcity of resources in target languages remains a challenge. In this work, we resort to data augmentation and continual pre-training for domain adaptation to improve cross-lingual abusive language detection. For data augmentation, we analyze two existing techniques… ▽ More

    Submitted 3 November, 2023; originally announced November 2023.

    Comments: Accepted at EMNLP 2023 (Main Conference)

  7. arXiv:2308.02951  [pdf, other

    cs.CL

    Multi-Source (Pre-)Training for Cross-Domain Measurement, Unit and Context Extraction

    Authors: Yueling Li, Sebastian Martschat, Simone Paolo Ponzetto

    Abstract: We present a cross-domain approach for automated measurement and context extraction based on pre-trained language models. We construct a multi-source, multi-domain corpus and train an end-to-end extraction pipeline. We then apply multi-source task-adaptive pre-training and fine-tuning to benchmark the cross-domain generalization capability of our model. Further, we conceptualize and apply a task-s… ▽ More

    Submitted 5 August, 2023; originally announced August 2023.

    Comments: Published as a workshop paper at BioNLP 2023

  8. arXiv:2210.07362  [pdf, other

    cs.CL

    Can Demographic Factors Improve Text Classification? Revisiting Demographic Adaptation in the Age of Transformers

    Authors: Chia-Chien Hung, Anne Lauscher, Dirk Hovy, Simone Paolo Ponzetto, Goran Glavaš

    Abstract: Demographic factors (e.g., gender or age) shape our language. Previous work showed that incorporating demographic factors can consistently improve performance for various NLP tasks with traditional NLP models. In this work, we investigate whether these previous findings still hold with state-of-the-art pretrained Transformer-based language models (PLMs). We use three common specialization methods… ▽ More

    Submitted 9 May, 2023; v1 submitted 13 October, 2022; originally announced October 2022.

    Comments: Findings of EACL 2023. arXiv admin note: text overlap with arXiv:2208.01029

  9. arXiv:2209.09062  [pdf, other

    cs.CL cs.IR

    Overview of the SV-Ident 2022 Shared Task on Survey Variable Identification in Social Science Publications

    Authors: Tornike Tsereteli, Yavuz Selim Kartal, Simone Paolo Ponzetto, Andrea Zielinski, Kai Eckert, Philipp Mayr

    Abstract: In this paper, we provide an overview of the SV-Ident shared task as part of the 3rd Workshop on Scholarly Document Processing (SDP) at COLING 2022. In the shared task, participants were provided with a sentence and a vocabulary of variables, and asked to identify which variables, if any, are mentioned in individual sentences from scholarly documents in full text. Two teams made a total of 9 submi… ▽ More

    Submitted 19 September, 2022; originally announced September 2022.

  10. arXiv:2209.06804  [pdf, other

    cs.DL cs.IR

    Towards Automated Survey Variable Search and Summarization in Social Science Publications

    Authors: Yavuz Selim Kartal, Sotaro Takeshita, Tornike Tsereteli, Kai Eckert, Henning Kroll, Philipp Mayr, Simone Paolo Ponzetto, Benjamin Zapilko, Andrea Zielinski

    Abstract: Nowadays there is a growing trend in many scientific disciplines to support researchers by providing enhanced information access through linking of publications and underlying datasets, so as to support research with infrastructure to enhance reproducibility and reusability of research results. In this research note, we present an overview of an ongoing research project, named VADIS (VAriable Dete… ▽ More

    Submitted 14 September, 2022; originally announced September 2022.

    Comments: 10 pages, 2 figures

  11. arXiv:2208.01029  [pdf, other

    cs.CL

    On the Limitations of Sociodemographic Adaptation with Transformers

    Authors: Chia-Chien Hung, Anne Lauscher, Dirk Hovy, Simone Paolo Ponzetto, Goran Glavaš

    Abstract: Sociodemographic factors (e.g., gender or age) shape our language. Previous work showed that incorporating specific sociodemographic factors can consistently improve performance for various NLP tasks in traditional NLP models. We investigate whether these previous findings still hold with state-of-the-art pretrained Transformers. We use three common specialization methods proven effective for inco… ▽ More

    Submitted 1 August, 2022; originally announced August 2022.

  12. arXiv:2208.01018  [pdf, other

    cs.CL

    Massively Multilingual Lexical Specialization of Multilingual Transformers

    Authors: Tommaso Green, Simone Paolo Ponzetto, Goran Glavaš

    Abstract: While pretrained language models (PLMs) primarily serve as general-purpose text encoders that can be fine-tuned for a wide variety of downstream tasks, recent work has shown that they can also be rewired to produce high-quality word representations (i.e., static word embeddings) and yield good performance in type-level lexical tasks. While existing work primarily focused on the lexical specializat… ▽ More

    Submitted 29 May, 2023; v1 submitted 1 August, 2022; originally announced August 2022.

    Comments: Accepted in ACL 2023

  13. X-SCITLDR: Cross-Lingual Extreme Summarization of Scholarly Documents

    Authors: Sotaro Takeshita, Tommaso Green, Niklas Friedrich, Kai Eckert, Simone Paolo Ponzetto

    Abstract: The number of scientific publications nowadays is rapidly increasing, causing information overload for researchers and making it hard for scholars to keep up to date with current trends and lines of work. Consequently, recent work on applying text mining technologies for scholarly publications has investigated the application of automatic text summarization technologies, including extreme summariz… ▽ More

    Submitted 30 May, 2022; originally announced May 2022.

    Comments: JCDL2022

  14. arXiv:2205.14981  [pdf, other

    cs.CL

    ZusammenQA: Data Augmentation with Specialized Models for Cross-lingual Open-retrieval Question Answering System

    Authors: Chia-Chien Hung, Tommaso Green, Robert Litschko, Tornike Tsereteli, Sotaro Takeshita, Marco Bombieri, Goran Glavaš, Simone Paolo Ponzetto

    Abstract: This paper introduces our proposed system for the MIA Shared Task on Cross-lingual Open-retrieval Question Answering (COQA). In this challenging scenario, given an input question the system has to gather evidence documents from a multilingual pool and generate from them an answer in the language of the question. We devised several approaches combining different model variants for three main compon… ▽ More

    Submitted 30 May, 2022; originally announced May 2022.

  15. arXiv:2205.10400  [pdf, other

    cs.CL

    Multi2WOZ: A Robust Multilingual Dataset and Conversational Pretraining for Task-Oriented Dialog

    Authors: Chia-Chien Hung, Anne Lauscher, Ivan Vulić, Simone Paolo Ponzetto, Goran Glavaš

    Abstract: Research on (multi-domain) task-oriented dialog (TOD) has predominantly focused on the English language, primarily due to the shortage of robust TOD datasets in other languages, preventing the systematic investigation of cross-lingual transfer for this crucial NLP application area. In this work, we introduce Multi2WOZ, a new multilingual multi-domain TOD dataset, derived from the well-established… ▽ More

    Submitted 20 May, 2022; originally announced May 2022.

    Comments: NAACL 2022

  16. arXiv:2204.04026  [pdf, other

    cs.CL

    Fair and Argumentative Language Modeling for Computational Argumentation

    Authors: Carolin Holtermann, Anne Lauscher, Simone Paolo Ponzetto

    Abstract: Although much work in NLP has focused on measuring and mitigating stereotypical bias in semantic spaces, research addressing bias in computational argumentation is still in its infancy. In this paper, we address this research gap and conduct a thorough investigation of bias in argumentative language models. To this end, we introduce ABBA, a novel resource for bias measurement specifically tailored… ▽ More

    Submitted 8 April, 2022; originally announced April 2022.

    Comments: ACL 2022

  17. arXiv:2203.04860  [pdf, other

    cs.CL

    PET: An Annotated Dataset for Process Extraction from Natural Language Text

    Authors: Patrizio Bellan, Han van der Aa, Mauro Dragoni, Chiara Ghidini, Simone Paolo Ponzetto

    Abstract: Process extraction from text is an important task of process discovery, for which various approaches have been developed in recent years. However, in contrast to other information extraction tasks, there is a lack of gold-standard corpora of business process descriptions that are carefully annotated with all the entities and relationships of interest. Due to this, it is currently hard to compare t… ▽ More

    Submitted 13 June, 2022; v1 submitted 9 March, 2022; originally announced March 2022.

  18. arXiv:2112.11031  [pdf, other

    cs.CL cs.IR

    On Cross-Lingual Retrieval with Multilingual Text Encoders

    Authors: Robert Litschko, Ivan Vulić, Simone Paolo Ponzetto, Goran Glavaš

    Abstract: In this work we present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks across a number of diverse language pairs. We first treat these models as multilingual text encoders and benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR. In contrast to supervised l… ▽ More

    Submitted 21 December, 2021; originally announced December 2021.

    Comments: to appear in IRJ ECIR 2021 Special Issue. arXiv admin note: substantial text overlap with arXiv:2101.08370

    ACM Class: H.3.3; I.2.7

  19. arXiv:2110.08395  [pdf, other

    cs.CL

    DS-TOD: Efficient Domain Specialization for Task Oriented Dialog

    Authors: Chia-Chien Hung, Anne Lauscher, Simone Paolo Ponzetto, Goran Glavaš

    Abstract: Recent work has shown that self-supervised dialog-specific pretraining on large conversational datasets yields substantial gains over traditional language modeling (LM) pretraining in downstream task-oriented dialog (TOD). These approaches, however, exploit general dialogic corpora (e.g., Reddit) and thus presumably fail to reliably embed domain-specific knowledge useful for concrete downstream TO… ▽ More

    Submitted 20 May, 2022; v1 submitted 15 October, 2021; originally announced October 2021.

    Comments: Findings of ACL 2022

  20. arXiv:2110.03754  [pdf, other

    cs.AI

    Process Extraction from Text: Benchmarking the State of the Art and Paving the Way for Future Challenges

    Authors: Patrizio Bellan, Mauro Dragoni, Chiara Ghidini, Han van der Aa, Simone Paolo Ponzetto

    Abstract: The extraction of process models from text refers to the problem of turning the information contained in an unstructured textual process descriptions into a formal representation,i.e.,a process model. Several automated approaches have been proposed to tackle this problem, but they are highly heterogeneous in scope and underlying assumptions,i.e., differences in input, target output, and data used… ▽ More

    Submitted 25 October, 2023; v1 submitted 7 October, 2021; originally announced October 2021.

  21. arXiv:2108.06295  [pdf, other

    cs.CL cs.DL

    Diachronic Analysis of German Parliamentary Proceedings: Ideological Shifts through the Lens of Political Biases

    Authors: Tobias Walter, Celina Kirschner, Steffen Eger, Goran Glavaš, Anne Lauscher, Simone Paolo Ponzetto

    Abstract: We analyze bias in historical corpora as encoded in diachronic distributional semantic models by focusing on two specific forms of bias, namely a political (i.e., anti-communism) and racist (i.e., antisemitism) one. For this, we use a new corpus of German parliamentary proceedings, DeuPARL, spanning the period 1867--2020. We complement this analysis of historical biases in diachronic word embeddin… ▽ More

    Submitted 13 August, 2021; originally announced August 2021.

    Comments: Accepted for JCDL2021

  22. arXiv:2105.01305  [pdf, other

    cs.AI cs.CL

    Large-scale Taxonomy Induction Using Entity and Word Embeddings

    Authors: Petar Ristoski, Stefano Faralli, Simone Paolo Ponzetto, Heiko Paulheim

    Abstract: Taxonomies are an important ingredient of knowledge organization, and serve as a backbone for more sophisticated knowledge representations in intelligent systems, such as formal ontologies. However, building taxonomies manually is a costly endeavor, and hence, automatic methods for taxonomy induction are a good alternative to build large-scale taxonomies. In this paper, we propose TIEmb, an approa… ▽ More

    Submitted 4 May, 2021; originally announced May 2021.

    Comments: Published at IEEE/WIC/ACM International Conference on Web Intelligence 2017 (WI'17)

  23. arXiv:2103.06598  [pdf, other

    cs.CL

    DebIE: A Platform for Implicit and Explicit Debiasing of Word Embedding Spaces

    Authors: Niklas Friedrich, Anne Lauscher, Simone Paolo Ponzetto, Goran Glavaš

    Abstract: Recent research efforts in NLP have demonstrated that distributional word vector spaces often encode stereotypical human biases, such as racism and sexism. With word representations ubiquitously used in NLP models and pipelines, this raises ethical issues and jeopardizes the fairness of language technologies. While there exists a large body of work on bias measures and debiasing methods, to date,… ▽ More

    Submitted 11 March, 2021; originally announced March 2021.

    Comments: Accepted as EACL21 Demo

  24. arXiv:2101.09810  [pdf, other

    cs.CL

    FakeFlow: Fake News Detection by Modeling the Flow of Affective Information

    Authors: Bilal Ghanem, Simone Paolo Ponzetto, Paolo Rosso, Francisco Rangel

    Abstract: Fake news articles often stir the readers' attention by means of emotional appeals that arouse their feelings. Unlike in short news texts, authors of longer articles can exploit such affective factors to manipulate readers by adding exaggerations or fabricating events, in order to affect the readers' emotions. To capture this, we propose in this paper to model the flow of affective information in… ▽ More

    Submitted 24 January, 2021; originally announced January 2021.

    Comments: 9 pages, 6 figures, EACL-2021

  25. arXiv:2101.08370  [pdf, other

    cs.CL cs.IR

    Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval

    Authors: Robert Litschko, Ivan Vulić, Simone Paolo Ponzetto, Goran Glavaš

    Abstract: Pretrained multilingual text encoders based on neural Transformer architectures, such as multilingual BERT (mBERT) and XLM, have achieved strong performance on a myriad of language understanding tasks. Consequently, they have been adopted as a go-to paradigm for multilingual and cross-lingual representation learning and transfer, rendering cross-lingual word embeddings (CLWEs) effectively obsolete… ▽ More

    Submitted 20 January, 2021; originally announced January 2021.

    Comments: accepted at ECIR'21 (preprint)

    ACM Class: H.3.3; I.2.7

  26. arXiv:2012.11213  [pdf, ps, other

    cs.IR cs.CL

    Self-Supervised Learning for Visual Summary Identification in Scientific Publications

    Authors: Shintaro Yamamoto, Anne Lauscher, Simone Paolo Ponzetto, Goran Glavaš, Shigeo Morishima

    Abstract: Providing visual summaries of scientific publications can increase information access for readers and thereby help deal with the exponential growth in the number of scientific publications. Nonetheless, efforts in providing visual publication summaries have been few and far apart, primarily focusing on the biomedical domain. This is primarily because of the limited availability of annotated gold s… ▽ More

    Submitted 14 January, 2021; v1 submitted 21 December, 2020; originally announced December 2020.

  27. arXiv:2011.01575  [pdf, ps, other

    cs.CL

    AraWEAT: Multidimensional Analysis of Biases in Arabic Word Embeddings

    Authors: Anne Lauscher, Rafik Takieddin, Simone Paolo Ponzetto, Goran Glavaš

    Abstract: Recent work has shown that distributional word vector spaces often encode human biases like sexism or racism. In this work, we conduct an extensive analysis of biases in Arabic word embeddings by applying a range of recently introduced bias tests on a variety of embedding spaces induced from corpora in Arabic. We measure the presence of biases across several dimensions, namely: embedding models (S… ▽ More

    Submitted 3 November, 2020; originally announced November 2020.

    Comments: accepted for WANLP 20

  28. arXiv:2003.06651  [pdf, other

    cs.CL

    Word Sense Disambiguation for 158 Languages using Word Embeddings Only

    Authors: Varvara Logacheva, Denis Teslenko, Artem Shelmanov, Steffen Remus, Dmitry Ustalov, Andrey Kutuzov, Ekaterina Artemova, Chris Biemann, Simone Paolo Ponzetto, Alexander Panchenko

    Abstract: Disambiguation of word senses in context is easy for humans, but is a major challenge for automatic approaches. Sophisticated supervised and knowledge-based models were developed to solve this task. However, (i) the inherent Zipfian distribution of supervised training instances for a given word and/or (ii) the quality of linguistic knowledge representations motivate the development of completely u… ▽ More

    Submitted 14 March, 2020; originally announced March 2020.

    Comments: 10 pages, 5 figures, 4 tables, accepted at LREC 2020

  29. arXiv:1910.06592  [pdf, other

    cs.CL cs.SI

    FacTweet: Profiling Fake News Twitter Accounts

    Authors: Bilal Ghanem, Simone Paolo Ponzetto, Paolo Rosso

    Abstract: We present an approach to detect fake news in Twitter at the account level using a neural recurrent model and a variety of different semantic and stylistic features. Our method extracts a set of features from the timelines of news Twitter accounts by reading their posts as chunks, rather than dealing with each tweet independently. We show the experimental benefits of modeling latent stylistic sign… ▽ More

    Submitted 15 October, 2019; originally announced October 2019.

    Comments: 6 pages

  30. arXiv:1909.06092  [pdf, other

    cs.CL cs.AI

    A General Framework for Implicit and Explicit Debiasing of Distributional Word Vector Spaces

    Authors: Anne Lauscher, Goran Glavaš, Simone Paolo Ponzetto, Ivan Vulić

    Abstract: Distributional word vectors have recently been shown to encode many of the human biases, most notably gender and racial biases, and models for attenuating such biases have consequently been proposed. However, existing models and studies (1) operate on under-specified and mutually differing bias definitions, (2) are tailored for a particular bias (e.g., gender bias) and (3) have been evaluated inco… ▽ More

    Submitted 3 January, 2020; v1 submitted 13 September, 2019; originally announced September 2019.

    Comments: AAAI 2020

  31. arXiv:1906.04836  [pdf, other

    cs.CL

    Unmasking Bias in News

    Authors: Javier Sánchez-Junquera, Paolo Rosso, Manuel Montes-y-Gómez, Simone Paolo Ponzetto

    Abstract: We present experiments on detecting hyperpartisanship in news using a 'masking' method that allows us to assess the role of style vs. content for the task at hand. Our results corroborate previous research on this task in that topic related features yield better results than stylistic ones. We additionally show that competitive results can be achieved by simply including higher-length n-grams, whi… ▽ More

    Submitted 11 June, 2019; originally announced June 2019.

  32. HHMM at SemEval-2019 Task 2: Unsupervised Frame Induction using Contextualized Word Embeddings

    Authors: Saba Anwar, Dmitry Ustalov, Nikolay Arefyev, Simone Paolo Ponzetto, Chris Biemann, Alexander Panchenko

    Abstract: We present our system for semantic frame induction that showed the best performance in Subtask B.1 and finished as the runner-up in Subtask A of the SemEval 2019 Task 2 on unsupervised semantic frame induction (QasemiZadeh et al., 2019). Our approach separates this task into two independent steps: verb clustering using word and their context embeddings and role labeling by combining these embeddin… ▽ More

    Submitted 5 May, 2019; originally announced May 2019.

    Comments: 5 pages, 3 tables, accepted at SemEval 2019

  33. Knowledge-rich Image Gist Understanding Beyond Literal Meaning

    Authors: Lydia Weiland, Ioana Hulpus, Simone Paolo Ponzetto, Wolfgang Effelsberg, Laura Dietz

    Abstract: We investigate the problem of understanding the message (gist) conveyed by images and their captions as found, for instance, on websites or news articles. To this end, we propose a methodology to capture the meaning of image-caption pairs on the basis of large amounts of machine-readable knowledge that has previously been shown to be highly effective for text understanding. Our method identifies t… ▽ More

    Submitted 18 April, 2019; originally announced April 2019.

    Journal ref: Data & Knowledge Engineering, Volume 117, September 2018, Pages 114-132

  34. arXiv:1904.06217  [pdf, other

    cs.CL

    Political Text Scaling Meets Computational Semantics

    Authors: Federico Nanni, Goran Glavas, Ines Rehbein, Simone Paolo Ponzetto, Heiner Stuckenschmidt

    Abstract: During the last fifteen years, automatic text scaling has become one of the key tools of the Text as Data community in political science. Prominent text scaling algorithms, however, rely on the assumption that latent positions can be captured just by leveraging the information about word frequencies in documents under study. We challenge this traditional view and present a new, semantically aware… ▽ More

    Submitted 14 October, 2021; v1 submitted 12 April, 2019; originally announced April 2019.

    Comments: Updated version - accepted for Transactions on Data Science (TDS)

  35. arXiv:1904.05439  [pdf, other

    cs.CL cs.DL cs.IR

    Event-based Access to Historical Italian War Memoirs

    Authors: Marco Rovera, Federico Nanni, Simone Paolo Ponzetto

    Abstract: The progressive digitization of historical archives provides new, often domain specific, textual resources that report on facts and events which have happened in the past; among these, memoirs are a very common type of primary source. In this paper, we present an approach for extracting information from Italian historical war memoirs and turning it into structured knowledge. This is based on the s… ▽ More

    Submitted 24 February, 2021; v1 submitted 8 April, 2019; originally announced April 2019.

    Comments: 23 pages, 6 figures

    Journal ref: J. Comput. Cult. Herit. 14, 1, Article 2 (February 2021)

  36. arXiv:1809.08593  [pdf, other

    cs.MM cs.IR

    Understanding the Gist of Images - Ranking of Concepts for Multimedia Indexing

    Authors: Lydia Weiland, Simone Paolo Ponzetto, Wolfgang Effelsberg, Laura Dietz

    Abstract: Nowadays, where multimedia data is continuously generated, stored, and distributed, multimedia indexing, with its purpose of group- ing similar data, becomes more important than ever. Understanding the gist (=message) of multimedia instances is framed in related work as a ranking of concepts from a knowledge base, i.e., Wikipedia. We cast the task of multimedia indexing as a gist understanding pro… ▽ More

    Submitted 23 September, 2018; originally announced September 2018.

  37. Unsupervised Sense-Aware Hypernymy Extraction

    Authors: Dmitry Ustalov, Alexander Panchenko, Chris Biemann, Simone Paolo Ponzetto

    Abstract: In this paper, we show how unsupervised sense representations can be used to improve hypernymy extraction. We present a method for extracting disambiguated hypernymy relationships that propagates hypernyms to sets of synonyms (synsets), constructs embeddings for these sets, and establishes sense-aware relationships between matching synsets. Evaluation on two gold standard datasets for English and… ▽ More

    Submitted 17 September, 2018; originally announced September 2018.

    Comments: In Proceedings of the 14th Conference on Natural Language Processing (KONVENS 2018). Vienna, Austria

  38. Watset: Local-Global Graph Clustering with Applications in Sense and Frame Induction

    Authors: Dmitry Ustalov, Alexander Panchenko, Chris Biemann, Simone Paolo Ponzetto

    Abstract: We present a detailed theoretical and computational analysis of the Watset meta-algorithm for fuzzy graph clustering, which has been found to be widely applicable in a variety of domains. This algorithm creates an intermediate representation of the input graph that reflects the "ambiguity" of its nodes. Then, it uses hard clustering to discover clusters in this "disambiguated" intermediate graph.… ▽ More

    Submitted 19 June, 2019; v1 submitted 20 August, 2018; originally announced August 2018.

    Comments: 58 pages, 17 figures, accepted at the Computational Linguistics journal

    MSC Class: 68T50 ACM Class: I.2.7

    Journal ref: Computational Linguistics 45:3 (2019) 423-479

  39. Unsupervised Semantic Frame Induction using Triclustering

    Authors: Dmitry Ustalov, Alexander Panchenko, Andrei Kutuzov, Chris Biemann, Simone Paolo Ponzetto

    Abstract: We use dependency triples automatically extracted from a Web-scale corpus to perform unsupervised semantic frame induction. We cast the frame induction problem as a triclustering problem that is a generalization of clustering for triadic data. Our replicable benchmarks demonstrate that the proposed graph-based approach, Triframes, shows state-of-the art results on this task on a FrameNet-derived d… ▽ More

    Submitted 18 May, 2018; v1 submitted 12 May, 2018; originally announced May 2018.

    Comments: 8 pages, 1 figure, 4 tables, accepted at ACL 2018

  40. arXiv:1805.00879  [pdf, ps, other

    cs.CL

    Unsupervised Cross-Lingual Information Retrieval using Monolingual Data Only

    Authors: Robert Litschko, Goran Glavaš, Simone Paolo Ponzetto, Ivan Vulić

    Abstract: We propose a fully unsupervised framework for ad-hoc cross-lingual information retrieval (CLIR) which requires no bilingual data at all. The framework leverages shared cross-lingual word embedding spaces in which terms, queries, and documents can be represented, irrespective of their actual language. The shared embedding spaces are induced solely on the basis of monolingual corpora in two language… ▽ More

    Submitted 2 May, 2018; originally announced May 2018.

    Comments: accepted at SIGIR'18 (preprint)

  41. arXiv:1804.10686  [pdf, other

    cs.CL

    An Unsupervised Word Sense Disambiguation System for Under-Resourced Languages

    Authors: Dmitry Ustalov, Denis Teslenko, Alexander Panchenko, Mikhail Chernoskutov, Chris Biemann, Simone Paolo Ponzetto

    Abstract: In this paper, we present Watasense, an unsupervised system for word sense disambiguation. Given a sentence, the system chooses the most relevant sense of each input word with respect to the semantic similarity between the given sentence and the synset constituting the sense of the target word. Watasense has two modes of operation. The sparse mode uses the traditional vector space model to estimat… ▽ More

    Submitted 27 April, 2018; originally announced April 2018.

    Comments: In Proceedings of the 11th Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan

  42. arXiv:1803.05829  [pdf, other

    cs.CL

    Enriching Frame Representations with Distributionally Induced Senses

    Authors: Stefano Faralli, Alexander Panchenko, Chris Biemann, Simone Paolo Ponzetto

    Abstract: We introduce a new lexical resource that enriches the Framester knowledge graph, which links Framnet, WordNet, VerbNet and other resources, with semantic features from text corpora. These features are extracted from distributionally induced sense inventories and subsequently linked to the manually-constructed frame representations to boost the performance of frame disambiguation in context. Since… ▽ More

    Submitted 15 March, 2018; originally announced March 2018.

    Comments: In Proceedings of the 11th Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan. ELRA

  43. arXiv:1801.06436  [pdf, other

    cs.CL

    A Resource-Light Method for Cross-Lingual Semantic Textual Similarity

    Authors: Goran Glavaš, Marc Franco-Salvador, Simone Paolo Ponzetto, Paolo Rosso

    Abstract: Recognizing semantically similar sentences or paragraphs across languages is beneficial for many tasks, ranging from cross-lingual information retrieval and plagiarism detection to machine translation. Recently proposed methods for predicting cross-lingual semantic similarity of short texts, however, make use of tools and resources (e.g., machine translation systems, syntactic parsers or named ent… ▽ More

    Submitted 19 January, 2018; originally announced January 2018.

    Comments: Accepted for publication in Knowledge-Based Systems journal

  44. arXiv:1712.08819  [pdf, other

    cs.CL

    A Framework for Enriching Lexical Semantic Resources with Distributional Semantics

    Authors: Chris Biemann, Stefano Faralli, Alexander Panchenko, Simone Paolo Ponzetto

    Abstract: We present an approach to combining distributional semantic representations induced from text corpora with manually constructed lexical-semantic networks. While both kinds of semantic resources are available with high lexical coverage, our aligned resource combines the domain specificity and availability of contextual information from distributional models with the conciseness and high quality of… ▽ More

    Submitted 23 December, 2017; originally announced December 2017.

    Comments: Accepted for publication in the journal of Natural Language Engineering, 2018

  45. arXiv:1711.02918  [pdf, other

    cs.CL

    Improving Hypernymy Extraction with Distributional Semantic Classes

    Authors: Alexander Panchenko, Dmitry Ustalov, Stefano Faralli, Simone P. Ponzetto, Chris Biemann

    Abstract: In this paper, we show how distributionally-induced semantic classes can be helpful for extracting hypernyms. We present methods for inducing sense-aware semantic classes using distributional semantics and using these induced semantic classes for filtering noisy hypernymy relations. Denoising of hypernyms is performed by labeling each semantic class with its hypernyms. On the one hand, this allows… ▽ More

    Submitted 28 February, 2018; v1 submitted 8 November, 2017; originally announced November 2017.

    Comments: In Proceedings of the 11th Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan

  46. arXiv:1710.01779  [pdf, other

    cs.CL

    Building a Web-Scale Dependency-Parsed Corpus from CommonCrawl

    Authors: Alexander Panchenko, Eugen Ruppert, Stefano Faralli, Simone Paolo Ponzetto, Chris Biemann

    Abstract: We present DepCC, the largest-to-date linguistically analyzed corpus in English including 365 million documents, composed of 252 billion tokens and 7.5 billion of named entity occurrences in 14.3 billion sentences from a web-scale crawl of the \textsc{Common Crawl} project. The sentences are processed with a dependency parser and with a named entity tagger and contain provenance information, enabl… ▽ More

    Submitted 28 February, 2018; v1 submitted 4 October, 2017; originally announced October 2017.

    Comments: In Proceedings of the 11th Conference on Language Resources and Evaluation (LREC'2018). Miyazaki, Japan

  47. Unsupervised, Knowledge-Free, and Interpretable Word Sense Disambiguation

    Authors: Alexander Panchenko, Fide Marten, Eugen Ruppert, Stefano Faralli, Dmitry Ustalov, Simone Paolo Ponzetto, Chris Biemann

    Abstract: Interpretability of a predictive model is a powerful feature that gains the trust of users in the correctness of the predictions. In word sense disambiguation (WSD), knowledge-based systems tend to be much more interpretable than knowledge-free counterparts as they rely on the wealth of manually-encoded elements representing word senses, such as hypernyms, usage examples, and images. We present a… ▽ More

    Submitted 21 July, 2017; originally announced July 2017.

    Comments: In Proceedings of the the Conference on Empirical Methods on Natural Language Processing (EMNLP 2017). 2017. Copenhagen, Denmark. Association for Computational Linguistics

    ACM Class: I.2.6; I.5.3; I.2.4