Zum Hauptinhalt springen

Showing 1–9 of 9 results for author: Erjavec, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.07363  [pdf, ps, other

    cs.CL

    Multilingual Power and Ideology Identification in the Parliament: a Reference Dataset and Simple Baselines

    Authors: Çağrı Çöltekin, Matyáš Kopp, Katja Meden, Vaidas Morkevicius, Nikola Ljubešić, Tomaž Erjavec

    Abstract: We introduce a dataset on political orientation and power position identification. The dataset is derived from ParlaMint, a set of comparable corpora of transcribed parliamentary speeches from 29 national and regional parliaments. We introduce the dataset, provide the reasoning behind some of the choices during its creation, present statistics on the dataset, and, using a simple classifier, some b… ▽ More

    Submitted 12 May, 2024; originally announced May 2024.

  2. arXiv:2211.02429  [pdf, other

    cs.CL

    Dealing with Abbreviations in the Slovenian Biographical Lexicon

    Authors: Angel Daza, Antske Fokkens, Tomaž Erjavec

    Abstract: Abbreviations present a significant challenge for NLP systems because they cause tokenization and out-of-vocabulary errors. They can also make the text less readable, especially in reference printed books, where they are extensively used. Abbreviations are especially problematic in low-resource settings, where systems are less robust to begin with. In this paper, we propose a new method for addres… ▽ More

    Submitted 4 November, 2022; originally announced November 2022.

    Comments: To be presented at The 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022)

  3. MULTEXT-East

    Authors: Tomaž Erjavec

    Abstract: MULTEXT-East language resources, a multilingual dataset for language engineering research, focused on the morphosyntactic level of linguistic description. The MULTEXT-East dataset includes the EAGLES-based morphosyntactic specifications, morphosyntactic lexicons, and an annotated multilingual corpora. The parallel corpus, the novel "1984" by George Orwell, is sentence aligned and contains hand-val… ▽ More

    Submitted 31 March, 2020; originally announced March 2020.

    ACM Class: I.2.7

    Journal ref: Published in: Nancy Ide, James Pustejovsky, eds. 2007. Handbook of linguistic annotation. pp. 441-462. Springer

  4. arXiv:1906.02053  [pdf, other

    cs.CL

    KAS-term: Extracting Slovene Terms from Doctoral Theses via Supervised Machine Learning

    Authors: Nikola Ljubešić, Darja Fišer, Tomaž Erjavec

    Abstract: This paper presents a dataset and supervised learning experiments for term extraction from Slovene academic texts. Term candidates in the dataset were extracted via morphosyntactic patterns and annotated for their termness by four annotators. Experiments on the dataset show that most co-occurrence statistics, applied after morphosyntactic patterns and a frequency threshold, perform close to random… ▽ More

    Submitted 5 June, 2019; originally announced June 2019.

  5. arXiv:1906.02045  [pdf, other

    cs.CL

    The FRENK Datasets of Socially Unacceptable Discourse in Slovene and English

    Authors: Nikola Ljubešić, Darja Fišer, Tomaž Erjavec

    Abstract: In this paper we present datasets of Facebook comment threads to mainstream media posts in Slovene and English developed inside the Slovene national project FRENK which cover two topics, migrants and LGBT, and are manually annotated for different types of socially unacceptable discourse (SUD). The main advantages of these datasets compared to the existing ones are identical sampling procedures, pr… ▽ More

    Submitted 13 June, 2019; v1 submitted 5 June, 2019; originally announced June 2019.

  6. arXiv:1602.05753  [pdf

    cs.CL cs.HC

    Overview of Annotation Creation: Processes & Tools

    Authors: Mark A. Finlayson, Tomaž Erjavec

    Abstract: Creating linguistic annotations requires more than just a reliable annotation scheme. Annotation can be a complex endeavour potentially involving many people, stages, and tools. This chapter outlines the process of creating end-to-end linguistic annotations, identifying specific tasks that researchers often perform. Because tool support is so central to achieving high quality, reusable annotations… ▽ More

    Submitted 18 February, 2016; originally announced February 2016.

    Comments: To appear in: James Pustejovsky and Nancy Ide (eds.) "Handbook of Linguistic Annotation." 2016. New York: Springer

  7. arXiv:0909.2718  [pdf

    cs.CL

    A Common XML-based Framework for Syntactic Annotations

    Authors: Nancy Ide, Laurent Romary, Tomaz Erjavec

    Abstract: It is widely recognized that the proliferation of annotation schemes runs counter to the need to re-use language resources, and that standards for linguistic annotation are becoming increasingly mandatory. To answer this need, we have developed a framework comprised of an abstract model for a variety of different annotation types (e.g., morpho-syntactic tagging, syntactic annotation, co-referenc… ▽ More

    Submitted 15 September, 2009; originally announced September 2009.

    Comments: Colloque avec actes et comité de lecture. internationale

    Report number: A01-R-289 || ide01d

    Journal ref: 1st NLP and XML Workshop, Tokyo, Japan : Japan (2001)

  8. arXiv:cs/0609067  [pdf

    cs.CL cs.IR

    A tool set for the quick and efficient exploration of large document collections

    Authors: Camelia Ignat, Bruno Pouliquen, Ralf Steinberger, Tomaz Erjavec

    Abstract: We are presenting a set of multilingual text analysis tools that can help analysts in any field to explore large document collections quickly in order to determine whether the documents contain information of interest, and to find the relevant text passages. The automatic tool, which currently exists as a fully functional prototype, is expected to be particularly useful when users repeatedly hav… ▽ More

    Submitted 12 September, 2006; originally announced September 2006.

    Comments: 10 pages

    ACM Class: H.3.1; H.3.3; H.3.4

    Journal ref: Proceedings of the Symposium on Safeguards and Nuclear Material Management. 27th Annual Meeting of the European SAfeguards Research and Development Association (ESARDA-2005). London, UK, 10-12 May 2005

  9. arXiv:cs/0609058  [pdf

    cs.CL

    The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages

    Authors: Ralf Steinberger, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaz Erjavec, Dan Tufis, Daniel Varga

    Abstract: We present a new, unique and freely available parallel corpus containing European Union (EU) documents of mostly legal nature. It is available in all 20 official EUanguages, with additional documents being available in the languages of the EU candidate countries. The corpus consists of almost 8,000 documents per language, with an average size of nearly 9 million words per language. Pair-wise par… ▽ More

    Submitted 12 September, 2006; originally announced September 2006.

    Comments: A multilingual textual resource with meta-data freely available for download at http://langtech.jrc.it/JRC-Acquis.html

    ACM Class: H.3.1; H.3.6

    Journal ref: Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'2006), pp. 2142-2147. Genoa, Italy, 24-26 May 2006