Skip to main content

Showing 1–19 of 19 results for author: Sirts, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.03861  [pdf, other

    cs.CL

    TartuNLP @ AXOLOTL-24: Leveraging Classifier Output for New Sense Detection in Lexical Semantics

    Authors: Aleksei Dorkin, Kairit Sirts

    Abstract: We present our submission to the AXOLOTL-24 shared task. The shared task comprises two subtasks: identifying new senses that words gain with time (when comparing newer and older time periods) and producing the definitions for the identified new senses. We implemented a conceptually simple and computationally inexpensive solution to both subtasks. We trained adapter-based binary classification mode… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

    Comments: Accepted to the 5th International Workshop on Computational Approaches to Historical Language Change 2024 (LChange'24)

  2. arXiv:2405.18061  [pdf, other

    cs.CL

    Context is Important in Depressive Language: A Study of the Interaction Between the Sentiments and Linguistic Markers in Reddit Discussions

    Authors: Neha Sharma, Kairit Sirts

    Abstract: Research exploring linguistic markers in individuals with depression has demonstrated that language usage can serve as an indicator of mental health. This study investigates the impact of discussion topic as context on linguistic markers and emotional expression in depression, using a Reddit dataset to explore interaction effects. Contrary to common findings, our sentiment analysis revealed a broa… ▽ More

    Submitted 3 July, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

  3. arXiv:2405.01159  [pdf, other

    cs.CL

    TartuNLP at EvaLatin 2024: Emotion Polarity Detection

    Authors: Aleksei Dorkin, Kairit Sirts

    Abstract: This paper presents the TartuNLP team submission to EvaLatin 2024 shared task of the emotion polarity detection for historical Latin texts. Our system relies on two distinct approaches to annotating training data for supervised learning: 1) creating heuristics-based labels by adopting the polarity lexicon provided by the organizers and 2) generating labels with GPT4. We employed parameter efficien… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

    Comments: Accepted to The Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA 2024)

  4. arXiv:2404.19430  [pdf, other

    cs.CL

    Sõnajaht: Definition Embeddings and Semantic Search for Reverse Dictionary Creation

    Authors: Aleksei Dorkin, Kairit Sirts

    Abstract: We present an information retrieval based reverse dictionary system using modern pre-trained language models and approximate nearest neighbors search algorithms. The proposed approach is applied to an existing Estonian language lexicon resource, Sõnaveeb (word web), with the purpose of enhancing and enriching it by introducing cross-lingual reverse dictionary functionality powered by semantic sear… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

    Comments: Accepted to *SEM 2024

  5. arXiv:2404.19359  [pdf, other

    cs.CL cs.AI

    Evaluating Lexicon Incorporation for Depression Symptom Estimation

    Authors: Kirill Milintsevich, Gaël Dias, Kairit Sirts

    Abstract: This paper explores the impact of incorporating sentiment, emotion, and domain-specific lexicons into a transformer-based model for depression symptom estimation. Lexicon information is added by marking the words in the input transcripts of patient-therapist conversations as well as in social media posts. Overall results show that the introduction of external knowledge within pre-trained language… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

    Comments: Accepted to Clinical NLP workshop at NAACL 2024

  6. arXiv:2404.15003  [pdf, other

    cs.CL

    Comparison of Current Approaches to Lemmatization: A Case Study in Estonian

    Authors: Aleksei Dorkin, Kairit Sirts

    Abstract: This study evaluates three different lemmatization approaches to Estonian -- Generative character-level models, Pattern-based word-level classification models, and rule-based morphological analysis. According to our experiments, a significantly smaller Generative model consistently outperforms the Pattern-based classification model based on EstBERT. Additionally, we observe a relatively small over… ▽ More

    Submitted 23 April, 2024; originally announced April 2024.

    Comments: 6 pages, 2 figures

    Journal ref: Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pp. 280-285, May 2023

  7. arXiv:2404.12845  [pdf, other

    cs.CL

    TartuNLP @ SIGTYP 2024 Shared Task: Adapting XLM-RoBERTa for Ancient and Historical Languages

    Authors: Aleksei Dorkin, Kairit Sirts

    Abstract: We present our submission to the unconstrained subtask of the SIGTYP 2024 Shared Task on Word Embedding Evaluation for Ancient and Historical Languages for morphological annotation, POS-tagging, lemmatization, character- and word-level gap-filling. We developed a simple, uniform, and computationally lightweight approach based on the adapters framework using parameter-efficient fine-tuning. We appl… ▽ More

    Submitted 19 April, 2024; originally announced April 2024.

    Comments: 11 pages, 3 figures

    Journal ref: Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, pp. 120-130, March 2024

  8. arXiv:2403.00438  [pdf, other

    cs.CL

    Your Model Is Not Predicting Depression Well And That Is Why: A Case Study of PRIMATE Dataset

    Authors: Kirill Milintsevich, Kairit Sirts, Gaël Dias

    Abstract: This paper addresses the quality of annotations in mental health datasets used for NLP-based depression level estimation from social media texts. While previous research relies on social media-based datasets annotated with binary categories, i.e. depressed or non-depressed, recent datasets such as D2S and PRIMATE aim for nuanced annotations using PHQ-9 symptoms. However, most of these datasets rel… ▽ More

    Submitted 1 March, 2024; originally announced March 2024.

  9. Enhancing Sequence-to-Sequence Neural Lemmatization with External Resources

    Authors: Kirill Milintsevich, Kairit Sirts

    Abstract: We propose a novel hybrid approach to lemmatization that enhances the seq2seq neural model with additional lemmas extracted from an external lexicon or a rule-based system. During training, the enhanced lemmatizer learns both to generate lemmas via a sequential decoder and copy the lemma characters from the external candidates supplied during run-time. Our lemmatizer enhanced with candidates extra… ▽ More

    Submitted 28 January, 2021; originally announced January 2021.

  10. Evaluating Sentence Segmentation and Word Tokenization Systems on Estonian Web Texts

    Authors: Kairit Sirts, Kairit Peekman

    Abstract: Texts obtained from web are noisy and do not necessarily follow the orthographic sentence and word boundary rules. Thus, sentence segmentation and word tokenization systems that have been developed on well-formed texts might not perform so well on unedited web texts. In this paper, we first describe the manual annotation of sentence boundaries of an Estonian web dataset and then present the evalua… ▽ More

    Submitted 16 November, 2020; originally announced November 2020.

    Comments: BalticHLT2020

  11. arXiv:2011.04784  [pdf, other

    cs.CL

    EstBERT: A Pretrained Language-Specific BERT for Estonian

    Authors: Hasan Tanvir, Claudia Kittask, Sandra Eiche, Kairit Sirts

    Abstract: This paper presents EstBERT, a large pretrained transformer-based language-specific BERT model for Estonian. Recent work has evaluated multilingual BERT models on Estonian tasks and found them to outperform the baselines. Still, based on existing studies on other languages, a language-specific BERT model is expected to improve over the multilingual ones. We first describe the EstBERT pretraining p… ▽ More

    Submitted 28 April, 2021; v1 submitted 9 November, 2020; originally announced November 2020.

    Comments: NoDaLiDa 2021

  12. arXiv:2010.00454  [pdf, ps, other

    cs.CL

    Evaluating Multilingual BERT for Estonian

    Authors: Claudia Kittask, Kirill Milintsevich, Kairit Sirts

    Abstract: Recently, large pre-trained language models, such as BERT, have reached state-of-the-art performance in many natural language processing tasks, but for many languages, including Estonian, BERT models are not yet available. However, there exist several multilingual BERT models that can handle multiple languages simultaneously and that have been trained also on Estonian data. In this paper, we evalu… ▽ More

    Submitted 8 January, 2021; v1 submitted 1 October, 2020; originally announced October 2020.

    Comments: V1: Baltic HLT 2020 V2: Changed NER baseline results

  13. arXiv:1810.08815  [pdf, other

    cs.CL

    Modeling Composite Labels for Neural Morphological Tagging

    Authors: Alexander Tkachenko, Kairit Sirts

    Abstract: Neural morphological tagging has been regarded as an extension to POS tagging task, treating each morphological tag as a monolithic label and ignoring its internal structure. We propose to view morphological tags as composite labels and explicitly model their internal structure in a neural sequence tagger. For this, we explore three different neural architectures and compare their performance with… ▽ More

    Submitted 20 October, 2018; originally announced October 2018.

    Comments: Proceedings of the 22nd Conference on Computational Natural Language Learning, 2018

  14. arXiv:1810.06908  [pdf, other

    cs.CL

    Neural Morphological Tagging for Estonian

    Authors: Alexander Tkachenko, Kairit Sirts

    Abstract: We develop neural morphological tagging and disambiguation models for Estonian. First, we experiment with two neural architectures for morphological tagging - a standard multiclass classifier which treats each morphological tag as a single unit, and a sequence model which handles the morphological tags as sequences of morphological category values. Secondly, we complement these models with the ana… ▽ More

    Submitted 16 October, 2018; originally announced October 2018.

    Journal ref: Proceedings of the Eighth International Conference Baltic HLT 2018

  15. arXiv:1810.05187  [pdf, other

    cs.IR cs.LG stat.ML

    The Impact of Annotation Guidelines and Annotated Data on Extracting App Features from App Reviews

    Authors: Faiz Ali Shah, Kairit Sirts, Dietmar Pfahl

    Abstract: Annotation guidelines used to guide the annotation of training and evaluation datasets can have a considerable impact on the quality of machine learning models. In this study, we explore the effects of annotation guidelines on the quality of app feature extraction models. As a main result, we propose several changes to the existing annotation guidelines with a goal of making the extracted app feat… ▽ More

    Submitted 11 October, 2018; originally announced October 2018.

  16. arXiv:1706.04473  [pdf, ps, other

    cs.CL

    Idea density for predicting Alzheimer's disease from transcribed speech

    Authors: Kairit Sirts, Olivier Piguet, Mark Johnson

    Abstract: Idea Density (ID) measures the rate at which ideas or elementary predications are expressed in an utterance or in a text. Lower ID is found to be associated with an increased risk of developing Alzheimer's disease (AD) (Snowdon et al., 1996; Engelman et al., 2010). ID has been used in two different versions: propositional idea density (PID) counts the expressed ideas and can be applied to any text… ▽ More

    Submitted 14 June, 2017; originally announced June 2017.

    Comments: CoNLL 2017

  17. arXiv:1704.01419  [pdf, other

    cs.CL

    Linear Ensembles of Word Embedding Models

    Authors: Avo Muromägi, Kairit Sirts, Sven Laur

    Abstract: This paper explores linear methods for combining several word embedding models into an ensemble. We construct the combined models using an iterative method based on either ordinary least squares regression or the solution to the orthogonal Procrustes problem. We evaluate the proposed approaches on Estonian---a morphologically complex language, for which the available corpora for training word em… ▽ More

    Submitted 5 April, 2017; originally announced April 2017.

    Comments: Nodalida 2017

  18. STransE: a novel embedding model of entities and relationships in knowledge bases

    Authors: Dat Quoc Nguyen, Kairit Sirts, Lizhen Qu, Mark Johnson

    Abstract: Knowledge bases of real-world facts about entities and their relationships are useful resources for a variety of natural language processing tasks. However, because knowledge bases are typically incomplete, it is useful to be able to perform link prediction or knowledge base completion, i.e., predict whether a relationship not in the knowledge base is likely to be true. This paper combines insight… ▽ More

    Submitted 8 March, 2017; v1 submitted 27 June, 2016; originally announced June 2016.

    Comments: V1: In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2016. V2: Corrected citation to (Krompaß et al., 2015). V3: A revised version of our NAACL-HLT 2016 paper with additional experimental results and latest related work

  19. Neighborhood Mixture Model for Knowledge Base Completion

    Authors: Dat Quoc Nguyen, Kairit Sirts, Lizhen Qu, Mark Johnson

    Abstract: Knowledge bases are useful resources for many natural language processing tasks, however, they are far from complete. In this paper, we define a novel entity representation as a mixture of its neighborhood in the knowledge base and apply this technique on TransE-a well-known embedding model for knowledge base completion. Experimental results show that the neighborhood information significantly hel… ▽ More

    Submitted 9 March, 2017; v1 submitted 21 June, 2016; originally announced June 2016.

    Comments: V1: In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016. V2: Corrected citation to (Krompaß et al., 2015). V3: A revised version of our CoNLL 2016 paper to update latest related work