Zum Hauptinhalt springen

Showing 1–45 of 45 results for author: Artemova, E

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.04284  [pdf, other

    cs.CL

    LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection

    Authors: Mervat Abassy, Kareem Elozeiri, Alexander Aziz, Minh Ngoc Ta, Raj Vardhan Tomar, Bimarsha Adhikari, Saad El Dine Ahmed, Yuxia Wang, Osama Mohammed Afzal, Zhuohan Xie, Jonibek Mansurov, Ekaterina Artemova, Vladislav Mikhailov, Rui Xing, Jiahui Geng, Hasan Iqbal, Zain Muhammad Mujahid, Tarek Mahmoud, Akim Tsvigun, Alham Fikri Aji, Artem Shelmanov, Nizar Habash, Iryna Gurevych, Preslav Nakov

    Abstract: The widespread accessibility of large language models (LLMs) to the general public has significantly amplified the dissemination of machine-generated texts (MGTs). Advancements in prompt manipulation have exacerbated the difficulty in discerning the origin of a text (human-authored vs machinegenerated). This raises concerns regarding the potential misuse of MGTs, particularly within educational an… ▽ More

    Submitted 8 August, 2024; originally announced August 2024.

  2. arXiv:2407.17629  [pdf, other

    cs.CL

    Papilusion at DAGPap24: Paper or Illusion? Detecting AI-generated Scientific Papers

    Authors: Nikita Andreev, Alexander Shirnin, Vladislav Mikhailov, Ekaterina Artemova

    Abstract: This paper presents Papilusion, an AI-generated scientific text detector developed within the DAGPap24 shared task on detecting automatically generated scientific papers. We propose an ensemble-based approach and conduct ablation studies to analyze the effect of the detector configurations on the performance. Papilusion is ranked 6th on the leaderboard, and we improve our performance after the com… ▽ More

    Submitted 30 July, 2024; v1 submitted 24 July, 2024; originally announced July 2024.

    Comments: to appear in "The 4th Workshop on Scholarly Document Processing @ ACL 2024" proceedings

  3. arXiv:2406.19232  [pdf, other

    cs.CL

    RuBLiMP: Russian Benchmark of Linguistic Minimal Pairs

    Authors: Ekaterina Taktasheva, Maxim Bazhukov, Kirill Koncha, Alena Fenogenova, Ekaterina Artemova, Vladislav Mikhailov

    Abstract: Minimal pairs are a well-established approach to evaluating the grammatical knowledge of language models. However, existing resources for minimal pairs address a limited number of languages and lack diversity of language-specific grammatical phenomena. This paper introduces the Russian Benchmark of Linguistic Minimal Pairs (RuBLiMP), which includes 45k pairs of sentences that differ in grammatical… ▽ More

    Submitted 28 June, 2024; v1 submitted 27 June, 2024; originally announced June 2024.

  4. arXiv:2403.19354  [pdf, other

    cs.CL

    AIpom at SemEval-2024 Task 8: Detecting AI-produced Outputs in M4

    Authors: Alexander Shirnin, Nikita Andreev, Vladislav Mikhailov, Ekaterina Artemova

    Abstract: This paper describes AIpom, a system designed to detect a boundary between human-written and machine-generated text (SemEval-2024 Task 8, Subtask C: Human-Machine Mixed Text Detection). We propose a two-stage pipeline combining predictions from an instruction-tuned decoder-only model and encoder-only sequence taggers. AIpom is ranked second on the leaderboard while achieving a Mean Absolute Error… ▽ More

    Submitted 28 March, 2024; originally announced March 2024.

    Comments: 2nd place at SemEval-2024 Task 8, Subtask C, to appear in SemEval-2024 proceedings

  5. arXiv:2403.17553  [pdf, other

    cs.CL

    RuBia: A Russian Language Bias Detection Dataset

    Authors: Veronika Grigoreva, Anastasiia Ivanova, Ilseyar Alimova, Ekaterina Artemova

    Abstract: Warning: this work contains upsetting or disturbing content. Large language models (LLMs) tend to learn the social and cultural biases present in the raw pre-training data. To test if an LLM's behavior is fair, functional datasets are employed, and due to their purpose, these datasets are highly language and culture-specific. In this paper, we address a gap in the scope of multilingual bias eval… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: accepted to LREC-COLING 2024

  6. arXiv:2403.12749  [pdf, other

    cs.CL

    Sebastian, Basti, Wastl?! Recognizing Named Entities in Bavarian Dialectal Data

    Authors: Siyao Peng, Zihang Sun, Huangyan Shan, Marie Kolm, Verena Blaschke, Ekaterina Artemova, Barbara Plank

    Abstract: Named Entity Recognition (NER) is a fundamental task to extract key information from texts, but annotated resources are scarce for dialects. This paper introduces the first dialectal NER dataset for German, BarNER, with 161K tokens annotated on Bavarian Wikipedia articles (bar-wiki) and tweets (bar-tweet), using a schema adapted from German CoNLL 2006 and GermEval. The Bavarian dialect differs fro… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

    Comments: LREC-COLING 2024

  7. arXiv:2402.02078  [pdf, other

    cs.CL

    Exploring the Robustness of Task-oriented Dialogue Systems for Colloquial German Varieties

    Authors: Ekaterina Artemova, Verena Blaschke, Barbara Plank

    Abstract: Mainstream cross-lingual task-oriented dialogue (ToD) systems leverage the transfer learning paradigm by training a joint model for intent recognition and slot-filling in English and applying it, zero-shot, to other languages. We address a gap in prior research, which often overlooked the transfer to lower-resource colloquial varieties due to limited test data. Inspired by prior work on English va… ▽ More

    Submitted 3 February, 2024; originally announced February 2024.

    Comments: To appear in EACL 2024 (main)

  8. arXiv:2401.04522  [pdf, other

    cs.CL

    LUNA: A Framework for Language Understanding and Naturalness Assessment

    Authors: Marat Saidov, Aleksandra Bakalova, Ekaterina Taktasheva, Vladislav Mikhailov, Ekaterina Artemova

    Abstract: The evaluation of Natural Language Generation (NLG) models has gained increased attention, urging the development of metrics that evaluate various aspects of generated text. LUNA addresses this challenge by introducing a unified interface for 20 NLG evaluation metrics. These metrics are categorized based on their reference-dependence and the type of text representation they employ, from string-bas… ▽ More

    Submitted 9 January, 2024; originally announced January 2024.

  9. arXiv:2309.01669  [pdf, other

    cs.CL

    Donkii: Can Annotation Error Detection Methods Find Errors in Instruction-Tuning Datasets?

    Authors: Leon Weber-Genzel, Robert Litschko, Ekaterina Artemova, Barbara Plank

    Abstract: Instruction tuning has become an integral part of training pipelines for Large Language Models (LLMs) and has been shown to yield strong performance gains. In an orthogonal line of research, Annotation Error Detection (AED) has emerged as a tool for detecting quality problems in gold standard labels. So far, however, the application of AED methods has been limited to classification tasks. It is an… ▽ More

    Submitted 22 February, 2024; v1 submitted 4 September, 2023; originally announced September 2023.

    Comments: Camera ready version for LAW-XVIII

  10. arXiv:2305.05295  [pdf, other

    cs.CL cs.IR

    Boosting Zero-shot Cross-lingual Retrieval by Training on Artificially Code-Switched Data

    Authors: Robert Litschko, Ekaterina Artemova, Barbara Plank

    Abstract: Transferring information retrieval (IR) models from a high-resource language (typically English) to other languages in a zero-shot fashion has become a widely adopted approach. In this work, we show that the effectiveness of zero-shot rankers diminishes when queries and documents are present in different languages. Motivated by this, we propose to train ranking models on artificially code-switched… ▽ More

    Submitted 26 May, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

    Comments: Accepted to Findings of ACL 2023

  11. arXiv:2304.09957  [pdf, other

    cs.CL

    Low-resource Bilingual Dialect Lexicon Induction with Large Language Models

    Authors: Ekaterina Artemova, Barbara Plank

    Abstract: Bilingual word lexicons are crucial tools for multilingual natural language understanding and machine translation tasks, as they facilitate the mapping of words in one language to their synonyms in another language. To achieve this, numerous papers have explored bilingual lexicon induction (BLI) in high-resource scenarios, using a typical pipeline consisting of two unsupervised steps: bitext minin… ▽ More

    Submitted 19 April, 2023; originally announced April 2023.

    Comments: Accepted to NoDaLiDa 2023

  12. Can BERT eat RuCoLA? Topological Data Analysis to Explain

    Authors: Irina Proskurina, Irina Piontkovskaya, Ekaterina Artemova

    Abstract: This paper investigates how Transformer language models (LMs) fine-tuned for acceptability classification capture linguistic features. Our approach uses the best practices of topological data analysis (TDA) in NLP: we construct directed attention graphs from attention matrices, derive topological features from them, and feed them to linear classifiers. We introduce two novel features, chordality,… ▽ More

    Submitted 4 April, 2023; originally announced April 2023.

    Comments: Accepted to the Workshop on Slavic NLP @ EACL 2023

  13. RuCoLA: Russian Corpus of Linguistic Acceptability

    Authors: Vladislav Mikhailov, Tatiana Shamardina, Max Ryabinin, Alena Pestova, Ivan Smurov, Ekaterina Artemova

    Abstract: Linguistic acceptability (LA) attracts the attention of the research community due to its many uses, such as testing the grammatical knowledge of language models and filtering implausible texts with acceptability classifiers. However, the application scope of LA in languages other than English is limited due to the lack of high-quality resources. To this end, we introduce the Russian Corpus of Lin… ▽ More

    Submitted 23 October, 2022; originally announced October 2022.

    Comments: Accepted to the EMNLP 2022 main conference

  14. TAPE: Assessing Few-shot Russian Language Understanding

    Authors: Ekaterina Taktasheva, Tatiana Shavrina, Alena Fenogenova, Denis Shevelev, Nadezhda Katricheva, Maria Tikhonova, Albina Akhmetgareeva, Oleg Zinkevich, Anastasiia Bashmakova, Svetlana Iordanskaia, Alena Spiridonova, Valentina Kurenshchikova, Ekaterina Artemova, Vladislav Mikhailov

    Abstract: Recent advances in zero-shot and few-shot learning have shown promise for a scope of research and practical purposes. However, this fast-growing area lacks standardized evaluation suites for non-English languages, hindering progress outside the Anglo-centric paradigm. To address this line of research, we propose TAPE (Text Attack and Perturbation Evaluation), a novel benchmark that includes six mo… ▽ More

    Submitted 23 October, 2022; originally announced October 2022.

    Comments: Accepted to EMNLP 2022 Findings

  15. Vote'n'Rank: Revision of Benchmarking with Social Choice Theory

    Authors: Mark Rofin, Vladislav Mikhailov, Mikhail Florinskiy, Andrey Kravchenko, Elena Tutubalina, Tatiana Shavrina, Daniel Karabekyan, Ekaterina Artemova

    Abstract: The development of state-of-the-art systems in different applied areas of machine learning (ML) is driven by benchmarks, which have shaped the paradigm of evaluating generalisation capabilities from multiple perspectives. Although the paradigm is shifting towards more fine-grained evaluation across diverse tasks, the delicate question of how to aggregate the performances has received particular in… ▽ More

    Submitted 12 February, 2023; v1 submitted 11 October, 2022; originally announced October 2022.

    Comments: To appear in EACL 2023 (main)

  16. arXiv:2206.10914  [pdf, other

    cs.CL

    Template-based Approach to Zero-shot Intent Recognition

    Authors: Dmitry Lamanov, Pavel Burnyshev, Ekaterina Artemova, Valentin Malykh, Andrey Bout, Irina Piontkovskaya

    Abstract: The recent advances in transfer learning techniques and pre-training of large contextualized encoders foster innovation in real-life applications, including dialog assistants. Practical needs of intent recognition require effective data usage and the ability to constantly update supported intents, adopting new ones, and abandoning outdated ones. In particular, the generalized zero-shot paradigm, i… ▽ More

    Submitted 22 June, 2022; originally announced June 2022.

    Comments: accepted to INLG 2022

  17. Findings of the The RuATD Shared Task 2022 on Artificial Text Detection in Russian

    Authors: Tatiana Shamardina, Vladislav Mikhailov, Daniil Chernianskii, Alena Fenogenova, Marat Saidov, Anastasiya Valeeva, Tatiana Shavrina, Ivan Smurov, Elena Tutubalina, Ekaterina Artemova

    Abstract: We present the shared task on artificial text detection in Russian, which is organized as a part of the Dialogue Evaluation initiative, held in 2022. The shared task dataset includes texts from 14 text generators, i.e., one human writer and 13 text generative models fine-tuned for one or more of the following generation tasks: machine translation, paraphrase generation, text summarization, text si… ▽ More

    Submitted 3 June, 2022; originally announced June 2022.

    Comments: Accepted to Dialogue-22

  18. RuNNE-2022 Shared Task: Recognizing Nested Named Entities

    Authors: Ekaterina Artemova, Maxim Zmeev, Natalia Loukachevitch, Igor Rozhkov, Tatiana Batura, Vladimir Ivanov, Elena Tutubalina

    Abstract: The RuNNE Shared Task approaches the problem of nested named entity recognition. The annotation schema is designed in such a way, that an entity may partially overlap or even be nested into another entity. This way, the named entity "The Yermolova Theatre" of type "organization" houses another entity "Yermolova" of type "person". We adopt the Russian NEREL dataset for the RuNNE Shared Task. NEREL… ▽ More

    Submitted 23 May, 2022; originally announced May 2022.

    Comments: To appear in Dialogue 2022

  19. arXiv:2205.09630  [pdf, other

    cs.CL cs.AI cs.LG math.AT

    Acceptability Judgements via Examining the Topology of Attention Maps

    Authors: Daniil Cherniavskii, Eduard Tulchinskii, Vladislav Mikhailov, Irina Proskurina, Laida Kushnareva, Ekaterina Artemova, Serguei Barannikov, Irina Piontkovskaya, Dmitri Piontkovski, Evgeny Burnaev

    Abstract: The role of the attention mechanism in encoding linguistic knowledge has received special interest in NLP. However, the ability of the attention heads to judge the grammatical acceptability of a sentence has been underexplored. This paper approaches the paradigm of acceptability judgments with topological data analysis (TDA), showing that the geometric properties of the attention graph can be effi… ▽ More

    Submitted 23 October, 2022; v1 submitted 19 May, 2022; originally announced May 2022.

    Comments: Accepted to EMNLP 2022 Findings

    Journal ref: Findings of the Association for Computational Linguistics: EMNLP 2022, 88-107

  20. arXiv:2202.07791  [pdf, other

    cs.CL cs.AI

    Russian SuperGLUE 1.1: Revising the Lessons not Learned by Russian NLP models

    Authors: Alena Fenogenova, Maria Tikhonova, Vladislav Mikhailov, Tatiana Shavrina, Anton Emelyanov, Denis Shevelev, Alexandr Kukushkin, Valentin Malykh, Ekaterina Artemova

    Abstract: In the last year, new neural architectures and multilingual pre-trained models have been released for Russian, which led to performance evaluation problems across a range of language understanding tasks. This paper presents Russian SuperGLUE 1.1, an updated benchmark styled after GLUE for Russian NLP models. The new version includes a number of technical, user experience and methodological impro… ▽ More

    Submitted 15 February, 2022; originally announced February 2022.

    Comments: Computational Linguistics and Intellectual Technologies Papers from the Annual International Conference "Dialogue" (2021) Issue 20

    MSC Class: 68-06; 68T50; 68T01 ACM Class: G.3; I.2.7

  21. arXiv:2201.09997  [pdf, other

    cs.CL

    Razmecheno: Named Entity Recognition from Digital Archive of Diaries "Prozhito"

    Authors: Timofey Atnashev, Veronika Ganeeva, Roman Kazakov, Daria Matyash, Michael Sonkin, Ekaterina Voloshina, Oleg Serikov, Ekaterina Artemova

    Abstract: The vast majority of existing datasets for Named Entity Recognition (NER) are built primarily on news, research papers and Wikipedia with a few exceptions, created from historical and literary texts. What is more, English is the main source for data for further labelling. This paper aims to fill in multiple gaps by creating a novel dataset "Razmecheno", gathered from the diary texts of the project… ▽ More

    Submitted 24 January, 2022; originally announced January 2022.

    Comments: Submitted to LREC 2022

  22. arXiv:2109.14350  [pdf, other

    cs.CL

    Call Larisa Ivanovna: Code-Switching Fools Multilingual NLU Models

    Authors: Alexey Birshert, Ekaterina Artemova

    Abstract: Practical needs of developing task-oriented dialogue assistants require the ability to understand many languages. Novel benchmarks for multilingual natural language understanding (NLU) include monolingual sentences in several languages, annotated with intents and slots. In such setup models for cross-lingual transfer show remarkable performance in joint intent recognition and slot filling. However… ▽ More

    Submitted 20 November, 2021; v1 submitted 29 September, 2021; originally announced September 2021.

    Comments: accepted to AIST 2021

  23. Shaking Syntactic Trees on the Sesame Street: Multilingual Probing with Controllable Perturbations

    Authors: Ekaterina Taktasheva, Vladislav Mikhailov, Ekaterina Artemova

    Abstract: Recent research has adopted a new experimental field centered around the concept of text perturbations which has revealed that shuffled word order has little to no impact on the downstream performance of Transformer-based language models across many NLP tasks. These findings contradict the common understanding of how the models encode hierarchical and structural information and even question if th… ▽ More

    Submitted 28 September, 2021; originally announced September 2021.

    Comments: accepted to MRL @ EMNLP 2021

  24. Artificial Text Detection via Examining the Topology of Attention Maps

    Authors: Laida Kushnareva, Daniil Cherniavskii, Vladislav Mikhailov, Ekaterina Artemova, Serguei Barannikov, Alexander Bernstein, Irina Piontkovskaya, Dmitri Piontkovski, Evgeny Burnaev

    Abstract: The impressive capabilities of recent generative models to create texts that are challenging to distinguish from the human-written ones can be misused for generating fake news, product reviews, and even abusive content. Despite the prominent performance of existing methods for artificial text detection, they still lack interpretability and robustness towards unseen models. To this end, we propose… ▽ More

    Submitted 28 April, 2022; v1 submitted 10 September, 2021; originally announced September 2021.

    Comments: Accepted to EMNLP 2021

    Journal ref: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 635-649

  25. arXiv:2108.13112  [pdf, other

    cs.CL

    NEREL: A Russian Dataset with Nested Named Entities, Relations and Events

    Authors: Natalia Loukachevitch, Ekaterina Artemova, Tatiana Batura, Pavel Braslavski, Ilia Denisov, Vladimir Ivanov, Suresh Manandhar, Alexander Pugachev, Elena Tutubalina

    Abstract: In this paper, we present NEREL, a Russian dataset for named entity recognition and relation extraction. NEREL is significantly larger than existing Russian datasets: to date it contains 56K annotated named entities and 39K annotated relations. Its important difference from previous datasets is annotation of nested named entities, as well as relations within nested entities and at the discourse le… ▽ More

    Submitted 3 September, 2021; v1 submitted 30 August, 2021; originally announced August 2021.

    Comments: accepted to RANLP

  26. arXiv:2108.06991  [pdf, other

    cs.CL

    A Single Example Can Improve Zero-Shot Data Generation

    Authors: Pavel Burnyshev, Valentin Malykh, Andrey Bout, Ekaterina Artemova, Irina Piontkovskaya

    Abstract: Sub-tasks of intent classification, such as robustness to distribution shift, adaptation to specific user groups and personalization, out-of-domain detection, require extensive and flexible datasets for experiments and evaluation. As collecting such datasets is time- and labor-consuming, we propose to use text generation methods to gather datasets. The generator should be trained to generate utter… ▽ More

    Submitted 16 August, 2021; originally announced August 2021.

    Comments: To appear in INLG2021 proceedings

  27. arXiv:2107.11275  [pdf, other

    cs.CL cs.LG

    A Differentiable Language Model Adversarial Attack on Text Classifiers

    Authors: Ivan Fursov, Alexey Zaytsev, Pavel Burnyshev, Ekaterina Dmitrieva, Nikita Klyuchnikov, Andrey Kravchenko, Ekaterina Artemova, Evgeny Burnaev

    Abstract: Robustness of huge Transformer-based models for natural language processing is an important issue due to their capabilities and wide adoption. One way to understand and improve robustness of these models is an exploration of an adversarial attack scenario: check if a small perturbation of an input can fool a model. Due to the discrete nature of textual data, gradient-based adversarial methods, w… ▽ More

    Submitted 23 July, 2021; originally announced July 2021.

    Comments: arXiv admin note: substantial text overlap with arXiv:2006.11078

  28. arXiv:2104.14314  [pdf, other

    cs.CL

    MOROCCO: Model Resource Comparison Framework

    Authors: Valentin Malykh, Alexander Kukushkin, Ekaterina Artemova, Vladislav Mikhailov, Maria Tikhonova, Tatiana Shavrina

    Abstract: The new generation of pre-trained NLP models push the SOTA to the new limits, but at the cost of computational resources, to the point that their use in real production environments is often prohibitively expensive. We tackle this problem by evaluating not only the standard quality metrics on downstream tasks but also the memory footprint and inference time. We present MOROCCO, a framework to comp… ▽ More

    Submitted 29 April, 2021; originally announced April 2021.

  29. arXiv:2104.12847  [pdf, other

    cs.CL

    Morph Call: Probing Morphosyntactic Content of Multilingual Transformers

    Authors: Vladislav Mikhailov, Oleg Serikov, Ekaterina Artemova

    Abstract: The outstanding performance of transformer-based language models on a great variety of NLP and NLU tasks has stimulated interest in exploring their inner workings. Recent research has focused primarily on higher-level and complex linguistic phenomena such as syntax, semantics, world knowledge, and common sense. The majority of the studies are anglocentric, and little remains known regarding other… ▽ More

    Submitted 4 May, 2021; v1 submitted 26 April, 2021; originally announced April 2021.

    Comments: To appear in the Proceedings of the 3rd Workshop on Research in Computational Typology and Multilingual NLP (SIGTYP, NAACL)

  30. Teaching a Massive Open Online Course on Natural Language Processing

    Authors: Ekaterina Artemova, Murat Apishev, Veronika Sarkisyan, Sergey Aksenov, Denis Kirjanov, Oleg Serikov

    Abstract: This paper presents a new Massive Open Online Course on Natural Language Processing, targeted at non-English speaking students. The course lasts 12 weeks; every week consists of lectures, practical sessions, and quiz assignments. Three weeks out of 12 are followed by Kaggle-style coding assignments. Our course intends to serve multiple purposes: (i) familiarize students with the core concepts an… ▽ More

    Submitted 4 May, 2021; v1 submitted 26 April, 2021; originally announced April 2021.

    Comments: To appear in the Proceedings of the Fifth Workshop on Teaching NLP @ NAACL

  31. arXiv:2103.00573  [pdf, other

    cs.CL

    RuSentEval: Linguistic Source, Encoder Force!

    Authors: Vladislav Mikhailov, Ekaterina Taktasheva, Elina Sigdel, Ekaterina Artemova

    Abstract: The success of pre-trained transformer language models has brought a great deal of interest on how these models work, and what they learn about language. However, prior research in the field is mainly devoted to English, and little is known regarding other languages. To this end, we introduce RuSentEval, an enhanced set of 14 probing tasks for Russian, including ones that have not been explored ye… ▽ More

    Submitted 2 March, 2021; v1 submitted 28 February, 2021; originally announced March 2021.

    Comments: The paper is accepted to BSNLP workshop at EACL 2021. The title follows Power Rangers Mystic Force series (Roll Call Team-Morph: "Magical Source, Mystic Force!")

  32. arXiv:2101.08133  [pdf, other

    cs.CL

    Active Learning for Sequence Tagging with Deep Pre-trained Models and Bayesian Uncertainty Estimates

    Authors: Artem Shelmanov, Dmitri Puzyrev, Lyubov Kupriyanova, Denis Belyakov, Daniil Larionov, Nikita Khromov, Olga Kozlova, Ekaterina Artemova, Dmitry V. Dylov, Alexander Panchenko

    Abstract: Annotating training data for sequence tagging of texts is usually very time-consuming. Recent advances in transfer learning for natural language processing in conjunction with active learning open the possibility to significantly reduce the necessary annotation budget. We are the first to thoroughly investigate this powerful combination for the sequence tagging task. We conduct an extensive empiri… ▽ More

    Submitted 18 February, 2021; v1 submitted 20 January, 2021; originally announced January 2021.

    Comments: In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL-2021)

  33. arXiv:2101.03778  [pdf, other

    cs.CL cs.LG

    Revisiting Mahalanobis Distance for Transformer-Based Out-of-Domain Detection

    Authors: Alexander Podolskiy, Dmitry Lipin, Andrey Bout, Ekaterina Artemova, Irina Piontkovskaya

    Abstract: Real-life applications, heavily relying on machine learning, such as dialog systems, demand out-of-domain detection methods. Intent classification models should be equipped with a mechanism to distinguish seen intents from unseen ones so that the dialog agent is capable of rejecting the latter and avoiding undesired behavior. However, despite increasing attention paid to the task, the best practic… ▽ More

    Submitted 23 May, 2022; v1 submitted 11 January, 2021; originally announced January 2021.

    Comments: AAAI 2021

  34. arXiv:2010.15939  [pdf, ps, other

    cs.CL cs.CY

    RuREBus: a Case Study of Joint Named Entity Recognition and Relation Extraction from e-Government Domain

    Authors: Vitaly Ivanin, Ekaterina Artemova, Tatiana Batura, Vladimir Ivanov, Veronika Sarkisyan, Elena Tutubalina, Ivan Smurov

    Abstract: We show-case an application of information extraction methods, such as named entity recognition (NER) and relation extraction (RE) to a novel corpus, consisting of documents, issued by a state agency. The main challenges of this corpus are: 1) the annotation scheme differs greatly from the one used for the general domain corpora, and 2) the documents are written in a language other than English. U… ▽ More

    Submitted 29 October, 2020; originally announced October 2020.

    Comments: to appear in AIST 2020

  35. RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark

    Authors: Tatiana Shavrina, Alena Fenogenova, Anton Emelyanov, Denis Shevelev, Ekaterina Artemova, Valentin Malykh, Vladislav Mikhailov, Maria Tikhonova, Andrey Chertok, Andrey Evlampiev

    Abstract: In this paper, we introduce an advanced Russian general language understanding evaluation benchmark -- RussianGLUE. Recent advances in the field of universal language models and transformers require the development of a methodology for their broad diagnostics and testing for general intellectual skills - detection of natural language inference, commonsense reasoning, ability to perform simple logi… ▽ More

    Submitted 2 November, 2020; v1 submitted 29 October, 2020; originally announced October 2020.

    Comments: to appear in EMNLP 2020

  36. arXiv:2010.03481  [pdf, ps, other

    cs.CL

    ELMo and BERT in semantic change detection for Russian

    Authors: Julia Rodina, Yuliya Trofimova, Andrey Kutuzov, Ekaterina Artemova

    Abstract: We study the effectiveness of contextualized embeddings for the task of diachronic semantic change detection for Russian language data. Evaluation test sets consist of Russian nouns and adjectives annotated based on their occurrences in texts created in pre-Soviet, Soviet and post-Soviet time periods. ELMo and BERT architectures are compared on the task of ranking Russian words according to the de… ▽ More

    Submitted 7 October, 2020; originally announced October 2020.

    Comments: The 9th International Conference on Analysis of Images, Social Networks and Texts (AIST 2020)

  37. DaNetQA: a yes/no Question Answering Dataset for the Russian Language

    Authors: Taisia Glushkova, Alexey Machnev, Alena Fenogenova, Tatiana Shavrina, Ekaterina Artemova, Dmitry I. Ignatov

    Abstract: DaNetQA, a new question-answering corpus, follows (Clark et. al, 2019) design: it comprises natural yes/no questions. Each question is paired with a paragraph from Wikipedia and an answer, derived from the paragraph. The task is to take both the question and a paragraph as input and come up with a yes/no answer, i.e. to produce a binary output. In this paper, we present a reproducible approach to… ▽ More

    Submitted 15 October, 2020; v1 submitted 6 October, 2020; originally announced October 2020.

    Comments: Analysis of Images, Social Networks and Texts - 9 th International Conference, AIST 2020, Skolkovo, Russia, October 15-16, 2020, Revised Selected Papers. Lecture Notes in Computer Science (https://dblp.org/db/series/lncs/index.html), Springer 2020

  38. arXiv:2007.00257  [pdf, other

    cs.CL

    So What's the Plan? Mining Strategic Planning Documents

    Authors: Ekaterina Artemova, Tatiana Batura, Anna Golenkovskaya, Vitaly Ivanin, Vladimir Ivanov, Veronika Sarkisyan, Ivan Smurov, Elena Tutubalina

    Abstract: In this paper we present a corpus of Russian strategic planning documents, RuREBus. This project is grounded both from language technology and e-government perspectives. Not only new language sources and tools are being developed, but also their applications to e-goverment research. We demonstrate the pipeline for creating a text corpus from scratch. First, the annotation schema is designed. Next… ▽ More

    Submitted 7 July, 2020; v1 submitted 1 July, 2020; originally announced July 2020.

    Comments: 15 pages, 3 figures, 5 tables. The paper has been accepted for the Fifth International Conference on Digital Transformation and Global Society (DTGS 2020)

  39. arXiv:2006.07116  [pdf, other

    cs.LG cs.CL stat.ML

    NAS-Bench-NLP: Neural Architecture Search Benchmark for Natural Language Processing

    Authors: Nikita Klyuchnikov, Ilya Trofimov, Ekaterina Artemova, Mikhail Salnikov, Maxim Fedorov, Evgeny Burnaev

    Abstract: Neural Architecture Search (NAS) is a promising and rapidly evolving research area. Training a large number of neural networks requires an exceptional amount of computational power, which makes NAS unreachable for those researchers who have limited or no access to high-performance clusters and supercomputers. A few benchmarks with precomputed neural architectures performances have been recently in… ▽ More

    Submitted 12 June, 2020; originally announced June 2020.

  40. arXiv:2003.10540  [pdf

    cs.LG cs.CL stat.ML

    Data-driven models and computational tools for neurolinguistics: a language technology perspective

    Authors: Ekaterina Artemova, Amir Bakarov, Aleksey Artemov, Evgeny Burnaev, Maxim Sharaev

    Abstract: In this paper, our focus is the connection and influence of language technologies on the research in neurolinguistics. We present a review of brain imaging-based neurolinguistic studies with a focus on the natural language representations, such as word embeddings and pre-trained language models. Mutual enrichment of neurolinguistics and language technologies leads to development of brain-aware nat… ▽ More

    Submitted 23 March, 2020; originally announced March 2020.

    Comments: 37 pages, 1 figure

    Journal ref: Journal of Cognitive Science, 2020

  41. arXiv:2003.09606  [pdf, other

    cs.CL

    A Joint Approach to Compound Splitting and Idiomatic Compound Detection

    Authors: Irina Krotova, Sergey Aksenov, Ekaterina Artemova

    Abstract: Applications such as machine translation, speech recognition, and information retrieval require efficient handling of noun compounds as they are one of the possible sources for out-of-vocabulary (OOV) words. In-depth processing of noun compounds requires not only splitting them into smaller components (or even roots) but also the identification of instances that should remain unsplitted as they ar… ▽ More

    Submitted 21 March, 2020; originally announced March 2020.

    Comments: 8 pages, 5 tables, 1 figure, accepted at LREC 2020

  42. arXiv:2003.06651  [pdf, other

    cs.CL

    Word Sense Disambiguation for 158 Languages using Word Embeddings Only

    Authors: Varvara Logacheva, Denis Teslenko, Artem Shelmanov, Steffen Remus, Dmitry Ustalov, Andrey Kutuzov, Ekaterina Artemova, Chris Biemann, Simone Paolo Ponzetto, Alexander Panchenko

    Abstract: Disambiguation of word senses in context is easy for humans, but is a major challenge for automatic approaches. Sophisticated supervised and knowledge-based models were developed to solve this task. However, (i) the inherent Zipfian distribution of supervised training instances for a given word and/or (ii) the quality of linguistic knowledge representations motivate the development of completely u… ▽ More

    Submitted 14 March, 2020; originally announced March 2020.

    Comments: 10 pages, 5 figures, 4 tables, accepted at LREC 2020

  43. Char-RNN and Active Learning for Hashtag Segmentation

    Authors: Taisiya Glushkova, Ekaterina Artemova

    Abstract: We explore the abilities of character recurrent neural network (char-RNN) for hashtag segmentation. Our approach to the task is the following: we generate synthetic training dataset according to frequent n-grams that satisfy predefined morpho-syntactic patterns to avoid any manual annotation. The active learning strategy limits the training dataset and selects informative training subset. The appr… ▽ More

    Submitted 8 November, 2019; originally announced November 2019.

    Comments: to appear in Cicling2019

  44. arXiv:1910.13291  [pdf, other

    cs.CL cs.LG

    Sentence Embeddings for Russian NLU

    Authors: Dmitry Popov, Alexander Pugachev, Polina Svyatokum, Elizaveta Svitanko, Ekaterina Artemova

    Abstract: We investigate the performance of sentence embeddings models on several tasks for the Russian language. In our comparison, we include such tasks as multiple choice question answering, next sentence prediction, and paraphrase identification. We employ FastText embeddings as a baseline and compare it to ELMo and BERT embeddings. We conduct two series of experiments, using both unsupervised (i.e., ba… ▽ More

    Submitted 29 October, 2019; originally announced October 2019.

    Comments: to appear in AIST2019

  45. Multilingual Named Entity Recognition Using Pretrained Embeddings, Attention Mechanism and NCRF

    Authors: Anton A. Emelyanov, Ekaterina Artemova

    Abstract: In this paper we tackle multilingual named entity recognition task. We use the BERT Language Model as embeddings with bidirectional recurrent network, attention, and NCRF on the top. We apply multilingual BERT only as embedder without any fine-tuning. We test out model on the dataset of the BSNLP shared task, which consists of texts in Bulgarian, Czech, Polish and Russian languages.

    Submitted 21 June, 2019; originally announced June 2019.

    Comments: BSNLP Shared Task 2019 paper. arXiv admin note: text overlap with arXiv:1806.05626 by other authors