Skip to main content

Showing 1–38 of 38 results for author: Kutuzov, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.04079  [pdf, other

    cs.CL

    AXOLOTL'24 Shared Task on Multilingual Explainable Semantic Change Modeling

    Authors: Mariia Fedorova, Timothee Mickus, Niko Partanen, Janine Siewert, Elena Spaziani, Andrey Kutuzov

    Abstract: This paper describes the organization and findings of AXOLOTL'24, the first multilingual explainable semantic change modeling shared task. We present new sense-annotated diachronic semantic change datasets for Finnish and Russian which were employed in the shared task, along with a surprise test-only German dataset borrowed from an existing source. The setup of AXOLOTL'24 is new to the semantic ch… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

    Comments: Proceedings of the 5th Workshop on Computational Approaches to Historical Language Change (ACL'24)

  2. arXiv:2406.14167  [pdf, other

    cs.CL

    Definition generation for lexical semantic change detection

    Authors: Mariia Fedorova, Andrey Kutuzov, Yves Scherrer

    Abstract: We use contextualized word definitions generated by large language models as semantic representations in the task of diachronic lexical semantic change detection (LSCD). In short, generated definitions are used as `senses', and the change score of a target word is retrieved by comparing their distributions in two time periods under comparison. On the material of five datasets and three languages,… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: Findings of ACL 2024

  3. arXiv:2403.18024  [pdf, other

    cs.CL

    Enriching Word Usage Graphs with Cluster Definitions

    Authors: Mariia Fedorova, Andrey Kutuzov, Nikolay Arefyev, Dominik Schlechtweg

    Abstract: We present a dataset of word usage graphs (WUGs), where the existing WUGs for multiple languages are enriched with cluster labels functioning as sense definitions. They are generated from scratch by fine-tuned encoder-decoder language models. The conducted human evaluation has shown that these definitions match the existing clusters in WUGs better than the definitions chosen from WordNet by two ba… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

    Comments: LREC-COLING 2024

  4. arXiv:2403.14009  [pdf, other

    cs.CL

    A New Massive Multilingual Dataset for High-Performance Language Technologies

    Authors: Ona de Gibert, Graeme Nail, Nikolay Arefyev, Marta Bañón, Jelmer van der Linde, Shaoxiong Ji, Jaume Zaragoza-Bernabeu, Mikko Aulamo, Gema Ramírez-Sánchez, Andrey Kutuzov, Sampo Pyysalo, Stephan Oepen, Jörg Tiedemann

    Abstract: We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performa… ▽ More

    Submitted 20 March, 2024; originally announced March 2024.

    Comments: LREC-COLING 2024

  5. arXiv:2309.08958  [pdf, other

    cs.CL cs.AI

    Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca

    Authors: Pinzhen Chen, Shaoxiong Ji, Nikolay Bogoychev, Andrey Kutuzov, Barry Haddow, Kenneth Heafield

    Abstract: Foundational large language models (LLMs) can be instruction-tuned to perform open-domain question answering, facilitating applications like chat assistants. While such efforts are often carried out in a single language, we empirically analyze cost-efficient strategies for multilingual scenarios. Our study employs the Alpaca dataset and machine translations of it to form multilingual data, which i… ▽ More

    Submitted 30 January, 2024; v1 submitted 16 September, 2023; originally announced September 2023.

    Comments: Accepted to Findings of ACL: EACL 2024. Added human evaluation and shortened writing

  6. arXiv:2305.11993  [pdf, other

    cs.CL

    Interpretable Word Sense Representations via Definition Generation: The Case of Semantic Change Analysis

    Authors: Mario Giulianelli, Iris Luden, Raquel Fernandez, Andrey Kutuzov

    Abstract: We propose using automatically generated natural language definitions of contextualised word usages as interpretable word and word sense representations. Given a collection of usage examples for a target word, and the corresponding data-driven usage clusters (i.e., word senses), a definition is generated for each usage with a specialised Flan-T5 language model, and the most prototypical definition… ▽ More

    Submitted 25 July, 2023; v1 submitted 19 May, 2023; originally announced May 2023.

    Comments: ACL 2023

  7. arXiv:2305.03880  [pdf, other

    cs.CL

    NorBench -- A Benchmark for Norwegian Language Models

    Authors: David Samuel, Andrey Kutuzov, Samia Touileb, Erik Velldal, Lilja Øvrelid, Egil Rønningstad, Elina Sigdel, Anna Palatkina

    Abstract: We present NorBench: a streamlined suite of NLP tasks and probes for evaluating Norwegian language models (LMs) on standardized data splits and evaluation metrics. We also introduce a range of new Norwegian language models (both encoder and encoder-decoder based). Finally, we compare and analyze their performance, along with other existing LMs, across the different benchmark tests of NorBench.

    Submitted 5 May, 2023; originally announced May 2023.

    Comments: Accepted to NoDaLiDa 2023

  8. arXiv:2303.09859  [pdf, other

    cs.CL

    Trained on 100 million words and still in shape: BERT meets British National Corpus

    Authors: David Samuel, Andrey Kutuzov, Lilja Øvrelid, Erik Velldal

    Abstract: While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, well-balanced, and publicly available English text source -- the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this ty… ▽ More

    Submitted 5 May, 2023; v1 submitted 17 March, 2023; originally announced March 2023.

    Comments: Accepted to EACL 2023

  9. arXiv:2209.13750  [pdf, other

    cs.CL

    RuDSI: graph-based word sense induction dataset for Russian

    Authors: Anna Aksenova, Ekaterina Gavrishina, Elisey Rykov, Andrey Kutuzov

    Abstract: We present RuDSI, a new benchmark for word sense induction (WSI) in Russian. The dataset was created using manual annotation and semi-automatic clustering of Word Usage Graphs (WUGs). Unlike prior WSI datasets for Russian, RuDSI is completely data-driven (based on texts from Russian National Corpus), with no external word senses imposed on annotators. Depending on the parameters of graph clusterin… ▽ More

    Submitted 27 September, 2022; originally announced September 2022.

    Comments: TextGraphs-16 workshop at the CoLING-2022 conference

  10. Contextualized language models for semantic change detection: lessons learned

    Authors: Andrey Kutuzov, Erik Velldal, Lilja Øvrelid

    Abstract: We present a qualitative analysis of the (potentially erroneous) outputs of contextualized embedding-based methods for detecting diachronic semantic change. First, we introduce an ensemble method outperforming previously described contextualized approaches. This method is used as a basis for an in-depth analysis of the degrees of semantic change predicted for English words across 5 decades. Our fi… ▽ More

    Submitted 31 August, 2022; originally announced September 2022.

    Journal ref: Northern European Journal of Language Technology (NEJLT). ISSN 2000-1533. 8(1)

  11. arXiv:2204.05717  [pdf, other

    cs.CL

    Do Not Fire the Linguist: Grammatical Profiles Help Language Models Detect Semantic Change

    Authors: Mario Giulianelli, Andrey Kutuzov, Lidia Pivovarova

    Abstract: Morphological and syntactic changes in word usage (as captured, e.g., by grammatical profiles) have been shown to be good predictors of a word's meaning change. In this work, we explore whether large pre-trained contextualised language models, a common tool for lexical semantic change detection, are sensitive to such morphosyntactic changes. To this end, we first compare the performance of grammat… ▽ More

    Submitted 12 April, 2022; originally announced April 2022.

    Comments: 3rd International Workshop on Computational Approaches to Historical Language Change 2022 (LChange'22)

  12. arXiv:2201.05123  [pdf, other

    cs.CL

    NorDiaChange: Diachronic Semantic Change Dataset for Norwegian

    Authors: Andrey Kutuzov, Samia Touileb, Petter Mæhlum, Tita Ranveig Enstad, Alexandra Wittemann

    Abstract: We describe NorDiaChange: the first diachronic semantic change dataset for Norwegian. NorDiaChange comprises two novel subsets, covering about 80 Norwegian nouns manually annotated with graded semantic change over time. Both datasets follow the same annotation procedure and can be used interchangeably as train and test splits for each other. NorDiaChange covers the time periods related to pre- and… ▽ More

    Submitted 27 April, 2022; v1 submitted 13 January, 2022; originally announced January 2022.

    Comments: LREC'2022 proceedings

  13. arXiv:2109.10397  [pdf, other

    cs.CL

    Grammatical Profiling for Semantic Change Detection

    Authors: Mario Giulianelli, Andrey Kutuzov, Lidia Pivovarova

    Abstract: Semantics, morphology and syntax are strongly interdependent. However, the majority of computational methods for semantic change detection use distributional word representations which encode mostly semantics. We investigate an alternative method, grammatical profiling, based entirely on changes in the morphosyntactic behaviour of words. We demonstrate that it can be used for semantic change detec… ▽ More

    Submitted 21 September, 2021; originally announced September 2021.

    Comments: CoNLL 2021

  14. arXiv:2106.08294  [pdf, other

    cs.CL

    Three-part diachronic semantic change dataset for Russian

    Authors: Andrey Kutuzov, Lidia Pivovarova

    Abstract: We present a manually annotated lexical semantic change dataset for Russian: RuShiftEval. Its novelty is ensured by a single set of target words annotated for their diachronic semantic shifts across three time periods, while the previous work either used only two time periods, or different sets of target words. The paper describes the composition and annotation procedure for the dataset. In additi… ▽ More

    Submitted 15 June, 2021; originally announced June 2021.

    Comments: Accepted to the 2nd International Workshop on Computational Approaches to Historical Language Change 2021 (LChange'21)

  15. arXiv:2105.01192  [pdf, other

    cs.CL

    Unreasonable Effectiveness of Rule-Based Heuristics in Solving Russian SuperGLUE Tasks

    Authors: Tatyana Iazykova, Denis Kapelyushnik, Olga Bystrova, Andrey Kutuzov

    Abstract: Leader-boards like SuperGLUE are seen as important incentives for active development of NLP, since they provide standard benchmarks for fair comparison of modern language models. They have driven the world's best engineering teams as well as their resources to collaborate and solve a set of tasks for general language understanding. Their performance scores are often claimed to be close to or even… ▽ More

    Submitted 3 May, 2021; originally announced May 2021.

    Comments: Accepted to Dialogue'2021

  16. arXiv:2104.06546  [pdf, other

    cs.CL

    Large-Scale Contextualised Language Modelling for Norwegian

    Authors: Andrey Kutuzov, Jeremy Barnes, Erik Velldal, Lilja Øvrelid, Stephan Oepen

    Abstract: We present the ongoing NorLM initiative to support the creation and use of very large contextualised language models for Norwegian (and in principle other Nordic languages), including a ready-to-use software environment, as well as an experience report for data preparation and training. This paper introduces the first large-scale monolingual language models for Norwegian, based on both the ELMo an… ▽ More

    Submitted 13 April, 2021; originally announced April 2021.

    Comments: Accepted to NoDaLiDa'2021

  17. arXiv:2103.16414  [pdf, other

    cs.CL

    Representing ELMo embeddings as two-dimensional text online

    Authors: Andrey Kutuzov, Elizaveta Kuzmenko

    Abstract: We describe a new addition to the WebVectors toolkit which is used to serve word embedding models over the Web. The new ELMoViz module adds support for contextualized embedding architectures, in particular for ELMo models. The provided visualizations follow the metaphor of `two-dimensional text' by showing lexical substitutes: words which are most semantically similar in context to the words of th… ▽ More

    Submitted 30 March, 2021; originally announced March 2021.

    Comments: EACL'2021 demo paper

  18. arXiv:2010.06436  [pdf, other

    cs.CL

    RuSemShift: a dataset of historical lexical semantic change in Russian

    Authors: Julia Rodina, Andrey Kutuzov

    Abstract: We present RuSemShift, a large-scale manually annotated test set for the task of semantic change modeling in Russian for two long-term time period pairs: from the pre-Soviet through the Soviet times and from the Soviet through the post-Soviet times. Target words were annotated by multiple crowd-source workers. The annotation process was organized following the DURel framework and was based on sent… ▽ More

    Submitted 13 October, 2020; originally announced October 2020.

    Comments: Accepted to COLING 2020

  19. arXiv:2010.03481  [pdf, ps, other

    cs.CL

    ELMo and BERT in semantic change detection for Russian

    Authors: Julia Rodina, Yuliya Trofimova, Andrey Kutuzov, Ekaterina Artemova

    Abstract: We study the effectiveness of contextualized embeddings for the task of diachronic semantic change detection for Russian language data. Evaluation test sets consist of Russian nouns and adjectives annotated based on their occurrences in texts created in pre-Soviet, Soviet and post-Soviet time periods. ELMo and BERT architectures are compared on the task of ranking Russian words according to the de… ▽ More

    Submitted 7 October, 2020; originally announced October 2020.

    Comments: The 9th International Conference on Analysis of Images, Social Networks and Texts (AIST 2020)

  20. arXiv:2005.00050  [pdf, other

    cs.CL

    UiO-UvA at SemEval-2020 Task 1: Contextualised Embeddings for Lexical Semantic Change Detection

    Authors: Andrey Kutuzov, Mario Giulianelli

    Abstract: We apply contextualised word embeddings to lexical semantic change detection in the SemEval-2020 Shared Task 1. This paper focuses on Subtask 2, ranking words by the degree of their semantic drift over time. We analyse the performance of two contextualising architectures (BERT and ELMo) and three change detection algorithms. We find that the most effective algorithms rely on the cosine similarity… ▽ More

    Submitted 18 July, 2020; v1 submitted 30 April, 2020; originally announced May 2020.

    Comments: To appear in Proceedings of the 14th International Workshop on Semantic Evaluation (SemEval-2020)

  21. arXiv:2003.06651  [pdf, other

    cs.CL

    Word Sense Disambiguation for 158 Languages using Word Embeddings Only

    Authors: Varvara Logacheva, Denis Teslenko, Artem Shelmanov, Steffen Remus, Dmitry Ustalov, Andrey Kutuzov, Ekaterina Artemova, Chris Biemann, Simone Paolo Ponzetto, Alexander Panchenko

    Abstract: Disambiguation of word senses in context is easy for humans, but is a major challenge for automatic approaches. Sophisticated supervised and knowledge-based models were developed to solve this task. However, (i) the inherent Zipfian distribution of supervised training instances for a given word and/or (ii) the quality of linguistic knowledge representations motivate the development of completely u… ▽ More

    Submitted 14 March, 2020; originally announced March 2020.

    Comments: 10 pages, 5 figures, 4 tables, accepted at LREC 2020

  22. arXiv:1909.03135  [pdf, other

    cs.CL

    To lemmatize or not to lemmatize: how word normalisation affects ELMo performance in word sense disambiguation

    Authors: Andrey Kutuzov, Elizaveta Kuzmenko

    Abstract: We critically evaluate the widespread assumption that deep learning NLP models do not require lemmatized input. To test this, we trained versions of contextualised word embedding ELMo models on raw tokenized corpora and on the corpora with word tokens replaced by their lemmas. Then, these models were evaluated on the word sense disambiguation task. This was done for the English and Russian languag… ▽ More

    Submitted 6 September, 2019; originally announced September 2019.

    Comments: Accepted to NODALIDA2019 Deep Learning for Natural Language Processing workshop

  23. arXiv:1907.12674  [pdf, other

    cs.CL

    One-to-X analogical reasoning on word embeddings: a case for diachronic armed conflict prediction from news texts

    Authors: Andrey Kutuzov, Erik Velldal, Lilja Øvrelid

    Abstract: We extend the well-known word analogy task to a one-to-X formulation, including one-to-none cases, when no correct answer exists. The task is cast as a relation discovery problem and applied to historical armed conflicts datasets, attempting to predict new relations of type `location:armed-group' based on data about past events. As the source of semantic information, we use diachronic word embeddi… ▽ More

    Submitted 29 July, 2019; originally announced July 2019.

    Comments: 1st International Workshop on Computational Approaches to Historical Language Change (ACL 2019)

  24. arXiv:1906.07040  [pdf, other

    cs.CL

    Making Fast Graph-based Algorithms with Graph Metric Embeddings

    Authors: Andrey Kutuzov, Mohammad Dorgham, Oleksiy Oliynyk, Chris Biemann, Alexander Panchenko

    Abstract: The computation of distance measures between nodes in graphs is inefficient and does not scale to large graphs. We explore dense vector representations as an effective way to approximate the same information: we introduce a simple yet efficient and effective approach for learning graph embeddings. Instead of directly operating on the graph structure, our method takes structural measures of pairwis… ▽ More

    Submitted 17 June, 2019; originally announced June 2019.

    Comments: In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL'2019). Florence, Italy

  25. arXiv:1905.06837  [pdf

    cs.CL

    Tracing cultural diachronic semantic shifts in Russian using word embeddings: test sets and baselines

    Authors: Vadim Fomin, Daria Bakshandaeva, Julia Rodina, Andrey Kutuzov

    Abstract: The paper introduces manually annotated test sets for the task of tracing diachronic (temporal) semantic shifts in Russian. The two test sets are complementary in that the first one covers comparatively strong semantic changes occurring to nouns and adjectives from pre-Soviet to Soviet times, while the second one covers comparatively subtle socially and culturally determined shifts occurring in ye… ▽ More

    Submitted 29 July, 2019; v1 submitted 16 May, 2019; originally announced May 2019.

    Comments: Dialogue 2019

  26. arXiv:1808.05611  [pdf, other

    cs.CL

    Learning Graph Embeddings from WordNet-based Similarity Measures

    Authors: Andrey Kutuzov, Mohammad Dorgham, Oleksiy Oliynyk, Chris Biemann, Alexander Panchenko

    Abstract: We present path2vec, a new approach for learning graph embeddings that relies on structural measures of pairwise node similarities. The model learns representations for nodes in a dense space that approximate a given user-defined graph distance measure, such as e.g. the shortest path distance or distance measures that take information beyond the graph structure into account. Evaluation of the prop… ▽ More

    Submitted 12 April, 2019; v1 submitted 16 August, 2018; originally announced August 2018.

    Comments: Accepted to StarSem 2019

  27. arXiv:1806.03537  [pdf, ps, other

    cs.CL

    Diachronic word embeddings and semantic shifts: a survey

    Authors: Andrey Kutuzov, Lilja Øvrelid, Terrence Szymanski, Erik Velldal

    Abstract: Recent years have witnessed a surge of publications aimed at tracing temporal changes in lexical semantics using distributional methods, particularly prediction-based word embedding models. However, this vein of research lacks the cohesion, common terminology and shared practices of more established areas of natural language processing. In this paper, we survey the current state of academic resear… ▽ More

    Submitted 13 June, 2018; v1 submitted 9 June, 2018; originally announced June 2018.

    Comments: Proceedings of COLING 2018

  28. Unsupervised Semantic Frame Induction using Triclustering

    Authors: Dmitry Ustalov, Alexander Panchenko, Andrei Kutuzov, Chris Biemann, Simone Paolo Ponzetto

    Abstract: We use dependency triples automatically extracted from a Web-scale corpus to perform unsupervised semantic frame induction. We cast the frame induction problem as a triclustering problem that is a generalization of clustering for triadic data. Our replicable benchmarks demonstrate that the proposed graph-based approach, Triframes, shows state-of-the art results on this task on a FrameNet-derived d… ▽ More

    Submitted 18 May, 2018; v1 submitted 12 May, 2018; originally announced May 2018.

    Comments: 8 pages, 1 figure, 4 tables, accepted at ACL 2018

  29. arXiv:1805.02258  [pdf

    cs.CL

    Russian word sense induction by clustering averaged word embeddings

    Authors: Andrey Kutuzov

    Abstract: The paper reports our participation in the shared task on word sense induction and disambiguation for the Russian language (RUSSE-2018). Our team was ranked 2nd for the wiki-wiki dataset (containing mostly homonyms) and 5th for the bts-rnc and active-dict datasets (containing mostly polysemous words) among all 19 participants. The method we employed was extremely naive. It implied representing c… ▽ More

    Submitted 6 May, 2018; originally announced May 2018.

    Comments: Proceedings of the 24rd International Conference on Computational Linguistics and Intellectual Technologies (Dialogue-2018)

  30. Size vs. Structure in Training Corpora for Word Embedding Models: Araneum Russicum Maximum and Russian National Corpus

    Authors: Andrey Kutuzov, Maria Kunilovskaya

    Abstract: In this paper, we present a distributional word embedding model trained on one of the largest available Russian corpora: Araneum Russicum Maximum (over 10 billion words crawled from the web). We compare this model to the model trained on the Russian National Corpus (RNC). The two corpora are much different in their size and compilation procedures. We test these differences by evaluating the traine… ▽ More

    Submitted 19 January, 2018; originally announced January 2018.

    Journal ref: In: van der Aalst W. et al. (eds) Analysis of Images, Social Networks and Texts. AIST 2017. Lecture Notes in Computer Science, vol 10716. Springer, Cham

  31. arXiv:1707.08660  [pdf, ps, other

    cs.CL

    Temporal dynamics of semantic relations in word embeddings: an application to predicting armed conflict participants

    Authors: Andrey Kutuzov, Erik Velldal, Lilja Øvrelid

    Abstract: This paper deals with using word embedding models to trace the temporal dynamics of semantic relations between pairs of words. The set-up is similar to the well-known analogies task, but expanded with a time dimension. To this end, we apply incremental updating of the models with new training texts, including incremental vocabulary expansion, coupled with learned transformation matrices that let u… ▽ More

    Submitted 26 July, 2017; originally announced July 2017.

    Comments: to appear in EMNLP 2017 proceedings

  32. arXiv:1704.05781  [pdf, other

    cs.CL

    Redefining Context Windows for Word Embedding Models: An Experimental Study

    Authors: Pierre Lison, Andrey Kutuzov

    Abstract: Distributional semantic models learn vector representations of words through the contexts they occur in. Although the choice of context (which often takes the form of a sliding window) has a direct influence on the resulting embeddings, the exact role of this model component is still not fully understood. This paper presents a systematic analysis of context windows based on a set of four distinct… ▽ More

    Submitted 19 April, 2017; originally announced April 2017.

  33. arXiv:1608.03803  [pdf, other

    cs.CL

    Redefining part-of-speech classes with distributional semantic models

    Authors: Andrey Kutuzov, Erik Velldal, Lilja Øvrelid

    Abstract: This paper studies how word embeddings trained on the British National Corpus interact with part of speech boundaries. Our work targets the Universal PoS tag set, which is currently actively being used for annotation of a range of languages. We experiment with training classifiers for predicting PoS tags for words based on their embeddings. The results show that the information about PoS affiliati… ▽ More

    Submitted 12 August, 2016; originally announced August 2016.

    Journal ref: Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, 2016, pp. 115-125

  34. arXiv:1604.05372  [pdf, other

    cs.CL

    Clustering Comparable Corpora of Russian and Ukrainian Academic Texts: Word Embeddings and Semantic Fingerprints

    Authors: Andrey Kutuzov, Mikhail Kopotev, Tatyana Sviridenko, Lyubov Ivanova

    Abstract: We present our experience in applying distributional semantics (neural word embeddings) to the problem of representing and clustering documents in a bilingual comparable corpus. Our data is a collection of Russian and Ukrainian academic texts, for which topics are their academic fields. In order to build language-independent semantic representations of these documents, we train neural distribution… ▽ More

    Submitted 18 April, 2016; originally announced April 2016.

    Comments: To be presented at 9th Workshop on Building and Using Comparable Corpora, co-located with LREC-2016 (https://comparable.limsi.fr/bucc2016/)

  35. arXiv:1504.08183  [pdf

    cs.CL

    Texts in, meaning out: neural language models in semantic similarity task for Russian

    Authors: Andrey Kutuzov, Igor Andreev

    Abstract: Distributed vector representations for natural language vocabulary get a lot of attention in contemporary computational linguistics. This paper summarizes the experience of applying neural network language models to the task of calculating semantic similarity for Russian. The experiments were performed in the course of Russian Semantic Similarity Evaluation track, where our models took from the 2n… ▽ More

    Submitted 30 April, 2015; originally announced April 2015.

    Comments: Proceedings of the Dialog 2015 Conference. Moscow, Russia

  36. arXiv:1409.1612  [pdf, other

    cs.CL cs.IR

    Semantic clustering of Russian web search results: possibilities and problems

    Authors: Andrey Kutuzov

    Abstract: The paper deals with word sense induction from lexical co-occurrence graphs. We construct such graphs on large Russian corpora and then apply this data to cluster Mail.ru Search results according to meanings of the query. We compare different methods of performing such clustering and different source corpora. Models of applying distributional semantics to big linguistic data are described.

    Submitted 26 October, 2014; v1 submitted 4 September, 2014; originally announced September 2014.

    Comments: Presented at Russian Summer School in Information Retrieval (RuSSIR 2014). To be published in Springer Communications in Computer and Information Science series

  37. arXiv:1003.0337  [pdf

    cs.CL

    Change of word types to word tokens ratio in the course of translation (based on Russian translations of K. Vonnegut novels)

    Authors: Andrey Kutuzov

    Abstract: The article provides lexical statistical analysis of K. Vonnegut's two novels and their Russian translations. It is found out that there happen some changes between the speed of word types and word tokens ratio change in the source and target texts. The author hypothesizes that these changes are typical for English-Russian translations, and moreover, they represent an example of Baker's translat… ▽ More

    Submitted 1 March, 2010; originally announced March 2010.

    Comments: 11 pages, 5 figures, to be reported at International Computational Linguistic Conference "Dialog-21"-2010 (http://dialog-21.ru)

  38. arXiv:0809.3250  [pdf

    cs.CL

    Using descriptive mark-up to formalize translation quality assessment

    Authors: Andrey Kutuzov

    Abstract: The paper deals with using descriptive mark-up to emphasize translation mistakes. The author postulates the necessity to develop a standard and formal XML-based way of describing translation mistakes. It is considered to be important for achieving impersonal translation quality assessment. Marked-up translations can be used in corpus translation studies; moreover, automatic translation assessmen… ▽ More

    Submitted 18 September, 2008; originally announced September 2008.

    Comments: 9 pages

    Journal ref: Published in Russian in 'Translation industry and information supply in international business activities: materials of international conference' - Perm, 2008, pp. 90-101