Zum Hauptinhalt springen

Showing 1–17 of 17 results for author: Straková, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.12422  [pdf, other

    cs.CL

    Open-Source Web Service with Morphological Dictionary-Supplemented Deep Learning for Morphosyntactic Analysis of Czech

    Authors: Milan Straka, Jana Straková

    Abstract: We present an open-source web service for Czech morphosyntactic analysis. The system combines a deep learning model with rescoring by a high-precision morphological dictionary at inference time. We show that our hybrid method surpasses two competitive baselines: While the deep learning model ensures generalization for out-of-vocabulary words and better disambiguation, an improvement over an existi… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: Accepted to TSD 2024

  2. CWRCzech: 100M Query-Document Czech Click Dataset and Its Application to Web Relevance Ranking

    Authors: Josef Vonášek, Milan Straka, Rostislav Krč, Lenka Lasoňová, Ekaterina Egorova, Jana Straková, Jakub Náplava

    Abstract: We present CWRCzech, Click Web Ranking dataset for Czech, a 100M query-document Czech click dataset for relevance ranking with user behavior data collected from search engine logs of Seznam$.$cz. To the best of our knowledge, CWRCzech is the largest click dataset with raw text published so far. It provides document positions in the search results as well as information about user behavior: 27.6M c… ▽ More

    Submitted 15 July, 2024; v1 submitted 31 May, 2024; originally announced May 2024.

    Comments: Accepted to SIGIR 2024

  3. arXiv:2404.08974  [pdf, other

    cs.CL

    OOVs in the Spotlight: How to Inflect them?

    Authors: Tomáš Sourada, Jana Straková, Rudolf Rosa

    Abstract: We focus on morphological inflection in out-of-vocabulary (OOV) conditions, an under-researched subtask in which state-of-the-art systems usually are less effective. We developed three systems: a retrograde model and two sequence-to-sequence (seq2seq) models based on LSTM and Transformer. For testing in OOV conditions, we automatically extracted a large dataset of nouns in the morphologically rich… ▽ More

    Submitted 28 May, 2024; v1 submitted 13 April, 2024; originally announced April 2024.

    Comments: Published in the proceedings of LREC-COLING 2024. 12 pages, 3 figures

    Journal ref: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp. 12455-12466

  4. arXiv:2404.05839  [pdf, other

    cs.CL

    ÚFAL LatinPipe at EvaLatin 2024: Morphosyntactic Analysis of Latin

    Authors: Milan Straka, Jana Straková, Federica Gamba

    Abstract: We present LatinPipe, the winning submission to the EvaLatin 2024 Dependency Parsing shared task. Our system consists of a fine-tuned concatenation of base and large pre-trained LMs, with a dot-product attention head for parsing and softmax classification heads for morphology to jointly learn both dependency parsing and morphological analysis. It is trained by sampling from seven publicly availabl… ▽ More

    Submitted 29 May, 2024; v1 submitted 8 April, 2024; originally announced April 2024.

    Comments: Accepted to LT4HALA 2024

  5. Extending an Event-type Ontology: Adding Verbs and Classes Using Fine-tuned LLMs Suggestions

    Authors: Jana Straková, Eva Fučíková, Jan Hajič, Zdeňka Urešová

    Abstract: In this project, we have investigated the use of advanced machine learning methods, specifically fine-tuned large language models, for pre-annotating data for a lexical extension task, namely adding descriptive words (verbs) to an existing (but incomplete, as of yet) ontology of event types. Several research questions have been focused on, from the investigation of a possible heuristics to provide… ▽ More

    Submitted 10 August, 2023; v1 submitted 3 June, 2023; originally announced June 2023.

    Comments: Published at LAW-XVII @ ACL 2023

  6. arXiv:2209.07278  [pdf, other

    cs.CL

    ÚFAL CorPipe at CRAC 2022: Effectivity of Multilingual Models for Coreference Resolution

    Authors: Milan Straka, Jana Straková

    Abstract: We describe the winning submission to the CRAC 2022 Shared Task on Multilingual Coreference Resolution. Our system first solves mention detection and then coreference linking on the retrieved spans with an antecedent-maximization approach, and both tasks are fine-tuned jointly with shared Transformer weights. We report results of fine-tuning a wide range of pretrained models. The center of this co… ▽ More

    Submitted 24 November, 2023; v1 submitted 15 September, 2022; originally announced September 2022.

    Comments: Accepted to CRAC 2022 (Fifth Workshop on Computational Models of Reference, Anaphora and Coreference)

  7. Czech Grammar Error Correction with a Large and Diverse Corpus

    Authors: Jakub Náplava, Milan Straka, Jana Straková, Alexandr Rosen

    Abstract: We introduce a large and diverse Czech corpus annotated for grammatical error correction (GEC) with the aim to contribute to the still scarce data resources in this domain for languages other than English. The Grammar Error Correction Corpus for Czech (GECCC) offers a variety of four domains, covering error distributions ranging from high error density essays written by non-native speakers, to web… ▽ More

    Submitted 21 April, 2022; v1 submitted 14 January, 2022; originally announced January 2022.

    Comments: Published in TACL, MIT Press

  8. arXiv:2111.09280  [pdf, other

    cs.CL

    Character Transformations for Non-Autoregressive GEC Tagging

    Authors: Milan Straka, Jakub Náplava, Jana Straková

    Abstract: We propose a character-based nonautoregressive GEC approach, with automatically generated character transformations. Recently, per-word classification of correction edits has proven an efficient, parallelizable alternative to current encoder-decoder GEC systems. We show that word replacement edits may be suboptimal and lead to explosion of rules for spelling, diacritization and errors in morpholog… ▽ More

    Submitted 17 November, 2021; originally announced November 2021.

    Comments: Accepted to W-NUT 2021

  9. arXiv:2110.07428  [pdf, other

    cs.CL

    Understanding Model Robustness to User-generated Noisy Texts

    Authors: Jakub Náplava, Martin Popel, Milan Straka, Jana Straková

    Abstract: Sensitivity of deep-neural models to input noise is known to be a challenging problem. In NLP, model performance often deteriorates with naturally occurring noise, such as spelling errors. To mitigate this issue, models may leverage artificially noised data. However, the amount and type of generated noise has so far been determined arbitrarily. We therefore propose to model the errors statisticall… ▽ More

    Submitted 17 November, 2021; v1 submitted 14 October, 2021; originally announced October 2021.

    Comments: Accepted to W-NUT 2021

  10. Diacritics Restoration using BERT with Analysis on Czech language

    Authors: Jakub Náplava, Milan Straka, Jana Straková

    Abstract: We propose a new architecture for diacritics restoration based on contextualized embeddings, namely BERT, and we evaluate it on 12 languages with diacritics. Furthermore, we conduct a detailed error analysis on Czech, a morphologically rich language with a high level of diacritization. Notably, we manually annotate all mispredictions, showing that roughly 44% of them are actually not errors, but e… ▽ More

    Submitted 24 May, 2021; originally announced May 2021.

    Journal ref: The Prague Bulletin of Mathematical Linguistics No. 116, 2021, pp. 27-42

  11. RobeCzech: Czech RoBERTa, a monolingual contextualized language representation model

    Authors: Milan Straka, Jakub Náplava, Jana Straková, David Samuel

    Abstract: We present RobeCzech, a monolingual RoBERTa language representation model trained on Czech data. RoBERTa is a robustly optimized Transformer-based pretraining approach. We show that RobeCzech considerably outperforms equally-sized multilingual and Czech-trained contextualized language representation models, surpasses current state of the art in all five evaluated NLP tasks and reaches state-of-the… ▽ More

    Submitted 14 October, 2021; v1 submitted 24 May, 2021; originally announced May 2021.

    Comments: Published in TSD 2021

  12. arXiv:2006.03687  [pdf, other

    cs.CL

    UDPipe at EvaLatin 2020: Contextualized Embeddings and Treebank Embeddings

    Authors: Milan Straka, Jana Straková

    Abstract: We present our contribution to the EvaLatin shared task, which is the first evaluation campaign devoted to the evaluation of NLP tools for Latin. We submitted a system based on UDPipe 2.0, one of the winners of the CoNLL 2018 Shared Task, The 2018 Shared Task on Extrinsic Parser Evaluation and SIGMORPHON 2019 Shared Task. Our system places first by a wide margin both in lemmatization and POS taggi… ▽ More

    Submitted 5 June, 2020; originally announced June 2020.

    Comments: Accepted at EvaLatin 2020, LREC (Proceedings of Language Resources and Evaluation, Marseille, France)

  13. arXiv:1910.11295  [pdf, ps, other

    cs.CL

    ÚFAL MRPipe at MRP 2019: UDPipe Goes Semantic in the Meaning Representation Parsing Shared Task

    Authors: Milan Straka, Jana Straková

    Abstract: We present a system description of our contribution to the CoNLL 2019 shared task, Cross-Framework Meaning Representation Parsing (MRP 2019). The proposed architecture is our first attempt towards a semantic parsing extension of the UDPipe 2.0, a lemmatization, POS tagging and dependency parsing pipeline. For the MRP 2019, which features five formally and linguistically different approaches to m… ▽ More

    Submitted 24 October, 2019; originally announced October 2019.

  14. arXiv:1909.03544  [pdf, other

    cs.CL

    Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER

    Authors: Milan Straka, Jana Straková, Jan Hajič

    Abstract: Contextualized embeddings, which capture appropriate word meaning depending on context, have recently been proposed. We evaluate two meth ods for precomputing such embeddings, BERT and Flair, on four Czech text processing tasks: part-of-speech (POS) tagging, lemmatization, dependency pars ing and named entity recognition (NER). The first three tasks, POS tagging, lemmatization and dependency parsi… ▽ More

    Submitted 12 April, 2021; v1 submitted 8 September, 2019; originally announced September 2019.

    Comments: Fixed the incorrectly evaluated CNEC 2.0 results

  15. arXiv:1908.07448  [pdf, ps, other

    cs.CL

    Evaluating Contextualized Embeddings on 54 Languages in POS Tagging, Lemmatization and Dependency Parsing

    Authors: Milan Straka, Jana Straková, Jan Hajič

    Abstract: We present an extensive evaluation of three recently proposed methods for contextualized embeddings on 89 corpora in 54 languages of the Universal Dependencies 2.3 in three tasks: POS tagging, lemmatization, and dependency parsing. Employing the BERT, Flair and ELMo as pretrained embedding inputs in a strong baseline of UDPipe 2.0, one of the best-performing systems of the CoNLL 2018 Shared Task a… ▽ More

    Submitted 20 August, 2019; originally announced August 2019.

  16. arXiv:1908.06931  [pdf, other

    cs.CL

    UDPipe at SIGMORPHON 2019: Contextualized Embeddings, Regularization with Morphological Categories, Corpora Merging

    Authors: Milan Straka, Jana Straková, Jan Hajič

    Abstract: We present our contribution to the SIGMORPHON 2019 Shared Task: Crosslinguality and Context in Morphology, Task 2: contextual morphological analysis and lemmatization. We submitted a modification of the UDPipe 2.0, one of best-performing systems of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies and an overall winner of the The 2018 Shared Task on Extrins… ▽ More

    Submitted 19 August, 2019; originally announced August 2019.

    Comments: Accepted by SIGMORPHON 2019: 16th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

  17. arXiv:1908.06926  [pdf, ps, other

    cs.CL cs.LG

    Neural Architectures for Nested NER through Linearization

    Authors: Jana Straková, Milan Straka, Jan Hajič

    Abstract: We propose two neural network architectures for nested named entity recognition (NER), a setting in which named entities may overlap and also be labeled with more than one label. We encode the nested labels using a linearized scheme. In our first proposed approach, the nested labels are modeled as multilabels corresponding to the Cartesian product of the nested labels in a standard LSTM-CRF archit… ▽ More

    Submitted 19 August, 2019; originally announced August 2019.

    Comments: Accepted by ACL 2019