See parent task for details.
Note that the Ukrainian analyzer is third-party, and so may have limitations on unpacking, depending on what's available in the plugin.
(Ukrainian was prioritized because it aligns with OKRs.)
See parent task for details.
Note that the Ukrainian analyzer is third-party, and so may have limitations on unpacking, depending on what's available in the plugin.
(Ukrainian was prioritized because it aligns with OKRs.)
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
Unpack and Upgrade Ukrainian Analysis Chain | mediawiki/extensions/CirrusSearch | master | +363 -22 | |
Build Ukrainian Analysis Plugin | search/extra | master | +1 K -0 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Öffnen Sie | None | T219550 [EPIC] Harmonize language analysis across languages | |||
Resolved | Gehel | T272606 [EPIC] Unpack all Elasticsearch analyzers | |||
Resolved | TJones | T318264 Investigate Unpacking Ukrainian Analyzer | |||
Resolved | RKemper | T322776 Deploy Ukrainian Analyzer Plugin | |||
Resolved | TJones | T323927 Reindex Ukrainian-language wikis to enable unpacked analysis |
So... the components of the analyzer are all defined together in one object, and the elements are all clear in the code: standard tokenizer, lowercase, stopwords, and stemmer, along with a pre-tokenization char filter on line 50. The stopwords are available as a plaintext file, and the dictionary used for the stemmer has been extracted out into it's own separate artifact.
I did a quick analysis of a small set of Ukrainian Wikipedia articles (just 500), and there are definitely some small fixes to be made—Cyrillic-Latin homoglyphs (esp Ukrainian і vs Latin i) and bidi markers on Arabic and Latin tokens, plus the expected usual suspects for ICU folding.
David and I talked about it and reviewed the code and other resources earlier today, and it looks like the best thing to do in the short term is fork the project for the analyzer and strip it down to just the stemmer. The other non-standard components—stopwords and char filter—can be recreated in Cirrus, like we have for other analyzers. And of course we already have the infrastructure for supporting our own plugins in general.
Since the stemming dictionary is a separate component, our new stemmer plugin would get most of the benefit of any likely future updates from there. I'm not too concerned about the stopword list—while it could be refined a bit in the future, the main list of common stopwords isn't going to change.
Longer term, we could make some upstream changes—the most obvious of which would be exposing all the components so anyone could customize, like we want to be able to do—but that's not in scope for this task and such changes wouldn't solve our immediate problem, since we are on ES 7.10 and not continuing with later versions of Elastic.
The other option is to skip Ukrainian, but David and I are both averse to that, since it would leave it out of any future generic analysis improvements.
I'm upping the estimate from 5 to 8 to reflect the broader scope of the task now.
Change 851086 had a related patch set uploaded (by Tjones; author: Tjones):
[search/extra@master] Build Ukrainian Stemmer Plugin
Change 851086 merged by jenkins-bot:
[search/extra@master] Build Ukrainian Analysis Plugin
Full write up on Mediawiki:
Lowercasing and Multiple tokens
Homoglyphs
ICU Folding
Patch with config updates and tests coming soon.
Change 858373 had a related patch set uploaded (by Tjones; author: Tjones):
[mediawiki/extensions/CirrusSearch@master] Unpack and Upgrade Ukrainian Analysis Chain
Change 858373 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Unpack and Upgrade Ukrainian Analysis Chain