Investigate Unpacking Ukrainian Analyzer
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	TJones
	Sep 21 2022, 6:46 PM

Description

See parent task for details.

Note that the Ukrainian analyzer is third-party, and so may have limitations on unpacking, depending on what's available in the plugin.

(Ukrainian was prioritized because it aligns with OKRs.)

Details

	Subject	Repo	Branch	Lines +/-
	Unpack and Upgrade Ukrainian Analysis Chain	mediawiki/extensions/CirrusSearch	master	+363 -22
	Build Ukrainian Analysis Plugin	search/extra	master	+1 K -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Öffnen Sie	None	T219550 [EPIC] Harmonize language analysis across languages
Resolved	Gehel	T272606 [EPIC] Unpack all Elasticsearch analyzers
Resolved	TJones	T318264 Investigate Unpacking Ukrainian Analyzer
Resolved	RKemper	T322776 Deploy Ukrainian Analyzer Plugin
Resolved	TJones	T323927 Reindex Ukrainian-language wikis to enable unpacked analysis

Event Timeline

TJones created this task.Sep 21 2022, 6:46 PM

Restricted Application added a subscriber: Base. · View Herald TranscriptSep 21 2022, 6:46 PM

TJones mentioned this in T272606: [EPIC] Unpack all Elasticsearch analyzers.Sep 21 2022, 6:47 PM

bking updated the task description. (Show Details)Sep 26 2022, 3:18 PM

TJones set the point value for this task to 5.Sep 26 2022, 4:00 PM

• MPhamWMF moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.Sep 29 2022, 12:00 AM

TJones moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.Oct 24 2022, 4:37 PM

So... the components of the analyzer are all defined together in one object, and the elements are all clear in the code: standard tokenizer, lowercase, stopwords, and stemmer, along with a pre-tokenization char filter on line 50. The stopwords are available as a plaintext file, and the dictionary used for the stemmer has been extracted out into it's own separate artifact.

I did a quick analysis of a small set of Ukrainian Wikipedia articles (just 500), and there are definitely some small fixes to be made—Cyrillic-Latin homoglyphs (esp Ukrainian і vs Latin i) and bidi markers on Arabic and Latin tokens, plus the expected usual suspects for ICU folding.

David and I talked about it and reviewed the code and other resources earlier today, and it looks like the best thing to do in the short term is fork the project for the analyzer and strip it down to just the stemmer. The other non-standard components—stopwords and char filter—can be recreated in Cirrus, like we have for other analyzers. And of course we already have the infrastructure for supporting our own plugins in general.

Since the stemming dictionary is a separate component, our new stemmer plugin would get most of the benefit of any likely future updates from there. I'm not too concerned about the stopword list—while it could be refined a bit in the future, the main list of common stopwords isn't going to change.

Longer term, we could make some upstream changes—the most obvious of which would be exposing all the components so anyone could customize, like we want to be able to do—but that's not in scope for this task and such changes wouldn't solve our immediate problem, since we are on ES 7.10 and not continuing with later versions of Elastic.

The other option is to skip Ukrainian, but David and I are both averse to that, since it would leave it out of any future generic analysis improvements.

I'm upping the estimate from 5 to 8 to reflect the broader scope of the task now.

TJones claimed this task.Oct 31 2022, 4:25 PM

Change 851086 had a related patch set uploaded (by Tjones; author: Tjones):

[search/extra@master] Build Ukrainian Stemmer Plugin

https://gerrit.wikimedia.org/r/851086

gerritbot added a project: Patch-For-Review.Oct 31 2022, 8:48 PM

Change 851086 merged by jenkins-bot:

[search/extra@master] Build Ukrainian Analysis Plugin

https://gerrit.wikimedia.org/r/851086

Maintenance_bot removed a project: Patch-For-Review.Nov 8 2022, 9:30 AM

TJones mentioned this in T322776: Deploy Ukrainian Analyzer Plugin.Nov 9 2022, 6:14 PM

Full write up on Mediawiki:

Lowercasing and Multiple tokens

A small number of input tokens generate multiple stemmer output tokens
Stemmer output is sometimes capitalized
Sometimes, the multiple output tokens differ only by capitalization
Re-lowercasing and deduplicating removes about 5% of tokens!

Homoglyphs

There are lots of mixed-script tokens with homoglyphs in them (Ukrainian і and ї are hard to type), and the homoglyph filter groups them with their fully Cyrillic counterparts!

ICU Folding

No exceptions enabled
- й/и and ї/і are all in the Ukrainian alphabet, but folding them causes very few mergers, and most are obviously good (typos, or inconsistently transliterated names)
- ґ/г is already folded by the Ukrainian analyzer, so I didn't mess with it. (ґ was only added back to the alphabet in 1990!)

Patch with config updates and tests coming soon.

Change 858373 had a related patch set uploaded (by Tjones; author: Tjones):

[mediawiki/extensions/CirrusSearch@master] Unpack and Upgrade Ukrainian Analysis Chain

https://gerrit.wikimedia.org/r/858373

gerritbot added a project: Patch-For-Review.Nov 18 2022, 7:58 PM

TJones moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.Nov 18 2022, 8:32 PM

Change 858373 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Unpack and Upgrade Ukrainian Analysis Chain

https://gerrit.wikimedia.org/r/858373

TJones moved this task from Needs review to To Be Deployed on the Discovery-Search (Current work) board.Nov 22 2022, 4:53 PM

ReleaseTaggerBot added a project: MW-1.40-notes (1.40.0-wmf.12; 2022-11-28).Nov 22 2022, 5:01 PM

Maintenance_bot removed a project: Patch-For-Review.Nov 22 2022, 5:30 PM