Page MenuHomePhabricator

Make LanguageCode::bcp47() available in Lua
Open, Needs TriagePublicFeature

Description

In MediaWiki PHP code, the function LanguageCode::bcp47() is used to map MediaWiki language codes (like simple) into standard IETF BCP 47 language tags (like en-simple). This function is not available to Lua; when a module wants to emit an HTML lang= attribute based on some on-wiki data (e.g. Lexeme data from Wikidata), it either has to reimplement this mapping, or (more likely) accept that some non-standard language codes will be emitted. Scribunto should make this function available, probably somewhere in the mw.language library.

Event Timeline

I can imagine three possible interfaces (not counting name variations):

htmlCode = mw.language.bcp47( languageCode )
htmlCode = mw.language.new( languageCode ).bcp47
htmlCode = mw.language.new( languageCode ):bcp47()

That is, a “static” method on mw.language, a field on an mw.language instance, or a method on an mw.language instance. Since mw.language:getCode is also a method, I think I’m leaning towards the third interface.

There’s probably also some possible bikeshedding over whether we want to use the term “BCP 47” (and if yes, on its own or with “code” or “tag”), or call it an “HTML language code”, or something else.

(I’m also tempted to add a method like mw.html:inLanguage() to mw.html, which would be a shortcut for setting the lang and dir attributes – in that case, the bcp47() method name wouldn’t matter as much, because most users wouldn’t use it directly.)

Tacsipacsi changed the subtype of this task from "Task" to "Feature Request".Jun 14 2022, 12:44 PM
Tacsipacsi subscribed.

Thanks for filing this task! I was planning to do it for months, but didn’t get around doing it…

The problem with a field or instance method is that the number of distinct language objects that can be constructed using mw.language.new is limited ($wgScribuntoEngineConf[*]['maxLangCacheSize'], by default 30 for both sandboxed and standalone Lua). On multilingual wikis like Commons, this can quickly be exceeded. LanguageCode::bcp47() itself is quite cheap, so there’s no need to constrain the number of processed languages. (getCode is different, no one would ever want to write mw.language.new( lang ):getCode(), as it’s exactly the same as simply lang; getCode is useful when an already-constructed language object is passed around and thus it would make no sense as a static method.)

Unfortunately mw.language:isRTL is also expensive (it needs to load the language file for the given language), so mw.html:inLanguage would also be expensive if it set dir to anything other than auto (but auto is dumb, it’s unable to handle mixed-directionality text, which often happens, as content may not be 100% translated).

The only non-expensive solution here is the static mw.language.bcp47, so I think that should be implemented (others can be implemented as convenience functions, but not as only solutions).

Hm, that’s a very good point. Let’s just limit this task to mw.language.bcp47() for now, then?

My only caution is that using strings to represent languages can be quite error-prone, given that there are different "string codes" floating around for the same language. In an ideal world, I'd say you should create a light-weight language object in lua that you can pass around without actually invoking mw.language.new (or bump the maxLangCacheSize only when actually loading the language file, not simply when creating the object). ::getBcp47Code() (or ::getHtmlCode()) and ::getCode() can be invoked on that lightweight object to make it clear which code you are using, and similarly, mw.language.newFromBcp47() and mw.language.newFromMw() would make it clear which "type of string" you are using to create the language object.

Introducing a lightweight also makes sense, although I don’t think mw.language.newFromBcp47() would be used a lot – usually the language comes from MediaWiki (page language, content language, {{int:lang}} hack etc.), and then converted to BCP47 for usage in HTML attributes, not the other way round. In an ideal world, MediaWiki would just use BCP47 language codes instead of inventing different ones for some languages…

Apparently @cscott recently added toBcp47Code(), but as a non-static instance method.

I think the expensiveness argument from earlier needs to be reevaluated anyway, though, in light of T342418: Speed up Language creation – both creating a language and looking up its RTL flag should be fast now, as far as I’m aware. (But I don’t know if Scribunto was updated for that or if it still includes it in the expensive function count.)

Apparently @cscott recently added toBcp47Code(), but as a non-static instance method.

I’ve already used it, forgetting about the fact that it supposedly doesn’t exist. 🙂

I think the expensiveness argument from earlier needs to be reevaluated anyway, though, in light of T342418: Speed up Language creation – both creating a language and looking up its RTL flag should be fast now, as far as I’m aware. (But I don’t know if Scribunto was updated for that or if it still includes it in the expensive function count.)

Creating a new language object is not an “expensive parser function” as tracked by the core parser (I don’t know if it has ever been one, but definitely wasn’t when I wrote T310581#8002337). It uses a separate counter, which still defaults to 30, but has been 200 in WMF production since T85461.

Apparently @cscott recently added toBcp47Code(), but as a non-static instance method.

I’ve already used it, forgetting about the fact that it supposedly doesn’t exist. 🙂

Yeah, that’s how I found it :D

Michael subscribed.

I wonder if this has some overlap with T253387?

Not really. Given simplewiki, T253387 should give simple and wikipedia, as simple is a valid “non-standard language code” (i.e. a language code that is known to MediaWiki but isn’t valid, or has a slightly different meaning, in BCP 47). Given simple, this task gives en-simple, which is the BCP 47 language tag for Simple English.