Wikifunctions:Type proposals/Wikidata based types: Difference between revisions
→Comments: Reply |
Tag: 2017 source edit |
||
Line 615: | Line 615: | ||
* {{s}} --[[User:Luca.favorido|Luca.favorido]] ([[User talk:Luca.favorido|talk]]) 06:06, 11 July 2024 (UTC) |
* {{s}} --[[User:Luca.favorido|Luca.favorido]] ([[User talk:Luca.favorido|talk]]) 06:06, 11 July 2024 (UTC) |
||
* {{s}} I'm not sure to understand very details and I'm sure we will need more (in particular "sense") but it's seems to be already a good start. Cheers, [[User:VIGNERON|VIGNERON]] ([[User talk:VIGNERON|talk]]) 06:43, 11 July 2024 (UTC) |
* {{s}} I'm not sure to understand very details and I'm sure we will need more (in particular "sense") but it's seems to be already a good start. Cheers, [[User:VIGNERON|VIGNERON]] ([[User talk:VIGNERON|talk]]) 06:43, 11 July 2024 (UTC) |
||
== Discussion == |
|||
* '''Lexeme''' According to the Wikidata model, a Lexeme has a single Lemma and a single language, which implies ZlllK2 should be singular and [[Z11]]. Arguably, if it’s monolingual, we don’t also need [[Z60]] or, if we have Z60, lemma can just be a [[Z6]] or, indeed, its ZfffK1 (which allows for orthographic variation without departing far from the Wikidata model). |
* '''Lexeme''' According to the Wikidata model, a Lexeme has a single Lemma and a single language, which implies ZlllK2 should be singular and [[Z11]]. Arguably, if it’s monolingual, we don’t also need [[Z60]] or, if we have Z60, lemma can just be a [[Z6]] or, indeed, its ZfffK1 (which allows for orthographic variation without departing far from the Wikidata model). |
||
Revision as of 23:39, 16 July 2024
Summary
This page describes proposed Wikifunctions types for Lexemes, Lexeme forms, Wikidata items, Wikidata statements, and Wikidata properties. They are modeled closely after the structure of the corresponding types in Wikidata, and Wikifunctions' content for instances of these types will be drawn from Wikidata. The initial motivation for creating these types is to have access to lexicographic content provided by Wikidata. For an overview of the Lexicographic data model in Wikidata, see the Lexicographical data documentation on Wikidata.
Each of the five proposed types has its own top-level section, following the Uses section. After the type descriptions, there are additional sections covering discussion topics, and a section for comments.
Type references in this page: Types are in general referenced using the form "Zx/Label"; the new types proposed here are shown as "Zlll/Lexeme", "Zfff/Lexical form", "Ziii/Wikidata item", "Zsss/Wikidata statement", and "Zppp/Wikidata property", since their final Z IDs are not determined yet. Other types that don't yet exist (e.g., Lexeme sense) are shown using "Z0". For general information regarding Wikifunctions' representational model and its terminology, please see Wikifunctions:Function_model.
Uses
The proposed Lexeme
, Lexeme form
, Wikidata item
, Wikidata statement
, and Wikidata property
types are needed to represent linguistic knowledge that is available on Wikidata. This knowledge will be used by a wide variety of Natural Language Generation (NLG) functions, including functions that will be used for Abstract Wikipedia. (Other uses of the more general types -- Wikidata item
, Wikidata statement
, and Wikidata property
-- will likely arise in future.)
The initial uses of these types will be as input and output types of linguistic knowledge-access functions such as the following (which will serve as building blocks for other NLG functions). These are suggestive examples; this is not a comprehensive list and not part of the type proposal per se.
- get Lexeme Forms from Lexeme
- Input:
Lexeme
- Output:
Typed list( Lexeme form )
- get text from Lexeme Form
- Input:
Lexeme form
- Output:
Multilingual text
- get grammatical features of Lexeme Form
- Input:
Lexeme form
- Output:
Typed list( Wikidata item )
- get labels from Wikidata Item
- Input:
Wikidata item
- Output:
Multilingual text
- get Form from Lexeme by grammatical features
- Input 1:
Lexeme
- Input 2:
Typed list( Wikidata item )
- Output:
Typed list( Lexeme form )
- get text from Lexeme by grammatical features
- Input 1:
Lexeme
- Input 2:
Typed list( Wikidata item )
- Output:
Multilingual text
- get plural from English Lexeme
- Input:
Lexeme
- Output:
Monolingual text
- get grammatical gender of Lexeme
- Input:
Lexeme
- Output:
Wikidata Item
- get Item value for property from Lexeme
- Input 1
Lexeme
- Input 2
Wikidata Property
- Output
Wikidata Item
- get statements for property from Lexeme
- Input 1
Lexeme
- Input 2
Wikidata Property
- Output
Typed List( Wikidata Statement )
- get Item value from statement
- Input
Wikidata Statement
- Output
Wikidata Item
- get gender of German noun
- Input
Lexeme
- Output
German grammatical gender
- choose correct German adjective Form for a noun
- Input 1
Lexeme (Adjective)
- Input 2
Lexeme (Noun)
- Output
Lexeme Form
- German undetermined noun phrase from a noun and adjective
- Input 1
Lexeme (Adjective)
- Input 2
Lexeme (Noun)
- Output
Monolingual text
Lexeme
A Lexeme
represents a lexeme as described in the Wikidata lexicographic data model. It roughly represents the idea of a word or an entry in a lexicon.
Keys
A Lexeme
consists of the following Keys with the given value Types. Keys that are not needed for initial work on NLG functions are tagged as either Stretch goal oder Out of scope.
Key | Label | Typ | |
---|---|---|---|
K1 | identity | Zlll/Lexeme | |
K2 | lemmas | Z12/Multilingual text | |
K3 | language | Z60/Natural language | |
K4 | part of speech | Ziii/Wikidata item | |
K5 | statements | Typed list( Zsss/Wikidata statement ) | Stretch goal |
K6 | senses | Typed list( Z0/Lexeme sense ) ) | Out of scope |
K7 | forms | Typed list( Zfff/Lexeme form ) |
Example values
Value for the Lexeme word (L3345).
{
"type": "Lexeme",
"identity": "L3345",
"lemmas": {
"type": "Multilingual text",
"text": ["Monolingual text",
{
"type": "Monolingual text",
"language": "English",
"text": "word"
}
]
},
"language": "English",
"part of speech": "noun",
"statements": ["Wikidata statement"],
"forms": ["Lexeme form",
{
"type": "Lexeme form",
"identity": "L3345F1",
"lexeme": "L3345",
"representations": {
"type": "Multilingual text",
"texts": ["Monolingual text",
{
"type": "Monolingual text",
"language": "English",
"text": "word"
}
]
},
"grammatical features": ["Wikidata item",
"singular"
],
"statements": ["Wikidata statement"]
},
{
"type": "Lexeme form",
"identity": "L3345F2",
"lexeme": "L3345",
"representations": {
"type": "Multilingual text",
"texts": ["Monolingual text",
{
"type": "Monolingual text",
"language": "English",
"text": "words"
}
]
},
"grammatical features": ["Wikidata item",
"plural"
],
"statements": ["Wikidata statement"]
}
]
}
|
{
"Z1K1": "Zlll",
"ZlllK1": "L3345",
"ZlllK2": {
"Z1K1": "Z12",
"Z12K1": ["Z11",
{
"Z1K1": "Z11",
"Z11K1": "Z1002",
"Z11K2": "word"
}
]
},
"ZlllK3": "Z1002",
"ZlllK4": "Q1084",
"ZlllK5": ["Zsss"],
"ZlllK7": ["Zfff",
{
"Z1K1": "Zfff",
"ZfffK1": "L3345F1",
"ZfffK2": "L3345",
"ZfffK3": {
"Z1K1": "Z12",
"Z12K1": ["Z11",
{
"Z1K1": "Z11",
"Z11K1": "Z1002",
"Z11K2": "word"
}
]
},
"ZfffK4": ["Ziii",
"Q110786"
],
"ZfffK5": ["Zsss"]
},
{
"Z1K1": "Zfff",
"ZfffK1": "L3345F2",
"ZfffK2": "L3345",
"ZfffK3": {
"Z1K1": "Z12",
"Z12K1": ["Z11",
{
"Z1K1": "Z11",
"Z11K1": "Z1002",
"Z11K2": "words"
}
]
},
"ZfffK4": ["Ziii",
"Q146786"
],
"ZsssK5": ["Zsss"]
}
]
}
|
Validator
Initially, the validator doesn't do anything. As we improve our understanding of how Lexemes are used, the validator could
- ensure that that the languages in the lemmas field fit to the language field
- ensure that there are lemmas
- that the part of speech is from a correct set of part of speech for the given language
- that the Forms point back to the Lexeme
- that the Forms are in languages that fit to the language field
- that the right Forms are available
- that the Forms have the grammatical Features expected for the given part of speech and language
Identity
Two Lexemes are the same if they have the same value for identity.
Converting to code
Python
A Python dictionary that follows the structure of the ZObject.
JavaScript
A JavaScript object that follows the structure of the ZObject.
Renderer
Initially, we don't have a bespoke renderer. We plan to add one later when we understand better how the Type works.
Parsers
Initially, we don't create a bespoke parser. We plan to add one later when we understand better how the Type works.
Lexeme form
A Lexeme form
represents a form as described in the Wikidata lexicographic data model. It roughly represents the idea of a word that is adapted to its grammatical role, e.g. the verb used for the third person present in English, or the noun in plural when needed.
Keys
A Lexeme form
consists of the following Keys with the given value Types. Keys that are not needed for initial work on NLG functions are tagged as Stretch goal.
Key | Label | Typ | |
---|---|---|---|
K1 | identity | Zfff/Lexeme form | |
K2 | lexeme | Zlll/Lexeme | |
K3 | representations | Z12/Multilingual Text | |
K4 | grammatical features | Typed List( Ziii/Wikidata item ) | |
K5 | statements | Typed list( Zsss/Wikidata statement ) | Stretch goal |
Example values
Value for the plural form "colours/colors":
{
"type": "Lexeme form",
"identity": "L1347F2",
"lexeme": "L1347",
"representations": {
"type": "Multilingual text",
"texts": ["Monolingual text",
{
"type": "Monolingual text",
"language": "British English",
"text": "colours"
},
{
"type": "Monolingual text",
"language": "Canadian English",
"text": "colours"
},
{
"type": "Monolingual text",
"language": "American English",
"text": "colors"
}
]
},
"grammatical features": ["Wikidata item",
"plural"
],
"statements": ["Wikidata statement"]
}
|
{
"Z1K1": "Zfff",
"ZfffK1": "L1347F2",
"ZfffK2": "L1347",
"ZfffK3": {
"Z1K1": "Z12",
"Z12K1": ["Z11",
{
"Z1K1": "Z11",
"Z11K1": "Z1199",
"Z11K2": "colours"
},
{
"Z1K1": "Z11",
"Z11K1": "Z1437",
"Z11K2": "colours"
},
{
"Z1K1": "Z11",
"Z11K1": "Z1689",
"Z11K2": "colors"
}
]
},
"ZfffK4": ["Ziii",
"Q146786"
],
"ZfffK5": ["Zsss"]
}
|
Validator
The validator ensures that:
Identity
Two Lexeme forms are the same if their identity is the same.
Converting to code
Python
A Python dictionary that follows the structure of the ZObject.
JavaScript
A JavaScript object that follows the structure of the ZObject.
Renderer
Initially, we don't have a bespoke renderer. We plan to add one later when we understand better how the Type works.
Parsers
Initially, we don't have a bespoke parser. We plan to add one later when we understand better how the Type works.
Wikidata item
A Wikidata item
represents an item as described in the Wikibase data model.
Keys
A Wikidata item
consists of the following Keys with the given value Types. Keys that are not needed for initial work on NLG functions are tagged as Out of scope.
Key | Label | Typ | |
---|---|---|---|
K1 | identity | Ziii/Wikidata Item | |
K2 | labels | Z12/Multilingual Text | |
K3 | aliases | Z32/Multilingual Stringset | Out of scope |
K4 | descriptions | Z12/Multilingual Text | Out of scope |
K5 | sitelinks | Typed List( Z0/Wikidata Sitelink ) | Out of scope |
K6 | statements | Typed List( Zsss/Wikidata Statement ) | Out of scope |
Example values
Value for plural (with only the keys which are in scope for now):
{
"type": "Wikidata item",
"identity": "Q146786",
"labels": {
"type": "Multilingual text",
"texts": ["Monolingual text",
{
"type": "Monolingual text",
"language": "English",
"text": "plural"
},
{
"type": "Monolingual text",
"language": "Korean",
"text": "복수"
},
{
"type": "Monolingual text",
"language": "Croatian",
"text": "množina"
},
…
}
}
|
{
"Z1K1": "Ziii",
"ZiiiK1": "Q146786",
"ZiiiK2": {
"Z1K1": "Z12",
"Z12K1": ["Z11",
{
"Z1K1": "Z11",
"Z11K1": "Z1002",
"Z11K2": "plural"
},
{
"Z1K1": "Z11",
"Z11K1": "Z1643",
"Z11K2": "복수"
},
{
"Z1K1": "Z11",
"Z11K1": "Z1272",
"Z11K2": "množina"
},
…
}
}
|
Validator
Initially, the validator doesn't do anything. As we improve our understanding of how Lexemes are used, the validator could do more things.
Identity
Two Wikidata Items are the same if they have the same value for identity.
Converting to code
Python
A Python dictionary that follows the structure of the ZObject.
JavaScript
A JavaScript object that follows the structure of the ZObject.
Renderer
Initially, we don't have a bespoke renderer. We plan to add one later when we understand better how the Type works.
Parsers
Initially, we don't create a bespoke parser. We plan to add one later when we understand better how the Type works.
Wikidata statement
A Wikidata statement
represents a statement as described in the Wikibase data model. An instance of this type roughly represents a simple statement in a natural language, e.g. "Paris is the capital of France", or, more pertinent, "the French word soleil (sun) is of the masculine grammatical gender".
Note that for starters we only support item values. Initially, we will not support a representation of no-value-snaks or some-value-snaks. Initially, we do not represent qualifiers or sources. Only statements that have an item value will be represented.
Keys
A Wikidata statement
consists of the following Keys with the given value Types. Keys that are not needed for initial work on NLG functions are tagged as Out of scope.
Key | Label | Typ | |
---|---|---|---|
K1 | subject | Z1/Object | |
K2 | predicate | Zppp/Wikidata property | Stretch goal |
K3 | value | Z1/Object | |
K4 | qualifiers | Typed List( Z0/Wikidata qualifier ) | Out of scope |
K5 | sources | Typed list( Z0/Wikidata source ) | Out of scope |
K6 | rank | Z0/Statement rank | Out of scope |
K7 | identity | Zsss/Wikidata statement | Out of scope |
(Out of scope) Statements with no value or an unknown value are represented by special objects.
Example values
Value for the plural form "colours/colors":
{
"type": "Wikidata statement",
"subject": "Wort",
"predicate": "grammatical gender",
"value": "neuter"
}
|
{
"Z1K1": "Wikidata statement",
"ZsssK1": "L2206",
"ZsssK2": "P5185",
"ZsssK3": "Q1775461"
}
|
Validator
The validator ensures initially that the K3/value is a Kiii/Wikidata item. It also ensures that the subject is initially either a Zlll/Lexeme, a Zfff/Lexeme form, or a Ziii/Wikidata item.
Identity
Two statements are the same if their identity is the same. Initially, there is no identity. In that case, two statements are the same if their subject, predicate, and value are the same.
Converting to code
Python
A Python dictionary that follows the structure of the ZObject.
JavaScript
A JavaScript object that follows the structure of the ZObject.
Renderer
Initially, we don't have a bespoke renderer. We plan to add one later when we understand better how the Type works.
Parsers
Initially, we don't have a bespoke parser. We plan to add one later when we understand better how the Type works.
Wikidata property
A Wikidata property
represents a property as described in the Wikibase data model. The set of properties define the possible predicates that can be used in a statement.
Keys
A Wikidata property
consists of the following Keys with the given value Types. Keys that are not needed for initial work on NLG functions are tagged as Out of scope.
Key | Label | Typ | |
---|---|---|---|
K1 | identity | Zppp/Wikidata property | |
K2 | data type | Z0/Wikidata data type | Out of scope |
K3 | labels | Z32/Multilingual text | Out of scope |
K4 | statements | Typed List( Zsss/Wikidata Statement ) | Out of scope |
(Out of scope) Properties got quite some special handling for representing constraints, formatters, etc. These are all out of scope initially.
Example values
Value for plural (with only the keys which are in scope for now):
{
"type": "Wikidata property",
"identity": "grammatical gender"
}
|
{
"Z1K1": "Zppp",
"ZpppK1": "P5185"
}
|
Validator
Initially, the validator doesn't do anything. As we improve our understanding of how Lexemes are used, the validator could do more things.
Identity
Two Wikidata Items are the same if they have the same value for identity.
Converting to code
Python
A Python dictionary that follows the structure of the ZObject.
JavaScript
A JavaScript object that follows the structure of the ZObject.
Renderer
Initially, we don't have a bespoke renderer. We plan to add one later when we understand better how the Type works.
Parsers
Initially, we don't create a bespoke parser. We plan to add one later when we understand better how the Type works.
Alternatives
- We could follow the Wikidata Lexicographic model less tightly
- We could have bespoke Types for each language
Notes and Questions
Transparent handling of IDs and Literals
Throughout this proposal, we assume that we magically handle QIDs, LIDs, and FIDs just like ZIDs, i.e. like references to Objects.
Instead we could also have explicated the “dereference a QID” function, and have both Items and QIDs as separate types.
What about literal Lexemes and Lexeme Forms?
Can we write a literal inline? We don't know. Because a Lexeme has as a key the Lexeme ID, technically the answer is probably no (just as we can't write Types inline). But it would be helpful for test cases if nothing else.
We are thinking "no support for inline Lexemes", but are not sure.
Items for Languages or Objects of Type Natural Language
The proposal assumes that we transparently translate both Wikidata QIDs for languages as well as IETF language codes transparently to the appropriate Object of Type Z60/Natural Language.
Instead we could also have used the new Item type for Wikidata QIDs for a language and String for the language code, and have Functions providing the mapping.
Consecutive Key IDs
The proposal aims to predict a logical order for the complete Type. This will lead to the Keys initially have gaps (i.e. K6 on Lexeme would be initially missing).
Instead we could provide the Keys in a consecutive order. That will later lead to inconsistencies in the order between Wikifunctions and Wikidata, though.
Functions only usable in compositions
Every Function that would use one of the new dereferencing functions would be only available in compositions, which may mean that they will potentially have trouble due to our current orchestration and evaluation performance.
(Wait, is this true? If we convert Lexemes, Forms, and Items into literals, why wouldn’t it be usable in code implementation? Even without reentrance? So if we get a Lexeme, we could just have objects representing that Lexeme. That could take us somewhere, even without reentrance?)
Kommentare
- Support as proposer. --DVrandecic (WMF) (talk) 01:10, 11 July 2024 (UTC)
- Support as this will allow us to deal with irregular forms. However, how will statements which have unsupported values be presented? -- ScienceD90 (talk) 01:43, 11 July 2024 (UTC)
- That's a good question; in practice, we may have to only return supported statement values for now. It's why it's a stretch goal. Jdforrester (WMF) (talk) 09:16, 11 July 2024 (UTC)
- I see there is no representation of statement ranks, not even as an out of scope key -- ScienceD90 (talk) 22:47, 11 July 2024 (UTC)
- Good point! Added. --DVrandecic (WMF) (talk) 21:42, 15 July 2024 (UTC)
- I see there is no representation of statement ranks, not even as an out of scope key -- ScienceD90 (talk) 22:47, 11 July 2024 (UTC)
- That's a good question; in practice, we may have to only return supported statement values for now. It's why it's a stretch goal. Jdforrester (WMF) (talk) 09:16, 11 July 2024 (UTC)
- Support --Luca.favorido (talk) 06:06, 11 July 2024 (UTC)
- Support I'm not sure to understand very details and I'm sure we will need more (in particular "sense") but it's seems to be already a good start. Cheers, VIGNERON (talk) 06:43, 11 July 2024 (UTC)
Discussion
- Lexeme According to the Wikidata model, a Lexeme has a single Lemma and a single language, which implies ZlllK2 should be singular and Monolingual text (Z11). Arguably, if it’s monolingual, we don’t also need Natural language (Z60) or, if we have Z60, lemma can just be a String (Z6) or, indeed, its ZfffK1 (which allows for orthographic variation without departing far from the Wikidata model).
- It’s not yet clear to me how we determine the required Lexeme in the first place, but monolingual lemma to Lexeme list would be a start. In English, “word” can be a noun or a verb but Wikidata insists on a single lexical category per lexeme (ZlllK4), so a lexeme list per K4 or a list of K4s for a lemma would seem to be necessary. (That said, a lemma is just one of the lexeme’s forms, so we could just go from literal form to lexeme list(s).) --GrounderUK (talk) 11:22, 11 July 2024 (UTC)
- Just looking at https://www.wikidata.org/wiki/Lexeme:L1 shows multiple lemmas (lemmata?), one for sux-latn and one for sux-xsux, so regardless of what a model claims, the reality is that they have more than one lemma (or perhaps one should say, more than one orthography of a lemma?) per Lexeme. Jdforrester (WMF) (talk) 12:00, 11 July 2024 (UTC)
- @GrounderUK and Jdforrester (WMF): all lexemes have one and only one language but indeed some (60k) have them have more than one lemma (inside the same language lato sensu, but variation is important as we don't want to generate sentences that mix randomly different language stricto sensu : « organise an organization » would be weird) ; in languages with multiple writing system, it's almost all of them. And yes “word” is two different lexemes with a very different sets of forms : d:L:L3345 (a noun, with two forms: a singular and a plural) and d:L:L17039 (a verb, with 5 forms: present, past, etc.). In some extreme (and thankfully rare) cases, there is even two lexeme with the same triple of lemma/language/category (the verbs “ressortir” in French : d:L17373 and d:L691143). We need to take that into account for accessing the right lexeme. Cheers, VIGNERON (talk) 13:14, 11 July 2024 (UTC)
- But do you know of any case where the lemma is (or could usefully be) more than one of the lexeme’s forms? GrounderUK (talk) 13:32, 11 July 2024 (UTC)
- @GrounderUK: if I understand your question correctly, it's quite common to have several representations of form identical to the lemma (verbs in Romance languages or nouns in languages with declension comes to me mind right now as an obvious example). Cheers, VIGNERON (talk) 14:32, 11 July 2024 (UTC)
- The lemma shouldn't be for different forms, just for different representations of the lemma form. -- DVrandecic (WMF) (talk) 00:49, 16 July 2024 (UTC)
- If I have not misunderstood User:VIGNERON, he prefers different Breton representations of the same grammatical form to be different Lexeme forms. However, it seems to me that the Wikifunctions type should be capable of supporting all the different approaches that are (or might be) supported by Wikidata, as well as some variants that are not. It will then be possible to write functions that transform what is present in Wikidata into a new Lexeme object that is consistent with an alternative approach. For example, color/colour should not have three representations for its plural form since there are only two distinct representations and more than three English variants. Moreover, there are no irregular forms, so it might make more sense, in some context, to replace the recorded forms with a reference to the rule that is followed (which is ultimately a function). I’m guessing that Grammatical feature could distinguish between regular and irregular forms and Wikidata statement could reference the relevant function and (separately) the base Form whose representation(s) follow the rule (if this is not the lemma). It would be convenient if we could distinguish between those values that are represented on Wikidata and those that are not. GrounderUK (talk) 10:51, 16 July 2024 (UTC)
- That's a really good consideration. I was thinking of the Lexeme type to be a pretty much carbon-copy of the data in Wikidata, which is one reason why it is so tightly following the Wikidata data model. If we want to extend that, for example to keep track of whether a form is generated by a Function or whether it is given in Wikidata, that would be happening entirely on Wikifunctions' side. I.e. we would have another Type that represents that. Particularly because Lexemes have an identifier, it would be good to keep the Lexemes the same as they are in Wikidata.
- Or, to put it differently: one way could be to eventually have "English noun" as a Type. English noun can be constructed from a Lexeme, and then it would try to find all relevant forms and fit them in the right place. If it doesn't have certain forms, it might decide to add them through a function. "English noun" could also be constructed from a string (or two) and in that case it would also use the functions. But at this point, the English noun value would be one removed from the Lexeme. But it would be much easier for us to work with in the context of generating texts for English than a raw Lexeme is. It could also keep the information whether the forms are generated or whether they are given.
- Does this make sense? -- DVrandecic (WMF) (talk) 19:10, 16 July 2024 (UTC)
- It does make sense, yes. But then I would envisage non-Wikidata Lexeme types that would parallel the Wikidata Lexeme types, so that a raw Wikidata Lexeme could be transformed into a substitute that would (generally) be treated by functions as if it were a Wikidata lexeme. I imagine that would be the best way to handle alternative approaches adopted in Wikidata (and inconsistencies). GrounderUK (talk) 19:42, 16 July 2024 (UTC)
- Maybe. I think Wikidata Lexemes are potentially a bit heavyweight and generic, and maybe more focused Types could be easier to handle. But I think both approaches would be valid. -- DVrandecic (WMF) (talk) 23:38, 16 July 2024 (UTC)
- It does make sense, yes. But then I would envisage non-Wikidata Lexeme types that would parallel the Wikidata Lexeme types, so that a raw Wikidata Lexeme could be transformed into a substitute that would (generally) be treated by functions as if it were a Wikidata lexeme. I imagine that would be the best way to handle alternative approaches adopted in Wikidata (and inconsistencies). GrounderUK (talk) 19:42, 16 July 2024 (UTC)
- If I have not misunderstood User:VIGNERON, he prefers different Breton representations of the same grammatical form to be different Lexeme forms. However, it seems to me that the Wikifunctions type should be capable of supporting all the different approaches that are (or might be) supported by Wikidata, as well as some variants that are not. It will then be possible to write functions that transform what is present in Wikidata into a new Lexeme object that is consistent with an alternative approach. For example, color/colour should not have three representations for its plural form since there are only two distinct representations and more than three English variants. Moreover, there are no irregular forms, so it might make more sense, in some context, to replace the recorded forms with a reference to the rule that is followed (which is ultimately a function). I’m guessing that Grammatical feature could distinguish between regular and irregular forms and Wikidata statement could reference the relevant function and (separately) the base Form whose representation(s) follow the rule (if this is not the lemma). It would be convenient if we could distinguish between those values that are represented on Wikidata and those that are not. GrounderUK (talk) 10:51, 16 July 2024 (UTC)
- But do you know of any case where the lemma is (or could usefully be) more than one of the lexeme’s forms? GrounderUK (talk) 13:32, 11 July 2024 (UTC)
- Yeah, there the singular lemma is the singular L1-F1 with multiple representations. I don’t know whether the lemma is always F1 but it should (by definition) be exactly one of the forms. GrounderUK (talk) 13:20, 11 July 2024 (UTC)
- Maybe. I can imagine in some languages the lemma actually not be any of the forms, but a more normalized version that appears in the lexicon but not in language. But it could also be that the lemma is always one of the forms (that, I think, would be the case for the languages I speak, as far as I can tell). -- DVrandecic (WMF) (talk) 00:50, 16 July 2024 (UTC)
- @GrounderUK and Jdforrester (WMF): all lexemes have one and only one language but indeed some (60k) have them have more than one lemma (inside the same language lato sensu, but variation is important as we don't want to generate sentences that mix randomly different language stricto sensu : « organise an organization » would be weird) ; in languages with multiple writing system, it's almost all of them. And yes “word” is two different lexemes with a very different sets of forms : d:L:L3345 (a noun, with two forms: a singular and a plural) and d:L:L17039 (a verb, with 5 forms: present, past, etc.). In some extreme (and thankfully rare) cases, there is even two lexeme with the same triple of lemma/language/category (the verbs “ressortir” in French : d:L17373 and d:L691143). We need to take that into account for accessing the right lexeme. Cheers, VIGNERON (talk) 13:14, 11 July 2024 (UTC)
- Just looking at https://www.wikidata.org/wiki/Lexeme:L1 shows multiple lemmas (lemmata?), one for sux-latn and one for sux-xsux, so regardless of what a model claims, the reality is that they have more than one lemma (or perhaps one should say, more than one orthography of a lemma?) per Lexeme. Jdforrester (WMF) (talk) 12:00, 11 July 2024 (UTC)