Wikidata:Events/Data Modelling Days 2023/Conflations

From Wikidata
Jump to navigation Jump to search

✨---------------✨---------------✨---------------✨---------------✨---------------

Conflations and duplications
Camillo, user:Epìdosis


✨---------------✨---------------✨---------------✨---------------✨---------------


👥 Number of participants (including speakers): ...
18 (at 9:00 UTC)
27 (at 9:04 UTC, beginning of the talk)
30 (at 9:07 UTC)
32 (at 9:09 UTC)
35 (at 9:13 UTC)
37 (at 9:23 UTC)
39 (at 9:36 UTC)

🖊️ Notes & links

Epìdosis will present some data about conflations and duplications and facilitate a discussion about possible ways to mitigate this problem in its various aspects.

Paper for the 4th Wikidata Workshop : https://wikidataworkshop.github.io/2023/papers/6__novel_conflations_and_duplica[1].pdf

Slides: https://commons.wikimedia.org/wiki/File:DMD2023_-_Conflations_and_duplications.pdf

Introduction

Causes:
same or similar name (homonym persons, like John Smith)
may comes from manual edit or from batches (through tools like OpenRefine or by bots)

Stats:
3.6 M merges
4.1 M redirects

How to detect:
    * constraint violations of unique value and single value constraints

Some of the constraint violations are hard to solve if you don't have editing rights in the other external datatbase that Wikidata connects to

Going through 6 differents cases and how to correct them in each case.

nikki says: when deprecating an id because it belongs to another item, there's https://www.wikidata.org/wiki/Property:P8327 (intended subject of deprecated statement) which you can use to point to the correct item
> Camillo: usually, if an ID is related to item B but it is also on item A, my suggestion is just deleting it from item A and leaving it only on item B; leaving it as deprecated on item A has some advantages (mainly preventing it from being readded), but also some drawbacks (e.g. some tools, like Mix'n'match, and some data reusers don't take ranks into account, so consider the deprecated ID just as a normal ID and will for example use it as an anchor to match other IDs, thus reinstating the conflation)

Conclusion:
- in case of the same ID in two items: you can and should solve the violation
- In case of two IDs in the same item you should ignore the violation

3) Solutions: merges and splits

4) Issues:
mitigate causes: prevent low quality batches
improve detection: strengthen data round-tripping
improve solutions: create a gadget for splits (from the chat: gadget moveClaim)

❓ Questions and discussions
nikki (chat): when deprecating an id because it belongs to another item, there's https://www.wikidata.org/wiki/Property:P8327 (intended subject of deprecated statement) which you can use to point to the correct item
Melderick says: In case 3, what's wrong with using qualifier Identifier Shared With ?
That's ok. ... You can also add a second identifier. This is a provosorial solution. The best would be to not have the conflation. Therefore, we need the deprecated IDs. First you deprecate (with the motivation "conflation" in qualifier) but you can add other identifiers like "shared with"
Frédéric (chat): The moveClaim gadget is very useful for splitting an item. It's a tool to move or copy a statement from one entity to another.
Please explain moveClaim more
It's a gadget that you can activate on https://www.wikidata.org/wiki/Special:Preferences#mw-prefsection-gadgets. It's a tool to move (or copy) a claim from one entity to another. The moveClaim and moveSitelinks are useful when performing splits. However, there is no gadget for moving descriptions and aliases. moveClaim and moveSitelinks should both be integrated, they are good gadgets. We should integrate them in a wider gadget for splits. They are already ready. You just need to construct a new one specifically for splits.
Tamsin/DrThneed (chat): Part of detecting/preventing duplicates surely encompasses item quality...sometimes I cannot tell if the item I want to create is the same or different than an existing one because the existing one with the right label has no references and very few statements...
Tamsin: data quality is a part of the issue. Sometimes the Item isn't complete enough to tell what it really is about.
This is a frequent issue. Especially when you are doing a big import. Sometimes you are not sure whether the item is the one you are looking for. I would suggest to fisrt spend time on what the item is. Try to understand the item. Improve them so that i bcomes clear what they are. If the item doesn't give enough clues what it is, you can propose it for deletion. If there is no connected wp article and no links. Only criterium 2 remains. If not identifiable it should be deleted. Of course you should make clearly identifiable items, otherwise it's a lot of effort for others to make it clear. (Tamsin: I'm grumbling because I resent spending time improving so many items that should have been better made to start with! Automatic descriptions would of course help; Nicolas is grumbling too :D especially for Lexemes)
Frédéric (user:Fjjulien): Sometimes conflation exists in the ontology itself. The most frequent cases of conflation at the class level are subclasses of both an organization and a building (or a facility). These cases can be very difficult to resolve, because some users actually want the conflation to exist and to remain. One particular case is "community centre" http://www.wikidata.org/entity/Q77115 In this case, most external IDs describe a building, but a given user considers efforts for clarifying the concept to be symptomatic of a narrow worldview.
Camillo: I haven't said this before: I usually work on persons, this is very well clear cut. About building etc. is one of the most difficult cases for data modeling. Honestly, I often just do other things. In the time I can solve the conflation of a building or library, I can solve many other conflations.
Frédéric: but these conflations create big problems, notably for reusing the data outside of Wikidata. I wish there was a stronger stance to deal with theses cases when there are users who have an ideology for this. This goes against the philosophy of Wikidata and can lead to edit wars.
Camilo: I think this falls into another talk I gave. If you find an issue, you have to discuss how to solve it and find an agreement, but sometimes there is no agreement. But this is the only way. This is the precondition for a solution. My other talk is about: how to solve the issue once you have agreement
Camilo: The longer you leave a class-level case of conflation, the more it becomes instantiated, so it is best to resolve class-level conflation quickly.
Nicolas Vigneron (chat): isn't the 3 items solution a good solution here? (like for Bonnie and Clyde)
Jan Ainali (chat): As a reuser through Govdirectory, we sometimes meet conflated public organizations and buildings, but then we always resolve them. And it becomes very clear when showing the usecase.
nikki (chat): I think what lydia (I think) was saying yesterday is relevant, trying to model things perfectly by never conflating things can make the data harder to use too, we have to find the right balance/granularity
Jan Ainali (chat): In my opinion, we have no choice but to spilt conflations of this kind and the one Fredéric mentions
Jan Ainali (chat): Balance yes. But not when it is clear that the items have different and confusing metadata.
nikki (chat): @jan: yeah, my point is that having to find a balance means there will be disagreements about what the right balance is 😃
Jan Ainali (chat): For example, recently I encountered a building being torn down, so someone put an end date on it. But the agency lived on somewhere else.
Michael Markert (ThULB Jena) (chat): I think this is one of the main reasons why library people (at least in Germany) sometimes are sceptical about wikidata - in their authority files there is a strong consensus on splitting items if there is a conflation

Frédéric: The Sitelink to redirect feature is a great means of dealing with conceptual conflation in wikipedias: https://www.wikidata.org/wiki/Wikidata:Sitelinks_to_redirects


🎯 Key takeaways and outcomes
Seemingly, most duplications in recent years have been added by batches run through QuickStatements and OpenRefine. We need a new policy containing guidelines and standards regarding the quality of semi-automated batches.
There are abundant cases of duplicate IDs in external databases. A solution is to improve data round-tripping, at least for the biggest databases (e.g. national authority files; cf. phab:T312718 - https://phabricator.wikimedia.org/T312718)
The moveClaim and moveSitelink gadgets are very useful for spliting conflated items, but they cannot move descriptions. Proposal: create a new gadget helping users in performing splits; it should guide the user step by step, so that no step is forgotten, and it should speed up the whole process.
Sometimes, items are too incomplete to be disambiguated. If an item cannot be clearly identified, has no sitelink and no external link, it can be proposed for deletion.
Conflation of organization and building/facility is a common problem, notably in the GLAM sector.
Such conflation can be found at the class level and sometimes finds its source in authority files. In these cases, it is preferable to seek community consensus before attempting a splitting. Sometimes, the 3 items solution (like for Bonnie and Clyde) is the best path forward.

☑️ Next steps
Lydia to create a ticket for making splitting of Items easier for example with a gadget
...