Page MenuHomePhabricator

Use RDF statement counts from entity data, not page props ( wikibase:identifiers, wikibase:statements and wikibase:sitelinks )
Closed, ResolvedPublic8 Estimated Story Points

Description

There are a lot of items that have wikibase:statements set to 0 but actually do have statements (mostly one statement). It may be caused by page props not being up-to-date when update happens.

SELECT distinct ?item ?p WHERE {
  ?item wikibase:statements 0 .
  ?item ?p [] .
  FILTER(?p != rdfs:label && ?p != schema:description && ?p != schema:version 
         && ?p != schema:dateModified && ?p != skos:altLabel && ?p != wikibase:statements && ?p != wikibase:sitelinks)
}

It is due to secondary data stored by Wikibase in page_props that is used to generate RDF data of Wikibase items is incorrect at the point of time the RDF is generated.
page_props is updated asynchronously. It might only get updated with the statement count after the RDF updater has dumped the edited item

There are also cases when page_props are not updated at all - those issues are not part of this task

Acceptance criteria

  • statement count in the RDF output is based on the count at the point of generating the output, and not read from the secondary database table

See also: T149239: Ensure consistency of secondary data for external consumers

Notes:

  • The “statement count” is not as simple as counting the entity’s own statements – for lexemes, it includes the statements of all senses and forms. Make sure to use the same code that also generates the page prop, so that whatever mechanism WikibaseLexeme uses to count those statements is also used when generating the count for RDF.

Related Objects

StatusSubtypeAssignedTask
Resolved Lucas_Werkmeister_WMDE
ResolvedOttomata
ResolvedOttomata
DeclinedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
DeclinedNone
Resolveddcausse
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
ResolvedOttomata
DuplicateOttomata
ResolvedOttomata
DeclinedOttomata

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

If this is mainly regarding the WDQS & updater we could always make an API for it that packages everything nicely..?

Well, its for everything that would use page props, but I know about RDF part specifically, I don't know about others. What kind of API do you propose?

If I remember correctly, @aaron recently changed WikiPage to push secondary data updates to the job queue instead of merely referring them to the post-send stage of the request. This greatly increases the risk of race conditions for tools that rely on recentchanges to detect edits, and then read secondary data from the API (or the database).

@Smalyshev, perhaps it makes sense to raise the issue on wikitech-l? It seems to me that it is an issue of general interest.

Or, if that is not possible, have a flag in recentchanges that indicates whether all secondary updates for an edit have been completed.

Instead of a flag you could have an indexed sequence ID which is null while the updates are ongoing (or maybe a separate table which only consists of an own primary key and a foreign key to rc_id - at least in MySQL that might be the easiest way to simulate sequences) and then add an API option to sort and continue by that.

A Wikipedia language version that was created last month has "wikibase:statements" missing: list (now 10, was up to 40). I don't think any of the items were newly created for that wiki.

Not really certain about this:

  • Sometimes, the triple with wikibase:statements seems to disappear after an edit is made.

Not sure if it's re-appearing later or if I get results from a different server ( sample).

Yes, it looks like the same problem - since the servers are updated independently, it could happen that when one server updates, the pagelink data is not there yet, but when the other is updated, it's there already.

The number of results on the olowiki query also varies: 10 or 15 today.

Movies: statement count gives an idea how many triples with wikibase:statements are there now. The gap is closing, also thanks to many edits in the field.

Personally, I think the new wikibase:statements triples are useful even if they are not always accurate.

Making them completely accurate seems to need some complex changes ( T149239 ), so I think it would be good to make sure that we get at least basic coverage on all items, this even if it may be 1 or 2 edits off.

Sample from https://www.wikidata.org/w/index.php?title=Q28465861&action=history

12:44, 21 January 2017‎  . . (1,605 bytes) (+422)‎ . . (‎Created claim: country (P17): 
12:33, 21 January 2017‎ . . (1,183 bytes) (+372)‎ . . (‎Created claim: Commons category (P373): 
12:33, 21 January 2017‎ . . (676 bytes) (+426)‎ . . (‎Created claim: instance of (P31):

WQS: currently no wikibase:statements

Is there a way to have it show at least wikibase:statements 2 (the first two edits)?

Esc3300 renamed this task from Statement counts from pageprops do not match actual ones to Statement counts from pageprops do not match actual ones ( wikibase:statements and wikibase:sitelinks ).Jan 23 2017, 11:13 AM
Smalyshev changed the task status from Open to Stalled.Mar 30 2017, 1:32 AM
This comment was removed by daniel.
Lydia_Pintscher lowered the priority of this task from High to Medium.Apr 28 2017, 1:37 PM

Implementing this probably requires T161731, with it we could use pageprop change events (already existing) for synchronizing.

Users will probably run into this far more often now due to the drop of wb_entity_per_page. Pages like https://www.wikidata.org/wiki/Wikidata:Database_reports/without_claims_by_site/nlwiki will feature false positives for example.

Looks like it's a Wikidata issue. E.g. see https://www.wikidata.org/wiki/Q1965957 - it has a statement with an identifier. However, the dump at https://www.wikidata.org/wiki/Special:EntityData/Q1965957.ttl?flavor=dump shows both wikibase:statements and wikibase:identifiers as zero. Checking the database I see:

>select * from page_props where pp_page = 1895030;
+---------+----------------+----------+------------+
| pp_page | pp_propname    | pp_value | pp_sortkey |
+---------+----------------+----------+------------+
| 1830394 | wb-claims      | 0        |          0 |
| 1830394 | wb-identifiers | 0        |          0 |
| 1830394 | wb-sitelinks   | 1        |          1 |
+---------+----------------+----------+------------+

So the data is not up to data in wikidata properties tables. Not sure why and how this can be fixed. Editing the item seems to fix it, so maybe there should be some bot that null-edits such items periodically?

So the data is not up to data in wikidata properties tables. Not sure why and how this can be fixed. Editing the item seems to fix it, so maybe there should be some bot that null-edits such items periodically?

Interesting. So https://www.wikidata.org/wiki/Special:EntityData/Q1965957.ttl?flavor=dump is somehow cached and not generated from live data? Where is it cached? How does this work?
If we would do null edits, what would trigger the query engine to update the data?

@Multichill no, the TTL export matches the data. But the data (the page_props table) does not match the actual counts in the item. Null edit would trigger the recount of page props, which will update the table.

But will a null edit also cause the WDQS updater to reload the entity?

Yes, if there's an RC record and revision number increase, it will reload.

Yes, if there's an RC record and revision number increase, it will reload.

I don't think a null edit will cause a new revision. So no RC record and no revision number increase.

This might be a stupid question, but… why are we getting those counts from the page props in the first place? Special:EntityData already has to load the full entity data, so I don’t think we’re saving much work by getting those numbers from the page_props table. And using page_props not only means that the RDF for the latest revision might have statement counters that are slightly out of date – it also means that the RDF for earlier revisions contains counters that are just totally unrelated to that revision. For instance:

$ curl -s https://www.wikidata.org/wiki/Special:EntityData/Q42.ttl?revision=112 | grep -A2 wikibase:statements
        wikibase:statements "248"^^xsd:integer ;
        wikibase:identifiers "183"^^xsd:integer ;
        wikibase:sitelinks "113"^^xsd:integer .

If you look at Q42?oldid=112, you can see that it did not actually contain over two hundred statements – which is understandable, since that revision predates support for statements on Wikidata by several months. (Maybe this mismatch was already reported somewhere, but I didn’t find any relevant Phabricator task other than this one.)

Sounds like the question wasn’t stupid after all (or at least, nobody said so in three weeks), so unstalling. IMHO there’s a clear path forward: stop using the page props in the RDF export, count statements/identifiers/sitelinks directly. (Though that still leaves the issue that apparently, sometimes the page props in the database don’t get updated even after a long time.)

I looked around in old bugs and found T129046 . I think it went like this:

  • Several Wikibase page_props were added for some unknown reason (you would have to look in the code where these are actually used now)
  • I wanted to track items without statements (T129037) and easiest was to use the page_props (T129046)
  • Some other page_props were added to the RDF later
WMDE-leszek set the point value for this task to 8.
noarave renamed this task from Statement counts from pageprops do not match actual ones ( wikibase:statements and wikibase:sitelinks ) to Use RDF statement counts from entity data, not page props ( wikibase:identifiers, wikibase:statements and wikibase:sitelinks ).Nov 3 2020, 2:55 PM

Change 641202 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/Wikibase@master] Calculate page props on-the-fly during RDF dump

https://gerrit.wikimedia.org/r/641202

Change 641209 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/WikibaseLexeme@master] tests: Inject EntityContentFactory into RdfBuilder

https://gerrit.wikimedia.org/r/641209

Change 641210 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/WikibaseMediaInfo@master] tests: Inject EntityContentFactory into RdfBuilder

https://gerrit.wikimedia.org/r/641210

Change 641209 merged by jenkins-bot:
[mediawiki/extensions/WikibaseLexeme@master] tests: Inject EntityContentFactory into RdfBuilder

https://gerrit.wikimedia.org/r/641209

Change 641210 merged by jenkins-bot:
[mediawiki/extensions/WikibaseMediaInfo@master] tests: Inject EntityContentFactory into RdfBuilder

https://gerrit.wikimedia.org/r/641210

Change 641984 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/Wikibase@master] Rename page props-related identifiers for clarity

https://gerrit.wikimedia.org/r/641984

Change 641985 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/Wikibase@master] Remove unneeded array_intersect_key()

https://gerrit.wikimedia.org/r/641985

Change 641202 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Calculate page props on-the-fly during RDF dump

https://gerrit.wikimedia.org/r/641202

Change 641984 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Rename page props-related identifiers for clarity

https://gerrit.wikimedia.org/r/641984

Change 641985 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Remove unneeded array_intersect_key()

https://gerrit.wikimedia.org/r/641985

I wonder if we should backport at least the main change to wmf.18? Otherwise, I believe it will only start showing up in the full RDF dumps of 7 December (since next week is a no-deploy week, and so the dumps of 30 November will still be on wmf.18, if I’m not mistaken).

I wonder if we should backport at least the main change to wmf.18? Otherwise, I believe it will only start showing up in the full RDF dumps of 7 December (since next week is a no-deploy week, and so the dumps of 30 November will still be on wmf.18, if I’m not mistaken).

Indeed, if possible it'd be great to backport.

Change 642103 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[mediawiki/extensions/Wikibase@wmf/1.36.0-wmf.18] Calculate page props on-the-fly during RDF dump

https://gerrit.wikimedia.org/r/642103

Alright, scheduled for Monday’s EU backport+config window – if I read the cron config correctly, the full RDF dumps start on Monday night (23:00 – UTC, I assume?), so if all goes well, next week’s dumps should already have this change.

Edit: Well, that’s assuming Wikidata (group1) is back to wmf.18 by then (currently on wmf.16).

Alright, scheduled for Monday’s EU backport+config window – if I read the cron config correctly, the full RDF dumps start on Monday night (23:00 – UTC, I assume?), so if all goes well, next week’s dumps should already have this change.

Yes absolutely if we can backport on monday and assuming the train rolls forward to group1 before 23:00 UTC next monday we should have the new dumps next week :)

Change 642103 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@wmf/1.36.0-wmf.18] Calculate page props on-the-fly during RDF dump

https://gerrit.wikimedia.org/r/642103

Mentioned in SAL (#wikimedia-operations) [2020-11-23T13:11:02Z] <lucaswerkmeister-wmde@deploy1001> Synchronized php-1.36.0-wmf.18/extensions/Wikibase: Backport: [[gerrit:642103|Calculate page props on-the-fly during RDF dump (T145712)]] (duration: 01m 14s)

Backport tested on test.wikidata.org – www.wikidata.org isn’t on wmf.18 yet but hopefully will be by the time the dumps start.