Page MenuHomePhabricator

Consider ways to invalidate FormatterCache when things go wrong
Closed, ResolvedPublic

Description

In T252079#6117724 it was identified that we had an incident that was exacerbated by the fact that our formatter cache caches negative results for the full 24hour TTL.
In this incident LUA calls incorrectly always returning no value when checking for terms, and this value was cached for 24 hours.
A rollback of the train happening in the hours after the incident, however the cache continued to have bad data until the next day.

In a comment on that ticket I speculated about some possible solutions:

  • a way to force this cache to be updated when performing a page purge (similar to force links update?)
  • a way to ditch all of the cache keys in this cache (probably just incrementing some value in the cache key)

Another possible area for investigation would be, is a 24h TTL actually needed for this cache?
Currently the TTL used is the same generic TTL for the "shared cache" used for entity storage.
We could experiment with this value as a much shorter value would result in a short fallout if something like this were to happen again.
If we can't improve the situation then perhaps we should think about one of the other ideas in this ticket?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 630173 had a related patch set uploaded (by Tobias Andersson; owner: Tobias Andersson):
[mediawiki/extensions/Wikibase@master] Move the duplicated code into lib and use it as a factory on client and repo.

https://gerrit.wikimedia.org/r/630173

Change 630188 had a related patch set uploaded (by Tobias Andersson; owner: Tobias Andersson):
[mediawiki/extensions/Wikibase@master] Add formatter cache version and a script to manage this from repo.

https://gerrit.wikimedia.org/r/630188

@toan and I had a chat about this ticket this morning and summarized we said that:

  • For formatter cache version patches need a few tweaks but all look like the right direction
    • We have never tried "dumping" the whole formatter cache at once before, but predict that it should just result in a small spike in slowness (and more db load) as the first spike of repopulation gets dealt with and then a long small tail
    • One thing that came up was that using WANCache inside the formatter cache would probably result in a slightly better internal situation when and if this were to happen.
    • It will be very hard to figure out what any of this would look like in advance of trying it.
    • One thing that could be done to make possible usage less interesting would be some sort of % based rollout of the new version.

Regarding the idea of allowing action=purge to purge the formatter cache:

  • We discussed this on the client side
    • The usage tracking mechanisms don't have an index on page_id, so getting from the page to the entities used would not be ideal.
    • We also doubted the value in this approach given the main issue is when incident happen, and in this case the formatter cache version would need to be incremented instead.
  • We also discussed this on the repo side
    • This would be easier, but less useful.
    • Purging the formatter cache for Q64 when doing action=purge on Q64 is odd, as that cache is not actually used in the page rendering at all.

Perhaps we could look at the above ideas further if we ever need more solutions in this area, but for now not having them is probably fine, as we have a version key in the cache key now.

Change 630173 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Move the duplicated code into lib and use it as a factory on client and repo.

https://gerrit.wikimedia.org/r/630173

toan subscribed.

@toan and I had a chat about this ticket this morning and summarized we said that:

  • For formatter cache version patches need a few tweaks but all look like the right direction
    • We have never tried "dumping" the whole formatter cache at once before, but predict that it should just result in a small spike in slowness (and more db load) as the first spike of repopulation gets dealt with and then a long small tail
    • One thing that came up was that using WANCache inside the formatter cache would probably result in a slightly better internal situation when and if this were to happen.
    • It will be very hard to figure out what any of this would look like in advance of trying it.
    • One thing that could be done to make possible usage less interesting would be some sort of % based rollout of the new version.

In his analysis of Wikidata S8 database load in T246415#6474756 @Ladsgroup wrote:

56% of the total load is term store (and it's heavily cached, this is actually 1% of the actual amount of read that reaches to the database, this is scary).

Considering that insight, throwing the entire cache for Terms away in one go would seem to be a very bold move. Can we make that "% based rollout of the new version" with our current setup and config, or are code change needed to make this possible?

At the very least, the patch that introduces the new formatter version (and possibly invalidates the current keys in the process?) should probably be flagged as a risky patch to watch out for during the next deployment.

At the very least, the patch that introduces the new formatter version (and possibly invalidates the current keys in the process?) should probably be flagged as a risky patch to watch out for during the next deployment.

The way the change is currently written, it shouldn’t invalidate the current cache contents (though we can still flag it as risky).

Change 635532 had a related patch set uploaded (by Pablo Grass (WMDE); owner: Pablo Grass (WMDE)):
[mediawiki/extensions/Wikibase@master] FormatterCacheFactory: add version to cache key(s)

https://gerrit.wikimedia.org/r/635532

Change 630188 abandoned by Tobias Andersson:
[mediawiki/extensions/Wikibase@master] Add formatterCacheVersion

Reason:
https://phabricator.wikimedia.org/T265983

https://gerrit.wikimedia.org/r/630188