Consider ways to invalidate FormatterCache when things go wrong
Closed, ResolvedPublic
Actions

Description

In T252079#6117724 it was identified that we had an incident that was exacerbated by the fact that our formatter cache caches negative results for the full 24hour TTL.
In this incident LUA calls incorrectly always returning no value when checking for terms, and this value was cached for 24 hours.
A rollback of the train happening in the hours after the incident, however the cache continued to have bad data until the next day.

In a comment on that ticket I speculated about some possible solutions:

In T252079#6117724, @Addshore wrote:

a way to force this cache to be updated when performing a page purge (similar to force links update?)

a way to ditch all of the cache keys in this cache (probably just incrementing some value in the cache key)

Another possible area for investigation would be, is a 24h TTL actually needed for this cache?
Currently the TTL used is the same generic TTL for the "shared cache" used for entity storage.
We could experiment with this value as a much shorter value would result in a short fallout if something like this were to happen again.
If we can't improve the situation then perhaps we should think about one of the other ideas in this ticket?

Details

	Subject	Repo	Branch	Lines +/-
	Add formatterCacheVersion	mediawiki/extensions/Wikibase	master	+126 -50
	Move the duplicated code into lib and use it as a factory on client and repo.	mediawiki/extensions/Wikibase	master	+205 -70

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	WMDE-leszek	T252595 Consider ways to invalidate FormatterCache when things go wrong
Resolved	• Pablo-WMDE	T265983 Make shared caching service (ObjectCache) injectable into FormatterCacheFactory
Resolved	• Pablo-WMDE	T265984 Create first exemplary services in WikibaseRepo
Resolved	• Pablo-WMDE	T266215 Implement the FormatterCache versioning

Event Timeline

Addshore created this task.May 12 2020, 8:12 PM

Restricted Application added a project: Wikidata. · View Herald TranscriptMay 12 2020, 8:12 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Addshore mentioned this in T252079: mw.wikibase.getLabelByLang('Q1','en') returning nil today.May 12 2020, 8:12 PM

Addshore added a project: Sustainability.

Addshore moved this task from Tag to Incident Followup on the Sustainability board.May 12 2020, 8:21 PM

Addshore edited projects, added Sustainability (Incident Followup); removed Sustainability.

Addshore moved this task from Incoming to Needs Tech Work on the Wikidata-Campsite board.Jun 17 2020, 2:44 PM

Addshore edited projects, added Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)); removed Wikidata-Campsite.Sep 21 2020, 10:43 AM

Addshore edited projects, added Wikidata-Campsite; removed Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)).

Addshore moved this task from Needs Tech Work to Prioritized Wikidata Tech Backlog (prioritised from top to bottom) on the Wikidata-Campsite board.Sep 21 2020, 10:46 AM

WMDE-leszek edited projects, added Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)); removed Wikidata-Campsite.Sep 22 2020, 12:39 PM

• toan claimed this task.Sep 25 2020, 8:45 AM

• toan moved this task from To Do (prioritised from top to bottom) to Doing on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.

Change 630173 had a related patch set uploaded (by Tobias Andersson; owner: Tobias Andersson):
[mediawiki/extensions/Wikibase@master] Move the duplicated code into lib and use it as a factory on client and repo.

https://gerrit.wikimedia.org/r/630173

gerritbot added a project: Patch-For-Review.Sep 25 2020, 1:21 PM

Change 630188 had a related patch set uploaded (by Tobias Andersson; owner: Tobias Andersson):
[mediawiki/extensions/Wikibase@master] Add formatter cache version and a script to manage this from repo.

https://gerrit.wikimedia.org/r/630188

@toan and I had a chat about this ticket this morning and summarized we said that:

For formatter cache version patches need a few tweaks but all look like the right direction
- We have never tried "dumping" the whole formatter cache at once before, but predict that it should just result in a small spike in slowness (and more db load) as the first spike of repopulation gets dealt with and then a long small tail
- One thing that came up was that using WANCache inside the formatter cache would probably result in a slightly better internal situation when and if this were to happen.
- It will be very hard to figure out what any of this would look like in advance of trying it.
- One thing that could be done to make possible usage less interesting would be some sort of % based rollout of the new version.

Regarding the idea of allowing action=purge to purge the formatter cache:

We discussed this on the client side
- The usage tracking mechanisms don't have an index on page_id, so getting from the page to the entities used would not be ideal.
- We also doubted the value in this approach given the main issue is when incident happen, and in this case the formatter cache version would need to be incremented instead.
We also discussed this on the repo side
- This would be easier, but less useful.
- Purging the formatter cache for Q64 when doing action=purge on Q64 is odd, as that cache is not actually used in the page rendering at all.

Perhaps we could look at the above ideas further if we ever need more solutions in this area, but for now not having them is probably fine, as we have a version key in the cache key now.

• toan moved this task from Doing to Peer Review on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.Sep 29 2020, 8:11 AM

Addshore mentioned this in T263999: Some lua-calls with language specified does not end up in formatterCache.Sep 29 2020, 11:29 AM

Change 630173 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Move the duplicated code into lib and use it as a factory on client and repo.

https://gerrit.wikimedia.org/r/630173

ReleaseTaggerBot added a project: MW-1.36-notes (1.36.0-wmf.12; 2020-10-05; NEVER DEPLOYED).Sep 30 2020, 3:00 PM

• toan removed • toan as the assignee of this task.Oct 2 2020, 9:22 AM

• toan subscribed.

• toan claimed this task.Oct 5 2020, 11:41 AM

• Pablo-WMDE claimed this task.Oct 5 2020, 12:31 PM

• Pablo-WMDE moved this task from Peer Review to Doing on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.

• Pablo-WMDE reassigned this task from • Pablo-WMDE to • toan.Oct 9 2020, 4:05 PM

• Pablo-WMDE subscribed.

• toan moved this task from Doing to Peer Review on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.Oct 12 2020, 2:18 PM

In T252595#6498006, @Addshore wrote:

@toan and I had a chat about this ticket this morning and summarized we said that:

For formatter cache version patches need a few tweaks but all look like the right direction

We have never tried "dumping" the whole formatter cache at once before, but predict that it should just result in a small spike in slowness (and more db load) as the first spike of repopulation gets dealt with and then a long small tail

One thing that came up was that using WANCache inside the formatter cache would probably result in a slightly better internal situation when and if this were to happen.

It will be very hard to figure out what any of this would look like in advance of trying it.

One thing that could be done to make possible usage less interesting would be some sort of % based rollout of the new version.

In his analysis of Wikidata S8 database load in T246415#6474756 @Ladsgroup wrote:

56% of the total load is term store (and it's heavily cached, this is actually 1% of the actual amount of read that reaches to the database, this is scary).

Considering that insight, throwing the entire cache for Terms away in one go would seem to be a very bold move. Can we make that "% based rollout of the new version" with our current setup and config, or are code change needed to make this possible?

At the very least, the patch that introduces the new formatter version (and possibly invalidates the current keys in the process?) should probably be flagged as a risky patch to watch out for during the next deployment.

• toan moved this task from Peer Review to Doing on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.Oct 13 2020, 10:00 AM

• toan moved this task from Doing to Peer Review on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.Oct 13 2020, 10:51 AM

In T252595#6536854, @Michael wrote:

At the very least, the patch that introduces the new formatter version (and possibly invalidates the current keys in the process?) should probably be flagged as a risky patch to watch out for during the next deployment.

The way the change is currently written, it shouldn’t invalidate the current cache contents (though we can still flag it as risky).

• Pablo-WMDE moved this task from Peer Review to To Do (prioritised from top to bottom) on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.Oct 20 2020, 9:38 AM

• toan removed • toan as the assignee of this task.Oct 20 2020, 9:38 AM

We split this into subtasks

to ease discussing the challenges in isolation.

Change 635532 had a related patch set uploaded (by Pablo Grass (WMDE); owner: Pablo Grass (WMDE)):
[mediawiki/extensions/Wikibase@master] FormatterCacheFactory: add version to cache key(s)

https://gerrit.wikimedia.org/r/635532

Lucas_Werkmeister_WMDE added a project: Story.Oct 21 2020, 2:41 PM

Change 630188 abandoned by Tobias Andersson:
[mediawiki/extensions/Wikibase@master] Add formatterCacheVersion

Reason:
https://phabricator.wikimedia.org/T265983

https://gerrit.wikimedia.org/r/630188

WMDE-leszek closed subtask T265983: Make shared caching service (ObjectCache) injectable into FormatterCacheFactory as Resolved.Oct 23 2020, 5:52 AM

WMDE-leszek closed subtask T266215: Implement the FormatterCache versioning as Resolved.Oct 23 2020, 10:03 AM

• Pablo-WMDE moved this task from To Do (prioritised from top to bottom) to Doing on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.Oct 23 2020, 2:50 PM

• Pablo-WMDE moved this task from Doing to To Do (prioritised from top to bottom) on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.

noarave moved this task from To Do (prioritised from top to bottom) to Doing on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.Nov 10 2020, 3:12 PM

Maintenance_bot moved this task from incoming to in progress on the Wikidata board.Nov 10 2020, 3:15 PM

ItamarWMDE moved this task from Doing to Parent tasks on the Wikidata-Campsite (Wikidata-Campsite-Iteration-∞ (On Hold)) board.Nov 12 2020, 10:53 AM