Page MenuHomePhabricator

Job queue corruption after codfw switch over (Queue growth, duplicate runs)
Closed, ResolvedPublic

Description

Watchcllist entries are duplicated, often several times after maintenance. Still happening even after refresh, etc... See here: https://usercontent.irccloud-cdn.com/file/TFhgFYVM/

From T163418

This is an exact duplicate of what we saw last year, T129517: The refreshLinks jobs enqueue rate is 10 times the normal rate
refreshLink jobs that are recursive keep feeding the queue with more and more jobs, and the jobrunners are processing them at unprecedented rates.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Related: T158004: Release RC Page filtering to non-ORES wikis is scheduled for 14:00 UTC this Monday, which will deploy Ores (by default/not as a Beta Feature) at almost all of the wikis.

Not to almost all wikis, only to English Wikipedia. It will enable the RCFilters beta feature on almost all wikis, but as the title of that task says, they're all non-ORES wikis with the exception of enwiki.

On April 12 a (supposedly harmless) test was performed in codfw using the switchdc script. It's probably not a coincidence then that the jobs we saw running unexpectedly today relate to events from that day. Presumably something happened there that led to jobs being set to some state somewhere and being triggered a second time today.

I just thoroughly reviewed the switchdc logs from April 12 and no action was taken on eqiad's redises or jobrunners. So I think this is a red herring.

That's exactly why I think it isn't a red herring. It ran on codfw, where the jobqueue redis is a replica of eqiads. It's conceivable to me that these tests started something that caused jobs after that point to be marked in a certain way (in codfw). Then on April 19, the minute step "Stop MediaWiki maintenance in the old master DC" was run, this marking stopped - presumably whatever state was left behind was now normalised. Then a few minutes later when the job queue was enabled in codfw, all jobs previously "marked" by codfw started running, again. (For the first time, from codfw's perspective)

What is the correct status for this task? What will be the impact of deploying ORES by default on English Wikipedia (T158004)?

If there is any chance to have ORES predictions creating duplicates on all watchlists, we will postpone the deployment. Please provide an answer before 17:00 UTC today.

What is the correct status for this task? What will be the impact of deploying ORES by default on English Wikipedia (T158004)?

If there is any chance to have ORES predictions creating duplicates on all watchlists, we will postpone the deployment. Please provide an answer before 17:00 UTC today.

As I said in T163337#3198009 all new duplication are cleaned now and there is no way that someone can see a duplicate entry in watchlist. but I can't guarantee it won't happen during the switchover in May 3rd because there might be a problem with the jobrunner again (in that case we can run the cleaner again but still we need to be careful)

Thanks @Ladsgroup. I'll keep a note about the possible future duplicates.

On April 12 a (supposedly harmless) test was performed in codfw using the switchdc script. It's probably not a coincidence then that the jobs we saw running unexpectedly today relate to events from that day. Presumably something happened there that led to jobs being set to some state somewhere and being triggered a second time today.

I just thoroughly reviewed the switchdc logs from April 12 and no action was taken on eqiad's redises or jobrunners. So I think this is a red herring.

That's exactly why I think it isn't a red herring. It ran on codfw, where the jobqueue redis is a replica of eqiads. It's conceivable to me that these tests started something that caused jobs after that point to be marked in a certain way (in codfw). Then on April 19, the minute step "Stop MediaWiki maintenance in the old master DC" was run, this marking stopped - presumably whatever state was left behind was now normalised.

@aaron Looked into this a bit and found that replication is very badly broken. Never mind that it "fixed itself" a minute before the switch over, it's just that we're not seeing the problem right now because we only run jobs in codfw and the job runners only read from the current Redis master.

Even now the codfw master and eqiad master don't match each other at all. Consistently much higher job count and various other inconsistencies.

codfw
[17:08 UTC] [email protected]:~$ mwscript showJobs.php enwiki --group
categoryMembershipChange: 1 queued; 83 claimed (4 active, 79 abandoned); 0 delayed
cdnPurge: 2 queued; 0 claimed (0 active, 0 abandoned); 42 delayed
cirrusSearchCheckerJob: 3 queued; 0 claimed (0 active, 0 abandoned); 1335 delayed
cirrusSearchIncomingLinkCount: 7 queued; 1336 claimed (52 active, 1284 abandoned); 323 delayed
cirrusSearchLinksUpdatePrioritized: 4 queued; 2 claimed (2 active, 0 abandoned); 0 delayed
cirrusSearchOtherIndex: 1 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
deleteLinks: 0 queued; 200 claimed (31 active, 169 abandoned); 0 delayed
enqueue: 10 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
htmlCacheUpdate: 37025 queued; 5 claimed (2 active, 3 abandoned); 0 delayed
LocalRenameUserJob: 0 queued; 1 claimed (0 active, 1 abandoned); 0 delayed
ORESFetchScoreJob: 15 queued; 262 claimed (22 active, 240 abandoned); 0 delayed
recentChangesUpdate: 1 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
refreshLinks: 17653 queued; 1 claimed (1 active, 0 abandoned); 0 delayed
refreshLinksPrioritized: 0 queued; 29 claimed (29 active, 0 abandoned); 0 delayed
ThumbnailRender: 0 queued; 8 claimed (0 active, 8 abandoned); 0 delayed
eqiad
-[17:08 UTC] [email protected]:~$ mwscript showJobs.php enwiki --group
+[17:08 UTC] [email protected]:~$ mwscript showJobs.php enwiki --group
 
-categoryMembershipChange: 1 queued; 83 claimed (4 active, 79 abandoned); 0 delayed

-cdnPurge: 2 queued; 0 claimed (0 active, 0 abandoned); 42 delayed
+cdnPurge: 35252 queued; 110041 claimed (1974 active, 108067 abandoned); 49 delayed

-cirrusSearchCheckerJob: 3 queued; 0 claimed (0 active, 0 abandoned); 1335 delayed
+cirrusSearchCheckerJob: 1588 queued; 8968 claimed (223 active, 8745 abandoned); 1123 delayed

-cirrusSearchIncomingLinkCount: 7 queued; 1336 claimed (52 active, 1284 abandoned); 323 delayed
+cirrusSearchIncomingLinkCount: 4573 queued; 122407 claimed (2894 active, 119513 abandoned); 289 delayed

-cirrusSearchLinksUpdatePrioritized: 4 queued; 2 claimed (2 active, 0 abandoned); 0 delayed
+cirrusSearchLinksUpdatePrioritized: 7 queued; 1 claimed (1 active, 0 abandoned); 0 delayed

+categoryMembershipChange: 3083 queued; 44236 claimed (1363 active, 42873 abandoned); 0 delayed

 cirrusSearchOtherIndex: 1 queued; 0 claimed (0 active, 0 abandoned); 0 delayed

-deleteLinks: 0 queued; 200 claimed (31 active, 169 abandoned); 0 delayed

-enqueue: 10 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
+enqueue: 471218 queued; 2672194 claimed (89895 active, 2582299 abandoned); 0 delayed

-htmlCacheUpdate: 37025 queued; 5 claimed (2 active, 3 abandoned); 0 delayed
+htmlCacheUpdate: 73967 queued; 19454 claimed (4331 active, 15123 abandoned); 0 delayed

+deleteLinks: 502 queued; 826 claimed (36 active, 790 abandoned); 0 delayed

+EchoNotificationDeleteJob: 1 queued; 0 claimed (0 active, 0 abandoned); 0 delayed

 LocalRenameUserJob: 0 queued; 1 claimed (0 active, 1 abandoned); 0 delayed

-ORESFetchScoreJob: 15 queued; 262 claimed (22 active, 240 abandoned); 0 delayed
+ORESFetchScoreJob: 76683 queued; 78689 claimed (4587 active, 74102 abandoned); 0 delayed

 recentChangesUpdate: 1 queued; 0 claimed (0 active, 0 abandoned); 0 delayed

-refreshLinks: 17653 queued; 1 claimed (1 active, 0 abandoned); 0 delayed
+refreshLinks: 443999 queued; 289844 claimed (13768 active, 276076 abandoned); 0 delayed

-refreshLinksPrioritized: 0 queued; 29 claimed (29 active, 0 abandoned); 0 delayed
+refreshLinksPrioritized: 4186 queued; 22266 claimed (1074 active, 21192 abandoned); 0 delayed

 ThumbnailRender: 0 queued; 8 claimed (0 active, 8 abandoned); 0 delayed

+wikibase-addUsagesForPage: 7 queued; 0 claimed (0 active, 0 abandoned); 0 delayed

Ideas for next steps:

  • Figure out if intra-DC replication is equally broken.
  • Better understand how Redis replication works and how this can diverge without triggering any errors or alerts.
  • Try re-creating one of the slaves from the master (how?) and monitor if they digress again?
  • Make it work?

Checked very quickly replication status and db keys stored for the rdbs 2003 / 1003 / 2004:

elukey@rdb2003:~$ for instance in 6378 6379 6380 6381; do echo ${instance}; redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_${instance}.conf)" -p ${instance} INFO | grep :keys; done
6378
db0:keys=4705607,expires=0,avg_ttl=0
6379
db0:keys=5861448,expires=1156899,avg_ttl=2219206300
6380
db0:keys=5860767,expires=1155979,avg_ttl=2188224808
6381
db0:keys=5770790,expires=1157860,avg_ttl=2174571917

elukey@rdb1003:~$ for instance in 6378 6379 6380 6381; do echo ${instance}; redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_${instance}.conf)" -p ${instance} INFO | grep :keys; done
6378
db0:keys=4705607,expires=0,avg_ttl=0
6379
db0:keys=5861621,expires=1156973,avg_ttl=1224536465
6380
db0:keys=4713560,expires=0,avg_ttl=1127781794
6381
db0:keys=4620926,expires=0,avg_ttl=1154068780


elukey@rdb2004:~$ for instance in 6378 6379 6380 6381; do echo ${instance}; redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_${instance}.conf)" -p ${instance} INFO | grep :keys; done
6378
db0:keys=4705607,expires=0,avg_ttl=0
6379
db0:keys=4706387,expires=0,avg_ttl=0
6380
db0:keys=4713723,expires=0,avg_ttl=0
6381
db0:keys=5771093,expires=1158035,avg_ttl=0

I don't see anything weird ongoing (except from 2004's ttls but could be a red herring) from the Redis replication point of view, maybe terbium's view of the queues is not correct?

I don't see anything weird ongoing (except from 2004's ttls but could be a red herring) from the Redis replication point of view, maybe terbium's view of the queues is not correct?

2004's ttl is perfectly normal if a redis instance has been a slave since startup.

I will try today to do a dump of redis data for one instance across DCs and intra-DC, so that we can settle once and for all if the difference we see on jobs between terbium and wasat can be attributed to a difference in data stores or to some misconfiguration.

Ok so, I started using a (slightly modified) version of the script presented here:

https://redislabs.com/blog/fast-and-efficient-parallelized-comparison-of-redis-databases/

to compare our redis databases. A first couple of test runs found no significant difference between master and replica in the same datacenter. I will update the ticket with my findings.

Ladsgroup subscribed.

I took care of the duplicate data. I have no idea how the jobrunner works.

Mentioned in SAL (#wikimedia-operations) [2017-04-27T08:54:04Z] <_joe_> restarting redis rdb1001:6380 after cleaning up the current AOF files for investigation of T163337

I finally managed to find some differences in one redis replication set, and those are all related to big data structures like enwiki:jobqueue:refreshLinks:l-unclaimed and enwiki:jobqueue:refreshLinks:z-claimed or enwiki:jobqueue:refreshLinks:h-idBySha1.

Specifically:

# On the slave
rdb1001:6380>LLEN  enwiki:jobqueue:refreshLinks:l-unclaimed
(integer) 85303
# On the master
rdb2001:6380> LLEN enwiki:jobqueue:refreshLinks:l-unclaimed
(integer) 1735

What I tried next was: stopping the slave, removing all its AOF files (to be sure a full resync would happen), starting it up again. The result was a perfect match in the size of the key in question.

So it seems that when a full resync is caused, the data should coincide in the two instances, which is what happens when we switch replication; the state seems somehow to drift for some reason

Now looking at the logs for the instance, I see

tcp_6380.log.1.gz:[29535] 19 Apr 14:25:34.648 * Full resync from master: c293e5300d666d7f1783d888c6bded7c6ba30ca2:5437935692

So when we switched masters a full resync DID happen and I have to assume succeedingly synced all the data, and that the drift happened from that moment on.

This seems to suggest that indeed redis replica CAN break for those data structures. I suspect this can happen when we're replicating LUA scripts instead of straight commands, as the way that's implemented in the redis replica is not perfectly clear to me. From the documentation (https://redis.io/topics/replication) I see

During Lua scripts executions no keys expires are performed. As a Lua script runs, conceptually the time in the master is frozen, so that a given key will either exist or not for all the time the script runs. This prevents keys to expire in the middle of a script, and is needed in order to send the same script to the slave in a way that is guaranteed to have the same effects in the data set.

which suggests the lua scripts are sent to the slave, not the resulting actions on the datastore itself. This technique of replication causes nightmares that anyone who has had to deal with mysql statement replication and non-deterministic statements knows very well. It's possible that due to some replication lag some statements do not happen to be exactly the same on the slave as they are on the master.

I am making this hypothesis based on the fact that differences are more easily found on inter-dc replicas (with higher lag) than intra-dc replicas, but I managed to find significant differences in both situations.

While this grants a much more in-depth investigation, a stopgap workaround for the switchover is to restart all redis slaves in phase 0 of the switchover, and then wait for all of them to have caught up. This will practically reduce to near-zero the possibility of creating significant discrepancies.

Mentioned in SAL (#wikimedia-operations) [2017-04-27T09:25:37Z] <_joe_> restarting all redis instances for jobqueues on eqiad to force a full resync with masters in codfw T163337

On April 12 a (supposedly harmless) test was performed in codfw using the switchdc script. It's probably not a coincidence then that the jobs we saw running unexpectedly today relate to events from that day. Presumably something happened there that led to jobs being set to some state somewhere and being triggered a second time today.

I just thoroughly reviewed the switchdc logs from April 12 and no action was taken on eqiad's redises or jobrunners. So I think this is a red herring.

That's exactly why I think it isn't a red herring. It ran on codfw, where the jobqueue redis is a replica of eqiads. It's conceivable to me that these tests started something that caused jobs after that point to be marked in a certain way (in codfw). Then on April 19, the minute step "Stop MediaWiki maintenance in the old master DC" was run, this marking stopped - presumably whatever state was left behind was now normalised. Then a few minutes later when the job queue was enabled in codfw, all jobs previously "marked" by codfw started running, again. (For the first time, from codfw's perspective)

When I say "nothing was done" on april 12th, I mean the only thing the script did was to read INFO REPLICATION, which is something our monitoring does every minute. Anyways, I think we found the reason for the duplication happened at the time of the switchover, and it is related to the redis replication drifting for some reason.

So, I just re-ran showJobs in eqiad and codfw and there were big discrepancies, as expected

--- wasat.before	2017-04-27 11:45:20.725345007 +0200
+++ terbium.before	2017-04-27 11:45:31.189344455 +0200
@@ -1,15 +1,14 @@
-categoryMembershipChange: 3 queued; 52 claimed (3 active, 49 abandoned); 0 delayed
-cdnPurge: 0 queued; 0 claimed (0 active, 0 abandoned); 32 delayed
+categoryMembershipChange: 3398 queued; 30062 claimed (406 active, 29656 abandoned); 0 delayed
+cdnPurge: 32648 queued; 57664 claimed (959 active, 56705 abandoned); 21 delayed

-cirrusSearchCheckerJob: 0 queued; 0 claimed (0 active, 0 abandoned); 913 delayed
-cirrusSearchIncomingLinkCount: 0 queued; 1521 claimed (30 active, 1491 abandoned); 132 delayed
-cirrusSearchLinksUpdatePrioritized: 0 queued; 2 claimed (2 active, 0 abandoned); 0 delayed
-cirrusSearchOtherIndex: 1 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
+cirrusSearchCheckerJob: 1620 queued; 5953 claimed (97 active, 5856 abandoned); 839 delayed
+cirrusSearchDeletePages: 1 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
+cirrusSearchIncomingLinkCount: 3141 queued; 75143 claimed (1293 active, 73850 abandoned); 148 delayed

-deleteLinks: 0 queued; 308 claimed (21 active, 287 abandoned); 0 delayed
-EchoNotificationDeleteJob: 1 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
-enqueue: 6 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
-htmlCacheUpdate: 28806 queued; 16 claimed (16 active, 0 abandoned); 0 delayed
+deleteLinks: 366 queued; 511 claimed (28 active, 483 abandoned); 0 delayed
+enqueue: 505113 queued; 2905195 claimed (63309 active, 2841886 abandoned); 0 delayed
+htmlCacheUpdate: 72547 queued; 31156 claimed (4176 active, 26980 abandoned); 0 delayed

-ORESFetchScoreJob: 0 queued; 248 claimed (22 active, 226 abandoned); 0 delayed
-refreshLinks: 15500 queued; 57 claimed (57 active, 0 abandoned); 0 delayed
-refreshLinksPrioritized: 0 queued; 66 claimed (66 active, 0 abandoned); 0 delayed
+ORESFetchScoreJob: 82437 queued; 60306 claimed (2169 active, 58137 abandoned); 0 delayed
+refreshLinks: 112933 queued; 116386 claimed (6897 active, 109489 abandoned); 0 delayed
+refreshLinksPrioritized: 175 queued; 601 claimed (601 active, 0 abandoned); 0 delayed

 ThumbnailRender: 0 queued; 8 claimed (0 active, 8 abandoned); 0 delayed
+wikibase-addUsagesForPage: 20 queued; 0 claimed (0 active, 0 abandoned); 0 delayed

After I forcibly synced all the servers in eqiad with the ones in codfw, the picture was the following:

--- wasat.after          	2017-04-27 11:49:01.761333335 +0200
+++ terbium.after     	2017-04-27 11:48:49.173333999 +0200
@@ -1,16 +1,15 @@
-categoryMembershipChange: 4 queued; 53 claimed (4 active, 49 abandoned); 0 delayed
-cdnPurge: 0 queued; 1 claimed (1 active, 0 abandoned); 25 delayed
-cirrusSearchCheckerJob: 1 queued; 3 claimed (3 active, 0 abandoned); 764 delayed
-cirrusSearchIncomingLinkCount: 0 queued; 1528 claimed (35 active, 1493 abandoned); 166 delayed
-cirrusSearchLinksUpdatePrioritized: 7 queued; 17 claimed (17 active, 0 abandoned); 0 delayed
+categoryMembershipChange: 2 queued; 56 claimed (7 active, 49 abandoned); 0 delayed
+cdnPurge: 0 queued; 1 claimed (1 active, 0 abandoned); 23 delayed
+cirrusSearchCheckerJob: 0 queued; 5 claimed (5 active, 0 abandoned); 763 delayed
+cirrusSearchIncomingLinkCount: 4 queued; 1529 claimed (36 active, 1493 abandoned); 148 delayed
+cirrusSearchLinksUpdatePrioritized: 11 queued; 17 claimed (17 active, 0 abandoned); 0 delayed
 deleteLinks: 0 queued; 309 claimed (22 active, 287 abandoned); 0 delayed
-enotifNotify: 1 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
 enqueue: 2 queued; 0 claimed (0 active, 0 abandoned); 0 delayed
 htmlCacheUpdate: 29039 queued; 17 claimed (17 active, 0 abandoned); 0 delayed
 LocalRenameUserJob: 0 queued; 1 claimed (0 active, 1 abandoned); 0 delayed
-ORESFetchScoreJob: 2 queued; 250 claimed (25 active, 225 abandoned); 0 delayed
-recentChangesUpdate: 0 queued; 1 claimed (1 active, 0 abandoned); 0 delayed
+ORESFetchScoreJob: 0 queued; 250 claimed (25 active, 225 abandoned); 0 delayed
+recentChangesUpdate: 1 queued; 1 claimed (1 active, 0 abandoned); 0 delayed
 refreshLinks: 16131 queued; 61 claimed (61 active, 0 abandoned); 0 delayed
-refreshLinksPrioritized: 1 queued; 66 claimed (66 active, 0 abandoned); 0 delayed
+refreshLinksPrioritized: 0 queued; 66 claimed (66 active, 0 abandoned); 0 delayed
 ThumbnailRender: 0 queued; 8 claimed (0 active, 8 abandoned); 0 delayed
-wikibase-addUsagesForPage: 14 queued; 19 claimed (19 active, 0 abandoned); 0 delayed
+wikibase-addUsagesForPage: 0 queued; 19 claimed (19 active, 0 abandoned); 0 delayed

where the small differences can be due to replication lag and/or the non perfect syncronicity of me launching the commands.

So at the moment my picture for what happened with the switchover is the following:

  • For some reason to be determined, the redis statement replication drifts over time on specific data structures
  • When this happens, any mediawiki instance using the slaves sees a lot of jobs as "unclaimed" or "unfinished" that are in fact already processed
  • During the switchover, we made the redis slaves in codfw masters, thus those jobs were now up for the grabs
  • When switched over, the redis masters in eqiad supposedly fully synced to codfw according to the logs; this remains to be controlled appropriately
  • At the moment, restarting all the redises in eiqad brutally, thus forcing a full resync, brought the datastores to be consistent again.

I would like to monitor the situation in the next few days to see if the drift happens again or not; if we have to trust the data we have, it would happen pretty quickly.

Also let me add a few remarks on the redis replication:

  • It won't tell you it's broken, because that's basically by design if you use LUA statements
  • It is hard to verify how it is broken if it only happens to drift on the contents on a few data structures out of 7 M keys per instance
  • we might need to do periodic PSYNCs on the slaves in the future to ensure they are indeed in sync with the master

Just to err on the side of caution, I reviewed all the code of JobQueueRedis and of the JobChron, and I found no obvious parts of our LUA scripts that could cause replication to break, like non-deterministic statements.

With no clear outlier in the code, I can do nothing else but look closer to some data structures, and add some monitoring to know if and how much such data structures differ.

Joe is going to add a more precise update but there is an interesting thing that we discovered today that might be relevant.

https://github.com/antirez/redis/issues/187

In our current version of Redis (2.8) the TTL on keys (set with EXPIRE) is not propagated to the slaves by the master, but instead a DEL is sent to the slaves when the master discovers that a certain key's TTL is elapsed. The main problem is that the master may discover that a key is expired only when it is accessed for the first time after the TTL is elapsed.

Today we thought that this might have been a plausible explanation why we saw all those duplicated jobs after the switchover (keys that should have been expired but didn't receive any DEL from the masters), but then Joe pointed out that the major differences seems to be in sets and lists, where the TTL seems not be applicable to their elements (still to verify).

The root cause of our might be a similar/related problem.. if anybody has any idea, please speak up :)

An additional case I'm going to study in more detail:

# Current master
redisc -h rdb2003.codfw.wmnet -p 6380 LLEN enwiki:jobqueue:enqueue:l-unclaimed
(integer) 2
# Local slave
redisc -h rdb2004.codfw.wmnet -p 6380 LLEN enwiki:jobqueue:enqueue:l-unclaimed
(integer) 61187
# Remote Slave
redisc -h rdb1003.eqiad.wmnet -p 6380 LLEN enwiki:jobqueue:enqueue:l-unclaimed
(integer) 59603

I roamed through the data structures and concluded there must be some race condition in the way expiration in the replica and lua scripts interact. I think it might happen that a job is expired on the master but still not read at the moment a lua script is replicated, and read through the LUA script on the master; this might cause the effect we're seeing here. I still have to forge a conclusive reproduction though, so this is still an hypothesis.

Given how near we are to the switchover, I'd rather restart all redises once we're in read-only mode instead of trying to fix the issue for now.

Also to be noted, I found no discrepancy in a sample of sessions servers, which might be due to the fact those don't have large data structures; I'd leave them not synced if further testing doesn't show corruption in the replication.

Note that the only keys that use a TTL are the root job de-duplication hash/timestamp keys. Everything else is normally just directly deleted (lpop,hdel,...) via Lua on ack() or by the JobChron Lua script (which uses similar commands). None of the main data structures have native redis expiration (which uses the special master deletion logic).

Note that the only keys that use a TTL are the root job de-duplication hash/timestamp keys. Everything else is normally just directly deleted (lpop,hdel,...) via Lua on ack() or by the JobChron Lua script (which uses similar commands). None of the main data structures have native redis expiration (which uses the special master deletion logic).

In fact expiration is not even possible on those data structures; @aaron are the l-unclaimed lists potentially affected by a de-duplication hash being present or not? I can of course look at our code more closely and find out myself but I trust your understanding of the code better than mine :)

Change 351004 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/switchdc@master] Add stage for restarting Redis.

https://gerrit.wikimedia.org/r/351004

Change 351004 merged by Giuseppe Lavagetto:
[operations/switchdc@master] Add stage for restarting Redis.

https://gerrit.wikimedia.org/r/351004

Krinkle renamed this task from Watchlist entries duplicated several times to Job queue corruption after codfw switch over (Queue worth, duplicate runs).May 5 2017, 2:54 AM
Krinkle updated the task description. (Show Details)
Krinkle added subscribers: Ciencia_Al_Poder, GWicke, ori.

The skew found by Joe keeps re-occurring, for example:

elukey@rdb1003:~$ for instance in 6378 6379 6380 6381; do echo ${instance}; redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_${instance}.conf)" -p ${instance} LLEN enwiki:jobqueue:enqueue:l-unclaimed; done
6378
(integer) 7233
6379
(integer) 0
6380
(integer) 0
6381
(integer) 0

elukey@rdb1004:~$ for instance in 6378 6379 6380 6381; do echo ${instance}; redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_${instance}.conf)" -p ${instance} LLEN enwiki:jobqueue:enqueue:l-unclaimed; done
6378
(integer) 7233
6379
(integer) 13
6380
(integer) 23
6381
(integer) 12

elukey@rdb2003:~$ for instance in 6378 6379 6380 6381; do echo ${instance}; redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_${instance}.conf)" -p ${instance} LLEN enwiki:jobqueue:enqueue:l-unclaimed; done
6378
(integer) 7233
6379
(integer) 5
6380
(integer) 137650     <====================================
6381
(integer) 2

One thing that I wanted to do was to follow changes to enwiki:jobqueue:enqueue:l-unclaimed in rdb100[34] and rdb2003's Redis 6380 AOF files to see where the values diverge, but it is not that easy :D

This issue is interesting: https://github.com/antirez/redis/issues/614 (should be already fixed in our version, 2.8, bringing it up only as reference)

As Joe pointed out it might be possible that we are hitting one of those weird corner cases, but we'd need to find a way to reliably reproduce it.

Investigation idea, not sure where it will lead but it is worth exploring.

I checked enwiki:jobqueue:enqueue:l-unclaimed elements on rdb2003 with the Redis command LRANGE enwiki:jobqueue:enqueue:l-unclaimed 0 100 and grabbed a random element. Then I grepped the element in /srv/redis/rdb2003-6380.aof, and the same grep on rdb1003.

Diff between them:

https://phabricator.wikimedia.org/P5398 (WMF-NDA protected, I am pretty sure no PII data is in there but better safe than sorry)

From a newbie point of view, it seems that on the master the job is enqueued and then moved somewhere else, meanwhile in the slave is only enqueued. It would be really great to reverse those EVALSHAs to figure out what are the Lua scripts, but afaics there is no such option in Redis.

Is there any way to get them from any mw host?

I'd assume you could get them from JobQueueRedis.php, since it's just the SHA1 of the lua strings, so:

  • doPushInternal() - 2AAC70021ADE78213B297E3E8316FA24F82D2897
  • doPop() - 3B56824083B7C4C3509163C7E76E00904014D92E
  • doAck() - AAE4A1B31C11DE4B5BA0FC23E3D2BBF585EA4589

I wonder if the script replication works fully correctly (implemented in https://github.com/antirez/redis/issues/558). E.g. if you restart the slave (loosing the scripts), does it always pick of the sha1 => script mapping from the AoF/RDB file (or even an out-of-band request to resolve the SHA-1 from the master).

https://github.com/antirez/redis/issues/2999 seems possibly related. I'm not sure why the ack() script would be the one to have the problems though.

I'm not sure why the ack() script would be the one to have the problems though.

Maybe it is only a side effect of the issue that we are chasing: on the master the job is put in the correct queue after being in unclaimed, and it gets "acked" correctly. On the slave, this doesn't happen and the job remains in the unclaimed.

Other test that I did today: since I noticed a lot of RPUSH on the Redis slaves, I tried to count those in the relative AOFs. I assumed that the format in the AOF is the following:

RPUSH
$69
frwiktionary:jobqueue:RestbaseUpdateJobOnDependencyChange:l-unclaimed

So basically RPUSH command and target queue listed right after it (this is why I used grep -A 3 in the commands below).

elukey@rdb1003:/srv/redis$ grep RPUSH rdb1003-6380.aof  -A 3 --color | grep ":" | sort | uniq -c | sort -n -k 1 | tail
      6 enwiktionary:jobqueue:RecordLintJob:l-unclaimed
      6 enwiktionary:jobqueue:refreshLinks:l-unclaimed
      7 commonswiki:jobqueue:RestbaseUpdateJobOnDependencyChange:l-unclaimed
      8 commonswiki:jobqueue:refreshLinks:l-unclaimed
     13 frwiki:jobqueue:RestbaseUpdateJobOnDependencyChange:l-unclaimed
     22 enwiktionary:jobqueue:RestbaseUpdateJobOnDependencyChange:l-unclaimed
     24 enwiki:jobqueue:htmlCacheUpdate:l-unclaimed
     27 enwiki:jobqueue:RecordLintJob:l-unclaimed
     30 enwiki:jobqueue:refreshLinks:l-unclaimed
     40 enwiki:jobqueue:RestbaseUpdateJobOnDependencyChange:l-unclaimed
elukey@rdb2003:/srv/redis$ grep RPUSH rdb2003-6380.aof  -A 3 --color | grep ":" | sort | uniq -c | sort -n -k 1 | tail
     13 frwiki:jobqueue:RestbaseUpdateJobOnDependencyChange:l-unclaimed
     16 commonswiki:jobqueue:refreshLinks:l-unclaimed
     18 enwiki:jobqueue:htmlCacheUpdate:l-unclaimed
     22 enwiktionary:jobqueue:RestbaseUpdateJobOnDependencyChange:l-unclaimed
     27 enwiki:jobqueue:RecordLintJob:l-unclaimed
     29 enwiki:jobqueue:refreshLinks:l-unclaimed
     40 enwiki:jobqueue:RestbaseUpdateJobOnDependencyChange:l-unclaimed
    125 enwiki:jobqueue:ORESFetchScoreJob:l-unclaimed
   1929 enwiki:jobqueue:enqueue:l-unclaimed
   5451 enwiktionary:jobqueue:refreshLinks:l-unclaimed

On the slaves the are a lot more RPUSH commands than in the master, so I checked the biggest offenders in the slave and compared their size with the master:

elukey@rdb2003:/srv/redis$ redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_${instance}.conf)" -p 6380 LLEN enwiktionary:jobqueue:refreshLinks:l-unclaimed
(integer) 354285

elukey@rdb1003:/srv/redis$ redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_${instance}.conf)" -p 6380 LLEN enwiktionary:jobqueue:refreshLinks:l-unclaimed
(integer) 424
elukey@rdb2003:/srv/redis$ redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_${instance}.conf)" -p 6380 LLEN enwiki:jobqueue:enqueue:l-unclaimed
(integer) 121914

elukey@rdb1003:/srv/redis$ redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_${instance}.conf)" -p 6380 LLEN enwiki:jobqueue:enqueue:l-unclaimed
(integer) 0
elukey@rdb2003:/srv/redis$ redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_${instance}.conf)" -p 6380 LLEN enwiki:jobqueue:ORESFetchScoreJob:l-unclaimed
(integer) 8864

elukey@rdb1003:/srv/redis$ redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_${instance}.conf)" -p 6380 LLEN enwiki:jobqueue:ORESFetchScoreJob:l-unclaimed
(integer) 1

Not sure if this is another red herring or a good lead to follow, but I am writing it anyway :)

Added note after a bit of thinking: the AOF rewrite functionality might play a role in here. I am not super expert in how Redis rewrites its AOF to keep its size under certain thresholds, but probably a [L|R]Push and a correspondent POP for the same element of the same structure (like a list) may be considered as NOOP.

If this is somehow true, we might see more RPUSHs on the slave because the correspondent POPs were not propagated correctly (some bugs are mentioning similar things).

The only RPUSH that I found is in the JobChron:

https://github.com/wikimedia/mediawiki-services-jobrunner/blob/master/redisJobChronService#L373

As far as I can see it grabs jobs claimed for too long and re-enqueues them in the relative unclaimed queue. Another theory is that https://github.com/wikimedia/mediawiki-services-jobrunner/blob/master/redisJobChronService#L356-L409 triggers a storm of rpushes only on the slave and not on the master for some reason.

What do you think of https://github.com/antirez/redis/issues/2999 ? I wonder if it would be worth adding a flag to make the luaEval() calls in the JobQueueRedis just use EVAL instead of EVALSHA1, resync, and see if it still diverges?

What do you think of https://github.com/antirez/redis/issues/2999 ? I wonder if it would be worth adding a flag to make the luaEval() calls in the JobQueueRedis just use EVAL instead of EVALSHA1, resync, and see if it still diverges?

Tried to make some tests doing the following:

  • executing EVAL on a lua script
  • executing sha1 from php (as luaEval does) of the lua script
  • check SCRIPT exists on master
  • run EVALSHA with the sha1 of the script
  • check the values in the slave

TEST1:

  • redis.log( redis.LOG_WARNING, '[eval test] REDIS EVAL TEST' )

On the master I can see two times [eval test] REDIS EVAL TEST' in the logs, but only one time in the slave (precisely only after EVALSHA is executed).
This is a strange result but the script does not perform any change to the database.

To repro:

  1. EVAL "redis.log( redis.LOG_WARNING, '[eval test] REDIS EVAL TEST' )" 0
  2. EVALSHA 246fc611b61839a9610659a6567ad894cd1cbdb9

I can see the log on the slave only after 2). Interestingly the command SCRIPT EXISTS 246fc611b61839a9610659a6567ad894cd1cbdb9 returns true only on the master after executing step 1), meanwhile it returns true on the slave only after step 2).

TEST2:

  • redis.call('SET', 'elukeytest', '42')

Correctly found the value in the slave

TEST 3

  • redis.call('INCR', 'elukeytest')

Correctly incremented two times (one for EVAL and one for EVALSHA)

TEST 4

  • redis.call('LPUSH', 'elukeytestlist', '42')

Correctly shows two times '42' (the first time it creates the list and adds the element, the second it only adds the element).

So for the moment the only weirdness is related to logging.

Couple of notes from https://raw.githubusercontent.com/antirez/redis/2.8/00-RELEASENOTES:

Mentioned in SAL (#wikimedia-operations) [2017-04-19T16:49:07Z] <Amir1> ladsgroup@naos:~$ mwscript extensions/ORES/maintenance/CleanDuplicateScores.php --wiki=enwiki (T163337)

There seem to still be a lot of duplicate scores in the table, leading to duplicate results on Special:Contributions. For example, on this page all results except the last two are duplicates (all from January, February and March 2016; the non-dupe results are from mid-2015).

I don't think these duplicates were caused by the most recent switchover though, because they're for revisions from 2016.

mysql:research@s3-analytics-slave [enwiki]> select oresc_rev, oresc_model, oresc_class, count(*) as c from ores_classification where oresc_rev between 700000000 and 710000000 group by oresc_rev, oresc_model, oresc_class having count(*)>1 ;
[...]
233348 rows in set (31.37 sec)

This particular run of duplicates involves about 2.1 million revisions with IDs between 646M and 762M:

mysql:research@s3-analytics-slave [enwiki]> select count(*), min(oresc_rev), max(oresc_rev), max(c) from (select oresc_rev, count(*) as c from ores_classification where oresc_rev between 646000000 and 762000000 group by oresc_rev, oresc_model, oresc_class having count(*)>1 ) as t;
+----------+----------------+----------------+--------+
| count(*) | min(oresc_rev) | max(oresc_rev) | max(c) |
+----------+----------------+----------------+--------+
|  2126404 |      646487174 |      761664891 |      3 |
+----------+----------------+----------------+--------+
1 row in set (2 min 59.34 sec)

This ID range corresponds to a time range spanning almost two years, from early 2015 to early 2017:

mysql:research@s3-analytics-slave [enwiki]> select rev_timestamp from revision where rev_id in (646487174,761664891);
+----------------+
| rev_timestamp  |
+----------------+
| 20150210115437 |
| 20170124033422 |
+----------------+
2 rows in set (0.24 sec)

133 of them are triplicated, the others are duplicated:

mysql:research@s3-analytics-slave [enwiki]> select count(*), c from (select oresc_rev, count(*) as c from ores_classification where oresc_rev between 646000000 and 762000000 group by oresc_rev, oresc_model, oresc_class having count(*)>1 ) as t group by c;
+----------+---+
| count(*) | c |
+----------+---+
|  2126271 | 2 |
|      133 | 3 |
+----------+---+

The density is 2%-3% most of the way through, but drops sharply at the end, with the 740M-750M range having a density as low as 0.03%:

mysql:research@s3-analytics-slave [enwiki]> select count(*), FLOOR(oresc_rev/10000000) as tenmillion  from (select oresc_rev, count(*) as c from ores_classification where oresc_rev between 646000000 and 762000000 group by oresc_rev, oresc_model, oresc_class having count(*)>1 ) as t group by tenmillion;
+----------+------------+
| count(*) | tenmillion |
+----------+------------+
|    71577 |         64 |
|   175789 |         65 |
|   207713 |         66 |
|   202436 |         67 |
|   223048 |         68 |
|   243488 |         69 |
|   233348 |         70 |
|   237363 |         71 |
|   307764 |         72 |
|   214180 |         73 |
|     3265 |         74 |
|     5287 |         75 |
|     1146 |         76 |
+----------+------------+
13 rows in set (2 min 59.65 sec)

I have a hypothesis that I can't test right now. But most of the numbers mentioned in this range are in ores classification because of our aggressive friend who practically was going through all of edits in English Wikipedia and sending requests concurrently which probably ended up putting duplicate rows un the db. I will investigate this later today.

Why would duplicate jobs lead to duplicate contributions rows? That seem like a bug in that job's logic, since Job classes are supposed to avoid that kind of thing when possible (since run-once is not guaranteed by pretty much any queue system).

@aaron another interesting open bug that might be worth reviewing: https://github.com/antirez/redis/issues/1525 ("EVAL replicated + conditionals about key existence = replication bug.")

It leverages TTLs though that we don't use, so probably it will not resolve much, but good to keep it as reference.

Keep wondering why we have this massive difference in unclaimed:

elukey@rdb2003:~$ redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_6380.conf)" -p 6380 LLEN enwiktionary:jobqueue:refreshLinks:l-unclaimed
(integer) 1344401

elukey@rdb1003:~$ redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_6380.conf)" -p 6380 LLEN enwiktionary:jobqueue:refreshLinks:l-unclaimed
(integer) 37

Is there a way to check what happened to one of the jobs stored in enwiktionary:jobqueue:refreshLinks:l-unclaimed on rdb2003 that are not on rdb1003? I am asking since I am not super familiar with where the jobs metadata are stored. If those jobs were all executed correctly then we might narrow down the bug to the JobChron scripts executions on the Redis slaves.

@aaron another interesting open bug that might be worth reviewing: https://github.com/antirez/redis/issues/1525 ("EVAL replicated + conditionals about key existence = replication bug.")

It leverages TTLs though that we don't use, so probably it will not resolve much, but good to keep it as reference.

That's not true. We do have keys that have a TTL and that are checked in order to do things on the data structures that show corruption.

Also, if you read the bug correctlly, it says clearly that doesn't just happens for keys with TTLs but in general there is a race condition with key removal.

This is exactly the direction I was investigating into, and it seems that it's what's happening to us.

The bad news is that there doesn't seem to be a fix that has been applied consistently to even the latest Redis versions.

When I was reading https://github.com/antirez/redis/issues/1525's content I was attracted by the conditionals since we use them in the Lua scripts, but I wasn't sure about the TTLs.

It would be great if we could find the race condition and try to apply a workaround, but it might be safer to just turn on a daily restarts of the Redis replicas to ensure consistencies and see how it goes.

Note: given the high number of RPUSHes on the slaves' AOF (that are not on the masters) I'll try to follow the theory that https://github.com/wikimedia/mediawiki-services-jobrunner/blob/master/redisJobChronService#L371-L374 might be executed differently on the master and on the slave. Could be a waste of time but it might give some useful info.

@aaron whenever you have time can you review my last statement and let me know what you think about it? At this point either we invest time in finding the race condition or we just set up regular cron jobs to periodically restart the Redis slaves in order to get in sync with their masters. With the current settings if a master dies we can't fully rely on the slaves to process its data..

@aaron What would it take not to use Lua anymore? I'm assuming for a moment that we don't use Lua during web requests to simply add/push jobs to the queues, but rather only for the deduplication and/or batching commands for fetching and marking-as-inprogress of multiple jobs when inside job runners.

If that is the case, a minor performance setback could be acceptable in favor of plain commands that take a bit longer within the job runner processes.

Krinkle renamed this task from Job queue corruption after codfw switch over (Queue worth, duplicate runs) to Job queue corruption after codfw switch over (Queue growth, duplicate runs).Jun 5 2017, 1:21 PM
Krinkle updated the task description. (Show Details)

@aaron another interesting open bug that might be worth reviewing: https://github.com/antirez/redis/issues/1525 ("EVAL replicated + conditionals about key existence = replication bug.")

It leverages TTLs though that we don't use, so probably it will not resolve much, but good to keep it as reference.

That's not true. We do have keys that have a TTL and that are checked in order to do things on the data structures that show corruption.

Also, if you read the bug correctlly, it says clearly that doesn't just happens for keys with TTLs but in general there is a race condition with key removal.

This is exactly the direction I was investigating into, and it seems that it's what's happening to us.

The bad news is that there doesn't seem to be a fix that has been applied consistently to even the latest Redis versions.

The keys with TTLs are just the root job de-duplication keys. Dependencies on those for writes are decided on the PHP level, not inside Lua scripts.

@aaron thanks for the clarification, but is there a way in your opinion to test if my theory about the JobChron rpushing jobs again in unclaimed is true? It would also make sense that conditionals could trigger race conditions as described in https://github.com/antirez/redis/issues/1525, even if not due to timeouts.

I suppose you could try stopping the 'jobchron' service for a few minutes on the runners, while letting 'jobrunner' keep running, and then seeing if divergence still happens. The main effect would be that delayed jobs won't become available during the time, which is tolerable for a while.

That might give some clue as to much divergence happens in ack() vs the periodic lua script.

As FYI https://gerrit.wikimedia.org/r/#/c/357193/ and https://gerrit.wikimedia.org/r/#/c/359950 got merged, setting a daily Redis restart for all the slaves (codfw and local ones).

Re-checked my previous test:

elukey@rdb1003:~$ for instance in 6378 6379 6380 6381; do echo ${instance}; redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_${instance}.conf)" -p ${instance} LLEN enwiki:jobqueue:enqueue:l-unclaimed; done
6378
(integer) 7233
6379
(integer) 0
6380
(integer) 0
6381
(integer) 0

elukey@rdb1004:~$ for instance in 6378 6379 6380 6381; do echo ${instance}; redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_${instance}.conf)" -p ${instance} LLEN enwiki:jobqueue:enqueue:l-unclaimed; done
6378
(integer) 7233
6379
(integer) 0
6380
(integer) 1
6381
(integer) 1

elukey@rdb2003:~$ for instance in 6378 6379 6380 6381; do echo ${instance}; redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_${instance}.conf)" -p ${instance} LLEN enwiki:jobqueue:enqueue:l-unclaimed; done
6378
(integer) 7233
6379
(integer) 0
6380
(integer) 94771
6381
(integer) 0

I verified that only the slaves got restarted, the patch works as planned. I'll try to figure out how much time it takes for enwiki:jobqueue:enqueue:l-unclaimed to get out of sync.

restarted Redis 6380 on rdb200[34] and started the following script in screen:

#!/bin/bash
set -e
set -u

while true;
do
    echo $(date '+%d/%m/%Y %H:%M:%S')
    redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_6380.conf)" -p 6380 LLEN enwiki:jobqueue:enqueue:l-unclaimed
    grep -c RPUSH /srv/redis/rdb200[34]-6380.aof
    sleep 300
done

Very rudimentary way to get some datapoints but it should work fine.

It seems that the inconsistency on the codfw slaves slowly creeps up after ~ the 2hrs mark, rather than all of a sudden. I am now running the above script on rdb[12]003 to measure the difference between the two over a day (removed the grep -c part).

Very rudimentary time series data about LLEN enwiki:jobqueue:enqueue:l-unclaimed (6380 shard) starting from 01:05 UTC onwards (slave restarts happen around this time everyday):

  • rdb1003
2 1 0 0 0 3 2 1 0 2 0 1 1 0 0 0 0 0 1 0 0 0 3 1 0 0 1 2 0 2 0 0 2 0 0 0 2 0 2 1 1 0 0 0 1 0 0 0 1 1 0 1 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 2 0 0 0 1 0 1 1 1 3 0 1 1 0
  • rdb2003
1 0 2 4 0 1 0 0 0 0 0 1 3 1 0 238 171 112 77 25 815 743 1404 1285 1163 1083 990 1584 2593 3607 4616 5580 6402 7330 8253 9204 10078 10934 11755 12375 13252 14140 15033 15743 15822 16563 16839 17666 18442 19261 20073 20534 20227 19894 19630 19333 19425 19223 19241 19033 18698 18441 18177 18147 18660 19407 20212 20989 21732 22357 23202 23942 24653 25360 26049 26497 27138 27789 28443 29113 29436 29929

The interesting thing is that rdb1004 (eqiad local replica) does not show any skew of data. Another interesting thing is that the data skew happens one/two hours after the last full sync with the master. The number if RPUSH commands in the .aof files is ~700 for rdb2003, ~370 for rdb1003 and ~270 for rdb1004.

Another random test that I did today.

  1. on rdb2003 I ran redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_6380.conf)" -p 6380 LRANGE enwiki:jobqueue:enqueue:l-unclaimed 0 50000 > ranges.txt to get the list of job sha1s that are in codfw but not in eqiad.
  1. ran the following grep on rdb1003, rdb1004 and rdb2003 tcp_6380's Redis AOF file checking differences for some random job sha1s found in ranges.txt:

grep d445898a3bbc4eb688aeed6fe2073a7b /srv/redis/rdb2003-6380.aof -B 500 --color | egrep 'd445898a3bbc4eb688aeed6fe2073a7b|^[A-Z]'

This should, theoretically, print any occurrence of a job together with its related Redis command (the AOF file lists one element per line).

These are some results: https://phabricator.wikimedia.org/P5616 (not sure if any data is sensitive so I copied the entire snippet that I'd have written in here and added WMF-NDA perms)

This comment was removed by elukey.

Deleted by accident my previous comment, will re-do it :)

So https://phabricator.wikimedia.org/P5621 seems interesting. I found the job id among the ones in the LRANGE enwiki:jobqueue:enqueue:l-unclaimed 0 50000 but this time there is no trace of a RPUSH, only traces of ZADD to the enwiki:jobqueue:enqueue:z-claimed queue.

No traces of the job id in the eqiad master rdb1003.

I am not an expert of the LUA code of jobrunner and jobchron but it seems to me that the following might be happening in codfw:

  • a job gets claimed and zadded in enwiki:jobqueue:enqueue:z-claimed
  • on the eqiad master, it gets executed and subsequently zremoved from enwiki:jobqueue:enqueue:z-claimed by something like the doAck function.
  • due to replication issues or race conditions, the job id gets moved from the claimed queue to the unclaimed queue by the jobchron.

The fact that I don't find any traces of the job id in the eqiad master's AOF seems consistent to me, since I am pretty sure that when the AOF gets rewritten commands like zadd/zrem for the same element are deleted (since they are a no-op for the database).

Still didn't find a reliable repro case, but it does seem that the order of execution of the lua scripts is not the same as it is on the master. Any opinions?

Ideas for next steps:

  • Figure out if intra-DC replication is equally broken.

@Krinkle can we find a strategy to verify this point? We can live with out of sync codfw replicas for a while but we are surely in danger if we can't rely on the local/intradc replicas.

Mentioned in SAL (#wikimedia-operations) [2017-06-26T16:59:18Z] <elukey> EXPERIMENT - T163337 - set slaveof no one on rdb2004 to remove its dependency to rdb2003 (puppet disabled on rdb2004, to rollback just enable/run it)

The idea about the experiment is to remove rdb2004 as slave of rdb2003, to see if this dependency creates any sort of different behavior from the rdb1004 slave (that does not have any slaves).

Summary for the moment:

  • Redis 3.0's changelog shows tons of changes for Redis replication - https://raw.githubusercontent.com/antirez/redis/3.0/00-RELEASENOTES
  • with daily restarts, the intra/local dc replica seems consistent and not out of sync for data structures like enwiki:jobqueue:enqueue:l-unclaimed
  • for some reason the inter dc replica in codfw suffers from race conditions that are causing inconsistent values in enwiki:jobqueue:enqueue:l-unclaimed and others.

Experiment finished and nothing changed in rdb2003, so we can rule out that (remote) option.

The next step would be to stop the jobchron daemon on all the eqiad jobrunners, restart all the codfw shards and observe the data structures like z-claimed and l-unclaimed. The experiment should last two/three hours maximum, since from what we observed the skew happens around this timeframe in codfw.

Any opposition?

Side question: Will stopping/restarting all of those things be noticeable to editors?

Side question: Will stopping/restarting all of those things be noticeable to editors?

The only visible effect should be that delayed jobs (IIUC jobs scheduled to be executed at some point in the future) will get delayed a bit because of the absence of the jobchron, but two/three hours of delay shouldn't be a huge deal in my opinion.

All the other restarts (related to replicas) are absolutely not noticeable for users :)

Opened a issue to upstream to ask for some guidance while we investigate: https://github.com/antirez/redis/issues/4103

Today I tried to count the occurrences of EVALSHA 3917a6b71d2857ca3d40fe6dabc2fc46d8e4e535 (JobChron script) in the AOFs of rdb1003 + slaves. IIUC this format is the one indicating KEYS and ARGS of the script.

The jobchron lua code is executed periodically for each queue that needs to be updated (== one EVALSHA for each queue, line enwiki, etc..).

I tried to grab the rClaimCutoff variable value, a timestamp that should indicate which elements of the z-set claimed are stale, corresponding to the jobs that have been claimed for too long and needs to be re-pushed to the unclaimed queue.

As usual I examined rdb1003:6380 + slaves and ran the following:

grep 3917a6b71d2857ca3d40fe6dabc2fc46d8e4e535 /srv/redis/rdb1003-6380.aof -A 18| egrep '^14[0-9]*' | uniq -c
grep 3917a6b71d2857ca3d40fe6dabc2fc46d8e4e535 /srv/redis/rdb2003-6380.aof -A 18| egrep '^14[0-9]*' | uniq -c
grep 3917a6b71d2857ca3d40fe6dabc2fc46d8e4e535 /srv/redis/rdb1004-6380.aof -A 18| egrep '^14[0-9]*' | uniq -c

These greps should return something like:

[..]
     15 1499235919
     30 1499235920
      8 1499235921
      5 1499235922
     14 1499235923
      5 1499235924
     14 1499235925
     11 1499235926
     11 1499235927
     16 1499235928
     11 1499235929
[..]

The KEYS/ARGS parameters are written down on separate lines in the AOF, right after the EVALSHA, and rClaimCutoff should be at line 18.
In the ideal world of perfect replication, master and slaves should emit the same output, but what happens is that on rdb2003 (the skewed slave) there are some jobruns that are repeated one time less than the master:

[..]
      44 1499235537
-     27 1499235538
+     26 1499235538
      10 1499235539
       2 1499235540
      20 1499235541
      11 1499235542
      16 1499235543
      41 1499235544
-     28 1499235545
+     27 1499235545
      25 1499235546
      19 1499235547
       9 1499235548
      15 1499235549
-     22 1499235550
+     21 1499235550
[..]

This might be completely biased by AOF rewrites but it brought me to think that we are not only experiencing a race condition for LUA code execution, but we might as well be hitting a major Redis replication bug.

This brings me to the answer that I got from upstream in https://github.com/antirez/redis/issues/4103, that is worth quoting in here:

the problem may due to a number of reasons: Lua misuse, 2.8 bugs, or the mix of the two, since there are past Lua bugs that result into problems in replication. However we have a policy of not investigating bugs like this were most of the subsystems changed severely, compared to the version the bug is reported: now we have, for instance, effects replication so that Redis is able to run complex scripts on the master, and instead of propagating the script, it can just propagate the commands that resulted from the script execution. The replication subsystem itself changed dramatically so it has a lot better and a lot less bugged compared to Redis 2.8.

Since Redis 3.2 and 4.0 are 99% (if not more) backward compatible, it is very unlikely to experience problems when upgrading, so my suggestion is to try to move to a newer version. If the problem will still be there, we can surely perform an investigation in order to understand what's wrong.

From what I can read Redis 2.8 is not supported anymore from upstream, the only chance that we have is to migrate to at least 3.2 and try to reproduce the bug in there. What a nice coincidence that the new Debian stable is Stretch and the Redis version in there is 3.2 :)

So I personally see two possible ways of resolving this issue:

  1. Dedicate our energies to T157088 and wait for the brand new Eventbus/Kafka/ChangeProp restyle of the JobQueues, accepting the fact that our codfw replicas are broken. We have a procedure to make the switchover to codfw without duplicate jobs, and the local DC replica seems consistent with the master now that we are restarting them daily.
  1. Test Redis 3.2 and think about migrating one rdb "tree" (eqiad master + slave + codfw-slave) to it, trying to figure out if the bug is still present or not. This might uncover new bugs of course and it would probably require a significant effort from ops and whoever knows how the jobqueues internals are working.

I am inclined to follow 1) but I do really like to know other opinions :)

After a chat with the other opsens there seems to be an agreement on the first option, avoiding to spend more time on Redis. I'll keep this task open for a couple of days more to give people the chance to speak up and propose different solutions, then I'll close if nobody opposes :)

elukey claimed this task.

Change 349235 abandoned by Ladsgroup:
Use DISTINCT option on ChangesList select

https://gerrit.wikimedia.org/r/349235