Page MenuHomePhabricator

Cleanup duplicate indices in cloudelastic
Closed, ResolvedPublic

Description

The indices that are supposed to be split between the psi and omega clusters of cloudelastic instead exist on both psi and omega cloudelastic clusters. At some point (during initial deployment?) in cloudelastic the name/port mappings were mixed up and data for psi cluster in prod went to omega in cloudelastic, similarly omega in prod ended up in psi in cloudelastic.

Cleanup all the unreferenced indices left over by the mixed.

Event Timeline

This shows using a single wiki as a example, but this is repeated for all of the wikis that are split between omega and psi. Here acewiki correctly does not exist on 9243 (chi). It should not exist on 9443 (omega), but does exist on cloudelastic:9443. It should exist on 9643 (psi) and does in all clusters.

ebernhardson@mwmaint1002:~$ for port in 9243 9443 9643; do for cluster in search.svc.{eqiad,codfw}.wmnet cloudelastic.wikimedia.org; do echo $cluster:$port; curl https://$cluster:$port/_cat/indices | awk '/acewiki/ { print $1 }'; done; done
search.svc.eqiad.wmnet:9243
search.svc.codfw.wmnet:9243
cloudelastic.wikimedia.org:9243
search.svc.eqiad.wmnet:9443
search.svc.codfw.wmnet:9443
cloudelastic.wikimedia.org:9443
acewiki_content_1582068728
acewiki_general_1582068795
search.svc.eqiad.wmnet:9643
acewiki_general_1605055533
acewiki_content_1605055462
acewiki_titlesuggest_1547630056
acewiki_archive_1605055570
search.svc.codfw.wmnet:9643
acewiki_content_1605055494
acewiki_general_1605055564
acewiki_archive_1605055603
acewiki_titlesuggest_1547630084
cloudelastic.wikimedia.org:9643
acewiki_general_1605055524
acewiki_content_1605055456

Double checking cluster assignments in mediawiki we get:

ebernhardson@mwmaint1002:~$ mwscript shell.php --wiki=acewiki
Psy Shell v0.10.5 (PHP 7.2.31-1+0~20200514.41+debian9~1.gbpe2a56b+wmf1+icu63 — cli) by Justin Hileman
>>> (new CirrusSearch\SearchConfig())->getClusterAssignment()->getServerList('cloudelastic')
=> [
     [
       "host" => "localhost",
       "transport" => "Http",
       "port" => 6107,
     ],
   ]

Making a connection verifies this is psi:

ebernhardson@mwmaint1002:~$ curl localhost:6107
{
  "name" : "cloudelastic1001-cloudelastic-psi-eqiad",
   ...

Referencing mediawiki-config we can see this should be cloudelastic-psi:

'cloudelastic-psi' => [
    [ // forwarded to https://cloudelastic.wikimedia.org:9443/
        'host' => 'localhost',
        'transport' => 'Http',
        'port' => 6107,
    ],
],

Overall we will need to do the santa thing, make a list and check it twice, before deleting all the unused indices from cloudelastic.

Pondering this, first step should probably be closing rather than deleting the indices. Closed indices can be easily reopened if we start getting errors from CirrusSearch that we closed an active index. Without errors after some reasonable time period the indices can be safely deleted.

Note that T279607 also has some indices to cleanup, it might make sense to address both at the same time.

Change 682189 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] Reconcile configured indices with live state

https://gerrit.wikimedia.org/r/682189

Ran script from above, initial report is found in P15515 with 1655 indices across clusters that don't match the set of indices expected to exist. On cloudelastic I've closed all indices where the script identified another index that is the current live index, such as when the index is on the wrong cluster or is a failed reindex. Watching logstash I don't see anything new complaining, probable that these indices were correctly classified. Will wait till monday to actually delete anything though.

Will do a bit more manual review for the remaining 48 problem indices. Likely i could close them in a scripted fashion like the other 1600, but being that there aren't that many and it's more painful to make mistakes on the prod clusters I've decided on manual review and close for the moment.

Cleared out the remaining 48 indices from prod clusters on friday. Checking the weekend logs I don't see anything particularly suspicious, will go ahead and delete all the closed indices. Separately we seem to be sending archive deletes to cloudelastic, even though cloudelastic doesn't have archive indices. These result in job queue failures and logs that we should clean up, even if there are no direct negative effects of invalid deletes.

The main purpose of this task is complete, the indices are cleaned up. Still need to finish code review on the scripts that were used so they are available in the future. Also need to update wikitech, i think there are a few hacky versions of finding stale indices in there.

Change 682189 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Reconcile configured indices with live state

https://gerrit.wikimedia.org/r/682189

Updated wikitech, dropping the section on clearing out duplicate titlesuggest indices and updating the section on removing duplicate indices. This should now be complete.