Page MenuHomePhabricator

Move 100% of external traffic to Kubernetes
Closed, ResolvedPublic

Assigned To
Authored By
Clement_Goubert
Apr 11 2024, 12:14 PM
Referenced Files
F55438321: image.png
Jun 18 2024, 3:00 PM
Tokens
"Fox" token, awarded by TheresNoTime."Mountain of Wealth" token, awarded by Lucas_Werkmeister_WMDE."Mountain of Wealth" token, awarded by Jdforrester-WMF."Barnstar" token, awarded by taavi."Burninate" token, awarded by jijiki."Love" token, awarded by Ladsgroup."Stroopwafel" token, awarded by hnowlan.

Description

This is (almost) the final step!

Progressively forward the remaining 30% of external traffic to MW-on-K8s

What?

Wikikube cluster will be fully serving:

  • External traffic (API, web, mobile)
  • Internal traffic
  • MediaWiki jobs (former jobrunners)
  • Commons

In MW-on-K8s terms, this translates to following deployments

  • mw-web
  • mw-api-int
  • mw-api-ext
  • mw-jobrunner
  • mw-parsoid
  • mw-wikifunctions

Things that are pending migration

External traffic

Karriere

Internal traffic

Progression

  • 75%
  • 80%
  • 85%
  • 90%
  • 95%
  • 100%

Cleanup

  • T367949: Spin down api_appserver and appserver clusters
  • The Icinga checks related to the Etcd last index. Starting from modules/icinga/manifests/monitor/etcd_mw_config.pp and all the related scripts and timers installed and running on the icinga hosts. Plus the files generated by the scripts themselves.

Notes

The above is per DC.

Details

SubjectRepoBranchLines +/-
operations/deployment-chartsmaster+3 -3
operations/puppetproduction+0 -4
operations/puppetproduction+4 -0
operations/puppetproduction+0 -4
operations/alertsmaster+11 -156
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/deployment-chartsmaster+2 -2
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/deployment-chartsmaster+2 -2
operations/puppetproduction+1 -1
operations/deployment-chartsmaster+4 -4
operations/puppetproduction+17 -22
operations/puppetproduction+1 -1
operations/puppetproduction+20 -9
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/deployment-chartsmaster+5 -0
operations/alertsmaster+1 -3
operations/puppetproduction+1 -1
operations/deployment-chartsmaster+2 -2
operations/puppetproduction+1 -1
operations/puppetproduction+21 -16
operations/puppetproduction+1 -1
operations/deployment-chartsmaster+3 -3
operations/puppetproduction+16 -11
operations/puppetproduction+1 -1
operations/deployment-chartsmaster+2 -2
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1405.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1371.eqiad.wmnet with OS bullseye completed:

  • mw1371 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405021132_hnowlan_3953702_mw1371.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1409.eqiad.wmnet with OS bullseye completed:

  • mw1409 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405021135_hnowlan_3953708_mw1409.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1405.eqiad.wmnet with OS bullseye completed:

  • mw1405 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405021137_hnowlan_3953738_mw1405.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1435.eqiad.wmnet with OS bullseye completed:

  • mw1435 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405021139_hnowlan_3953714_mw1435.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1399.eqiad.wmnet with OS bullseye completed:

  • mw1399 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405021143_hnowlan_3953743_mw1399.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB

Change #1026160 merged by jenkins-bot:

[operations/deployment-charts@master] mw-web, mw-api-ext: bump replicas

https://gerrit.wikimedia.org/r/1026160

Change #1026159 merged by Hnowlan:

[operations/puppet@production] trafficserver: move 85% of traffic to mw-on-k8s

https://gerrit.wikimedia.org/r/1026159

Change #1028840 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] kubernetes: make 5 eqiad api appservers k8s workers

https://gerrit.wikimedia.org/r/1028840

Change #1028842 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] mw-web, mw-api-ext: bump replicas in advance of traffic shift

https://gerrit.wikimedia.org/r/1028842

Change #1028844 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] trafficserver: move k8s traffic shift to 90%

https://gerrit.wikimedia.org/r/1028844

We are currently holding at 85% of global traffic, and as such not reimaging anymore servers except for T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes because we do not want to start migrating commons to mw-on-k8s and not have the hardware to move it back in case of emergency. T363307 gets a pass because of the benefits for cluster speed and stability.

Once T295007: Upload by URL should use the job queue, possibly chunked with range requests and T118887: Upload by URL doesn't work well for large files: HTTP request timed out. are resolved, we should move commons traffic in relatively big increments (20 or 25% at a time), running mw-web and mw-api-ext at higher php-fpm utilization levels as long as latency is under control. Based on earlier testing, we can aim for:

  • 60% average utilization
  • saturation alert at 75%

If latency starts getting impacted by the increase in traffic, ad-hoc replica increases should be performed. In order to allow for these replica increases to happen without reimages, I will be tweaking the RollingUpdate strategy maxUnavailable and testing deployments to see how low we can go with regard to CPU available without breaking them.

If we really need more hardware, it should be taken from the appservers cluster in priority.

Change #1031430 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/alerts@master] mw-on-k8s: Raise saturation threshold to 75%

https://gerrit.wikimedia.org/r/1031430

Change #1031430 merged by jenkins-bot:

[operations/alerts@master] mw-on-k8s: Raise saturation threshold to 75%

https://gerrit.wikimedia.org/r/1031430

Change #1031844 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-on-k8s: Bump maxUnavailable to 6%

https://gerrit.wikimedia.org/r/1031844

Change #1031844 merged by jenkins-bot:

[operations/deployment-charts@master] mw-on-k8s: Bump maxUnavailable to 6%

https://gerrit.wikimedia.org/r/1031844

Mentioned in SAL (#wikimedia-operations) [2024-05-15T15:49:34Z] <cgoubert@deploy1002> Started scap: mw-on-k8s: Bump maxUnavailable to 6% - T362323

Mentioned in SAL (#wikimedia-operations) [2024-05-15T15:51:00Z] <cgoubert@deploy1002> Finished scap: mw-on-k8s: Bump maxUnavailable to 6% - T362323 (duration: 02m 01s)

Change #1032497 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] trafficserver: migrate 5% of traffic to commons

https://gerrit.wikimedia.org/r/1032497

Change #1032497 merged by Hnowlan:

[operations/puppet@production] trafficserver: migrate 5% of commons traffic to k8s

https://gerrit.wikimedia.org/r/1032497

Change #1032828 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] trafficserver: move to 15% traffic split for commons

https://gerrit.wikimedia.org/r/1032828

Change #1032828 merged by Hnowlan:

[operations/puppet@production] trafficserver: move to 15% traffic split for commons

https://gerrit.wikimedia.org/r/1032828

Change #1034043 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] trafficserver: move commons-on-k8s to 30%

https://gerrit.wikimedia.org/r/1034043

Change #1034043 merged by Hnowlan:

[operations/puppet@production] trafficserver: move commons-on-k8s to 30%

https://gerrit.wikimedia.org/r/1034043

Change #1034088 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] trafficserver: move commons-on-k8s to 100%

https://gerrit.wikimedia.org/r/1034088

Change #1034088 merged by Hnowlan:

[operations/puppet@production] trafficserver: move commons-on-k8s to 100%

https://gerrit.wikimedia.org/r/1034088

akosiaris renamed this task from Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) to Move 100% of external traffic to Kubernetes (excluding Votewiki).May 24 2024, 11:13 AM
akosiaris updated the task description. (Show Details)

Note that the votewiki blocker is apparently also now fixed.

As far as mediawiki calling itself goes (I see it was removed from the task description, but it is technically still pending), with everything now using $wgLocalHTTPProxy, we can choose to either change the listeners for bare-metal so they call MW-on-K8s, or we can just keep moving external traffic, and the related backend internal traffic will follow. I am inclined to choose the second option.

Note that the votewiki blocker is apparently also now fixed.

Yes, I commented on the task to ask what's needed for testing in MW-on-K8s

Change #1028840 merged by Clément Goubert:

[operations/puppet@production] kubernetes: rename and repurpose 5 api appservers as k8s workers

https://gerrit.wikimedia.org/r/1028840

Change #1038732 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-web, mw-api-ext: Raise replicas for 90% traffic

https://gerrit.wikimedia.org/r/1038732

Change #1038735 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] trafficserver: Migrate votewiki to k8s

https://gerrit.wikimedia.org/r/1038735

Ladsgroup renamed this task from Move 100% of external traffic to Kubernetes (excluding Votewiki) to Move 100% of external traffic to Kubernetes.Jun 4 2024, 10:18 AM

Change #1038735 merged by Clément Goubert:

[operations/puppet@production] trafficserver: Migrate votewiki to k8s

https://gerrit.wikimedia.org/r/1038735

Mentioned in SAL (#wikimedia-operations) [2024-06-04T10:23:24Z] <claime> Migrating votewiki to mw-on-k8s - T362323

Change #1038757 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] kubernetes: rename and reimage 3 api appservers, 2 appservers

https://gerrit.wikimedia.org/r/1038757

Change #1038757 merged by Hnowlan:

[operations/puppet@production] kubernetes: rename and reimage 3 api appservers, 2 appservers

https://gerrit.wikimedia.org/r/1038757

Change #1039196 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] mw-web, mw-api-ext: Raise replicas for 95% traffic

https://gerrit.wikimedia.org/r/1039196

Change #1038732 merged by jenkins-bot:

[operations/deployment-charts@master] mw-web, mw-api-ext: Raise replicas for 90% traffic

https://gerrit.wikimedia.org/r/1038732

Change #1041589 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] trafficserver: move 90% of traffic to mw-on-k8s

https://gerrit.wikimedia.org/r/1041589

Change #1041589 merged by Clément Goubert:

[operations/puppet@production] trafficserver: move 90% of traffic to mw-on-k8s

https://gerrit.wikimedia.org/r/1041589

Mentioned in SAL (#wikimedia-operations) [2024-06-11T10:45:32Z] <claime> move 90% of traffic to mw-on-k8s - T362323

Change #1039196 merged by jenkins-bot:

[operations/deployment-charts@master] mw-web, mw-api-ext: Raise replicas for 95% traffic

https://gerrit.wikimedia.org/r/1039196

Change #1042205 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] trafficserver: move 95% of traffic to mw-on-k8s

https://gerrit.wikimedia.org/r/1042205

Change #1028844 abandoned by Hnowlan:

[operations/puppet@production] trafficserver: move k8s traffic shift to 95%

Reason:

Done in another change

https://gerrit.wikimedia.org/r/1028844

Change #1042205 merged by Clément Goubert:

[operations/puppet@production] trafficserver: move 95% of traffic to mw-on-k8s

https://gerrit.wikimedia.org/r/1042205

Change #1047046 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-web, mw-api-ext: Raise replicas for 100% traffic

https://gerrit.wikimedia.org/r/1047046

Change #1047047 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] trafficserver: move 95% of traffic to mw-on-k8s

https://gerrit.wikimedia.org/r/1047047

Change #1047046 merged by jenkins-bot:

[operations/deployment-charts@master] mw-web, mw-api-ext: Raise replicas for 100% traffic

https://gerrit.wikimedia.org/r/1047046

Change #1047047 merged by Giuseppe Lavagetto:

[operations/puppet@production] trafficserver: move 100% of traffic to mw-on-k8s

https://gerrit.wikimedia.org/r/1047047

Mentioned in SAL (#wikimedia-operations) [2024-06-18T14:24:14Z] <claime> trafficserver: move 100% of traffic to mw-on-k8s - T362323

Change #1047107 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] httpbb: Remove appserver hourly tests

https://gerrit.wikimedia.org/r/1047107

Change #1047115 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] statograph: Use k8s envoy metric for statuspage

https://gerrit.wikimedia.org/r/1047115

Change #1047115 merged by Clément Goubert:

[operations/puppet@production] statograph: Use k8s envoy metric for statuspage

https://gerrit.wikimedia.org/r/1047115

Mentioned in SAL (#wikimedia-operations) [2024-06-18T16:23:47Z] <claime> resetting Wiki response time metric on wikimedia.statuspage.io following complete switch to k8s - T362323

Change #1047138 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] statograph: Use benthos query to save thanos

https://gerrit.wikimedia.org/r/1047138

Mentioned in SAL (#wikimedia-operations) [2024-06-18T17:21:09Z] <cdanis> resetting Wiki response time metric on wikimedia.statuspage.io following complete switch to k8s - T362323 T367894

Change #1047439 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/alerts@master] mediawiki: Remove bare-metal cluster alerts

https://gerrit.wikimedia.org/r/1047439

Change #1047439 merged by jenkins-bot:

[operations/alerts@master] mediawiki: Remove bare-metal cluster alerts

https://gerrit.wikimedia.org/r/1047439

Change #1047107 merged by Clément Goubert:

[operations/puppet@production] httpbb: Remove appserver hourly tests

https://gerrit.wikimedia.org/r/1047107

Change #1049185 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] httpbb: empty host list

https://gerrit.wikimedia.org/r/1049185

Change #1049186 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] httpbb: Remove unused appserver tests

https://gerrit.wikimedia.org/r/1049186

Change #1049185 merged by Clément Goubert:

[operations/puppet@production] httpbb: empty host list

https://gerrit.wikimedia.org/r/1049185

Change #1049186 merged by Clément Goubert:

[operations/puppet@production] httpbb: Remove unused appserver tests

https://gerrit.wikimedia.org/r/1049186

Change #1028842 abandoned by Hnowlan:

[operations/deployment-charts@master] mw-web, mw-api-ext: bump replicas in advance of traffic shift

Reason:

already bumped

https://gerrit.wikimedia.org/r/1028842

Should we call this Resolved and track the remaining migrations in the parent, T290536?

/me shakes fist at Phorge for not letting me award this task another token

🪙🪙🪙🪙🪙