⚓ T362323 Move 100% of external traffic to Kubernetes

Subject	Repo	Branch	Lines +/-
mw-web, mw-api-ext: bump replicas in advance of traffic shift	operations/deployment-charts	master	+3 -3
httpbb: Remove unused appserver tests	operations/puppet	production	+0 -4
httpbb: empty host list	operations/puppet	production	+4 -0
httpbb: Remove appserver hourly tests	operations/puppet	production	+0 -4
mediawiki: Remove bare-metal cluster alerts	operations/alerts	master	+11 -156
statograph: Use k8s envoy metric for statuspage	operations/puppet	production	+1 -1
trafficserver: move 100% of traffic to mw-on-k8s	operations/puppet	production	+1 -1
mw-web, mw-api-ext: Raise replicas for 100% traffic	operations/deployment-charts	master	+2 -2
trafficserver: move 95% of traffic to mw-on-k8s	operations/puppet	production	+1 -1
trafficserver: move k8s traffic shift to 95%	operations/puppet	production	+1 -1
mw-web, mw-api-ext: Raise replicas for 95% traffic	operations/deployment-charts	master	+2 -2
trafficserver: move 90% of traffic to mw-on-k8s	operations/puppet	production	+1 -1
mw-web, mw-api-ext: Raise replicas for 90% traffic	operations/deployment-charts	master	+4 -4
kubernetes: rename and reimage 3 api appservers, 2 appservers	operations/puppet	production	+17 -22
trafficserver: Migrate votewiki to k8s	operations/puppet	production	+1 -1
kubernetes: rename and repurpose 5 api appservers as k8s workers	operations/puppet	production	+20 -9
trafficserver: move commons-on-k8s to 100%	operations/puppet	production	+1 -1
trafficserver: move commons-on-k8s to 30%	operations/puppet	production	+1 -1
trafficserver: move to 15% traffic split for commons	operations/puppet	production	+1 -1
trafficserver: migrate 5% of commons traffic to k8s	operations/puppet	production	+1 -1
mw-on-k8s: Bump maxUnavailable to 6%	operations/deployment-charts	master	+5 -0
mw-on-k8s: Raise saturation threshold to 75%	operations/alerts	master	+1 -3
trafficserver: move 85% of traffic to mw-on-k8s	operations/puppet	production	+1 -1
mw-web, mw-api-ext: bump replicas	operations/deployment-charts	master	+2 -2
scap: make mw1407 a scap proxy	operations/puppet	production	+1 -1
k8s: move 5 eqiad appservers to kubernetes	operations/puppet	production	+21 -16
trafficserver: move 80% of traffic to mw on k8s	operations/puppet	production	+1 -1
mw-web, mw-api-ext: Raise replicas for 80% traffic	operations/deployment-charts	master	+3 -3
kubernetes: move 5 eqiad appservers to kubernetes	operations/puppet	production	+16 -11
trafficserver: move 75% of traffic to mw on k8s	operations/puppet	production	+1 -1
mw-web, mw-api-ext: Raise replicas for 75% traffic	operations/deployment-charts	master	+2 -2

Status	Assigned	Task
In Progress	None	T290536 Serve production traffic via Kubernetes
Resolved	Clement_Goubert	T362323 Move 100% of external traffic to Kubernetes
Resolved	CDanis	T367894 update status page latency for mw-on-k8s

Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1405.eqiad.wmnet with OS bullseye

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1371.eqiad.wmnet with OS bullseye completed:

mw1371 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405021132_hnowlan_3953702_mw1371.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1409.eqiad.wmnet with OS bullseye completed:

mw1409 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405021135_hnowlan_3953708_mw1409.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1405.eqiad.wmnet with OS bullseye completed:

mw1405 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405021137_hnowlan_3953738_mw1405.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1435.eqiad.wmnet with OS bullseye completed:

mw1435 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405021139_hnowlan_3953714_mw1435.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1399.eqiad.wmnet with OS bullseye completed:

mw1399 (PASS)
- Downtimed on Icinga/Alertmanager
- Disabled Puppet
- Removed from Puppet and PuppetDB if present and deleted any certificates
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Add puppet_version metadata to Debian installer
- Checked BIOS boot parameters are back to normal
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga/Alertmanager
- Removed previous downtime on Alertmanager (old OS)
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202405021143_hnowlan_3953743_mw1399.out
- configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB

Change #1026160 merged by jenkins-bot:

[operations/deployment-charts@master] mw-web, mw-api-ext: bump replicas

https://gerrit.wikimedia.org/r/1026160

Change #1026159 merged by Hnowlan:

[operations/puppet@production] trafficserver: move 85% of traffic to mw-on-k8s

https://gerrit.wikimedia.org/r/1026159

Maintenance_bot removed a project: Patch-For-Review.May 2 2024, 3:30 PM

Change #1028840 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] kubernetes: make 5 eqiad api appservers k8s workers

https://gerrit.wikimedia.org/r/1028840

gerritbot added a project: Patch-For-Review.May 7 2024, 2:13 PM

Change #1028842 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] mw-web, mw-api-ext: bump replicas in advance of traffic shift

https://gerrit.wikimedia.org/r/1028842

Change #1028844 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] trafficserver: move k8s traffic shift to 90%

https://gerrit.wikimedia.org/r/1028844

Clement_Goubert updated the task description. (Show Details)May 13 2024, 9:03 AM

We are currently holding at 85% of global traffic, and as such not reimaging anymore servers except for T363307: Co-locate kube-apiserver and etcd on new staging control plane nodes because we do not want to start migrating commons to mw-on-k8s and not have the hardware to move it back in case of emergency. T363307 gets a pass because of the benefits for cluster speed and stability.

Once T295007: Upload by URL should use the job queue, possibly chunked with range requests and T118887: Upload by URL doesn't work well for large files: HTTP request timed out. are resolved, we should move commons traffic in relatively big increments (20 or 25% at a time), running mw-web and mw-api-ext at higher php-fpm utilization levels as long as latency is under control. Based on earlier testing, we can aim for:

60% average utilization
saturation alert at 75%

If latency starts getting impacted by the increase in traffic, ad-hoc replica increases should be performed. In order to allow for these replica increases to happen without reimages, I will be tweaking the RollingUpdate strategy maxUnavailable and testing deployments to see how low we can go with regard to CPU available without breaking them.

If we really need more hardware, it should be taken from the appservers cluster in priority.

Change #1031430 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/alerts@master] mw-on-k8s: Raise saturation threshold to 75%

https://gerrit.wikimedia.org/r/1031430

Change #1031430 merged by jenkins-bot:

[operations/alerts@master] mw-on-k8s: Raise saturation threshold to 75%

https://gerrit.wikimedia.org/r/1031430

Change #1031844 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-on-k8s: Bump maxUnavailable to 6%

https://gerrit.wikimedia.org/r/1031844

Change #1031844 merged by jenkins-bot:

[operations/deployment-charts@master] mw-on-k8s: Bump maxUnavailable to 6%

https://gerrit.wikimedia.org/r/1031844

Mentioned in SAL (#wikimedia-operations) [2024-05-15T15:49:34Z] <cgoubert@deploy1002> Started scap: mw-on-k8s: Bump maxUnavailable to 6% - T362323

Mentioned in SAL (#wikimedia-operations) [2024-05-15T15:51:00Z] <cgoubert@deploy1002> Finished scap: mw-on-k8s: Bump maxUnavailable to 6% - T362323 (duration: 02m 01s)

Change #1032497 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] trafficserver: migrate 5% of traffic to commons

https://gerrit.wikimedia.org/r/1032497

Change #1032497 merged by Hnowlan:

[operations/puppet@production] trafficserver: migrate 5% of commons traffic to k8s

https://gerrit.wikimedia.org/r/1032497

Change #1032828 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] trafficserver: move to 15% traffic split for commons

https://gerrit.wikimedia.org/r/1032828

Change #1032828 merged by Hnowlan:

[operations/puppet@production] trafficserver: move to 15% traffic split for commons

https://gerrit.wikimedia.org/r/1032828

Change #1034043 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] trafficserver: move commons-on-k8s to 30%

https://gerrit.wikimedia.org/r/1034043

Change #1034043 merged by Hnowlan:

[operations/puppet@production] trafficserver: move commons-on-k8s to 30%

https://gerrit.wikimedia.org/r/1034043

Change #1034088 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] trafficserver: move commons-on-k8s to 100%

https://gerrit.wikimedia.org/r/1034088

Change #1034088 merged by Hnowlan:

[operations/puppet@production] trafficserver: move commons-on-k8s to 100%

https://gerrit.wikimedia.org/r/1034088

hnowlan updated the task description. (Show Details)May 20 2024, 3:43 PM

Jdforrester-WMF updated the task description. (Show Details)May 20 2024, 4:09 PM

hnowlan updated the task description. (Show Details)May 20 2024, 4:14 PM

akosiaris renamed this task from Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) to Move 100% of external traffic to Kubernetes (excluding Votewiki).May 24 2024, 11:13 AM

akosiaris updated the task description. (Show Details)

Note that the votewiki blocker is apparently also now fixed.

As far as mediawiki calling itself goes (I see it was removed from the task description, but it is technically still pending), with everything now using $wgLocalHTTPProxy, we can choose to either change the listeners for bare-metal so they call MW-on-K8s, or we can just keep moving external traffic, and the related backend internal traffic will follow. I am inclined to choose the second option.

In T362323#9830829, @Jdforrester-WMF wrote:

Note that the votewiki blocker is apparently also now fixed.

Yes, I commented on the task to ask what's needed for testing in MW-on-K8s

Change #1028840 merged by Clément Goubert:

[operations/puppet@production] kubernetes: rename and repurpose 5 api appservers as k8s workers

https://gerrit.wikimedia.org/r/1028840

Clement_Goubert updated the task description. (Show Details)Jun 4 2024, 10:05 AM

Change #1038732 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-web, mw-api-ext: Raise replicas for 90% traffic

https://gerrit.wikimedia.org/r/1038732

Change #1038735 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] trafficserver: Migrate votewiki to k8s

https://gerrit.wikimedia.org/r/1038735

Ladsgroup renamed this task from Move 100% of external traffic to Kubernetes (excluding Votewiki) to Move 100% of external traffic to Kubernetes.Jun 4 2024, 10:18 AM

Change #1038735 merged by Clément Goubert:

[operations/puppet@production] trafficserver: Migrate votewiki to k8s

https://gerrit.wikimedia.org/r/1038735

Mentioned in SAL (#wikimedia-operations) [2024-06-04T10:23:24Z] <claime> Migrating votewiki to mw-on-k8s - T362323

Change #1038757 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] kubernetes: rename and reimage 3 api appservers, 2 appservers

https://gerrit.wikimedia.org/r/1038757

Change #1038757 merged by Hnowlan:

[operations/puppet@production] kubernetes: rename and reimage 3 api appservers, 2 appservers

https://gerrit.wikimedia.org/r/1038757

Change #1039196 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] mw-web, mw-api-ext: Raise replicas for 95% traffic

https://gerrit.wikimedia.org/r/1039196

Change #1038732 merged by jenkins-bot:

[operations/deployment-charts@master] mw-web, mw-api-ext: Raise replicas for 90% traffic

https://gerrit.wikimedia.org/r/1038732

Change #1041589 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] trafficserver: move 90% of traffic to mw-on-k8s

https://gerrit.wikimedia.org/r/1041589

Change #1041589 merged by Clément Goubert:

[operations/puppet@production] trafficserver: move 90% of traffic to mw-on-k8s

https://gerrit.wikimedia.org/r/1041589

Mentioned in SAL (#wikimedia-operations) [2024-06-11T10:45:32Z] <claime> move 90% of traffic to mw-on-k8s - T362323

Change #1039196 merged by jenkins-bot:

[operations/deployment-charts@master] mw-web, mw-api-ext: Raise replicas for 95% traffic

https://gerrit.wikimedia.org/r/1039196

Change #1042205 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] trafficserver: move 95% of traffic to mw-on-k8s

https://gerrit.wikimedia.org/r/1042205

Change #1028844 abandoned by Hnowlan:

[operations/puppet@production] trafficserver: move k8s traffic shift to 95%

Reason:

Done in another change

https://gerrit.wikimedia.org/r/1028844

Change #1042205 merged by Clément Goubert:

[operations/puppet@production] trafficserver: move 95% of traffic to mw-on-k8s

https://gerrit.wikimedia.org/r/1042205

Clement_Goubert updated the task description. (Show Details)Jun 18 2024, 11:15 AM

Change #1047046 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-web, mw-api-ext: Raise replicas for 100% traffic

https://gerrit.wikimedia.org/r/1047046

Change #1047047 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] trafficserver: move 95% of traffic to mw-on-k8s

https://gerrit.wikimedia.org/r/1047047

Clement_Goubert moved this task from Backlog to In Progress on the MW-on-K8s board.Jun 18 2024, 12:01 PM

Change #1047046 merged by jenkins-bot:

[operations/deployment-charts@master] mw-web, mw-api-ext: Raise replicas for 100% traffic

https://gerrit.wikimedia.org/r/1047046

Change #1047047 merged by Giuseppe Lavagetto:

[operations/puppet@production] trafficserver: move 100% of traffic to mw-on-k8s

https://gerrit.wikimedia.org/r/1047047

Mentioned in SAL (#wikimedia-operations) [2024-06-18T14:24:14Z] <claime> trafficserver: move 100% of traffic to mw-on-k8s - T362323

🚀🚀🚀

Clement_Goubert updated the task description. (Show Details)Jun 18 2024, 3:01 PM

Macro itshappening:

Change #1047107 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] httpbb: Remove appserver hourly tests

https://gerrit.wikimedia.org/r/1047107

Jdforrester-WMF awarded a token.Jun 18 2024, 3:24 PM

Lucas_Werkmeister_WMDE awarded a token.Jun 18 2024, 3:37 PM

Change #1047115 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] statograph: Use k8s envoy metric for statuspage

https://gerrit.wikimedia.org/r/1047115

Change #1047115 merged by Clément Goubert:

[operations/puppet@production] statograph: Use k8s envoy metric for statuspage

https://gerrit.wikimedia.org/r/1047115

Mentioned in SAL (#wikimedia-operations) [2024-06-18T16:23:47Z] <claime> resetting Wiki response time metric on wikimedia.statuspage.io following complete switch to k8s - T362323

Change #1047138 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/puppet@production] statograph: Use benthos query to save thanos

https://gerrit.wikimedia.org/r/1047138

Mentioned in SAL (#wikimedia-operations) [2024-06-18T17:21:09Z] <cdanis> resetting Wiki response time metric on wikimedia.statuspage.io following complete switch to k8s - T362323 T367894