Page MenuHomePhabricator

DC-OpsGroup
ActivePublic

Members (7)

Watchers (1)

Details

Description

Tasks handled by the Wikimedia Foundation's datacenter operations team, which is a sub-team of the SRE department.

This project includes sub-project procurement, decommission-hardware, and every single datacenter site-specific project: ops-codfw, ops-drmrs, ops-eqdfw, ops-eqiad, ops-eqord, ops-esams , ops-eqsin, ops-ulsfo, & ops-magru .

This can be linked to via: https://phabricator.wikimedia.org/tag/dc-ops/

Please note any wikitech documentation handled by DC-Ops is linked off of https://wikitech.wikimedia.org/wiki/Dc-operations

SLAs

DC-Ops makes every attempt to resolve all tasks and requests in a timely manner. We've implemented the following SLA targets.

Please note none of these start until both the clarified start time and with proper project tags. See details for each type of task request in their section below. Please use templates listed below.

ProjectDays to ResolveSLA startTemplate
procurement90Date of Task filingProcurement Template
Racking/Installation30Arrival of Hardware to DC site
Hardware Failure / Repair10Date of Task filingHardware Failure Template
Decommission45When all sub-team steps are complete and task is assigned to on-siteServer Decommission Template

Hardware Repair

If you need to file a task requesting hardware troubleshooting, please use the File Hardware Failure Task link here or in the navbar on the left.

Troubleshooting includes hardware failures, raid re-configuration, etc...

A full runbook on how to troubleshoot hardware failures can be viewed here: https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook

Requesting Hardware

If you have a budget line item, and want to file a request for pricing, please file your procurement request via this link. If you do not yet have a budget line for the request in this fiscal year, you can still file via that link, merely list that there is no budget allocation in that section of the task.

Once hardware has been ordered, a racking task must be entered using the form. This form may also be used if a system has to be moved and re-imaged.

Decommissioning Hardware

All hardware being returned to DC-Ops for processing into spares, or into decommission state and removed from the rack.

Any hardware no longer required for use should have a task filed for decommission via the pre-defined server decommission request form.

Netbox Reporting

The template for netbox report errors is here: https://phabricator.wikimedia.org/maniphest/task/edit/form/133/

Neueste Aktivität

Heute

ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1260.eqiad.wmnet with OS bullseye executed with errors:

  • wikikube-worker1260 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wikikube-worker1260.eqiad.wmnet to get a root shellbut depending on the failure this may not work.
Sat, Aug 3, 3:09 AM · SRE, serviceops, ops-eqiad, DC-Ops
Jclark-ctr updated the task description for T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.
Sat, Aug 3, 2:22 AM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1269.eqiad.wmnet with OS bullseye completed:

  • wikikube-worker1269 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408030205_jclark_4062815_wikikube-worker1269.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Sat, Aug 3, 2:22 AM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1266.eqiad.wmnet with OS bullseye executed with errors:

  • wikikube-worker1266 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wikikube-worker1266.eqiad.wmnet to get a root shellbut depending on the failure this may not work.
Sat, Aug 3, 2:15 AM · SRE, serviceops, ops-eqiad, DC-Ops
Jclark-ctr updated the task description for T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.
Sat, Aug 3, 2:01 AM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1260.eqiad.wmnet with OS bullseye

Sat, Aug 3, 1:49 AM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1268.eqiad.wmnet with OS bullseye completed:

  • wikikube-worker1268 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408030130_jclark_4055036_wikikube-worker1268.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Sat, Aug 3, 1:48 AM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1269.eqiad.wmnet with OS bullseye

Sat, Aug 3, 1:46 AM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1267.eqiad.wmnet with OS bullseye completed:

  • wikikube-worker1267 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408030128_jclark_4054700_wikikube-worker1267.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Sat, Aug 3, 1:45 AM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1260.eqiad.wmnet with OS bullseye executed with errors:

  • wikikube-worker1260 (FAIL)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • The reimage failed, see the cookbook logs for the details,You can also try typing "install-console" wikikube-worker1260.eqiad.wmnet to get a root shellbut depending on the failure this may not work.
Sat, Aug 3, 1:37 AM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1268.eqiad.wmnet with OS bullseye

Sat, Aug 3, 1:12 AM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1265.eqiad.wmnet with OS bullseye completed:

  • wikikube-worker1265 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408030053_jclark_4041648_wikikube-worker1265.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Sat, Aug 3, 1:11 AM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1267.eqiad.wmnet with OS bullseye

Sat, Aug 3, 1:09 AM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1264.eqiad.wmnet with OS bullseye completed:

  • wikikube-worker1264 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408030050_jclark_4041233_wikikube-worker1264.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Sat, Aug 3, 1:08 AM · SRE, serviceops, ops-eqiad, DC-Ops
Jclark-ctr updated the task description for T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.
Sat, Aug 3, 1:06 AM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1263.eqiad.wmnet with OS bullseye completed:

  • wikikube-worker1263 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408030047_jclark_4038673_wikikube-worker1263.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Sat, Aug 3, 1:05 AM · SRE, serviceops, ops-eqiad, DC-Ops
Jclark-ctr updated the task description for T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.
Sat, Aug 3, 12:59 AM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1262.eqiad.wmnet with OS bullseye completed:

  • wikikube-worker1262 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408030040_jclark_4034810_wikikube-worker1262.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Sat, Aug 3, 12:57 AM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1266.eqiad.wmnet with OS bullseye

Sat, Aug 3, 12:55 AM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1261.eqiad.wmnet with OS bullseye completed:

  • wikikube-worker1261 (PASS)
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Add puppet_version metadata to Debian installer
    • Checked BIOS boot parameters are back to normal
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408030037_jclark_4034779_wikikube-worker1261.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Sat, Aug 3, 12:55 AM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1265.eqiad.wmnet with OS bullseye

Sat, Aug 3, 12:33 AM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1258.eqiad.wmnet with OS bullseye completed:

  • wikikube-worker1258 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408030022_jclark_4034271_wikikube-worker1258.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Sat, Aug 3, 12:33 AM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1264.eqiad.wmnet with OS bullseye

Sat, Aug 3, 12:30 AM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1259.eqiad.wmnet with OS bullseye completed:

  • wikikube-worker1259 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408030019_jclark_4034288_wikikube-worker1259.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Sat, Aug 3, 12:29 AM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1263.eqiad.wmnet with OS bullseye

Sat, Aug 3, 12:27 AM · SRE, serviceops, ops-eqiad, DC-Ops
Jclark-ctr updated the task description for T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.
Sat, Aug 3, 12:22 AM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1262.eqiad.wmnet with OS bullseye

Sat, Aug 3, 12:18 AM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1261.eqiad.wmnet with OS bullseye

Sat, Aug 3, 12:18 AM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1256.eqiad.wmnet with OS bullseye completed:

  • wikikube-worker1256 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408030007_jclark_4025538_wikikube-worker1256.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Sat, Aug 3, 12:17 AM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1260.eqiad.wmnet with OS bullseye

Sat, Aug 3, 12:17 AM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1259.eqiad.wmnet with OS bullseye

Sat, Aug 3, 12:14 AM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1258.eqiad.wmnet with OS bullseye

Sat, Aug 3, 12:14 AM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1254.eqiad.wmnet with OS bullseye completed:

  • wikikube-worker1254 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408030002_jclark_4025526_wikikube-worker1254.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Sat, Aug 3, 12:13 AM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1255.eqiad.wmnet with OS bullseye completed:

  • wikikube-worker1255 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408022356_jclark_4025678_wikikube-worker1255.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Sat, Aug 3, 12:07 AM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1257.eqiad.wmnet with OS bullseye completed:

  • wikikube-worker1257 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408022354_jclark_4025565_wikikube-worker1257.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Sat, Aug 3, 12:05 AM · SRE, serviceops, ops-eqiad, DC-Ops

Yesterday

ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1255.eqiad.wmnet with OS bullseye

Fri, Aug 2, 11:49 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1256.eqiad.wmnet with OS bullseye

Fri, Aug 2, 11:48 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1254.eqiad.wmnet with OS bullseye

Fri, Aug 2, 11:48 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1257.eqiad.wmnet with OS bullseye

Fri, Aug 2, 11:48 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1253.eqiad.wmnet with OS bullseye completed:

  • wikikube-worker1253 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408022335_jclark_4019134_wikikube-worker1253.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Fri, Aug 2, 11:46 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1252.eqiad.wmnet with OS bullseye completed:

  • wikikube-worker1252 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408022333_jclark_4017219_wikikube-worker1252.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Fri, Aug 2, 11:44 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1251.eqiad.wmnet with OS bullseye completed:

  • wikikube-worker1251 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408022329_jclark_4017039_wikikube-worker1251.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Fri, Aug 2, 11:41 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host wikikube-worker1250.eqiad.wmnet with OS bullseye completed:

  • wikikube-worker1250 (PASS)
    • Downtimed on Icinga/Alertmanager
    • Removed from Puppet and PuppetDB if present and deleted any certificates
    • Removed from Debmonitor if present
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga/Alertmanager
    • Removed previous downtime on Alertmanager (old OS)
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202408022326_jclark_4016830_wikikube-worker1250.out
    • configmaster.wikimedia.org updated with the host new SSH public key for wmf-update-known-hosts-production
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
    • Updated Netbox status planned -> active
    • The sre.puppet.sync-netbox-hiera cookbook was run successfully
Fri, Aug 2, 11:36 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1253.eqiad.wmnet with OS bullseye

Fri, Aug 2, 11:30 PM · SRE, serviceops, ops-eqiad, DC-Ops
Maintenance_bot added a project to T371741: PDU sensor over limit: SRE.
Fri, Aug 2, 11:29 PM · SRE, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1252.eqiad.wmnet with OS bullseye

Fri, Aug 2, 11:26 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1251.eqiad.wmnet with OS bullseye

Fri, Aug 2, 11:24 PM · SRE, serviceops, ops-eqiad, DC-Ops
ops-monitoring-bot added a comment to T369743: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304.

Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host wikikube-worker1250.eqiad.wmnet with OS bullseye

Fri, Aug 2, 11:21 PM · SRE, serviceops, ops-eqiad, DC-Ops
phaultfinder created T371741: PDU sensor over limit.
Fri, Aug 2, 11:19 PM · SRE, ops-eqiad, DC-Ops
gerritbot added a comment to T363399: Q4:rack/setup/install parsoidtest1001.

Change #1053791 abandoned by Dzahn:

[operations/puppet@production] scap: remove scandium from dsh groups

Reason:

https://gerrit.wikimedia.org/r/c/operations/puppet/+/1024402 seems stalled since April

https://gerrit.wikimedia.org/r/1053791

Fri, Aug 2, 9:17 PM · Patch-For-Review, SRE, serviceops, ops-eqiad, DC-Ops