Study performance impact of disabling TCP selective acknowledgments
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• MoritzMuehlenhoff
	Jun 18 2019, 9:36 AM

Description

Yesterday we've disabled TCP SACKs [1] on the caches/bastions to mitigate a remote DoS in the TCP implementation in the Linux kernel [2].

The caches were disabled between 17:30 and 18:00 UTC on Mon 17th and the remaining hosts were disabled starting 8:20 UTC on Tue 18th (with Puppet runs taking up to 30 minutes after the merge of the patch).

Once a fixed kernel is fully deployed we can consider whether to re-enable TCP SACKs. Various options:

If there's no visible performance impact, we can keep it disabled (reduces exploitable complexity)
If there's a performance regression, re-enable on the caches only
If there's a performance regression, re-enable fleet-wide if running a fixed Linux kernel (Puppet can apply this gradually as reboots happen)

Footnotes:
[1] https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Selective_acknowledgments
[2] https://github.com/Netflix/security-bulletins/blob/master/advisories/third-party/2019-001.md

Details

	Subject	Repo	Branch	Lines +/-
	Re-enable TCP selective acknowledgements on hosts running a fixed kernel	operations/puppet	production	+13 -4

Customize query in gerrit

Related Objects

Mentioned In: T226048: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster)
T224694: cp3041 - Varnish frontend child restarted icinga alert
Mentioned Here: rOMWC29385090c63e: Load WikibaseLexeme on all of group0 (again)
P8650 Hive queries to check results of SACK A/B test
T226048: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster)
T224694: cp3041 - Varnish frontend child restarted icinga alert

Event Timeline

• MoritzMuehlenhoff created this task.Jun 18 2019, 9:36 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 18 2019, 9:36 AM

• Gilles added a project: Performance-Team.Jun 18 2019, 9:40 AM

loadEventEnd seems to have regressed around the time the change was deployed. In the week-over-week you can see that the curves seem to diverge past that point, increasing the week-over-week difference:

Capture d'écran 2019-06-18 11.42.03.png (407×476 px, 33 KB)

Ideally you want to re-enable this for half the fleet to verify, as daily/weekly traffic patterns add a lot of noise to the data. As long as a patched kernel is on the short horizon, I think we can wait for that before conducting such a test.

• ema triaged this task as Medium priority.Jun 18 2019, 9:58 AM

• ema added a project: Traffic.

In T225998#5264757, @Gilles wrote:

As long as a patched kernel is on the short horizon, I think we can wait for that before conducting such a test.

Very short, the first cache reboots with patched kernel are planned for this EU afternoon.

• ema moved this task from Backlog to Caching on the Traffic board.Jun 18 2019, 10:03 AM

In T225998#5264757, @Gilles wrote:

loadEventEnd seems to have regressed around the time the change was deployed

I'm gonna acknowledge the alert in Icinga as we're aware of this and there's no immediate action to take.

Mentioned in SAL (#wikimedia-operations) [2019-06-19T11:07:13Z] <ema> cache nodes: pause rolling reboots for kernel and varnish upgrades T224694 T225998

Mentioned in SAL (#wikimedia-operations) [2019-06-19T12:20:48Z] <ema> cache nodes: resume rolling reboots for kernel and varnish upgrades T224694 T225998

Mentioned in SAL (#wikimedia-operations) [2019-06-19T16:01:26Z] <ema> cache nodes: stop rolling reboots for today, 47/80 done T224694 T225998

• ema mentioned this in T226048: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster).Jun 20 2019, 9:10 AM

Mentioned in SAL (#wikimedia-operations) [2019-06-20T09:17:36Z] <ema> cache nodes: resume rolling reboots for kernel and varnish upgrades T224694 T225998 T226048

All cache nodes are now running Linux 4.9.168-1+deb9u3. Next week we can thus re-enable SACKs on part of the cache fleet and further evaluate performance implications.

CDanis subscribed.Jun 21 2019, 3:58 PM

Paladox subscribed.Jun 23 2019, 10:10 AM

In T225998#5264757, @Gilles wrote:

loadEventEnd seems to have regressed around the time the change was deployed. In the week-over-week you can see that the curves seem to diverge past that point, increasing the week-over-week difference:

Indeed, from Navigation Timing dashboard, loadEventEnd shows a global increase of ~ 100 ms at the 75th percentile compared to last week. Starting around June 15th (2 days before this task was created), and still on-going today.

And, affects both desktop and mobile (link):

But, based on other nav timing metrics, this might not be network-related.

reponseStart

Did go up around the same date, but only by 10 milliseconds, and has since come back down, whereas the load time regression hasn't.

response

navtiming-response (responseStart to responseEnd), did not go up on the same day. Only several days later, and only by < 3 milliseconds, and has since come back down.

The domInteractive (until end of HTML download/parse, not sub resources or paint) metric and firstPaint metric show a similar pattern as the responseStart metric. They went up by a small amount (around 10 milliseconds) on the same day (June 14-15), but have since come back down.

The ones that remained up and correlate with loadEventEnd, are:

domComplete and loadEventStart (naturally, nearly identical to loadEventEnd).
mediaWikiLoadEnd (nav start to JS settle).
processing (responseEnd to domComplete).

This suggests it is either related to client-side rendering (amount of data or the speed at which it can be processed), or related to secondary requests (e.g. the HTML req might be fine, but CSS/JS/images are more affected by the networking issue).

Mentioned in SAL (#wikimedia-operations) [2019-06-24T10:28:23Z] <moritzm> re-enabling TCP SACKs on cp5001-cp5003 (half of Varnish/upload in eqsin) T225998

Mentioned in SAL (#wikimedia-operations) [2019-06-24T10:36:58Z] <moritzm> re-enabling TCP SACKs on cp5007-cp5009 (half of Varnish/text in eqsin) T225998

Mentioned in SAL (#wikimedia-operations) [2019-06-24T13:04:13Z] <moritzm> re-enabling TCP SACKs on cp1075-1082 (half of Varnish/text and Varnish/upload in eqiad) T225998

Mentioned in SAL (#wikimedia-operations) [2019-06-24T13:10:37Z] <moritzm> re-enabling TCP SACKs on cp2001,2002,2004-2008,2010,2011, 2014, 2017 (half of Varnish/text and Varnish/upload in codfw) T225998

Mentioned in SAL (#wikimedia-operations) [2019-06-24T13:19:03Z] <moritzm> re-enabling TCP SACKs on cp3040-cp3047, cp3049 (half of Varnish/text and Varnish/upload in esams) T225998

Mentioned in SAL (#wikimedia-operations) [2019-06-24T13:26:25Z] <moritzm> re-enabling TCP SACKs on cp4024-4029 (half of Varnish/text and Varnish/upload in ulsfo) T225998

• Gilles claimed this task.Jun 24 2019, 8:01 PM

• Gilles moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.

Breakdown of servers and their config

eqsin:
Enabled: cp5001-cp5003, cp5007-cp5009
Disabled: cp5004-cp5006, cp5010-cp5012

ulsfo:
Enabled: cp4024-cp4029
Disabled: cp4021-cp4023, cp4030-cp4032

esams:
Enabled: cp3040-cp3047, cp3049
Disabled: cp3030, cp3032-cp3039

codfw:
Enabled: cp2001,cp2002,cp2004-cp2008,cp2010,cp2011,cp2014,cp2017
Disabled: cp2012,cp2013,cp2016,cp2018-cp2026

eqiad:
Enabled: cp1075-cp1082
Disabled: cp1083-cp1090

The results are in, looking at loadEventEnd since yesterday 14:00 GMT until now.

status	median	p90	p95	sample size
SACK enabled	1282	4079	6338.75	335806
SACK disabled	1291	4259	6783	339182
difference	0.7%	4.2%	6.5%

It's definitely worth re-enabling SACK everywhere, as the performance difference is significant, particularly for users with slower/less reliable internet connections.

Hive queries used, for reference: P8650

The internal recvfrom field from EventLogging refers to the Varnish frontend that received the /beacon request (similar to the last hostname in the XCS response header). Afaik that does not correlate to the multiple requests that made up the page load as each of those requests are independently distributed by LVS afaik.

Looking at the critical non-image reqs (HTML, CSS, JS) for a page view, logged-out, I see resps from cp3042 and some from cp3030. And images from a separate cp hosts as well (text cluster vs upload cluster).

It does for text, when your IP gets hashed to a specific Varnish frontend, you get all "small" requests from it. I.e. anything below the size threshold to be stored in Varnish frontends.

Remember that x-cache headers are read from right to left. Trying this out right now with a clear cache and a clear local storage I get the following headers for all assets on the page:

Non-images

x-cache: cp1079 hit/1, cp3030 hit/1, cp3041 miss
x-cache: cp1087 hit/7, cp3041 hit/3, cp3041 hit/140
x-cache: cp1087 hit/7, cp3030 hit/6, cp3041 hit/3267
x-cache: cp1079 hit/9, cp3033 hit/2, cp3041 hit/5468
x-cache: cp1085 hit/5, cp3032 hit/3, cp3041 hit/4463
x-cache: cp1089 hit/1, cp3033 hit/3, cp3041 hit/1
x-cache: cp1083 hit/7, cp3030 hit/5, cp3041 hit/1494482 (image/icon served by MediaWiki)
x-cache: cp1081 hit/4, cp3041 hit/3, cp3041 hit/1458154 (image/icon served by MediaWiki)
x-cache: cp1075 hit/4, cp3030 hit/4, cp3041 hit/529456 (image/icon served by MediaWiki)
x-cache: cp1087 hit/1, cp3041 hit/2, cp3041 hit/675
x-cache: cp1085 hit/5, cp3040 hit/2, cp3041 hit/208723
x-cache: cp1083 hit/10, cp3042 hit/5, cp3041 hit/rOMWC29385090c63e (image/icon served by MediaWiki)
x-cache: cp1087 hit/1, cp3032 hit/1, cp3041 hit/2138
x-cache: cp1081 pass, cp3030 hit/3, cp3041 hit/47756
x-cache: cp1087 hit/9, cp3032 hit/3, cp3041 hit/29240

Images

x-cache: cp3036 hit, cp3044 hit/1308
x-cache: cp3046 hit, cp3044 hit/103289
x-cache: cp3043 hit, cp3044 hit/1
x-cache: cp3043 hit, cp3044 hit/474536

So it was cp3041 every time for me, for anything besides images.

In T225998#5284077, @Gilles wrote:

Remember that x-cache headers are read from right to left.

More details here: https://wikitech.wikimedia.org/wiki/Varnish#X-Cache

Change 519193 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Re-enable TCP selective acknowledgements on hosts running a fixed kernel

https://gerrit.wikimedia.org/r/519193

gerritbot added a project: Patch-For-Review.Jun 26 2019, 8:27 AM

ayounsi subscribed.Jun 26 2019, 8:29 AM

Change 519193 merged by Muehlenhoff:
[operations/puppet@production] Re-enable TCP selective acknowledgements on hosts running a fixed kernel

https://gerrit.wikimedia.org/r/519193

@Gilles: is there anything left to be done here? Other than blogging about the results that is. :-)

Nothing left to do here, I believe.

Yeah, I think we can close this one.

	F29622172: Screenshot 2019-06-23 at 14.37.42.png
	Jun 23 2019, 1:44 PM

	F29622170: Screenshot 2019-06-23 at 14.36.41.png
	Jun 23 2019, 1:44 PM

	F29622160: Screenshot 2019-06-23 at 14.32.04.png
	Jun 23 2019, 1:44 PM

	F29622151: Screenshot 2019-06-23 at 14.30.51.png
	Jun 23 2019, 1:44 PM

	F29622162: Screenshot 2019-06-23 at 14.32.11.png
	Jun 23 2019, 1:44 PM

	F29601139: Capture d'écran 2019-06-18 11.42.03.png
	Jun 18 2019, 9:47 AM

Study performance impact of disabling TCP selective acknowledgmentsClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Study performance impact of disabling TCP selective acknowledgments
Closed, ResolvedPublic
Actions