Investigate esams text varnish backend fetch failures
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• ema
	Jun 24 2019, 9:49 AM

Description

There has been an increase in varnish-be fetch failures in esams text lately, correlating timewise with T226048. FetchError monitoring, recently introduced, might help diagnosing the issue: T224994.

See the kibana dashboard https://logstash.wikimedia.org/goto/caa7c533400fa99e5a4b0516ee0298a6 as well as the fetcherror pie graph for a breakdown of the different errors: https://logstash.wikimedia.org/goto/f14da21012a25060b1f63685028ecc7b

Last 24h at the time of this writing:

reason	count	percentage
Resource temporarily unavailable - straight insufficient bytes	13279	31.4%
http format error	13062	30.9%
Could not get storage	10341	24.5%
HTC status 3	3113	7.4%
chunked read err	2476	5.9%

"Resource temporarily unavailable" is EAGAIN, and it might be caused by between_bytes_timeout or similar.

$ sudo varnishtest -v bin/varnishtest/tests/b00022.vtc 2>&1 | grep FetchError
**** v1    1.5 vsl|       1002 FetchError      b Resource temporarily unavailable
**** v1    1.5 vsl|       1002 FetchError      b eof socket fail

"HTC status 3" is probably due to https://github.com/varnishcache/varnish-cache/issues/1772. Once we patch Varnish and upgrade all hosts to the new version, this error should become "Timed out reusing backend connection".

"http format error" is interesting, it looks like some garbage (sic, that's what varnish calls it) is occasionally returned by the appservers. See https://logstash.wikimedia.org/goto/7338143bb141cf85845385c77b52a944 and https://logstash.wikimedia.org/goto/4bfb870a2c82886a21ababfb898459b5

Details

Subject	Repo	Branch	Lines +/-
cache: add cp3043 back to the text cluster	operations/puppet	production	+4 -4
varnishfetcherror: log BogoHeader	operations/puppet	production	+1 -1
Honor first_byte_timeout for recycled backend connections	operations/debs/varnish4	debian-wmf	+262 -0

Customize query in gerrit

Related Objects

Mentioned In: T226048: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster)
T226318: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408
Mentioned Here: T226373: Swift object servers become briefly unresponsive on a regular basis
T224994: varnish: implement FetchError logging
T226048: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster)

Event Timeline

• ema created this task.Jun 24 2019, 9:49 AM

Restricted Application added a project: SRE. · View Herald TranscriptJun 24 2019, 9:49 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

• ema triaged this task as Medium priority.Jun 24 2019, 9:49 AM

• ema mentioned this in T226318: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408.

• Gilles subscribed.Jun 24 2019, 10:16 AM

This comment was removed by • Gilles.

• ema moved this task from Backlog to Caching on the Traffic board.Jun 24 2019, 12:49 PM

• ema updated the task description. (Show Details)Jun 24 2019, 1:03 PM

CDanis subscribed.Jun 24 2019, 1:04 PM

• ema updated the task description. (Show Details)Jun 24 2019, 1:20 PM

Change 518064 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/varnish4@debian-wmf] Honor first_byte_timeout for recycled backend connections

https://gerrit.wikimedia.org/r/518064

gerritbot added a project: Patch-For-Review.Jun 24 2019, 1:25 PM

• ema updated the task description. (Show Details)Jun 24 2019, 1:27 PM

• ema updated the task description. (Show Details)Jun 24 2019, 1:36 PM

Change 518064 merged by Ema:
[operations/debs/varnish4@debian-wmf] Honor first_byte_timeout for recycled backend connections

https://gerrit.wikimedia.org/r/518064

• ema updated the task description. (Show Details)Jun 24 2019, 2:01 PM

Mentioned in SAL (#wikimedia-operations) [2019-06-24T14:01:57Z] <ema> cp3032: upgrade varnish to 5.1.3-1wm11 T226375

• ema mentioned this in T226048: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster).Jun 24 2019, 4:31 PM

Paladox subscribed.Jun 24 2019, 7:33 PM

Change 519056 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnishfetcherror: log BogoHeader

https://gerrit.wikimedia.org/r/519056

Change 519056 merged by Ema:
[operations/puppet@production] varnishfetcherror: log BogoHeader

https://gerrit.wikimedia.org/r/519056

Change 519189 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: add cp3043 back to the text cluster

https://gerrit.wikimedia.org/r/519189

Change 519189 merged by Ema:
[operations/puppet@production] cache: add cp3043 back to the text cluster

https://gerrit.wikimedia.org/r/519189

Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts:

['cp3043.esams.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201906261419_ema_252538.log.

Completed auto-reimage of hosts:

['cp3043.esams.wmnet']

and were ALL successful.

Zache subscribed.Jun 27 2019, 3:45 AM

Long-ago dealt with it looks like, and in any case varnish-be doesn't exist anymore.

Investigate esams text varnish backend fetch failuresClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Investigate esams text varnish backend fetch failures
Closed, ResolvedPublic
Actions