Page MenuHomePhabricator

Investigate esams text varnish backend fetch failures
Closed, ResolvedPublic

Description

There has been an increase in varnish-be fetch failures in esams text lately, correlating timewise with T226048. FetchError monitoring, recently introduced, might help diagnosing the issue: T224994.

See the kibana dashboard https://logstash.wikimedia.org/goto/caa7c533400fa99e5a4b0516ee0298a6 as well as the fetcherror pie graph for a breakdown of the different errors: https://logstash.wikimedia.org/goto/f14da21012a25060b1f63685028ecc7b

fetcherror-pie.png (1×2 px, 188 KB)

Last 24h at the time of this writing:

reasoncountpercentage
Resource temporarily unavailable - straight insufficient bytes1327931.4%
http format error1306230.9%
Could not get storage1034124.5%
HTC status 331137.4%
chunked read err24765.9%

"Resource temporarily unavailable" is EAGAIN, and it might be caused by between_bytes_timeout or similar.

$ sudo varnishtest -v bin/varnishtest/tests/b00022.vtc 2>&1 | grep FetchError
**** v1    1.5 vsl|       1002 FetchError      b Resource temporarily unavailable
**** v1    1.5 vsl|       1002 FetchError      b eof socket fail

"HTC status 3" is probably due to https://github.com/varnishcache/varnish-cache/issues/1772. Once we patch Varnish and upgrade all hosts to the new version, this error should become "Timed out reusing backend connection".

"http format error" is interesting, it looks like some garbage (sic, that's what varnish calls it) is occasionally returned by the appservers. See https://logstash.wikimedia.org/goto/7338143bb141cf85845385c77b52a944 and https://logstash.wikimedia.org/goto/4bfb870a2c82886a21ababfb898459b5

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change 518064 had a related patch set uploaded (by Ema; owner: Ema):
[operations/debs/varnish4@debian-wmf] Honor first_byte_timeout for recycled backend connections

https://gerrit.wikimedia.org/r/518064

Change 518064 merged by Ema:
[operations/debs/varnish4@debian-wmf] Honor first_byte_timeout for recycled backend connections

https://gerrit.wikimedia.org/r/518064

Mentioned in SAL (#wikimedia-operations) [2019-06-24T14:01:57Z] <ema> cp3032: upgrade varnish to 5.1.3-1wm11 T226375

Change 519056 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] varnishfetcherror: log BogoHeader

https://gerrit.wikimedia.org/r/519056

Change 519056 merged by Ema:
[operations/puppet@production] varnishfetcherror: log BogoHeader

https://gerrit.wikimedia.org/r/519056

Change 519189 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: add cp3043 back to the text cluster

https://gerrit.wikimedia.org/r/519189

Change 519189 merged by Ema:
[operations/puppet@production] cache: add cp3043 back to the text cluster

https://gerrit.wikimedia.org/r/519189

Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts:

['cp3043.esams.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201906261419_ema_252538.log.

Completed auto-reimage of hosts:

['cp3043.esams.wmnet']

and were ALL successful.

BBlack assigned this task to ema.
BBlack subscribed.

Long-ago dealt with it looks like, and in any case varnish-be doesn't exist anymore.