Page MenuHomePhabricator

ForeignResourceStructureTest flaky in CI due to "Failed to download resource at https://codeload.github.com"
Closed, ResolvedPublic

Description

Similar to T362095: "composer install" flaky in CI due to "Failed to connect to github.com port 443: Connection timed out"

16:33:32 1) ForeignResourceStructureTest::testVerifyIntegrity
16:33:32 LogicException: Failed to download resource at https://codeload.github.com/wikimedia/jquery.i18n/tar.gz/70b5ee20a638cb8fe36baef8d51ac2eb577ce012
16:33:32 
16:33:32 /workspace/src/includes/ResourceLoader/ForeignResourceManager.php:300
16:33:32 /workspace/src/includes/ResourceLoader/ForeignResourceManager.php:381
16:33:32 /workspace/src/includes/ResourceLoader/ForeignResourceManager.php:186
16:33:32 /workspace/src/tests/phpunit/integration/includes/ResourceLoader/ForeignResourceStructureTest.php:41

Event Timeline

Reedy renamed this task from Increase in Failed to download resource at https://codeload.github.com to Increase in "Failed to download resource at https://codeload.github.com" in CI.Apr 12 2024, 3:45 PM
Krinkle renamed this task from Increase in "Failed to download resource at https://codeload.github.com" in CI to ForeignResourceStructureTest flaky in CI due to "Failed to download resource at https://codeload.github.com".Apr 22 2024, 9:11 PM
Krinkle moved this task from Backlog to WMF-deployed Build Failure on the ci-test-error board.

Got this failure again, this time with jQuery:

00:03:33.091 1) ForeignResourceStructureTest::testVerifyIntegrity
00:03:33.091 LogicException: Failed to download resource at https://code.jquery.com/qunit/qunit-2.20.0.js
00:03:33.091 
00:03:33.091 /workspace/src/includes/ResourceLoader/ForeignResourceManager.php:292
00:03:33.091 /workspace/src/includes/ResourceLoader/ForeignResourceManager.php:350
00:03:33.091 /workspace/src/includes/ResourceLoader/ForeignResourceManager.php:204
00:03:33.091 /workspace/src/tests/phpunit/integration/includes/ResourceLoader/ForeignResourceStructureTest.php:41

See the console log. This is for a pure js change: Object.assign's first argument must never be null/undefined

I just filed the jQuery version at T368385 as well; not sure if it makes sense to track separately or should be considered a duplicate, TBH.

BTW, I also remember occasionally getting this failure (for GitHub)… maybe ForeignResourceManager should retry the download once or twice if it fails? (AFAICT it’s never called during normal requests, so the potential extra runtime shouldn’t be a production concern, I think.)

I just filed the jQuery version at T368385 as well; not sure if it makes sense to track separately or should be considered a duplicate, TBH.

On second thought… I think I lean towards merging the two, yeah.

Is it possible to specify a fallback source (e.g. some mirror that hosts this code) with ForeignResourceManager that is then only used in CI context?

I guess that depends on what the goal of the test is…

In a change where the foreign resources are touched, I think specifying additional sources would effectively mean that we trust all of the listed sources equally? Since we would let the change pass CI (and check the new resources into Git) if the file matched any of the sources.

In a change where the foreign resources aren’t touched, I think the test serves almost no purpose and might as well be disabled, except that it’s tricky to implement that? (It could still detect if upstream clandestinely changes the resource at the same URL, but I don’t know that it’s our responsibility to detect that, to be honest.)

Change #1049584 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/core@master] ForeignResourceManager: Show details about error

https://gerrit.wikimedia.org/r/1049584

Seen in another build:

LogicException: Failed to download resource at https://registry.npmjs.org/oojs/-/oojs-7.0.1.tgz

So either codeload.github.com, code.jquery.com and registry.npmjs.org are all having infrastructure problems (admittedly, GitHub and npm are both owned by Microsoft, so it’s not completely impossible)… or the issue is on our end?

Change #1049594 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[mediawiki/core@master] Skip failing ForeignResourceStructureTest

https://gerrit.wikimedia.org/r/1049594

Add cloudflare to the list of seemingly affected upstreams (build):

LogicException: Failed to download resource at https://cdnjs.cloudflare.com/ajax/libs/chosen/1.8.2/chosen-sprite%402x.png: HTTP request timed out.

At this point I think it’s pretty likely that the issue is on our end… but unfortunately “HTTP request timed out” isn’t as much detail as I’d hoped for :S

Change #1049606 had a related patch set uploaded (by Lucas Werkmeister (WMDE); author: Lucas Werkmeister (WMDE)):

[mediawiki/core@master] Revert "Skip failing ForeignResourceStructureTest"

https://gerrit.wikimedia.org/r/1049606

Change #1049594 merged by jenkins-bot:

[mediawiki/core@master] Skip failing ForeignResourceStructureTest

https://gerrit.wikimedia.org/r/1049594

This is odd, on the integration VMs I am not seeing any connection problems.

thcipriani@integration-agent-docker-1045:~$ nc -vz cdnjs.cloudflare.com -w 1 443
Connection to cdnjs.cloudflare.com (104.17.24.14) 443 port [tcp/https] succeeded!
thcipriani@integration-agent-docker-1045:~$ nc -vz github.com -w 1 443
Connection to github.com (140.82.114.4) 443 port [tcp/https] succeeded!

I note that there are extra layers: docker container, package manager that still should be tested, but I think network connectivity seems fine at the moment.

Change #1049584 merged by jenkins-bot:

[mediawiki/core@master] ForeignResourceManager: Show details about error

https://gerrit.wikimedia.org/r/1049584

Hrm. looks like there are basic tcp connection errors on the VM. I also note the ip changing a surprising amount during this short test:

thcipriani@integration-agent-docker-1045:~$ while : ; do nc -vz github.com -w 6 443; sleep 30; done
Connection to github.com (140.82.113.3) 443 port [tcp/https] succeeded!
Connection to github.com (140.82.113.3) 443 port [tcp/https] succeeded!
Connection to github.com (140.82.114.4) 443 port [tcp/https] succeeded!
Connection to github.com (140.82.114.4) 443 port [tcp/https] succeeded!
Connection to github.com (140.82.114.3) 443 port [tcp/https] succeeded!
Connection to github.com (140.82.114.3) 443 port [tcp/https] succeeded!
Connection to github.com (140.82.114.4) 443 port [tcp/https] succeeded!
Connection to github.com (140.82.114.4) 443 port [tcp/https] succeeded!
Connection to github.com (140.82.114.3) 443 port [tcp/https] succeeded!
Connection to github.com (140.82.114.3) 443 port [tcp/https] succeeded!
Connection to github.com (140.82.112.4) 443 port [tcp/https] succeeded!
Connection to github.com (140.82.112.4) 443 port [tcp/https] succeeded!
Connection to github.com (140.82.113.3) 443 port [tcp/https] succeeded!
Connection to github.com (140.82.113.3) 443 port [tcp/https] succeeded!
Connection to github.com (140.82.112.3) 443 port [tcp/https] succeeded!
Connection to github.com (140.82.112.3) 443 port [tcp/https] succeeded!
nc: connect to github.com (140.82.113.4) port 443 (tcp) timed out: Operation now in progress
Connection to github.com (140.82.113.4) 443 port [tcp/https] succeeded!

trying this locally: seems par for github, failing outbound is something else though.

I think @Andrew moved the contint Cloud VPS nodes to the OVS agent hypervisors today. Could that be related?

One of the potential issue is ForeignResourceStructureTest::testVerifyIntegrity is triggered from each of the Jenkins job running for mediawiki/core (and thus under php 7.4, 8.1, 8.2, 8.3) and for any patch sent.

Note that it is downloading the tarballs to verify their integrity. The download could potentially be skipped if we instead kept a checksum of each of the files contained in the tarball.

After a quick look at includes/ResourceLoader/ForeignResourceManager.php, it supports XDG_CACHE_HOME and uses mw-foreign beneath it. If I look at the CI cache, there is a single job having such a directory:

$ ls -l /srv/castor/mediawiki-core/master/mediawiki-quibble-vendor-mysql-php80/mw-foreign
total 5708
drwxr-sr-x 2 jenkins-deploy wikidev    4096 May  7 13:21 .
drwxrwsrwx 7 jenkins-deploy wikidev    4096 May  7 13:21 ..
-rw-r--r-- 1 jenkins-deploy wikidev   12718 May  7 13:21 CLDRPluralRuleParser_35271498_328afeab_CLDRPluralRuleParser_js.data
-rw-r--r-- 1 jenkins-deploy wikidev  511630 May  7 13:21 codex_09902517_c9fd17c2_codex_1_5_0_tgz.data
-rw-r--r-- 1 jenkins-deploy wikidev   67311 May  7 13:21 codex_design_tokens_945744ae_48a6385a_codex_design_tokens_1_5_0_tgz.data
-rw-r--r-- 1 jenkins-deploy wikidev  187300 May  7 13:21 codex_icons_b05e0ae9_00df03a6_codex_icons_1_5_0_tgz.data
-rw-r--r-- 1 jenkins-deploy wikidev   12959 May  7 13:21 fetch_polyfill_84f8f065_573ed6ed_whatwg_fetch_3_6_2_tgz.data
-rw-r--r-- 1 jenkins-deploy wikidev   22342 May  7 13:21 intersection_observer_dc899fec_e740e0f2_intersection_observer_0_12_0_tgz.data
-rw-r--r-- 1 jenkins-deploy wikidev    1215 May  7 13:21 jquery_chosen_056947c8_a922d849_LICENSE_md.data
-rw-r--r-- 1 jenkins-deploy wikidev    1904 May  7 13:21 jquery_chosen_4bc4fa80_f000651a_README_md.data
-rw-r--r-- 1 jenkins-deploy wikidev   47205 May  7 13:21 jquery_chosen_6ad86030_ccf582d1_chosen_jquery_js.data
-rw-r--r-- 1 jenkins-deploy wikidev   11978 May  7 13:21 jquery_chosen_83917de4_61c23e25_chosen_css.data
-rw-r--r-- 1 jenkins-deploy wikidev     538 May  7 13:21 jquery_chosen_b5bfabcd_0a20953f_chosen_sprite_png.data
-rw-r--r-- 1 jenkins-deploy wikidev     738 May  7 13:21 jquery_chosen_ebd43fb5_2007fde4_chosen_sprite%402x_png.data
-rw-r--r-- 1 jenkins-deploy wikidev    6010 May  7 13:21 jquery_client_fad01184_d7a52564_jquery_client_3_0_0_tgz.data
-rw-r--r-- 1 jenkins-deploy wikidev  285314 May  7 13:21 jquery_e5d63f9e_c357ee36_jquery_3_7_1_js.data
-rw-r--r-- 1 jenkins-deploy wikidev  112819 May  7 13:21 jquery_i18n_a5136c3c_9483ae39_70b5ee20a638cb8fe36baef8d51ac2eb577ce012.data
-rw-r--r-- 1 jenkins-deploy wikidev 1459212 May  7 13:21 moment_cc10245f_fe546a6a_2_25_2.data
-rw-r--r-- 1 jenkins-deploy wikidev   34584 May  7 13:21 mustache_700834f3_1b259b17_mustache_4_2_0_tgz.data
-rw-r--r-- 1 jenkins-deploy wikidev   24215 May  7 13:21 oojs_e785e93d_32a85675_oojs_7_0_1_tgz.data
-rw-r--r-- 1 jenkins-deploy wikidev 1601863 May  7 13:21 ooui_d516a5cd_35888a57_oojs_ui_0_49_1_tgz.data
-rw-r--r-- 1 jenkins-deploy wikidev  142467 May  7 13:21 pako_1a96f6ff_cdb1394f_pako_deflate_js.data
-rw-r--r-- 1 jenkins-deploy wikidev    5183 May  7 13:21 pako_3cda6da3_cba999cf_README_md.data
-rw-r--r-- 1 jenkins-deploy wikidev   27876 May  7 13:21 pako_53b54ed7_12e6bae1_pako_deflate_min_js.data
-rw-r--r-- 1 jenkins-deploy wikidev    1104 May  7 13:21 pako_f23deea8_f82a78b9_LICENSE.data
-rw-r--r-- 1 jenkins-deploy wikidev   84760 May  7 13:21 pinia_0987ebc5_a1f75ba8_pinia_2_0_16_tgz.data
-rw-r--r-- 1 jenkins-deploy wikidev    9706 May  7 13:21 qunitjs_674dbb25_3a1a650e_qunit_2_20_0_css.data
-rw-r--r-- 1 jenkins-deploy wikidev  261991 May  7 13:21 qunitjs_c71e1571_f31cb574_qunit_2_20_0_js.data
-rw-r--r-- 1 jenkins-deploy wikidev  215019 May  7 13:21 sinonjs_5439e32e_313e716d_sinon_1_17_7_js.data
-rw-r--r-- 1 jenkins-deploy wikidev    1082 May  7 13:21 url_7b040ba9_b66de763_LICENSE_md.data
-rw-r--r-- 1 jenkins-deploy wikidev     148 May  7 13:21 url_8675b312_28ba3621_polyfill_js.data
-rw-r--r-- 1 jenkins-deploy wikidev   18263 May  7 13:21 url_ae97ebf9_6b4ac0f0_polyfill_js.data
-rw-r--r-- 1 jenkins-deploy wikidev  531501 May  7 13:21 vue_3418ace6_2ddc00f2_vue_3_3_9_tgz.data
-rw-r--r-- 1 jenkins-deploy wikidev   65770 May  7 13:21 vuex_0a04d0b3_7636816b_vuex_4_0_2_tgz.data

That last ran on May 7, then we no more run the php8.0 job. That makes me wonder whether the test actually runs on CI or maybe the cache is not working somehow? :/

Is it possible to specify a fallback source […] only used in CI context?

In a change where the foreign resources are touched, […].

In a change where the foreign resources aren’t touched, I think the test serves almost no purpose […].

I believe this was solved in 2019 with T203694: Run ForeignResourceManager verification on MediaWiki core commits.

If you've run manageForeignResources.php verify once in the past, re-running doesn't download anything and completes near-instantly. The same is true for composer phpunit -- tests/phpunit/integration/includes/ResourceLoader/ForeignResourceStructureTest.php. These two share the same offline and non-expiring cache, validated by content hash. This is enabled in CI as well.

If CI is having general networking issues, then composer install won't succeed anyway, and individual tests like this make little difference (although the vendor job will reach PHPUnit without it, so on those jobs manageForeignResources might be the first visible failure). For this to happen, there would have to be an empty cache. This might happen once or twice a quarter after adding renaming or adding new Jenkins jobs, e.g. for a new PHP version. Or after a Debian upgrade of the CI runners at WMCS, where Castor would begin with an empty cache once.

I guess one of these scenarios happened recently, and thus uncovered a networking reliablilty problem.

With https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1049879 I have emitted some debug statement which seems to indicate the class does indeed write downloaded materials under /cache/mw-foreign but I don't get why it is not saved. I have at least confirmed the files get written to /cache/mw-foreign and if trigger a job as if it was in postmerge, castor does save the mw-foreign file.

So my assumption is that eventually the cache get nuked or saved while that directory does not exist.

I also note the Quibble step which installs the dev dependencies does download all dependencies:

11:31:10   - Downloading squizlabs/php_codesniffer (3.8.1)
11:31:10   - Downloading dealerdirect/phpcodesniffer-composer-installer (v1.0.0)
11:31:10   - Downloading composer/pcre (3.1.4)
11:31:10   - Downloading psr/cache (1.0.1)
11:31:10   - Downloading doctrine/deprecations (1.1.3)
11:31:10   - Downloading doctrine/event-manager (1.2.0)

Filed as T368550

I did ran a job and confirmed the mw-foreign cache to be saved. I was watching the cached directory on the Castor instance and eventually it vanished:

VANISHED ! Wed, 26 Jun 2024 15:30:02 +0000

I tracked back the build to https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1049832 from gate-and-submit (which does trigger castor). In the build artifact mw-debug-cli.log.gz there is:

[PHPUnit] Start test ForeignResourceStructureTest::testVerifyIntegrity
[PHPUnit] Skipped test ForeignResourceStructureTest::testVerifyIntegrity: T362425

Since the test is skipped, mw-foreign is not populated and due to T368550 it is not saved even when a build generates it. Mystery solved

Change #1049988 had a related patch set uploaded (by Ladsgroup; author: Bartosz Dziewoński):

[mediawiki/core@wmf/1.43.0-wmf.11] Skip failing ForeignResourceStructureTest

https://gerrit.wikimedia.org/r/1049988

Change #1049989 had a related patch set uploaded (by Ladsgroup; author: Bartosz Dziewoński):

[mediawiki/core@wmf/1.43.0-wmf.10] Skip failing ForeignResourceStructureTest

https://gerrit.wikimedia.org/r/1049989

I think the root cause is T368550 which prevents the cache to be kept between build. That means as the jobs running in parallel download materials over a short period of time, we might trigger a throttle/rate limit upstream which kills the connection and fails the build. As Timo said on IRC: We now roll the dice 300 times in every build instead of between 0-1 times per build.

The cache should be restored now, but it is to be verified before we reenable the ForeignResourceStructureTest test.

Change #1049988 merged by jenkins-bot:

[mediawiki/core@wmf/1.43.0-wmf.11] Skip failing ForeignResourceStructureTest

https://gerrit.wikimedia.org/r/1049988

Change #1049989 merged by jenkins-bot:

[mediawiki/core@wmf/1.43.0-wmf.10] Skip failing ForeignResourceStructureTest

https://gerrit.wikimedia.org/r/1049989

Mentioned in SAL (#wikimedia-operations) [2024-06-26T17:05:56Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:rMW1049982acf77|Modify WikiExporter's BATCH_SIZE from 50000 to 10000 (T368098)]], [[gerrit:1049989|Skip failing ForeignResourceStructureTest (T362425)]], [[gerrit:1049988|Skip failing ForeignResourceStructureTest (T362425)]], [[gerrit:1049984|Modify WikiExporter's BATCH_SIZE from 50000 to 10000 (T368098)]]

Mentioned in SAL (#wikimedia-operations) [2024-06-26T17:08:53Z] <ladsgroup@deploy1002> ladsgroup: Backport for [[gerrit:rMW1049982acf77|Modify WikiExporter's BATCH_SIZE from 50000 to 10000 (T368098)]], [[gerrit:1049989|Skip failing ForeignResourceStructureTest (T362425)]], [[gerrit:1049988|Skip failing ForeignResourceStructureTest (T362425)]], [[gerrit:1049984|Modify WikiExporter's BATCH_SIZE from 50000 to 10000 (T368098)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwd

Mentioned in SAL (#wikimedia-operations) [2024-06-26T17:14:49Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:rMW1049982acf77|Modify WikiExporter's BATCH_SIZE from 50000 to 10000 (T368098)]], [[gerrit:1049989|Skip failing ForeignResourceStructureTest (T362425)]], [[gerrit:1049988|Skip failing ForeignResourceStructureTest (T362425)]], [[gerrit:1049984|Modify WikiExporter's BATCH_SIZE from 50000 to 10000 (T368098)]] (duration: 08m 52s)

hashar claimed this task.

The root cause was that we ran CI without any cache (T368550) and thus the integrity checker had to redownload files again and again as described above T362425#9925584.

I am closing this since I have restored the cacheing last week.

Jdforrester-WMF subscribed.

This is not fixed. The revert of the test is (a) not landed, and more importantly (b) not passing CI.

Change #1049606 merged by jenkins-bot:

[mediawiki/core@master] Revert "Skip failing ForeignResourceStructureTest"

https://gerrit.wikimedia.org/r/1049606

It passed this time 🤷‍♂️