InternetArchiveBot incorrectly replacing archive URL
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Graham87
	Mar 22 2017, 2:39 AM

Description

The Pandora Archive is a web-archiving service run by the National Library of Australia that contains some URLs that are not on the Wayback Machine. It is used in thousands of articles on Wikipedia:
https://en.wikipedia.org/w/index.php?title=Special:LinkSearch/pandora.nla.gov.au&limit=500&offset=10000&target=http%3A%2F%2Fpandora.nla.gov.au

Internet Archive Bot replaced the Pandora link with a dead link in this diff:
https://en.wikipedia.org/w/index.php?title=Kieran_Modra&diff=771512151&oldid=761483969

I've disabled the bot to stop it from doing something like this in the future.

Event Timeline

Graham87 created this task.Mar 22 2017, 2:39 AM

Restricted Application assigned this task to Cyberpower678. · View Herald TranscriptMar 22 2017, 2:39 AM

Restricted Application added a project: Internet-Archive. · View Herald Transcript

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Cyberpower678 triaged this task as High priority.Mar 22 2017, 2:48 AM

Cyberpower678 moved this task from Inbox to Archive requests on the InternetArchiveBot board.

This will take some doing. The newest update, force validates archive URLs to make sure they are actually archive URLs. As I continued going around Wikipedia researching ways to improve the reliability of IABot, I see too many cases where the archiveurl parameter is getting misused. The bot recognizes numerous archiving services, I'll work on adding the ones I missed in the first round of development.

On a side note, they get ignored if the bot doesn't see a need to archive URLs to the original.

Cool. Could you also check to see if the bot has accidentally overwritten any other archive pages in this way? I've checked other pages about Australian Paralympians, where I've used the Pandora Archive often, but can't find any more examples.

Internet Archive Bot replaced the Pandora link with a dead link in this diff:
https://en.wikipedia.org/w/index.php?title=Kieran_Modra&diff=771512151&oldid=761483969

Not exactly a dead link.

Original link: http://www.ausport.gov.au/olym96/paracycl.html
Before (Pandora): http://pandora.nla.gov.au/nph-arch/2000/Z2000-Jan-20/http://www.ausport.gov.au/olym96/paracycl.html (archivedate=20 January 2000)
After: https://web.archive.org/web/20060128204513/http://www.ausport.gov.au/olym96/paracycl.html (archivedate=28 January 2006)

Both of these links "work" and contain an archived copy of what was visible at URL http://www.ausport.gov.au/olym96/paracycl.html at some point in time.

The problem is that at some point between 2000 and 2006, the original page ceased to exist and started to respond with a "Page Not Found" page. Archive.org did crawl this url in 2006 and it did archive it, however it archived a copy of the "Page Not Found" page.

http://pandora.nla.gov.au/nph-arch/2000/Z2000-Jan-20/http://www.ausport.gov.au/olym96/paracycl.html

Original page from 2000

Screen Shot 2017-03-21 at 20.43.15.png (1×3 px, 1 MB)

https://web.archive.org/web/20060128204513/http://www.ausport.gov.au/olym96/paracycl.html

"Page Not Found" as it looked in 2006

Screen Shot 2017-03-21 at 20.43.55.png (1×2 px, 863 KB)

http://www.ausport.gov.au/olym96/paracycl.html

"Page Not Found" as it looks today (2017)

Screen Shot 2017-03-21 at 20.44.41.png (1×1 px, 309 KB)

To prevent this, the Wikipedia bot would need to verify that, in addition to Archive.org having a copy, that it is not a "Page Not Found" kind of copy. The bot can use the internally provided HTTP status code to verify this (without needing to inspect the page itself).

However this won't work for this particular example, because ausport.gov.au did not have their server configured correctly back in 2006. They were serving the "Page not found" page with a 200 OK (Success) status code (instead of 404). This is a common mistake in servers when they do have a page, it is just a placeholder page to mean there is no page. The current version of ausport.gov.au fixed this and does have the Page Not Found page internally marked as a real 404, so the bot correctly will not try to use the newer copies.

Screen Shot 2017-03-21 at 20.49.31.png (130×1 px, 31 KB)

Screen Shot 2017-03-21 at 20.49.26.png (94×1 px, 26 KB)

In T161074#3120713, @Graham87 wrote:

Cool. Could you also check to see if the bot has accidentally overwritten any other archive pages in this way? I've checked other pages about Australian Paralympians, where I've used the Pandora Archive often, but can't find any more examples.

I couldn't find any other instances. You switched the bot off soon after it was turned back on.

In T161074#3120715, @Krinkle wrote:

Internet Archive Bot replaced the Pandora link with a dead link in this diff:
https://en.wikipedia.org/w/index.php?title=Kieran_Modra&diff=771512151&oldid=761483969

Not exactly a dead link.

Original link: http://www.ausport.gov.au/olym96/paracycl.html

Before (Pandora): http://pandora.nla.gov.au/nph-arch/2000/Z2000-Jan-20/http://www.ausport.gov.au/olym96/paracycl.html (archivedate=20 January 2000)

After: https://web.archive.org/web/20060128204513/http://www.ausport.gov.au/olym96/paracycl.html (archivedate=28 January 2006)

Both of these links "work" and contain an archived copy of what was visible at URL http://www.ausport.gov.au/olym96/paracycl.html at some point in time.

The problem is that at some point between 2000 and 2006, the original page ceased to exist and started to respond with a "Page Not Found" page. Archive.org did crawl this url in 2006 and it did archive it, however it archived a copy of the "Page Not Found" page.

http://pandora.nla.gov.au/nph-arch/2000/Z2000-Jan-20/http://www.ausport.gov.au/olym96/paracycl.html

Original page from 2000

https://web.archive.org/web/20060128204513/http://www.ausport.gov.au/olym96/paracycl.html

"Page Not Found" as it looked in 2006

http://www.ausport.gov.au/olym96/paracycl.html

"Page Not Found" as it looks today (2017)

To prevent this, the Wikipedia bot would need to verify that, in addition to Archive.org having a copy, that it is not a "Page Not Found" kind of copy. The bot can use the internally provided HTTP status code to verify this (without needing to inspect the page itself).

However this won't work for this particular example, because ausport.gov.au did not have their server configured correctly back in 2006. They were serving the "Page not found" page with a 200 OK (Success) status code (instead of 404). This is a common mistake in servers when they do have a page, it is just a placeholder page to mean there is no page. The current version of ausport.gov.au fixed this and does have the Page Not Found page internally marked as a real 404, so the bot correctly will not try to use the newer copies.

The bot already does that, and actually instructs the Wayback Machine to only deliver 200/203/206 content in the snapshot. The fact that it was a 404, means the site wasn't setup properly at that time.

Wow, NLA certainly has a bunch of different formats for the same snapshot.

Updated the archive validation sub routines.

	F6837502: Screen Shot 2017-03-21 at 20.49.31.png
	Mar 22 2017, 3:50 AM

	F6837252: Screen Shot 2017-03-21 at 20.44.41.png
	Mar 22 2017, 3:50 AM

	F6837123: Screen Shot 2017-03-21 at 20.43.15.png
	Mar 22 2017, 3:50 AM

	F6837174: Screen Shot 2017-03-21 at 20.43.55.png
	Mar 22 2017, 3:50 AM

	F6837503: Screen Shot 2017-03-21 at 20.49.26.png
	Mar 22 2017, 3:50 AM

InternetArchiveBot incorrectly replacing archive URLClosed, ResolvedPublicActions

Description

Event Timeline

InternetArchiveBot incorrectly replacing archive URL
Closed, ResolvedPublic
Actions