Page MenuHomePhabricator

Investigate Weird Behaviour Around Google-selected Canonical
Open, Needs TriagePublic

Description

I'm trying to gauge if the following is representative of a bigger problem.

Searching for the word "Fjord" on Google with the browser's language preference set to English on google.com yields the zhwiki page as the first Wikipedia result. https://zh.wikipedia.org/zh-cn/en:fjord.

The expected page is https://en.wikipedia.org/wiki/Fjord

Upon querying the search console for this URL, one sees the following:

Page indexing
Page is not indexed: Page with redirect
User-declared canonical : https://en.wikipedia.org/wiki/Fjord
Google-selected canonical:  https://zh.wikipedia.org/zh-cn/en:fjord

The page indexing info for https://zh.wikipedia.org/zh-cn/en:fjord says the canonical is https://en.wikipedia.org/wiki/Fjord as expected.

As far as I can tell enwiki-Fjord was never a redirect for desktop UAs. But on mobile, it does 302 to en.m.wikipedia.org/wiki/Fjord and is therefore treated as a redirect. For this reason, the page is no longer treated as a canonical and Google "guesses" at a canonical by looking at page content and concludes that the zhwiki page is the right canconical because it is a redirect to the same enwiki article.

The Google-selected / guessed canonical is apparently based on content match. Understandably the zhwiki URL matches the eventual page that the mobile crawler sees because it itself is again a redirect. The canonical for the zhwiki page is however the enwiki non-mobile version.

I don't know how widespread this is or how problematic it will be. I examined a couple hundred URLs that are marked as "Pages with redirect" in search console and therefore not indexed but could not find another example such as the Fjord one. In other words, all the other pages that Google seems to think are redirects are indeed redirects.

What's special about the Fjord article?

If an attempts to index an en.wikipedia using a mobile crawler are redirects leading to a weird canonical being chosen by Google, why isn't this happening to other pages (even pages that got crawled on the same day with the same crawler)?

Event Timeline

SCherukuwada added a subscriber: Krinkle.

@Krinkle This could use a second pair of eyes to make sure I'm not missing something completely obvious if you could spare the time.

I've reached out to Partnerships to surface this to Google.

The /zh-cn/ path is for automatic language variant conversion. This is enabled on certain wikis only (incl zh.wikipedia.org). This displays meaningfully different text in the article content body, and gets its own canonical URL in the HTML. E.g. compare https://zh.wikipedia.org/wiki/關東地方 to https://zh.wikipedia.org/zh-sg/關東地方 (linked from the variants menu in the toolbar), which contains notably different characters in the content, and each have their own <link rel=canonical>. This conversation resides close to the Parser and is part of the MediaWiki-Language-converter component, both maintained by the CTT team under MwEng. I don't think this component is playing a role in what URL Google is displaying.

The en:fjord part is the article title, which in this case includes an interwiki prefix. Interwiki titles generally redirect to their destination. This uses the same Interwiki map as e.g. used in wikitext or VisualEditor when creating links with [[foo:brackets]]. In page content, we would output a link to the destination directly (no redirect). For the benefit of search, and for interwiki stacking, these are supported in URLs as well (e.g. from nl.wiktionary you would use [[w:de:Foo]] to end up at de.wikipedia, by first rendering as nl.wikipedia.org/wiki/de:Foo and then redirect the second step based on local configuration and settings. Likewise on nl.wiktionary itself you can use HTTP to resolve interwikis from that wikis context).

Unlike "wiki" redirects between articles, these interwiki redirects are truly at the HTTP level, so there is no intermediary page from which we could serve an incorrect canonical link. https://zh.wikipedia.org/zh-cn/en:fjord HTTP redirects to https://en.wikipedia.org/wiki/fjord where the normal page view rendering will then output its own canonical url, and the incoming URL is not known about or referred to.

This falls under MediaWiki-Interwiki. This component is currently unknowned. I don't think this component, or indeed anything in MediaWiki, is doing anything that would justify Google claming a redirect as its preferred URL displayed in search results.

Separate from this, I do recall numerous past cases where non-canonical URLs were displayed in Google search results. Both involving redirects but also involving direct responses under non-canonical URLs (e.g m-dot subdomain, or unusual index.php query parameter formations). I didn't have access to Google Search Console then to know how those were preceived there, so this may be unrelated. In the past, the cause was pretty consistent: Google was limiting the scope of the search query and then displaying the "most"-canonical URL it knows of that matches the query.

For example, let's say Google indexes the link nl.wiktionary.org/wiki/w:de:Foobar (note that it would need to be hand-written by a person somewhere and then linked to from a page that Google indexes, we don't generate or supply this kind of URL ourselves). And then let's say there is no (high ranking) page on nl.wiktionary.org about "Foobar". If I search world wide, it wil find the de.wikipedia.org/wiki/Foobar result and display its canonical URL. If I instead searched with a Dutch-language device, or with site:wiktionary.org oder site:nl.wiktionary.org, Google will still find that same canonical entry through one of its known aliases. But, for presumed UX/product reasons, it will display and link to the matching alias, so that the result makes sense for the reader.

Something similar might be going on with your "fjord" example.

As far as I can tell enwiki-Fjord was never a redirect for desktop UAs. […]
I don't know how widespread this is or how problematic it will be. […] What's special about the Fjord article? […], why isn't this happening to other pages […]

Indeed, Special:Log shows no past page renames, and no past page deletions under this title. Suppose a redirect did exist, what chain reaction do you suspect may have happened? I'm thinking about it the other way around - not redirects from here, but to here. Although by itself that "should" indeed not matter, as we have thousands of redirects to our canonical pages both within and across domains.

As for why it doesn't happen more often..., if the interwiki redirect part of this URL is indeed playing a role somehow, then it would naturally be very limited since it could only affect cases where such URLs are hand-written in the wild, then subsequently survive to be discovered and crawled by Gogole, and then subsequently trigger the unknown factors inside Google that cause this. That's easily 1% of 1% of 1% etc. I'd definitely expect more to exist, but they might not be so easy to find. The start of that 1% chain would likely be a Wikipedia editor writing on a talk page or external message board somewhere, linking to something very specific they're working on.