Page MenuHomePhabricator

Make local_sites_with_dupe filter configurable and count duplicates
Closed, ResolvedPublic5 Estimated Story Points

Description

Problem

When searching for files on a wiki CirrusSearch constructs a query that targets the local wiki index but also the commons index.
Reason is that files can be uploaded locally (generally for fair use images but also for privacy reasons on private wikis).

In order to avoid showing duplicated file entries (a file that has the same name on commons and the local wiki) in the search result CirrusSearch filters out results from commons that are known to have a duplicated entry on the local wiki.

This process requires that all the wikis might possibly trigger an update to the commons index.

This is not ideal and it is unclear if doing this at index time has value:

  • only works if the local file is uploaded/updated after the file on commons
  • the dupe flag is not cleaned up when no longer necessary
  • over all the wikis enwiki has 79555 identified dupes (only 0.08% of the files in commons), but only 23539 are actually duplicates, in other words 70% of those are false positives

Solution

Since we do not know how often this filter is effective, it remains unclear, if this behaviour is worthwhile being ported to the search update pipeline. Hence, we will disable the filter via configurable flag and count the duplicates we see in result pages. We will monitor the results as well as any community feedback over a period of one month. Our hypothesis is, that duplicate results in file searches are rare and therefore can be lived with. If we get proven wrong, we'll turn on the filter again and think of alternative ways of maintaining the local_sites_with_dupe array, for example by using a periodic job.

AC:

  • a config option exists that turns of the usage of the local_sites_dupe filter in elastic search queries
  • a metric exists that indicates the number of duplicates in search results
  • local_sites_dupe is no longer maintained and updated
  • file duplicates are removed from the result set in a similar way that missing revisions are filtered out

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
bking set the point value for this task to 5.Jul 10 2023, 3:47 PM

@dcausse, Would you have a preference, where to implement the deduplication? I identified two options so far:

  • a) inside \CirrusSearch\Search\BaseCirrusSearchResultSet::extractResults
  • b) in a dedicated subclass of \CirrusSearch\Search\BaseCirrusSearchResultSet, DeduplicatedCirrusSearchResultSet, that is used instead of the dynamic instance in \CirrusSearch\Search\FullTextResultsType::transformElasticsearchResult (which has a comment // Should we make this a concrete class? anyways)

@dcausse, Would you have a preference, where to implement the deduplication? I identified two options so far:

  • a) inside \CirrusSearch\Search\BaseCirrusSearchResultSet::extractResults
  • b) in a dedicated subclass of \CirrusSearch\Search\BaseCirrusSearchResultSet, DeduplicatedCirrusSearchResultSet, that is used instead of the dynamic instance in \CirrusSearch\Search\FullTextResultsType::transformElasticsearchResult (which has a comment // Should we make this a concrete class? anyways)

I don't have strong preferences but perhaps I'd try solutions that do not involve creating new classes first, BaseCirrusSearchResultSet::extractResults being final it cannot be overridden so if everything is available from here this could be an option, if some dependencies are missing perhaps option b is better, by adapting \CirrusSearch\Search\FullTextResultsType to do some deduplication in \CirrusSearch\Search\FullTextResultsType::transformElasticsearchResult as it might be easier to inject new dependencies there (injecting a new dep in the \CirrusSearch\Search\BaseCirrusSearchResultSet might be difficult because it's used in many different places).

Change 944269 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[mediawiki/extensions/CirrusSearch@master] Handle duplicate files in commons and local wiki.

https://gerrit.wikimedia.org/r/944269

Thanks, @dcausse! It took me a while to come up with a setup to test this but I finally managed to do so. I just started with a naive approach. Could you provide some early comments, please?

pfischer renamed this task from Replace the local_sites_with_dupe filter with deduplication on the result set to Make local_sites_with_dupe filter configurable and count duplicates.Aug 10 2023, 2:41 PM
pfischer updated the task description. (Show Details)

Change 944269 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Count duplicate files in search results.

https://gerrit.wikimedia.org/r/944269

Change 952346 had a related patch set uploaded (by Peter Fischer; author: Peter Fischer):

[operations/mediawiki-config@master] Disable search result deduplication.

https://gerrit.wikimedia.org/r/952346

Change 952346 merged by jenkins-bot:

[operations/mediawiki-config@master] Disable search result deduplication.

https://gerrit.wikimedia.org/r/952346

Mentioned in SAL (#wikimedia-operations) [2023-08-30T07:03:10Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:952346|Disable search result deduplication. (T341227)]]

Mentioned in SAL (#wikimedia-operations) [2023-08-30T07:04:55Z] <ladsgroup@deploy1002> ladsgroup and pfischer: Backport for [[gerrit:952346|Disable search result deduplication. (T341227)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)

Mentioned in SAL (#wikimedia-operations) [2023-08-30T07:19:03Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:952346|Disable search result deduplication. (T341227)]] (duration: 15m 53s)

So, we're seeing ~300 duplicates per day, vs ~30M full text search queries per day. Roughly 0.001% of queries. It seems clear that we can ignore this for the time being, and revisit if we have a good reason to in the future.