User Details
- User Since
- Jun 9 2015, 9:03 AM (475 w, 6 d)
- Availability
- Available
- IRC Nick
- dcausse
- LDAP User
- DCausse
- MediaWiki User
- DCausse (WMF) [ Global Accounts ]
Fri, Jul 19
Thu, Jul 18
The new code has been deployed this morning and data is flowing properly into these new topics, we improved batching a bit to save some space via compression, I believe that we have some room to increase some buffer size if we want to optimize for space even further.
I can't reproduce with the given example in the description, File:Nut_Grab.jpg (page id 29851242) is properly excluded when searching pageid:29851242 -deepcategory:"Animals with nuts". So I suspect that the problem might have been caused by the issues we had with dumps recently. @Prototyperspective could you confirm or possibly provide another example file that does not comply with the search query.
For reference (when writing this comment) the list of categories identified by deepcategory:"Animals with nuts" is:
- Animals with nuts
- Animals eating nuts
- Animals eating peanuts
- Curculio (larval damage)
- Animals eating hazelnuts
- Animals eating walnuts
- Birds eating nuts
- Sciurus vulgaris eating walnuts
- Sciurus vulgaris eating hazelnuts
- Sciuridae eating peanuts
- Birds eating peanuts
- Sciurus carolinensis eating walnuts
- Curculio nucum (larva)
- Tamias striatus eating peanuts
- Sciurus vulgaris eating peanuts
- Sciurus carolinensis eating peanuts
- Tamias striatus fed by hand (EIC)
Wed, Jul 17
We are getting ready to deploy the new updater that will populate these new topics, @bking could we have the topics created with proper retention and partitioning? (We could also let the topics autocreate and adapt the retention after the fact using https://wikitech.wikimedia.org/wiki/Kafka/Administration#Alter_topic_retention_settings). Thanks!
Perhaps something to consider as well is fine-tuning mirrormaker, I don't think that in the case of the wdqs updater we need the *.rdf-streaming-updater.mutation* topics replicated between the two kafka-main clusters.
Tue, Jul 16
I think that all 412 schemas are now properly indexed.
Mon, Jul 15
Wed, Jul 10
@Scott_French thanks for pinging us, indeed the test that was running was using https://api-ro.discovery.wmnet and this is unintentional (we rarely run such tests and by re-using an old configuration I overlooked that it was relying on api-ro). I have stopped the test (and removed this old config), there should be no use-cases from our end still hitting api-ro.
Fri, Jul 5
@Lucas_Werkmeister_WMDE thanks for the fix! I manually re-indexed this item with our new (WIP) tooling, it would have been fixed automatically by the cleanup process but it would have taken up to 2weeks to discover in the worst case.
Thu, Jul 4
Should T192361 be re-opened and added as a subtask here?
Wed, Jul 3
Seems like \EntitySchema\Wikibase\DataValues\EntitySchemaValue::getType() is returning EntityIdValue::getType() and thus some code are considering it as EntityIdValue ('VT:wikibase-entityid`), here WikibaseCirrusSearch is calling https://gerrit.wikimedia.org/g/mediawiki/extensions/Wikibase/+/8b3312396b4b8b91790d7b33c4703fb31bd290d8/repo/WikibaseRepo.datatypes.php#421 with an EntitySchemaValue.
The process is unable to render this document: https://www.wikidata.org/w/api.php?action=query&cbbuilders=content|links&format=json&format=json&formatversion=2&pageids=120965176&prop=cirrusbuilddoc fails with Caught exception of type TypeError:
Seems like that recently https://packages.sury.org/php/dists/buster/ is now returning a 403
Tue, Jul 2
Mon, Jul 1
Moving a page from one namespace to another should now properly cleanup the search index, existing phantom redirects might still be around for a couple weeks while the automated cleanup process takes care of them. Please let me know if you see new instances of this problem in the future, sorry for the inconvenience.
Hi I'm having issues with a flink job running in staging and failing to deploy with an error:
>>> Status | Error | DEPLOYED | {"type":"org.apache.flink.kubernetes.operator.exception.DeploymentFailedException","message":"pods \"flink-app-consumer-search-784bc9fd87-9n862\" is forbidden: violates PodSecurity \"restricted:latest\": allowPrivilegeEscalation != false (container \"flink-main-container\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container \"flink-main-container\" must set securityContext.capabilities.drop=[\"ALL\"]), runAsNonRoot != true (pod or container \"flink-main-container\" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container \"flink-main-container\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")","additionalMetadata":{"reason":"FailedCreate"},"throwableList":[]}
The talk page is ranked very low indeed, it does seem quite recent (created on may 2024) and have 0 incoming_links and thus is far behind https://he.wikipedia.org/wiki/שיחת_משתמש:קיפודנחש/ארכיון_31_עד_מאי_2024 which has more than 3k incoming links. CirrusSearch does not prioritize master pages over their subpages indeed, if we want to do this this would have to be carefully evaluated because one thing we can't do is rank lower a subpage comparatively solely to its master page, all subpages would be down-ranked.
Fri, Jun 28
I added some logging info to get a sense of the numbers, moving to waiting while we gather a bit more info.
tagging serviceops for help on envoy to see if it can be used as a load balancer to balance the internal requests made from one blazegraph cluster to another without using lvs.
@Vgutierrez thanks for the help!
Wed, Jun 26
In the meantime a ugly workaround is to search both EntitySchema and EntitySchema talk namespaces but filter on the content model using the keyword contentmodel:EntitySchema: https://www.wikidata.org/w/index.php?search=contentmodel%3AEntitySchema+intitle%3A%2FE%2F&title=Special:Search&profile=advanced&fulltext=1&ns640=1&ns641=1 .
yes this is sadly kind of expected (I should have told you about this on the config patch, sorry). The cleanup process had already started moving pages around while the entity schema namespace was considered non-content and thus these ones are no longer findable now it was brought back again in the content namespace. I need to reindex these pages to make search working again but sadly our tooling is not working as expected and I need to deploy https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/143 first to be able to fix the index. If this is causing major disruption I can messup with the index by hand but I'd rather not do that if not strictly required, sorry for the inconvenience!
Another instance of this issue was reported on wiki:
@dcausse (WMF): fwiw, I have 6 items updated on the 19 & 20 June - https://w.wiki/ASz6 - for which WDQS has not been updated ... on the production WDQS, not test. Only one of them was edited within the June 19 between 03:00 and 15:30 UTC window, afaics. It's not a prolem for me, more of a FYI. --Tagishsimon (talk) 16:01, 21 June 2024 (UTC)
Surprisingly E378 which is one of the schemas that is not indexed appears to be indexed in the "content" index of wikidata, but AFAICT 640 is not a content namespace.
But it might have been considered as a content namespace few weeks ago.
I wonder if T363153 and esp. https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EntitySchema/+/1040113/ might not be the reason of this change. When a namespace with existing documents has its search characteristics changed (wgContentNamespace and/or wgNamespacesToBeSearchedDefault) the indexed docs are not moved automatically from one index to another and will rely on the saneitizer to slowly fix the inconsistencies, this is what might have happened here and explain why the schemas suddenly disappeared and got re-indexed slowly overtime.
The above reindex did not work as I expected, the attached patch should remedy this by allowing non indexed page to be re-indexed properly when manually re-indexing a whole namespace.
The root cause as to why these schemas were not indexed in the first place is yet to be investigated.
Tue, Jun 25
There are currently 354 pages indexed in the entity schema, the all pages api does seem to suggest that there are 397 schemas.
Mon, Jun 24
Jun 21 2024
Jun 20 2024
After discussing this Erik with we have a rough plan:
- add a new lvs enpoint dedicated to internal federation and targeting a new port opened by nginx
- add a new port in the nginx config for which we add the X-Disable-Throttling + x-bigdata-read-only to the request forwarded to blazegraph
- use the blazegraph service alias feature to map https://query-main.wikidata.org/sparql -> https://wdqs-main.discovery.wmnet:$NEW_PORT/sparql
- adapt ProxiedHttpConnectionFactory to allow the bypass of *.wmnet hostnames
Yes this is my understanding as well, the undesirable effects I could see are:
- one tagging an entity with a P31 that points to a scholarly article
- introducing a scholarly article in the chain of subclass of thus making the sparql property path noneffective
I'm not knowledgeable enough but I suspect these problems should be quite rare and perhaps already identified via other means?
Jun 18 2024
we need 4 weeks to be able to backfill after an import, from the time the wikidata dump process starts, the time required to shuffle the data around (compression, hdfs-rsync to hdfs) and til the end of the import into blazegraph, see the initial lag column in T241128 for past import times, perhaps 3weeks would be manageable but we went to 4 weeks to have extra room.
Jun 14 2024
Unsure if feasible but perhaps manually flagging list of safe regex & very popular regex could help reduce the number of requests to shellbox?
I did some testing and sadly when a wdqs node makes a query to https://query.wikidata.org it hits varnish again:
from wdqs1020 to https://query.wikidata.org (echo 'SELECT ?test_dcausse { ?test_dcausse ?p ?o . } LIMIT 1' | curl -f -s --data-urlencode query@- https://query.wikidata.org/sparql?format=json)
"x-request-id": "b34bb930-ef85-4b23-956e-7dcb11f0f7ec", "content-length": "99", "x-forwarded-proto": "http", "x-client-port": "40256", "x-bigdata-max-query-millis": "60000", "x-wmf-nocookies": "1", "x-client-ip": "2620:0:861:10a:10:64:131:24", "x-varnish": "800949377", "x-forwarded-for": "2620:0:861:10a:10:64:131:24\\, 10.64.0.79\\, 2620:0:861:10a:10:64:131:24", "x-requestctl": "", "x-cdis": "pass", "accept": "*/*", "x-real-ip": "2620:0:861:10a:10:64:131:24", "via-nginx": "1", "x-bigdata-read-only": "yes", "host": "query.wikidata.org", "content-type": "application/x-www-form-urlencoded", "connection": "close", "x-envoy-expected-rq-timeout-ms": "65000", "x-connection-properties": "H2=1; SSR=0; SSL=TLSv1.3; C=TLS_AES_256_GCM_SHA384; EC=UNKNOWN;", "user-agent": "curl/7.74.0"
Jun 13 2024
@RKemper I think we should now do a full import to measure the time it takes in order to have a rough estimation to answer T367409
To have a full run we need to re-enable the updater on wdqs2023 (which I think will be done with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1042965)
The command to run should be (using the latest dumps):
cookbook sre.wdqs.data-reload \ --task-id T349069 \ --reason "Test wdqs reload based on HDFS" \ --reload-data wikidata_full \ --from-hdfs hdfs:///wmf/data/discovery/wikidata/munged_n3_dump/wikidata/full/20240603/ \ --stat-host stat1009.eqiad.wmnet \ wdqs2023.codfw.wmnet
Jun 11 2024
Triggered a reindex of all the lexemes using https://gitlab.wikimedia.org/repos/search-platform/cirrus-rerender, might take about 3 hours to complete.
Jun 10 2024
Jun 6 2024
@RKemper for testing I created a smaller folder at hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ it has only two chunks so I hope it might help iterate a bit faster on this, the command should become:
cookbook sre.wdqs.data-reload \ --task-id T349069 \ --reason "Test wdqs reload based on HDFS" \ --reload-data wikidata_full \ --from-hdfs hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ \ --stat-host stat1009.eqiad.wmnet \ wdqs2023.codfw.wmnet
Jun 4 2024
Jun 3 2024
Yes (all the images under docker-registry.wikimedia.org/wikimedia/wikidata-query-flink-rdf-streaming-updater should no longer be used and can be safely removed if needed)
Sorry to see this happening again, it is probable that we missed some edge cases when deploying T317045.
May 31 2024
May 30 2024
Hi, we might have a use-case related to "other dumps" that might benefit from the Dumps 2.0 infrastructure, I filed T366248 with some details about it.
May 29 2024
The system should now index lexemes properly.
We still have to reindex all the lexemes to fix the ones created/edited before the fix was applied.
@BTullis thanks! Categories are reloaded via a cronjob on all WDQS machine, the job is about to run in 30 mins
May 28 2024
Output with:
cirrus = (spark.table("discovery.cirrus_index").where('cirrus_replica="codfw" AND snapshot="20240428"'))
The search fields specific to Lexemes are currently ignored causing this NOTICE but also preventing lexemes from being searchable (esp. the new ones).
The schemas should be adapted to support these fields and the lexemes will have to be re-indexed.
@achou except expert search users explicitly searching for topics (which I suspect are rare) the growth team is the only team using this data in a user facing product, it is hard to tell what would be the impact for them but I suspect that if only a few (<100) are lost these might hardly impact anything. If you suspect that more might be lost perhaps having duplicates is better if this is an option for you.