User Details
- User Since
- Nov 12 2020, 6:16 PM (194 w, 2 d)
- Availability
- Available
- LDAP User
- Fabian Kaelin
- MediaWiki User
- FKaelin (WMF) [ Global Accounts ]
Tue, Jul 23
Mon, Jul 8
Finally got around to this. Thank you @YLiou_WMF for the data file, this looks good to me in general.
Jun 19 2024
I confirm that this request is legit, also adding @XiaoXiao-WMF as manager.
Jun 17 2024
Thanks. Is this now using AQS 2? It has been a moment, can you point to a current/good example job that writes to a AQS cassandra dataset from airflow?
Summary of developments:
- implementation of an end-to-end ml training workflows for
- revert risk model (done)
- add-a-link model (in progress)
- airflow dags that to execute pipelines (scheduled for retraining pipelines, manual trigger for development)
- discussions for how new model versions can be deployed
- for now, continue with manual process established by ML platform
- T366528 to track automation, as a manual process will not scale as research puts more training pipelines into production
- guide for contributing to repository containing training workflows
- future work in collab with ML platform
- GPU support
- enable using new ML boxes once they become available
- use gpu available on existing infra in production airflow job (maybe with a sprint with ML platform that we didn't get to in Q4 FY24)
- standards for ML training
- there is a style guide and existing ml training pipelines to base new work on, but we refrained from introducing "abstractions/framework" like code or language - instead we used the existing infrastructure.
- lead by the ML Platform team, we should revisit this once the new ML boxes become available, as there will be a need for new tooling at that point
- related: the current tooling for end-to-end ML training workflows is not convenient for iterative research/development (setup/deployment is error prone and too involved for one-off use xcases), research engineering has a goal in FY25 to improve researcher tooling
- GPU support
Weekly updates
- initial review on the MR
- meeting with Aisha/Martin to discuss MR and how to approach remaining work
Weekly update
- pipeline are merged
- airflow dags are deployed, final testing in progress
Jun 13 2024
In T358366#9831389 I asked if other fields could be added to the schema; in particular the diff between two revisions, which is frequently used by research (wikidiff). I agree with @xcollazo's concerns, but this lead me to think about the implications of computing the diff separately in regards to reconciliation.
- the diff is expensive to compute, as a the parent revision might be at any moment in the past and is not necessarily the most recent previous revision. The wikidiff pipeline batches jobs by page (i.e. a batch contains the full history of the pages in the batch).
- the full diff dataset computed for each snapshot to follow the "snapshot pattern". However it is not significantly cheaper to make this pipeline incremental (e.g. only append diffs for the new month of revisions) as any revision in the past can be a parent revision so the join is still expensive
- so how would one go about "enriching" wmf_dumps.wikitext_raw_rc2 with a diff column? the job could filter the full history for only the pages changed in that hour (broadcast join) and then do the self join, but that would still require a full pass over the data which seems expensive. This certainly is solvable, e.g. one could decrease the update interval, but it is tempting to instead implement the diff as a streaming "enrichment" pipeline.
- this would look similar to the existing "page change" job, e.g. query mediawiki for the current and parent revision text and compute the diff (maybe with a cache for the previous wikitext for each page which is the most common parent revision)
- however, this leads to the question of correctness/reconciliation, since this diff dataset would not be derived from wmf_dumps.wikitext_raw_rc2 and would thus require its own reconciliation mechanism? Which would be an argument in favour of the "s wmf_dumps.wikitext_raw the right 'place' to check whether we are missing events or not? Shouldn't we do these checks upstream?" point raised above.
Jun 3 2024
May 31 2024
Indeed, different versions of the database seems to be present on cluster hosts.
May 23 2024
- This pipeline is implemented (MR)
- Remaining work: schedule an airflow dag to regularly compute new topics dataset
Closing this task as resolved as the storage request was handled.
Closing this as resolved. After more discussion and some experimentation, it was decided that doing batch inference within the distributed jobs (e.g. by broadcasting the model to the workers) is preferable. Pasting the comments from the relevant slack thread here.
May 22 2024
More on "Availability" / time travel. This question is not easy to answer, as it also relates to the current snapshot approach, which forces a pipeline to reason about the past in a rather limiting way. Aka "do you want the data as it looked today, or 1 month ago, or 2 month ago?", and finding out if/how the past data is different is not trivial and rarely practical. Generally pipelines either
- offload dealing with the snapshot semantics to the consumers by producing snapshotted datasets themselves
- implement a pseudo-incremental dataset by disregarding the "new past" and any changes it might contain.
For this reason I find it hard to define requirements for time travel, it is basically a new capability (for example, the replacement for "mediawiki_wikitext_current" could be a transformation of a time travel query). Starting with 90 days should be sufficient as it is strictly an improvement to what one can do now.
- Availability: Research is mostly treating the current history dumps as a pseudo incremental dataset- i.e. pipelines that depend on the history wait for a new snapshot to be released and then only use the "new" data from that snapshot (aka the revisions created in the month since the last snapshot was generated). This means that the wmf_dumps.wikitext_raw allows to significantly reduce the latency - roughly from 1 month (wait for snapshot interval to trigger) +12days (dump processing) to a few hours.
- Schema: As the schemas are almost identical, my main question is about extending the existing dataset in ways that depend on the snapshot mechanism. For example research has a number of use cases that involve comparing the revision text with the parent revision text. This involves a computationally expensive self join and some pitfalls, so there is a wikidiff job that creates (yet another) version of the wikitext history that includes a column with the unified diff between the current and parent revision.
- Could we add the diff to the proposed wmf_dumps.wikitext_raw? As the parent revision could be at any point in the past, this would likely involve the equivalent to the wmf.mediawiki_wikitext_current available when new revisions are ingested into the dataset.
- More generally, what is the replacement for the wmf.mediawiki_wikitext_current?
- Data quality: the discussion around correctness of the events data T120242 also applies in this context. For research in particular, many use cases don't have high requirements (e.g. for training datasets for ML, or for metrics datasets that involve models that can also be "incorrect"), and we could/would migrate existing jobs to the new dumps table once it is available/supported in prod.
Apr 30 2024
I am closing this as done - a summary:
This task requires design/implementation. Given that the current implementation is stable, I am moving this task to the freezer until there is a more urgent need for an incremental dataset.
Update on the use of gitlab issues:
- Research doesn't use them for team internal planning, work is tracked in Phabricator.
- Some researchers use gitlab issues for managing tasks with external collaborators (e.g. outreachy internships) as it more convenient than depending another tool.
Apr 16 2024
Closing this. Deploying on CloudVPS is supported, blubber integration to be done when a kubernetes deploy is needed.
Done - code
Removing due date and moving to backlog to prioritize.
Apr 3 2024
@Pablo thanks for flagging - there was indeed an issue with the wikidiff table: it is an external hive table, the required data was on hdfs and triggered the risk observatory dag, but the hive table itself was not being correctly updated, so no data was ingested. This is fixed now, and the dashboard shows data until Feb 24 now.
Mar 21 2024
Pasting this reply from a slack thread for context
Mar 5 2024
Mar 4 2024
Weekly updates
- Interesting development with the ml team, there is a conversation with an European HPC infra provider about getting compute resources, and research projects are good candidates. Naturally this is relevant to this cloud GPU initiative, and research is very interested.
Feb 29 2024
These directories can be removed both on the stat clients and hdfs. Thanks!
Feb 27 2024
This is fixed (MR) and the the data is available.
Feb 12 2024
Weekly updates
- Trained the simplification model on a 3 billion parameter model (flan-t5-xl) on a single H100 (80GB). Results look promising.
- Training for 2 epochs (~10h), running inference on test datasets (~6h), and downloading model weights: total cost ~50$
- The fine-tuned model model weights are on stat1008. Validated that inference on the currently available GPU in the WMF infra works (it is slow)
Feb 6 2024
Feb 5 2024
The cultural geographical gap data is now in production, aggregated on the level of the WMF regions. The gap name is geography_cultural_wmf_region, e.g. see here, the documentation is also updated as well as the example intersections notebook.
For completeness: the datasets are also documented for the hive tables (which are equivalent to the published datasets) that are only available internally; see datahub (SSO login required)
The is done: Datasets.
Weekly updates
- initial experiments with lambda labs, using text simplification as use case (T354653)
- tested with A100 (40GB) and H100 (80GB) to validate approach and get an estimate of the cost for fine-tuning runs.
- for a model size that can be trained on WMF infra (T5 large, 700M params), 1 epoch takes ~24h in WMF infra. On lambda labs 1 epoch costs ~6$ (i.e. time depends on hardware, ~4 h on A100, ~2h on a H100).
- next up: use a model (3B param model) that can't currently be fine-tuned using WMF infra, but can be served using WMF infra.
Jan 29 2024
Weekly updates
Jan 25 2024
Also for reference, at some point I created a template superset dashboard which mirrors the content_gap_metric hive tables - here https://superset.wikimedia.org/superset/dashboard/472, that is just a draft with example charts.
The issue seems that the superset ui for these queries can't render nested parquet structures, e.g. the metrics column contains a set of scalar columns, and the quantiles are nested structs themselves. The query itself works, but the UI can't render the result as is. If you formulate the query in a way that doesn't contain nested structs it works, for example:
Jan 24 2024
This work is done with this MR, which migrated the KG pipeline to using the canonical_data.countries table which now includes the wikidata qid of the country, which allowed to replace the base region mapping file. For the cultural gap in particular, the "re-mapping" of some territories not in the canonical countries table was retained to expand the coverage of the gap.
Example dataset for the cultural geographic gap (aggregated for wmf regions) for review: https://analytics.wikimedia.org/published/datasets/one-off/fab/content_gap/ . The code is merged and for the next scheduled run the new gap will be published as well (same format as the linked file above), if needed we can easily also re-run the previous pipeline to have the data sooner.
Somehow I accidentally closed..
Jan 23 2024
Jan 18 2024
pa = spark.table("wmf.pageview_actor").where("""year=2024 and month=1 and day=18 and hour=16""") prefetch_fields = [ 'prefetch_sec_purpose', 'prefetch_purpose', 'prefetch_x_moz'] cols = [F.col("x_analytics_map").getItem(f).isNotNull().alias(f) for f in prefetch_fields] pa.groupBy(*cols).count().orderBy("count",ascending=False).show(1000,truncate=False)
Jan 17 2024
The content gaps for the geography gap using the cultural model are available on hive:
(spark.table("content_gap_metrics.by_category") .where("content_gap='geography_cultural_region'") .show() +-------+--------------------+--------------------+--------------------+--------------------+-----------+ |wiki_db| category| metrics| quantiles| content_gap|time_bucket| +-------+--------------------+--------------------+--------------------+--------------------+-----------+ | frwiki| Afghanistan|{1, 499592, 434.8...|{{1, 1, 1, 1, 1},...|geography_cultura...| 2023-10| | frwiki| Albania|{13, 551487, 363....|{{1, 1, 1, 1, 1},...|geography_cultura...| 2023-10| | frwiki| Algeria|{13, 5211013, 503...|{{1, 1, 1, 1, 1},...|geography_cultura...| 2023-10| )
Thanks for the updates @dr0ptp4kt, and nice that you are able to reproduce such a google proxy request.
Jan 15 2024
TrainWing as originally planned will not be built, instead it will incrementally be built within existing data engineering infrastructure. As such, I mark this task as invalid.
This work is completed, the pipelines that were added:
- Risk observatory pipeline (airflow dag)
- Wikidiff pipeline (airflow dag)
- Article embeddings pipeline
Dec 18 2023
Dec 8 2023
It so happens I sniped myself into looking into this after noticing a lot of google ips in trending streaming pages. Here pyspark snippets with more results.
Nov 21 2023
@leila, should the redundant meta page be deleted, or redirected?
This is done with Embeddings At Scale and Evaluation of similarity search solutions
Nov 20 2023
@Jelto Unfortunately I also don't know either how these logs looked, and I am not familiar with npm/js. Looking at the pipeline output, it says only 1 file was tested. But eslint is for javascript, and there is only a single js file in this project. From my perspective this is a good start, we can refine this as needed in the future.
Nov 16 2023
@isarantopoulos the python 3.8 is merged, and here is the MR for the xgboost bump.
Nov 15 2023
Nov 8 2023
Nov 7 2023
This is done