Page MenuHomePhabricator

fkaelin
User

Projects

Heute

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Nov 12 2020, 6:16 PM (194 w, 2 d)
Availability
Available
LDAP User
Fabian Kaelin
MediaWiki User
FKaelin (WMF) [ Global Accounts ]

Neueste Aktivität

Tue, Jul 23

fkaelin closed T349755: Training pipeline for Revert Risk Language Agnostic (RRLA) model as Resolved.

The is resolved, including the training of the model. Code: pipeline / dag

Tue, Jul 23, 10:07 AM · Knowledge-Integrity, Research
fkaelin closed T349755: Training pipeline for Revert Risk Language Agnostic (RRLA) model, a subtask of T314384: Develop a ML-based service to predict reverts on Wikipedia(s), as Resolved.
Tue, Jul 23, 10:07 AM · Machine-Learning-Team, Research, Epic

Mon, Jul 8

fkaelin added a comment to T354958: Additional data release - aggregated survey form.

Finally got around to this. Thank you @YLiou_WMF for the data file, this looks good to me in general.

Mon, Jul 8, 1:11 PM · Research

Jun 19 2024

fkaelin updated subscribers of T367757: Request to add mnz to analytics-research-admins.

I confirm that this request is legit, also adding @XiaoXiao-WMF as manager.

Jun 19 2024, 3:07 PM · Patch-For-Review, SRE, SRE-Access-Requests

Jun 17 2024

fkaelin added a comment to T340494: Create keyspace and table for Knowledge Gaps.

Thanks. Is this now using AQS 2? It has been a moment, can you point to a current/good example job that writes to a AQS cassandra dataset from airflow?

Jun 17 2024, 4:10 PM · Cassandra, Data-Engineering
fkaelin added a comment to T351009: Develop an ML training workflow for ongoing work.

Summary of developments:

  • implementation of an end-to-end ml training workflows for
  • airflow dags that to execute pipelines (scheduled for retraining pipelines, manual trigger for development)
  • discussions for how new model versions can be deployed
    • for now, continue with manual process established by ML platform
    • T366528 to track automation, as a manual process will not scale as research puts more training pipelines into production
  • guide for contributing to repository containing training workflows
  • future work in collab with ML platform
    • GPU support
      • enable using new ML boxes once they become available
      • use gpu available on existing infra in production airflow job (maybe with a sprint with ML platform that we didn't get to in Q4 FY24)
    • standards for ML training
      • there is a style guide and existing ml training pipelines to base new work on, but we refrained from introducing "abstractions/framework" like code or language - instead we used the existing infrastructure.
      • lead by the ML Platform team, we should revisit this once the new ML boxes become available, as there will be a need for new tooling at that point
        • related: the current tooling for end-to-end ML training workflows is not convenient for iterative research/development (setup/deployment is error prone and too involved for one-off use xcases), research engineering has a goal in FY25 to improve researcher tooling
Jun 17 2024, 2:15 PM · Research-engineering, Research (FY2023-24-Research-April-June)
fkaelin added a comment to T361929: [Research Engineering Request] Building end-to-end training pipeline for the add-a-link model.

Weekly updates

  • initial review on the MR
  • meeting with Aisha/Martin to discuss MR and how to approach remaining work
Jun 17 2024, 1:37 PM · Research (FY2024-25-Research-July-September), Research-engineering
fkaelin added a comment to T357316: Develop pipelines for research datasets - Q3/Q4.

Weekly update

  • pipeline are merged
  • airflow dags are deployed, final testing in progress
Jun 17 2024, 1:37 PM · Research (FY2024-25-Research-July-September), Research-engineering

Jun 13 2024

fkaelin added a comment to T358373: [Dumps 2] Reconcillation mechanism to detect and fetch missing/mismatched revisions.

In T358366#9831389 I asked if other fields could be added to the schema; in particular the diff between two revisions, which is frequently used by research (wikidiff). I agree with @xcollazo's concerns, but this lead me to think about the implications of computing the diff separately in regards to reconciliation.

  • the diff is expensive to compute, as a the parent revision might be at any moment in the past and is not necessarily the most recent previous revision. The wikidiff pipeline batches jobs by page (i.e. a batch contains the full history of the pages in the batch).
  • the full diff dataset computed for each snapshot to follow the "snapshot pattern". However it is not significantly cheaper to make this pipeline incremental (e.g. only append diffs for the new month of revisions) as any revision in the past can be a parent revision so the join is still expensive
  • so how would one go about "enriching" wmf_dumps.wikitext_raw_rc2 with a diff column? the job could filter the full history for only the pages changed in that hour (broadcast join) and then do the self join, but that would still require a full pass over the data which seems expensive. This certainly is solvable, e.g. one could decrease the update interval, but it is tempting to instead implement the diff as a streaming "enrichment" pipeline.
  • this would look similar to the existing "page change" job, e.g. query mediawiki for the current and parent revision text and compute the diff (maybe with a cache for the previous wikitext for each page which is the most common parent revision)
  • however, this leads to the question of correctness/reconciliation, since this diff dataset would not be derived from wmf_dumps.wikitext_raw_rc2 and would thus require its own reconciliation mechanism? Which would be an argument in favour of the "s wmf_dumps.wikitext_raw the right 'place' to check whether we are missing events or not? Shouldn't we do these checks upstream?" point raised above.
Jun 13 2024, 6:58 PM · Patch-For-Review, Dumps 2.0 (Kanban Board)
fkaelin created T367446: Consolidate duplicated configuration/constants.
Jun 13 2024, 4:48 PM · Research-engineering, Research

Jun 3 2024

fkaelin created T366528: Deployment of model updates .
Jun 3 2024, 7:47 PM · Research-engineering, Machine-Learning-Team, Research

May 31 2024

fkaelin added a comment to T366369: MaxMind seems to be mapping the same IP to different countries.

Indeed, different versions of the database seems to be present on cluster hosts.

May 31 2024, 4:44 PM · Data-Engineering

May 23 2024

fkaelin assigned T351118: [Research Engineering Request] Produce regular snapshots of all Wikipedia article topics to MunizaA.
  • This pipeline is implemented (MR)
  • Remaining work: schedule an airflow dag to regularly compute new topics dataset
May 23 2024, 6:57 PM · Research-engineering, Research
fkaelin claimed T354958: Additional data release - aggregated survey form.
May 23 2024, 6:54 PM · Research
fkaelin closed T294380: Storage request for datasets published by research team as Resolved.

Closing this task as resolved as the storage request was handled.

May 23 2024, 6:53 PM · SRE-swift-storage
fkaelin closed T304425: Test LiftWing API/Predictions from Hadoop as Resolved.

Closing this as resolved. After more discussion and some experimentation, it was decided that doing batch inference within the distributed jobs (e.g. by broadcasting the model to the workers) is preferable. Pasting the comments from the relevant slack thread here.

May 23 2024, 6:52 PM · Lift-Wing
fkaelin closed T304425: Test LiftWing API/Predictions from Hadoop , a subtask of T290173: Orchestration of end-to-end machine learning workloads, as Resolved.
May 23 2024, 6:52 PM · Research-Freezer
fkaelin changed the status of T351009: Develop an ML training workflow for ongoing work from Open to In Progress.
May 23 2024, 6:52 PM · Research-engineering, Research (FY2023-24-Research-April-June)

May 22 2024

fkaelin added a comment to T358366: Consult with Product and Research team on schema and data retention expectations for wmf_dumps.wikitext_raw.

More on "Availability" / time travel. This question is not easy to answer, as it also relates to the current snapshot approach, which forces a pipeline to reason about the past in a rather limiting way. Aka "do you want the data as it looked today, or 1 month ago, or 2 month ago?", and finding out if/how the past data is different is not trivial and rarely practical. Generally pipelines either

  1. offload dealing with the snapshot semantics to the consumers by producing snapshotted datasets themselves
  2. implement a pseudo-incremental dataset by disregarding the "new past" and any changes it might contain.

For this reason I find it hard to define requirements for time travel, it is basically a new capability (for example, the replacement for "mediawiki_wikitext_current" could be a transformation of a time travel query). Starting with 90 days should be sufficient as it is strictly an improvement to what one can do now.

May 22 2024, 8:14 PM · Dumps 2.0 (Kanban Board)
fkaelin added a comment to T358366: Consult with Product and Research team on schema and data retention expectations for wmf_dumps.wikitext_raw.
  • Availability: Research is mostly treating the current history dumps as a pseudo incremental dataset- i.e. pipelines that depend on the history wait for a new snapshot to be released and then only use the "new" data from that snapshot (aka the revisions created in the month since the last snapshot was generated). This means that the wmf_dumps.wikitext_raw allows to significantly reduce the latency - roughly from 1 month (wait for snapshot interval to trigger) +12days (dump processing) to a few hours.
  • Schema: As the schemas are almost identical, my main question is about extending the existing dataset in ways that depend on the snapshot mechanism. For example research has a number of use cases that involve comparing the revision text with the parent revision text. This involves a computationally expensive self join and some pitfalls, so there is a wikidiff job that creates (yet another) version of the wikitext history that includes a column with the unified diff between the current and parent revision.
    • Could we add the diff to the proposed wmf_dumps.wikitext_raw? As the parent revision could be at any point in the past, this would likely involve the equivalent to the wmf.mediawiki_wikitext_current available when new revisions are ingested into the dataset.
    • More generally, what is the replacement for the wmf.mediawiki_wikitext_current?
  • Data quality: the discussion around correctness of the events data T120242 also applies in this context. For research in particular, many use cases don't have high requirements (e.g. for training datasets for ML, or for metrics datasets that involve models that can also be "incorrect"), and we could/would migrate existing jobs to the new dumps table once it is available/supported in prod.
May 22 2024, 8:11 PM · Dumps 2.0 (Kanban Board)

Apr 30 2024

fkaelin closed T355440: PoC - general model training support (Cloud GPU) as Resolved.

I am closing this as done - a summary:

Apr 30 2024, 4:59 PM · Research (FY2023-24-Research-April-June)
fkaelin edited projects for T344830: Incremental knowledge gap dataset, added: Research-Freezer; removed Research.

This task requires design/implementation. Given that the current implementation is stable, I am moving this task to the freezer until there is a more urgent need for an incremental dataset.

Apr 30 2024, 1:19 PM · Research-Freezer
fkaelin added a comment to T341515: Team Interface: Working on.

Update on the use of gitlab issues:

  • Research doesn't use them for team internal planning, work is tracked in Phabricator.
  • Some researchers use gitlab issues for managing tasks with external collaborators (e.g. outreachy internships) as it more convenient than depending another tool.
Apr 30 2024, 1:04 PM · Research-management, Research

Apr 16 2024

fkaelin closed T348826: Integrate with WMF deployment pipeline as Declined.

Closing this. Deploying on CloudVPS is supported, blubber integration to be done when a kubernetes deploy is needed.

Apr 16 2024, 3:10 PM · Research
fkaelin closed T348826: Integrate with WMF deployment pipeline, a subtask of T348820: Tooling to work with embeddings, as Declined.
Apr 16 2024, 3:10 PM · Epic, Research
fkaelin closed T348367: Create a python package to compute wikitext embeddings in the WMF data infra as Resolved.

Done - code

Apr 16 2024, 3:04 PM · Research
fkaelin closed T348367: Create a python package to compute wikitext embeddings in the WMF data infra, a subtask of T348819: Develop pipelines for research datasets - Q2, as Resolved.
Apr 16 2024, 3:04 PM · Research (FY2023-24-Research-October-December)
fkaelin closed T348823: Tooling to create an index from a dataset of vectors as Resolved.
Apr 16 2024, 3:02 PM · Research
fkaelin closed T348823: Tooling to create an index from a dataset of vectors, a subtask of T348820: Tooling to work with embeddings, as Resolved.
Apr 16 2024, 3:02 PM · Epic, Research
fkaelin removed Due Date on T343061: Denylist for language agnostic revert risk model.
Apr 16 2024, 2:57 PM · Research
fkaelin moved T343061: Denylist for language agnostic revert risk model from Staged to Backlog on the Research board.

Removing due date and moving to backlog to prioritize.

Apr 16 2024, 2:56 PM · Research
fkaelin closed T342915: Generate training/evaluation datasets using airflow , a subtask of T341817: Standardize research pipelines - Dataset generation, as Resolved.
Apr 16 2024, 2:54 PM · Epic, Research
fkaelin closed T342915: Generate training/evaluation datasets using airflow as Resolved.
Apr 16 2024, 2:54 PM · Research
fkaelin added a comment to T343065: Scheduled risk observatory pipeline.

@Pablo can this ticket be closed as well, as the work was tracked with T341777?

Apr 16 2024, 1:35 AM · Research (FY2023-24-Research-April-June)

Apr 3 2024

fkaelin added a comment to T341777: Automate the data collection process.

@Pablo thanks for flagging - there was indeed an issue with the wikidiff table: it is an external hive table, the required data was on hdfs and triggered the risk observatory dag, but the hive table itself was not being correctly updated, so no data was ingested. This is fixed now, and the dashboard shows data until Feb 24 now.

Apr 3 2024, 11:10 AM · Research

Mar 21 2024

fkaelin added a comment to T305688: Make HTML Dumps available in hadoop.

Pasting this reply from a slack thread for context

Mar 21 2024, 3:15 PM · Data-Engineering, Research, Structured-Data-Backlog

Mar 5 2024

fkaelin updated the task description for T356729: Research API repository.
Mar 5 2024, 4:27 PM · Research

Mar 4 2024

fkaelin added a comment to T355440: PoC - general model training support (Cloud GPU).

Weekly updates

  • Interesting development with the ml team, there is a conversation with an European HPC infra provider about getting compute resources, and research projects are good candidates. Naturally this is relevant to this cloud GPU initiative, and research is very interested.
Mar 4 2024, 3:17 PM · Research (FY2023-24-Research-April-June)

Feb 29 2024

fkaelin added a comment to T354241: Check home/HDFS leftovers of nickifeajika.

These directories can be removed both on the stat clients and hdfs. Thanks!

Feb 29 2024, 1:55 PM · Data-Platform-SRE (2024.02.12 - 2024.03.03), Data-Engineering

Feb 27 2024

fkaelin closed T358613: Content_gap_metrics stage of knowledge_gaps job failing repeatedly as Resolved.

This is fixed (MR) and the the data is available.

Feb 27 2024, 10:54 PM · Research, Movement-Metrics, Movement-Insights

Feb 12 2024

fkaelin updated subscribers of T357316: Develop pipelines for research datasets - Q3/Q4.
Feb 12 2024, 3:23 PM · Research (FY2024-25-Research-July-September), Research-engineering
fkaelin created T357316: Develop pipelines for research datasets - Q3/Q4.
Feb 12 2024, 3:14 PM · Research (FY2024-25-Research-July-September), Research-engineering
fkaelin added a comment to T355440: PoC - general model training support (Cloud GPU).

Weekly updates

  • Trained the simplification model on a 3 billion parameter model (flan-t5-xl) on a single H100 (80GB). Results look promising.
  • Training for 2 epochs (~10h), running inference on test datasets (~6h), and downloading model weights: total cost ~50$
  • The fine-tuned model model weights are on stat1008. Validated that inference on the currently available GPU in the WMF infra works (it is slow)
Feb 12 2024, 3:09 PM · Research (FY2023-24-Research-April-June)

Feb 6 2024

fkaelin created T356729: Research API repository.
Feb 6 2024, 2:24 AM · Research

Feb 5 2024

fkaelin closed T355226: Productionize geography gaps data, cultural model as Resolved.

The cultural geographical gap data is now in production, aggregated on the level of the WMF regions. The gap name is geography_cultural_wmf_region, e.g. see here, the documentation is also updated as well as the example intersections notebook.

Feb 5 2024, 10:15 PM · Research
fkaelin added a comment to T331156: Improve documentation of the metrics available in the knowledge gap index.

For completeness: the datasets are also documented for the hive tables (which are equivalent to the published datasets) that are only available internally; see datahub (SSO login required)

Feb 5 2024, 10:07 PM · Research
fkaelin closed T331156: Improve documentation of the metrics available in the knowledge gap index as Resolved.

The is done: Datasets.

Feb 5 2024, 10:04 PM · Research
fkaelin closed T331156: Improve documentation of the metrics available in the knowledge gap index, a subtask of T331155: Knowledge Gaps Metrics, as Resolved.
Feb 5 2024, 10:04 PM · Epic, Research
fkaelin added a comment to T355440: PoC - general model training support (Cloud GPU).

Weekly updates

  • initial experiments with lambda labs, using text simplification as use case (T354653)
  • tested with A100 (40GB) and H100 (80GB) to validate approach and get an estimate of the cost for fine-tuning runs.
  • for a model size that can be trained on WMF infra (T5 large, 700M params), 1 epoch takes ~24h in WMF infra. On lambda labs 1 epoch costs ~6$ (i.e. time depends on hardware, ~4 h on A100, ~2h on a H100).
  • next up: use a model (3B param model) that can't currently be fine-tuned using WMF infra, but can be served using WMF infra.
Feb 5 2024, 9:54 PM · Research (FY2023-24-Research-April-June)

Jan 29 2024

fkaelin added a comment to T355440: PoC - general model training support (Cloud GPU).

Weekly updates

Jan 29 2024, 3:32 AM · Research (FY2023-24-Research-April-June)

Jan 25 2024

fkaelin added a comment to T355859: NEW BUG REPORT: Error querying content_gap_metrics tables from Presto/Superset.

Also for reference, at some point I created a template superset dashboard which mirrors the content_gap_metric hive tables - here https://superset.wikimedia.org/superset/dashboard/472, that is just a draft with example charts.

Jan 25 2024, 2:24 PM · Data-Platform-SRE, Data-Platform
fkaelin added a comment to T355859: NEW BUG REPORT: Error querying content_gap_metrics tables from Presto/Superset.

The issue seems that the superset ui for these queries can't render nested parquet structures, e.g. the metrics column contains a set of scalar columns, and the quantiles are nested structs themselves. The query itself works, but the UI can't render the result as is. If you formulate the query in a way that doesn't contain nested structs it works, for example:

Jan 25 2024, 2:16 PM · Data-Platform-SRE, Data-Platform

Jan 24 2024

fkaelin closed T348348: Standardize usage of geographic entities for knowledge gaps as Resolved.

This work is done with this MR, which migrated the KG pipeline to using the canonical_data.countries table which now includes the wikidata qid of the country, which allowed to replace the base region mapping file. For the cultural gap in particular, the "re-mapping" of some territories not in the canonical countries table was retained to expand the coverage of the gap.

Jan 24 2024, 5:19 PM · Research
fkaelin added a comment to T355226: Productionize geography gaps data, cultural model.

Example dataset for the cultural geographic gap (aggregated for wmf regions) for review: https://analytics.wikimedia.org/published/datasets/one-off/fab/content_gap/ . The code is merged and for the next scheduled run the new gap will be published as well (same format as the linked file above), if needed we can easily also re-run the previous pipeline to have the data sooner.

Jan 24 2024, 4:34 PM · Research
fkaelin reopened T355226: Productionize geography gaps data, cultural model as "Open".

Somehow I accidentally closed..

Jan 24 2024, 4:23 PM · Research
fkaelin moved T348348: Standardize usage of geographic entities for knowledge gaps from Backlog to In Progress on the Research board.
Jan 24 2024, 4:03 PM · Research

Jan 23 2024

fkaelin closed T355226: Productionize geography gaps data, cultural model as Resolved.
Jan 23 2024, 2:31 PM · Research
fkaelin changed the status of T348348: Standardize usage of geographic entities for knowledge gaps from Open to In Progress.
Jan 23 2024, 2:29 PM · Research
fkaelin changed the status of T355226: Productionize geography gaps data, cultural model from Open to In Progress.
Jan 23 2024, 2:29 PM · Research

Jan 18 2024

fkaelin added a comment to T346463: Identify and label prefetch proxy data in our traffic.
pa = spark.table("wmf.pageview_actor").where("""year=2024 and month=1 and day=18 and hour=16""")
prefetch_fields = [ 'prefetch_sec_purpose', 'prefetch_purpose', 'prefetch_x_moz']
cols = [F.col("x_analytics_map").getItem(f).isNotNull().alias(f) for f in prefetch_fields]
pa.groupBy(*cols).count().orderBy("count",ascending=False).show(1000,truncate=False)
Jan 18 2024, 8:36 PM · Traffic, Movement-Insights, Data-Engineering

Jan 17 2024

fkaelin added a comment to T355226: Productionize geography gaps data, cultural model.

The content gaps for the geography gap using the cultural model are available on hive:

(spark.table("content_gap_metrics.by_category")
.where("content_gap='geography_cultural_region'")
.show()
+-------+--------------------+--------------------+--------------------+--------------------+-----------+
|wiki_db|            category|             metrics|           quantiles|         content_gap|time_bucket|
+-------+--------------------+--------------------+--------------------+--------------------+-----------+
| frwiki|         Afghanistan|{1, 499592, 434.8...|{{1, 1, 1, 1, 1},...|geography_cultura...|    2023-10|
| frwiki|             Albania|{13, 551487, 363....|{{1, 1, 1, 1, 1},...|geography_cultura...|    2023-10|
| frwiki|             Algeria|{13, 5211013, 503...|{{1, 1, 1, 1, 1},...|geography_cultura...|    2023-10|
)
Jan 17 2024, 6:09 PM · Research
fkaelin added a comment to T346463: Identify and label prefetch proxy data in our traffic.

Thanks for the updates @dr0ptp4kt, and nice that you are able to reproduce such a google proxy request.

Jan 17 2024, 4:47 PM · Traffic, Movement-Insights, Data-Engineering

Jan 15 2024

fkaelin closed T342917: Provide feedback for TrainWing design as Invalid.

TrainWing as originally planned will not be built, instead it will incrementally be built within existing data engineering infrastructure. As such, I mark this task as invalid.

Jan 15 2024, 4:29 PM · Research
fkaelin closed T348819: Develop pipelines for research datasets - Q2 as Resolved.

This work is completed, the pipelines that were added:

Jan 15 2024, 3:58 PM · Research (FY2023-24-Research-October-December)
fkaelin closed T348819: Develop pipelines for research datasets - Q2, a subtask of T341817: Standardize research pipelines - Dataset generation, as Resolved.
Jan 15 2024, 3:58 PM · Epic, Research
fkaelin closed T349615: Implement risk obsevatory pipeline as Resolved.

Completed with https://gitlab.wikimedia.org/repos/research/research-datasets/-/merge_requests/11

Jan 15 2024, 3:56 PM · Research
fkaelin closed T349615: Implement risk obsevatory pipeline, a subtask of T343065: Scheduled risk observatory pipeline, as Resolved.
Jan 15 2024, 3:56 PM · Research (FY2023-24-Research-April-June)
fkaelin closed T341818: Migrate and consolidate Research teams' code to Gitlab as Resolved.

The remaining known repos have been migrated to gitlab and the github repos archived (cc @Isaac @MGerlach).

Jan 15 2024, 2:31 PM · Research (FY2023-24-Research-October-December)
fkaelin updated the task description for T341818: Migrate and consolidate Research teams' code to Gitlab.
Jan 15 2024, 2:29 PM · Research (FY2023-24-Research-October-December)

Dec 18 2023

fkaelin created T353665: Remove nickifeajika from analytics-privatedata-users.
Dec 18 2023, 5:55 PM · Data-Platform-SRE (2024.02.12 - 2024.03.03)

Dec 8 2023

fkaelin added a comment to T346463: Identify and label prefetch proxy data in our traffic.

It so happens I sniped myself into looking into this after noticing a lot of google ips in trending streaming pages. Here pyspark snippets with more results.

Dec 8 2023, 7:36 PM · Traffic, Movement-Insights, Data-Engineering

Nov 21 2023

fkaelin moved T345630: Update documentation for maintaining research web pages from In Progress to Needs Sign-off on the Research board.
Nov 21 2023, 6:08 PM · Research
fkaelin updated subscribers of T345630: Update documentation for maintaining research web pages.

@leila, should the redundant meta page be deleted, or redirected?

Nov 21 2023, 6:08 PM · Research
fkaelin updated the task description for T345630: Update documentation for maintaining research web pages.
Nov 21 2023, 6:05 PM · Research
fkaelin moved T348823: Tooling to create an index from a dataset of vectors from Staged to In Progress on the Research board.
Nov 21 2023, 5:10 PM · Research
fkaelin moved T349615: Implement risk obsevatory pipeline from Staged to In Progress on the Research board.
Nov 21 2023, 5:08 PM · Research
fkaelin closed T348822: Choose vector search framework as Resolved.

Benchmark code, analysis notebook

Nov 21 2023, 5:04 PM · Research
fkaelin closed T348822: Choose vector search framework, a subtask of T348820: Tooling to work with embeddings, as Resolved.
Nov 21 2023, 5:03 PM · Epic, Research
fkaelin closed T348821: Define requirements for List-Building & Add-A-Link as Resolved.

This is done with Embeddings At Scale and Evaluation of similarity search solutions

Nov 21 2023, 4:59 PM · Research
fkaelin closed T348821: Define requirements for List-Building & Add-A-Link, a subtask of T348820: Tooling to work with embeddings, as Resolved.
Nov 21 2023, 4:59 PM · Epic, Research
fkaelin changed Due Date from Oct 27 2023, 4:00 AM to Nov 24 2023, 5:00 AM on T345630: Update documentation for maintaining research web pages.
Nov 21 2023, 4:47 PM · Research
fkaelin moved T345630: Update documentation for maintaining research web pages from Staged to In Progress on the Research board.
Nov 21 2023, 4:47 PM · Research

Nov 20 2023

fkaelin updated subscribers of T350795: Add linting of research landing-page to gitlab CI .

@Jelto Unfortunately I also don't know either how these logs looked, and I am not familiar with npm/js. Looking at the pipeline output, it says only 1 file was tested. But eslint is for javascript, and there is only a single js file in this project. From my perspective this is a good start, we can refine this as needed in the future.

Nov 20 2023, 9:00 PM · Research, collaboration-services

Nov 16 2023

fkaelin moved T350389: Upgrade xgboost in knowledge_integrity from Staged to In Progress on the Research board.

@isarantopoulos the python 3.8 is merged, and here is the MR for the xgboost bump.

Nov 16 2023, 4:54 AM · Research, Machine-Learning-Team

Nov 15 2023

fkaelin closed T348825: Deployment on yarn as Resolved.

This is done.

Nov 15 2023, 6:24 PM · Research
fkaelin closed T348825: Deployment on yarn, a subtask of T348820: Tooling to work with embeddings, as Resolved.
Nov 15 2023, 6:24 PM · Epic, Research

Nov 8 2023

fkaelin moved T350795: Add linting of research landing-page to gitlab CI from Backlog to Support Needed on the Research board.
Nov 8 2023, 3:19 PM · Research, collaboration-services
fkaelin created T350795: Add linting of research landing-page to gitlab CI .
Nov 8 2023, 3:18 PM · Research, collaboration-services

Nov 7 2023

fkaelin closed T349614: Archeology on the notebooks / documentation as Resolved.

This is done

Nov 7 2023, 6:16 PM · Research
fkaelin closed T349614: Archeology on the notebooks / documentation, a subtask of T343065: Scheduled risk observatory pipeline, as Resolved.
Nov 7 2023, 6:16 PM · Research (FY2023-24-Research-April-June)
fkaelin assigned T350389: Upgrade xgboost in knowledge_integrity to MunizaA.
Nov 7 2023, 4:18 PM · Research, Machine-Learning-Team
fkaelin set Due Date to Nov 30 2023, 5:00 AM on T350389: Upgrade xgboost in knowledge_integrity.
Nov 7 2023, 4:17 PM · Research, Machine-Learning-Team

Nov 2 2023

Dzahn awarded T334511: Move research webpages to gitlab a Love token.
Nov 2 2023, 3:54 PM · GitLab (Pipeline Services Migration🐤), collaboration-services, Research

Oct 26 2023

fkaelin renamed T349755: Training pipeline for Revert Risk Language Agnostic (RRLA) model from Create an standardized training pipeline for Revert Risk Language Agnostic (RRLA) model to [Requesting Engineering Support] Training pipeline for Revert Risk Language Agnostic (RRLA) model.
Oct 26 2023, 2:42 PM · Knowledge-Integrity, Research

Oct 25 2023

fkaelin closed T348666: Add randomization to the revision order showed in Annotool, a subtask of T344016: Improvements to Annotool, as Resolved.
Oct 25 2023, 4:58 PM · Research
fkaelin closed T348666: Add randomization to the revision order showed in Annotool as Resolved.
Oct 25 2023, 4:58 PM · Research

Oct 24 2023

fkaelin changed Due Date from Oct 18 2023, 10:00 PM to Oct 25 2023, 10:00 PM on T348666: Add randomization to the revision order showed in Annotool.
Oct 24 2023, 4:33 PM · Research
fkaelin moved T348666: Add randomization to the revision order showed in Annotool from Backlog to In Progress on the Research board.
Oct 24 2023, 4:32 PM · Research
fkaelin closed T343063: Multilingual revert risk pipeline as Resolved.
Oct 24 2023, 4:31 PM · Research
fkaelin set Due Date to Dec 29 2023, 5:00 AM on T348826: Integrate with WMF deployment pipeline.
Oct 24 2023, 3:41 PM · Research