User Details
- User Since
- Oct 8 2021, 4:05 PM (147 w, 12 h)
- Availability
- Available
- LDAP User
- TBurmeister
- MediaWiki User
- TBurmeister (WMF) [ Global Accounts ]
Mon, Jul 29
Since the time frame for when this task was relevant has now passed, I'm going to mark it as Declined.
Jul 2 2024
Jul 1 2024
Final status update!
- Finished categorizing / recategorizing all the pages with Data_platform and/or Data_platform_systems.
- Added category or prefix-based search boxes to the navigation template and the Systems landing page.
- Finalized category list, doc FAQs, and doc maintainer guidance at https://wikitech.wikimedia.org/wiki/Data_Platform_Engineering.
Resolving, as planned FY23-24 work on this task is now complete (see updates in subtasks).
Jun 27 2024
I added the Research nav template to the new page at https://meta.wikimedia.org/wiki/Research:Data_introduction, and added links between that page and other key pages like https://meta.wikimedia.org/wiki/Research:Data and https://meta.wikimedia.org/wiki/Research:Resources. I also added [[category:Research]] to the page. Doing a full refresh of the page / navigation design for the entire Research portal was not my intent: I'm just bad at naming phab tasks sometimes.
This content is now published at https://meta.wikimedia.org/wiki/Research:Data_introduction, and integrated with a section added to https://meta.wikimedia.org/wiki/Research:Data.
The scope of this task is too large..the documentation it describes could basically constitute and entire degree program in research/data science :-) But also: some of the work is already done or should be undertaken as part of working on the existing Research portal docs. See:
- (work done this FY) https://meta.wikimedia.org/wiki/Research:Data_introduction#Tools_and_data_access_methods
- (pages that exist and should be expanded to cover more of these topics):
I ran out of time this fiscal year to be able to publish the expanded version of this page that I had originally envisioned, and I think this work would need to be approached differently, in closer collaboration with the Research team and aligned with their other priorities. For now, the topic is covered by:
- https://meta.wikimedia.org/wiki/Research:Data_introduction
- https://meta.wikimedia.org/wiki/Research:Unique_devices
- https://meta.wikimedia.org/wiki/Research:Activity_session
- https://meta.wikimedia.org/wiki/Research:Page_view
- Improved documentation for the AQS API, which is one of the primary data sources in this space! (See T288664 and related)
All content from draft page is now moved into the main intro/overview page at https://meta.wikimedia.org/wiki/Research:Data_introduction#Content (and preceding sections).
I ran out of time this fiscal year to be able to publish the expanded version of this page that I had originally envisioned, and I think this work would need to be approached differently, in closer collaboration with the Research team and aligned with their other priorities. For now, the topic is covered by:
- https://meta.wikimedia.org/wiki/Research:Data_introduction
- https://meta.wikimedia.org/wiki/Research:Edit and related research concepts pages
- Slides from ICWSM Wikimedia data tutorial, and PAWS notebook, added to aforementioned page.
- https://wikitech.wikimedia.org/wiki/Data_Platform/Data_Lake/Edits
Jun 24 2024
Jun 18 2024
Continued cleanup after the big doc migration:
- Created a new landing page for AQS docs, to accommodate on-wiki user docs located at Data_Platform/AQS, separate from the maintainer docs at Data_Platform/Systems/AQS.
- Updated the content of Analytics page. Its only remaining subpages are under /Archive.
- Updated navigation template to make use of the more consolidated page structure, linking to key landing pages like /Data_Lake instead of sections of the higher-level landing pages like /Discover_data.
Jun 17 2024
Epic status update:
- I moved 456 pages from /Analytics and /Data_Engineering to their new homes as outlined above in https://phabricator.wikimedia.org/T350911#9886785.
- I cleaned up the many redirects that were making these docs confusing to navigate (see the legacy state here).
- Updated links on all the new Data Platform landing pages and some other key content pages, but otherwise the redirects will handle getting people to the right place.
More metrics definitions docs I just discovered during doc cleanup / restructuring:
https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2/Metrics_Definition
Additional docs that should be integrated into the revised Contact Us and Intake Process pages, and then marked as historical/deprecated:
Are these metric definitions specific to Wikistats, or are they more generally applicable? If they're generally applicable, I'd like to integrate them into https://meta.wikimedia.org/wiki/Research:Data_introduction (T343146). If they're only applicable to Wikistats, then T357327 is probably relevant.
I just discovered https://www.mediawiki.org/wiki/Analytics/Metric_definitions, which contains documentation that is probably more relevant to users of Wikistats than the admin-focused content at https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Wikistats_2, which is what the dashboard UI links to. These pages should probably be consolidated, or at least more clearly connected to each other, though I'm unsure whether these metrics definitions are actually broad enough that they could really be part of higher-level conceptual documentation, like https://meta.wikimedia.org/wiki/Research:Data_introduction. (I'm asking this question in T123989)
There is an existing phab task(T361327) that I think relates to this question and highlights the current split of metrics documentation between meta and mediawiki.org. Unfortunately my work in the data domain has focused mostly on the underlying datasets, not metrics derived from them, so I'm not sure I can be of much help here. My main recommendation would be to pick one place and keep it consistent, and add clear cross-references to and from other locations where your audience might expect to find the information. I would lean towards choosing meta over mediawiki.org for this type of content, but even that choice is ambiguous because there's already so much metrics-related content on mw: https://www.mediawiki.org/wiki/Wikimedia_Product/Data_dictionary
Jun 15 2024
Jun 14 2024
I added a new section with a couple links at https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark#Use_PySpark_to_run_SQL_on_Hive_tables. I also updated the task description to capture the key info from the original Slack thread and make it easier for someone else to pick up and work on this task (it's out of scope for my work on Data Platform docs this quarter, so I'm unassigning myself).
Jun 13 2024
Update:
- Started moving draft content back into https://meta.wikimedia.org/wiki/Research:Data_introduction#Content
- Finished revising sections:
- Namespaces
- Wikitext vs. HTML
- Content data sources (added more info)
I was planning to leave redirects, but I'm open to not doing that if you think it's a better long-term solution for the overall health of the docs and wikitech!
Jun 12 2024
Decision from 12 June working session (see https://wikitech.wikimedia.org/wiki/Data_Engineering/TOC and this content analysis of page structure for reference):
- Move the following to be subpages of Data_Platform, and add Category:Data_Platform:
- Pages under Data_Engineering/, with the exception of /Systems
- Pages under Analytics/, with the exception of /Systems and /Cluster
- Move the following to be subpages of Data_Platform/Systems, and add Category:Data_Platform_systems
- Pages under Data_Engineering/Systems
- Pages under Analytics/Systems and Analytics/Cluster
Jun 11 2024
Status update:
- Had 2 working sessions with DPE team/stakeholders to finalize content deprecations and set the team up for ongoing doc maintenance of revised/new pages on Wikitech
- Created a new navigation menu that mirrors the structure of the Data Platform landing page, and started adding it to the major content pages that we link to from the landing page, i.e. it is now on https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake
- See this update about deprecation and redirection of content at https://wikitech.wikimedia.org/wiki/Data_Engineering
Jun 10 2024
As part of replacing the outdated Data Engineering docs on Wikitech, I have made the following revisions after consultation with the DPE team/stakeholders:
May 30 2024
This is related to and part of a more comprehensive dataset documentation strategy, which includes metadata for how we describe datasets: T349103.
Met with @MGerlach to discuss the concept of this page and how it relates to the higher-level introductory content. New plan is to:
- focus this content on the core concepts of namespaces, wikitext / parsing, templates, and redirects.
- integrate this content into Research:Data_introduction and split all thie educational overview content there off into a separate page: one page for concepts, a separate page for data domains & links to data sources and access methods.
May 29 2024
May 28 2024
- The four major user-focused landing pages have been published!
- I updated https://wikitech.wikimedia.org/wiki/Data_Platform_Engineering to serve as a navigational page that guides people to the Data Platform docs on Wikitech and the Data Platform Engineering team docs on mediawiki.org.
- Remaining work before this task can be closed:
- Migrate infrastructure admin-focused content from https://wikitech.wikimedia.org/wiki/Data_Engineering into the relevant landing page sections so that the old pages can be redirected to the new landing pages.
- Move remaining team content to mediawiki.org and mark the old landing pages as historical.
Page published at https://wikitech.wikimedia.org/wiki/Data_Platform/Transform_data. The team agreed it's okay to publish with some existing TODOs, because work on much of the data lifecycle and governance documentation is still ongoing.
May 23 2024
The feedback cycle for all 4 landing pages has concluded. I'm waiting for a go/no-go decision from the team about:
- Whether we can eliminate the Collect Data page and just link to the Metrics Platform page from the Data Platform main landing page
- Whether it's okay to move forward with publishing the pages in the mainspace even though there are still open TODOs and links to in-progress pages related to the ongoing data governance and lifecycle work, especially on the Publish data landing page.
I've requested a decision about those two items by tomorrow, May 24, in order to keep this work on schedule and give me enough time for implementation.
May 22 2024
DPE team feedback session happened 5/21. Working on integrating the feedback received into the "Publish data" landing page, and aligning the group on a decision about whether this page needs to exist.
May 20 2024
Additional feedback received in other channels (copying here as TODO items):
Hi @NicoleLBee! I love this tool! The ability to find tools to annotate based on specific fields would be really useful. Specifically, I'd like to be able to focus on tasks like this:
- Find tools that have no links to documentation in their "developer_docs_url" and/or "user_docs_url" annotations fields, then look for whether those docs exist and try to add the links. This would be really useful for running sessions like Doc Your Tool, where one element of the task is just finding tools that may already have user and developer docs, but the links to those docs aren't in Toolhub.
- Find tools that have no content in the annotations fields for "audiences", "content_types", "tasks", and/or "subject_domains", and add those annotations. One example use for this functionality would be to run an "annotate-a-thon" at events where specific communities or interest groups have gathered. For example, at an event where many GLAM Wikimedians are present, I would like to be able to have many people use Toolhunt to find GLAM tools and add annotations like "GLAM" in the "subject_domains" field. Adding those annotations would enable people in the GLAM community to more effectively use Toolhub to find tools relevant to them.
Changes I made to the draft page this week include:
- Added a new subsection for structured data, including data sources that weren't previously listed and adding DBpedia links that are currently on Research:Data.
- Added a sortable reference table listing all data source and their domain + access method
- Added a section for third-part dataset discovery tools
May 17 2024
May 16 2024
Status update: community feedback period is ongoing, and I'm working on a reply to address questions / concerns that have already been posted on the Talk page.