Commons Impact Metrics now available via data dumps and API

Translate this post

Commons Impact Metrics is a new data product offering monthly data dumps and a new Wikimedia Analytics API for Wikimedia Commons categories of images relating to cultural heritage. These categories include content from libraries, museums, and archives but also visual documentation of natural, built, and living heritage. We’re agnostic about who facilitates the upload, so relevant categories can be added by institutions, Wikimedia affiliates, Wiki Loves… campaigns, and individual contributors. Using this data, Commons contributors and their partners can count monthly edits in a category; identify their most active contributors and most viewed files; and understand which Wikimedia projects, languages, and articles are using their images. We will share some specific examples using real data in a follow-up Diff post. 

Partners need to demonstrate the impact of their contributions

Museums, libraries and archives have contributed millions of high quality images to Commons. An analysis of digitized paintings on Wikipedia found that 67% of the articles they illustrate are not directly related to art. These images are used to illustrate articles about history, religion, geography, and even broader concepts like play and love. Paintings, drawings, and sculpture from cultural institutions are important visual records of people and historical events that preceded the widespread use of photography. 

By contributing their digitized collections to the Wikimedia projects, institutions can make their knowledge visible and relevant to new audiences in more than 300 languages. In 2018, The Met reported that their image of Katsushika Hokusai’s The Great Wave painting attracted 10x more views on English Wikipedia than on their own website, and 20x more views on Wikipedia articles in all languages than on their own website. In 2021, Wellcome Collection announced that their images on Wikipedia have been viewed more than 1.5 billion times, illustrating articles about Sagittarius, an East India Company opium processing factory, the Al-Aqsa Mosque, bipolar disorder, Neville Chamberlain, and more.

A collage of the images used to illustrate Wikipedia articles
Wellcome Collection images used to illustrate Wikipedia articles, taken from Images from Wellcome Collection pass 1.5 billion views on Wikipedia, by Alice White, CC BY 4.0

It’s not only large museums or international brands that achieve this visibility on Wikipedia. Each month, a few hundred image files contributed by The Museum of Veterinary Anatomy in São Paulo (category) attract millions of page views.

Many of the institutions sharing images do so for reasons that go beyond a simple measure of reach. For example, the Smithsonian is focused on amplifying the accomplishments of American women by adding their biographies to Wikipedia and some of their most viewed images on Wikipedia are of women of color, such as Sojourner Truth, A Chippeway Widow, and Josephine Baker. Similarly, Wikimedia UK and the private Khalili Collections are working together to improve Wikipedia’s coverage of topics from Islamic pilgrimage to Japanese fashions. This followed research that found “a systemic cultural bias against non-Western visual art and artists across all Wikipedia platforms and in various languages.” 

We’re providing more reliable data 

To understand the impact of their contributions to Wikimedia projects, cultural institutions are interested in the utilization of their content (what projects, languages, and articles their images are used in) and views (how many times their images were seen per project, language, and article). Vital tools to access this data were developed by the community but it has been difficult for volunteer developers to consistently maintain them, leading to outages that damage the credibility of the Wikimedia movement. Because of the way these tools generate analytics, any outages also lead to data loss and both under- and overcounting of views.

To address these reliability issues, we developed a centralized, pre-computed dataset to increase trust in the numbers used to demonstrate the impact of image contributions.

  1. More reliable because the dataset, data dumps, and API are developed and maintained by the Data Products team and openly available for use and integration.
  2. Less complex because we are delivering pre-computed data instead of raw data that has to be processed in order to extract metrics from it.
  3. More stable because the calculation includes an algorithm with a maximum depth of seven sub-categories that covers most known use cases without inadvertently causing the system to fail by traversing the entirety of the category graph.
  4. Operates at scale. While existing tools can fail silently when handling larger quantities of content, we’re handling categories with up to 1 million files. For comparison, GLAMorgan has a 30K files max in the category graph.
  5. Well documented API and service that aims to standardize our definitions and methods, so it is easier to compare and learn.

The data product was informed by community discussions and feedback

The development of this new data product was informed by community documentation and feedback, including a Wikimedia-l email thread (The problems with Wikimedia metrics, February 2023); a GLAM Manifesto (February 2023); a meeting report by Wikimedia Sweden (Wikimedia metrics tools, February 2023); and the GLAM Metrics Needs page (February-August 2023). We also reviewed past research, including a study from 2013, Report on requirements for usage and reuse statistics for GLAM content.

We carried out design research interviews with 16 participants representing 9 affiliates and institutions; released a prototype for user testing; and facilitated a workshop at the GLAM Wiki 2023 conference (Understanding the Impact of Image Contributions to Commons, November 2023). In May 2024, a beta version of Commons Impact Analytics data was made available at Wikimedia Dumps and Marcel Ruiz Forns led an exploration session at the Wikimedia Hackathon, “Commons Impact Metrics BETA Data Dumps available today!”

Marcel Ruiz Forns and Krishna Chaitanya Velaga at the Wikimedia Hackathon 2024, by Robert Sim, CC BY-SA 4.0

We had to make some trade-offs to deliver useful analytics this year

There is an allow list of categories

Early investigations showed that we couldn’t work with the Commons repository in its entirety. There are more than 100 million media files associated with more than 16 million categories. The category graph can quickly connect to nearly every file on Commons causing computational and system failures. We therefore use an allow list of the categories that are in scope. We have backfilled more than 6 months of data for more than 1,200 categories that are used by the GLAM Wiki Dashboard, the Cassandra Dashboard, and BaGLAMa 2. (As on-demand metrics tools, GLAMorgan and GLAMorous don’t have predefined lists of categories.) Having this backfilled data will help category owners understand how our new definitions and methods impact their numbers. New categories can be requested and will be added each month. 

The data has a monthly granularity

To keep the dumps lighter and more manageable for volunteers, affiliates, and partners, we aggregate the data in a monthly granularity, on a monthly release schedule. New data will be available between the 2nd and 5th of each month. 

We prioritized pageviews over mediarequests

The other big decision we needed to make was whether to prioritize pageviews (used by baGLAMa2 and GLAMorgan) or mediarequests (used by GLAM Wiki Dashboard). We selected pageviews because we thought it was important for partners and contributors to see which articles their images are illustrating. This gives them insight into how their collections can meet the everyday information needs of internet users, and where they are closing visual knowledge gaps on Wikimedia projects. This is more actionable information than total view counts by projects and languages only. 

However, one downside of using pageviews is that we had to exclude main page views until we have a better way of accounting for when media is actually present in the page. While main page views can significantly increase the overall count for a category or file, it is less intentional traffic than article views. For more about the pros and cons of these different approaches, see the Data Model page on Wikitech.

In the future, we may offer mediarequest counts alongside pageview counts if there is a demonstrated need. We have captured this potential work in a Phabricator ticket.

This product will be supported by two teams at the Foundation

The dumps and API service are managed by the Data Products Team. Please watch the Data Model page and join the analytics-l mailing list for any planned updates that could have an impact on integrations and tools. Issues and requests should be logged in Phabricator using the Commons-Impact-Metrics and Data Products tags. You can track incoming requests and progress on the Phabricator workboard. If you have questions or feedback, please use the Talk page for the project on Wikitech

If you want to add a new Wikimedia Commons category to the allow list (or rename or remove existing categories), you can make a request to the Culture and Heritage team by using this form. You can see all open requests on this Phabricator workboard

How to start using Commons Impact Metrics 

We’ll soon publish a second post with examples of the questions that can be answered using Commons Impact Metrics.

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?