Page MenuHomePhabricator

[L] Create tool for manual evaluation of section-level image suggestions
Closed, ResolvedPublic

Description

This task is to adapt either
https://media-search-signal-test.toolforge.org/synonyms_bak.html
or
https://image-recommendation-test.toolforge.org/
for manual evaluation of section-level image suggestions once T315976 is complete.

There will be a follow up task to do the manual testing itself.

Acceptance Criteria:

  • Determine which of the above tools is better/easier to adapt for this purpose.
  • The tool will evaluate results in English, Portuguese, Indonesian, Russian, Arabic, Czech, Bengali, French and Spanish Wikipedias
  • The tool will allow the user to choose which wiki/language they want to evaluate
  • The tool will evaluate 500 random section-level image suggestions across 500 random different articles, per wiki
  • The tool will display and evaluate the output (both a preview of the section text and the image), similar to https://media-search-signal-test.toolforge.org/
  • The tool will allow testers to manually decide whether the match is good or bad for each result for each unillustrated article.
  • The tool will show the user information about the source of the match -- whether it's from section alignment (e.g. this image was used on X wiki), or from visual topics, or an intersection of both
  • The tool will show suggestions from section alignment, visual topics, and at the intersection of both, but will prioritize suggestions that intersect
  • The tool will output the results into a spreadsheet, showing how many good and bad matches were produced for each article, and what the confidence score for each of those matches was
  • We will remove images that are not .jpgs from the evaluation dataset so that we remove potential icons

Update

We are going to do another round of evaluation to see if we can improve the % good topics and the number of available intersection suggestions.

Plan

Data
  • Investigate more images linked through wikidata via T311832 and T311831 - update: Structured Data on Commons depicts statements can increase the amount of images for topics
    • @mfossati to spend 1 day on both spikes to see how it goes; then include them in the updated evaluation data if it goes well
    • Reverse-lookup depicts statements
  • Use the updated section topics data set (fewer tables and lists, media items, dates) in the updated evaluation data set
  • Half of the data evaluation set should use section topics with a relevance score at the section level over 10
  • Half of the data evaluation set should use section topics with a relevance score at the article level over 3.725 - update: threshold computed via "recursive" percentiles 🤓
  • Run the pipeline to generate intersection-based suggestions
  • Do not include alignment-based suggestions in the updated data set -- only include section-topics and intersection-based suggestions
Tool
  • Add an explanation for users when the source is a depicts statement (eg, here is an image we think might fit the article section, because: This is the image has the depicts statement X, and an article about that item is linked from the section.)
  • Remove suggestions from the queue when the evaluator clicks “unsure” so that we cycle through more suggestions
  • Keep the new evaluation data set separate from the previous one by adding a 'dataset_id' field to ratedSuggestions
  • Before switching over to the new data set, put the updated results from the old data set in the ticket
  • Just like in the first data set, remove images that are not .jpgs from the evaluation dataset so that we remove potential icons
  • Once these updates are made, run another evaluation, just amongst ourselves in our languages
    • If it goes well, we can do one more round with ambassadors

Round 1 results

wiki% good intersection% good alignment% good p18 topicstotal rated suggestions
arwiki1007141343
bnwiki50401355
cswiki383816206
enwiki826844178
eswiki897829398
frwiki857118344
idwiki889571530
ptwiki1008369398
ruwiki777528966
overall7969373418

As of Feb 16.

Round 2 internal results

wiki%good intersection% good p18 topicstotal rated suggestions
enwiki10071329
eswiki10085101
frwiki755770
ptwiki1007594
ruwiki5662139
overall8670733

As of Feb 28.

Event Timeline

CBogen renamed this task from Create tool for manual evaluation of section-level image suggestions to [L] Create tool for manual evaluation of section-level image suggestions.Oct 5 2022, 4:45 PM
CBogen updated the task description. (Show Details)
Cparle subscribed.

Here's a sample of the data generated by T315976 (approx 2000 suggestions per wiki, 1000 generated via section topics and 1000 via section alignment)

https://docs.google.com/spreadsheets/d/1XOsTmCOCxeIvMVO-_LulD205PACYwrG1FpNlFel94AQ/edit?usp=sharing

@Cparle Could you please add the FR as well as it's part of the target wikis?

Based on our meeting today, I added the following acceptance criteria:

    • The tool will show the user information about the source of the match -- whether it's from section alignment (e.g. this image was used on X wiki), or from visual topics, or an intersection of both
    • The tool will show suggestions from section alignment, visual topics, and at the intersection of both, but will prioritize
      • Spreadsheet columns will be: wiki; article name; section name; section link; image; source; match strength (good/bad)
  • The tool will output data that allows us to evaluate what percentage of bad matches come from section alignment, and which wikis specifically, so we can decide if there is a next step here

Rating summary by wiki on Tues Feb 7

wiki_db%good intersection%good alignment%good topicstotal rated suggestions
arwiki1007244309
bnwiki50??30
cswiki383816206
enwiki826843161
eswiki897829398
frwiki85903996
idwiki889047276
ptwiki100??8
ruwiki777633696
overall7574332203

Topic score and % good image:

topic score% goodtotal ratings
>107020
5-105831
2-553163
1-244257
<137824

Update: we are going to do another round of evaluation to see if we can improve the % good topics and the number of available intersection suggestions.

The updated version of the new plan is in the ticket description!

CBogen updated the task description. (Show Details)