Page MenuHomePhabricator

Develop a model to detect sentences that need copy-editing
Closed, ResolvedPublic

Description

In the literature review, it became clear that there is no off-the-shelve method to detect candidates for copyedits in the context of Wikipedia that works across all languages in Wikipedia.

As a longer-term goal, it would thus be desirable to develop a custom model that can identify sentences that require copyediting in Wikipedia in different languages. Such a model would probably be similar to the citation-needed model which aims to predict sentences in Wikipedia that need citations (https://arxiv.org/abs/1902.11116).

Things we need:

  • A labeled dataset of positive and negative examples (i.e. sentences requiring and not requiring copyedits) in different languages. A promising approach are copyedit-templates which exist in at least 81 Wikipedias We have a dataset of edits (reverted and non-reverted) with edit-tag "newcomer task copyedit" in 10+ wikis (T299245#8188596)
  • Implementing one or more models to predict positive and negative examples based on text-features.

Event Timeline

Update week 2022-01-10:

  • I met with Djellel this week to discuss the potential for working together on building a custom model to detect candidates for copyedits in WIkipedia
  • We are starting to scope the plan to develop such a model and specify the necessary tasks. The idea is that individual parts of the model-developments can be done as internship-projects

Update week 2022-06-13:
Discussing with Djellel we clarified the goal:

  • the evaluation of LanguageTool T305180 showed that it can surface meaningful copyedits (and suggestions for improvement) in Wikipedia article but that we have to deal with the challenge that it might surface many false positives. Therefore, we would like to build a model that can assign a confidence score to Languagetool’s copyedits in Wikipedia articles. Then we can work through the list from top to bottom highlighting those copyedits for which we are most confident about.
  • The first step is to get a labeled dataset of copyedits in Wikipedia articles. A promising candidate (thanks to Marshall's suggestion) are the suggested edits in which newcomers are guided (without actual recommendations) to do copyedits. We can extract the corresponding edits in all deployed languages via the revision_tag "newcomer task: copyedit" (see an example in enwiki). Analysing the corresponding diffs should give us a multilingual labeled dataset to train a model that can rank LanguageTool's errors.
MGerlach renamed this task from Develop a model to detect sentences that need copy-editing (Q3+) to Develop a model to detect sentences that need copy-editing.Jul 8 2022, 2:14 PM

Update week 2022-07-04:

  • started to build dataset querying revisions from newcomer task marked as "newcomer task copyedit" in the revision tag from mediawiki-history. for enwiki there were ~25k revisions.

Update week 2022-07-11:

  • got dataset of all edits in enwiki for "newcomer task copyedit" (pairs of revision-ids of old and new revision). currently adding wikitext of each revision in order to analyze diffs of the edit.

Update week 2022-07-18:

  • completed a dataset of copyedit-edits in enwiki from tagged with "newcomer task copyedit". retrieved wikitext of the old and new revision, respecitively, and annotated the diff with edit-types library
  • this allows for filtering of edits that likely correspond to copyediting by focusing on edits that are i) localized to a single sentence and ii) only contain changes of individual words (no changes of links, templates, etc).
  • this yields a set of ~7000 localized copyedit-edits which we can use for training the model. it not only contains positive but also negative samples (i.e. reverted edits)

Update week 2022-08-22:

enwiki 25550
ruwiki 3154
frwiki 2095
itwiki 1964
arwiki 1958
fawiki 1773
eswiki 1502
dewiki 1454
idwiki 1449
zhwiki 919
hewiki 808
ptwiki 690
trwiki 642
kowiki 600
ukwiki 527
viwiki 471
plwiki 386
cswiki 307
svwiki 280
jawiki 255
nowiki 215
azwiki 186
fiwiki 180
huwiki 173
hrwiki 157
bnwiki 145
dawiki 111
uzwiki 108
simplewiki 95
elwiki 83
cawiki 76
rowiki 68
thwiki 60
srwiki 53
hiwiki 52
mswiki 47
bswiki 38
tewiki 36
euwiki 27
kawiki 21
ckbwiki 20
kkwiki 17
sqwiki 16
bewiki 15
urwiki 14
testwiki 14
ltwiki 14
etwiki 13
lvwiki 9
slwiki 8
mlwiki 6
mywiki 6
aswiki 3
glwiki 3
zh_yuewiki 3
tawiki 2
astwiki 2
azbwiki 2
knwiki 2
siwiki 1
tlwiki 1
kywiki 1

Update week 2022-09-05:

  • manual evaluation of a sample of copyedits (T315086#8225199) showed that particularly spellcheckers suffer from large number of false positives. my hypothesis is that this comes from the fact that a spellchecker only looks up each individual word (checking whether it is in a pre-defined dictionary) without considering any of the context in the sentence.
  • thus, our first approach will be to use a language model (e.g. BERT) to generate features capturing the context of the copyedit in order to assign a high or low score to the proposed copyedit. we can use the "newcomer task copyedit" for evaluation.

Weekly update:

weekly update:

  • discussed with Djellel; we refined plan for development of the model in terms of features training and test data
  • planning to start implementing in the next week(s)

weekly update:

  • conducted first analysis on using BERT language model to classify sentences that are grammatically correct/incorrect
  • for benchmark corpora and synthetic corpora (not Wikipedia) we obtain high accuracy showing the general applicability of this approach to score/rank copyedits. however, for the dataset of sentences from the newcomer-copyedit-task, the model cannot distinguish between sentences before or after the edit (i.e. from these examples we cannot detect systematic differences which would help us distinguish supposedly correct or incorrect sentences). since the general approach works with benchmark/synthetic corpora, the limiting factor seems to be the underlying dataset to fine-tune the model with labeled sentences from Wikipedia.
  • Therefore, as a next step, we try to obtain an alternative dataset of labeled sentences from Wikipedia that are grammatically correct/incorrect using copyedit-templates. We will adapt the approach from extracting positive and negative examples of articles with reliability issues -- instead of looking for templates indicating reliability issues (such as pov), we will look for articles with copyedit-issues (e.g. copy_edit).

weekly update:

  • generated a new ground-truth dataset of edits to articles where the copyedit-template was removed. the rationale is that the removal of the template indicates that the edit improved the article with respect to copyediting.
  • looking at all such events in the revision history of all articles in English Wikipedia. only keeping edits for articles where: i) the template was removed only once in the revision history of the article (to avoid cases where the template is added/removed many times); ii) the edit was marked as a minor edit (to avoid edits which contain major addition/removal of content).
  • I then align sentences from the old to the new revision by matching all possible pairs of sentences via their (minimum) Levenshtein distance
  • this yields 13k pairs of sentences across 5k articles where each sentence was changed supposedly as part of copyediting due to removal of the copyedit template. One-off dataset available here.
  • in principle, the pipeline can be adapted easily to other Wikipedias which use this or similar templates.

weekly update:

  • generated larger dataset of sentence pairs by looking at all edits from removal of copyedit template (not only those marked as minor). this yields 176k pairs of sentences from 34k different articles.

weekly update:

  • filtering the dataset of edited sentences from copyedit-template-removal. many sentences are not changed due to grammtical/copyedit errors but seem to be stylistic reasons. Focusing only on a small subset of sentences that are clearly related to grammatical errors seems to make it possible to distinguish whether a sentence needs editing using pre-trained language models.

weekly update:

  • using a standard pre-trained language model, we can automatically distinguish sentence-pairs (the same sentence before and after an edit tagged as copyedit via the removal of the copyedit template) with moderate precision of ~70-80%. this suggests we might use this model to predict whether a specific sentence requires copyediting.
  • so far we only checked this for sentence from English Wikipedia. as a next step I will extract similar sentence pairs (before/after an edit where the copyedit-template was removed) from other wikis. the template exists in 83 different wikis (Q6292692)

weekly update:

  • generated dataset of sentence pairs (before/after) from edit-diffs where copyedit-template was removed for all wikis which have the copyedit-template. after some filtering, there are 30 different wikis with at least 1000 pairs of aligned sentences (before/after the removal of the copyedit template)

weekly update:

  • obtained first results for evaluating the model to score sentences for copyediting in multiple languages
  • considering 7 languages (arwiki, bnwiki, cswiki, enwiki, eswiki, frwiki, viwki) we obtain an accuracy between 70-80% across languages distinguishing ground-truth sentences from those wikis obtained from the removal of copyedit-templates
  • as a next step: apply model's scores to larger dataset of sentences from Wikipedia and manually check results

weekly update:

  • generated larger dataset for evaluation of model consisting of 1M random sentences from articles of each of the 7 wikis (arwiki, bnwiki, cswiki, enwiki, eswiki, frwiki, viwki)

weekly update:

  • applied copyedit-scoring model to the 1M sentences
  • get a subsample of 100 sentece from worst/average/highest score-range for manual evaluation
  • discussing strategies for manual scoring with collaborators

weekly update:

  • improved copyedit-model by fine-tuning with larger dataset automatically extracted for different languages via an existing diff-extractor tool
  • evaluation accuracy to distinguish before/after ground-truth copyedit sentences: between 64-81%
  • apply model to assign copyedit-scores to any sentence (low score=sentence needs copyediting)
  • next step: manual evaluation of scored sentences

weekly updates:

  • fixed sentence extraction pipeline for the 1M-sentences dataset. there were some errors from misformatted sentences due to, e.g., formatting templates leading to missing digits
  • re-running automatic scoring on the new dataset
  • planning manual evaluation of scored sentences

weekly update:

  • starting manual annotation of subsample of the scored sentences from the 1M-sentence dataset

weekly update:

  • we have a prototype for a mutlilingual model to score sentences about the need for copyediting
  • working on the documentation of results and started to write paper for submission to conference
  • still working on getting manual ratings for automatic scores. revising the extraction of well-formed sentences in different languages since they often seem to contain misformatting from templates etc. (one example is the use of {{formatnum}} in wikitext). one solution is to switch to extract sentences from HTML versions of articles where templates are expanded.

weekly update:

  • generated 1M sentence dataset for 11 Wikipedias extracted from HTML-dumps for evaluation of scoring (dataset)

@MGerlach Without knowing much about the details of the solutions, I have prepared a sentence dataset for 321 languages recently for training a fasttext model for language identification. Code https://github.com/santhoshtr/wikisentences, dataset https://analytics.wikimedia.org/published/datasets/one-off/santhosh/wikisentences/ (there is a fresh version with data clean up that I have not published yet). Please contact me if this could be useful.

weekly update:

  • putting together results of the model. we have evaluation results for 11 languages from offline evaluation. Djellel is still running one round of manual evaluation results.
  • we are discussing how to frame the contributions for the paper (task, dataset, model design and results for multilingual evaluation). we started sketching the paper

weekly update:

  • no update (didnt get to work on the documentation this week as planned as I was out sick for one day)

weekly update