In the literature review, it became clear that there is no off-the-shelve method to detect candidates for copyedits in the context of Wikipedia that works across all languages in Wikipedia.
As a longer-term goal, it would thus be desirable to develop a custom model that can identify sentences that require copyediting in Wikipedia in different languages. Such a model would probably be similar to the citation-needed model which aims to predict sentences in Wikipedia that need citations (https://arxiv.org/abs/1902.11116).
Things we need:
- A labeled dataset of positive and negative examples (i.e. sentences requiring and not requiring copyedits) in different languages.
A promising approach are copyedit-templates which exist in at least 81 WikipediasWe have a dataset of edits (reverted and non-reverted) with edit-tag "newcomer task copyedit" in 10+ wikis (T299245#8188596) - Implementing one or more models to predict positive and negative examples based on text-features.