Page MenuHomePhabricator

Improve features for wikibase vandalism detection model
Open, MediumPublic

Description

The number of features of Wikidata vandalism detection is good but it can be better.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

We are working on this with @Lea_Lacroix_WMDE to get feedback and improve them.

Any updates here?

I just had a meeting with Wikidata's communication manager. She is starting the process and it takes some time.

And the announcement for feedback will be done on Monday, July 2nd :)

Vvjjkkii renamed this task from Improve features for wikibase vandalism detection model to yxcaaaaaaa.Jul 1 2018, 1:09 AM
Vvjjkkii removed Ladsgroup as the assignee of this task.
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from yxcaaaaaaa to Improve features for wikibase vandalism detection model.Jul 2 2018, 4:10 PM
CommunityTechBot assigned this task to Ladsgroup.
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.

@Ladsgroup let's sit down together and flesh this out :)

@Ladsgroup Can you add the list of existing features as we discussed?

Sure:

is_client_move,
is_client_delete,
is_merge_into,
is_merge_from,
is_revert,
is_restore,
is_item_creation,
sex_or_gender_changed,
country_of_citizenship_changed,
member_of_sports_team_changed,
date_of_birth_changed,
image_changed,
signature_changed,
commons_category_changed,
official_website_changed,
en_label_changed,
is_human,
is_blp
comment_longest_repeated_char,
comment_uppercase_ratio,
comment_numbers_ratio,
comment_whitespace_ratio,
comment_english_bad_words,
comment_english_informals,
comment_longest_repeated_uppercase_char,
comment_has_url,
comment_has_first_person_pronouns_en,
comment_has_second_person_pronouns_en,
comment_has_do_or_dont_en,
log(wikibase.revision.parent.claims + 1),
log(wikibase.revision.parent.properties + 1),
log(wikibase.revision.parent.aliases + 1),
log(wikibase.revision.parent.sources + 1),
log(wikibase.revision.parent.qualifiers + 1),
log(wikibase.revision.parent.badges + 1),
log(wikibase.revision.parent.labels + 1),
log(wikibase.revision.parent.sitelinks + 1),
log(wikibase.revision.parent.descriptions + 1)
wikibase.revision.diff.sitelinks_added,
wikibase.revision.diff.sitelinks_removed,
wikibase.revision.diff.sitelinks_changed,
wikibase.revision.diff.labels_added,
wikibase.revision.diff.labels_removed,
wikibase.revision.diff.labels_changed,
wikibase.revision.diff.descriptions_added,
wikibase.revision.diff.descriptions_removed,
wikibase.revision.diff.descriptions_changed,
wikibase.revision.diff.aliases_added,
wikibase.revision.diff.aliases_removed,
wikibase.revision.diff.properties_added,
wikibase.revision.diff.properties_removed,
wikibase.revision.diff.properties_changed,
wikibase.revision.diff.claims_added,
wikibase.revision.diff.claims_removed,
wikibase.revision.diff.claims_changed,
wikibase.revision.diff.identifiers_changed,
wikibase.revision.diff.sources_added,
wikibase.revision.diff.sources_removed,
wikibase.revision.diff.qualifiers_added,
wikibase.revision.diff.qualifiers_removed,
wikibase.revision.diff.badges_added,
wikibase.revision.diff.badges_removed,
wikibase.revision.diff.proportion_of_qid_added,
wikibase.revision.diff.proportion_of_language_added,
wikibase.revision.diff.proportion_of_links_added
revision.comment.suggests_section_edit
revision.comment.has_link
revision.user.is_bot
revision.user.has_advanced_rights
revision.user.is_admin
revision.user.is_trusted
revision.user.is_patroller
revision.user.is_curator
revision_oriented.revision.user.is_anon,
log(temporal.revision.user.seconds_since_registration + 1)

This is all of the features, Tell me if any one them is not clear enough.

Halfak triaged this task as Medium priority.Feb 19 2019, 10:21 PM
Halfak moved this task from Unsorted to New development on the Machine-Learning-Team board.
Addshore subscribed.

I don't know how much work this is.
@Lydia_Pintscher should this still be on the campsite?
Is this ready to be done?

I don't know how much work this is.
@Lydia_Pintscher should this still be on the campsite?
Is this ready to be done?

It's probably good to do embedding or clustering on set of one-hot encodings of properties changed, languages changed, number of statements per properties, etc. That would make it greatly more accurate.

@Ladsgroup @Lydia_Pintscher

is it possible already to write up more accurate description of what the expected outcome of this would look like? Right now it is very wide open and general with no clear end.

with this:

It's probably good to do embedding or clustering on set of one-hot encodings of properties changed, languages changed, number of statements per properties, etc. That would make it greatly more accurate.

maybe we repurpose this task to capture doing that (which I don't understand yet). @Ladsgroup does that make sense?