Page MenuHomePhabricator

Add a shared table to CheckUser that records changes to different wikis per IP address of the user
Closed, ResolvedPublic2 Estimated Story Points

Description

For additional context, see the discussion on T355672: Investigate: How to make the GUC query performant.

Background

A couple of new features need this table:

  • The new Global User Contributions (GUC) tool will allow looking up edits from temporary users using the same IP address or range, across all wikis
  • GlobalBlocking will introduce global autoblocks against accounts, which will involve looking up IP addresses used by accounts across all wikis

In order for both features to avoid querying the CheckUser tables for all wikis, we can introduce a shared table that records which IPs have recently edited which wikis.

Suggested structure

Columns:

  • IP (hex value)
  • Central ID
  • Wiki ID
  • Whether the row is for GUC
  • Timestamp of most recent change

Indexes:

  • Primary key on (IP, is the row for guc, wiki ID, central ID). This allows us to select the wiki IDs, given an IP address or range for GUC while only including temporary accounts contributions to ensure an efficient lookup.
  • Index on the central ID and timestamp, to allow finding the most recently used IP address(es) for GlobalBlocking auto blocks.
  • Index on timestamp. This allows us to purge rows older than $wgCUDMaxAge.
Shared table

This could be shared using virtual domains, configured similar to this example.

Related Objects

Event Timeline

Jotting down some thoughts about questions that might affect the table structure:

Do we need to distinguish between temporary users and registered users?

Reasoning:

  • GlobalBlocking will allow blocking any registered user, not just temporary users. It needs to allow global autoblocks against temporary accounts for T355286: [Epic] Globally blocking a temporary account should prevent further account creations. If it also allows global autoblocks against registered users, then this table would need to hold IP addresses of anyone who makes a CheckUser-recorded change on any wiki.
  • That may slow down the GUC query, which only needs to know about changes that temporary users made from a given IP/range, since it may perform useless lookups to wikis where only a registered user edited from that IP/range. Or the difference in performance may be small enough to ignore.

Possible solutions:

  • Add a column to indicate temporary/registered user. (IP, wiki ID) would no longer be unique.
  • Do nothing unless we find that the GUC query becomes too slow

Do we need to indicate they type of change (edit, log event, private event)?

(Pointed out by @Dreamy_Jazz)

Reasoning:

  • Local autoblocks do this: they look up the last IP used, whether it was for an edit or a logged/unlogged action. If global autoblocks do this too, then this table could contain IPs of users that only performed actions and didn't make edits.
  • As above, this may degrade the performance of the GUC query, which only needs to look up edits. Or the difference may be negligible.

Possible solutions:

  • Add a column for the type of event. Perhaps that maps to the exact table (cu_changes, cu_log_event, cu_private_event), to avoid needing to look for the IP in all tables.
  • Do nothing unless we find that the GUC query becomes too slow
Tchanders renamed this task from Add a shared table that records changes to wikis by IP address to Add a shared table to CheckUser that records changes to different wikis per IP address of the user.Jun 21 2024, 2:03 PM
Tchanders updated the task description. (Show Details)

Some other thoughts:

Do we need to not create the table if the wiki is not in a wiki farm (i.e. only one wiki)?
Reasoning:

  • The CheckUser extension will have no need for a central table if it is installed on one wiki. Instead it can query the local cu_changes / cu_log_event / cu_private_event tables, or potentially no need to provide a different query for a one-wiki install (as GUC and GlobalBlocking global autoblocks make no sense for a one-wiki install).
  • Having this extra table will represent a duplication of data that could amount to a storage concern for third-party wiki installs where only one wiki is used.

Possible solutions:

  • Allow this table to not exist if only one wiki is used, with support to enable multiple wiki support after an install
  • Do nothing because the overhead cost is not worth supporting third-party installs

Can the table rely on eventual consistency?

Reasoning:

  • A single row may be updated frequently, such as in the case of a mass-attack
  • This fast rate of editing will cause frequent updates to the timestamp and could lead to deadlocks

Possible solutions:

  • The data does not have to be immediately accurate, and we can debounce the requests to update the timestamp using some method. This should result in eventual consistency. This eventual consistency should not cause issues if the previous timestamp was more than $wgCUDMaxAge seconds old.
  • We need to find a way to ensure that the table is updated as close to the time of the edit / log action as possible.

Do we need a column for a central ID?

Reasoning:

  • The current column structure does not include a way to filter for rows that are specific to a given user
  • This makes it practically impossible to use this table for GlobalBlocking, because the autoblock code starts with the target of the block (the given user) to find the relevant IPs

Possible solutions:

  • Add a central ID column to the table and make the primary key also include the central ID column
  • Find a different way to perform global autoblocks to avoid adding more columns to this table

Should we build this table with the idea that a Global CheckUser tool could use this table?

Reasoning:

  • Adding support for Global CheckUser has been requested in T212779. This central table may be a way to implement this tool
  • While we are unlikely to work on T212779 for the time being, building the schema of the table to support (or support with some schema modifications) a Global CheckUser tool could reduce the work needed for T212779.

Possible solutions:

  • We keep T212779 in mind and avoid using a schema for this table that makes it impossible for T212779 to use this table.
  • We do not consider T212779 as part of making this table. If the schema of this table works for that task, then it was not because of making it possible to implement T212779.

I will give my own answers to the questions posed by me and Thalia.

Do we need to not create the table if the wiki is not in a wiki farm (i.e. only one wiki)?

I would argue that we do not need to have the table if the install is not on a multi-wiki install. This is because the features that would use this table would be redundant to existing tools (i.e. GUC is redundant to Special:IPContributions on a one-wiki install, GlobalBlocking autoblocks can be performed by using the local autoblocking code).

A wiki can specify they are multi-wiki through configuration.

Can the table rely on eventual consistency?

I would argue that some level of eventual consistency is fine. If we were to use the job queue and de-duplicate jobs using the performer, wiki and timestamp then we could avoid mass-editing attacks causing deadlocks or too frequent writes.

Do we need a column for a central ID?

Yes, and the schema needs to be updated to include this. We add a central ID column, and then include this column in the primary key (which makes the primary key the IP hex, central ID, wiki ID).

Should we build this table with the idea that a Global CheckUser tool could use this table?

I do not believe we should actively think of this, as for WMF wikis there does not seem to be much of an appetite for a Global CheckUser tool. This table will be primarily for GlobalBlocking autoblocks and GUC.

Do we need to distinguish between temporary users and registered users?

We could join to the table that holds central IDs to usernames and then use that to exclude / include temporary accounts. Adding a is_temp column could become confusing for the same reasons that the user_is_temp column is specifically not for use in a MediaWiki context (i.e. two sources of truth being the username and DB column that may be a mismatch).

Do we need to indicate they type of change (edit, log event, private event)?

We may need to do this to make the query more efficient. For example, loginwiki will never have edits to show but will always have a log event on the temporary account creation. As such, we could add a column that is a bitfield indicating which of the tables the row refers to. Alternatively, we could have a column per table that indicates whether results are in that table.

I will wait for feedback on my thoughts before proceeding with any kind of patch.

I will give my own answers to the questions posed by me and Thalia.

Do we need to not create the table if the wiki is not in a wiki farm (i.e. only one wiki)?

I would argue that we do not need to have the table if the install is not on a multi-wiki install. This is because the features that would use this table would be redundant to existing tools (i.e. GUC is redundant to Special:IPContributions on a one-wiki install, GlobalBlocking autoblocks can be performed by using the local autoblocking code).

A wiki can specify they are multi-wiki through configuration.

I agree.

Can the table rely on eventual consistency?

I would argue that some level of eventual consistency is fine. If we were to use the job queue and de-duplicate jobs using the performer, wiki and timestamp then we could avoid mass-editing attacks causing deadlocks or too frequent writes.

I agree, if I've understood the proposal correctly: that we use only one job per performer/wiki/timestamp, and that we debounce jobs with the same wiki/performer (debounce time yet to be determined).

Do we need a column for a central ID?

Yes, and the schema needs to be updated to include this. We add a central ID column, and then include this column in the primary key (which makes the primary key the IP hex, central ID, wiki ID).

I somehow missed that we'd be looking up by IP address for GUC, but by global account for GlobalBlocking...

After reading this, I considered if we should have two tables:

  • One that GUC uses, which records which IPs edited which wikis, and is only written to for temp account contributions
  • One that GlobalBlocking uses, which records which central users have edited which wikis, and doesn't include IP addresses

Advantages:

  • GUC query is quicker since the table doesn't have registered user contributions

Disadvantages:

  • Takes more space (for temp accounts, ID, wiki ID and timestamp are repeated in two tables)
  • GlobalBlocking would need to query local CheckUser tables

Conclusion: I agree with adding a central user ID column to this table.

Should we build this table with the idea that a Global CheckUser tool could use this table?

I do not believe we should actively think of this, as for WMF wikis there does not seem to be much of an appetite for a Global CheckUser tool. This table will be primarily for GlobalBlocking autoblocks and GUC.

I agree.

Do we need to distinguish between temporary users and registered users?

We could join to the table that holds central IDs to usernames and then use that to exclude / include temporary accounts. Adding a is_temp column could become confusing for the same reasons that the user_is_temp column is specifically not for use in a MediaWiki context (i.e. two sources of truth being the username and DB column that may be a mismatch).

I agree. We can always revisit this if the queries are too slow.

Do we need to indicate they type of change (edit, log event, private event)?

We may need to do this to make the query more efficient. For example, loginwiki will never have edits to show but will always have a log event on the temporary account creation. As such, we could add a column that is a bitfield indicating which of the tables the row refers to. Alternatively, we could have a column per table that indicates whether results are in that table.

I wonder, could we start by not doing this, and monitor the query performance? It would be nice to keep things simpler, but only if it works.

Hi, I reviewed the original idea long time ago and this still sounds good to me (no changes since my review). Please let me know if you need anything further from me.

Thanks @Ladsgroup. We'll add you as a reviewer to the patch to introduce the table with its agreed-upon structure and in indexes, if that's ok.

Do we need to indicate they type of change (edit, log event, private event)?

We may need to do this to make the query more efficient. For example, loginwiki will never have edits to show but will always have a log event on the temporary account creation. As such, we could add a column that is a bitfield indicating which of the tables the row refers to. Alternatively, we could have a column per table that indicates whether results are in that table.

I wonder, could we start by not doing this, and monitor the query performance? It would be nice to keep things simpler, but only if it works.

Having discussed this with Thalia outside of Phabricator, we have decided to include this but only distinguish between edits and non-edits. To do this we will have a column that is treated as a boolean and indicates whether the row is just for edit events. This also means that there will be two rows per performer/IP/wiki combination, where the first is for all actions and the second is just for edits (though it will not be present if no edits have been made for that combination in the last 90 days).

Change #1053361 had a related patch set uploaded (by Dreamy Jazz; author: Dreamy Jazz):

[mediawiki/extensions/CheckUser@master] Add cu_central_index table on virtual-checkuser database domain

https://gerrit.wikimedia.org/r/1053361

I've proposed the name cu_central_index for the new table, because:

  • This can be seen as an index used to find the relevant information for GUC. In a GlobalBlocking autoblocks context, only this table is necessary to find the relevant IPs (but that shouldn't matter for the name IMO).
  • While it may be confused with the idea of SQL indexes, I don't think the table name will appear in any place that it could be mis-identified as a SQL index.

I am open to other suggested names for the table.

Do we need to distinguish between temporary users and registered users?

We could join to the table that holds central IDs to usernames and then use that to exclude / include temporary accounts. Adding a is_temp column could become confusing for the same reasons that the user_is_temp column is specifically not for use in a MediaWiki context (i.e. two sources of truth being the username and DB column that may be a mismatch).

I agree. We can always revisit this if the queries are too slow.

On second thoughts, I have decided that we should have a column that combines both whether the user is temporary and whether the row is for an edit, in a column named cuci_for_guc.

This is because, excluding rows that do not have the central ID of a temporary user will be hard in a way which does not bind itself to any specific implementation of the wgCentralIdProvider:

  • If MediaWiki-extensions-CentralAuth is installed then the code will need to join to the globaluser oder localuser table to perform the LIKE query on the username for the central ID. Therefore, this table must be on the same database as the MediaWiki-extensions-CentralAuth tables (which may not occur for WMF production).
  • If the local central ID provider is used then you need to join to the user table, which requires that the table be on the local wiki database.

Therefore, by indicating whether the row is for use in GUC or not the query can filter for rows where cuui_for_guc is set to 1 and be confident that the central ID for a temporary account and that the actions associated with this row are only edits. Being sure of both of these is necessary to avoid lookups where no temporary accounts are found, or issues caused by the cuui_timestamp value being for a non-edit action or made by an account other than a temporary account (see note 1 for why).

note 1: If this mismatch occurs, then a query could be made to a wiki where no rows are found. Also, if the GUC query limits itself to the X most recently edited wikis, then the cuui_timestamp` value being for an action that is not displayed would cause paging over the wikis to be incorrect.

Current table structure in the proposed patch on a MariaDB database:

MariaDB [my_database]> describe cu_central_index;
+-----------------+------------------+------+-----+---------+-------+
| Field           | Type             | Null | Key | Default | Extra |
+-----------------+------------------+------+-----+---------+-------+
| cuci_ip_hex     | varbinary(255)   | NO   | PRI | NULL    |       |
| cuci_central_id | int(10) unsigned | NO   | PRI | NULL    |       |
| cuci_for_guc    | tinyint(1)       | NO   | PRI | NULL    |       |
| cuci_wiki       | varbinary(255)   | NO   | PRI | NULL    |       |
| cuci_timestamp  | binary(14)       | NO   | MUL | NULL    |       |
+-----------------+------------------+------+-----+---------+-------+
5 rows in set (0.001 sec)

Indexes for the current proposed table structure:

MariaDB [my_database]> show index from cu_central_index;
+------------------+------------+---------------------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+
| Table            | Non_unique | Key_name                  | Seq_in_index | Column_name     | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Ignored |
+------------------+------------+---------------------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+
| cu_central_index |          0 | PRIMARY                   |            1 | cuci_ip_hex     | A         |           0 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| cu_central_index |          0 | PRIMARY                   |            2 | cuci_for_guc    | A         |           0 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| cu_central_index |          0 | PRIMARY                   |            3 | cuci_wiki       | A         |           0 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| cu_central_index |          0 | PRIMARY                   |            4 | cuci_central_id | A         |           0 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| cu_central_index |          1 | cuci_timestamp            |            1 | cuci_timestamp  | A         |           0 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| cu_central_index |          1 | cuci_central_id_timestamp |            1 | cuci_central_id | A         |           0 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| cu_central_index |          1 | cuci_central_id_timestamp |            2 | cuci_timestamp  | A         |           0 |     NULL | NULL   |      | BTREE      |         |               | NO      |
+------------------+------------+---------------------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+
7 rows in set (0.003 sec)
Dreamy_Jazz set the point value for this task to 2.Thu, Jul 11, 9:01 AM

As commented in T337089#9989917, a discussion with Product came to the conclusion that ideally we'd build GUC out a bit more so that users could use GUC to look up, globally:

  • Temp account edits from the same IP
  • Edits from the same temp account
  • Edits from the same registered (named) account

...for the last 90 days. This would mean that they wouldn't need to switch between Special:GUC and the GUC tool, at least while checking for recent edits.

I discussed this further with @Dreamy_Jazz. Technically it seems feasible to me, as long as it doesn't make the shared table too big (a concern expressed in this thread in gerrit). Use of the tool should still be restricted to users with checkuser-temporary-account, since building a publicly-available tool that looks up data in the CheckUser tables is risky, and since users without that right would have a better experience on the GUC tool, which shows historical data.

What could the tables look like?

All of this and global autoblocks in one table:

  • IP
  • Central user ID
  • Wiki ID (normalised ID, FK to a separate table of wiki ID strings)
  • Timestamp of most recent action (including edits), for global autoblocks
  • Timestamp of most recent edit, for GUC
  • Whether the user is a temporary account, for GUC (see T368151#9970775 for why)

Separate tables:

  • Global autoblocks table
    • Central user ID
    • Timestamp of most recent action
  • GUC temp user table
    • Central user ID
    • IP
    • Wiki ID
    • Timestamp of most recent edit
  • GUC named user table
    • Central user ID
    • Wiki ID
    • Timestamp of most recent edit

Maximum number of rows in each table

One big table: number of IPs * number of (all) users * number of wikis

Global autoblocks table: number of (all) users
GUC temp user table: number of temp users * number of temp user IPs * number of wikis
GUC named user table: number of named users * number of wikis

Maximum number of rows in each table

One big table: number of IPs * number of (all) users * number of wikis

Global autoblocks table: number of (all) users
GUC temp user table: number of temp users * number of temp user IPs * number of wikis
GUC named user table: number of named users * number of wikis

Just to clarify that the users would be limited to all users who have performed any CheckUser-logged action in the last 90 days. For the GUC specific tables the number of users is instead all users who have made edits in the last 90 days.

Edits by temp users/logged out users are quite small in number. Recording information on all users edits in all wikis requires two orders of magnitude more writes and more rows (specially in the original idea of storing all of them in one table) and can and will cause issues for the databases if we are not very careful. The last 90 days won't help much (if anything, it adds even more simultanes writes and potential deadlocks and contentions.) I highly recommend you separate the global block and GUC tables. GUC for temp users is tiny compared the global autoblocks one (or non-temp GUC one) and can easily move forward. But I recommend these mitigations before doing anything on global autoblocks or non-temp GUC:

  • Do not try to update the timestamp if it's updated within last minute (debounce it).
  • (maybe) if the last update was within an hour, roll a dice and update the timestamp in 10% of the times only.
  • Don't store any information on the bots
  • Don't store information on WMCS CIDR ranges (we skip global block checks for those as they caused outages).
  • (maybe) Don't store anything on wikidata and always separately make a query to rc table of it when GUC is being used.
  • Be very careful about your data structures and DDLs. I suggest splitting IPv6 and IPv4 if you're planning to store IPs for non-temp users. T292623#7546780
  • Make sure the writes happen in a job, not part of main transaction or deferred update.

We have decided to limit the scope of GUC to the following for the time being to lookup temporary accounts on a given IP or IP range.

Therefore, we likely need two tables (one for GUC and one for global autoblocks). We may also want to create a wiki ID table.

Change #1056177 had a related patch set uploaded (by Dreamy Jazz; author: Dreamy Jazz):

[mediawiki/extensions/CheckUser@master] Add cu_central_index_wiki_map table

https://gerrit.wikimedia.org/r/1056177

Change #1056202 had a related patch set uploaded (by Dreamy Jazz; author: Dreamy Jazz):

[mediawiki/extensions/CheckUser@master] Add cu_central_index_user_wiki table

https://gerrit.wikimedia.org/r/1056202

New proposed structure

cu_central_index_wiki_map

A table to normalise the wiki ID and give it a SMALLINT number to decrease space usage

MariaDB [my_database]> describe cu_central_index_wiki_map;
+-----------+----------------------+------+-----+---------+----------------+
| Field     | Type                 | Null | Key | Default | Extra          |
+-----------+----------------------+------+-----+---------+----------------+
| ciwm_id   | smallint(5) unsigned | NO   | PRI | NULL    | auto_increment |
| ciwm_wiki | varbinary(255)       | NO   |     | NULL    |                |
+-----------+----------------------+------+-----+---------+----------------+
2 rows in set (0.004 sec)
cu_central_index_ip_temp_edit

A table to allow lookups of which wikis have temporary accounts edits for a given IP or IP range, for the purposes of GUC

MariaDB [my_database]> describe cu_central_index_ip_temp_edit;
+-----------------+----------------------+------+-----+---------+-------+
| Field           | Type                 | Null | Key | Default | Extra |
+-----------------+----------------------+------+-----+---------+-------+
| ciite_ip_hex    | varbinary(255)       | NO   | PRI | NULL    |       |
| ciite_ciwm_id   | smallint(5) unsigned | NO   | PRI | NULL    |       |
| ciite_timestamp | binary(14)           | NO   | MUL | NULL    |       |
+-----------------+----------------------+------+-----+---------+-------+
3 rows in set (0.002 sec)
MariaDB [my_database]> show indexes from cu_central_index_ip_temp_edit;
+-------------------------------+------------+------------------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+
| Table                         | Non_unique | Key_name               | Seq_in_index | Column_name     | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Ignored |
+-------------------------------+------------+------------------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+
| cu_central_index_ip_temp_edit |          0 | PRIMARY                |            1 | ciite_ip_hex    | A         |           0 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| cu_central_index_ip_temp_edit |          0 | PRIMARY                |            2 | ciite_ciwm_id   | A         |           0 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| cu_central_index_ip_temp_edit |          1 | ciite_timestamp        |            1 | ciite_timestamp | A         |           0 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| cu_central_index_ip_temp_edit |          1 | ciite_ip_hex_timestamp |            1 | ciite_ip_hex    | A         |           0 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| cu_central_index_ip_temp_edit |          1 | ciite_ip_hex_timestamp |            2 | ciite_timestamp | A         |           0 |     NULL | NULL   |      | BTREE      |         |               | NO      |
+-------------------------------+------------+------------------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+
5 rows in set (0.001 sec)
cu_central_index_user_wiki

A table that allows looking up the wikis that a user has performed actions on, for global blocking and potentially also GUC

MariaDB [my_database]> describe cu_central_index_user_wiki;
+-----------------+----------------------+------+-----+---------+-------+
| Field           | Type                 | Null | Key | Default | Extra |
+-----------------+----------------------+------+-----+---------+-------+
| ciuw_central_id | int(10) unsigned     | NO   | PRI | NULL    |       |
| ciuw_ciwm_id    | smallint(5) unsigned | NO   | PRI | NULL    |       |
| ciuw_timestamp  | binary(14)           | NO   | MUL | NULL    |       |
+-----------------+----------------------+------+-----+---------+-------+
3 rows in set (0.002 sec)
MariaDB [my_database]> show indexes from cu_central_index_user_wiki;
+----------------------------+------------+---------------------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+
| Table                      | Non_unique | Key_name                  | Seq_in_index | Column_name     | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Ignored |
+----------------------------+------------+---------------------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+
| cu_central_index_user_wiki |          0 | PRIMARY                   |            1 | ciuw_central_id | A         |           0 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| cu_central_index_user_wiki |          0 | PRIMARY                   |            2 | ciuw_ciwm_id    | A         |           0 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| cu_central_index_user_wiki |          1 | ciuw_timestamp            |            1 | ciuw_timestamp  | A         |           0 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| cu_central_index_user_wiki |          1 | ciuw_central_id_timestamp |            1 | ciuw_central_id | A         |           0 |     NULL | NULL   |      | BTREE      |         |               | NO      |
| cu_central_index_user_wiki |          1 | ciuw_central_id_timestamp |            2 | ciuw_timestamp  | A         |           0 |     NULL | NULL   |      | BTREE      |         |               | NO      |
+----------------------------+------------+---------------------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+
5 rows in set (0.001 sec)

For global autoblocks, the following pusedo code would be used:

  • Get wiki IDs for a central ID ordered by timestamp descending from the cu_central_index_user_wiki table
  • Choose the first wiki ID, and loop:
    • Get the IPs used on that wiki from the local CheckUser result tables, ordered by timestamp descending
    • Choose the first IP from the query, and loop:
      • If the last timestamp associated with this IP is less than the timestamp associated with the next wiki ID from the query in step 1, then "break;"
      • Autoblock the IP

For GUC on temp accounts and named accounts:

  • Get wiki IDs for a central ID from the cu_central_index_user_wiki table
  • Perform the lookups of the edits from the local cu_changes table for each wiki

Thanks @Dreamy_Jazz. I really like this approach because it allows us to implement what we definitely need for GUC and autoblocks, and then try and see if we can get the GUC accounts working. (And we'll still use the applicable mitigations from T368151#9993273.)

So we should be able to do GUC for IPs and global autoblocks fairly straightforwardly. Then we can try GUC for accounts (temp and permanent) based on the cu_central_index_user_wiki table, and fine-tune a few things if necessary. (E.g. if the number of wikis with actions is too big, we can try adjusting the table or at worst reduce the scope back down again.)

Change #1056177 merged by jenkins-bot:

[mediawiki/extensions/CheckUser@master] Add cuci_wiki_map table

https://gerrit.wikimedia.org/r/1056177

Change #1053361 merged by jenkins-bot:

[mediawiki/extensions/CheckUser@master] Add cuci_temp_edit table

https://gerrit.wikimedia.org/r/1053361

I don't think we need to QA the creation of the database tables.

You need it in production too. Shall I create it there? or you did it already :P

You need it in production too. Shall I create it there? or you did it already :P

I was about to create it :D

Change #1056202 merged by jenkins-bot:

[mediawiki/extensions/CheckUser@master] Add cuci_user table

https://gerrit.wikimedia.org/r/1056202

You need it in production too. Shall I create it there? or you did it already :P

I was about to create it :D

MariaDB [wikishared]> show tables;
+------------------------------------+
| Tables_in_wikishared               |
+------------------------------------+
....
| cuci_temp_edit                     |
| cuci_user                          |
| cuci_wiki_map                      |
....
+------------------------------------+
X rows in set (0.001 sec)

I'll also update the on-wiki documentation to add these new tables.