Analytics Datasets: MediaWiki History

Contents

This data set contains a historical record of all* events and states of Wikimedia wikis since 2001. (*) It includes data about revisions (reverts, tags, ...), users (renames, groups, blocks, bot/human, registered/anonymous, edit count, ...) and pages (moves, redirects, deletions, restores, ...). For further details visit the Wikitech page with the data set schema and other important information.

Updates

The updates for this data set are monthly, around the end of the month's first week. And each update contains a full dump since 2001 (the beginning of MediaWiki-time) up to the current month. The reason for this particularity is the underlying data, the MediaWiki databases. Every time a user gets renamed, a revision reverted, a page moved, etc. the existing related records in the logging table are updated accordingly. So an event triggered today may change the state of that table 10 years ago. And it turns out the logging table is the base of the MediaWiki history reconstruction process. Thus, note that incremental downloads of these dumps may generate inconsistent data. Consider using EventStreams for real time updates on MediaWiki changes (API docs).

Versioning

Each update receives the name of the last featured month, in YYYY-MM format. For example: if the dump spans from 2001 to August 2019, it will be named 2019-08 (even if it will be released on the first days of September 2019). There will be a folder for each version at the root of the download URL. Note that, for storage reasons, only a short number of versions will be available. Versions older than 6 months will be removed.

Partitioning

The data is organized by wiki and time range. This way it can be downloaded for a single wiki (or set of wikis). The time split is necessary because of file size reasons. There are 3 different time range splits: monthly, yearly and all-time. Very big wikis are partitioned monthly, while medium wikis are partitioned yearly, and small wikis are dumped in one single file. This way we ensure that files are not larger than ~2GB, and at the same time we prevent generating a very large number of files.

File format

The file format is TSV, because it doesn't contain meta-data (like JSON), thus making the download lighter. Even if MediaWiki history data is pretty flat, it has some fields that are arrays of strings. The encoding of such arrays is the following: array(<value1>,<value2>,...,<valueN>). The compression algorithm is Bzip2, for it being widely used, free software, and having a high compression rate. Note that with Bzip2, you can concatenate several compressed files and treat them as a single Bzip2 file.

Directory structure

When choosing a file (or set of files) for download, the URL should look like this:
/<version>/<wiki>/<time range>.tsv.bz2
Where <snapshot> is YYYY-MM, i.e. 2019-08; <wiki> is the wiki_db, i.e. enwiki or commonswiki; and <time_range> is either YYYY-MM for big wikis, YYYY for medium wikis, or all-time for the rest. Examples of dump files:

Download MediaWiki History Data

If you're interested in how this data set is generated, have a look at the following articles:

Back to all Analytics Datasets