Wikimedia Enterprise HTML Dumps

This partial mirror of Wikimedia Enterprise HTML dumps is an experimental service.

Dumps are produced for a specific set of namespaces and wikis, and then made available for public download. Each dump output file consists of a tar.gz archive which, when uncompressed and untarred, contains one file, with a single line per article, in json format. Among the attributes defined in the file are the following:

name
the title of the article
identifier
the page id
date modified
the last time the page was modified
version
a compound structure representing the page revision
url
the full url to the page on the wiki
namespace
a compound structure representing the namespace of the page
in_language
a compound structure representing the language of the wiki that has the page
article_body
the page text in HTML with additional markup from the MediaWiki parser (see specification)
wikitext
the plain wikitext of the page without additional markup
license
license information for the page

Accompanying the tar.gz file is a file containing the md5sum and the date the dump was produced, also in json format.

All files for a dump run are included in a single directory of the format YYYYMMDD. The dumps will eventually become available on a regular schedule, around the 2nd and 21st of each month.

Initially, we hope to keep three runs for public download, covering about a 6-week period.

View the directories here: other/enterprise_html/runs


Back to other data bundles | the main index page