Index of /other/pagecounts-all-sites/

NameLast ModifiedSizeType
../ -  Directory
2014/2015-Jan-05 18:29:14-  Directory
2015/2015-Dec-01 01:09:29-  Directory
2016/2016-Aug-01 01:16:16-  Directory
README.txt2015-Mar-12 09:20:465.2Ktext/plain
The files in this directory provide hourly aggregated pagecounts and
projectcounts of all sites (i.e.: desktop, mobile, and zero) across
public wikis.

1.   On-wiki documentation
2.   Used pageview definition
3.   File structure
3.1. Disambiguating abbreviations ending in “.m”
4.   Contact / Bugs

1. On-wiki documentation

While this README.txt is currently (2014-10-01) up-to-date, this file cannot
easily be updated by the community. Hence, we consider the on-wiki documentation

the authoritative documentation.

This README.txt is just a convenience for people that have issues accessing the
on-wiki documentation.

2. Used pageview definition

The files use the pageview definition of webstatscollector, and (in contrast to
the files at

) do not only apply it to the desktop site, but all sites.
So: desktop, mobile, and zero.

(Note that this does not yet catch everything that we want to consider as
pageview. It comes with the same definition issues as webstatscollector data
from pagecounts-raw. But! It provides data for mobile, as a stop-gap measure.)

3. File structure

The file structure should be compatible with the files from


Filenames are of the form${YEAR}/${YEAR}-${MONTH}/pagecounts-${YEAR}${MONTH}${DAY}-${HOUR}0000.gz${YEAR}/${YEAR}-${MONTH}/projectcounts-${YEAR}${MONTH}${DAY}-${HOUR}0000


The pagecounts are gzipped text files holding hourly per page aggregates of
pageviews and total response bytes, and projectcounts are plain text files
holding hourly per domain-name aggregates of pageviews and total response bytes,
and projectcounts.

Note that (to maintain compatibility with pagecounts-raw) the time used in the
filename refers to the end of the aggregation period, not the beginning.

Both pagecounts and projectcounts are made up of lines having 4 space-separated

  domain page_title count_views total_response_size

* domain

    Domain name of the request.

    Common trailing parts have been abbreviated just as they are for the above
    pagecounts-raw files:

    ""           -> ""
    ""           -> ".b"
    ""          -> ".d"
    "" -> ".f"
    ""           -> ".m" (only for some projects. See below)
    ""            -> ".n"
    ""           -> ".q"
    ""          -> ".s"
    ""         -> ".v"
    ""          -> ".voy"
    ""           -> ".w"
    ""            -> ".wd"

    For "", only the following domains are considered:


  (There is also ".mw", but ".mw" is only there for legacy reasons and are a
  legacy attempt to count mobile sites per language across projects. Please
  do not use ".mw", and use the counts for the mobile sites (like "en.m.voy")

* page_title

    For pagecounts files, it holds the title of the page.


    For projectcounts files, it is "-".

* count_views

    the number of times this page has been viewed in the respective hour.

* total_response_size

    the total response size caused by the requests for this page in the
    respective hour.

So for example a line of

  en Main_Page 42 50043

means 42 requests to "", which accounted in total
for 50043 response bytes. And

  de.m.voy Berlin 176 314159

would stand for 176 requests to "", which
accounted in total for 314159 response bytes.

Each domain+page_title combination occurs at most once.

The file is sorted by domain and page_title.

3.1. Disambiguating abbreviations ending in “.m”

The are two ways for an abbreviation end in “.m”. Either because the domain is a
whitelisted project on (like “” being
abbreviated to “commons.m”), or the domain is the mobile site of wikipedia (like
“” being abbreviated to “en.m”).

Since the whitelisted projects (see abbreviation table above)
never match a language code on wikipedia, the mapping between domain name and
abbrevaition is bijective.

While this solution requires an “if” for the edge case of “Summing up pageviews
across all mobile sites”, it allows to stay compatible with pagecounts-raw's
abbreviations while at the same time also keeping the concept and semantics of
abbreviating domain names. Also it makes it easier to automate comparisons
between this dataset and TSVs (like sampled-1000) or Hive data.

4. Contact / Bugs

You can reach the analytics team via email at

or via IRC on freenode in #wikimedia-analytics .