Index of /other/pagecounts-all-sites/

NameLast ModifiedSizeType
../ -  Directory
2014/2015-Jan-05 18:29:14-  Directory
2015/2015-Dec-01 01:09:29-  Directory
2016/2016-Aug-01 01:16:16-  Directory
README.txt2015-Mar-12 09:20:465.2Ktext/plain
The files in this directory provide hourly aggregated pagecounts and
projectcounts of all sites (i.e.: desktop, mobile, and zero) across
public wikis.

1.   On-wiki documentation
2.   Used pageview definition
3.   File structure
3.1. Disambiguating abbreviations ending in “.m”
4.   Contact / Bugs



1. On-wiki documentation
==========================

While this README.txt is currently (2014-10-01) up-to-date, this file cannot
easily be updated by the community. Hence, we consider the on-wiki documentation
at

  https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-sites

the authoritative documentation.

This README.txt is just a convenience for people that have issues accessing the
on-wiki documentation.


2. Used pageview definition
=============================

The files use the pageview definition of webstatscollector, and (in contrast to
the files at

  http://dumps.wikimedia.org/other/pagecounts-raw/

) do not only apply it to the desktop site, but all sites.
So: desktop, mobile, and zero.

(Note that this does not yet catch everything that we want to consider as
pageview. It comes with the same definition issues as webstatscollector data
from pagecounts-raw. But! It provides data for mobile, as a stop-gap measure.)


3. File structure
===================

The file structure should be compatible with the files from

  http://dumps.wikimedia.org/other/pagecounts-raw/

.

Filenames are of the form

  http://dumps.wikimedia.org/other/pagecounts-all-sites/${YEAR}/${YEAR}-${MONTH}/pagecounts-${YEAR}${MONTH}${DAY}-${HOUR}0000.gz
  http://dumps.wikimedia.org/other/pagecounts-all-sites/${YEAR}/${YEAR}-${MONTH}/projectcounts-${YEAR}${MONTH}${DAY}-${HOUR}0000

.

The pagecounts are gzipped text files holding hourly per page aggregates of
pageviews and total response bytes, and projectcounts are plain text files
holding hourly per domain-name aggregates of pageviews and total response bytes,
and projectcounts.

Note that (to maintain compatibility with pagecounts-raw) the time used in the
filename refers to the end of the aggregation period, not the beginning.

Both pagecounts and projectcounts are made up of lines having 4 space-separated
fields:

  domain page_title count_views total_response_size



* domain

    Domain name of the request.

    Common trailing parts have been abbreviated just as they are for the above
    pagecounts-raw files:

    ".wikipedia.org"           -> ""
    ".wikibooks.org"           -> ".b"
    ".wiktionary.org"          -> ".d"
    ".wikimediafoundation.org" -> ".f"
    ".wikimedia.org"           -> ".m" (only for some projects. See below)
    ".wikinews.org"            -> ".n"
    ".wikiquote.org"           -> ".q"
    ".wikisource.org"          -> ".s"
    ".wikiversity.org"         -> ".v"
    ".wikivoyage.org"          -> ".voy"
    ".mediawiki.org"           -> ".w"
    ".wikidata.org"            -> ".wd"

    For ".wikimedia.org", only the following domains are considered:

       * commons.wikimedia.org
       * meta.wikimedia.org
       * incubator.wikimedia.org
       * species.wikimedia.org
       * strategy.wikimedia.org
       * outreach.wikimedia.org
       * usability.wikimedia.org
       * quality.wikimedia.org


  (There is also ".mw", but ".mw" is only there for legacy reasons and are a
  legacy attempt to count mobile sites per language across projects. Please
  do not use ".mw", and use the counts for the mobile sites (like "en.m.voy")
  instead.)

* page_title

    For pagecounts files, it holds the title of the page.

    E.g.:
      Main_Page
      Berlin

    For projectcounts files, it is "-".

* count_views

    the number of times this page has been viewed in the respective hour.

* total_response_size

    the total response size caused by the requests for this page in the
    respective hour.


So for example a line of

  en Main_Page 42 50043

means 42 requests to "en.wikipedia.org/wiki/Main_Page", which accounted in total
for 50043 response bytes. And

  de.m.voy Berlin 176 314159

would stand for 176 requests to "de.m.wikivoyage.org/wiki/Berlin", which
accounted in total for 314159 response bytes.

Each domain+page_title combination occurs at most once.

The file is sorted by domain and page_title.



3.1. Disambiguating abbreviations ending in “.m”
--------------------------------------------------

The are two ways for an abbreviation end in “.m”. Either because the domain is a
whitelisted project on wikimedia.org (like “commons.wikimedia.org” being
abbreviated to “commons.m”), or the domain is the mobile site of wikipedia (like
“en.m.wikipedia.org” being abbreviated to “en.m”).

Since the whitelisted wikimedia.org projects (see abbreviation table above)
never match a language code on wikipedia, the mapping between domain name and
abbrevaition is bijective.

While this solution requires an “if” for the edge case of “Summing up pageviews
across all mobile sites”, it allows to stay compatible with pagecounts-raw's
abbreviations while at the same time also keeping the concept and semantics of
abbreviating domain names. Also it makes it easier to automate comparisons
between this dataset and TSVs (like sampled-1000) or Hive data.



4. Contact / Bugs
===================

You can reach the analytics team via email at

  analytics@lists.wikimedia.org

or via IRC on freenode in #wikimedia-analytics .
lighttpd