Search code examples
analysiswikipediapageviews

Wikipedia pageviews analysis


I've been challenged with wikipedia pageviews analysis. For me this is the first project with such amount of data and I'm a bit lost. When I download the file from the link and unpack it, I can see that it has a table-like structure with rows looking like this:

1   |  2                             |3|4

en.m The_Beatles_in_the_United_States 2 0

I struggle with finding out what exactly can be found in each column. My guesses:

language version and additional info (.m = mobile?)

name of the article

The biggest concern I have with two last columns. The last one has only "0" values in it and I have no idea what it represents. I'd assume then that the third one show number of views but I'm not sure.

I'd be grateful if someone could help me to understand what exactly can be found in each column or recommend some reading on this subject. Thanks!


Solution

  • After more time spent on this, I've finally found solution. I'm posting this in case someone has the same problem in the future. Wikipedia explains what can be found in database. These explanations were painful to find but you can access theme here and here.

    Based on that you can see that rows have following structure:

    • domain code
    • page_title
    • count_views
    • total_response_size (no longer maintained)

    Some explanations for each column:

    Column 1:

    Domain name of the request, abbreviated. (...) Domain_code now can also be an abbreviation for mobile and zero domain names, in which case .m or .zero is inserted as second part of the domain name (just like with full domain name). E.g. 'en.m.v' stands for "en.m.wikiversity.org".

    Column 2:

    For page-level files, it holds the title of the unnormalized part after /wiki/ -in the request Url (E.g.: Main_Page Berlin). For project-level files, it is - .

    Column 3:

    The number of times this page has been viewed in the respective hour.

    Column 4:

    The total response size caused by the requests for this page in the respective hour. If I understand it correctly response size is discontinued due to low accuracy. That's why there are only 0s. The pagecounts and projectcounts files also include total response byte sizes at their respective aggregation level, but this was dropped from the pageviews and projectviews files because it wasn't very accurate.

    Hope someone finds it useful.