Search code examples
wikipediawikipedia-apipageviews

Getting total page view from (french) Wikipedia by page


I am searching for the total pageview (from july 2015, release date of PageViews API, to 1rst January of 2019) of any page of french Wikipedia project.

Using PageViews API (How to use Wikipedia API to get the page view statistics of a particular page in wikipedia?) seems ways too heavy to me : I need data from over 2 millions pages.

Using MassViews (https://tools.wmflabs.org/massviews/) with a query returning all pages titles (https://quarry.wmflabs.org/query/34473) do not work either : MassView suffer from a 20000 pages limitation, and fail to retrieve data for some pages titles from my query results.

Do you know some more efficient tools to do this ?


Solution

  • Wikipedia's API is powerful, like this can get the pageview of Apollo_10 of french wikipedia. Make a script based on this is not so hard.

    If you think using API to query all the sites is heavy, you can use google bigquery. It has pageview data in its open dataset. There has a tutorial about this.

    Here is my example:

    1. Access bigqery's console.
    2. Type the content below in the answer.
    select * from `bigquery-public-data.wikipedia.pageviews_2015` where datehour = '2015-07-12 18:00:00 UTC';
    
    1. And you will get a table that contains all the pageview data at this time.

    If you want to get specific page of french wiki, you may specify 'wiki=fr' and 'title = xxx'. As I'm new in bigquery, I don't know how to query data cross the table and export. But that's possible based on my poor knowledge in SQL. You can aggregate the data by title and export the result.

    The only problem is that bigquery is not free. For example, the query above cost 6GB. Querys (on-demand) is free for the first 1 TB and 5 dollars per TB after. Bigquery will charge according to the data processed in the columns you select, even if you use a 'limit'. So it may cost a lot.