Search code examples
postgresqlwikipediawikipedia-apiweb-analytics

Popularity of each wikipedia article


I would like to store a list of all en.wikipedia articles in my database. For each article I want to store the pageid, title and the popularity. I thought about using the view count (over the last month) as a measurement for popularity but if that is not possible, I could imagine going for something else (maybe use the number of revisions). I'm aware of http://dumps.wikimedia.org/enwiki/latest/ and that I can get a full list of articles from there (current count 36508337). However, I can not find a clever way to get the view count for each article.

// Updates, Edits, ... The suggested duplicate does not help me because a) I was looking for a popularity measurement. The answer to the other questions just states that it is not possible to get the number of watchers for a page, which is fine with me. b) There is no answer there that gives me the page views (or any other metric) for every page.


Solution

  • Okay I'm finally done. Here is what I did:

    I found http://dumps.wikimedia.org/other/pagecounts-ez/ which provides page views per month. This seems promising but they don't mention the pageid so what I'm doing is getting a list of all articles from http://dumps.wikimedia.org/enwiki/latest/, create a mapping name->pageid and then parse the pagecount dump. This takes about 30 minutes, here are some statistics:

    1. 68% of the articles in the page count file do not exist in the latest dump. This is probably due to some users linking, for example, Misfits_(TV_series) while other link to Misfits_(tv_series) and even stuff like Misfits_%28TV_series%29... I did not bother with those because my program already took long enough to run.

    2. The top 3 pages are:

      2.1. Front page with 639 million views (in the last month)

      2.2. Malware with 8.5 million views

      2.3. Falcon 9 v1.1 with 4.7 million views (cool!)

    3. I made a histogram for the number of pages with a certain view count, here it is: Histogram number of pages with view count

    4. I also plotted the number of pages I would have to deal with when I disregard all articles below a certain view count. Here it is: LogLog Plot with pages having at least x views