Search code examples
rdplyrwikipedia-apipageviews

How to run the article_pageviews function on multiple wikipedia articles at once in R, saving the output in a data frame?


I'm currently trying to harvest Wikipedia viewing data (how many views a certain article had in a given timeframe) from Wikipedia using the article_pageviews function from the pageviews package. I furthermore have a data frame containing names of Wikipedia articles I wish to extract the viewing data from.

My data frame containing the names looks like this:

name        Variable1   Variable2
Henry V        .            .
Henry VI       .            . 
Henry VII      .            .
   .           .            .
   .           .            .
   .           .            .

For the extraction of viewing data I'm using the following code

Viewings <- article_pageviews(
  project = "en.wikipedia",
  article = "name of wikipedia article",
  platform = "all",
  user_type = "all",
  start = as.Date('2019-01-01'),
  end = as.Date('2020-01-01'),
  reformat = TRUE,
  granularity = "monthly"
  )

Running this line of code yields a table with 12 observations (1 for each month) containing the variable views. I'm interested in the sum of all the views for all 12 observations

sum(Viewings$views)

I was wondering whether there is a way to run the article_pageviews function on the Wikipedia page names I have saved in my dataframe, all at once and save the sum(Viewings$views) for each article in the dataframe. The only alternative would be to run the article_pageviews function on each Wikipedia article separately but it would be interesting to know whether there is a way of automating this process.


Solution

  • You can let map_dbl from the purrr use the names in your df as an input and get all the pageviews.

    library(dplyr)
    library(purrr)
    library(pageviews)
    
    df <- tibble(name = c('Henry V', 'Henry VI', 'Henry VII', 'sadfasdfasdf'))
    
    Viewings <- df %>%
      mutate(
        views_total = map_dbl(name, .f = function(article){
          tryCatch({
            article_pageviews(
                project = "en.wikipedia",
                article = article,
                platform = "all",
                user_type = "all",
                start = as.Date('2019-01-01'),
                end = as.Date('2020-01-01'),
                reformat = TRUE,
                granularity = "monthly"
              ) %>%
              pull(views) %>%
              sum(na.rm=T)
            },
            error = function(e){return(NA_real_)}
          )
        })
      )
    

    Above code does cover the possibility that a certain article can't be found (such as 'sadfasdfasdf'), in which case the map function catches the resulting error and instead returns NA.