I'm currently trying to harvest Wikipedia viewing data (how many views a certain article had in a given timeframe) from Wikipedia using the article_pageviews
function from the pageviews
package. I furthermore have a data frame containing names of Wikipedia articles I wish to extract the viewing data from.
My data frame containing the names looks like this:
name Variable1 Variable2
Henry V . .
Henry VI . .
Henry VII . .
. . .
. . .
. . .
For the extraction of viewing data I'm using the following code
Viewings <- article_pageviews(
project = "en.wikipedia",
article = "name of wikipedia article",
platform = "all",
user_type = "all",
start = as.Date('2019-01-01'),
end = as.Date('2020-01-01'),
reformat = TRUE,
granularity = "monthly"
)
Running this line of code yields a table with 12 observations (1 for each month) containing the variable views
. I'm interested in the sum of all the views for all 12 observations
sum(Viewings$views)
I was wondering whether there is a way to run the article_pageviews
function on the Wikipedia page names I have saved in my dataframe, all at once and save the sum(Viewings$views) for each article in the dataframe. The only alternative would be to run the article_pageviews
function on each Wikipedia article separately but it would be interesting to know whether there is a way of automating this process.
You can let map_dbl
from the purrr
use the names in your df as an input and get all the pageviews.
library(dplyr)
library(purrr)
library(pageviews)
df <- tibble(name = c('Henry V', 'Henry VI', 'Henry VII', 'sadfasdfasdf'))
Viewings <- df %>%
mutate(
views_total = map_dbl(name, .f = function(article){
tryCatch({
article_pageviews(
project = "en.wikipedia",
article = article,
platform = "all",
user_type = "all",
start = as.Date('2019-01-01'),
end = as.Date('2020-01-01'),
reformat = TRUE,
granularity = "monthly"
) %>%
pull(views) %>%
sum(na.rm=T)
},
error = function(e){return(NA_real_)}
)
})
)
Above code does cover the possibility that a certain article can't be found (such as 'sadfasdfasdf'), in which case the map function catches the resulting error and instead returns NA
.