Im trying to extract the wikipedia revision history of several hundred pages. However, the Mediawiki API sets the return limit to 500 for any given page(https://www.mediawiki.org/wiki/API:Revisions).
The "rvcontinue" parameter allows you to extract the next 500 and so on, but I'm not sure how to automate this in R. (I've seen some examples of Python code (Why does the Wikipedia API Call in Python throw up a Type Error?), but I don't know how to replicate it in R).
A sample GET request code for one page is appended below, any help is appreciated!
base_url <- "http://en.wikipedia.org/w/api.php"
query_param <- list(action = "query",
pageids = "8091",
format = "json",
prop = "revisions",
rvprop = "timestamp|ids|user|userid|size",
rvlimit = "max",
rvstart = "2014-05-01T12:00:00Z",
rvend = "2021-12-30T23:59:00Z",
rvdir = "newer",
rvcontinue = #the continue value returned from the original request goes here
)
revision_hist <- GET(base_url, query_param)
Ideally my GET request would automatically update the rvcontinue parameter every 500 values until there are none left.
Thanks!
Edit 1
In your first response, you need to extract the value of rvcontinue to feed it into the second query. I'm still tinkering with the loop but here's the basics:
# Query 1
base_url <- "http://en.wikipedia.org/w/api.php"
query_param <- list(action = "query",
pageids = "8091",
format = "json",
prop = "revisions",
rvprop = "timestamp|ids|user|userid|size",
rvlimit = "max",
rvstart = "2014-05-01T12:00:00Z",
rvend = "2021-12-30T23:59:00Z",
rvdir = "newer"
)
r <- httr::GET(base_url, query = query_param)
parsed <- jsonlite::fromJSON(httr::content(r, as = "text"))
# Query 2
query_param2 <- list(action = "query",
pageids = "8091",
format = "json",
prop = "revisions",
rvprop = "timestamp|ids|user|userid|size",
rvlimit = "max",
rvstart = "2014-05-01T12:00:00Z",
rvend = "2021-12-30T23:59:00Z",
rvdir = "newer",
rvcontinue = parsed[["continue"]][["rvcontinue"]]
)
r2 <- httr::GET(base_url, query = query_param2)
parsed2 <- jsonlite::fromJSON(httr::content(r2, as = "text"))
Original answer
I haven't solved it completely, but I noticed that you're probably missing query = query_param
in http::GET()
. Here I tried using rvcontinue = "rvcontinue"
, but that doesn't seem to work for now.
base_url <- "http://en.wikipedia.org/w/api.php"
query_param <- list(action = "query",
pageids = "8091",
format = "json",
prop = "revisions",
rvprop = "timestamp|ids|user|userid|size",
rvlimit = "max",
rvstart = "2014-05-01T12:00:00Z",
rvend = "2021-12-30T23:59:00Z",
rvdir = "newer",
rvcontinue = "rvcontinue"
)
response <- httr::GET(base_url, query = query_param)
parsed <- jsonlite::fromJSON(httr::content(response, as = "text"))
Here's my error message:
> print(parsed)
$error
$error$code
[1] "badcontinue"
$error$info
[1] "Invalid continue param. You should pass the original value returned by the previous query."
$error$`*`
[1] "See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/> for notice of API deprecations and breaking changes."
$servedby
[1] "mw1398"