I am attempting to webscrape from Reddit using the R package RedditExtractoR. Specifically, I am using reddit_urls() to return results from Reddit with the search term "president".
I first created an object links499
that (should) contain 499 pages worth of URLs that contain the term "president". I sorted by comments.
links499 <- reddit_urls(search_terms = "president",
cn_threshold = 0,
page_threshold = 499,
sort_by = "comments",
wait_time = 2)
links499Com <- get_reddit(search_terms = "president",
cn_threshold = 0,
page_threshold = 499,
sort_by = "comments",
wait_time =2)
Each of these objects had the same number of unique URL titles (n=239) and both only returned URLs with very high number of comments (the lowest of which was 12,378). This makes sense because I am pulling URLs from Reddit in order of decreasing number of comments.
# Have the same number of unique titles
unique(links499$title)
unique(links499Com$title)
# Both have minimum of 12378
min(links499$num_comments)
min(links499Com$num_comments)
I next wanted to return an even larger amount of matched URLs for the search term "president" from Reddit. I thought this could be accomplished by simply increasing the page_threshold
parameter. However, I (unsuccessfully) tried the same code only now searching through 1,000 pages worth of URLs.
links1000 <- reddit_urls(search_terms = "president",
cn_threshold = 0,
page_threshold = 1000,
sort_by = "comments",
wait_time = 2)
links1000Com <- get_reddit(search_terms = "president",
cn_threshold = 0,
page_threshold = 1000,
sort_by = "comments",
wait_time =2)
I thought links1000
would contain URLs with search term "president" from the 1000 pages with the largest number of comments (whereas links499
would contain URLs with search term "president" from the 499 pages with the largest number of comments). However, links1000
and links499
were identical.
Moreover, links1000Com
could not be created and threw an error: URL 'https://www.reddit.com/r/politics/comments/dzd8lu/discussion_thread_fifth_democratic_presidential/.json?limit=500': status was 'Failure when receiving data from the peer'
.
It seems there is a 500 page limit.
My question is: How would I next obtain all URLs (and their associated comments)? Not just for the top 499 pages or top 1000 pages but to continue until all URLs with the search term "president" in Reddit have been returned?
Thank you for sharing any advice.
*** EDIT ***
As suggested, I am adding reproducible code below. Thank you again!
library(tidyverse)
library(RedditExtractoR)
links499 <- reddit_urls(search_terms = "president",
cn_threshold = 0, # minimum number of comments
page_threshold = 499,
sort_by = "comments",
wait_time = 2)
links499Com <- get_reddit(search_terms = "president",
cn_threshold = 0,
page_threshold = 499,
sort_by = "comments",
wait_time =2)
# Have the same number of unique titles (n=239)
length(unique(links499$title))
length(unique(links499Com$title))
# Both have minimum of 12378
min(links499Com$num_comments)
min(links499$num_comments)
links1000 <- reddit_urls(
search_terms = "president",
cn_threshold = 0, # minimum number of comments
page_threshold = 1000, # can probably get as many URLs as you want but you can only extract a certain amount of data at one time
sort_by = "comments",
wait_time = 2
)
links1000Com <- get_reddit(search_terms = "president",
cn_threshold = 0,
page_threshold = 1000,
sort_by = "comments",
wait_time =2 )
# Have the same number of unique titles (n=241)
length(unique(links1000$title))
length(unique(links1000Com$title))
# Both have minimum of 12378
min(links1000Com$num_comments)
min(links1000$num_comments)
So, looking at the code for get_reddit
and reddit_urls
, you will see that get_reddit
is a wrapper for reddit_urls
and that the defaults are simply different between the two functions. get_reddit
, reddit_urls
.
However, the answer to your question is: You can't get more than 1000 results to a search query.
Limitations and caveats
The limit=500
argument in your error message refers to the desired number of posts to return, not the desired number of pages. The way reddit does pagination is different than you would expect. Basically they keep track of the order of posts, and then in order to get the next set of posts (a new page) you pass the ID of the last post to your call. I think reddit keeps track of the originator of the call (your computer) and places limits on how much it will return.
This decribes reddit's API (in particular before
and after
)
Here is a resource in Python which describes limitations on reddit's API.
EDIT:
It's also not clear to me why we're not getting the number of results we asked for. One thing I noticed is that Reddit seems to stop giving keys to further results after a certain amount of pages. It's not clear what this is based upon. I wrote some code to check it out and see if I could pull out the results myself:
search_query = "president"
number_of_pages = 10
results_holder <- data_frame(page = 1:number_of_pages, search = character(length = number_of_pages), titles = as.list(rep(1, number_of_pages)), url = as.list(rep(1, number_of_pages)))
first_search <- paste0("https://www.reddit.com/search/.json?q=",search_query,"&limit=1000&sort=comment")
tmp <- read_lines(first_search)
tmp2 <- jsonlite::fromJSON(tmp)
results_holder$search[1] <- first_search
results_holder$titles[[1]] <- tmp2$data$children$data$title
results_holder$url[[1]] <- tmp2$data$children$data$permalink
last_name <- tmp2$data$after
for(i in 2:number_of_pages){
new_search = paste0("https://www.reddit.com/search/.json?q=",search_query,"&limit=1000&sort=comment&after=",last_name)
tmp_loop <- read_lines(new_search)
tmp2_loop <- jsonlite::fromJSON(tmp_loop)
results_holder$search[i] <- new_search
results_holder$titles[[i]] <- tmp2_loop$data$children$data$title
results_holder$url[[i]] <- tmp2_loop$data$children$data$permalink
last_name <- tmp2_loop$data$after
Sys.sleep(5)
}
From this you can examine the object results_holder$search
and see that eventually we start over in the pagination.
What I see happening (and can verify by doing the same thing in the browser) is that reddit stops giving a value for after
in the json file. This is the value that we need in order to construct the new search string and get the next page.
Sometimes I can get it to return 3 pages/ ~250 results before it starts giving "after": null