Search code examples
rweb-scrapingreddit

RedditExtractoR: reddit_urls() does not return all results


I am attempting to webscrape from Reddit using the R package RedditExtractoR. Specifically, I am using reddit_urls() to return results from Reddit with the search term "president".

I first created an object links499 that (should) contain 499 pages worth of URLs that contain the term "president". I sorted by comments.

links499 <- reddit_urls(search_terms = "president",
  cn_threshold = 0,
  page_threshold = 499,
  sort_by = "comments",
  wait_time = 2)

links499Com <- get_reddit(search_terms = "president", 
  cn_threshold = 0,
  page_threshold = 499,
  sort_by = "comments",
  wait_time =2)

Each of these objects had the same number of unique URL titles (n=239) and both only returned URLs with very high number of comments (the lowest of which was 12,378). This makes sense because I am pulling URLs from Reddit in order of decreasing number of comments.

# Have the same number of unique titles
unique(links499$title)
unique(links499Com$title)

# Both have minimum of 12378
min(links499$num_comments)
min(links499Com$num_comments)

I next wanted to return an even larger amount of matched URLs for the search term "president" from Reddit. I thought this could be accomplished by simply increasing the page_threshold parameter. However, I (unsuccessfully) tried the same code only now searching through 1,000 pages worth of URLs.

links1000 <- reddit_urls(search_terms = "president",
  cn_threshold = 0,
  page_threshold = 1000,
  sort_by = "comments",
  wait_time = 2)

links1000Com <- get_reddit(search_terms = "president", 
  cn_threshold = 0,
  page_threshold = 1000,
  sort_by = "comments",
  wait_time =2)

I thought links1000 would contain URLs with search term "president" from the 1000 pages with the largest number of comments (whereas links499 would contain URLs with search term "president" from the 499 pages with the largest number of comments). However, links1000 and links499 were identical.

Moreover, links1000Com could not be created and threw an error: URL 'https://www.reddit.com/r/politics/comments/dzd8lu/discussion_thread_fifth_democratic_presidential/.json?limit=500': status was 'Failure when receiving data from the peer'.

It seems there is a 500 page limit.

My question is: How would I next obtain all URLs (and their associated comments)? Not just for the top 499 pages or top 1000 pages but to continue until all URLs with the search term "president" in Reddit have been returned?

Thank you for sharing any advice.

*** EDIT ***

As suggested, I am adding reproducible code below. Thank you again!

library(tidyverse)
library(RedditExtractoR)

links499 <- reddit_urls(search_terms = "president",
                        cn_threshold = 0, # minimum number of comments
                        page_threshold = 499,
                        sort_by = "comments",
                        wait_time = 2)

links499Com <- get_reddit(search_terms = "president", 
                          cn_threshold = 0,
                          page_threshold = 499,
                          sort_by = "comments",
                          wait_time =2)

# Have the same number of unique titles (n=239)
length(unique(links499$title))
length(unique(links499Com$title))

# Both have minimum of 12378
min(links499Com$num_comments)
min(links499$num_comments)

links1000 <- reddit_urls(
    search_terms = "president",
    cn_threshold = 0, # minimum number of comments
    page_threshold = 1000, # can probably get as many URLs as you want but you can only extract a certain amount of data at one time
    sort_by = "comments",
    wait_time = 2
)

links1000Com <- get_reddit(search_terms = "president", 
                          cn_threshold = 0,
                          page_threshold = 1000,
                          sort_by = "comments",
                          wait_time =2 )

# Have the same number of unique titles (n=241)
length(unique(links1000$title))
length(unique(links1000Com$title))

# Both have minimum of 12378
min(links1000Com$num_comments)
min(links1000$num_comments)

Solution

  • So, looking at the code for get_reddit and reddit_urls, you will see that get_reddit is a wrapper for reddit_urls and that the defaults are simply different between the two functions. get_reddit, reddit_urls.

    However, the answer to your question is: You can't get more than 1000 results to a search query.

    Limitations and caveats

    • Search terms may be stemmed. A search for "dogs" may return results with the word "dog" in them.
    • Search results are limited to 1000 results.

    The limit=500 argument in your error message refers to the desired number of posts to return, not the desired number of pages. The way reddit does pagination is different than you would expect. Basically they keep track of the order of posts, and then in order to get the next set of posts (a new page) you pass the ID of the last post to your call. I think reddit keeps track of the originator of the call (your computer) and places limits on how much it will return.

    This decribes reddit's API (in particular before and after)

    Here is a resource in Python which describes limitations on reddit's API.


    EDIT:

    It's also not clear to me why we're not getting the number of results we asked for. One thing I noticed is that Reddit seems to stop giving keys to further results after a certain amount of pages. It's not clear what this is based upon. I wrote some code to check it out and see if I could pull out the results myself:

    search_query = "president"
    number_of_pages = 10
    
    results_holder <- data_frame(page = 1:number_of_pages, search = character(length = number_of_pages), titles = as.list(rep(1, number_of_pages)), url = as.list(rep(1, number_of_pages)))
    
    first_search <- paste0("https://www.reddit.com/search/.json?q=",search_query,"&limit=1000&sort=comment")
    
    tmp <- read_lines(first_search)
    tmp2 <- jsonlite::fromJSON(tmp)
    results_holder$search[1] <- first_search
    results_holder$titles[[1]] <- tmp2$data$children$data$title
    results_holder$url[[1]] <- tmp2$data$children$data$permalink
    last_name <- tmp2$data$after
    
    for(i in 2:number_of_pages){
      new_search = paste0("https://www.reddit.com/search/.json?q=",search_query,"&limit=1000&sort=comment&after=",last_name)
      tmp_loop <- read_lines(new_search)
      tmp2_loop <- jsonlite::fromJSON(tmp_loop)
      results_holder$search[i] <- new_search
      results_holder$titles[[i]] <- tmp2_loop$data$children$data$title
      results_holder$url[[i]] <- tmp2_loop$data$children$data$permalink
      last_name <- tmp2_loop$data$after
      Sys.sleep(5)
    }
    

    From this you can examine the object results_holder$search and see that eventually we start over in the pagination.

    What I see happening (and can verify by doing the same thing in the browser) is that reddit stops giving a value for after in the json file. This is the value that we need in order to construct the new search string and get the next page. Sometimes I can get it to return 3 pages/ ~250 results before it starts giving "after": null