Search code examples
rpaginationrate-limitingsocrata

Socrata API deep scrolling in R


Edit My question has changed considerably as I learned more about the problem. After adding several edits this post became unwieldy, so I'm starting over with a clean slate and the final question.

I'm trying to download all public datasets for the state of Utah through the state Socrata portal using R. There are more than 10,000 records listed, so deep scrolling methods using the scroll_id parameter are required. Using this parameter in the query returns results alphabetically by resource.id. However, using a NULL id defaults to starting with the first letter alphabetically and will not return all results because several thousand records begin with a number.

My question is: how can I query with the scroll_id parameter such that it starts with the record with the lowest number, rather than first letter?

The workaround I have is first querying the portal without a scroll_id parameter. This will return 10,000 results in arbitrary order. I then take the lowest digit record from that set and re-query with deep scrolling from there. This produces what I imagine is most records, but probably not all. I'm sure this is an absurd workaround for a simple problem, but here is the code I'm working with:

pacman::p_load(tidyverse,
               httr,
               jsonlite)

# Function to query without scroll_id
get_socrata_data <- function(domain, 
                             limit = 15000) {
    "http://api.us.socrata.com/api/catalog/v1?domains=${domain}&limit=${limit}" %>%
        str_interp() %>%
        httr::GET() %>%
        content("text") %>%
        fromJSON(flatten = TRUE)
}

# Scroll function that includes scroll_id
scroll_socrata <- function(domain,
                           limit = 15000,
                           scroll_id = NULL) {
    "http://api.us.socrata.com/api/catalog/v1?domains=${domain}&scroll_id=${scroll_id}&limit=${limit}" %>%
        str_interp() %>%
        httr::GET() %>%
        content("text") %>%
        fromJSON(flatten = TRUE)
}

ut_domain <- 'opendata.utah.gov'

# Initial query without scroll_id
ut_results <- get_socrata_data(ut_domain)
nrow(ut_results$results)

# If results are capped, restart with first digit id using scroll function
if (nrow(ut_results$results) == 10000) {
    
    # Pull out first record starting with digit
    digit_start <- ut_results$results |>
        filter(str_starts(resource.id, '\\d'))
    first_digit_id <- first(sort(digit_start$resource.id))
    
    # Data frame for results
    output <- data.frame()
    
    # Restart query with scroll_socrata function that includes scroll_id
    page <- scroll_socrata(ut_domain,
                           scroll_id = first_digit_id)
    
    output <- bind_rows(output, page$results)
    
    # more pages if necessary
    while (nrow(page$results) == 10000) {
        page <- scroll_socrata(ut_domain,
                               scroll_id = last(page$results$resource.id))
        output <- bind_rows(output, page$results)
    }
}

# check total
nrow(output)


Solution

  • At long last, the very simple answer to my problem. Entering a scroll_id of '0' will query records starting with the resource.id with the lowest digit.

    scroll_socrata <- function(domain,
                               limit = 15000,
                               scroll_id = NULL) {
        "http://api.us.socrata.com/api/catalog/v1?domains=${domain}&scroll_id=${scroll_id}&limit=${limit}" %>%
            str_interp() %>%
            httr::GET() %>%
            content("text") %>%
            fromJSON(flatten = TRUE)
    }
    
    ut_domain <- 'opendata.utah.gov'
    
    # DF for results
    output <- data.frame()
    
    # First query starting with lowest digit resource.id
    page <- scroll_socrata(ut_domain, 
                           limit = 15000,
                           scroll_id = '0')
    
    output <- bind_rows(output, page$results)
    
    # More pages if necessary
    while (nrow(page$results) == 10000) {
        page <- scroll_socrata(ut_domain,
                               limit = 15000,
                               scroll_id = last(page$results$resource.id))
        output <- bind_rows(output, page$results)
    }
        
    output