Search code examples
rpaginationhttroktajsonlite

Using R and httr to retrieve data - Okta API GET request with paginated link headers


I am trying to use the RStudio / Hadley Wickham 'httr' R package to return all records from an Okta API GET request ('List Users Assigned to Application'). The following request works perfectly fine to get the maximum limit of records (500) per call:

oktaurl <- "https://mydomain.okta.com/api/v1/apps/applicationID/users?limit=500"

oktagetjson <- with_verbose(content(GET(oktaurl,
                                        add_headers("Authorization" = "bearer myapikey",
                                                       "Content-Type" =  "application/json;charset=UTF-8"))))

Parsing the 'oktagetjson' returned data into a usable data frame with 'jsonlite' and R is not a problem; however, this particular API call is hard limited to a maximum of 500 records per call so I need to somehow retrieve and paginate through all the 'Link:' headers to get all several thousand records. The 'Link:' headers themselves are in the form of:

Link: <https://mydomain.okta.com/api/v1/apps/applicationID/users?limit=500>; rel="self"

Link: <https://mydomain.okta.com/api/v1/apps/applicationID/users?after=random cursor string&limit=500>; rel="next"

(The Okta API documentation describes their pagination structure here)

I am stuck here:

  1. I can see the first two pagination 'Link:' headers listed above in the R / RStudio console when making the call 'oktagetjson <- with_verbose(content(GET(oktaurl, etc ... ) ...) ' to get my oktagetjston object but the 'Link:' headers are not returned as part of the object itself. Calling headers(HEAD("https://mydomain.okta.com/api/v1/apps/<applicationID>/users")) returns some headers but does not return the pagination 'Link:' headers
  2. The 'Link:' headers contain random cursor strings so I cannot guess their actual format
  3. Even if I could retrieve all of the required 'Link:' headers, I have no idea how to call / iterate / paginate / recursively follow through all of them in R to build an object of the entire data set of several thousand records.

Unfortunately, due to the nature of the request, service provider and data I cannot provide a fully reproducible example with real links and sample data but I hope the concept is clear enough for someone to point me in the right direction - even if that direction is to not use the 'httr' package or R for this effort.

Thank you for your consideration.


Solution

  • Hacked something together a while ago that works but certainly won't win any elegance awards. Have modified it to get users assigned to Okta applications as well. Useful if you are auditing / joining with other company / directory data.

    library(jsonlite)
    library(dplyr)
    library(httr)
    library(purrr)
    library(stringi)
    library(tidyr)
    
    # create character vector to hold URLs we'll use later when we GET content
    url_list <- as.character()
    
    # list placeholder for GET content
    okta_content <- list()
    
    # initial URL construction parts for first URL
    okta_urllimit = as.character("200")
    okta_baseurl <- paste0("https://<your company>.okta.com/api/v1/users?limit=",okta_urllimit)
    
    # next URL construction parts for 'next' URLs
    basenexturl <- "https://<your company>.okta.com/api/v1/users?after="
    baselimiturl <- "&limit=200"
    
    # Pass initial URL to get first batch
    okta_get01 <- httr::GET(okta_baseurl,
                             config = (
                               add_headers(Authorization = "SSWS <your Okta API key>")))
    
    
    # append the URL vector 
    url_list <- append(url_list, okta_baseurl)
    
    
    # unlist the all_headers list element from the URL
    testallheaders <- as.character(unlist(okta_get01$all_headers))
    
    okta_content <- append(okta_content,content(okta_get01))
    
    # if "next" is in the second link URL (testallheaders[16]) then iterate for as long as
    # the next URL header element has "next" in it
    
    while (
      grepl("next",testallheaders[16]) == 'TRUE'
    )
    
    { 
      # parse the sha value 
      testparsenext <- regmatches(testallheaders[16], gregexpr('(?<=after=).*?(?=&limit)',testallheaders[16], perl=T))[[1]]
      # and create URL
      oktaurlnext <- paste0(basenexturl,testparsenext,baselimiturl)
    
    
      # iterate and replace 'okta_baseurl' with each subsquent oktaurlnext
    
      okta_get01 <- httr::GET(oktaurlnext,
                               config = (
                                 add_headers(Authorization = "SSWS <your Okta API key>")))
    
      testallheaders <- as.character(unlist(okta_get01$all_headers))
      url_list <- append(url_list, oktaurlnext)
      okta_content <- append(okta_content,content(okta_get01))
    
      next
    }
    
    
    # Parse the results into something usable
    
    oktagettojson <- toJSON(okta_content, simplifyDataFrame = TRUE, flatten = TRUE, recursive = TRUE)
    oktagetdf <- fromJSON(oktagettojson, simplifyDataFrame = TRUE, flatten = TRUE)
    dfnames <- names(oktagetdf)
    oktagetdf <- oktagetdf %>% map_if(is.list, as.character)
    oktagetdf <- do.call(cbind, lapply(oktagetdf, data.frame, stringsAsFactors=FALSE))
    names(oktagetdf) <- dfnames
    
    # adding columns to separate AD domain mastered account and domain names
    oktagetdf <- separate(oktagetdf, profile.login,
                          into = c("credPrefix", "credSuffix"), sep = "@", remove = FALSE, extra = "drop")
    
    # select some data frame columns of interest
    okta_allusers <- subset(oktagetdf, select = c("id","status","created","lastLogin","profile.login","credPrefix", "credSuffix","profile.firstName","profile.lastName","profile.email","credentials.provider.type","credentials.provider.name"))