Search code examples
rpurrrhttr

Extracting data from API with R


I am trying to extract data from Philadelphia's public transit API using the httr2 package. I load the packages and call the API in the code below

library(tidyverse)
library(httr2)

septaURL <- 'https://www3.septa.org/api/TransitViewAll'

response <- request(septaURL) %>% 
  req_perform()

It returns a bunch of nested lists. I use the pluck function to drill down to a list of 128 items (the number of bus routes), which are lists of varying length (depending on how many buses are driving that route), which themselves are lists of 19 items (the data I want to retrieve).

response %>% 
  resp_body_json() %>% 
  pluck('routes', 1) %>%
  glimpse()

However when I pipe in the map function to extract the data, I get a tibble of 0 columns, 0 rows.

response %>% 
  resp_body_json() %>% 
  pluck('routes', 1) %>%
  #map_dfr(
    \(x) {
      tibble(
        Route = x %>% pluck('route_id'),
        Trip = x %>% pluck('trip'),
        Late = x %>% pluck('late'),
        seatsAvailable = x %>% pluck('estimated_seat_availability')
      )
    }
  ) %>% 
  glimpse()

If I add an additional argument to the pluck function, it gives me the data I want for a single route in a tibble. But I want the data from all 128 routes.

response %>% 
  resp_body_json() %>% 
  pluck('routes', 1, 1) %>% #Added an argument here
  #map_dfr(
    \(x) {
      tibble(
        Route = x %>% pluck('route_id'),
        Trip = x %>% pluck('trip'),
        Late = x %>% pluck('late'),
        seatsAvailable = x %>% pluck('estimated_seat_availability')
      )
    }
  ) %>% 
  glimpse()

How do I get a tibble like the one produced above but for all routes?


Solution

  • There are many ways to do this:

    1. Bind everything together and select your columns

    I think this is the simplest solution. This isn't an overly large data set, so make your life easier and create a large data frame and select the columns.

    response %>% 
      resp_body_json() %>% 
      pluck('routes', 1) %>%
      map_dfr(bind_rows) %>% 
      select(Route = route_id, Trip = trip, Late = late, seatsAvailable = estimated_seat_availability)
    
    1. Iterate using correct map_* variant
    response %>% 
      resp_body_json() %>% 
      pluck('routes', 1) %>%
      map_dfr(
      \(x) {
          tibble(
            Route = map_chr(x, "route_id"),
            Trip = map_chr(x, "trip"),
            Late = map_int(x, "late"),
            seatsAvailable = map_chr(x, "estimated_seat_availability")
          )
      }
      )
    

    _chr will return a character vector, _int will return an integer vector and so forth. map functions are often used to apply a function over a list/vector, but they can also be used as selectors, which is the case here.

    1. Iterate at a specified depth using map_depth()

    If you needed to iterate at a depth other than 1 you could use this:

    response %>% 
      resp_body_json() %>% 
      pluck('routes', 1) %>%
      map_dfr(
      \(x) {
          tibble(
            Route = unlist(map_depth(x, 1, "route_id")),
            Trip = unlist(map_depth(x, 1, "trip")),
            Late = unlist(map_depth(x, 1, "late")),
            seatsAvailable = unlist(map_depth(x, 1, "estimated_seat_availability"))
          )
      }
      )
    

    Since map_depth() will return a list, we use unlist() to get a vector (technically it will be a named vector but you can remove the names by specifying unlist(map_depth(...), use.names = F)).

    Also, since we are talking about Septa, I can save you some time pulling the Late column. Just always assume it is late :D.

    Output

       Route Trip    Late seatsAvailable      
       <chr> <chr>  <int> <chr>               
     1 1     719611    10 NOT_AVAILABLE       
     2 1     719643     0 NOT_AVAILABLE       
     3 1     719622     7 NOT_AVAILABLE       
     4 1     719642     7 NOT_AVAILABLE       
     5 1     719645    16 MANY_SEATS_AVAILABLE
     6 1     719624    -1 MANY_SEATS_AVAILABLE
     7 1     719612   998 NOT_AVAILABLE       
     8 1     719607     0 NOT_AVAILABLE       
     9 10    777333    -1 NOT_AVAILABLE       
    10 10    777498   998 NOT_AVAILABLE       
    # ℹ 888 more rows