Search code examples
rgetzipresponsehttr

How to download and/or extract data stored in a 'raw' binary zip object within a response object in R?


I am unable to download or read a zip file from an API request using httr package. Is there another package I can try that will allow me to download/read binary zip files stored within the response of a get request in R?

I tried two ways:

  1. used GET to get an application/json type response object (successful) and then used fromJSON to extract content using content(my_response, 'text'). The output includes a column called 'zip' which is the data I'm interested in downloading, which documentation states is a base64 encoded binary file. This column is currently a really long string of random letters and I'm not sure how to convert this to the actual dataset.

  2. I tried bypassing using fromJSON because I noticed there is a field of class 'raw' within the response object itself. This object is a list of random numbers which I suspect are the binary representation of the dataset. I tried using rawToChar(my_response$content) to try to convert the raw data type into character, but this results in the same long character string being produced as in #1.

  3. I noticed that with approach #1, if I use base64_dec() to try to convert the long character string I also get the same type of output as the 'raw' field within the response object itself.
getzip1  <- GET(getzip1_link)
getzip1 # successful response, status 200
df <- fromJSON(content(getzip1, "text"))

df$status # "OK"
df$dataset$zip # <- this is the very long string of letters (eg. "I1NC5qc29uUEsBAhQDFA...")

# Method 1: try to convert from the 'zip' object in the output of fromJSON
try1 <- base64_dec(df$dataset$zip)
#looks similar to getzip1$content (i.e.  this produces the list of numbers/letters 50 4b 03 04 14 00, etc, perhaps binary representation)

# Method 2: try to get data directly from raw object
class(getzip1$content) # <- 'raw' class object directly from GET request
try2 <- rawToChar(getzip1$content) #returns same output as df$data$zip


I should be able to use either the raw 'content' object from my response or the long character string in the 'zip' object of the output of fromJSON in order to view the dataset or somehow download it. I don't know how to do this. Please help!


Solution

  • welcome!

    Based on the documentation for the API the response to the getDataset endpoint has schema

    Dataset archive including meta information, the dataset itself is base64 encoded to allow for binary ZIP transfers.

    {
     "status": "OK",
     "dataset": {
     "state_id": 5,
     "session_id": 1624,
     "session_name": "2019-2020 Regular Session",
     "dataset_hash": "1c7d77fe298a4d30ad763733ab2f8c84",
     "dataset_date": "2018-12-23",
     "dataset_size": 317775,
     "mime": "application\/zip",
     "zip": "MIME 64 Encoded Document"
     }
    }
    

    We can use R for obtaining the data with the following code,

    library(httr)
    library(jsonlite)
    library(stringr)
    library(maditr)
    token <- "" # Your API key
    session_id <- 1253L # Obtained from the getDatasetList endpoint
    access_key <- "2qAtLbkQiJed9Z0FxyRblu" # Obtained from the getDatasetList endpoint
    destfile <- file.path("path", "to", "file.zip") # Modify
    response <- str_c("https://api.legiscan.com/?key=",
                      token,
                      "&op=getDataset&id=",
                      session_id,
                      "&access_key=",
                      access_key) %>%
      GET()
    status_code(x = response) == 200 # Good
    body <- content(x = response,
                    as = "text",
                    encoding = "utf8") %>%
      fromJSON() # This contains some extra metadata
    content(x = response,
            as = "text",
            encoding = "utf8") %>%
      fromJSON() %>%
      getElement(name = "dataset") %>%
      getElement(name = "zip") %>%
      base64_dec() %>%
      writeBin(con = destfile)
    unzip(zipfile = destfile)
    

    unzip will unzip the files which in this case will look like

    hash.md5 # Can be checked against the metadata
    AL/2016-2016_1st_Special_Session/bill/*.json
    AL/2016-2016_1st_Special_Session/people/*.json
    AL/2016-2016_1st_Special_Session/vote/*.json
    

    As always, wrap your code in functions and profit.

    PS: Here is how the code would like like in Julia as a comparison.

    using Base64, HTTP, JSON3, CodecZlib
    token = "" # Your API key
    session_id = 1253 # Obtained from the getDatasetList endpoint
    access_key = "2qAtLbkQiJed9Z0FxyRblu" # Obtained from the getDatasetList endpoint
    destfile = joinpath("path", "to", "file.zip") # Modify
    response = string("https://api.legiscan.com/?",
                      join(["key=$token",
                            "op=getDataset",
                            "id=$session_id",
                            "access_key=$access_key"],
                            "&")) |>
        HTTP.get
    @assert response.status == 200
    JSON3.read(response.body) |>
        (content -> content.dataset.zip) |>
        base64decode |>
        (data -> write(destfile, data))
    run(pipeline(`unzip`, destfile))