R: fetching pdf documents from Companies House API

I'm trying to fetch documents from the API using R. Appreciate the clarification of the process in this post. I've been following the above steps with partial success, but still fail the last step to get access to documents' content:

  1. Find the document filing you're interested in (e.g. make a filing history request1 for the company). Parse the response for the link to the document in the field "links" : { "document_metadata" : "link URI fragment here" }.

No problem:


### retrieving filing history ####
company_num = 'FC013908'
key = 'my_key'
fh_path = paste0('/company/', str_to_upper(company_num), "/filing-history")
fh_url <- modify_url("", path = fh_path)
fh_test <- GET(fh_url, authenticate(key, "")) #status_code = 200
fh_parsed <- jsonlite::fromJSON(content(fh_test, "text",encoding = "utf-8"), flatten = TRUE)
docs <- fh_parsed$items


2 For a given document request the document metadata via CH Document API3. Parse the response to get the document (mime) types available and the link to the actual document data (document URI fragment).

No problems here:

md_meta_url = docs$links.document_metadata[1]  
key_pass <- paste0(key,":")
decoded_auth <- paste0('Basic ', base64_encode(key_pass))

md_test <- GET(md_meta_url,
               add_headers(Authorization = decoded_auth)
md_test #status_code = 200!
md_parsed <- jsonlite::fromJSON(content(md_test, "text",encoding = "utf-8"), flatten = TRUE)

This way I can obtain the content URL:

cont_url = md_parsed$links$document

Request the actual document9, specifying the mime type (e.g. "application/pdf").

I do it while NOT following the redirect and, as expected, I get the 302 status code with the location header:

accept = 'application/pdf'
cont_test <- GET(cont_url, 
           add_headers(Authorization = decoded_auth,
                       Accept = accept),
           config(followlocation = FALSE)

final_url <- cont_test$headers$location

> final_url
[1] ""

However, when I try to

Request this URI from Amazon again passing the content type you want again. I get 400 error:

 final_test <- GET(final_url, 
                 add_headers(Authorization = decoded_auth,
                             Accept = accept

> final_test
Response []
  Date: 2018-06-20 08:37
  Status: 400
  Content-Type: application/xml
  Size: 523 B

Needless to say, executing


returns Access Denied error. I suspect it may have something to do with Amazon authorization problems similar to those described here. Any ideas how to solve this final hurdle?



  • The answer was provided by @voracityemail in response to my question on Companies House Developers Hub. Basically, the final call doesn't require the Authorization header, so if you run the following code for final_test:

    final_test <- GET(final_url, add_headers(Accept = accept))

    It will return 200 code

    > final_test
    Response []
      Date: 2018-06-27 10:02
      Status: 200
      Content-Type: application/pdf
      Size: 21.7 kB

    and then


    will open the specified document in the browser. Victory!