Search code examples
rjsonliteaws.s3

Find and replace in a aws.s3 object during json-streamin


I have a fairly practical question, where it's hard to provide a regex - sorry for that. So I try to explain it properly.

A script connects to a AWS s3 bucket with the aws.s3 package. In that bucket there are .gz-files which contain JSON. Unfortunately some lines - not all - contain a bug in JSON-Format. They end with }]]} instead of }]}.

So I try to find an R-way to find and replace the pattern before unpacking the JSON-Object fails. A non-working line is already inserted (# gsub()) which represents a lucky guess to fix that thing.

What would be your solution?

    data_i <- aws.s3::get_object(
  object = objectname_i,
  bucket = bucketname_i,
  region = "eu-central-1",
  as = "raw"
) %>%
  rawConnection() |> 
  gzcon() |> 
 # gsub("}]]}", "}]]}") |>  
  jsonlite::stream_in() 

Solution

  • I found following solution: After setting up a connection, I use gzcon() for unpacking - as before. Now I read in the lines (readLines()) over the connection and have the data in R.

    Now I can operate on the R object with gsub().

    After that I want to use stream_in() again, and open therefore a textConnection(). As a result I have the data.frame s3ObjectDataframe

       s3ObjectUnpacked <- aws.s3::get_object(
          object = objectname_i,
          bucket = bucketname_i,
          region = "eu-central-1",
          as = "raw"
        ) |> 
          rawConnection() |>
          gzcon()
    
        s3ObjectPerLine <- readLines(s3ObjectUnpacked)
        s3ObjectCorrected <- gsub("}]]}", "}]}", s3ObjectPerLine)
        s3ObjectDataframe <- jsonlite::stream_in(textConnection(gsub("\\n", "", s3ObjectCorrected)))