Search code examples
rline-breaksamazon-athena

In R How do you remove line breaks from a field so it's readable by Amazon Athena?


I'm working in R (Max OSX) for my ETL process and have to export it to an AWS S3 bucket so it can be read by Athena. Currently Athena isn't reading it correctly because of the line breaks inside of the columns. How do I fix it without removing all the linebreaks from the data?


Solution

  • It sounds like you've got a CSV file with newlines in a column or two. That's…not great…for anything, really, including Athena (but these "big data" systems are so cute as the especially seem to fail to do things like read CSV files that other environments handle with ease).

    It also sounds like you need whatever you make to be usable in Tableau (though I'm not really sure why you can't write it into multiple formats). I got that from another question you've recently asked.

    Tableau can read newline-delimited JSON and Athena loves newline-delimited JSON, so just write your data frame out with something like:

    jsonlite::stream_out(mtcars, gzfile("mtcars.json.gz"))
    

    punt it over to S3 and define the schema in Athena (like you'd have to do with CSV anyway).

    You can also point your dragon droppings (tableau/drag-n-drop) users to the file or the uncompressed version if that horrible program can't handle gzip'd ndjson.