Search code examples
amazon-web-serviceshiveamazon-athena

Dealing with multi-line JSON? (And, bonus points, CRLF)


I am trying to create & query an Athena table that is reading JSON files, but it chokes on the newlines. And to make it more difficult, they're currently windows newlines (CRLF)

Error message:

HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: A JSONObject text must end with '}' at 2 [character 3 line 1]

For readability, our JSON looks like this (one row per file, though we have nested arrays, not included here).

{
  "event" : "REQUEST",
  "globalTransactionId" : "8de9e1b8-3ab0-4c3b-8b85-cae4e58257a7",
  "inboundIdentifierValue" : "22",
  "timeStamp" : "2020-03-19T20:36:42.864Z",
  "elapsedTime" : 0
}

My create table looks like this, and if I strip out ALL newlines, it works.

I've tried both the openx serde and org.apache.hive.hcatalog.data.JsonSerDe

CREATE EXTERNAL TABLE TestA(
event string,
globaltransactionid string,
inboundidentifiervalue string,
`timestamp` string,
elapsedtime bigint
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://mybucker/test/' 

Is there any reasonable way to make this work, or do we have to change the file format?


Solution

  • Athena requires that there is one JSON document per line, multiline documents are not supported.