Based on my previous question, I built a simple JSON file with one "row" per line. I'm still in shock, because this is not valid JSON, as it doesn't have square brackets around it.
One data file:
{"firstName": "Neal", "lastName": "Walters", "city": "Irving", "state", "TX" }
{"firstName": "Fred", "lastName": "Flintstone", "city": "Bedrock", "state", "TX"}
{"firstName": "Barney", "lastName": "Rubble", "city": "Stillwater", "state", "OK"}
After running through GLUE, this was my first query, which was quite disappointing.
Below is the schema that it generated. From it, we can see that GLUE apparently thought this was a CSV instead of JSON. I didn't see any option when setting up the Glue crawler that asked what type of a file it was, did I miss that somewhere on some hidden option?
For a simple example like this, I can probably manually fix the schema. But is GLUE really such a poor parser? In my real application, I have about 150 fields, so ideally it will generate all the columns for me.
CREATE EXTERNAL TABLE `flattb_testflatjson`(
`col0` string,
`col1` string,
`col2` string,
`col3` string,
`col4` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://relatix/polygonData/history/testflatjson/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'UPDATED_BY_CRAWLER'='FlatJsonTestForAthena',
'areColumnsQuoted'='false',
'averageRecordSize'='83',
'classification'='csv',
'columnsOrdered'='true',
'compressionType'='none',
'delimiter'=',',
'objectCount'='1',
'recordCount'='3',
'sizeKey'='255',
'typeOfData'='file')
Glue is terrible in general, but this actually surprised me until I saw the comment by Achyut: your JSON is malformed.
JSON is a data format, not a file format. There is no such thing as a correctly formatted JSON file because the specification doesn't cover that. Tools like Spark, Hadoop, and Athena require JSON data to be in files with one document per line, because that makes it easy to efficiently process the data. Sometimes this is referred to as "JSON stream" (which isn't a great name since we're talking about files), or "line-delimited JSON".
I think you will be better off just creating the table manually. You can find an example to start off from in the documentation: https://docs.aws.amazon.com/athena/latest/ug/json-serde.html
You should also use a proper JSON serialisation library for writing your JSON so that you don't end up with syntax errors like that accidental comma instead of colon.