Search code examples
jsonparquetamazon-athena

Athena: use only a subset of JSON fields


I have a pretty complex nested JSON logs. I'd like to create an Athena external table based on these logs, but only use SOME of the JSON fields in the table.

I have two questions:

  1. Do I still have to create a full complex DDL with nested structs even though I don't need most of the fields there?
  2. assuming I can do that, and I store the JSON logs in Parquet format in S3 - will Athena scan only those parts/fields of the logs I specify? Or will it do a full scan anyway , for the full price? :-)

a simplified JSON event example:

{
  "name": "n1",
  "f1": "v1",
  "group1": {
     "g1F1": "v1",
     "g1F2": "v2",
     "group11": {
         "g11F1": "v1",
         "g11F2": "v2"
     },
     "group12": {
         "g12F1": "v1",
         "g12F2": "v2"
     }
   },
   "group2": {
     "g2F1": "v1",
     "g2F2": "v2",
      ...
   },
   ...
}  

Lets say I am only interested in the top-level fields "name", "f1" and some nested fields, say "group2"'s fields "g2F1" and "g2F2". Could I do something like this:

CREATE EXTERNAL TABLE mytable (
  name string,
  f1 string,
  group2 struct<g2F1: string, g2F2: string>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://<mybucket>'

?

When I try this approach - the DDL runs with no errors, the table is created, but 'select * from mytable' returns no results ...

Thank you!!


Update:

Resolution for Question/Problem 1: Not sure why, but once I re-uploaded gzipped logs once again into S3 - the table creation and queries started working!

For Question 2 - see Tanveer's answer below.


Update 2:

For those who is also considering this option: I got Athena table created with about 6G of zipped data - all is good. However, query execution times are extremely high. A simple query, with a filter on a few fields, takes about 25-30 min.... Considering this is only a POC, using 6G, and I would have a few hundred Ts for real - this option is not feasible for my use case. Granted, I did not use Parquet format - but since I would often need most of the columns, I don't think using Parquet would drastically improve query performance for me.


Solution

  • Answers below.

    1. The reason your query is not returning data since Athena does not like line break in the json data. Your table definition is perfect and you don't need to create a table definition on the whole data. Thats the good thing about the schema-on-read. Your whole data should be in same line as below. Then your Athena query will return data. Don't get disheartened. Glue ETL can read json file with line breaks since Glue ETL runs Spark under the hood. So you can convert your json file to parquet using Glue ETL and then create Athena table on top of the parquet data.

      {"name": "n1","f1": "v1","group1": {"g1F1": "v1","g1F2": "v2","group11": {"g11F1": "v1","g11F2": "v2"},"group12": {"g12F1": "v1","g12F2": "v2"}},"group2": {"g2F1": "v1","g2F2": "v2"}}

    2. Once you convert the data into parquet, you will be charged only for the column you use in the query, not for the whole table as explained here.