aws-glue amazon-athena aws-glue-data-catalog

Ignore object keys; only add JSON body?

I have data structured as so that I want crawled and added to Athena:

key_1: {id: key_1, name: name_1, other_data: data_1}
key_2: {id: key_2, name: name_2, other_data: data_2}
key_3: {id: key_3, name: name_3, other_data: data_3}

Ideally, I'd like the Glue table schema to look as so:

id:string
name:string
other_data:string

But Glue is instead creating a schema that is:

key_1:
  id:string
  name:string
  other_data:string
key_2:
  id:string
  name:string
  other_data:string
key_3:
  id:string
  name:string
  other_data:string

Is there any way to get Glue to ignore the outer keys and instead develop a schema only over the body content without having to change the underlying S3 file itself?

Glue Configuration:

Schema updates in the data store Update the table definition in the data catalog.
Inherit schema from table Update all new and existing partitions with metadata from the table.
Object deletion in the data store Mark the table as deprecated in the data catalog.

Solution

No, not using Glue config. Glue Crawlers are not very smart, and usually can't solve anything but the most basic tasks – but in this case the limitation is in Athena. You can't tell a table to unpack a structure like that.

In your example the rows have different keys (key_1, key_2, key_3) are these really different from row to row or is it in fact the same key on each row? Since you use the same strings for the id property I assume these are connected somehow?

If they are different I think you're more or less completely out of luck, each row would make Glue Crawler create a different column, and you would probably end up with an arbitrary set of columns somewhere around the limit of the number of allowed columns – and the set of columns would change depending on the order that the crawler saw the rows on a particular day. This is never going to work with Glue or Athena, unfortunately.

If, however, the outer key on each row is the same (let's say "outer_key"), you can create a view that does the unpacking for you, so that your queries don't have to:

CREATE VIEW unpacked_rows AS
SELECT outer_key.id, outer_key.name, outer_key.other_key
FROM my_table

You can then query that view like any table, and get the result you want:

SELECT * FROM unpacked_rows

id    | name   | other_key
------+--------+----------
key_1 | name_1 | data_1
key_2 | name_2 | data_2
key_3 | name_3 | data_3