I have data structured as so that I want crawled and added to Athena:
key_1: {id: key_1, name: name_1, other_data: data_1}
key_2: {id: key_2, name: name_2, other_data: data_2}
key_3: {id: key_3, name: name_3, other_data: data_3}
Ideally, I'd like the Glue table schema to look as so:
id:string
name:string
other_data:string
But Glue is instead creating a schema that is:
key_1:
id:string
name:string
other_data:string
key_2:
id:string
name:string
other_data:string
key_3:
id:string
name:string
other_data:string
Is there any way to get Glue to ignore the outer keys and instead develop a schema only over the body content without having to change the underlying S3 file itself?
Glue Configuration:
No, not using Glue config. Glue Crawlers are not very smart, and usually can't solve anything but the most basic tasks – but in this case the limitation is in Athena. You can't tell a table to unpack a structure like that.
In your example the rows have different keys (key_1, key_2, key_3) are these really different from row to row or is it in fact the same key on each row? Since you use the same strings for the id
property I assume these are connected somehow?
If they are different I think you're more or less completely out of luck, each row would make Glue Crawler create a different column, and you would probably end up with an arbitrary set of columns somewhere around the limit of the number of allowed columns – and the set of columns would change depending on the order that the crawler saw the rows on a particular day. This is never going to work with Glue or Athena, unfortunately.
If, however, the outer key on each row is the same (let's say "outer_key"), you can create a view that does the unpacking for you, so that your queries don't have to:
CREATE VIEW unpacked_rows AS
SELECT outer_key.id, outer_key.name, outer_key.other_key
FROM my_table
You can then query that view like any table, and get the result you want:
SELECT * FROM unpacked_rows
id | name | other_key
------+--------+----------
key_1 | name_1 | data_1
key_2 | name_2 | data_2
key_3 | name_3 | data_3