Search code examples
aws-glueamazon-athenaaws-glue-data-catalog

Ignore object keys; only add JSON body?


I have data structured as so that I want crawled and added to Athena:

key_1: {id: key_1, name: name_1, other_data: data_1}
key_2: {id: key_2, name: name_2, other_data: data_2}
key_3: {id: key_3, name: name_3, other_data: data_3}

Ideally, I'd like the Glue table schema to look as so:

id:string
name:string
other_data:string

But Glue is instead creating a schema that is:

key_1:
  id:string
  name:string
  other_data:string
key_2:
  id:string
  name:string
  other_data:string
key_3:
  id:string
  name:string
  other_data:string

Is there any way to get Glue to ignore the outer keys and instead develop a schema only over the body content without having to change the underlying S3 file itself?

Glue Configuration:

  • Schema updates in the data store Update the table definition in the data catalog.
  • Inherit schema from table Update all new and existing partitions with metadata from the table.
  • Object deletion in the data store Mark the table as deprecated in the data catalog.

Solution

  • No, not using Glue config. Glue Crawlers are not very smart, and usually can't solve anything but the most basic tasks – but in this case the limitation is in Athena. You can't tell a table to unpack a structure like that.

    In your example the rows have different keys (key_1, key_2, key_3) are these really different from row to row or is it in fact the same key on each row? Since you use the same strings for the id property I assume these are connected somehow?

    If they are different I think you're more or less completely out of luck, each row would make Glue Crawler create a different column, and you would probably end up with an arbitrary set of columns somewhere around the limit of the number of allowed columns – and the set of columns would change depending on the order that the crawler saw the rows on a particular day. This is never going to work with Glue or Athena, unfortunately.

    If, however, the outer key on each row is the same (let's say "outer_key"), you can create a view that does the unpacking for you, so that your queries don't have to:

    CREATE VIEW unpacked_rows AS
    SELECT outer_key.id, outer_key.name, outer_key.other_key
    FROM my_table
    

    You can then query that view like any table, and get the result you want:

    SELECT * FROM unpacked_rows
    
    id    | name   | other_key
    ------+--------+----------
    key_1 | name_1 | data_1
    key_2 | name_2 | data_2
    key_3 | name_3 | data_3