I am trying to ingest data from DynamoDB table to Hive table built over S3 using AWS Glue.
The data from Source would look like below -
{
"id": "123",
"name": "Peter",
"title": "My List",
"listItems": [
{
"manufacture_date": "2023-01-01",
"purchase_price": "20.0"
},
{
"manufacture_date": "2023-01-02",
"purchase_price": "30.0"
}
]
}
Code to ingest the data:
datasource = glue_context.create_dynamic_frame.from_options(
connection_type="dynamodb",
connection_options={"dynamodb.input.tableName": "purchase_list"},
transformation_ctx="datasource")
ddb_src_df = datasource.toDF()
ddb_src_df.createOrReplaceTempView("ddb_src_df")
inst_sql = f"""insert overwrite table purchase_list
select id,
name,
title,
cast(listItems as string) as listItems
from ddb_src_df"""
glue_context.spark_session.sql(inst_sql)
I would like to have the Attribute names populated on the table so that I can unnest later from the table. But the above code is returning me without Attribute names like below -
id | name | title | listItems |
---|---|---|---|
123 | Peter | My List | [{2023-01-01, 20.0}, {2023-01-02,30.0}] |
How can I achieve something like this below?
Expected output:
id | name | title | listItems |
---|---|---|---|
123 | Peter | My List | [{manufacture_date:2023-01-01,purchase_price:20.0}, {manufacture_date:2023-01-02,purchase_price:30.0}] |
Target table strucuture:
id string
name string
title string
listItems string
I wanted to have the struct column listItems as String as the Attribute has potential to add more attributes in the future. Please advice.
You're almost there. In your insert
, replace cast(listItems as string)
with to_json(listItems)
and you'll get strings "json-style". You can then remove double-quotes around attribute names and values using regexp_replace
or something, if needed.