Search code examples
apache-sparkapache-spark-sqlamazon-dynamodbaws-glue

Issues while fetching attribute names in Array Struct field during ingesting the data from DynamDB to Hive table in S3


I am trying to ingest data from DynamoDB table to Hive table built over S3 using AWS Glue.

The data from Source would look like below -

{
 "id": "123",
 "name": "Peter",
 "title": "My List",
 "listItems": [
  {
   "manufacture_date": "2023-01-01",
   "purchase_price": "20.0"
  },
  {
   "manufacture_date": "2023-01-02",
   "purchase_price": "30.0"
  }
 ]
}

Code to ingest the data:

datasource = glue_context.create_dynamic_frame.from_options(
                            connection_type="dynamodb",
                            connection_options={"dynamodb.input.tableName": "purchase_list"},
                            transformation_ctx="datasource")
ddb_src_df = datasource.toDF()
ddb_src_df.createOrReplaceTempView("ddb_src_df")


inst_sql = f"""insert overwrite table purchase_list
        select id,
name,
title,
cast(listItems as string) as listItems
from ddb_src_df"""

glue_context.spark_session.sql(inst_sql)

I would like to have the Attribute names populated on the table so that I can unnest later from the table. But the above code is returning me without Attribute names like below -

id name title listItems
123 Peter My List [{2023-01-01, 20.0}, {2023-01-02,30.0}]

How can I achieve something like this below?

Expected output:

id name title listItems
123 Peter My List [{manufacture_date:2023-01-01,purchase_price:20.0}, {manufacture_date:2023-01-02,purchase_price:30.0}]

Target table strucuture:

id string
name string
title string
listItems string

I wanted to have the struct column listItems as String as the Attribute has potential to add more attributes in the future. Please advice.


Solution

  • You're almost there. In your insert, replace cast(listItems as string) with to_json(listItems) and you'll get strings "json-style". You can then remove double-quotes around attribute names and values using regexp_replace or something, if needed.