Search code examples
amazon-web-servicespysparkaws-glue

update values in dataframe based on JSON structure


I am having a problem using Relationalize on a GLUE dataframe. The problem is due to the data structure which I receive. Below you can see the result of the printSchema()

root
...
|-- details: struct
        ...
|    |-- reviews: struct
|    |-- customerType: choice
|    |    |-- string
|    |    |-- struct
|    |    |    |-- brandId: string
|    |    |    |-- id: string
                ...

My sample data responsible for the result in the dynamic frame looks the following:

type one:
"customerType": {
   "name_JP": "管理見込",
   "id": "002",
   "brand": "XXX",
   "brandId": "XXX#002",
   "name_EN": "Managed"
 },
 
type two:
"customerType": "",

My idea is update the empty string to either None or to an empty struct object. I try using the following code, but it fails and I am not clear how to resolve it.

import pyspark.sql.functions as F
from pyspark.sql.types import *

new_df = case_details.toDF()

new_df = new_df.select('*', 'details.reviews.*') \
   .withColumn("generalReason", F.when(str(F.col("generalReason")) == F.lit(""), StructType()).otherwise(F.col("generalReason"))) \
   .drop(*new_df.select('details.reviews.*').columns)

m_df = DynamicFrame.fromDF(new_df, glueContext, "m_df")
m_df.toDF().printSchema()

Solution

  • I found the correct approach after reading the AWS documents for some time.

    case_details = case_details.resolveChoice(
    specs=[
        ("details.reviews.generalReason", "project:struct"),
        ("details.reviews.rejectedList.reason", "project:struct"),
        ("details.customerType", "project:struct"),
        ("details.businessCategory", "project:struct"),
        ("details.doctor", "project:struct"),
        ("details.ownerOutletpName", "project:struct"),
        ("details.ownerOutletpName.latitude", "cast:double"),
        ("details.ownerOutletpName.longitude", "cast:double"),
    ],
    transformation_ctx = "case_details_resolveChoice"
    )