I am having a problem using Relationalize on a GLUE dataframe. The problem is due to the data structure which I receive. Below you can see the result of the printSchema()
root
...
|-- details: struct
...
| |-- reviews: struct
| |-- customerType: choice
| | |-- string
| | |-- struct
| | | |-- brandId: string
| | | |-- id: string
...
My sample data responsible for the result in the dynamic frame looks the following:
type one:
"customerType": {
"name_JP": "管理見込",
"id": "002",
"brand": "XXX",
"brandId": "XXX#002",
"name_EN": "Managed"
},
type two:
"customerType": "",
My idea is update the empty string to either None or to an empty struct object. I try using the following code, but it fails and I am not clear how to resolve it.
import pyspark.sql.functions as F
from pyspark.sql.types import *
new_df = case_details.toDF()
new_df = new_df.select('*', 'details.reviews.*') \
.withColumn("generalReason", F.when(str(F.col("generalReason")) == F.lit(""), StructType()).otherwise(F.col("generalReason"))) \
.drop(*new_df.select('details.reviews.*').columns)
m_df = DynamicFrame.fromDF(new_df, glueContext, "m_df")
m_df.toDF().printSchema()
I found the correct approach after reading the AWS documents for some time.
case_details = case_details.resolveChoice(
specs=[
("details.reviews.generalReason", "project:struct"),
("details.reviews.rejectedList.reason", "project:struct"),
("details.customerType", "project:struct"),
("details.businessCategory", "project:struct"),
("details.doctor", "project:struct"),
("details.ownerOutletpName", "project:struct"),
("details.ownerOutletpName.latitude", "cast:double"),
("details.ownerOutletpName.longitude", "cast:double"),
],
transformation_ctx = "case_details_resolveChoice"
)