I have a dataframe and I am trying to do following transformation.
exploded_prod_df = exploded_hits_product_df.withColumn("product_cd", F.explode(F.col("products.customdimensions")))
My issue is products is Null currently, which may not be the case in future. I tried many things but they didn't work. For example, I tried below code
exploded_prod_df = (
exploded_hits_product_df
.withColumn('products_customdimensions', F.when(F.col('products').isNull(), F.lit(None).cast(T.StructType()))
.otherwise('products.customdimensions'))
.withColumn("product_cd", F.explode_outer(F.col("products_customdimensions")))
)
but this is giving error. Please help me. This is Google Analytics data. Error is: Cannot resolve "CASE WHEN (products IS NULL) THEN NULL ELSE products.customdimensions END" due to data type mismatch: Input to casewhen should all be the same type, but it's ["STRUCT<>", "STRING"].;
If you could provide the output of print(product_df.dtypes)
I would be happy to provide more context in this answer.
These are my thoughts so far:
explode
and explode_outer
only work on Map
and Array
types.
If products
is NULL and products.customdimensions
was one of these types, it is unlikely you would receive errors when using these explode functions.
Since you are getting errors, I think the issue is that products.customdimensions
is of type Struct
Option 1. You know the fields in products.customdimensions
.
If so, you can access each field you need using .withColumn
.
df = product_df.withColumn("product_type", F.col("products.customdimensions.type"))\
.withColumn("product_color", F.col("products.customdimensions.color"))
df.select('products','products.customdimensions', 'product_type', 'product_color').show()
products | customdimensions | product_type | product_color |
---|---|---|---|
{{red, apple}} | {red, apple} | apple | red |
null | null | null | null |
If you instead need this information to be in a vertical format, you could create an Array
with the fields you need and then explode it.
df = product_df.withColumn('products_customdimensions',
F.when(F.col('products').isNull(), F.array(F.lit(None)))\
.otherwise(F.array("products.customdimensions.type",
"products.customdimensions.color"))\
)
df = df.withColumn("product_cd", F.explode('products_customdimensions'))
df.select('products', 'products.customdimensions', 'product_cd').show()
products | customdimensions | product_cd |
---|---|---|
{{red, apple}} | {red, apple} | apple |
{{red, apple}} | {red, apple} | red |
null | null | null |
Option 2. You do not know the fields in products.customdimensions
.
If this is the case, you may want to convert the Struct
to a Map
. Once it is a Map
, there is an option to use it in the explode
or explode_outer
function
Here is an example of converting a Struct
to a Map
,
Convert struct to map in Spark SQL