I want to make a distinct dataframe that if they have the same product_number
, there is a prioritization about the product_name
.
Here is the prioritization: Product A > B > C
In this table, you can see that there are same product_number
with different product_name
.
How can I make it distinct to something like this?
For product_number
1000003, A has been prioritized.
For product_number
1000005, A has been prioritized.
For product_number
1000006, B has been prioritized.
Note: Name of the product is not actually A, B, C nor only 3
Please try below one -
Example -
from pyspark.sql import Row
from pyspark.sql.functions import *
df_pn = spark.createDataFrame([
Row(product_number="1000001", product_name="coil"),
Row(product_number="1000002", product_name="mouse"),
Row(product_number="1000003", product_name="coil"),
Row(product_number="1000003", product_name="laptop"),
Row(product_number="1000004", product_name="coil"),
Row(product_number="1000005", product_name="mouse"),
Row(product_number="1000005", product_name="coil"),
Row(product_number="1000005", product_name="laptop"),
Row(product_number="1000006", product_name="mouse"),
Row(product_number="1000006", product_name="laptop"),
Row(product_number="1000007", product_name="laptop")
])
product_map = { "coil" : "1", "mouse" : "3", "laptop" : "2" }
from itertools import chain
from pyspark.sql.functions import create_map, lit, when
mapping_exprsn = create_map([lit(x) for x in chain(*product_map.items())])
df1 = df_pn.filter(df_pn['product_name'].isNull()).withColumn('product_priority', lit(None))
df2 = df_pn.filter(df_pn['product_name'].isNotNull()).withColumn(
'product_priority',
when(
df_pn['product_name'].isNotNull(),
mapping_exprsn[df_pn['product_name']]
)
)
result = df1.unionAll(df2)
display(result)
result.createOrReplaceTempView("test")
select test.product_number, test.product_name from test
join (select min(product_priority) product_priority, first(product_name) as product_name, product_number from test group by product_number) test_agg
on test.product_number=test_agg.product_number
where test.product_priority=test_agg.product_priority