I have a very big dataframe like:
A B
a_1 b_1
a_2 b_2
a_3 b_3
a_1 b_4
a_2 b_4
a_2 b_2
I want to create columns corresponding to each unique value of B and set its value as 1 if it exist for each unique value of A. The expected result should look like this
A B C_b_1 C_b_2 C_b_3 C_b_4
a_1 b_1 1 0 0 1
a_2 b_2 0 1 0 1
a_3 b_3 0 0 1 0
a_1 b_4 1 0 0 1
a_2 b_4 0 1 0 1
a_2 b_2 0 1 0 1
Explanation: For a_1 the distinct values of B are {b_1, b_4} and hence the columns corresponding to them are set to 1. For a_2 the distinct values of B are {b_2, b_4} and hence those columns are 1. Similarly for a_3.
The data is pretty huge, expect 'A' to have about 37000 distinct values while 'B' has about 370. The number of records are over 17 million.
You can use df.stat.crosstab
, and join it back to the original dataframe using the A
column:
df.join(df.stat.crosstab("A","B").withColumnRenamed("A_B", "A"), "A").show
+---+---+---+---+---+---+
| A| B|b_1|b_2|b_3|b_4|
+---+---+---+---+---+---+
|a_3|b_3| 0| 0| 1| 0|
|a_2|b_2| 0| 2| 0| 1|
|a_2|b_4| 0| 2| 0| 1|
|a_2|b_2| 0| 2| 0| 1|
|a_1|b_4| 1| 0| 0| 1|
|a_1|b_1| 1| 0| 0| 1|
+---+---+---+---+---+---+