scala dataframe apache-spark user-defined-functions rdd

How to label encode for a column in spark scala?

I have a very big dataframe like:

    A     B
   a_1   b_1
   a_2   b_2
   a_3   b_3
   a_1   b_4
   a_2   b_4
   a_2   b_2

I want to create columns corresponding to each unique value of B and set its value as 1 if it exist for each unique value of A. The expected result should look like this

    A     B     C_b_1     C_b_2     C_b_3     C_b_4
   a_1   b_1      1         0         0         1
   a_2   b_2      0         1         0         1
   a_3   b_3      0         0         1         0
   a_1   b_4      1         0         0         1
   a_2   b_4      0         1         0         1
   a_2   b_2      0         1         0         1

Explanation: For a_1 the distinct values of B are {b_1, b_4} and hence the columns corresponding to them are set to 1. For a_2 the distinct values of B are {b_2, b_4} and hence those columns are 1. Similarly for a_3.

The data is pretty huge, expect 'A' to have about 37000 distinct values while 'B' has about 370. The number of records are over 17 million.

Solution

You can use df.stat.crosstab, and join it back to the original dataframe using the A column:

df.join(df.stat.crosstab("A","B").withColumnRenamed("A_B", "A"), "A").show
+---+---+---+---+---+---+
|  A|  B|b_1|b_2|b_3|b_4|
+---+---+---+---+---+---+
|a_3|b_3|  0|  0|  1|  0|
|a_2|b_2|  0|  2|  0|  1|
|a_2|b_4|  0|  2|  0|  1|
|a_2|b_2|  0|  2|  0|  1|
|a_1|b_4|  1|  0|  0|  1|
|a_1|b_1|  1|  0|  0|  1|
+---+---+---+---+---+---+