Search code examples
scaladataframeapache-sparkuser-defined-functionsrdd

How to label encode for a column in spark scala?


I have a very big dataframe like:

    A     B
   a_1   b_1
   a_2   b_2
   a_3   b_3
   a_1   b_4
   a_2   b_4
   a_2   b_2

I want to create columns corresponding to each unique value of B and set its value as 1 if it exist for each unique value of A. The expected result should look like this

    A     B     C_b_1     C_b_2     C_b_3     C_b_4
   a_1   b_1      1         0         0         1
   a_2   b_2      0         1         0         1
   a_3   b_3      0         0         1         0
   a_1   b_4      1         0         0         1
   a_2   b_4      0         1         0         1
   a_2   b_2      0         1         0         1

Explanation: For a_1 the distinct values of B are {b_1, b_4} and hence the columns corresponding to them are set to 1. For a_2 the distinct values of B are {b_2, b_4} and hence those columns are 1. Similarly for a_3.

The data is pretty huge, expect 'A' to have about 37000 distinct values while 'B' has about 370. The number of records are over 17 million.


Solution

  • You can use df.stat.crosstab, and join it back to the original dataframe using the A column:

    df.join(df.stat.crosstab("A","B").withColumnRenamed("A_B", "A"), "A").show
    +---+---+---+---+---+---+
    |  A|  B|b_1|b_2|b_3|b_4|
    +---+---+---+---+---+---+
    |a_3|b_3|  0|  0|  1|  0|
    |a_2|b_2|  0|  2|  0|  1|
    |a_2|b_4|  0|  2|  0|  1|
    |a_2|b_2|  0|  2|  0|  1|
    |a_1|b_4|  1|  0|  0|  1|
    |a_1|b_1|  1|  0|  0|  1|
    +---+---+---+---+---+---+