Search code examples
pythonapache-sparkpyspark

Create an interaction between two categorical columns in PySpark


I have two multi-level categorical columns stored in df:

  1. dow represents the day of week (seven catagories mapped to integers: 1, 2, ..., 7).
  2. type represents four types of observation (four categories mapped to integers: 1, 2, 3, 4).

How can I create an interaction (i.e., the multiplication) of these two columns in PySpark?

I know how to encode them using OneHotEncoder. However, I'm not sure how to go about the feature engineering process to account for all 28 combinations (7 x 4 possible cases), especially because OneHotEncoder returns sparse vectors.

For the purpose of this question, assume my pyspark dataframe df looks as follows:

dow type target
1 1 200
1 2 222
1 7 229

Where dow can take on seven different values and type can take on four. Is there a built-in way to create interactions between these two columns in order to account for all possible combinations?


Solution

  • You could do integer encoding by multiplying dow by 10 and adding type to it to create individual integers for each unique value:

    (
        df
        .select(
            (F.col('dow') * F.lit(10) + F.col('type')).alias('result'), 
            'dow', 
            'type'
        )
        .show()
    )
    
    +------+---+----+
    |result|dow|type|
    +------+---+----+
    |    11|  1|   1|
    |    12|  1|   2|
    |    17|  1|   7|
    +------+---+----+