I have two multi-level categorical columns stored in df
:
dow
represents the day of week (seven catagories mapped to integers: 1, 2, ..., 7).type
represents four types of observation (four categories mapped to integers: 1, 2, 3, 4).How can I create an interaction (i.e., the multiplication) of these two columns in PySpark?
I know how to encode them using OneHotEncoder
. However, I'm not sure how to go about the feature engineering process to account for all 28 combinations (7 x 4 possible cases), especially because OneHotEncoder
returns sparse vectors.
For the purpose of this question, assume my pyspark dataframe df
looks as follows:
dow | type | target |
---|---|---|
1 | 1 | 200 |
1 | 2 | 222 |
1 | 7 | 229 |
Where dow
can take on seven different values and type
can take on four. Is there a built-in way to create interactions between these two columns in order to account for all possible combinations?
You could do integer encoding by multiplying dow
by 10 and adding type
to it to create individual integers for each unique value:
(
df
.select(
(F.col('dow') * F.lit(10) + F.col('type')).alias('result'),
'dow',
'type'
)
.show()
)
+------+---+----+
|result|dow|type|
+------+---+----+
| 11| 1| 1|
| 12| 1| 2|
| 17| 1| 7|
+------+---+----+