python pyspark apache-spark-sql apache-spark-mllib one-hot-encoding

One-Hot Encoding to a list feature. Pyspark

I would like to prepare my dataset to be used by machine learning algorithms. I have a feature composed by a list of the tags associated to every TV series (my records). It is possible to apply the one-hot encoding directly or it would be preferable to first extract all the possible elements of the aforementioned lists? My idea is to use this tags for the next analysis.

Here is an example of my dataset and the code applied to it.

my dataframe

from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoder

indexer = StringIndexer(inputCol="tags", outputCol="tagsIndex")

df = indexer.fit(df).transform(df)

ohe = OneHotEncoder(inputCol="tagsIndex", outputCol="tagsOHEVector")

df = ohe.fit(df).transform(df)

Solution

Not sure if there is a way to apply one-hot encoding directly, I would also like to know.

In the meantime, the straightforward way of doing that is to collect and explode tags in order to create one-hot encoding columns.

Example:

from pyspark.sql import SparkSession
from pyspark.sql import functions as F


spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
    [
        {"ID": 1, "tags": ["A", "B", "C"]},
        {"ID": 2, "tags": ["A", "D", "E"]},
        {"ID": 3, "tags": ["A", "C", "F"]},
    ]
)

tags = [
    x[0]
    for x in df.select(F.explode("tags").alias("tags"))
    .distinct()
    .orderBy("tags")
    .collect()
]

df = df.select(
    "*",
    *[
        F.array_contains("tags", tag).alias("tags{}".format(tag)).cast("integer")
        for tag in tags
    ]
)

Result:

+---+---------+-----+-----+-----+-----+-----+-----+
|ID |tags     |tagsA|tagsB|tagsC|tagsD|tagsE|tagsF|
+---+---------+-----+-----+-----+-----+-----+-----+
|1  |[A, B, C]|1    |1    |1    |0    |0    |0    |
|2  |[A, D, E]|1    |0    |0    |1    |1    |0    |
|3  |[A, C, F]|1    |0    |1    |0    |0    |1    |
+---+---------+-----+-----+-----+-----+-----+-----+