I would like to prepare my dataset to be used by machine learning algorithms. I have a feature composed by a list of the tags associated to every TV series (my records). It is possible to apply the one-hot encoding directly or it would be preferable to first extract all the possible elements of the aforementioned lists? My idea is to use this tags for the next analysis.
Here is an example of my dataset and the code applied to it.
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoder
indexer = StringIndexer(inputCol="tags", outputCol="tagsIndex")
df = indexer.fit(df).transform(df)
ohe = OneHotEncoder(inputCol="tagsIndex", outputCol="tagsOHEVector")
df = ohe.fit(df).transform(df)
Not sure if there is a way to apply one-hot encoding directly, I would also like to know.
In the meantime, the straightforward way of doing that is to collect and explode tags
in order to create one-hot encoding columns.
Example:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[
{"ID": 1, "tags": ["A", "B", "C"]},
{"ID": 2, "tags": ["A", "D", "E"]},
{"ID": 3, "tags": ["A", "C", "F"]},
]
)
tags = [
x[0]
for x in df.select(F.explode("tags").alias("tags"))
.distinct()
.orderBy("tags")
.collect()
]
df = df.select(
"*",
*[
F.array_contains("tags", tag).alias("tags{}".format(tag)).cast("integer")
for tag in tags
]
)
Result:
+---+---------+-----+-----+-----+-----+-----+-----+
|ID |tags |tagsA|tagsB|tagsC|tagsD|tagsE|tagsF|
+---+---------+-----+-----+-----+-----+-----+-----+
|1 |[A, B, C]|1 |1 |1 |0 |0 |0 |
|2 |[A, D, E]|1 |0 |0 |1 |1 |0 |
|3 |[A, C, F]|1 |0 |1 |0 |0 |1 |
+---+---------+-----+-----+-----+-----+-----+-----+