I'm trying to train a machine learning model using a Dataset that has categorical values (String values). Spark models, however, can't be trained with String values and therefore I have to convert them or index them into a numerical value. However, I have found out that Spark's only String transformer is the StringIndexer
but I find that to be very unreliable because it indexes Strings based on the frequency of that string, and there is no guarantee that in my test files the frequency of the string values will remain the same. So I thought of using a String variable's hashcode as a way of indexing them. I can easily iterate through the rows and get the hashcode of a Column of String values and store them in a List. However, I don't know how to add this List into the dataset in order to be able to train my model with it. The List will be ordered from top row to bottom row, so I was trying to find a way of converting that List into a column, but I don't think Spark has that option. Any idea of how I can create a new Column from a List and append that to my training Dataset?
So I found out that Spark has a function called hash
that creates an int column containing the hash-values of another column.
The solution to my problem was the following:
import org.apache.spark.sql.functions;
Column stringCol = new Column("stringValues");
trainingDF = trainingDF.withColumn("hashString", functions.hash(stringCol));