Let us say that I have a Spark DataFrame where one of the column has short phrases. The total number of unique phrases is too large to be useful in a machine learning algorithm, but I was thinking of breaking the phrases into words and then transforming each phrase into a vector of fixed length N
where each dimension indicates if one of the N
most common words is found in this phrase.
Here's an example.
val sentenceDataFrame = spark.createDataFrame(Seq(
(0, "Foo bar dinosaur"),
(1, "bar logistic"),
(2, "foo bar logistic regression")
)).toDF("id", "sentence")
If I let N
be 3, then I expect the following transformed DataFrame
val sentenceDataFrameTransformed = spark.createDataFrame(Seq(
(0, "Foo bar dinosaur", [1, 1, 0]),
(1, "bar logistic", [0, 1, 0]),
(2, "foo bar logistic regression", [1 1 1])
)).toDF("id", "sentence", "sentenceTopWordsHotEncoded")
The top words would be "foo," "bar," and "logistic" in this case (by the popularity).
There are a lot of available transformers that Spark describes here, but I don't see one (or an easy combination of some) that give me what I want.
The reason that I don't want to write this function manually is because I want the transformer ready to be put into a Pipeline so that I can serialize this model and so when I evaluate new rows, the top N
words will be the same.
Use either CountVectorizer
or HashingTF
. If you want binary features you should use the second one and setBinary
.
Both will require tokenization first and optionally StopWordsRemover
. Overall:
(RegexTokenizer|Tokenizer) -> (CountVectorizer|HashingTF)