apache-spark tokenize apache-spark-mllib one-hot-encoding

Which Spark ML Feature Transformer should I use to convert a column of phrases to vectors of fixed length?

Let us say that I have a Spark DataFrame where one of the column has short phrases. The total number of unique phrases is too large to be useful in a machine learning algorithm, but I was thinking of breaking the phrases into words and then transforming each phrase into a vector of fixed length N where each dimension indicates if one of the N most common words is found in this phrase.

Here's an example.

val sentenceDataFrame = spark.createDataFrame(Seq(
  (0, "Foo bar dinosaur"),
  (1, "bar logistic"),
  (2, "foo bar logistic regression")
)).toDF("id", "sentence")

If I let N be 3, then I expect the following transformed DataFrame

val sentenceDataFrameTransformed = spark.createDataFrame(Seq(
  (0, "Foo bar dinosaur", [1, 1, 0]),
  (1, "bar logistic", [0, 1, 0]),
  (2, "foo bar logistic regression", [1 1 1])
)).toDF("id", "sentence", "sentenceTopWordsHotEncoded")

The top words would be "foo," "bar," and "logistic" in this case (by the popularity).

There are a lot of available transformers that Spark describes here, but I don't see one (or an easy combination of some) that give me what I want.

The reason that I don't want to write this function manually is because I want the transformer ready to be put into a Pipeline so that I can serialize this model and so when I evaluate new rows, the top N words will be the same.

Solution

Use either CountVectorizer or HashingTF. If you want binary features you should use the second one and setBinary.

Both will require tokenization first and optionally StopWordsRemover. Overall:

 (RegexTokenizer|Tokenizer) ->  (CountVectorizer|HashingTF)