Search code examples
apache-sparktokenizeapache-spark-mllibone-hot-encoding

Which Spark ML Feature Transformer should I use to convert a column of phrases to vectors of fixed length?


Let us say that I have a Spark DataFrame where one of the column has short phrases. The total number of unique phrases is too large to be useful in a machine learning algorithm, but I was thinking of breaking the phrases into words and then transforming each phrase into a vector of fixed length N where each dimension indicates if one of the N most common words is found in this phrase.

Here's an example.

val sentenceDataFrame = spark.createDataFrame(Seq(
  (0, "Foo bar dinosaur"),
  (1, "bar logistic"),
  (2, "foo bar logistic regression")
)).toDF("id", "sentence")

If I let N be 3, then I expect the following transformed DataFrame

val sentenceDataFrameTransformed = spark.createDataFrame(Seq(
  (0, "Foo bar dinosaur", [1, 1, 0]),
  (1, "bar logistic", [0, 1, 0]),
  (2, "foo bar logistic regression", [1 1 1])
)).toDF("id", "sentence", "sentenceTopWordsHotEncoded")

The top words would be "foo," "bar," and "logistic" in this case (by the popularity).

There are a lot of available transformers that Spark describes here, but I don't see one (or an easy combination of some) that give me what I want.

The reason that I don't want to write this function manually is because I want the transformer ready to be put into a Pipeline so that I can serialize this model and so when I evaluate new rows, the top N words will be the same.


Solution

  • Use either CountVectorizer or HashingTF. If you want binary features you should use the second one and setBinary.

    Both will require tokenization first and optionally StopWordsRemover. Overall:

     (RegexTokenizer|Tokenizer) ->  (CountVectorizer|HashingTF)