scala apache-spark user-defined-functions word2vec apache-spark-ml

Using Word2Vec functions inside of a UDF in Apache Spark (v2.3.1)

I have a dataframe which consists of two columns, one an Int and the other a String:

+-------------+---------------------+
|user_id      |token                |
+-------------+---------------------+
|          419|                 Cake|
|          419|            Chocolate|
|          419|               Cheese|
|          419|                Cream|
|          419|                Bread|
|          419|                Sugar|
|          419|               Butter|
|          419|              Chicken|
|          419|               Baking|
|          419|             Grilling|
+-------------+---------------------+

I need to find the 250 closest tokens in the Word2Vec vocabulary, for each token in the "token" column. I attempted to use the findSynonymsArray method in a udf:

def getSyn( w2v : Word2VecModel ) = udf { (token : String) => w2v.findSynonymsArray(token, 10)}

However, this udf causes NullPointerException when used with withColumn. This exception occurs even if token is hard-coded, and regardless of whether code is run locally or in cluster mode. I used a try-catch inside the udf to catch the null pointer, and it is being raised on every row.

I have queried the dataframe for null values, there are none in either column.

I also tried extracting the words and vectors from the Word2VecModel with getVectors, running my udf on the words on this dataframe, and doing an inner join with my dataframe. The same exception is raised.

I would greatly appreciate any help.

Solution

This is an expected outcome Word2VecModel is a distributed model, and its methods are implemented using RDD operations. Because of that, it cannot be used inside udf, map or any other executor-side code.

If you want to compute synonyms for the whole DataFrame you'll can try to do it manually.

Load the model directly as DataFrame as shown for example in using Word2VecModel.transform() does not work in map function
Transform the input data.
Join both using approximate join or cross product and filter the result.