I have a dataframe which consists of two columns, one an Int and the other a String:
+-------------+---------------------+
|user_id |token |
+-------------+---------------------+
| 419| Cake|
| 419| Chocolate|
| 419| Cheese|
| 419| Cream|
| 419| Bread|
| 419| Sugar|
| 419| Butter|
| 419| Chicken|
| 419| Baking|
| 419| Grilling|
+-------------+---------------------+
I need to find the 250 closest tokens in the Word2Vec vocabulary, for each token in the "token" column. I attempted to use the findSynonymsArray
method in a udf:
def getSyn( w2v : Word2VecModel ) = udf { (token : String) => w2v.findSynonymsArray(token, 10)}
However, this udf causes NullPointerException
when used with withColumn
. This exception occurs even if token is hard-coded, and regardless of whether code is run locally or in cluster mode. I used a try-catch inside the udf to catch the null pointer, and it is being raised on every row.
I have queried the dataframe for null values, there are none in either column.
I also tried extracting the words and vectors from the Word2VecModel
with getVectors
, running my udf on the words on this dataframe, and doing an inner join with my dataframe. The same exception is raised.
I would greatly appreciate any help.
This is an expected outcome Word2VecModel
is a distributed model, and its methods are implemented using RDD
operations. Because of that, it cannot be used inside udf
, map
or any other executor-side code.
If you want to compute synonyms for the whole DataFrame
you'll can try to do it manually.
DataFrame
as shown for example in using Word2VecModel.transform() does not work in map function