python apache-spark pyspark nlp word2vec

How to free the memory taken by a pyspark model (JavaModel)?

As described, I load a trained word2vec model through pyspark.

word2vec_model = Word2VecModel.load("saving path")

After using that, I want to delete it since it will take much memory space on single node (I used the findSynonyms function, and the doc says it should be local used only) I tried to use

del word2vec_model
gc.collect()

but it seems that doesn't word. And it's not an rdd file, I can't use .unpersist(). I didn't find any like unload() fuction in the doc.

Anyone could help me or give me some advice?

Solution

You can ensure that the object is dereferenced by the py4j gateway by running the following statement:

Given word2vec_model a pyspark Transformer:

Given spark a SparkSession:

spark.sparkContext._gateway.detach(word2vec_model._java_obj)

... or given sc a SparkContext:

sc._gateway.detach(word2vec_model._java_obj)

Explanations:

Access underlying wrapper object: Your model is a pyspark Transformer and each transformer holds an instance of JavaObject in a private _java_obj attribute.
Access the SparkContext's py4j gateway.
Use the gateway's detach method on the wrapper object (instance of JavaObject)