Search code examples
pythonapache-sparkpysparknlpword2vec

How to free the memory taken by a pyspark model (JavaModel)?


As described, I load a trained word2vec model through pyspark.

word2vec_model = Word2VecModel.load("saving path")

After using that, I want to delete it since it will take much memory space on single node (I used the findSynonyms function, and the doc says it should be local used only) I tried to use

del word2vec_model
gc.collect()

but it seems that doesn't word. And it's not an rdd file, I can't use .unpersist(). I didn't find any like unload() fuction in the doc.

Anyone could help me or give me some advice?


Solution

  • You can ensure that the object is dereferenced by the py4j gateway by running the following statement:

    Given word2vec_model a pyspark Transformer:

    • Given spark a SparkSession:
    spark.sparkContext._gateway.detach(word2vec_model._java_obj)
    
    • ... or given sc a SparkContext:
    sc._gateway.detach(word2vec_model._java_obj)
    

    Explanations:

    1. Access underlying wrapper object: Your model is a pyspark Transformer and each transformer holds an instance of JavaObject in a private _java_obj attribute.
    2. Access the SparkContext's py4j gateway.
    3. Use the gateway's detach method on the wrapper object (instance of JavaObject)