Search code examples
databricksegg

How can I stop a DataBricks notebook referencing old versions of my egg files?


On DataBricks on Azure:

I follow these steps:

  • create a library from a python egg, say simon_1_001.egg which contains a module simon.

  • attach the library to a cluster and restart the cluster

  • attach a notebook to the cluster and run:

    import simon as s
    print s.__file__

  • run the notebook and it correctly gives me a file name including the string 'simon_1_001.egg'

  • then detach and delete the egg file, even emptying trash.

  • restart the cluster, detach and attach the notebook run it and instead of complaining that it can't find module simon, it runs and displays the same string

Similarly if I upload a newer version of the egg, say simon_1_002.egg, it still displays the same string. If I wait half an hour clear and rerun a few times then eventually it will pick up the new library and will display simon_1_002.egg.

How can I properly clear down the old egg files ?


Solution

  • Simon, this is a bug in Databricks platform. When a library is created in Databricks using jar, the file is stored in dbfs:/FileStore and /databricks/python2/lib/python2.7/site-packages/ for Py2 and /databricks/python3/lib/python3.5/site-packages/ for Py3 clusters.

    In both the jar and egg cases, the path is stored when a library is created. When a library is detached and removed from Trash, it is supposed to removed the copy from DBFS which it does not do currently.

    To alliviate this inconsistency issue, you might want to check the environment subtab in Spark UI or using %sh ls in a cell for looking in appropriate paths to make sure if a library is removed correctly or not and also remove them using %sh rm command before restarting the cluster and attaching a newer version of the library.