On DataBricks on Azure:
I follow these steps:
create a library from a python egg, say simon_1_001.egg which contains a module simon.
attach the library to a cluster and restart the cluster
attach a notebook to the cluster and run:
import simon as s
print s.__file__
run the notebook and it correctly gives me a file name including the string 'simon_1_001.egg'
then detach and delete the egg file, even emptying trash.
restart the cluster, detach and attach the notebook run it and instead of complaining that it can't find module simon, it runs and displays the same string
Similarly if I upload a newer version of the egg, say simon_1_002.egg, it still displays the same string. If I wait half an hour clear and rerun a few times then eventually it will pick up the new library and will display simon_1_002.egg.
How can I properly clear down the old egg files ?
Simon, this is a bug in Databricks platform. When a library is created in Databricks using jar, the file is stored in dbfs:/FileStore
and /databricks/python2/lib/python2.7/site-packages/
for Py2 and /databricks/python3/lib/python3.5/site-packages/
for Py3 clusters.
In both the jar
and egg
cases, the path is stored when a library is created. When a library is detached and removed from Trash, it is supposed to removed the copy from DBFS which it does not do currently.
To alliviate this inconsistency issue, you might want to check the environment subtab in Spark UI or using %sh ls
in a cell for looking in appropriate paths to make sure if a library is removed correctly or not and also remove them using %sh rm
command before restarting the cluster and attaching a newer version of the library.