Search code examples

Elephas not loaded in PySpark: No module named elephas.spark_model

I am trying to distribute Keras training on a cluster and use Elephas for that. But, when running the basic example from the doc of Elephas (

from elephas.utils.rdd_utils import to_simple_rdd
rdd = to_simple_rdd(sc, x_train, y_train)
from elephas.spark_model import SparkModel
from elephas import optimizers as elephas_optimizers
sgd = elephas_optimizers.SGD()
spark_model = SparkModel(sc, model, optimizer=sgd, frequency='epoch', mode='asynchronous', num_workers=2)
spark_model.train(rdd, nb_epoch=epochs, batch_size=batch_size, verbose=1, validation_split=0.1)

I get the following error:

 ImportError: No module named elephas.spark_model

```Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 5.0 failed 4 times, most recent failure: Lost task 1.3 in stage 5.0 (TID 58, xxxx, executor 8): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/xx/xx/hadoop/yarn/local/usercache/xx/appcache/application_151xxx857247_19188/container_1512xxx247_19188_01_000009/", line 163, in main
    func, profiler, deserializer, serializer = read_command(pickleSer, infile)
  File "/xx/xx/hadoop/yarn/local/usercache/xx/appcache/application_151xxx857247_19188/container_1512xxx247_19188_01_000009/", line 54, in read_command
    command = serializer._read_with_length(file)
  File /yarn/local/usercache/xx/appcache/application_151xxx857247_19188/container_1512xxx247_19188_01_000009/", line 169, in _read_with_length
    return self.loads(obj)
  File "/yarn//local/usercache/xx/appcache/application_151xxx857247_19188/container_1512xxx247_19188_01_000009/", line 454, in loads
    return pickle.loads(obj)
ImportError: No module named elephas.spark_model

    at org.apache.spark.api.python.PythonRunner$$anon$
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.executor.Executor$
    at java.util.concurrent.ThreadPoolExecutor.runWorker(
    at java.util.concurrent.ThreadPoolExecutor$

Also, the model is actually created, I can do print(spark_model) and will get this <elephas.spark_model.SparkModel object at 0x7efce0abfcd0>. The error occurs during spark_model.train.

I've installed elephas using pip2 install git+, maybe this is relevant.

I use PySpark 2.1.1, Keras 2.1.4 and Python 2.7. I've tried running it with spark-submit:

PYSPARK_DRIVER_PYTHON=`which python` spark-submit --driver-memory 1G

And also directly in a Jupyter Notebook. Both result in the same problem.

Can anyone give me any pointers? Is this elephas related or is it a PySpark problem?

EDIT: I also upload the zip file of the virtual environment and call it within the script:

virtualenv spark_venv --relocatable
cd spark_venv 
zip -qr ../ *

PYSPARK_DRIVER_PYTHON=`which python` spark-submit --driver-memory 1G --py-files

Then in the file I do:


After this keras is imported without any problems, but I still get the elephas error from above.


  • I found a solution on how to properly load a virtual environment to the master and all the workers:

    virtualenv venv --relocatable
    cd venv 
    zip -qr ../ *
    PYSPARK_PYTHON=./SP/bin/python spark-submit --master yarn --deploy-mode cluster --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./SP/bin/python --driver-memory 4G --archives

    More details in the GitHub Issue: