Search code examples
pysparkhadoop-yarnemr

External dependency for spark job


I am new to big data technologies.I have to run a spark job in cluster mode on EMR. The job is written in python and it has dependencies on several libraries and some other tools. I have already written the script and run it in local client mode.But it arising some dependency issue when I am trying to run it using yarn.How do I manage these dependencies?

Log :

"/mnt/yarn/usercache/hadoop/appcache/application_1511680510570_0144/container_1511680510570_0144_01_000002/pyspark.zip/pyspark/cloudpickle.py", line 711, in subimport
    __import__(name)
ImportError: ('No module named boto3', <function subimport at 0x7f8c3c4f9c80>, ('boto3',))

        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Solution

  • It seems you have not installed Boto 3 library. Download the compatible one and install it using below

    $ pip install boto3
    

    or python -m pip install --user boto3

    Hope this helps.You can refer the link-https://github.com/boto/boto3

    Then it seems you have not installed the boot 3 on all executors(nodes). Since, you are running spark, python code is running partly on driver and executors.You need to install the library in all nodes if its yarn.

    To install the same.Please refer-How to bootstrap installation of Python modules on Amazon EMR?