Search code examples
eclipsepython-2.7apache-sparkpydevpyspark

PySpark in Eclipse: using PyDev


I am running a local pyspark code from command line and it works:

/Users/edamame/local-lib/apache-spark/spark-1.5.1/bin/pyspark --jars myJar.jar --driver-class-path myJar.jar --executor-memory 2G --driver-memory 4G --executor-cores 3 /myPath/myProject.py

Is it possible to run this code from Eclipse using PyDev? What are the arguments required in the Run Configuration? I tried and got the following errors:

Traceback (most recent call last):
  File "/myPath/myProject.py", line 587, in <module>
    main()
  File "/myPath/myProject.py", line 506, in main
    conf = SparkConf()
  File "/Users/edamame/local-lib/apache-spark/spark-1.5.1/python/pyspark/conf.py", line 104, in __init__
    SparkContext._ensure_initialized()
  File "/Users/edamame/local-lib/apache-spark/spark-1.5.1/python/pyspark/context.py", line 234, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway()
  File "/Users/edamame/local-lib/apache-spark/spark-1.5.1/python/pyspark/java_gateway.py", line 76, in launch_gateway
    proc = Popen(command, stdin=PIPE, preexec_fn=preexec_func, env=env)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 1308, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

Does any one have any idea? Thank you very much!


Solution

  • Considering the following prerequisites:

    • Eclipse, PyDev and Spark installed.
    • PyDev with a Python interpreter configured.
    • PyDev with the Spark Python sources configured.

    Here is what you'll need to do:

    • From Eclipse ID, Check that you are on the PyDev perspective:

      • On Mac : Eclipse > Preferences
      • On Linux : Window > Preferences
    • From the Preferences window, go to PyDev > Interpreters > Python Interpreter:

      • Click on the central button [Environment]
      • Click on the button [New...] add a new Environment variable.
      • Add the environment variable SPARK_HOME and validate:
      • Name: SPARK_HOME, Value: /path/to/apache-spark/spark-1.5.1/
      • Note : Don’t use the system environment variables such as $SPARK_HOME

    I also recommend you to handle your own log4j.properties file in each of your project.

    To do so, you'll need to add the environment variable SPARK_CONF_DIR as done previously, example:

    Name: SPARK_CONF_DIR, Value: ${project_loc}/conf
    

    If you experience some problems with the variable ${project_loc} (e.g: with Linux), specify an absolute path instead.

    Or if you want to keep ${project_loc}, right-click on every Python source and: Runs As > Run Configuration, then create your SPARK_CONF_DIR variable in the Environment tab as described previously.

    Occasionally, you can add other environment variables such as TERM, SPARK_LOCAL_IP and so on:

    • Name: TERM, Value on Mac: xterm-256color, Value on Linux: xterm , if you wan to use xterm of course
    • Name: SPARK_LOCAL_IP, Value: 127.0.0.1 (it’s recommended to specify your real local IP address)

    PS: I don't remember the sources of this tutorial, so excuse me for not citing the author. I didn't come up with this by myself.