Search code examples
pythonpython-2.7scalaapache-sparkrdd

Piping Scala RDD to Python code fails


I am trying to execute Python code inside Scala program passing RDD as data to the Python script. The Spark cluster is initialized successfully, the data conversion to RDD is fine and running the Python script separately(outside Scala code) works. However, the execution of the same Python script inside Scala fails with:

java.lang.IllegalStateException: Subprocess exited with status 2. Command ran: /{filePath}/{File}.py

Looking deeper, it shows import: command not found when trying to execute the Python file. I believe the built-in Scala executor cannot understand that this is a Python script.

Note: My environment variables are set correctly and I can access and execute Python scripts from everywhere else. In place of filePath and File I have the actual path to that file and the file name.

Environment: Spark 2.2.1 Scala 2.11.11 Python 2.7.10

Code:

val conf=new SparkConf().setAppName("Test").setMaster("local[*]")
val sparkContext = new SparkContext(conf)
val distScript = "/{filePath}/{File}.py"
val distScriptName = "{File}.py"
sparkContext.addFile(distScript)
val ipData = sparkContext.parallelize(List("asd","xyz","zxcz","sdfsfd","Ssdfd","Sdfsf"))
val pipeRDD = ipData.pipe(SparkFiles.get(distScriptName))

pipeRDD.foreach(println)

Have someone tried this before and is able to help resolving the issue I am getting? Is this the best way to integrate Scala and Python scripts? I am open for some other verified recommendations and suggestions to try.


Solution

  • I found where the issue came from. I was missing the following line in the Python file:

    #!/usr/bin/python
    

    After adding it the java.lang.IllegalStateException disappeared.