Search code examples
pythonamazon-web-servicesapache-sparkpysparkamazon-emr

pyspark saveAsTextFile works for python 2.7 but not 3.4


I'm running pyspark on an Amazon EMR cluster. I have a very simple test script to see if I can write data to s3 using spark-submit ...

from pyspark import SparkContext
sc = SparkContext()
numbers = sc.parallelize(range(100))
numbers.saveAsTextFile("s3n://my-bucket/test.txt")
sc.stop()

When I run this script using spark-submit in a python2.7 environment, it works just fine. But when I try to run the same script in a python3.4 environment, I get the following traceback ...

Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File ".../pyspark/worker.py", line 161, in main 
    func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File ".../pyspark/worker.py", line 54, in read_command
    command = serializer._read_with_length(file)
File ".../pyspark/serializers.py", line 164, in _read_with_length
    return self.loads(obj)
File ".../pyspark/serializers.py", line 419, in loads
    return pickle.loads(obj, encoding=encoding)
AttributeError: Can't get attribute 'unicode' on <module 'builtins' (built-in)>

I'm manipulating my python environment using conda and by setting the PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON variables.

Is there an issue using saveAsTextFile in python 3? Or am I missing a step in setting up my python 3 environment?

Thanks!


Solution

  • Ok, so it looks like this has nothing to do with python3, and everything to do with my conda environment. In short, I set up a conda environment in my bootstrap.sh, but I only actually activated it on the master node. So the master node was using conda python, but the workers were using system python.

    My solution now is to set PYSPARK_PYTHON=/home/hadoop/miniconda3/envs/myenv/python.

    Is there a better way to activate my conda environment on the worker nodes?