Search code examples
google-cloud-platformpysparkgoogle-cloud-dataproc

How to force python versions to sync in a datalab instance spun from a GCP dataproc cluster?


I've created a Dataproc cluster in GCP using image 1.2. I want to run Spark from a Datalab notebook. This works fine if I keep the Datalab notebook running Python 2.7 as its kernel, but if I want to use Python 3 I run into a minor version mismatch. I demonstrate the mismatch with a Datalab script below:

### Configuration
import sys, os
sys.path.insert(0, '/opt/panera/lib')
os.environ['PYSPARK_PYTHON'] = '/opt/conda/bin/python'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/opt/conda/bin/python'

import google.datalab.storage as storage
from io import BytesIO

spark = SparkSession.builder \
  .enableHiveSupport() \
  .config("hive.exec.dynamic.partition","true") \
  .config("hive.exec.dynamic.partition.mode","nonstrict") \
  .config("mapreduce.fileoutputcommitter.marksuccessfuljobs","false") \
  .getOrCreate() \

sc = spark.sparkContext

### import libraries
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils
from pyspark.mllib.regression import LabeledPoint

### trivial example
data = [ 
  LabeledPoint(0.0, [0.0]),
  LabeledPoint(1.0, [1.0]),
  LabeledPoint(1.0, [2.0]),
  LabeledPoint(1.0, [3.0])
]

toyModel = DecisionTree.trainClassifier(sc.parallelize(data), 2, {})
print(toyModel)

The error:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, pan-bdaas-prod-jrl6-w-3.c.big-data-prod.internal, executor 6): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 124, in main
    ("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 3.6 than that in driver 3.5, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

Other initialization scripts: gs://dataproc-initialization-actions/cloud-sql-proxy/cloud-sql-proxy.sh gs://dataproc-initialization-actions/datalab/datalab.sh ...and scripts that load some of our necessary libraries and utilities


Solution

  • The Python 3 kernel in Datalab is using Python 3.5 rather than Python 3.6

    You could try to set up a 3.6 environment inside of Datalab and then install a new kernelspec for it, but it is probably easier to just configure the Dataproc cluster to use Python 3.5

    The instructions for setting up your cluster to use 3.5 are here