I'm trying to use Spark Connect to create a Spark session on a remote Spark cluster with pyspark in Python 3.12:
ingress_ep = "..."
access_token = "..."
conn_string = f"sc://{ingress_ep}/;token={access_token}"
spark = SparkSession.builder.remote(conn_string).getOrCreate()
When running this I get a ModuleNotFoundError
message:
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[13], line 11
9 conn_string = f"sc://{ingress_ep}/;token={access_token}"
10 print(conn_string)
---> 11 spark = SparkSession.builder.remote(conn_string).getOrCreate()
File c:\Users\...\venv2\Lib\site-packages\pyspark\sql\session.py:464, in SparkSession.Builder.getOrCreate(self)
458 if (
459 "SPARK_CONNECT_MODE_ENABLED" in os.environ
460 or "SPARK_REMOTE" in os.environ
461 or "spark.remote" in opts
462 ):
463 with SparkContext._lock:
--> 464 from pyspark.sql.connect.session import SparkSession as RemoteSparkSession
466 if (
467 SparkContext._active_spark_context is None
468 and SparkSession._instantiatedSession is None
469 ):
470 url = opts.get("spark.remote", os.environ.get("SPARK_REMOTE"))
File c:\Users\...\venv2\Lib\site-packages\pyspark\sql\connect\session.py:19
1 #
2 # Licensed to the Apache Software Foundation (ASF) under one or more
3 # contributor license agreements. See the NOTICE file distributed with
...
---> 24 from distutils.version import LooseVersion
26 try:
27 import pandas
ModuleNotFoundError: No module named 'distutils'
I'm aware that that the distuils
module has been removed from Python 3.12. So I have installed setuptools
and set SETUPTOOLS_USE_DISTUTILS='local'
as suggested in Why did I got an error ModuleNotFoundError: No module named 'distutils'? and No module named 'distutils' despite setuptools installed but I'm still getting the error.
Going back to an older version of Python is not an option for me. Am I missing something? How can I get this to work?
You probably need to import setuptools
before any attempt of importing distutils
.
The long answer is that setuptools
employs a MetaPathFinder
to tell Python how to locate distutils
. This MetaPathFinder
is only added to sys.meta_path
when setuptools
is imported.
This might be something to report to the library developers.
If the workaround described above still does not work, there might be another dependency that is trying to explicitly disable this MetaPathFinder
.