Search code examples
pythonpysparksetuptoolsspark-connect

How to use Spark Connect with pyspark on Python 3.12?


I'm trying to use Spark Connect to create a Spark session on a remote Spark cluster with pyspark in Python 3.12:

ingress_ep = "..."
access_token = "..."
conn_string = f"sc://{ingress_ep}/;token={access_token}"
spark = SparkSession.builder.remote(conn_string).getOrCreate()

When running this I get a ModuleNotFoundError message:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[13], line 11
      9 conn_string = f"sc://{ingress_ep}/;token={access_token}"
     10 print(conn_string)
---> 11 spark = SparkSession.builder.remote(conn_string).getOrCreate()

File c:\Users\...\venv2\Lib\site-packages\pyspark\sql\session.py:464, in SparkSession.Builder.getOrCreate(self)
    458 if (
    459     "SPARK_CONNECT_MODE_ENABLED" in os.environ
    460     or "SPARK_REMOTE" in os.environ
    461     or "spark.remote" in opts
    462 ):
    463     with SparkContext._lock:
--> 464         from pyspark.sql.connect.session import SparkSession as RemoteSparkSession
    466         if (
    467             SparkContext._active_spark_context is None
    468             and SparkSession._instantiatedSession is None
    469         ):
    470             url = opts.get("spark.remote", os.environ.get("SPARK_REMOTE"))

File c:\Users\...\venv2\Lib\site-packages\pyspark\sql\connect\session.py:19
      1 #
      2 # Licensed to the Apache Software Foundation (ASF) under one or more
      3 # contributor license agreements.  See the NOTICE file distributed with
...
---> 24 from distutils.version import LooseVersion
     26 try:
     27     import pandas

ModuleNotFoundError: No module named 'distutils'

I'm aware that that the distuils module has been removed from Python 3.12. So I have installed setuptools and set SETUPTOOLS_USE_DISTUTILS='local' as suggested in Why did I got an error ModuleNotFoundError: No module named 'distutils'? and No module named 'distutils' despite setuptools installed but I'm still getting the error.

Going back to an older version of Python is not an option for me. Am I missing something? How can I get this to work?


Solution

  • You probably need to import setuptools before any attempt of importing distutils.

    The long answer is that setuptools employs a MetaPathFinder to tell Python how to locate distutils. This MetaPathFinder is only added to sys.meta_path when setuptools is imported.

    This might be something to report to the library developers.

    If the workaround described above still does not work, there might be another dependency that is trying to explicitly disable this MetaPathFinder.