Search code examples
apache-sparkspark-submit

python spark application doesn't work with spark-submit


But it works when running C:...\cwd> python SimpleApp.py it is code from https://spark.apache.org/docs/latest/quick-start.html 'Self contained application'

placed setup.py and SimpleApp.py

setup.py code:

from setuptools import setup, find_packages

setup(
    name='my-spark-project',
    version='0.1',
    packages=find_packages(),
    install_requires=[
        'pyspark==3.5.1'
        # Add other dependencies here
    ],
)

SimpleApp.py code:

"""SimpleApp.py"""
from pyspark.sql import SparkSession

logFile = "C:\\apache-spark\\README.md"  # Should be some file on your system
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
logData = spark.read.text(logFile).cache()

numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()

print("Lines with a: %i, lines with b: %i" % (numAs, numBs))

spark.stop()

and I executed

> pip install .
> spark-submit --master local[*] SimpleApp.py

result is like: Python24/04/05 14:22:35 INFO ShutdownHookManager: Shutdown hook called 24/04/05 14:22:35 INFO ShutdownHookManager: Deleting directory C:\Users\hendr\AppData\Local\Temp\spark-e91e861f-3f9b-4e18-b064-44bee42a2fb0

I did exactly as it says in document


Solution

  • I'm not entirely certain, but perhaps you could try two diffirent approaches:

    1. use findspark: import findspark findspark.init("C:\spark")

    --> pip install findspark. https://pypi.org/project/findspark/

    2. find the spark-submit path: path/to/your/spark-submit --master local[*] SimpleApp.py