Search code examples
pythonpysparkaws-gluespark-submitaws-glue-spark

Using arguments with Glue pyspark


Intro

I have a docker configured with Glue ETL PySpark environment, thanks to this AWS Glue tutorial. I used the "hellowrold.py":

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

glueContext = GlueContext(SparkContext.getOrCreate())

medicare = spark.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load('s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv')
medicare.printSchema()

I cannot run it doing spark-submit hellowrold.py because I'm faced to well known error :

ModuleNotFoundError: No module named 'dynamicframe'

I found a hack: using the redirection operator: pyspark < helloworld.py and it works like a charm.

My problem

HOWEVER. Now I need to pass some arguments to my script.

I used to (before trying to use Glue ETL) use : spark-submit myScript.py arg1 arg2 arg3

When I tried naively to do pyspark < myScript.py arg1 arg2 arg3 I got the following error:

Error: pyspark does not support any application options.

Minimal myScript.py to reproduce

import sys
from pyspark import SparkContext
from awsglue.context import GlueContext

# Hello world
glueContext = GlueContext(SparkContext.getOrCreate())
print(sys.argv[1] + " " + sys.argv[2] + " " + sys.argv[3])

Is there any solution to continue to use pyspark instead of spark-submit using some arguments?

Am I totally wrong, and is there a solution that can use spark-submit with Glue?


Solution

  • I would advise you to use the integration with PyCharm if possible. There you don't have the module error and you can inject arguments through the parameter option of the PyCharm run configuration.

    The article that you linked also explains how to integrate with PyCharm.

    Edit:

    When I log into the Docker container and just run:

    /home/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/bin/spark-submit myScript.py test1, test2, test3
    

    it prints out test1 test2 test3. I copied the exact content from your script. Could you please try that?