Search code examples
apache-sparkpysparkaws-glue

Spark Extension using AWS Glue


I have created a script locally that uses the spark extension 'uk.co.gresearch.spark:spark-extension_2.12:2.2.0-3.3' for comparing different DataFrames in a simple manner.

However, when I try this out on AWS Glue I ran into some issues and received this error: ModuleNotFoundError: No module named 'gresearch'

I have tried copying the .jar file from my local disk that was referenced when I initialized the spark session locally and received this message:

... The jars for the packages stored in: /Users/["SOME_NAME"]/.ivy2/jars uk.co.gresearch.spark#spark-extension_2.12 added as a dependency...

In that path I found a file named: uk.co.gresearch.spark_spark-extension_2.12-2.2.0-3.3.jar that I copied to S3 and referenced in the Jar lib path.

But this did not work... How would you go about setting this up in the correct manner?

The example code I've used to test this on AWS Glue looks like this:

from pyspark.conf import SparkConf
from pyspark.sql import SparkSession

appName = 'test_gresearch'
spark_conf = SparkConf()
spark_conf.setAll([('spark.jars.packages', 'uk.co.gresearch.spark:spark- 
extension_2.12:2.2.0-3.3')])
spark=SparkSession.builder.config(conf=spark_conf)\
.enableHiveSupport().appName(appName).getOrCreate()

from gresearch.spark.diff import *

df1 = spark.createDataFrame([
  [1, "ABC", 5000, "US"],
  [2, "DEF", 4000, "UK"],
  [3, "GHI", 3000, "JPN"],
  [4, "JKL", 4500, "CHN"]
], ["id", "name", "sal", "Address"])

df2 = spark.createDataFrame([
  [1, "ABC", 5000, "US"],
  [2, "DEF", 4000, "CAN"],
  [3, "GHI", 3500, "JPN"],
  [4, "JKL_M", 4800, "CHN"]
], ["id", "name", "sal", "Address"])

df1.show()
df2.show()

options = DiffOptions().with_change_column('changes')
df1.diff_with_options(df2, options, 'id').show()

Any tips are more than welcome. Thank you in advance!

Regards


Solution

  • After some investigation with the AWS support team, I was instructed to include the package .jar file through the Python library path since the .jar file comprises embedded Python packages. The correct version of the .jar file shall therefore be downloaded (https://mvnrepository.com/artifact/uk.co.gresearch.spark/spark-extension_2.12/2.1.0-3.1 was the version I ended up using) and uploaded to S3 and referenced in under the Glue job setting for Python library path (for eg - s3://bucket-name/spark-extension_2.12-2.1.0-3.1.jar).

    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
    
    ## @params: [JOB_NAME]
    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
    
    sc = SparkContext()   
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)
    job.commit()
    
    left = spark.createDataFrame([(1, "one"), (2, "two"), (3, "three")], ["id", "value"])
    right = spark.createDataFrame([(1, "one"), (2, "Two"), (4, "four")], ["id", "value"])
    
    from gresearch.spark.diff import *
    
    left.diff(right, "id").show()