Search code examples
pythonamazon-web-servicesaws-gluedelta-lake

AWS Glue locally: convert pandas df to delta


I run my AWS Glue jobs locally in a docker container (AWS Glue lib 4.0) and want to convert/write a pandas dataframe to delta format.

I added

spark = SparkSession.builder \
        .appName("YourAppName") \
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
        .getOrCreate()
    sys.argv += ['--datalake-formats', 'delta']
    args = getResolvedOptions(sys.argv, ['datalake-formats'])

but this line

spark.createDataFrame(pandas_df).write.format('delta').save('myfile.delta')

give me still the error Failed to find data source: delta.

I dont' get what iam missing here.


Solution

  • Found the answer in AWS blog post :

    "Glue 4.0: Add native data lake libraries AWS Glue 4.0 Docker image supports native data lake libraries; Apache Hudi, Delta Lake, and Apache Iceberg. You can pass the environment variable DATALAKE_FORMATS to load the relevant JAR files.

    -e DATALAKE_FORMATS=hudi,delta,iceberg"

    When you set this env variable starting your docker container it will do following

    Adding delta-2.1.0 libs to Spark Classpath