I run my AWS Glue jobs locally in a docker container (AWS Glue lib 4.0) and want to convert/write a pandas dataframe to delta format.
I added
spark = SparkSession.builder \
.appName("YourAppName") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
sys.argv += ['--datalake-formats', 'delta']
args = getResolvedOptions(sys.argv, ['datalake-formats'])
but this line
spark.createDataFrame(pandas_df).write.format('delta').save('myfile.delta')
give me still the error Failed to find data source: delta.
I dont' get what iam missing here.
Found the answer in AWS blog post :
"Glue 4.0: Add native data lake libraries AWS Glue 4.0 Docker image supports native data lake libraries; Apache Hudi, Delta Lake, and Apache Iceberg. You can pass the environment variable DATALAKE_FORMATS to load the relevant JAR files.
-e DATALAKE_FORMATS=hudi,delta,iceberg
"
When you set this env variable starting your docker container it will do following
Adding delta-2.1.0 libs to Spark Classpath