I am trying to access Delta lake tables underlying on S3 using AWS glue jobs however getting error as "Module Delta not defined"
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
spark = SparkSession.builder.appName("MyApp").config("spark.jars.packages", "io.delta:delta-core_2.11:0.6.0").getOrCreate()
from delta.tables import *
data = spark.range(0, 5)
data.write.format("delta").save("S3://databricksblaze/data")
Added the necessary Jar ( delta-core_2.11-0.6.0.jar ) too in the dependency jars of the glue job. Can anyone help me on this Thanks
I have had success in using Glue + Deltalake. I added the Deltalake dependencies to the section "Dependent jars path" of the Glue job. Here you have the list of them (I am using Deltalake 0.6.1):
Then in your Glue job you can use the following code:
from pyspark.context import SparkContext
from awsglue.context import GlueContext
sc = SparkContext()
sc.addPyFile("io.delta_delta-core_2.11-0.6.1.jar")
from delta.tables import *
glueContext = GlueContext(sc)
spark = glueContext.spark_session
delta_path = "s3a://your_bucket/folder"
data = spark.range(0, 5)
data.write.format("delta").mode("overwrite").save(delta_path)
deltaTable = DeltaTable.forPath(spark, delta_path)