I'm encountering an error message when I call a S3 parquet file then overwrite on the path but is working fine if I used append. The code below is just some of the parts of the whole script
df = spark.read.format("jdbc")
.option("driver", jdbc_driver_name)
.option("url", db_url)
.option("dbtable", table_name)
.option("user", db_username)
.option("password", db_password)
.option("fetchSize", 100000).load()
load_test = spark.read.parquet("s3://s3-raw/test_table")
new_test = df.withColumn("load_timestamp", unix_timestamp(lit(timestamp),'yyyy-MM-dd HH:mm:ss').cast("timestamp"))
new_test.write.partitionBy("load_timestamp").format("parquet").mode("overwrite").save("s3://s3-raw/test_table")
I tried to edit the code (see below) and it could now call the S3 parquet file then overwrite on the path but when I checked the S3 path (s3://s3-raw/test_table) the partitioned table which is load_timestamp is available there but there are no data inside. When I crawled it to data catalog and query on AWS Athena, the expected output is available there.
load_test = spark.read.parquet("s3://s3-raw/" + 'test_table')
new_test = df.withColumn("load_timestamp", unix_timestamp(lit(timestamp),'yyyy-MM-dd HH:mm:ss').cast("timestamp"))
new_test.write.partitionBy("load_timestamp").format("parquet").mode("overwrite").save("s3://s3-raw/" + 'test_table')
Reason for such behaviour is because of S3 Consistency Model. S3 provides read after write consistency for the new objects being created in bucket. Where as for object deletion and modification it's eventually consistent i.e. it might take sometime to reflect updation or deletion of an object in S3. And as you are writing here it's going to be a modification hence, eventual consistency.
You can read more about it in AWS documentation: S3 Consistency model. A good post to understand it practically and how can this be handled can be found here: https://medium.com/@dhruvsharma_50981/s3-eventual-data-consistency-model-issues-and-tackling-them-47093365a595