Search code examples
apache-sparkamazon-s3spark-structured-streamingapache-hudi

Writing data from Multi-Cluster into Hudi tables in S3


For Multi-cluster writes in S3, Delta lake uses Dynamo Db to atomically check if the file is present before writing it because S3 not supporting the “put-if-absent” consistency guarantee. Therefore, in order to leverage this feature using Delta lake for concurrent writes, we need DynamoDb which is an extra cost for us to maintain. So I would like to check how this works with Hudi

So similarly does Hudi also require DynamoDb to do multi-writes to S3? or any other instead DynamoDb?

I don't see anything mentioned specifically for hudi to do multi-writes to the same table in s3


Solution

  • Hudi supports several types of locking, check doc

    But it seems that for aws ecosystem - DynamoDB is the best choice as aws suggests

    Currently Hive Metastore locking is not working properly with Glue (check this issue)

    But you can try to use FileSystem based lock