I understand Data Lake Zones in S3 and I am looking at establishing 3 zones - LANDING, STAGING, CURATED. If I were in an Azure environment, I would create the Data Lake and have multiple folders as various zones.
How would I do the equivalent in AWS - Would it be a separate bucket for each zone (s3://landing_data/
, s3://staging_data
, s3://curated_data
) or a single bucket with multiple folders (i.e. s3://bucket_name/landing/...
, s3://bucket_name/staging/
). I understand AWS S3 is nothing more than containers.
Also, would I be able to mount multiple S3 buckets on Databricks AWS? If so is there any reference documentation?
Is there any best/recommended approach given that we can read and write to S3 in multiple ways?
I looked at this as well. S3 performance Best Pratices
There is no single solution - the actual implementation depends on the amount of data, number of consumers/producers, etc. You need to take into account AWS S3 limits, like:
You can mount each of the buckets, or individual folders into Databricks workspace as described in documentation. But it's really not recommended from the security standpoint, as everyone in workspace will have the same permissions as role that was used for mounting. Instead of that, just use full S3 URLs in combination with instance profiles.