Search code examples
amazon-web-servicesamazon-s3databricksaws-databricks

Data Lakes - S3 and Databricks


I understand Data Lake Zones in S3 and I am looking at establishing 3 zones - LANDING, STAGING, CURATED. If I were in an Azure environment, I would create the Data Lake and have multiple folders as various zones.

How would I do the equivalent in AWS - Would it be a separate bucket for each zone (s3://landing_data/, s3://staging_data, s3://curated_data) or a single bucket with multiple folders (i.e. s3://bucket_name/landing/..., s3://bucket_name/staging/). I understand AWS S3 is nothing more than containers.

Also, would I be able to mount multiple S3 buckets on Databricks AWS? If so is there any reference documentation?

Is there any best/recommended approach given that we can read and write to S3 in multiple ways?

I looked at this as well. S3 performance Best Pratices


Solution

  • There is no single solution - the actual implementation depends on the amount of data, number of consumers/producers, etc. You need to take into account AWS S3 limits, like:

    • By default you may have only 100 buckets in an account - it could be increased although
    • You may issue 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix (directory) in a single bucket (although the number of prefixes is not limited)

    You can mount each of the buckets, or individual folders into Databricks workspace as described in documentation. But it's really not recommended from the security standpoint, as everyone in workspace will have the same permissions as role that was used for mounting. Instead of that, just use full S3 URLs in combination with instance profiles.