Search code examples
pythonamazon-s3pysparkdelta-lakeaws-databricks

Error loading data from S3 bucket to Databricks External Table


Using an example I found online, below code throws error as it cannot read from S3 bucket. Problem is I have to pass in the AWS credentials which is found in variable S3_dir with the bucket path. I am unable to get this to work.

%sql
DROP TABLE IF EXISTS mydb.bigtable;
CREATE TABLE mydb.bigtable
(
 id BIGINT,
  string1 STRING,
  numbers BIGINT,
) USING delta
LOCATION '/user/hive/warehouse/mydb'
SELECT * from delta.f"{S3_dir}";

Solution

  • In-order to use any data source from a storage(such as AWS S3, Azure ADLS), you need to mount the s3 bucket with Databricks. Here is the step by step procedure:

    Step 1: Create AWS Access Key and Secret Key for Databricks

    • Log into your AWS account
    • Select security credentials -> under your profile name [top right corner]
    • Look for Access keys (Scroll down) -> Create and download the generate CSV file

    Note: This file contains your access and secret key

    Step 2: Login to Databricks and run the following code

    Create a notebook called “mount_s3_dbfs”

    import urllib
    
    # 1. Add necessary information
    ACCESS_KEY = <access_key>
    SECRET_KEY = <secret_key>
    S3_BUCKET = <s3bucket_name>
    
    # 2. Encode your secret key
    ENCODED_SECRET_KEY = urllib.parse.quote(string=SECRET_KEY, safe="")
    
    # 3. Defining URL
    URL = "s3a://{}:{}@{}".format(ACCESS_KEY, ENCODED_SECRET_KEY, S3_BUCKET)
    
    # 4. Mounting your S3 Bucket with DBFS
    dbutils.fs.mount(URL,f"/mnt/{S3_BUCKET}")
    
    # 5. Print mounted folder content
    dbutils.fs.ls(URL,f"/mnt/{S3_BUCKET}/")
    

    Note: Providing your credential directly is not recommended, try uploading it into your DBFS and read it using spark read csv method.

    Step 3: Use spark to read and write file from AWS

    %sql
    
    DROP TABLE IF EXISTS mydb.bigtable;
    CREATE TABLE mydb.bigtable
    (
     id BIGINT,
      string1 STRING,
      numbers BIGINT,
    ) USING delta
    LOCATION '/user/hive/warehouse/mydb'
    SELECT * FROM delta.`path\to\s3file`;
    

    Note: Use '`' (backtick) to define your s3 file location.