Using an example I found online, below code throws error as it cannot read from S3 bucket. Problem is I have to pass in the AWS credentials which is found in variable S3_dir with the bucket path. I am unable to get this to work.
%sql
DROP TABLE IF EXISTS mydb.bigtable;
CREATE TABLE mydb.bigtable
(
id BIGINT,
string1 STRING,
numbers BIGINT,
) USING delta
LOCATION '/user/hive/warehouse/mydb'
SELECT * from delta.f"{S3_dir}";
In-order to use any data source from a storage(such as AWS S3, Azure ADLS), you need to mount the s3 bucket with Databricks. Here is the step by step procedure:
Step 1: Create AWS Access Key and Secret Key for Databricks
Note: This file contains your access and secret key
Step 2: Login to Databricks and run the following code
Create a notebook called “mount_s3_dbfs”
import urllib
# 1. Add necessary information
ACCESS_KEY = <access_key>
SECRET_KEY = <secret_key>
S3_BUCKET = <s3bucket_name>
# 2. Encode your secret key
ENCODED_SECRET_KEY = urllib.parse.quote(string=SECRET_KEY, safe="")
# 3. Defining URL
URL = "s3a://{}:{}@{}".format(ACCESS_KEY, ENCODED_SECRET_KEY, S3_BUCKET)
# 4. Mounting your S3 Bucket with DBFS
dbutils.fs.mount(URL,f"/mnt/{S3_BUCKET}")
# 5. Print mounted folder content
dbutils.fs.ls(URL,f"/mnt/{S3_BUCKET}/")
Note: Providing your credential directly is not recommended, try uploading it into your DBFS and read it using spark read csv method.
Step 3: Use spark to read and write file from AWS
%sql
DROP TABLE IF EXISTS mydb.bigtable;
CREATE TABLE mydb.bigtable
(
id BIGINT,
string1 STRING,
numbers BIGINT,
) USING delta
LOCATION '/user/hive/warehouse/mydb'
SELECT * FROM delta.`path\to\s3file`;
Note: Use '`' (backtick) to define your s3 file location.