Search code examples
amazon-s3aws-lambdaduckdb

Unable to access AWS S3 parquet file from AWS Lambda using duckdb


I have a parquet file stored in AWS S3. Assume the location is s3://bucket/file.parquet. I defined a function in AWS Lambda to access this parquet file by using the code below.


    import os
    def lambda_handler(event, context):
        import duckdb
        con = duckdb.connect(database=':memory:', read_only=False)
        home_directory = "/tmp/duckdb/"
        if not os.path.exists(home_directory):
            os.mkdir(home_directory)
    
        con.query("SET home_directory='/tmp/duckdb/';")
        con.query("INSTALL 'httpfs';")
        con.query("LOAD 'httpfs';")
        con.query("SET s3_access_key_id='xxx';")
        con.query("SET s3_secret_access_key='xxx;")
        con.query("SET s3_region='us-east-1';")
        sql = "select * from read_parquet('s3://bucket/file.parquet') limit 5;"
        data = con.execute(sql).fetchall()
        print(data)

When running the above code, I see the following error.

Response
{
  "errorMessage": "HTTP Error: HTTP GET error on 's3://bucket/file.parquet' (HTTP 400)",
  "errorType": "HTTPException",
  "stackTrace": [
    "  File \"/var/task/lambda_function.py\", line 140, in lambda_handler\n    data = con.execute(sql).fetchall()\n"
  ]
}

If I replace the s3 url with https signed url, this code works fine. I also tried using boto3 library to download file on the container of Lambda and run the code and it works fine. My question is why it is failing if I pass the S3 path directly to read_parquet function as it is supported as per the duckdb docs.

https://duckdb.org/docs/extensions/httpfs.html#s3


Solution

  • After a lot of struggle, I was able to figure out the issue. Whenever you are trying to access files from S3, we do not need to explicitly specify the following parameters in Lambda function. The credentials are picked directly from the inheriting IAM role. The moment it finds credentials as part of the code, it gets confused. Removing the lines below got rid of the error.

            con.query("SET s3_access_key_id='xxx';")
            con.query("SET s3_secret_access_key='xxx;")
            con.query("SET s3_region='us-east-1';")