Search code examples

Read_parquet function of duckdb from MinIO issue

I am trying to query a parquet file using duckdb. the parquet file is stored in MINIO. I am using Jupyter notebook. the code is as below

def queryduckdb(bucketname, parquetfilepath):


        # Establish a connection
        conn = duckdb.connect()
        # Load the httpfs extension
        conn.execute("LOAD httpfs")
        # Set MinIO configuration
        conn.execute("SET s3_region = 'ap-south-1'")
        conn.execute("SET s3_access_key_id = 'abcded'")
        conn.execute("SET s3_secret_access_key = 'abcdd'")
        conn.execute("SET s3_endpoint = ''")
        conn.execute("SET s3_use_ssl = false")  # Use true if MinIO uses HTTPS
        # Construct and print the URL
        url = f's3://{bucketname}/{parquetfilepath}'
        # Construct the query
        query = f"SELECT * FROM read_parquet('{url}')"
        # Execute the query and fetch results
        result = conn.execute(query).fetchall()

        return result

    except Exception as e:
        # Print or log the exception message
        print(f"Exception: {e}")

        # Close the connection

parquetfile = "sData/MarketDetails/Year=2024/1.parquet"

the url construction is s3://bucketname/sData/MarketDetails/Year=2024/1.parquet

But I am getting below error

IO Error: Connection Error for HTTP Head to 'http://bucketname.'

Why there are two http in the error? the point of concern here is in the error we can see bucketname.http//endpoint/parquetfile. why bucketname comes first and then endpoint. why bucketname and parquetfile are separate.

Kindly guide


  • Why there are two http in the error?

    Because you specified "http" in the endpoint. It would appear that you should instead be using:

    conn.execute("SET s3_endpoint = ''")

    You may also wish to consider using:

    conn.execute("SET s3_url_style = 'path'")