I am trying to query a parquet file using duckdb. the parquet file is stored in MINIO. I am using Jupyter notebook. the code is as below
def queryduckdb(bucketname, parquetfilepath):
try:
# Establish a connection
conn = duckdb.connect()
# Load the httpfs extension
conn.execute("LOAD httpfs")
# Set MinIO configuration
conn.execute("SET s3_region = 'ap-south-1'")
conn.execute("SET s3_access_key_id = 'abcded'")
conn.execute("SET s3_secret_access_key = 'abcdd'")
conn.execute("SET s3_endpoint = 'http://172.20.20.101:9000'")
conn.execute("SET s3_use_ssl = false") # Use true if MinIO uses HTTPS
# Construct and print the URL
url = f's3://{bucketname}/{parquetfilepath}'
# Construct the query
query = f"SELECT * FROM read_parquet('{url}')"
# Execute the query and fetch results
result = conn.execute(query).fetchall()
return result
except Exception as e:
# Print or log the exception message
print(f"Exception: {e}")
finally:
# Close the connection
conn.close()
bucketname="bucketname"
parquetfile = "sData/MarketDetails/Year=2024/1.parquet"
queryduckdb(bucketname,parquetfile)
the url construction is s3://bucketname/sData/MarketDetails/Year=2024/1.parquet
But I am getting below error
IO Error: Connection Error for HTTP Head to 'http://bucketname.http://172.20.20.101%3A9000/sData/MarketDetails/Year=2024/1.parquet'
Why there are two http in the error? the point of concern here is in the error we can see bucketname.http//endpoint/parquetfile. why bucketname comes first and then endpoint. why bucketname and parquetfile are separate.
Kindly guide
Why there are two http in the error?
Because you specified "http" in the endpoint. It would appear that you should instead be using:
conn.execute("SET s3_endpoint = '172.20.20.101:9000'")
You may also wish to consider using:
conn.execute("SET s3_url_style = 'path'")