I’m trying to read a Parquet file using the nodejs-polars
library, but I’m encountering a 403 Forbidden
response when attempting to load the file from an S3 bucket.
Most of the examples I’ve found are in Python, and I’m looking for guidance on how to achieve this using Node.js. Specifically, I’d like to know:
How to properly read a Parquet file from S3 using nodejs-polars
, including how to handle AWS credentials.
Whether it’s possible to use partitioning with nodejs-polars
similar to what’s available in the Python implementation.
If hive partitioning is supported, could you provide the valid URL pattern or example configuration for this? As it written here
This is an example of my code:
import pl from 'nodejs-polars';
const cloudOptions = new Map();
cloudOptions.set('aws_region', 'eu-west-1');
// this row returns an error
const df = pl
.scanParquet(
'https://my-bucket-name.s3.eu-west-1.amazonaws.com/test_folder/test_file.parquet',
{
cloudOptions: cloudOptions,
},
)
.collectSync();
Could someone please provide a working example or any pointers to resolve the issue? Additionally, if partitioning support is available, a brief overview of how to implement it would be greatly appreciated.
I was able to read a Parquet file from an S3 container using nodejs-polars
library starting from version 0.16.0
.
Here is a working example:
import pl from 'nodejs-polars';
// Define your AWS cloud options
const cloudOptions = {
aws_region: 'relevant-region', // Replace with your AWS region
aws_session_token: 'your-session-token', // Replace with your AWS session token
};
const df = pl
.scanParquet(
's3://your-bucket-name/some-dir/**/**/**/*.parquet', // Update with your actual S3 path
{
cloudOptions: cloudOptions, // Specify AWS options
hivePartitioning: true, // Enable hive partitioning if applicable
}
)
.collectSync();
Important Notes:
Version Requirement: This code worked for me with nodejs-polars
version 0.16.0
(the current version at the time of writing the answer).
AWS Credentials: Make sure to set the correct AWS region and session token in cloudOptions
for accessing your S3 bucket. And notice that it now uses options as Object
instead of Map
.
S3 URL Template: I used the **
in the S3 URL (e.g., 's3://your-bucket-name/some-dir/**/**/**/**/*.parquet'
) as a wildcard for directories. This allowed the library to scan through the directories recursively to find the Parquet files.
Hive Partitioning: I enabled hivePartitioning: true
to handle partitioned data, which was useful in my case because my Parquet files were structured that way in S3.