Search code examples
node.jsamazon-s3parquetnodejs-polars

Reading parquet file form S3 bucket using nodejs-polars


I’m trying to read a Parquet file using the nodejs-polars library, but I’m encountering a 403 Forbidden response when attempting to load the file from an S3 bucket.

Most of the examples I’ve found are in Python, and I’m looking for guidance on how to achieve this using Node.js. Specifically, I’d like to know:

  1. How to properly read a Parquet file from S3 using nodejs-polars, including how to handle AWS credentials.

  2. Whether it’s possible to use partitioning with nodejs-polars similar to what’s available in the Python implementation.

  3. If hive partitioning is supported, could you provide the valid URL pattern or example configuration for this? As it written here

This is an example of my code:

import pl from 'nodejs-polars';


const cloudOptions = new Map();
cloudOptions.set('aws_region', 'eu-west-1');

// this row returns an error
const df = pl
  .scanParquet(
    'https://my-bucket-name.s3.eu-west-1.amazonaws.com/test_folder/test_file.parquet',
    {
      cloudOptions: cloudOptions,
    },
  )
  .collectSync();

Could someone please provide a working example or any pointers to resolve the issue? Additionally, if partitioning support is available, a brief overview of how to implement it would be greatly appreciated.


Solution

  • I was able to read a Parquet file from an S3 container using nodejs-polars library starting from version 0.16.0.

    Here is a working example:

    import pl from 'nodejs-polars';
    
    // Define your AWS cloud options
    const cloudOptions = {
      aws_region: 'relevant-region',           // Replace with your AWS region
      aws_session_token: 'your-session-token', // Replace with your AWS session token
    };
    
    const df = pl
      .scanParquet(
        's3://your-bucket-name/some-dir/**/**/**/*.parquet', // Update with your actual S3 path
        {
          cloudOptions: cloudOptions,  // Specify AWS options
          hivePartitioning: true,      // Enable hive partitioning if applicable
        }
      )
      .collectSync();
    

    Important Notes:

    • Version Requirement: This code worked for me with nodejs-polars version 0.16.0 (the current version at the time of writing the answer).

    • AWS Credentials: Make sure to set the correct AWS region and session token in cloudOptions for accessing your S3 bucket. And notice that it now uses options as Object instead of Map.

    • S3 URL Template: I used the ** in the S3 URL (e.g., 's3://your-bucket-name/some-dir/**/**/**/**/*.parquet') as a wildcard for directories. This allowed the library to scan through the directories recursively to find the Parquet files.

    • Hive Partitioning: I enabled hivePartitioning: true to handle partitioned data, which was useful in my case because my Parquet files were structured that way in S3.