Search code examples
javaapache-sparkhadoopamazon-s3parquet

How to read Parquet file from S3 without spark? Java


Currently, I am using the Apache ParquetReader for reading local parquet files, which looks something like this:

ParquetReader<GenericData.Record> reader = null;
    Path path = new Path("userdata1.parquet");
    try {
        reader = AvroParquetReader.<GenericData.Record>builder(path).withConf(new Configuration()).build();
        GenericData.Record record;
        while ((record = reader.read()) != null) {
            System.out.println(record);

However, I am trying to access a parquet file through S3 without downloading it. Is there a way to parse Inputstream directly with parquet reader?


Solution

  • Yes, the latest versions of hadoop include support for S3 filesystem. Use the s3a client from hadoop-aws library to directly access the S3 filesystem.

    The HadoopInputFile Path should be constructed as s3a://bucket-name/prefix/key along with the authentication credentials access_key and secret_key configured using the properties

    • fs.s3a.access.key
    • fs.s3a.secret.key

    Additionally, you would require these dependant libraries

    • hadoop-common JAR
    • aws-java-sdk-bundle JAR

    Read more: Relevant configuration properties