java apache-spark hadoop amazon-s3 parquet

How to read Parquet file from S3 without spark? Java

Currently, I am using the Apache ParquetReader for reading local parquet files, which looks something like this:

ParquetReader<GenericData.Record> reader = null;
    Path path = new Path("userdata1.parquet");
    try {
        reader = AvroParquetReader.<GenericData.Record>builder(path).withConf(new Configuration()).build();
        GenericData.Record record;
        while ((record = reader.read()) != null) {
            System.out.println(record);

However, I am trying to access a parquet file through S3 without downloading it. Is there a way to parse Inputstream directly with parquet reader?

Solution

Yes, the latest versions of hadoop include support for S3 filesystem. Use the s3a client from hadoop-aws library to directly access the S3 filesystem.

The HadoopInputFile Path should be constructed as s3a://bucket-name/prefix/key along with the authentication credentials access_key and secret_key configured using the properties

fs.s3a.access.key
fs.s3a.secret.key

Additionally, you would require these dependant libraries

hadoop-common JAR
aws-java-sdk-bundle JAR