Currently, I am using the Apache ParquetReader for reading local parquet files, which looks something like this:
ParquetReader<GenericData.Record> reader = null;
Path path = new Path("userdata1.parquet");
try {
reader = AvroParquetReader.<GenericData.Record>builder(path).withConf(new Configuration()).build();
GenericData.Record record;
while ((record = reader.read()) != null) {
System.out.println(record);
However, I am trying to access a parquet file through S3 without downloading it. Is there a way to parse Inputstream directly with parquet reader?
Yes, the latest versions of hadoop include support for S3 filesystem. Use the s3a
client from hadoop-aws
library to directly access the S3 filesystem.
The HadoopInputFile
Path should be constructed as s3a://bucket-name/prefix/key
along with the authentication credentials access_key
and secret_key
configured using the properties
fs.s3a.access.key
fs.s3a.secret.key
Additionally, you would require these dependant libraries
hadoop-common
JARaws-java-sdk-bundle
JARRead more: Relevant configuration properties