Search code examples
javaamazon-s3parquet

How do I configure S3 access for org.apache.parquet.avro.AvroParquetReader?


I struggled with this for a while and wanted to share my solution. AvroParquetReader is a fine tool for reading Parquet, but its defaults for S3 access are weak:

java.io.InterruptedIOException: doesBucketExist on MY_BUCKET: com.amazonaws.AmazonClientException: No AWS Credentials provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider : com.amazonaws.AmazonClientException: Unable to load credentials from service endpoint

I want to use credentials providers akin to those used by com.amazonaws.auth.profile.ProfileCredentialsProvider, which works for accessing my S3 bucket, but it is not clear from AvroParquetReader's class definition or documentation how I would achieve this.


Solution

  • This code worked for me. It allowed AvroParquetReader to access S3 using ProfileCredentialsProvider.

    import com.amazonaws.auth.AWSCredentialsProvider;
    import com.amazonaws.auth.profile.ProfileCredentialsProvider;
    import org.apache.parquet.avro.AvroParquetReader;
    import org.apache.parquet.hadoop.ParquetReader;
    import org.apache.hadoop.fs.Path;
    import org.apache.avro.generic.GenericRecord;
    import org.apache.hadoop.conf.Configuration;
    
    ...
    
    final String path = "s3a://"+bucketName+"/"+pathName;
    final Configuration configuration = new Configuration();
    configuration.setClass("fs.s3a.aws.credentials.provider", ProfileCredentialsProvider.class,
            AWSCredentialsProvider.class);
    ParquetReader<GenericRecord> parquetReader =
            AvroParquetReader.<GenericRecord>builder(new Path(path)).withConf(configuration).build();