Search code examples
scalaamazon-s3parquetakka-streamalpakka

How to read parquet file from S3 using akka streams or alpakka


Im trying to read parque file from S3 using akka streams following the official doc but I am getting this error java.io.IOException: No FileSystem for scheme: s3a this is the code that triggered that exception. I will highly appreciate any clue/example of how should I do it correctly

val path = s"s3a://bucketName/path/to/foo/part-00000-656418ee-7cc0-42ee-93e-aaa69ee6f916.c000.snappy.parquet"
val conf: Configuration = new Configuration()
conf.setBoolean(AvroReadSupport.AVRO_COMPATIBILITY, true)
val file = HadoopInputFile.fromPath(new Path(path), conf)
val reader: ParquetReader[GenericRecord] =
    AvroParquetReader.builder[GenericRecord](file).withConf(conf).build()
    //should read the file lines here but not there yet ...

Solution

  • You are most likely missing hadoop-aws lib on your classpath.

    Have a look here: https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html

    And also this SO gives some more details how to setup credentials for access to S3: How do I configure S3 access for org.apache.parquet.avro.AvroParquetReader?

    Once you have AvroParquetReader correctly initialized, then you can create Akka Stream's Source out of it as per the Alpakka Avro Parquet doc (https://doc.akka.io/docs/alpakka/current/avroparquet.html#source-initiation)