Search code examples
hadoopapache-sparkavroparquet

How to read specific fields from Avro-Parquet file in Java?


How can I read a subset of fields from an avro-parquet file in java?

I thought I could define an avro schema which is a subset of the stored records and then read them...but I get an exception.

here is how i tried to solve it

I have 2 avro schemas:

  • classA
  • ClassB

The fields of ClassB are a subset of ClassA.

        final Builder<ClassB> builder = AvroParquetReader.builder(files[0].getPath());
        final ParquetReader<ClassB> reader = builder.build();
        //AvroParquetReader<ClassA> readerA = new AvroParquetReader<ClassA>(files[0].getPath());
        ClassB record = null;
        final List<ClassB> list = new ArrayList<>();
        while ((record = reader.read()) != null) {
            list.add(record);
        }

But I get a ClassCastException on line (record=reader.read()): Cannot convert ClassA to ClassB

I suppose the reader is reading the schema from the file.

I tried to send in the model (i.e. builder.withModel) but since classB extends org.apache.avro.specific.SpecificRecordBase it throws an exception.

I event tried to set the schema in the configuration and set it through builder.withConfig but no cigar...


Solution

  • So...

    Couple of things:

    • AvroReadSupport.setRequestedProjection(hadoopConf, ClassB.$Schema) can be used to set a projection for the columns that are selected.
    • The reader.readNext method still will return a ClassA object but will null out the fields that are not present in ClassB.

    To use the reader directly you can do the following:

    AvroReadSupport.setRequestedProjection(hadoopConf, ClassB.SCHEMA$);
    final Builder<ClassB> builder = AvroParquetReader.builder(files[0].getPath());
    final ParquetReader<ClassA> reader = builder.withConf(hadoopConf).build();
    
    ClassA record = null;
    final List<ClassA> list = new ArrayList<>();
    while ((record = reader.read()) != null) {
        list.add(record);
    }
    

    Also if you're planning to use an inputformat to read the avro-parquet file, there is a convenience method - here is a spark example:

            final Job job = Job.getInstance(hadoopConf);
            ParquetInputFormat.setInputPaths(job, pathGlob);
            AvroParquetInputFormat.setRequestedProjection(job, ClassB.SCHEMA$);
    
            @SuppressWarnings("unchecked")
            final JavaPairRDD<Void, ClassA> rdd = sc.newAPIHadoopRDD(job.getConfiguration(), AvroParquetInputFormat.class,
                    Void.class, ClassA.class);