How to read specific fields from Avro-Parquet file in Java?

How can I read a subset of fields from an avro-parquet file in java?

I thought I could define an avro schema which is a subset of the stored records and then read them...but I get an exception.

here is how i tried to solve it

I have 2 avro schemas:

classA
ClassB

The fields of ClassB are a subset of ClassA.

        final Builder<ClassB> builder = AvroParquetReader.builder(files[0].getPath());
        final ParquetReader<ClassB> reader = builder.build();
        //AvroParquetReader<ClassA> readerA = new AvroParquetReader<ClassA>(files[0].getPath());
        ClassB record = null;
        final List<ClassB> list = new ArrayList<>();
        while ((record = reader.read()) != null) {
            list.add(record);
        }

But I get a ClassCastException on line (record=reader.read()): Cannot convert ClassA to ClassB

I suppose the reader is reading the schema from the file.

I tried to send in the model (i.e. builder.withModel) but since classB extends org.apache.avro.specific.SpecificRecordBase it throws an exception.

I event tried to set the schema in the configuration and set it through builder.withConfig but no cigar...

Solution

So...

Couple of things:

AvroReadSupport.setRequestedProjection(hadoopConf, ClassB.$Schema) can be used to set a projection for the columns that are selected.
The reader.readNext method still will return a ClassA object but will null out the fields that are not present in ClassB.

To use the reader directly you can do the following:

AvroReadSupport.setRequestedProjection(hadoopConf, ClassB.SCHEMA$);
final Builder<ClassB> builder = AvroParquetReader.builder(files[0].getPath());
final ParquetReader<ClassA> reader = builder.withConf(hadoopConf).build();

ClassA record = null;
final List<ClassA> list = new ArrayList<>();
while ((record = reader.read()) != null) {
    list.add(record);
}

Also if you're planning to use an inputformat to read the avro-parquet file, there is a convenience method - here is a spark example:

        final Job job = Job.getInstance(hadoopConf);
        ParquetInputFormat.setInputPaths(job, pathGlob);
        AvroParquetInputFormat.setRequestedProjection(job, ClassB.SCHEMA$);

        @SuppressWarnings("unchecked")
        final JavaPairRDD<Void, ClassA> rdd = sc.newAPIHadoopRDD(job.getConfiguration(), AvroParquetInputFormat.class,
                Void.class, ClassA.class);