Search code examples
hadoopavroparquet

Parquet-MR AvroParquetWriter - how to convert data to Parquet (with Specific Mapping)


I'm working on a tool for converting data from a homegrown format to Parquet and JSON (for use in different settings with Spark, Drill and MongoDB), using Avro with Specific Mapping as the stepping stone. I have to support conversion of new data on a regular basis and on client machines which is why I try to write my own standalone conversion tool with a (Avro|Parquet|JSON) switch instead of using Drill or Spark or other tools as converters as I probably would if this was a one time job. I'm basing the whole thing on Avro because this seems like the easiest way to get conversion to Parquet and JSON under one hood.

I used Specific Mapping to profit from static type checking, wrote an IDL, converted that to a schema.avsc, generated classes and set up a sample conversion with specific constructor, but now I'm stuck configuring the writers. All Avro-Parquet conversion examples I could find [0] use AvroParquetWriter with deprecated signatures (mostly: Path file, Schema schema) and Generic Mapping.

AvroParquetWriter has only one none-deprecated Constructor, with this signature:

AvroParquetWriter(
    Path file, 
    WriteSupport<T> writeSupport,
    CompressionCodecName compressionCodecName,
    int blockSize, 
    int pageSize, 
    boolean enableDictionary,
    boolean enableValidation, 
    WriterVersion writerVersion,
    Configuration conf
)

Most of the parameters are not hard to figure out but WriteSupport<T> writeSupport throws me off. I can't find any further documentation or an example.
Staring at the source of AvroParquetWriter I see GenericData model pop up a few times but only one line mentioning SpecificData: GenericData model = SpecificData.get();.

So I have a few questions:

1) Does AvroParquetWriter not support Avro Specific Mapping? Or does it by means of that SpecificData.get() method? The comment "Utilities for generated Java classes and interfaces." over 'SpecificData.class` seems to suggest that but how exactly should I proceed?

2) What's going on in the AvroParquetWriter constructor, is there an example or some documentation to be found somewhere?

3) More specifically: the signature of the WriteSupport method asks for 'Schema avroSchema' and 'GenericData model'. What does GenericData model refer to? Maybe I'm not seeing the forest because of all the trees here...

To give an example of what I'm aiming for, my central piece of Avro conversion code currently looks like this:

DatumWriter<MyData> avroDatumWriter = new SpecificDatumWriter<>(MyData.class);
DataFileWriter<MyData> dataFileWriter = new DataFileWriter<>(avroDatumWriter);
dataFileWriter.create(schema, avroOutput);

The Parquet equivalent currently looks like this:

AvroParquetWriter<SpecificRecord> parquetWriter = new AvroParquetWriter<>(parquetOutput, schema);

but this is not more than a beginning and is modeled after the examples I found, using the deprecated constructor, so will have to change anyway.

Thanks,
Thomas

[0] Hadoop - The definitive Guide, O'Reilly, https://gist.github.com/hammer/76996fb8426a0ada233e, http://www.programcreek.com/java-api-example/index.php?api=parquet.avro.AvroParquetWriter


Solution

  • Try AvroParquetWriter.builder :

    MyData obj = ... // should be avro Object
    ParquetWriter<Object> pw = AvroParquetWriter.builder(file)
            .withSchema(obj.getSchema())
            .build();
    pw.write(obj);
    pw.close(); 
    

    Thanks.