Search code examples
hadoopavrowritablesequencefile

Is there a simple way to migrate from SequenceFiles to Avro?


I'm currently using hadoop mapreduce jobs with SequenceFiles of writables. The same Writable type are used for serialization also in the non-hadoop related parts of the system.

This method is hard to maintain - mainly because of the lack of schema and the need for manual handling of version changes.

It appears that apache avro handles these issues.

The problem is, that during the migration I will have data in both formats. is there a simple way to handle the migration?


Solution

  • Generally, there is nothing stopping you from using Avro data and SequenceFiles interchangably. Use whatever InputFormat is necessary for the type of data you need, and for output it of course makes sense to use Avro formats whenever practial. If your input comes in different formats, take a look at MultipleInputs. Essentially, you will still have to implement separate Mappers, but that's to be expeced considering the Map input key/value is different.

    Moving to Avro is a wise move. If you have the capacity in time and hardware, it might even be worthwhile to explicitly convert your data from SequenceFile to Avro right away. You can use any language supported by Avro which also happens to supports SequenceFiles to do this. Java certainly does (clearly), but Pig is also pretty handy for doing this.

    The user contributed PiggyBank project has functionality for reading a SequenceFile, and then it is simply a matter of using AvroStorage from the same PiggyBank project with the appropriate Avro Scheme to get your Avro file.

    If only Pig supported loading Avro schemas from file.. ! If you use Pig you will unfortunately have to form scripts that explicitly contain the Avro schema, which can be a bit annoying.