Search code examples
javapythonmapreduceavro

Read AVRO file using Python


I have an AVRO file(created by JAVA) and seems like it is some kind of zipped file for hadoop/mapreduce, i want to 'unzip' (deserialize) it to a flat file. Per record per row.

I learned that there is an AVRO package for python, and I installed it correctly. And run the example to read the AVRO file. However, it came up with the errors below and I am wondering what is going on reading the simplest example? Can anyone help me interpret the errors bellow.

>>> reader = DataFileReader(open("/tmp/Stock_20130812104524.avro", "r"), DatumReader())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../python2.7/site-packages/avro/datafile.py", line 240, in __init__
    raise DataFileException('Unknown codec: %s.' % self.codec)
avro.datafile.DataFileException: Unknown codec: snappy.

btw, if I do 'head' of file, and using VI to open up the first few lines of the AVRO file, I could see the schema definition together with some crappy weird characters - probably the zipped content. The starting bit of the raw AVRO file looks like below:

bj^A^D^Tavro.codec^Lsnappy^Vavro.schemaØ${"type":"record","name":"Stoc...

I don't know if those schemas would be necessary to read the AVRO file, something like below:

schema = avro.schema.parse(open("schema").read())
# include schema to do sth...
reader = DataFileReader(open("Stock_20130812104524.avro", "r"), DatumReader())

Thanks in advance.


Solution

  • The problem is that if there is no Xcode command line tools installed you cannot get snappy working. You can check by typing gcc at the command prompt to see if it is installed or not. If not then type xcode-select –-install to install it. Then installing python-snappy should work. Thanks Bin!