python hadoop-streaming parquet outputformat

Write Parquet Output in a Hadoop Streaming job

Is there a way to write text data into a parquet file with hadoop-streaming using python.

Basically, I have a string being emitted from my IdentityMapper which I want to store as a parquet file.

inputs or examples would be really helpful

Solution

I suspect there's no builtin way of doing this using built Hadoop Streaming (I couldn't find one), however, depending on your data sets you may use a 3rd party package as

https://github.com/whale2/iow-hadoop-streaming

To generate Parquet from JSON your streaming code would spit out json and together with an AVRO schema you could write your Parquet using ParquetAsJsonOutputFormat.

Please note that at this stage the package above has some limitations (like only being able to use primitive types, etc).

Depending on the nature of your data your may also play with Kite SDK as briefly explained here:

https://dwbigdata.wordpress.com/2016/01/31/json-to-parquet-conversion/

and here:

https://community.cloudera.com/t5/Kite-SDK-includes-Morphlines/JSON-to-Parquet/td-p/20630

Cheers