Search code examples
pythonhadoop-streamingparquetoutputformat

Write Parquet Output in a Hadoop Streaming job


Is there a way to write text data into a parquet file with hadoop-streaming using python.

Basically, I have a string being emitted from my IdentityMapper which I want to store as a parquet file.

inputs or examples would be really helpful


Solution

  • I suspect there's no builtin way of doing this using built Hadoop Streaming (I couldn't find one), however, depending on your data sets you may use a 3rd party package as

    https://github.com/whale2/iow-hadoop-streaming

    To generate Parquet from JSON your streaming code would spit out json and together with an AVRO schema you could write your Parquet using ParquetAsJsonOutputFormat.

    Please note that at this stage the package above has some limitations (like only being able to use primitive types, etc).

    Depending on the nature of your data your may also play with Kite SDK as briefly explained here:

    https://dwbigdata.wordpress.com/2016/01/31/json-to-parquet-conversion/

    and here:

    https://community.cloudera.com/t5/Kite-SDK-includes-Morphlines/JSON-to-Parquet/td-p/20630

    Cheers