Search code examples
hadoopziphadoop-streaming

Hadoop streaming with zip input files


I'm trying to run a streaming job where the input files are csv inside zip files. I tried using this, however it doesn't seem for work with CDH4 (I get the error class com.cotdp.hadoop.ZipFileInputFormat not org.apache.hadoop.mapred.InputFormat)

Anyone know of an input file reader I can use for streaming with zip files? If possible, I'm looking for a multi file reader (that can be given the top level directory).


Solution

  • I ended up writing zipstream.

    Note that is process only the first file in the zip, I'll probably add support for multiple files later.