Hadoop HDFS compress in place

So, theres a bunch of log files in /var/log/… on hdfs that can be either uncompressed or compressed with snappy.

If they don't end in .snappy I'd like to compress them, and name them with the ending. But I'd like to do this with data-locality, and preferably get the names right.

I tried the hadoop streaming approach.

HAD=/usr/lib/hadoop
$HAD/bin/hadoop jar $HAD/hadoop-streaming.jar \
-D mapred.output.compress=true \
-D madred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
-D mapred.reduce.tasks=0 \
-input /var/log/… -output /user/hadoop/working \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper

But that's giving a bunch of part files and seems to be doing things line by line. It's also opting for deflate for some reason. So I get files like part-00000.deflate etc. The inputs were like app_123456789_0123_1. I would have liked app_123456789_0123_1.snappy but the part didn't even quite map to a whole file, nor was it snappy codec.

The FAQ says that you could generate an input file of file names and then do a task on each name. There's no simple snappy compressor tool though. The other option looks like it would be even better (though I'd like not to have to build a jar, I suppose I can) however it says the names aren't going to be preserved. I think that'll be unhelpful to me.

Is there some way that'd do this that doesn't involve getting the file from HDFS, working on it locally, and putting it back? And which handles the file names?

Solution

Log files are continuously generated, so I'm not sure it makes sense use Hadoop streaming to read them since that's a one time action and does not preserve the files that have been read if you run it again.

Also, if all you want is the application_1234 files, you can enable YARN log compression in the Hadoop configuration, and that'll handle uploading to HDFS for you for just the YARN logs.

If you would like to have those logs continuously being compressed and uploaded to HDFS, you should consider using at least Flume, which is included in the major Hadoop distributions.

If you are comfortable installing any other software, look at either Fluentd or Filebeat for log collection, and then NiFi for handling transfer to HDFS in reasonable file sizes in a compression format of your choice. Kafka may also be used between the log collector and NiFi. With these options, you get good control over filenames and could alternatively ship logs to a proper search platform like Solr or Elasticsearch

Regarding your comment, it's been a while since setting these tools up, but I believe you can use a filename regex pattern to explicitly capture the files you want to include/exclude