In Hadoop, the hdfs dfs -text
and hdfs dfs -getmerge
commands allow one to easily read contents of compressed files in HDFS from the command-line, including piping to other commands for processing (e.g. wc -l <(hdfs dfs -getmerge /whatever 2>/dev/null)
).
Is there a reciprocal for these commands, allowing one to push content to HDFS from the command-line, while supporting the same compression and format features as the aforementioned commands?
hdfs dfs -put
will seemingly just make a raw copy of a local file to HDFS, without compression or container format change.
Answers suggesting command-line tools for manipulating such formats and compression algorithms are welcome too. I typically see Snappy-compressed data in CompressedStream's but can't figure how to convert a plain-old text file (one datum per line) into such a file from the command-line. I gave a try at snzip (as suggested in this askubuntu question) as well as this snappy command-line tool but couldn't use either of them to generate Hadoop-friendly Snappy files (or read the contents of Snappy files ingested in HDFS using Apache Flume).
There is seemingly no reciprocal to hdfs dfs -text
and WebHDFS also has no support for (de)compression whatsoever, so I ended up writing my own command-line tool in Java for compressing standard input to standard output in Hadoop-friendly Snappy.
Code goes like this:
class SnappyCompressor {
static void main(String[] args)
{
try {
Configuration conf = new Configuration();
CompressionCodecFactory ccf = new CompressionCodecFactory(conf);
CompressionCodec codec =
ccf.getCodecByClassName(SnappyCodec.class.getName());
Compressor comp = CodecPool.getCompressor(codec);
CompressionOutputStream compOut =
codec.createOutputStream(System.out, comp);
BufferedReader in =
new BufferedReader(new InputStreamReader(System.in));
String line;
while( (line=in.readLine()) != null ) {
compOut.write( line.getBytes() );
compOut.write( '\n' );
}
compOut.finish();
compOut.close();
}
catch( Exception e ) {
System.err.print("An exception occured: ");
e.printStackTrace(System.err);
}
}
}
Run using hadoop jar <jar path> <class name>
.
Text data compressed this way can be put
to HDFS (through e.g. hdfs dfs -put
or using WebHDFS) then read with hdfs dfs -text
.