Search code examples
hadoopcommand-linehdfshadoop2

Hadoop: reciprocal of hdfs dfs -text


In Hadoop, the hdfs dfs -text and hdfs dfs -getmerge commands allow one to easily read contents of compressed files in HDFS from the command-line, including piping to other commands for processing (e.g. wc -l <(hdfs dfs -getmerge /whatever 2>/dev/null)).

Is there a reciprocal for these commands, allowing one to push content to HDFS from the command-line, while supporting the same compression and format features as the aforementioned commands? hdfs dfs -put will seemingly just make a raw copy of a local file to HDFS, without compression or container format change.

Answers suggesting command-line tools for manipulating such formats and compression algorithms are welcome too. I typically see Snappy-compressed data in CompressedStream's but can't figure how to convert a plain-old text file (one datum per line) into such a file from the command-line. I gave a try at snzip (as suggested in this askubuntu question) as well as this snappy command-line tool but couldn't use either of them to generate Hadoop-friendly Snappy files (or read the contents of Snappy files ingested in HDFS using Apache Flume).


Solution

  • There is seemingly no reciprocal to hdfs dfs -text and WebHDFS also has no support for (de)compression whatsoever, so I ended up writing my own command-line tool in Java for compressing standard input to standard output in Hadoop-friendly Snappy.

    Code goes like this:

    class SnappyCompressor {
        static void main(String[] args)
        {
            try {
                Configuration conf = new Configuration();
                CompressionCodecFactory ccf = new CompressionCodecFactory(conf);
                CompressionCodec codec =
                    ccf.getCodecByClassName(SnappyCodec.class.getName());
                Compressor comp = CodecPool.getCompressor(codec);
                CompressionOutputStream compOut =
                    codec.createOutputStream(System.out, comp);
                BufferedReader in =
                    new BufferedReader(new InputStreamReader(System.in));
                String line;
                while( (line=in.readLine()) != null ) {
                    compOut.write( line.getBytes() );
                    compOut.write( '\n' );
                }
                compOut.finish();
                compOut.close();
            }
            catch( Exception e ) {
                System.err.print("An exception occured: ");
                e.printStackTrace(System.err);
            }
        }
    }
    

    Run using hadoop jar <jar path> <class name>.

    Text data compressed this way can be put to HDFS (through e.g. hdfs dfs -put or using WebHDFS) then read with hdfs dfs -text.