Search code examples
hadoophadoop-streaming

hadoop-streaming.jar adds x'09' at the end of each line


I am trying to merge some *_0 (part files in HDFS) files in a HDFS location using the below hadoop-streaming.jar command.

  hadoop jar $HDPHOME/hadoop-streaming.jar -Dmapred.reduce.tasks=1 -input $INDIR -output $OUTTMP/${OUTFILE}  -mapper cat -reducer cat

Things work fine - Except that, I get into problems, as, the result from above command seem to add x'09' to the end of each line.

We have Hive tables defined on top of the part files (which are replaced with the merged file) where the last field is defined as BIGINT. Since, the merged file adds the x'09' to the last field - the same definition of the tbale now shows NULL in the last field in Hue (as 510408 is no longer a number as X'09' is added to it).

e.g.

Data in part file.

00000320  7c 35 31 30 34 30 38 0a                           ||510408.|

Data in merged file (result of above command)

00000320  7c 35 31 30 34 30 38 09  0a                       ||510408..|

How do I avoid this from happening? Is there some option that I can set in the command to prevent this?

Appreciate your time for any help/pointers.


Solution

  • I found the answerin this post -

    Adding the below option seems to resolve it.

    -D mapred.textoutputformat.separator=<delimiter-of-input-file>