Search code examples
hadoopmapreducehadoop-streaming

Specify N in hadoop streaming when use NLineInputFormat


If I use NLineInputFormat in hadoop streaming, how to specify N?

hadoop jar /home/Software/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
-D stream.non.zero.exit.is.failure=false \
-D mapred.map.tasks=2 \
-D mapred.reduce.tasks=1 \
-files /home/hello.py \
-input /hello.txt \
-output /result \
-mapper "/home/.conda/envs/perimeter-pytorch2/bin/python hello.py" \
-inputformat org.apache.hadoop.mapred.lib.NLineInputFormat
-????

what command can specify N?


Solution

  • The non deprecated class is org.apache.hadoop.mapreduce.lib.input.NLineInputFormat (All classes from mapred package are deprecated)

    Per Javadoc for that class, you'd pass configuration option for -D mapreduce.input.lineinputformat.linespermap=N

    If you'd like to use PyTorch with HDFS data, I'd suggest using Spark or Flink over mapreduce