Search code examples
javahadoopmapreducehadoop2

Getting unknown integer value in Map-reduce output file


I am working on a hadoop map-reduce program where i am not setting the mapper and reducer and not setting any other parameter to the Job configuration from my program. I did so assuming that the the Job will send the same output as the input to the output file. But what i found that it is printing some dummy integer value in the output file with every line separated by tab(i guess).

Here is my code:

import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class MinimalMapReduce extends Configured implements Tool {

    public int run(String[] args) throws Exception {

        Job job = new Job(getConf());
        job.setJarByClass(getClass());
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        return job.waitForCompletion(true) ? 0 : 1;
    }

    public static void main(String[] args) {
        String argg[] = {"/Users/***/Documents/hadoop/input/input.txt",
                            "/Users/***/Documents/hadoop/output_MinimalMapReduce"}; 
        try{
            int exitCode = ToolRunner.run(new MinimalMapReduce(), argg);
            System.exit(exitCode);
        }catch(Exception e){
            e.printStackTrace();
        }
    }
}

And here is the input:

2011 22
2011 25
2012 40
2013 35
2013 38
2014 44
2015 43

And here is the output:

0   2011 22
8   2011 25
16  2012 40
24  2013 35
32  2013 38
40  2014 44
48  2015 43

How can i get the same ouput as the input?


Solution

  • I agree with the @philantrovert's answer, but here is the more details i found. According to the Hadoop- The Definitive Guide, it is TextInputFormat which adds the offset to the line numbers. Here is the documentation about the TextInputFormat:

    TextInputFormat is the default InputFormat. Each record is a line of input. The key, a LongWritable, is the byte offset within the file of the beginning of the line. The value is the contents of the line, excluding any line terminators (e.g., newline or carriage return), and is packaged as a Text object. So, a file containing the following text:

    On the top of the Crumpetty Tree
    The Quangle Wangle sat,
    But his face you could not see,
    On account of his Beaver Hat.
    

    is divided into one split of four records. The records are interpreted as the following key-value pairs:

    (0, On the top of the Crumpetty Tree)
    (33, The Quangle Wangle sat,)
    (57, But his face you could not see,)
    (89, On account of his Beaver Hat.)
    

    Clearly, the keys are not line numbers. This would be impossible to implement in general, in that a file is broken into splits at byte, not line, boundaries. Splits are processed independently. Line numbers are really a sequential notion. You have to keep a count of lines as you consume them, so knowing the line number within a split would be possible, but not within the file.

    However, the offset within the file of each line is known by each split independently of the other splits, since each split knows the size of the preceding splits and just adds this onto the offsets within the split to produce a global file offset. The offset is usually sufficient for applications that need a unique identifier for each line. Combined with the file’s name, it is unique within the filesystem. Of course, if all the lines are a fixed width, calculating the line number is simply a matter of dividing the offset by the width.