Search code examples
hadoopelastic-map-reduce

What happens if I add the same path twice to a Hadoop?


I am using elastic map reduce. I wonder what will happen if I use the exact same line twice in my main method.

FileInputFormat.addInputPath(job, new Path( "s3n://mybucket/data/lolcat/*"));

Will hadoop process the same files twice ? Or will it figure out that they are the same files and will skip the duplicates ?


Solution

  • Here is the source that adds the input paths:

    
    public static void addInputPath(JobConf conf, Path path ) {
        path = new Path(conf.getWorkingDirectory(), path);
        String dirStr = StringUtils.escapeString(path.toString());
        String dirs = conf.get("mapred.input.dir");
        conf.set("mapred.input.dir", dirs == null ? dirStr :
          dirs + StringUtils.COMMA_STR + dirStr);
    }
    

    So as you can see it just appends your input into mapred.input.dir without looking at the content before.

    Besides the getSplits function only uses List and no Set, so if you have the same input paths N times it will be processed N times. Tested on a Hadoop streaming job, I get twice the amount of mappers if I duplicate the same input path.