I am using elastic map reduce. I wonder what will happen if I use the exact same line twice in my main method.
FileInputFormat.addInputPath(job, new Path( "s3n://mybucket/data/lolcat/*"));
Will hadoop process the same files twice ? Or will it figure out that they are the same files and will skip the duplicates ?
Here is the source that adds the input paths:
public static void addInputPath(JobConf conf, Path path ) {
path = new Path(conf.getWorkingDirectory(), path);
String dirStr = StringUtils.escapeString(path.toString());
String dirs = conf.get("mapred.input.dir");
conf.set("mapred.input.dir", dirs == null ? dirStr :
dirs + StringUtils.COMMA_STR + dirStr);
}
So as you can see it just appends your input into mapred.input.dir without looking at the content before.
Besides the getSplits
function only uses List
and no Set
, so if you have the same input paths N times it will be processed N times. Tested on a Hadoop streaming job, I get twice the amount of mappers if I duplicate the same input path.