Search code examples
javamultithreadinghadoopmapreduceapache-commons

Is Hadoop's TooRunner thread-safe?


I would like to trigger a few Hadoop jobs simultaneously. I’ve created a pool of threads using Executors.newFixedThreadPool. Idea is that if the pool size is 2, my code will trigger 2 Hadoop jobs at the same exact time using ‘ToolRunner.run’. In my testing, I noticed that these 2 threads keep stepping on each other.

When I looked under the hood, I noticed that ToolRunner creates GenericOptionsParser which in turn calls a static method ‘buildGeneralOptions’. This method uses ‘OptionBuilder.withArgName’ which uses an instance variable called, ‘argName’. This doesn’t look thread safe to me and I believe is the root cause of issues I am running into.

Any thoughts?


Solution

  • Confirmed that ToolRunner is NOT thread-safe:

    Original code (which runs into problems):

      public static int run(Configuration conf, Tool tool, String[] args) 
    throws Exception{
    if(conf == null) {
      conf = new Configuration();
    }
    GenericOptionsParser parser = new GenericOptionsParser(conf, args);
    //set the configuration back, so that Tool can configure itself
    tool.setConf(conf);
    
    //get the args w/o generic hadoop args
    String[] toolArgs = parser.getRemainingArgs();
    return tool.run(toolArgs);
    

    }

    New Code(which works):

        public static int run(Configuration conf, Tool tool, String[] args)
            throws Exception{
        if(conf == null) {
            conf = new Configuration();
        }
        GenericOptionsParser parser = getParser(conf, args);
    
        tool.setConf(conf);
    
        //get the args w/o generic hadoop args
        String[] toolArgs = parser.getRemainingArgs();
        return tool.run(toolArgs);
    }
    
    private static synchronized GenericOptionsParser getParser(Configuration conf, String[] args) throws Exception {
        return new GenericOptionsParser(conf, args);
    }