Search code examples
hadoopmapreduceaccumulo

Multiple table input for mapreduce


I am thinking of doing a mapreduce using accumulo tables as input.
Is there a way to have 2 different tables as input, the same way it exists for the multiple files input like addInputPath ?
Or is it possible to have one input from a file and the other one from a table with AccumuloInputFormat ?


Solution

  • You probably want to take a look at AccumuloMultiTableInputFormat. The Accumulo manual demonstrates how to use it here.

    Example Usage:

    job.setInputFormat(AccumuloInputFormat.class);
    
    AccumuloMultiTableInputFormat.setConnectorInfo(job, user, new PasswordToken(pass));
    AccumuloMultiTableInputFormat.setMockInstance(job, INSTANCE_NAME);
    
    InputTableConfig tableConfig1 = new InputTableConfig();
    InputTableConfig tableConfig2 = new InputTableConfig();
    
    Map<String, InputTableConfig> configMap = new HashMap<String, InputTableConfig>();
    configMap.put(table1, tableConfig1);
    configMap.put(table2, tableConfig2);
    
    AccumuloMultiTableInputFormat.setInputTableConfigs(job, configMap);
    

    See the unit test for AccumuloMultiTableInputFormat here for some additional information.

    Note, that unlike normal multiple inputs, you can't specify different Mappers to run on each table. Although, its not a massive problem in this case since the incoming Key/Value types are the same and you can use:

    RangeInputSplit split = (RangeInputSplit)c.getInputSplit();
    String tableName = split.getTableName();
    

    To workout which table the records are coming from (taken from the Accumulo manual) in your mapper.