Search code examples
hadoopmapreducerdbmsavrodatabase-table

MultipleInputs with DBInputFormat in Hadoop


In my database I have multiple tables where each table is a different entity type. I have an Avro schema that I use in hadoop which is a union of all the fields of these different entity types plus it has a entity type field.

What I would like to do is something along the lines of setting up a DBInputFormat with a DBWritable for each entity type that maps the entity type to the combined Avro type. Then give each DBInputFormat to something like MultipleInputs so that I can create a composite input format. The composite input format could then be given to my map reduce job so that all of the data from all the tables could be processed at once by the same mapper class.

Data is constantly added to these database tables so I need to be able to configure the DBInputFormat for each entity type/dbtable to only grab the new data and to do the splits properly.

Basically I need the functionality of DBInputFormat or DataDrivenDBInputFormat but also the ability to make a composite of them similar to what you can do with paths and MultipleInputs.


Solution

  • Create a view from the N input tables and set the view in the DBInputFormat#setInput. According to the Cloudera article. So, I guess data should not be updated in the table for the time the job completes.

    Hadoop may need to execute the same query multiple times. It will need to return the same results each time. So any concurrent updates to your database, etc, should not affect the query being run by your MapReduce job. This can be accomplished by disallowing writes to the table while your MapReduce job runs, restricting your MapReduce’s query via a clause such as “insert_date < yesterday,” or dumping the data to a temporary table in the database before launching your MapReduce process.

    Evaluate frameworks which support real time processing like Storm, HStreaming, S4 and Strembases. Some of these sit on top of Hadoop and some don't, some are FOSS and some are commercial.