Search code examples
hadoopmapreducehivehcatalog

Can I use HCatInputFormat with MultipleInputs in Hadoop?


I'm attempting to do a join between two datasets, one is stored in a Hive table, the other one is not. I see according to what people do this is not very normal, as in, they either define everything as a Hive table or they don't.

Now there's the MultipleInputs class, but the addInputPath method takes Configuration, Path, InputFormat, Mapper

I could use the input format there and try the to put the table name disguised as a Path but that sounds like a wild guess at best.

There's a patch for newer version of Hive (I'm on CDH4 so that means hive 0.10 and hcat 0.5 sadly). I found this patch which is not quite straight forward to translate into my current version and also seems to only work with multiple tables and not a mix of them.

https://issues.apache.org/jira/browse/HIVE-4997

Is this possible or have you any recommendations?

The only thing I can think of is reading the raw data without using the table, but that implies logic over hive specific formats I'd rather avoid.


Solution

  • The solution here apparently is either upgrade to 0.14.0 (or patch the old version) or not use HCatalog but read the metastore directly and manually add each partition subdirectory to MultipleInputs.

    Personally since I can't upgrade easily and the subpartitioning is too much work, I just focused on optimising the jobs in other ways and be contempt with running a sequence of jobs for now.