Search code examples
hadoophivekylin

Hadoop Map Reduce job: Class org.apache.hive.hcatalog.mapreduce.HCatInputFormat not found


I'm using kylin. It is a data warehouse tool and it uses hadoop, hive and hbase. It is shipped with sample data so that we can test the system. I was building this sample. It is a multi-step process many of the steps are map-reduce jobs. Second step is Extract Fact Table Distinct Columns which is a MR job. This job is failing without writing anything in hadoop logs. After digging deeper I find one Exception in logs/userlogs/application_1450941430146_0002/container_1450941430146_0002_01_000004/syslog

2015-12-24 07:31:03,034 WARN [main] org.apache.hadoop.mapred.YarnChild:
Exception running child : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hive.hcatalog.mapreduce.HCatInputFormat not found
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
    at org.apache.hadoop.mapreduce.task.JobContextImpl.getInputFormatClass(JobContextImpl.java:174)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:749)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hive.hcatalog.mapreduce.HCatInputFormat not found
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
    ... 8 more

2015-12-24 07:31:03,037 INFO [main] org.apache.hadoop.mapred.Task: Runnning cleanup for the task

My question is should I copy all dependencies jar of mapper class to all hadoop node? This job succeeds if I restarts kylin server and resume cube building job. This behavior is observed again when restart it after cleaning up everything.

I am using 5 node cluster, each node is 8 core and 30GB. NameNode is running on one node. DataNode is running on all 5 nodes. For Hbase; HMaster and HQuorumPeer is running on same node as NameNode and HRegionServer is running on all nodes. Hive and Kylin are deployed on Master Node.

Version information:

Ubuntu 12.04 (64 bit)
Hadoop 2.7.1
Hbase  0.98.16
Hive   0.14.0
Kylin  1.1.1

Solution

  • The issue here is Kylin assumes the same Hive jars on all Hadoop nodes. And when certain node missing the Hive jars (or even in different location), you get the ClassNotFoundException on HCatInputFormat.

    Btw, you should be able to get a clear error message from Yarn job console. This is a met issue.

    Deploying Hive to all cluster nodes can surely fix the problem, like you have tried.

    Or another (cleaner) workaround is manually configure Kylin to submit Hive jars as additional job dependencies. See https://issues.apache.org/jira/browse/KYLIN-1021

    Finally there's also a open JIRA suggests that Kylin should submit Hive jars by default. See https://issues.apache.org/jira/browse/KYLIN-1082