I'm using kylin. It is a data warehouse tool and it uses hadoop, hive and hbase. It is shipped with sample data so that we can test the system. I was building this sample. It is a multi-step process many of the steps are map-reduce jobs. Second step is Extract Fact Table Distinct Columns
which is a MR job. This job is failing without writing anything in hadoop logs. After digging deeper I find one Exception in logs/userlogs/application_1450941430146_0002/container_1450941430146_0002_01_000004/syslog
2015-12-24 07:31:03,034 WARN [main] org.apache.hadoop.mapred.YarnChild:
Exception running child : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hive.hcatalog.mapreduce.HCatInputFormat not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at org.apache.hadoop.mapreduce.task.JobContextImpl.getInputFormatClass(JobContextImpl.java:174)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:749)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hive.hcatalog.mapreduce.HCatInputFormat not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
... 8 more
2015-12-24 07:31:03,037 INFO [main] org.apache.hadoop.mapred.Task: Runnning cleanup for the task
My question is should I copy all dependencies jar of mapper class to all hadoop node? This job succeeds if I restarts kylin server and resume cube building job. This behavior is observed again when restart it after cleaning up everything.
I am using 5 node cluster, each node is 8 core and 30GB. NameNode is running on one node. DataNode is running on all 5 nodes. For Hbase; HMaster and HQuorumPeer is running on same node as NameNode and HRegionServer is running on all nodes. Hive and Kylin are deployed on Master Node.
Ubuntu 12.04 (64 bit)
Hadoop 2.7.1
Hbase 0.98.16
Hive 0.14.0
Kylin 1.1.1
The issue here is Kylin assumes the same Hive jars on all Hadoop nodes. And when certain node missing the Hive jars (or even in different location), you get the ClassNotFoundException on HCatInputFormat.
Btw, you should be able to get a clear error message from Yarn job console. This is a met issue.
Deploying Hive to all cluster nodes can surely fix the problem, like you have tried.
Or another (cleaner) workaround is manually configure Kylin to submit Hive jars as additional job dependencies. See https://issues.apache.org/jira/browse/KYLIN-1021
Finally there's also a open JIRA suggests that Kylin should submit Hive jars by default. See https://issues.apache.org/jira/browse/KYLIN-1082