Search code examples
apache-piggoogle-cloud-dataprocgoogle-cloud-bigtable

Submitting Pig job from Google Cloud Dataproc does not add custom jars to Pig classpath


I'm trying to submit a Pig job via Google Cloud Dataproc and include a custom jar that implements a custom load function I use in the Pig script, but I can't find out how to do that.

Adding my custom jar through the UI appears DOES NOT add it to the Pig classpath.

Here's the output of the Pig job, showing it fails to find my class:

17/03/29 16:12:21 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
17/03/29 16:12:21 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
17/03/29 16:12:21 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2017-03-29 16:12:21,961 [main] INFO  org.apache.pig.Main - Apache Pig version 0.16.0 (r: unknown) compiled Nov 27 2016, 23:14:51
2017-03-29 16:12:21,961 [main] INFO  org.apache.pig.Main - Logging error messages to: /tmp/cb3b0696-3f30-4db4-a6a7-bb716d2a8a89/pig_1490803941959.log
2017-03-29 16:12:22,379 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2017-03-29 16:12:22,379 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2017-03-29 16:12:22,379 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://aspen-dp-central-m
2017-03-29 16:12:22,404 [main] INFO  com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase - GHFS version: 1.6.0-hadoop2
2017-03-29 16:12:22,890 [main] INFO  org.apache.pig.PigServer - Pig Script ID for the session: PIG-default-e53a2851-efe5-4e74-bf33-89dfe0733386
2017-03-29 16:12:22,890 [main] WARN  org.apache.pig.PigServer - ATS is disabled since yarn.timeline-service.enabled set to false
2017-03-29 16:12:23,247 [main] ERROR org.apache.pig.PigServer - exception during parsing: Error during parsing. Could not resolve com.turner.pig.load.HBaseMultiScanLoader using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Failed to parse: Pig script failed to parse: 
<line 8, column 13> pig script failed to validate: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve com.turner.pig.load.HBaseMultiScanLoader using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
    at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:199)
    at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1819)
    at org.apache.pig.PigServer$Graph.access$000(PigServer.java:1527)
    at org.apache.pig.PigServer.parseAndBuild(PigServer.java:460)
    at org.apache.pig.PigServer.executeBatch(PigServer.java:485)
    at org.apache.pig.PigServer.executeBatch(PigServer.java:471)
    at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:172)
    at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:742)
    at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:376)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:231)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:206)
    at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
    at org.apache.pig.Main.run(Main.java:532)
    at org.apache.pig.Main.main(Main.java:176)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: 
<line 8, column 13> pig script failed to validate: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve com.turner.pig.load.HBaseMultiScanLoader using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
    at org.apache.pig.parser.LogicalPlanBuilder.validateFuncSpec(LogicalPlanBuilder.java:1339)
    at org.apache.pig.parser.LogicalPlanBuilder.buildFuncSpec(LogicalPlanBuilder.java:1324)
    at org.apache.pig.parser.LogicalPlanGenerator.func_clause(LogicalPlanGenerator.java:5184)
    at org.apache.pig.parser.LogicalPlanGenerator.load_clause(LogicalPlanGenerator.java:3515)
    at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1625)
    at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:1102)
    at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560)
    at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421)
    at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:191)
    ... 19 more
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 1070: Could not resolve com.turner.pig.load.HBaseMultiScanLoader using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
    at org.apache.pig.impl.PigContext.resolveClassName(PigContext.java:671)
    at org.apache.pig.parser.LogicalPlanBuilder.validateFuncSpec(LogicalPlanBuilder.java:1336)
    ... 27 more
2017-03-29 16:12:23,251 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve com.turner.pig.load.HBaseMultiScanLoader using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Details at logfile: /tmp/cb3b0696-3f30-4db4-a6a7-bb716d2a8a89/pig_1490803941959.log
2017-03-29 16:12:23,269 [main] INFO  org.apache.pig.Main - Pig script completed in 1 second and 477 milliseconds (1477 ms)
Job output is complete

Solution

  • Registering the custom jar inside the Pig script solves the problem. So, basically:

    1. Added my jar file to Google Storage
    2. Registered the jar inside the script
    3. Submitted Pig job either via UI or command line below:

    gcloud dataproc jobs submit pig --cluster eduboom-central --file custom.pig --jars=gs://eduboom-dataproc/custom/eduboom.jar

    custom.pig:

    register eduboom.jar;
    raw = LOAD 'hbase://eduboom_table'
       USING com.eduboom.pig.load.HBaseMultiScanLoader('2017-03-30T14:00Z_00', '2017-03-30T14:01Z_25', 'cf:*')
       AS (key:chararray, data);
    DUMP raw;