Search code examples
hadoopapache-piggoogle-hadoop

Google Cloud connector for Hadoop doesn't work with Pig


I'm using Hadoop with HDFS 2.7.1.2.4 and Pig 0.15.0.2.4 (Hortonworks HDP 2.4) and trying to use Google Cloud Storage Connector for Spark and Hadoop (bigdata-interop on GitHub). It works correctly when I try, say,

hadoop fs -ls gs://bucket-name

But when I try the following in Pig (in mapreduce mode):

data = LOAD 'gs://softline/o365.avro' USING AvroStorage();
data = STORE data INTO 'gs://softline/o366.avro' USING AvroStorage();

Pig fails with the following errors:

org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Wrong FS scheme: hdfs, in path: hdfs://hdp.slweb.ru:8020/user/root, expected scheme: gs
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:279)
    at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
    at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
    at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
    at org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob.submit(ControlledJob.java:335)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.pig.backend.hadoop23.PigJobControl.submit(PigJobControl.java:128)
    at org.apache.pig.backend.hadoop23.PigJobControl.run(PigJobControl.java:194)
    at java.lang.Thread.run(Thread.java:745)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher$1.run(MapReduceLauncher.java:276)
Caused by: java.lang.IllegalArgumentException: Wrong FS scheme: hdfs, in path: hdfs://hdp.slweb.ru:8020/user/root, expected scheme: gs
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.checkPath(GoogleHadoopFileSystemBase.java:741)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.checkPath(GoogleHadoopFileSystem.java:90)
    at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:466)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.makeQualified(GoogleHadoopFileSystemBase.java:701)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.getGcsPath(GoogleHadoopFileSystem.java:163)
    at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.setWorkingDirectory(GoogleHadoopFileSystemBase.java:1094)
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:235)
    ... 18 more

If needed I could post the logs of GC connectors.

Hame somebody used Pig with this connectors? Any help would be appeciated.


Solution

  • TL;DR explicitly set workmapreduce.job.working.dir=/user/root/ when starting the pig job

    If a working directory has not been explicitly set during job submission then Hadoop will set the working directory to be the working directory of the default filesystem. When using HDFS as your default FS the working directory will generally be something like 'hdfs://namenode:port/user/<your username>'.

    When PigInputFormat#getSplits is called, it fetches the FileSystem associated with the path of the input that it is operating on. In this case the filesystem is an instance of GoogleHadoopFileSystem. Pig then inspects the path of its input and if the path is non-local calls FileSystem#setWorkingDirectory(job.getWorkingDirectory()). The problem here is that the job's working directory is 'hdfs://namenode:port/user/<your username>' which GoogleHadoopFileSystem will reject as a path to set as its own working directory (as it only supports 'gs://' paths).