Search code examples
hadoopgoogle-cloud-storagegoogle-cloud-dataproc

how do you perform hadoop fs -getmerge on dataproc from google storage


How do you use getmerge on dataproc for part files which are dumped to the google storage bucket. If I try this hadoop fs -getmerge gs://my-bucket/temp/part-* gs://my-bucket/temp_merged I get an error getmerge: /temp_merged (Permission denied)

It works fine for hadoop fs -getmerge gs://my-bucket/temp/part-* temp_merged but that of course writes the merged file on the cluster machine and not in GS.


Solution

  • According to the fsshell documentation, the getmerge command fundamentally treats the destination path as a "local" path (so in gs://my-bucket/temp_merged it's ignoring the "scheme" and "authority" components, trying to write directly to your local filesystem path /temp_meged; this is not specific to the GCS connector; you'll see the same thing if you try hadoop fs -getmerge gs://my-bucket/temp/part-* hdfs:///temp_merged, and even worse, if you try something like hadoop fs -getmerge gs://my-bucket/temp/part-* hdfs:///tmp/temp_merged, you may think it succeeded when in fact the file did not appear inside hdfs:///tmp/temp_merged, but instead appeared under your local filesystem, file:///tmp/temp_merged.

    You can instead make use of piping stdout/stdin to make it happen; unfortunately -getmerge doesn't play well with /dev/stdout due to permissions and usage of .crc files, but you can achieve the same effect using the feature in hadoop fs -put which supports reading from stdin:

    hadoop fs -cat  gs://my-bucket/temp/part-* | \
        hadoop fs -put - gs://my-bucket/temp_merged