How do you use getmerge on dataproc for part files which are dumped to the google storage bucket.
If I try this hadoop fs -getmerge gs://my-bucket/temp/part-* gs://my-bucket/temp_merged
I get an error
getmerge: /temp_merged (Permission denied)
It works fine for hadoop fs -getmerge gs://my-bucket/temp/part-* temp_merged
but that of course writes the merged file on the cluster machine and not in GS.
According to the fsshell documentation, the getmerge
command fundamentally treats the destination path as a "local" path (so in gs://my-bucket/temp_merged
it's ignoring the "scheme" and "authority" components, trying to write directly to your local filesystem path /temp_meged
; this is not specific to the GCS connector; you'll see the same thing if you try hadoop fs -getmerge gs://my-bucket/temp/part-* hdfs:///temp_merged
, and even worse, if you try something like hadoop fs -getmerge gs://my-bucket/temp/part-* hdfs:///tmp/temp_merged
, you may think it succeeded when in fact the file did not appear inside hdfs:///tmp/temp_merged
, but instead appeared under your local filesystem, file:///tmp/temp_merged
.
You can instead make use of piping stdout/stdin to make it happen; unfortunately -getmerge
doesn't play well with /dev/stdout
due to permissions and usage of .crc
files, but you can achieve the same effect using the feature in hadoop fs -put
which supports reading from stdin:
hadoop fs -cat gs://my-bucket/temp/part-* | \
hadoop fs -put - gs://my-bucket/temp_merged