I was testing my dataflow java application in my IntelliJ and it work perfectly fine. But when ran the dataflow jar file in linux system, there was this problem:
this is the options that I used for dataflow:
--project=myproject --stagingLocation=gs://mybucket/staging2 --tempLocation=gs://mybucket/gcp-temp2 --gcpTempLocation=gs://mybucket/gcp-temp2 --bigtableProjectId=myinstance --bigtableInstanceId=user-test --bigtableTableId=test_table1
So the problem is that the gs directory is not recognized properly. In fact, it is considered as local directory of the server where I ran my jar file.
Here is the reason why this directory problem occurs:
I looked for the difference between [maven assembly jar] vs [maven shade jar] and found out that FileSystemRefistrar was pointing at the wrong file.
But using shade plugin is not the remedy for the problem, I was just lucky that the class was GcsFileSystemRegistrar was not overwritten. The same problem occurs again when I change the dependency order.
To make this work, I have to have both of these libraries in this order:
beam-runners-google-cloud-dataflow-java
beam-sdks-java-core
'beam-sdks-java-core' is included in 'beam-runners-google-cloud-dataflow-java' but I need to add it after 'beam-runners-google-cloud-dataflow-java'. So the dependencyHierarchy looks funny but this is the only way I can get this to work. Here is how it looks:
If I exclude 'beam-sdks-java-core' or change the order, the problem occurs again. I tried excluding it using maven plugins but it didn't work.
So my question is how can I set the FileSystemRegistrar properly? I don't know why it works this way.
+And I hope if anyone whose having this problem may get a hint from this article. I struggled a lot from this :'(
As OGCheeze commented, it was solved by using use maven shade plugin with ServicesResourceTransformer. In this post has more detailed explanation.