apache-spark google-cloud-dataproc google-cloud-bigtable

How do I pass in the google cloud project to the SHC BigTable connector at runtime?

I'm trying to access BigTable from Spark (Dataproc). I tried several different methods and SHC seems to be the cleanest for what I am trying to do and performs well.

https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/tree/master/scala/bigtable-shc

However this approach requires that I put the Google cloud project ID in hbase-site.xml which means I need to build separate versons of the fat jar file with my spark code for each env I run on (prod, staging, etc.) which is something I'd like to avoid.

Is there a way for me to pass in the google cloud project id at runtime?

Solution

As far as I can tell, the SHC library does not let you pass through hbase configs (looking in here).

The easiest thing would be to run an init action that gets the VM's project id from VM metadata and sets it in hbase-site.xml. We are working on an initialization that does that and installs the Hbase client for Bigtable. Check out the in-progress pull request, which would be a good starting point if you needed to write one immediately. Otherwise, I expect the PR to get merged in the next couple weeks.

Alternatively, consider adding an option in SHC for passing through properties to the HBaseConfiguration it creates. That would be a valuable feature for the broader community.