Search code examples
rgoogle-bigquerygoogle-cloud-dataprocsparklyrrstudio-server

Connecting to BigQuery from Rstudio running on a Dataproc cluster


I created a Dataproc cluster and launched RStudio Server successfully using the instructions below: https://cloud.google.com/solutions/running-rstudio-server-on-a-cloud-dataproc-cluster

I also installed sparklyr and created a Spark instance successfully.

sc <- spark_connect(master = "local")

However, I am wondering how I can connect to BigQuery. There is a sparkbq library but I am not sure how I can pass the bigquery jar connector (in runtime) that is described here: https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example


Solution

  • You can use Dataproc init actions to install spark-bigquery connector on all the nodes of your cluster. https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/connectors.

    You may have to recreate the cluster with updated init actions and launch RStudio Server again. If you don't wish to do that and your cluster is small, you could also ssh into the nodes and download SparkBigQuery-connector jar manually.