python google-app-engine apache-spark google-bigquery pyspark

Loading Data from Google BigQuery into Spark (on Databricks)

I want to load data into Spark (on Databricks) from Google BigQuery. I notice that Databricks offers alot of support for Amazon S3 but not for Google.

What is the best way to load data into Spark (on Databricks) from Google BigQuery? Would the BigQuery connector allow me to do this or is this only valid for files hosted on Google Cloud storage?

Solution

The BigQuery Connector is a client side library that uses the public BigQuery API: it runs BigQuery export jobs to Google Cloud Storage, and takes advantage of file creation ordering to start Hadoop processing early to increase overall throughput.

This code should work wherever you happen to locate your Hadoop cluster.

That said, if you are running over large data, then you might find network bandwidth throughput to be a problem (how good is your network connection to Google?), and since you are reading data out of Google's network, then GCS network egress costs will apply.