Search code examples
javapostgresqlapache-sparkgoogle-bigquerydata-pipeline

combining data from different sources in apache spark


I am exploring apache spark for a project where I want to get data from different sources - database tables (postgres and BigQuery), and text. The data will be processed and fed into another table for analytics. My choice of the programming language is Java, but I am exploring Python too.Can someone please let me know if I can read the directly into spark for processing? Do I need some kind of connector between the database tables and the Spark cluster.

Thanks in advance.


Solution

  • If for example you want to read the content from a BigQuery table, you can do it through these instructions (Python for example):

    words = spark.read.format('bigquery') \
       .option('table', 'bigquery-public-data:samples.shakespeare') \
       .load()
    

    you can refer to this document [1] (here you can see also the instructions with Scala).

    ***I recommend trying the wordcount code first to get used of the usage pattern****

    After that, and you have your Spark code ready, you have to create a new cluster in Google Dataproc [2] and run the job there, linking the BigQuery connector (example with python):

    gcloud dataproc jobs submit pyspark wordcount.py \
       --cluster cluster-name \
       --region cluster-region (example: "us-central1") \
       --jars=gs://spark-lib/bigquery/spark-bigquery-latest.jar
    

    Here you can find the latest version of the BigQuery connector [3].

    In addition, in this GitHub repository you can find some examples of how to use BigQuery connector with Spark [4].

    With these instructions you should be able to handle reading and writing BigQuery.

    [1] https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example#running_the_code

    [2] https://cloud.google.com/dataproc/docs/guides/create-cluster

    [3] gs://spark-lib/bigquery/spark-bigquery-latest.jar

    [4] https://github.com/GoogleCloudDataproc/spark-bigquery-connector