Search code examples
google-cloud-dataprocgoogle-cloud-datalabgoogle-cloud-source-repos

Integration with Dataproc + Datalab + Source Code repos


Can someone been able to integrate Dataproc,Datalab and Source code repo? As many of us have seen that when you call an init action to install datalab, it does not create the source code repo. I am trying to achieve a full end-to-end solution where a user logs into to a datalab notebook, interact with Dataproc through Pyspark and check-in the notebooks to the Source code repo. I have not been able to do this with the init action like i pointed out earlier. I also tried installing dataproc and then datalab as a separate install ( this time it creates the source repo) , however, I can't run any spark code on this datalab notebook. Can someone please give me some pointers on how to achieve this? Any and all is appreciated.

Code in Datalab

from pyspark.sql import HiveContext
hc=HiveContext(sc)
hc.sql("""show databases""").show()
hc.sql("""CREATE EXTERNAL TABLE IF NOT EXISTS INVOICES
      (SubmissionDate DATE, TransactionAmount DOUBLE, TransactionType STRING)
      STORED AS PARQUET
      LOCATION 'gs://my-exercise-project-2019016-ds-team/datasets/invoices'""")
hc.sql("""select * from invoices limit 10""").show()

Error

Py4JJavaError: An error occurred while calling o55.sql.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2395)
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3208)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3240)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3291)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3259)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:470)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$or

Solution

  • Unfortunately, it takes some pre-work to be able to create the datalab-notebooks repository in Cloud Source Repositories from an init action.

    The reason is that creating the repository requires the service account for the VM to have the "source.repos.create" IAM permission on the project, which is not true by default.

    You can either grant that permission to the service account, and then create the repository via gcloud source repos create datalab-notebooks, or manually create the repository before creating the cluster.

    Then, to clone the repository inside of your startup script, add the following lines:

    mkdir -p ${HOME}/datalab
    gcloud source repos clone datalab-notebooks ${HOME}/datalab/notebooks
    

    If you are modifying the canned init action for Datalab, then I would suggest adding these lines here