Search code examples
apache-sparkgoogle-cloud-platformgoogle-cloud-dataprocgoogle-cloud-datalab

Installing Datalab/Jupyter on Dataproc cluster


I'm trying to install Jupyter notebook / Datalab on my Dataproc cluster but with no avail.

I follow this tutorial: https://cloud.google.com/dataproc/docs/tutorials/dataproc-datalab

Step by step:

  1. I create a new GS Bucket called datalab-init-bucket-001 and upload there the datalab.sh script from GitHub https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/blob/master/datalab/datalab.sh enter image description here
  2. Then start the Dataproc via gcloud command with --initialization-actions 'gs://datalab-init-bucket-001/datalab.sh', the entire command looks like:

    gcloud dataproc create cluster-test --subnet default --zone "" --master-machine-type n1-standard-4 --master-boot-disk-size 10 --num-workers 2 --worker-machine-type n1-standard-2 --worker-boot-disk-size 10 --initialization-action-timeout "10h" --initialization-actions 'gs://datalab-init-bucket-001/datalab.sh'

Here, the first problem arises: enter image description here

Looking at the logs:

OK > Downloading script [gs://datalab-init-bucket-001/datalab.sh] to [/etc/google-dataproc/startup-scripts/dataproc-initialization-script-0]

OK > Running script [/etc/google-dataproc/startup-scripts/dataproc-initialization-script-0] and saving output in [/var/log/dataproc-initialization-script-0.log]

OK > DIR* completeFile: /user/spark/eventlog/.cc2b1d00-4968-4008-87d7-eac090b09e56 is closed by DFSClient_NONMAPREDUCE_1150019196_1

ERROR > AgentRunner startup failed: com.google.cloud.hadoop.services.agent.AgentException: Initialization action failed to start (error=2, No such file or directory). Failed action 'gs://datalab-init-bucket-001/datalab.sh' (TASK_FAILED) at com.google.cloud.hadoop.services.agent.AgentException$Builder.build(AgentException.java:83) at com.google.cloud.hadoop.services.agent.AgentException$Builder.buildAndThrow(AgentException.java:79) at com.google.cloud.hadoop.services.agent.BootstrapActionRunner.throwInitActionFailureException(BootstrapActionRunner.java:236) at com.google.cloud.hadoop.services.agent.BootstrapActionRunner.runSingleCustomInitializationScriptWithTimeout(BootstrapActionRunner.java:146) at com.google.cloud.hadoop.services.agent.BootstrapActionRunner.runCustomInitializationActions(BootstrapActionRunner.java:126) at com.google.cloud.hadoop.services.agent.AbstractAgentRunner.runCustomInitializationActionsIfFirstRun(AbstractAgentRunner.java:150) at com.google.cloud.hadoop.services.agent.MasterAgentRunner.initialize(MasterAgentRunner.java:165) at com.google.cloud.hadoop.services.agent.AbstractAgentRunner.start(AbstractAgentRunner.java:68) at com.google.cloud.hadoop.services.agent.MasterAgentRunner.start(MasterAgentRunner.java:36) at com.google.cloud.hadoop.services.agent.AgentMain.lambda$boot$0(AgentMain.java:63) at com.google.cloud.hadoop.services.agent.AgentStatusReporter.runWith(AgentStatusReporter.java:52) at com.google.cloud.hadoop.services.agent.AgentMain.boot(AgentMain.java:59) at com.google.cloud.hadoop.services.agent.AgentMain.main(AgentMain.java:46) Caused by: java.io.IOException: Cannot run program "/etc/google-dataproc/startup-scripts/dataproc-initialization-script-0": error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) at com.google.cloud.hadoop.services.agent.util.NativeAsyncProcessWrapperFactory.startAndWrap(NativeAsyncProcessWrapperFactory.java:33) at com.google.cloud.hadoop.services.agent.util.NativeAsyncProcessWrapperFactory.startAndWrap(NativeAsyncProcessWrapperFactory.java:27) at com.google.cloud.hadoop.services.agent.BootstrapActionRunner.createRunner(BootstrapActionRunner.java:349) at com.google.cloud.hadoop.services.agent.BootstrapActionRunner.runScriptAndPipeOutputToGcs(BootstrapActionRunner.java:301) at com.google.cloud.hadoop.services.agent.BootstrapActionRunner.runSingleCustomInitializationScriptWithTimeout(BootstrapActionRunner.java:142) ... 9 more Suppressed: java.io.IOException: Cannot run program "/etc/google-dataproc/startup-scripts/dataproc-initialization-script-0": error=2, No such file or directory ... 15 more Caused by: java.io.IOException: error=2, No such file or directory at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.(UNIXProcess.java:247) at java.lang.ProcessImpl.start(ProcessImpl.java:134) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) ... 14 more Caused by: java.io.IOException: error=2, No such file or directory at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.(UNIXProcess.java:247) at java.lang.ProcessImpl.start(ProcessImpl.java:134) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) ... 14 more undefinedE AgentRunner startup failed:

  1. "Manual" installation on the master node VM fails too: enter image description here

I somehow managed to start Datalab on single-node cluster. But I was not able to start the (py)Spark session there.

I run the latest Dataproc image version (1.2), but for example 1.1 also didn't work. I have free credits account, but I guess this should not pose a problem.

Any idea how to update the datalab.sh script to make this work?


Solution

  • It seems the reason for failure was not large enough disk. I switched from 10 GB to 50 GB disk size on each node and suddenly it works.