Search code examples
apache-sparkgoogle-hadoop

(bdutil) Unable to get hadoop/spark cluster working with a fresh install


I'm setting up a tiny cluster in GCE to play around with it but although instances are created some failures prevent to get it working. I'm following the steps in https://cloud.google.com/hadoop/downloads

So far I'm using (as of now) lastest versions of gcloud (143.0.0) and bdutil (1.3.5), freshly installed.

./bdutil deploy -e extensions/spark/spark_env.sh

using debian-8 as image (as bdutil still uses debian-7-backports).

At some point I got

Fri Feb 10 16:19:34 CET 2017: Command failed: wait ${SUBPROC} on line 326.
Fri Feb 10 16:19:34 CET 2017: Exit code of failed command: 1

full debug output is in https://gist.github.com/jlorper/4299a816fc0b140575ed70fe0da1f272 (project id and bucket names changed)

Instances are created, but spark not even installed. Digging a bit I've managed to run spark installation and start hadoop commands in the master after after ssh. But it fails badly when starting the spark-shell:

17/02/10 15:53:20 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.4.5-hadoop1
17/02/10 15:53:20 INFO gcsio.FileSystemBackedDirectoryListCache: Creating '/hadoop_gcs_connector_metadata_cache' with createDirectories()...
java.lang.RuntimeException: java.lang.RuntimeException: java.nio.file.AccessDeniedException: /hadoop_gcs_connector_metadata_cache
    at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)

and not able to import sparkSQL. For what I've read everything should be started automatically.

Up to this point I'm a bit lost and don't know what else to do. Am I missing any step? Is any of the commands faulty? Thanks in advance.

Update: solved

As pointed out in accepted solution I cloned the repo and cluster was created without issues. When trying to start the spark-shell though it gave

java.lang.RuntimeException: java.io.IOException: GoogleHadoopFileSystem has been closed or not initialized.`

That sounded to me like connectors were not initialized properly, so after running

 ./bdutil --env_var_files extensions/spark/spark_env.sh,bigquery_env.sh run_command_group install_connectors

it worked as expected.


Solution

  • The last version of bdutil on https://cloud.google.com/hadoop/downloads is a bit stale and I'd instead recommend using the version of bdutil at head on github: https://github.com/GoogleCloudPlatform/bdutil.