pyspark derby jupyter-notebook apache-spark-2.0

How to run multiple instances of Spark 2.0 at once (in multiple Jupyter Notebooks)?

I have a script which conveniently allows me to use Spark in a Jupyter Notebook. This is great, except when I run spark commands in a second notebook (for instance to test out some scratch work).

I get a very long error message the key parts of which seem to be:

Py4JJavaError: An error occurred while calling o31.json. : java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient`

. . .

Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /metastore_db

The problem seems to be that I can only run one instance of Spark at a time.

How can I set up Spark to run in multiple notebooks at once?

Solution

By default Spark runs on top of Hive and Hadoop, and stores its instructions for database transformations in Derby - a light weight database system. Derby can only run one Spark instance at a time, so when you start a second notebook and start running Spark commands, it crashes.

To get around this you can connect Spark's Hive installation to Postgres instead of Derby.

Brew install postgres, if you do not have it installed already.

Then download postgresql-9.4.1212.jar (assuming you are running java 1.8 aka java8) from https://jdbc.postgresql.org/download.html

Move this .jar file to the /libexec/jars/ directory for your Spark installation.

ex: /usr/local/Cellar/apache-spark/2.0.1/

(on Mac you can find where Spark is installed by typing brew info apache-spark in the command line)

Next create hive-site.xml in the /libexec/conf directory for your Spark installation.

ex: /usr/local/Cellar/apache-spark/2.0.1/libexec/conf

This can be done through a text editor - just save the file with a '.xml' extension.

hive-site.xml should contain the following text:

<configuration>
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:postgresql://localhost:5432/hive_metastore</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>org.postgresql.Driver</value>
</property>

<property>
<name>javax.jdo.option.ConnectionUserName</name>
  <value>hive</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>mypassword</value>
</property>

</configuration>

'hive' and 'mypassword' can be replaced with whatever makes sense to you - but must match with the next step.

Finally create a user and password in Postgress: in the command line run the following commands -

psql
CREATE USER hive;
ALTER ROLE hive WITH PASSWORD 'mypassword';
CREATE DATABASE hive_metastore;
GRANT ALL PRIVILEGES ON DATABASE hive_metastore TO hive;
\q

Thats it, you're done. Spark should now run in multiple Jupyter Notebooks simultaneously.