Search code examples
librariesamazon-emrjupyterhub

Adding libraries to PySpark kernel on Jupyter/JupiterHub on EMR


I'm trying to use Matplotlib with PySpark3 with JupyterHub (0.9.4) running on a docker on an AWS EMR (5.20). There are 4 kernels preinstalled on that JupyterHub: Python, PySpark, PySpark3, and Spark. There was no problem importing Matplotlib with the Python kernel. However, when I tried "import matplotlib as plt" with either PySpark or PySpark3 kernel, I got back the message "matplotlib not found". Have been trying to find a guy on that but no luck.

Could you please help?

Thanks and regards, Averell


Solution

  • Further reading showed that I was wrong: Using the PySpark kernels will actually have the code run on the Spark cluster (the EMR itself), while using the Python kernel will have the code run on the JupyterHub server (the docker image).

    Matplotlib came preinstalled on the docker image, not the EMR. Installing matplotlib on the EMR master node would solve that import issue in PySpark kernels. However, that doesn't help further (at least for me now) in plotting graphs using dataframes from Spark.

    I could finally get what I wanted by following this guide - transferring the result to "local" (here "local" means the JupyterHub server - the docker image) and use matplotlib locally using %%local magic: https://github.com/jupyter-incubator/sparkmagic/blob/master/examples/Pyspark%20Kernel.ipynb