Search code examples
google-cloud-platformjupyter-notebookgoogle-cloud-dataproc

"No module named numpy..." error on google-data-proc , how to upgrade numpy on google-data-proc?


I keep getting this error when I run my notebook on Google-Cloud-Data-Proc

import numpy as np
ImportError: ('No module named numpy', <function _parse_datatype_json_string at 0x7fc294e25230>.......

But don't get the error when running locally with same python 2.7

I found that version on my local is numpy.version.version '1.11.1'

but on google-data-proc it is older **'1.8.2' **

As mentioned in other answers ImportError: No module named numpy - Google Cloud Dataproc when using Jupyter Notebook I tried this to upgrade

 import sys

sys.path.append('/usr/lib/python2.7/dist-packages')

os.system("sudo apt-get install python-pandas -y")
os.system("sudo apt-get install python-numpy -y")
os.system("sudo apt-get install python-scipy -y")
os.system("sudo apt-get install python-sklearn -y")

import pandas
import numpy
import scipy
import sklearn

I still get 1.8.2 version

pip command doesn't have permission on google-data-proc

tried pip with sudo, that too didn't work.

IOError: [Errno 13] Permission denied: '/usr/local/bin/miniconda/lib/python2.7/site-
packages/easy-install.pth'
my-user-name@cluster-name-1-m:~$ sudo pip install numpy
sudo: pip: command not found

Solution

  • Edit: We've now added a metadata option JUPYTER_CONDA_PACKAGES to automatically pre-install packages through conda during the Jupyter setup. As now covered by the examples, the preferred way to get your packages installed is with:

    gcloud dataproc clusters create my-cluster \
        --initialization-actions gs://dataproc-initialization-actions/jupyter/jupyter.sh \
        --metadata JUPYTER_CONDA_PACKAGES=numpy:pandas:scikit-learn:scipy
    

    In the absence of using this metadata value, historical answer below for posterity and more internal details:

    Dataproc's jupyter initialization action also installs conda, so on your master node you can just run:

    sudo su
    conda install numpy
    

    Depending on how it's used you may also need it on your worker nodes; you can customize the main jupyter.sh script adding the line conda install numpy anywhere after the /dataproc-initialization-actions/conda/bootstrap-conda.sh line and re-upload your custom init action to GCS somewhere to specify that instead of gs://dataproc-initialization-actions/jupyter/jupyter.sh to automatically install it on your deployments. Something like:

    gsutil cp gs://dataproc-initialization-actions/jupyter/jupyter.sh .
    echo "conda install numpy >> jupyter.sh"
    gsutil cp jupyter.sh gs://my-bucket/jupyter_with_numpy.sh
    gcloud dataproc clusters crreate my-cluster \
        --initialization-actions gs://my-bucket/jupyter_with_numpy.sh 
    

    Finally, you can also use the built-in package manager in the Jupyter UI to browse and install conda packages:

    Select Conda Packages menu dropdown from Kernel menu

    Browse Conda packages

    Install Conda packages