Search code examples
pysparkpipjupyter-notebookamazon-emr

Jupyter Notebook PySpark Kernel referencing lowered pip version from host machine site-packages


I am using a Jupyter Notebook which is provided by an AWS managed service called EMR Studio. My understanding of how these notebooks work is that they are hosted on EC2 instances that I provision as part of my EMR cluster. Specifically with the PySpark kernel using the task nodes.

Currently when I run the command sc.list_packages() I see that pip is at version 9.0.1 whereas if I SSH onto the master node and run pip list I see that pip is at version 20.2.2. I have issues running the command sc.install_pypi_package() due to the lowered pip version in the Notebook.

In the notebook cell if I run import pip then pip I see that the module is located at

<module 'pip' from '/mnt1/yarn/usercache/<LIVY_IMPERSONATION_ROLE>/appcache/application_1652110228490_0001/container_1652110228490_0001_01_000001/tmp/1652113783466-0/lib/python3.7/site-packages/pip/__init__.py'> 

I am assuming this is most likely within a virtualenv of some sort running as an application on the task node? I am unsure of this and I have no concrete evidence of how the virtualenv is provisioned if there is one.

If I run sc.uninstall_package('pip') then sc.list_packages() I see pip at a version of 20.2.2 which is what I am looking to initially start off with. The module path is the same as previously mentioned.

How can I get pip 20.2.2 in the virtualenv instead of pip 9.0.1?

If I import a package like numpy I see that the module is located at a different location from where pip is. Any reason for this?

<module 'numpy' from '/usr/local/lib64/python3.7/site-packages/numpy/__init__.py'>

As for pip 9.0.1 the only reference I can find at the moment is in /lib/python2.7/site-packages/virtualenv_support/pip-9.0.1-py2.py3-none-any.whl. One directory outside of this I see a file called virtualenv-15.1.0-py2.7.egg-info which if I cat the file states that it upgrades to pip 9.0.1. I have tried to remove the pip 9.0.1 wheel file and replaced it with a pip 20.2.2 wheel which caused issues with the PySpark kernel being able to provision properly. There is also a virtualenv.py file which does reference a __version__ = "15.1.0".


Solution

  • I was able to find a solution on updating pip, setuptools, and wheel in the virtualenv that PySpark uses.

    I initially had to determine how pip 9 is being sourced. By SSH'ing to my EMR Master node I changed directories into the root cd / and then ran the command sudo find . -name "pip*" to recursively search for where pip files may be located at.

    In my scenario there is a pip 9 wheel located at:

    ./usr/lib/python2.7/site-packages/virtualenv_support/pip-9.0.1-py2.py3-none-any.whl
    

    By searching around a bit more in /usr/lib/python2.7/site-packages there is a virtualenv.py that is being invoked to create the virtualenv and is explained a bit more below.

    Within the PySpark notebook session using %%info shows that the virtualenv is created from this file path (thanks Parag):

    'spark.pyspark.virtualenv.bin.path': '/usr/bin/virtualenv'
    

    Running cat /usr/bin/virtualenv shows that the virtualenv is being invoked from the following commands:

    #!/usr/bin/python
    import virtualenv
    virtualenv.main()
    

    This version of python in /usr/bin is python2.7. At the terminal I ran the following commands in sequence:

    1. /usr/bin/python
    2. import virtualenv
    3. virtualenv

    This outputs:

    <module 'virtualenv' from '/usr/lib/python2.7/site-packages/virtualenv.py'>
    

    I have sometimes seen a virtualenv.pyc file being used here which is located in /usr/lib/python2.7/site-packages/ but I have seen other users suggest that .pyc files can be deleted.

    On the EMR master node I ran the command /usr/bin/virtualenv which shows some flags that can be used. First I used /usr/bin/virtualenv --verbose ./myVE which shows that pip 9.0.1 is packaged into the virtualenv I created. If I run /usr/bin/virtualenv --verbose --download ./myVE2 this shows output that an updated version of pip, setuptools, and wheel are being downloaded from Artifactory (our private PyPi mirror) into the virtualenv. There is a /etc/pip.conf that we use to setup the index-url and trusted host for Artifactory to be used instead of PyPi.

    At this point it seems that the EMR cluster's virtualenv.py file as a default does not download updated wheels from Artifactory/PyPi and instead uses the wheel files located in /usr/lib/python2.7/site-packages/virtualenv_support/*.whl

    Running cat /usr/lib/python2.7/site-packages/virtualenv.py shows that this version of virtualenv is 15.1.0 which is very outdated (2016 release).

    Reading more into virtualenv.py shows that the main() function has a block of code as follows:

    parser.add_option(
            "--download",
            dest="download",
            action="store_true",
            help="Download preinstalled packages from PyPI.",
        )
    

    I compared this virtualenv.py file on my EMR master to the official release of virtualenv==15.1.0 from PyPi (https://pypi.org/project/virtualenv/15.1.0/). I downloaded the tar.gz file and unzipped it on my local machine. There is a virtualenv.py file in the unzipped folder. When comparing contents using diff of the official virtualenv.py file to the EMR cluster's virtualenv.py file there are only a couple of lines that are not the same. The main difference is that parser.add_option from the code block above has default=True, in the official virtualenv.py file. The EMR cluster's virtualenv.py file does not have this.

    parser.add_option(
            "--download",
            dest="download",
            action="store_true",
            default=True,
            help="Download preinstalled packages from PyPI.",
        )
    

    What I did from here was I copied the EMR cluster's virtualenv.py and updated the line of code to set default=True,. I then used this updated virtualenv.py as part of an EMR bootstrap script so that this file is updated on all node types (master/core/task).

    The bootstrap script does the following:

    1. sudo rm /usr/lib/python2.7/site-packages/virtualenv.pyc
    2. sudo rm /usr/lib/python2.7/site-packages/virtualenv.py
    3. sudo aws s3 cp <UPDATED_VIRTUALENV_S3_PATH> /usr/lib/python2.7/site-packages/

    Ensure that the copied file from S3 is just called virtualenv.py in the event that this causes any issues due to filenames not being kept the same.

    Now when I start up a PySpark kernel the spark.pyspark.virtualenv.bin.path invokes the updated virtualenv.py file and I am able to confirm that pip is at a much higher version number (20+) which is what I was looking to achieve.