Search code examples
pandaspysparkamazon-emr

Trying to install pandas for Pyspark running on Amazon EMR


This question could apply really to any Python packages. I have a bootstrap script that runs before my Spark jobs, and I assume that I need to install pandas in that script. I've tried many different things, but nothing seems to work (pip install, easy_install, yum install, etc). The jobs all fail when in Spark pandas is failed to be imported. I'm running EMR v5.12.1 and Python 3.4.


Solution

  • sudo python3 -m pip install pandas
    

    This is what we have written in our bootstarp.sh to install pandas.