I have installed both Spark 3.1.3 and Anaconda 4.12.0 on Ubuntu 20.04.
I have set PYSPARK_PYTHON
to be the python bin of a conda environment called my_env
export PYSPARK_PYTHON=~/anaconda3/envs/my_env/bin/python
I installed several packages on conda environment my_env
using pip
. Here is a portion of the output of pip freeze
command:
numpy==1.22.3
pandas==1.4.1
py4j==0.10.9.3
pyarrow==7.0.0
N.B: package pyspark
is not installed on the conda environment my_env
. I would like to be able to launch a pyspark shell on different conda environments without having to reinstall pyspark
in every environment (I would like to only modify PYSPARK_PYTHON
). This would also avoids having different versions of Spark on different conda environments (which is sometimes desirable but not always).
When I launch a pyspark shell using pyspark
command, I can indeed import pandas
and numpy
which confirms that PYSPARK_PYTHON
is properly set (my_env
is the only conda env with pandas
and numpy
installed, moreover pandas
and numpy
are not installed on any other python installation even outside conda, and finally if I change PYSPARK_PYTHON
I am no longer able to import pandas
or numpy
).
Inside the pyspark shell, the following code works fine (creating and showing a toy Spark dataframe):
sc.parallelize([(1,2),(2,4),(3,5)]).toDF(["a", "b"]).show()
However, if I try to convert the above dataframe into a pandas on spark dataframe it does not work. The command
sc.parallelize([(1,2),(2,4),(3,5)]).toDF(["t", "a"]).to_pandas_on_spark()
returns:
AttributeError: 'DataFrame' object has no attribute 'to_pandas_on_spark'
I tried to first import pandas
(which works fine) and then pyspark.pandas
before running the above command but when I run
import pyspark.pandas as ps
I obtain the following error:
ModuleNotFoundError: No module named 'pyspark.pandas'
Any idea why this happens ?
Thanks in advance
From here, it seems that you need apache spark 3.2, not 3.1.3. Update to 3.2 and you will have the desired API.