Issue with 'pandas on spark' used with conda: "No module named 'pyspark.pandas'" even though both pyspark and pandas are installed

I have installed both Spark 3.1.3 and Anaconda 4.12.0 on Ubuntu 20.04. I have set PYSPARK_PYTHON to be the python bin of a conda environment called my_env

export PYSPARK_PYTHON=~/anaconda3/envs/my_env/bin/python

I installed several packages on conda environment my_env using pip. Here is a portion of the output of pip freeze command:

numpy==1.22.3
pandas==1.4.1
py4j==0.10.9.3
pyarrow==7.0.0

N.B: package pyspark is not installed on the conda environment my_env. I would like to be able to launch a pyspark shell on different conda environments without having to reinstall pyspark in every environment (I would like to only modify PYSPARK_PYTHON). This would also avoids having different versions of Spark on different conda environments (which is sometimes desirable but not always).

When I launch a pyspark shell using pyspark command, I can indeed import pandas and numpy which confirms that PYSPARK_PYTHON is properly set (my_env is the only conda env with pandas and numpy installed, moreover pandas and numpy are not installed on any other python installation even outside conda, and finally if I change PYSPARK_PYTHON I am no longer able to import pandas or numpy).

Inside the pyspark shell, the following code works fine (creating and showing a toy Spark dataframe):

sc.parallelize([(1,2),(2,4),(3,5)]).toDF(["a", "b"]).show()

However, if I try to convert the above dataframe into a pandas on spark dataframe it does not work. The command

sc.parallelize([(1,2),(2,4),(3,5)]).toDF(["t", "a"]).to_pandas_on_spark()

returns:

AttributeError: 'DataFrame' object has no attribute 'to_pandas_on_spark'

I tried to first import pandas (which works fine) and then pyspark.pandas before running the above command but when I run

import pyspark.pandas as ps

I obtain the following error:

ModuleNotFoundError: No module named 'pyspark.pandas'

Any idea why this happens ?

Thanks in advance

Solution

From here, it seems that you need apache spark 3.2, not 3.1.3. Update to 3.2 and you will have the desired API.