This recent blog post from Databricks https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html says that the only change needed to a pandas program to run it under pyspark.pandas is to change from pandas import read_csv
to from pyspark.pandas import read_csv
.
But that does not seem right. What about all the other (non read_csv
) references to pandas? Isn't the right approach to change import pandas as pd
to import pyspark.pandas as pd
? Then all the other pandas references in your existing program will point to the pyspark version of pandas.
You got that right. The canonical way they have suggested, however, is, from pyspark import pandas as ps