Starting to use PySpark on Databricks, and I see I can import pyspark.pandas
alongside pandas
. What is the different?
I assume it's not like koalas
, right?
PySpark is an interface for Apache Spark in Python. It allows you to write Spark applications using Python and provides the PySpark shell to analyze data in a distributed environment.
Pyspark.pandas
is an API that allows you to use pandas functions and operations on "spark data frames".
Koalas is another library developed by Databricks that allows running pandas-like operations on spark data.
This blog show some differences between pyspark.pandas and pyspark: https://towardsdatascience.com/run-pandas-as-fast-as-spark-f5eefe780c45
Pyspark.pandas documentation is of course a reference: https://spark.apache.org/docs/3.3.0/api/python/reference/pyspark.pandas/index.html