Search code examples
pandaspyspark

What is the difference between pyspark.pandas to pandas?


Starting to use PySpark on Databricks, and I see I can import pyspark.pandas alongside pandas. What is the different? I assume it's not like koalas, right?


Solution

  • PySpark is an interface for Apache Spark in Python. It allows you to write Spark applications using Python and provides the PySpark shell to analyze data in a distributed environment. Pyspark.pandas is an API that allows you to use pandas functions and operations on "spark data frames".

    Koalas is another library developed by Databricks that allows running pandas-like operations on spark data.

    This blog show some differences between pyspark.pandas and pyspark: https://towardsdatascience.com/run-pandas-as-fast-as-spark-f5eefe780c45

    Pyspark.pandas documentation is of course a reference: https://spark.apache.org/docs/3.3.0/api/python/reference/pyspark.pandas/index.html