What is the difference between pyspark.pandas to pandas?

Starting to use PySpark on Databricks, and I see I can import pyspark.pandas alongside pandas. What is the different? I assume it's not like koalas, right?

Solution

PySpark is an interface for Apache Spark in Python. It allows you to write Spark applications using Python and provides the PySpark shell to analyze data in a distributed environment. Pyspark.pandas is an API that allows you to use pandas functions and operations on "spark data frames".

Koalas is another library developed by Databricks that allows running pandas-like operations on spark data.

This blog show some differences between pyspark.pandas and pyspark: https://towardsdatascience.com/run-pandas-as-fast-as-spark-f5eefe780c45

Pyspark.pandas documentation is of course a reference: https://spark.apache.org/docs/3.3.0/api/python/reference/pyspark.pandas/index.html

Breaking long method chains into multiple lines in Python
what's the inverse of the quantile function on a pandas Series?
Writing back to a panda groupby group
Plotly change color mapping interactively based on data frame values
Find the index of the current df value in another series and add to a column
How to add a new row to an existing DataFrame which is the sum of two rows?
Pandas Error: need to escape, but no escapechar set
obtaining last value of dataframe column without index
Using named columns and relative row numbers with Pandas 3
How to convert index of a pandas dataframe into a column
Conditional mapping in pandas
How to stream DataFrame using FastAPI without saving the data to csv file?
How to control scientific notation in matplotlib?
How do I create a multiline plot using seaborn?
Pandas to Excel - make part of the text bold
How to extract multiple JSON objects from one file?
How to create a column with randomly generated values in a pandas dataframe
Convert Categorical codes to Categorical values
Polars vs. Pandas: size and speed difference
Visualizing Relationships Between Heterogeneous Data Variables in a Pandas DataFrame
Pandas.DataFrame.query Series.str.startswith Tuple returns Empty
How do I make Pandas resample starting first day of each year in DataFrame
Python Pandas - how to read in data from list (data) and columns (separate list)
Converting a pandas dataframe in wide format to long format
Reshape wide to long in pandas
How to drop columns which have same values in all rows via pandas or spark dataframe?
concatenate all strings in the dataframe column
TypeError: Cannot convert numpy.ndarray to numpy.ndarray
extracting days from a numpy.timedelta64 value
How can I run a code where I can plot and save multiple hours of the latest GFS model run?