Search code examples
apache-sparkpysparkaveragemeanmedian

What is the difference between the .mean() and the .avg() methods?


Currently, I am working with PySpark to analyze some data. I have a CSV file with Payroll data in it. I want to know what Job has the best pay. To do that I need the median() because I want to know the average.

The methods for groupBy in Pyspark are these: agg, avg, count, max, mean, min, pivot, sum

When I try the .mean() method it looks like this:

mean_pay_data = reduced_data.groupBy("JOB_TITLE").mean("REGULAR_PAY")
mean_pay_data.show(3)

# +--------------------+-----------------+
# |           JOB_TITLE| avg(REGULAR_PAY)|
# +--------------------+-----------------+
# |SENIOR SECURITY O...|59818.79285751433|
# |SENIOR TRAFFIC SU...| 72116.8394540951|
# |AIR CONDITIONING ...|98415.21726190476|
# +--------------------+-----------------+

Here is what it looks like with the .avg() method:

average_pay_data = reduced_data.groupBy("JOB_TITLE").avg("REGULAR_PAY")
average_pay_data.show(3)

# +--------------------+-----------------+
# |           JOB_TITLE| avg(REGULAR_PAY)|
# +--------------------+-----------------+
# |SENIOR SECURITY O...|59818.79285751433|
# |SENIOR TRAFFIC SU...| 72116.8394540951|
# |AIR CONDITIONING ...|98415.21726190476|
# +--------------------+-----------------+

They return the exact same values. What's the difference between mean() and avg()?

I also want to find the median, so that one person doesn't have too much of an impact. Since there is no median() method in PySpark I don't know what to do here.


Solution

  • Both avg and mean documentation tell this:

    mean() is an alias for avg()

    Both of these functions are identical. Both names are needed, so that developers coming from different backgrounds would feel comfortable.

    Regarding the median:

    • Approximate (efficient) median: F.expr('percentile_approx(col_name, .5) over()')

    • Accurate (inefficient) median: F.expr('percentile(col_name, .5) over()')