apache-spark apache-spark-sql apache-spark-dataset

How to return the median of a column in a dataset?

I want to select the median value of one of a dataset column (the median being the value located at the middle of a set of values ranked in an ascending order (for example, alphabetical order for strings)) basically I do it for now like this:

List<Row> listRows = dataset.where(dataset.col(column).isNotNull()).orderBy(column)
                .select(column).collectAsList();
int division = (int) Math.ceil(((double) listRows.size()) / 2);

String medianValue = division % 2 == 0 ? listRows.get(division).getString(0)
                    : listRows.get(division - 1).getString(0);

But i want to avoid using collectAsList() method and listRows var and rather get the value from one single Spark transformation, how to achieve that in Spark?

Example:

+---------+---------+
| name    | address |
+---------+---------+
| Patrick | NYC     |
+---------+---------+
| Mel     | SF      |
+---------+---------+
| John    | TX      |
+---------+---------+

if the target column is "name" then I want the value Mel to be returned since the ordering of "name" column will be like (John then Mel then Patrick (alphabetical order)) then the median is Mel. PS: i work in Java, but a solution with PySpark or Scala are also welcome. Thanks.

Solution

I think it would be acceptable to:

Create a row number column (using monotonically increasing id probably)
Cache the dataset (so you can count and find the median without having to re-evaluate the whole plan)
Count the dataset
Find the median
Filter the dataset where row number equals to your found median

Or in code, something like this:

listRows = listRows.withColumn("id", expr("ROW_NUMBER() OVER (ORDER BY name)"));

val count = listRows.cache().count();
val median = count / 2; // and some more operations if you are interested, like rounding

listRows.where(col("id") === median) // returns the row with median

Another workaround can be:

Create a row number column
Find the maximum row number column
Create another column that calculates your median from step 1 and 2
Filter where row number equals the calculation in step 3