Search code examples
apache-sparkapache-spark-sqlapache-spark-dataset

How to return the median of a column in a dataset?


I want to select the median value of one of a dataset column (the median being the value located at the middle of a set of values ranked in an ascending order (for example, alphabetical order for strings)) basically I do it for now like this:

List<Row> listRows = dataset.where(dataset.col(column).isNotNull()).orderBy(column)
                .select(column).collectAsList();
int division = (int) Math.ceil(((double) listRows.size()) / 2);

String medianValue = division % 2 == 0 ? listRows.get(division).getString(0)
                    : listRows.get(division - 1).getString(0);

But i want to avoid using collectAsList() method and listRows var and rather get the value from one single Spark transformation, how to achieve that in Spark?

Example:

+---------+---------+
| name    | address |
+---------+---------+
| Patrick | NYC     |
+---------+---------+
| Mel     | SF      |
+---------+---------+
| John    | TX      |
+---------+---------+

if the target column is "name" then I want the value Mel to be returned since the ordering of "name" column will be like (John then Mel then Patrick (alphabetical order)) then the median is Mel. PS: i work in Java, but a solution with PySpark or Scala are also welcome. Thanks.


Solution

  • I think it would be acceptable to:

    1. Create a row number column (using monotonically increasing id probably)
    2. Cache the dataset (so you can count and find the median without having to re-evaluate the whole plan)
    3. Count the dataset
    4. Find the median
    5. Filter the dataset where row number equals to your found median

    Or in code, something like this:

    listRows = listRows.withColumn("id", expr("ROW_NUMBER() OVER (ORDER BY name)"));
    
    val count = listRows.cache().count();
    val median = count / 2; // and some more operations if you are interested, like rounding
    
    listRows.where(col("id") === median) // returns the row with median
    

    Another workaround can be:

    1. Create a row number column
    2. Find the maximum row number column
    3. Create another column that calculates your median from step 1 and 2
    4. Filter where row number equals the calculation in step 3