I want to select the median value of one of a dataset column (the median being the value located at the middle of a set of values ranked in an ascending order (for example, alphabetical order for strings)) basically I do it for now like this:
List<Row> listRows = dataset.where(dataset.col(column).isNotNull()).orderBy(column)
.select(column).collectAsList();
int division = (int) Math.ceil(((double) listRows.size()) / 2);
String medianValue = division % 2 == 0 ? listRows.get(division).getString(0)
: listRows.get(division - 1).getString(0);
But i want to avoid using collectAsList()
method and listRows
var and rather get the value from one single Spark transformation, how to achieve that in Spark?
Example:
+---------+---------+
| name | address |
+---------+---------+
| Patrick | NYC |
+---------+---------+
| Mel | SF |
+---------+---------+
| John | TX |
+---------+---------+
if the target column is "name" then I want the value Mel
to be returned since the ordering of "name" column will be like (John
then Mel
then Patrick
(alphabetical order)) then the median is Mel
.
PS: i work in Java, but a solution with PySpark or Scala are also welcome.
Thanks.
I think it would be acceptable to:
Or in code, something like this:
listRows = listRows.withColumn("id", expr("ROW_NUMBER() OVER (ORDER BY name)"));
val count = listRows.cache().count();
val median = count / 2; // and some more operations if you are interested, like rounding
listRows.where(col("id") === median) // returns the row with median
Another workaround can be: