Search code examples
pythonnumpypysparkmedian

How to find median of column in pyspark?


I have a spark data frame

df = 
   a     b     c     d
0  12  12.0   car  bike
1  20  20.5   car  alto
2  15  12.0  bike   car
3  25    25  bike  jeep

I want to find the median of a column 'a'. I couldn't find an appropriate way to find the median, so used the normal python NumPy function to find the median but I was getting an error as below:-

import numpy as np
median = df['a'].median()

error:-

TypeError: 'Column' object is not callable

Expected output:-

17.5

Solution

  • You can use precentile_approx like this,

    df.agg(F.expr("percentile_approx('a', 0.5)")).show()