Search code examples
rapache-sparkapache-spark-sqlsparkr

Get percentiles as a column in SparkR dataframe


I have a Spark data frame as below (dummy data) and would want to add percentiles as a column to it using SparkR:

UserId Values
  U1     5
  U2     50
  U3     76
  .
  .
  .
  . 
  .
  U1000   8

Example:

 UserId Values   Percentile
      U1     5     0.011144
      U2     50    0.212245
      U3     76    0.811986 
      .
      .
      .
      . 
      .
      U1000   8    0.0134567

Solution

  • Well, there is a function, percent_rank, but it only operates on window partitions; adapting the documentation example:

    library(SparkR)
    library(magrittr)
    sparkR.version()
    # [1] "2.2.0"    
    
    df <- createDataFrame(mtcars)
    ws <- orderBy(windowPartitionBy("am"), "hp")
    out <- select(df, over(percent_rank(), ws), df$hp, df$am)
    out <- out %>% withColumnRenamed(colnames(out)[1], 'percentile') # get rid of the ridiculously long column name 
    head(out)
    #   percentile  hp am
    # 1 0.00000000  62  0
    # 2 0.05555556  95  0
    # 3 0.11111111  97  0
    # 4 0.16666667 105  0
    # 5 0.22222222 110  0
    # 6 0.27777778 123  0
    

    If you don't want a window partition but a global percentile, as in your case, the easiest way I can think of is to add first a dummy constant column and group by this, which you can subsequently drop. So, here is the above example, but this time without partitioning by am:

    df_new = df %>% withColumn("dummy", lit("N"))
    ws <- orderBy(windowPartitionBy('dummy'), "hp")
    out <- select(df_new, over(percent_rank(), ws), df_new$hp)
    out <- out %>% withColumnRenamed(colnames(out)[1], 'percentile')
    

    The result of collect(out) is

       percentile  hp
    1  0.00000000  52
    2  0.03225806  62
    3  0.06451613  65
    4  0.09677419  66
    5  0.09677419  66
    6  0.16129032  91
    7  0.19354839  93
    8  0.22580645  95
    9  0.25806452  97
    10 0.29032258 105
    11 0.32258065 109
    12 0.35483871 110
    13 0.35483871 110
    14 0.35483871 110
    15 0.45161290 113
    16 0.48387097 123
    17 0.48387097 123
    18 0.54838710 150
    19 0.54838710 150
    20 0.61290323 175
    21 0.61290323 175
    22 0.61290323 175
    23 0.70967742 180
    24 0.70967742 180
    25 0.70967742 180
    26 0.80645161 205
    27 0.83870968 215
    28 0.87096774 230
    29 0.90322581 245
    30 0.90322581 245
    31 0.96774194 264
    32 1.00000000 335