How calculate percentile for each value in PySpark data frame?

Let say I have PySpark data frame with column "data".

I would like to assign for each value in this column "Percentile" value with bin = 5.

Here is a sketch of Python code and desired result

import numpy as np

# Array of data
data = [5,6,9,87,2,3,5,7,2,6,5,2,3,4,69,4]

# Calculate percentiles
list_percentile = [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95]
p = np.percentile(data, list_percentile)

# => Desired result
data_with_percentile = [(5, 60), (6, 65), ..., (4, 55)]

where (5, 60) pair means that value "5" lies in the percentile bin 55-60%.

Any help is welcome!

Solution

I feel like you could achieve this using approxQuantile

Each number in your data could be part of several quantiles. For example 5 falls into the 50th, 55th, and 60th quantile. Using dict(zip()) would return the highest quantile per input value, due to the list_percentiles being in ascending order.

Also this requires your percentiles to be in decimal format.

from pyspark.sql.functions import monotonically_increasing_id

data = [5,6,9,87,2,3,5,7,2,6,5,2,3,4,69,4]

list_percentile = [.05, .10, .15, .20, .25, .30, .35, .40, .45, .50, .55, .60, .65, .70, .75, .80, .85, .90, .95]


df = spark.createDataFrame([[x] for x in data],['i'])
df = df.withColumn('order', monotonically_increasing_id())

m = dict(zip(df.approxQuantile('i',list_percentile,.01),list_percentile))
m= spark.createDataFrame(list(m.items()),['i','bin'])

output = df.join(m, on='i').orderBy('order').drop('order').rdd.map(tuple).collect()

print(output)

Output

[(5, 0.6), (6, 0.75), (9, 0.85), (87, 0.95), (2, 0.15), (3, 0.3), (5, 0.6), (7, 0.8), (2, 0.15), (6, 0.75), (5, 0.6), (2, 0.15), (3, 0.3), (4, 0.4), (69, 0.9), (4, 0.4)]