Let say I have PySpark data frame with column "data".
I would like to assign for each value in this column "Percentile" value with bin = 5.
Here is a sketch of Python code and desired result
import numpy as np
# Array of data
data = [5,6,9,87,2,3,5,7,2,6,5,2,3,4,69,4]
# Calculate percentiles
list_percentile = [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95]
p = np.percentile(data, list_percentile)
# => Desired result
data_with_percentile = [(5, 60), (6, 65), ..., (4, 55)]
where (5, 60) pair means that value "5" lies in the percentile bin 55-60%.
Any help is welcome!
I feel like you could achieve this using approxQuantile
Each number in your data could be part of several quantiles. For example 5 falls into the 50th, 55th, and 60th quantile. Using dict(zip())
would return the highest quantile per input value, due to the list_percentiles
being in ascending order.
Also this requires your percentiles to be in decimal format.
from pyspark.sql.functions import monotonically_increasing_id
data = [5,6,9,87,2,3,5,7,2,6,5,2,3,4,69,4]
list_percentile = [.05, .10, .15, .20, .25, .30, .35, .40, .45, .50, .55, .60, .65, .70, .75, .80, .85, .90, .95]
df = spark.createDataFrame([[x] for x in data],['i'])
df = df.withColumn('order', monotonically_increasing_id())
m = dict(zip(df.approxQuantile('i',list_percentile,.01),list_percentile))
m= spark.createDataFrame(list(m.items()),['i','bin'])
output = df.join(m, on='i').orderBy('order').drop('order').rdd.map(tuple).collect()
print(output)
Output
[(5, 0.6), (6, 0.75), (9, 0.85), (87, 0.95), (2, 0.15), (3, 0.3), (5, 0.6), (7, 0.8), (2, 0.15), (6, 0.75), (5, 0.6), (2, 0.15), (3, 0.3), (4, 0.4), (69, 0.9), (4, 0.4)]