I have a pandas multi index that looks like this:
I would like to transform the Raw_Score into a percentile, where the comparison is against the day. So the top score of the day gets a 100, and the bottom score of the day gets at zero, with the other scores proportionately aligned. I would like to do this at the "all data", "Sector", or "Region" level. The dataframe would have a new column called "Adjusted_Score". I think I have to use groupby and quantile, but I am lost. Can someone point me in the right direction?
NOTE: Revised to show working code
import pandas as pd
import numpy as np
from datetime import datetime
from numpy import random
# ----------------------------------
# Set up a sample dataframe
# ----------------------------------
def create_df(num):
# create empty data frame in pandas
df = pd.DataFrame()
# add a range of dates
dates = pd.date_range(start="2022-04-01",end="2022-06-05").to_pydatetime().tolist()
df['Dates'] = dates
# generate a random 3 digit Value
#for i in range(len(dates)):
df['Raw_Score'] = np.random.randint(-999, 999, size=len(df))
# generate a random 2 digit Sector between 10 and 15
#for i in range(len(dates)):
df['Sector'] = np.random.randint(10, 20, size=len(df))
# give it a unique value
df['Region'] = num
return df
# make a big df
big_df = []
for num in range(10):
df = create_df(num)
df = pd.DataFrame(df)
while num == 0:
big_df = df.copy()
num = num + 1
else:
big_df = pd.concat((big_df, df), axis=0)
df = big_df.copy()
df.reset_index(inplace=True)
df = df.drop_duplicates()
df.set_index(["Region", "Sector", "Dates"], inplace=True)
# ----------------------------------
# Actual Problem Below
# ----------------------------------
# Step 1: Getting rank with groupby:
df['rank'] = df.groupby('Dates')[['Raw_Score']].apply(lambda g: g.rank(method='first'))
# Step 2: Obtain the max value:
df['max'] = df.groupby('Dates')[['rank']].transform(lambda g: g.max())
# Step 3:Calculate percentile by n/N*100:
df['percentile'] = df['rank'] / df['max'] * 100
df
Thank you!
Sample code:
# Generate sample data:
df = pd.DataFrame({
'var': ['a']*10 + ['b']*10,
'val': np.random.permutation(20)})
# Step 1: Getting rank with groupby:
df['rank'] = df.groupby('var')[['val']].apply(lambda g: g.rank(method='first'))
# Step 2: Obtain the max value:
df['max'] = df.groupby('var')[['rank']].transform(lambda g: g.max())
# Step 3:Calculate percentile by n/N*100:
df['percentile'] = df['rank'] / df['max'] * 100
If your dataset is large and you want to speed up the process: Step 2
simply:
subsample = df.groupby('var')[['rank']].max()
df = pd.merge(df, subsample, on=['var'], how='left')
For some reason, merge
in Pandas works much faster than transform
method
3 level of multi-index follows the same code above