Search code examples
pythonpandasmulti-indexbinning

Pandas Multi Index Dataframe Transform Raw Score into Quantile by Date and add Percentile Column


I have a pandas multi index that looks like this:

enter image description here

I would like to transform the Raw_Score into a percentile, where the comparison is against the day. So the top score of the day gets a 100, and the bottom score of the day gets at zero, with the other scores proportionately aligned. I would like to do this at the "all data", "Sector", or "Region" level. The dataframe would have a new column called "Adjusted_Score". I think I have to use groupby and quantile, but I am lost. Can someone point me in the right direction?

NOTE: Revised to show working code

import pandas as pd
import numpy as np

from datetime import datetime
from numpy import random

# ----------------------------------
# Set up a sample dataframe
# ----------------------------------

def create_df(num):
  # create empty data frame in pandas
  df = pd.DataFrame()

  # add a range of dates
  dates = pd.date_range(start="2022-04-01",end="2022-06-05").to_pydatetime().tolist()
  df['Dates'] = dates

  # generate a random 3 digit Value
  #for i in range(len(dates)):
  df['Raw_Score'] = np.random.randint(-999, 999, size=len(df))

  # generate a random 2 digit Sector between 10 and 15
  #for i in range(len(dates)):
  df['Sector'] = np.random.randint(10, 20, size=len(df))

  # give it a unique value
  df['Region'] = num

  return df

# make a big df
big_df = []

for num in range(10):
  df = create_df(num)
  df = pd.DataFrame(df)

  while num == 0:
    big_df = df.copy()
    num = num + 1
  else:
    big_df = pd.concat((big_df, df), axis=0)


df = big_df.copy()
df.reset_index(inplace=True)
df = df.drop_duplicates()

df.set_index(["Region", "Sector", "Dates"], inplace=True)

# ----------------------------------
# Actual Problem Below
# ----------------------------------

# Step 1: Getting rank with groupby:
df['rank'] = df.groupby('Dates')[['Raw_Score']].apply(lambda g: g.rank(method='first'))

# Step 2: Obtain the max value:
df['max'] = df.groupby('Dates')[['rank']].transform(lambda g: g.max())

# Step 3:Calculate percentile by n/N*100:
df['percentile'] = df['rank'] / df['max'] * 100

df

Thank you!


Solution

  • Sample code:

    # Generate sample data:
    df = pd.DataFrame({
        'var': ['a']*10 + ['b']*10,
        'val': np.random.permutation(20)})
    
    # Step 1: Getting rank with groupby:
    df['rank'] = df.groupby('var')[['val']].apply(lambda g: g.rank(method='first'))
    
    # Step 2: Obtain the max value:
    df['max'] = df.groupby('var')[['rank']].transform(lambda g: g.max())
    
    # Step 3:Calculate percentile by n/N*100:
    df['percentile'] = df['rank'] / df['max'] * 100
    

    If your dataset is large and you want to speed up the process: Step 2 simply:

    subsample = df.groupby('var')[['rank']].max()
    df = pd.merge(df, subsample, on=['var'], how='left')
    

    For some reason, merge in Pandas works much faster than transform method

    3 level of multi-index follows the same code above