Search code examples
pythonpandasdataframepandas-groupbyseries

How to transpose values from top few rows in python dataframe into new columns


I am trying to select the values from the top 3 records of each group in a python sorted dataframe and put them into new columns. I have a function that is processing each group but I am having difficulties finding the right method to extract, rename the series, then combine the result as a single series to return.

Below is a simplified example of an input dataframe (df_in) and the expected output (df_out):

import pandas as pd
data_in = { 'Product': ['A', 'A', 'A', 'A', 'B', 'C', 'C'],
        'Price':  [25.0, 30.5, 50.0, 61.5, 120.0, 650.0, 680.0], 
        'Qty': [15 , 13, 14, 10, 5, 2, 1]}
df_in = pd.DataFrame (data_in, columns = ['Product', 'Price', 'Qty'])

dataframe input and transformed output with best 3 records values in new columns

I am reproducing below 2 examples of the functions I've tested and trying to get a more efficient option that works, especially if I have to process many more columns and records. Function best3_prices_v1 works but have to explicitly specify each column or variable, and is especially an issue as I have to add more columns.

def best3_prices_v1(x):     
   d = {}

# get best 3 records if records available, else set volumes as zeroes   
best_price_lv1 = x.iloc[0].copy()

rec_with_zeroes = best_price_lv1.copy()
rec_with_zeroes['Price'] = 0
rec_with_zeroes['Qty'] = 0

recs = len(x) # number of records
if (recs == 1): 
    # 2nd and 3rd records not available
    best_price_lv2 = rec_with_zeroes.copy()
    best_price_lv3 = rec_with_zeroes.copy()
elif (recs == 2):        
    best_price_lv2 = x.iloc[1]
    # 3rd record not available
    best_price_lv3 = rec_with_zeroes.copy()
else:
    best_price_lv2 = x.iloc[1]
    best_price_lv3 = x.iloc[2]    

# 1st best
d['Price_1'] = best_price_lv1['Price'] 
d['Qty_1'] = best_price_lv1['Qty'] 

# 2nd best
d['Price_2'] = best_price_lv2['Price'] 
d['Qty_2'] = best_price_lv2['Qty'] 

# 3rd best
d['Price_3'] = best_price_lv3['Price'] 
d['Qty_3'] = best_price_lv3['Qty'] 

# return combined results as a series
return pd.Series(d, index=['Price_1', 'Qty_1', 'Price_2', 'Qty_2', 'Price_3', 'Qty_3'])

Codes to call function:

# sort dataframe by Product and Price
df_in.sort_values(by=['Product', 'Price'], ascending=True, inplace=True)
# get best 3 prices and qty as new columns
df_out = df_in.groupby(['Product']).apply(best3_prices_v1).reset_index()

Second attempt to improve/reduce codes and explicit names for each variable ... not complete and not working.

def best3_prices_v2(x):     
d = {}

# get best 3 records if records available, else set volumes as zeroes   
best_price_lv1 = x.iloc[0].copy()

rec_with_zeroes = best_price_lv1.copy()
rec_with_zeroes['Price'] = 0
rec_with_zeroes['Qty'] = 0

recs = len(x) # number of records
if (recs == 1): 
    # 2nd and 3rd records not available
    best_price_lv2 = rec_with_zeroes.copy()
    best_price_lv3 = rec_with_zeroes.copy()
elif (recs == 2):        
    best_price_lv2 = x.iloc[1]
    # 3rd record not available
    best_price_lv3 = rec_with_zeroes.copy()
else:
    best_price_lv2 = x.iloc[1]
    best_price_lv3 = x.iloc[2]   


stats_columns = ['Price', 'Qty']

 # get records values for best 3 prices
d_lv1 = best_price_lv1[stats_columns]
d_lv2 = best_price_lv2[stats_columns] 
d_lv3 = best_price_lv3[stats_columns] 

# How to rename (keys?) or combine values to return?
lv1_stats_columns = [c + '_1' for c in stats_columns]
lv2_stats_columns = [c + '_2' for c in stats_columns]
lv3_stats_columns = [c + '_3' for c in stats_columns]
    
# return combined results as a series
return pd.Series(d, index=lv1_stats_columns + lv2_stats_columns + lv3_stats_columns)

Solution

  • Let's unstack():

    df_in=(df_in.set_index([df_in.groupby('Product').cumcount().add(1),'Product'])
                 .unstack(0,fill_value=0))
    df_in.columns=[f"{x}_{y}" for x,y in df_in]
    df_in=df_in.reset_index()
    

    OR via pivot()

    df_in=(df_in.assign(key=df_in.groupby('Product').cumcount().add(1))
          .pivot('Product','key',['Price','Qty'])
          .fillna(0,downcast='infer'))
    df_in.columns=[f"{x}_{y}" for x,y in df_in]
    df_in=df_in.reset_index()