Search code examples
pythonpandasstackmax

Python pandas - wide data - identify earliest and maximum columns in time series


I am working with a data frame that is written in wide format. Each book has a number of sales, but some quarters have null values because the book was not released before that quarter.

import pandas as pd

data = {'Book Title': ['A Court of Thorns and Roses', 'Where the Crawdads Sing', 'Bad Blood', 'Atomic Habits'],
    'Metric': ['Book Sales','Book Sales','Book Sales','Book Sales'],
   'Q1 2022': [100000,0,0,0],
   'Q2 2022': [50000,75000,0,35000],
   'Q3 2022': [25000,150000,20000,45000],
   'Q4 2022': [25000,20000,10000,65000]}

df1 = pd.DataFrame(data)

enter image description here

What I would like to do is create one field that identifies "ID of first available quarter" ("First Quarter ID"), and another that identifies "ID of quarter with maximum sales" ("Max Quarter ID"). Then I would like to show two fields with the sales in the first available quarter and the second available quarter.

enter image description here

Tips to go about this? Thank you!


Solution

  • Edit, updated approach making better use of groupby after melting

    #melt table to be long-form
    long_df1 = df1.melt(
        id_vars = ['Book Title','Metric'],
        value_name = 'Sales',
        var_name = 'Quarter',
    )
    
    #remove rows that have 0 sales (could be dropna if null values used instead)
    long_df1 = long_df1[long_df1['Sales'].gt(0)]
    
    #groupby book title and find the first/max quarter/sales
    gb = long_df1.groupby('Book Title')
    
    first_df = gb[['Quarter','Sales']].first()
    max_df = long_df1.loc[gb['Sales'].idxmax(),['Book Title','Quarter','Sales']].set_index('Book Title')
    
    #concatenate the first/max dfs
    out_df = pd.concat(
        (first_df.add_prefix('First '),max_df.add_prefix('Max ')),
        axis=1
    ).reset_index()
    

    Output

    enter image description here