Search code examples
pythonpandasdataframetranspose

transpose pandas df based on multiple header rows


I have the following df from a vendor:

Unnamed: 0  Unnamed: 1  Unnamed: 2  agg metrics 10/20/22    10/20/22    10/21/22    10/21/22
title     content title season      episode     start       hours       start       hours
book      blue          1           3           2           2           5           2
movie     orange        2           4           11          4           7           4

I need the output like this:

title   content title   season  episode date    start   hours
book    blue            1       3       10/20/22    2   2
book    blue            1       3       10/21/22    5   2
movie   orange          2       4       10/20/22    11  4
movie   orange          2       4       10/21/22    7   4
df = pd.read_csv('file')
df = df.drop(labels=0, axis=0)
df1 = df.melt(['Unnamed: 0','Unnamed: 1', 'Unnamed: 2', 'agg metrics'],var_name='Date', value_name='Value')

but this doesn't return the proper output. apologies for not knowing how to represent this properly. hopefully my IP/OP helps.

Essentially, i'm having trouble transposing multiple headers.

Thanks for your help!


Solution

  • You could do this and this is what QuangHoang's thought too I believe:

    # Read csv with top two rows as headers resulting in multiindex, from your code I figure 
    # you are not doing that.
    df = pd.read_csv(
        StringIO(
            """
    Unnamed: 0,Unnamed: 1,Unnamed: 2,agg metrics,10/20/22,10/20/22,10/21/22,10/21/22
    title,content title,season,episode,start,hours,start,hours
    book,blue,1,3,2,2,5,2
    movie,orange,2,4,11,4,7,4
            """
        ),
        header=[0, 1],
    )
    
    # Then filter columns that are date like and stack at level 0 and reset_index
    t = df.filter(regex="\d+/\d+/\d+")
    t1 = t.stack(0).rename_axis(["", "date"]).reset_index(1)
    
    # Then get other columns and reindex to the index of the intermediate output you got above.
    t2 = df[df.columns.difference(t.columns)].droplevel(0, axis=1).reindex(t1.index)
    
    # Then concat both along axis 1
    out = pd.concat([t2, t1], axis=1)
    
    print(out)
    
       title content title  season  episode      date  hours  start
                                                                   
    0   book          blue       1        3  10/20/22      2      2
    0   book          blue       1        3  10/21/22      2      5
    1  movie        orange       2        4  10/20/22      4     11
    1  movie        orange       2        4  10/21/22      4      7