Search code examples
pythonpandaspivot-tablepandas-melt

Melt/unpivot a dataset with multiple groups of values


I'm trying to convert a dataframe in Python, and I am stuck because I don't know exactly how to phrase what I want to do (which makes searching for it difficult). It seems like I need a combination of unstack and pivot. I can explain it with an example though, hopefully. I have a dataframe with this shape:

userid GroupA_measure1 GroupA_measure2 GroupB_measure1 GroupB_measure2
001 65 70 45 50
002 96 89 12 8
003 12 14 38 40

and I want to convert this to this format:

userid measure groupA groupB
001 1 65 45
001 2 70 50
002 1 96 12
002 2 89 8
003 1 12 38
003 2 14 40

I can unstack the whole df using pd.melt(df, id_vars =['userid']) which put all values in separate rows, but I want to keep separate columns for the values of GroupA and GroupB.

Any help would be much appreciated.


Solution

  • Use wide_to_long with extract numbers from measure column by Series.str.extract:

    df1 = pd.wide_to_long(df, 
                          stubnames=['GroupA','GroupB'], 
                          i='userid', 
                          j='measure', sep='_', suffix=r'\w+').reset_index()
    
    df1['measure'] = df1['measure'].str.extract('(\d+)').astype(int)
    

    Or convert not _ columns first, split all columns with _ and reshape by DataFrame.stack, last also extract digits:

    df1 = df.set_index('userid')
    df1.columns = df1.columns.str.split('_', expand=True)
    df1 = df1.rename_axis((None, 'measure'), axis=1).stack().reset_index()
    df1['measure'] = df1['measure'].str.extract('(\d+)').astype(int)
    print (df1)
      userid  measure GroupA GroupB
    0    001        1     65     45
    1    002        1     96     12
    2    003        1     12     38
    3    001        2     70     50
    4    002        2     89      8
    5    003        2     14     40
    

    Last if necessary sorting by DataFrame.sort_values:

    df1 = df1.sort_values('userid', ignore_index=True)
    print (df1)
      userid  measure GroupA GroupB
    0    001        1     65     45
    1    001        2     70     50
    2    002        1     96     12
    3    002        2     89      8
    4    003        1     12     38
    5    003        2     14     40