I'm trying to convert a dataframe in Python, and I am stuck because I don't know exactly how to phrase what I want to do (which makes searching for it difficult). It seems like I need a combination of unstack and pivot. I can explain it with an example though, hopefully. I have a dataframe with this shape:
userid | GroupA_measure1 | GroupA_measure2 | GroupB_measure1 | GroupB_measure2 |
---|---|---|---|---|
001 | 65 | 70 | 45 | 50 |
002 | 96 | 89 | 12 | 8 |
003 | 12 | 14 | 38 | 40 |
and I want to convert this to this format:
userid | measure | groupA | groupB |
---|---|---|---|
001 | 1 | 65 | 45 |
001 | 2 | 70 | 50 |
002 | 1 | 96 | 12 |
002 | 2 | 89 | 8 |
003 | 1 | 12 | 38 |
003 | 2 | 14 | 40 |
I can unstack the whole df using pd.melt(df, id_vars =['userid'])
which put all values in separate rows, but I want to keep separate columns for the values of GroupA and GroupB.
Any help would be much appreciated.
Use wide_to_long
with extract numbers from measure
column by Series.str.extract
:
df1 = pd.wide_to_long(df,
stubnames=['GroupA','GroupB'],
i='userid',
j='measure', sep='_', suffix=r'\w+').reset_index()
df1['measure'] = df1['measure'].str.extract('(\d+)').astype(int)
Or convert not _
columns first, split all columns with _
and reshape by DataFrame.stack
, last also extract digits:
df1 = df.set_index('userid')
df1.columns = df1.columns.str.split('_', expand=True)
df1 = df1.rename_axis((None, 'measure'), axis=1).stack().reset_index()
df1['measure'] = df1['measure'].str.extract('(\d+)').astype(int)
print (df1)
userid measure GroupA GroupB
0 001 1 65 45
1 002 1 96 12
2 003 1 12 38
3 001 2 70 50
4 002 2 89 8
5 003 2 14 40
Last if necessary sorting by DataFrame.sort_values
:
df1 = df1.sort_values('userid', ignore_index=True)
print (df1)
userid measure GroupA GroupB
0 001 1 65 45
1 001 2 70 50
2 002 1 96 12
3 002 2 89 8
4 003 1 12 38
5 003 2 14 40