I have a dataset that I need to parse and manipulate from long to wide. Each row represents a single person and there are multiple columns representing instances of a measure (uk-biobank formatted):
import pandas as pd
# initialize data of lists.
data = {'id': ['1', '2', '3', '4'],
'3-0.0': [20, 21, 19, 18],
'3-1.0': [10, 11, 29, 12],
'3-2.0': [5, 6, 7, 8]}
# Create DataFrame
df = pd.DataFrame(data)
df.set_index('id')
3-0.0, 3-1.0, and 3-2.0 are three different measures of the same event for a given person. What I want is multiple rows for a given person and a column to indicate the event instance (0,1 or 2) and then a column for the associated value.
My inefficient approach is as follows and I know it can be way better. I am new to python so looking for ways of coding more efficiently:
# parsing out each instance
i0 = df.filter(regex="\-0\.")
i1 = df.filter(regex="\-1\.")
i2 = df.filter(regex="\-2\.")
# set index as column and melt each df
i0.reset_index(inplace=True)
i0 = pd.melt(i0, id_vars = "index", ignore_index = True).dropna().drop(columns=['variable']).assign(instance = '0')
i1.reset_index(inplace=True)
i1 = pd.melt(i1, id_vars = "index", ignore_index = True).dropna().drop(columns=['variable']).assign(instance = '1')
i2.reset_index(inplace=True)
i2 = pd.melt(i2, id_vars = "index", ignore_index = True).dropna().drop(columns=['variable']).assign(instance = '2')
# concatenate back together
fin = pd.concat([i0,i1,i2])
data = {'id': ['1', '2', '3', '4'],
'3-0.0': [20, 21, 19, 18],
'3-1.0': [10, 11, 29, 12],
'3-2.0': [5, 6, 7, 8]}
# final dataset looks like this
id, measure, instance
1 20 0
1 10 1
1 5 2
2 21 0
2 11 1
2 6 2
3 19 0
3 29 1
3 7 2
4 18 0
4 12 1
4 8 2
Bonus if you can incorporate the fact there are several measurement columns formatted like this 3-0.0','3-1.0', '3-2.0','4-0.0','4-1.0','4-2.0',...
Given:
id 3-0.0 3-1.0 3-2.0 4-0.0 4-1.0 4-2.0
0 1 20 10 5 10 5 20
1 2 21 11 6 11 6 21
2 3 19 29 7 29 7 19
3 4 18 12 8 12 8 18
Doing:
unique_people = df.filter(regex='\d-').columns.str.split('-').str[0].unique()
out = pd.wide_to_long(df, stubnames=unique_people, i='id', j='instance', sep='-', suffix='.*')
out = out.rename(int, level=1)
print(out.sort_index())
Output:
3 4
id instance
1 0 20 10
1 10 5
2 5 20
2 0 21 11
1 11 6
2 6 21
3 0 19 29
1 29 7
2 7 19
4 0 18 12
1 12 8
2 8 18