Search code examples
pythonpandasdataframedata-manipulation

More efficient way of pandas dataframe manipulation: filtering and melting


I have a dataset that I need to parse and manipulate from long to wide. Each row represents a single person and there are multiple columns representing instances of a measure (uk-biobank formatted):

import pandas as pd
# initialize data of lists.
data = {'id': ['1', '2', '3', '4'],
        '3-0.0': [20, 21, 19, 18],
        '3-1.0': [10, 11, 29, 12],
        '3-2.0': [5, 6, 7, 8]}
# Create DataFrame
df = pd.DataFrame(data)
df.set_index('id')

3-0.0, 3-1.0, and 3-2.0 are three different measures of the same event for a given person. What I want is multiple rows for a given person and a column to indicate the event instance (0,1 or 2) and then a column for the associated value.

My inefficient approach is as follows and I know it can be way better. I am new to python so looking for ways of coding more efficiently:

# parsing out each instance 
i0 = df.filter(regex="\-0\.")
i1 = df.filter(regex="\-1\.")
i2 = df.filter(regex="\-2\.")

# set index as column and melt each df
i0.reset_index(inplace=True)
i0 = pd.melt(i0, id_vars = "index",  ignore_index = True).dropna().drop(columns=['variable']).assign(instance = '0')

i1.reset_index(inplace=True)
i1 = pd.melt(i1, id_vars = "index",  ignore_index = True).dropna().drop(columns=['variable']).assign(instance = '1')

i2.reset_index(inplace=True)
i2 = pd.melt(i2, id_vars = "index",  ignore_index = True).dropna().drop(columns=['variable']).assign(instance = '2')

# concatenate back together
fin = pd.concat([i0,i1,i2])

data = {'id': ['1', '2', '3', '4'],
        '3-0.0': [20, 21, 19, 18],
        '3-1.0': [10, 11, 29, 12],
        '3-2.0': [5, 6, 7, 8]}

# final dataset looks like this
id, measure, instance
1   20   0
1   10   1
1   5    2
2   21   0 
2   11   1
2   6   2
3   19   0
3   29   1
3   7   2
4   18   0 
4   12   1 
4   8   2

Bonus if you can incorporate the fact there are several measurement columns formatted like this 3-0.0','3-1.0', '3-2.0','4-0.0','4-1.0','4-2.0',...


Solution

  • Given:

      id  3-0.0  3-1.0  3-2.0  4-0.0  4-1.0  4-2.0
    0  1     20     10      5     10      5     20
    1  2     21     11      6     11      6     21
    2  3     19     29      7     29      7     19
    3  4     18     12      8     12      8     18
    

    Doing:

    unique_people = df.filter(regex='\d-').columns.str.split('-').str[0].unique()
    out = pd.wide_to_long(df, stubnames=unique_people, i='id', j='instance', sep='-', suffix='.*')
    out = out.rename(int, level=1)
    print(out.sort_index())
    

    Output:

                  3   4
    id instance
    1  0         20  10
       1         10   5
       2          5  20
    2  0         21  11
       1         11   6
       2          6  21
    3  0         19  29
       1         29   7
       2          7  19
    4  0         18  12
       1         12   8
       2          8  18