Search code examples
pandasdataframegroup-by

Find common values within groupby in pandas Dataframe based on two columns


I have following dataframe:

period    symptoms   recovery
1         4          2
1         5          2
1         6          2
2         3          1
2         5          2
2         8          4
2         12         6
3         4          2
3         5          2
3         6          3
3         8          5
4         5          2
4         8          4
4         12         6

I'm trying to find the common values of df['period'] groups (1, 2, 3, 4) based on value of two columns 'symptoms' and 'recovery'

Result should be :

     symptoms   recovery   period
     5          2          [1, 2, 3, 4]
     8          4          [2, 4]

where each same two columns values has the periods occurrence in a list or column.

I'm I approaching the problem in the wrong way ? Appreciate your help.

I tried to turn each period into dict and loop through to find values but didn't work for me. Also tried to use grouby().apply() but I'm not getting a meaningful data frame. Tried sorting values based on 3 columns but couldn't get the common ones between each period section. Last attempt :

df2 = df[['period', 'how_long', 'days_to_ex']].copy()
#s = df.groupby(["period", "symptoms", "recovery"]).size()
 s = df.groupby(["symptoms", "recovery"]).size()

Solution

  • You were almost there:

    from io import StringIO
    import pandas as pd
    
    # setup sample data
    data = StringIO("""
    period;symptoms;recovery
    1;4;2
    1;5;2
    1;6;2
    2;3;1
    2;5;2
    2;8;4
    2;12;6
    3;4;2
    3;5;2
    3;6;3
    3;8;5
    4;5;2
    4;8;4
    4;12;6
    """)
    
    df = pd.read_csv(data, sep=";")
    
    # collect unique periods
    df.groupby(['symptoms','recovery'])[['period']].agg(list).reset_index()
    

    This gives

       symptoms  recovery        period
    0         3         1           [2]
    1         4         2        [1, 3]
    2         5         2  [1, 2, 3, 4]
    3         6         2           [1]
    4         6         3           [3]
    5         8         4        [2, 4]
    6         8         5           [3]
    7        12         6        [2, 4]