Search code examples
pythonpandasdataframeassociations

Associative group in python


After preprocessing i have a final dataframe with columns 'timestamp', 'group', 'person1', 'person2'. I am trying to figure out how to code my requirement or want to know is it possible using python. What I am trying to extract is groups within each group. for example: in group G0, A is meeting with B, B meets with C, A meets with D. It means ABCD forms a group within the group. There can be multiple groups within each group (for example in group G1). How can I do this? what logic or code can I apply to extract this? I searched a lot, but it was not of any help..

The pic of dataframe sample and expected output is: enter image description here

sample data:

df = pd.DataFrame(
    {
        "timestamp": ['25-06-2020 09:29','25-06-2020 09:29','25-06-2020 09:31','25-06-2020 09:32','25-06-2020 09:33','25-06-2020 09:33','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 12:29','25-06-2020 12:29','25-06-2020 12:30','25-06-2020 12:30'],
        "group": ['G0','G0','G0','G0','G0','G0','G1','G1','G1','G1','G1','G2','G2','G2'],
        "person1": ['A','A','B','A','X','Z','A','B','L','X','Y','L','N','O'],
        "person2": ['B','B','C','D','Y','N','B','C','M','Y','Z','M','O','P']
    }
)

Solution

  • You can do it using the networkx library graph theory and connected components:

    import networkx as nx
    import pandas as pd
    
    df = pd.DataFrame(
        {
            "timestamp": ['25-06-2020 09:29','25-06-2020 09:29','25-06-2020 09:31','25-06-2020 09:32','25-06-2020 09:33','25-06-2020 09:33','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 11:17','25-06-2020 12:29','25-06-2020 12:29','25-06-2020 12:30','25-06-2020 12:30'],
            "group": ['G0','G0','G0','G0','G0','G0','G1','G1','G1','G1','G1','G2','G2','G2'],
            "person1": ['A','A','B','A','X','Z','A','B','L','X','Y','L','N','O'],
            "person2": ['B','B','C','D','Y','N','B','C','M','Y','Z','M','O','P']
        }
    )
    
    def f(x):
        G = nx.from_pandas_edgelist(x, 'person1', 'person2')
        l = x.apply(lambda n: ''.join(nx.node_connected_component(G, n['person1'])), axis=1)
        return l
    
    df['subgroup'] = df.groupby('group').apply(f).to_numpy()
    df
    

    Output:

               timestamp group person1 person2 subgroup
    0   25-06-2020 09:29    G0       A       B     DACB
    1   25-06-2020 09:29    G0       A       B     DACB
    2   25-06-2020 09:31    G0       B       C     DACB
    3   25-06-2020 09:32    G0       A       D     DACB
    4   25-06-2020 09:33    G0       X       Y       YX
    5   25-06-2020 09:33    G0       Z       N       NZ
    6   25-06-2020 11:17    G1       A       B      ACB
    7   25-06-2020 11:17    G1       B       C      ACB
    8   25-06-2020 11:17    G1       L       M       ML
    9   25-06-2020 11:17    G1       X       Y      ZYX
    10  25-06-2020 12:29    G1       Y       Z      ZYX
    11  25-06-2020 12:29    G2       L       M       ML
    12  25-06-2020 12:30    G2       N       O      ONP
    13  25-06-2020 12:30    G2       O       P      ONP
    

    Groupby subgroups:

    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df.groupby('subgroup')['timestamp'].agg(['min', 'max'])
    

    Output:

                             min                 max
    subgroup                                        
    ACB      2020-06-25 11:17:00 2020-06-25 11:17:00
    DACB     2020-06-25 09:29:00 2020-06-25 09:32:00
    ML       2020-06-25 11:17:00 2020-06-25 12:29:00
    NZ       2020-06-25 09:33:00 2020-06-25 09:33:00
    ONP      2020-06-25 12:30:00 2020-06-25 12:30:00
    YX       2020-06-25 09:33:00 2020-06-25 09:33:00
    ZYX      2020-06-25 11:17:00 2020-06-25 12:29:00