Search code examples
pythonpandasmultithreadingtwittertree

How to get threads/conversations from reply ids (Python)?


I'm a relative newbie with python and I'm trying to reconstruct conversations/threads from a dataframe with a list of IDs.

I currently have a pandas dataframe of tweets / reddit posts which have roughly the following format:

id text parent_id replies
id1 blah blah _ post _ id2, id3, id4, id5, id6, id7
id2 blah blah id1 id4, id5, id6, id7
id3 blah blah id1
id4 blah blah id2 id6, id7
id5 blah blah id2
id6 blah blah id4 id7
id7 blah blah id6

My goal is to separate the data into threads/conversations based on the ids. This would mean, from the above example, getting the following sequences as the output:

[id1, id2, id4, id6],

[id1, id2, id4, id7],

[id1, id2, id5], &

[id1, id3].

Having these lists would then enable me to look at threads in their entirety. Currently my code is very convoluted and looks something like this:

out_list = []
for i, row in df.iterrows():
    id_ = row["id"]
    # create our output file 
    sequence = [id_]
    replies = list(row['replies'])
    # creates a new dataframe from the replies to the topline comment in question
    reply_df= df.loc[df['id'].isin(replies)]
    reply_df = reply_df[reply_df.Parent_id2 == id_]
    #check if ends at topline
    if reply_df.empty == False:
        
        def turn_recursion(df, reply_df):
            for j, row_ in reply_df.iterrows():
                replies_2 = reply_df.loc[j, 'replies']
                id_2 = row_["id"]

                reply_df2 =  df.loc[df['id'].isin(replies_2)]
                reply_df2 = reply_df2[reply_df2.Parent_id2 == id_2]

                nonlocal sequence
                nonlocal out_list
                            
                if reply_df2.empty == False:
                    sequence.append(id_2)
                    return(turn_recursion(df, reply_df2))
                
                else:
                    sequence.append(id_2)
                    out_list.append(sequence)
        
        turn_recursion(test2, reply_df)
    else:
        out_list.append(sequence)
    

This is currently giving me semi-accurate results but instead of getting: [[id1, id2, id4, id6],[id1, id2, id4, id7]], I get: [id1, id2, id4, id6, id7].

I realise I'm probably being a bit dim and that there is a simple solution, but for the life of me, I can't seem to figure out a way of doing this so that it works properly and for any thread length.

Thank you in advance for any suggestions. :)


Solution

  • Use networkx to achieve what you want:

    import pandas as pd
    import networkx as nx
    from collections import defaultdict
    
    data = defaultdict(list)
    
    # Build graph from pandas
    G = nx.from_pandas_edgelist(df, source='parent_id', target='id', 
                                create_using=nx.DiGraph)
    
    # Find leaves (id3, id5, id7)
    leaves = [node for node, degree in G.out_degree() if degree == 0]
    
    # Enumerate all possible paths
    for node in df['id']:
        for leaf in leaves:
            for path in nx.all_simple_paths(G, node, leaf):
                data[node].append(path)
    

    Output:

    >>> data
    defaultdict(list,
                {'id1': [['id1', 'id3'],
                  ['id1', 'id2', 'id5'],
                  ['id1', 'id2', 'id4', 'id6', 'id7']],
                 'id2': [['id2', 'id5'], ['id2', 'id4', 'id6', 'id7']],
                 'id4': [['id4', 'id6', 'id7']],
                 'id6': [['id6', 'id7']]})
    

    If you want to merge the dictionary to your dataframe:

    df['replies'] = df['id'].map(data)
    print(df)
    
    # Output:
        id       text parent_id                                            replies
    0  id1  blah blah  _ post _  [[id1, id3], [id1, id2, id5], [id1, id2, id4, ...
    1  id2  blah blah       id1                 [[id2, id5], [id2, id4, id6, id7]]
    2  id3  blah blah       id1                                                 []
    3  id4  blah blah       id2                                  [[id4, id6, id7]]
    4  id5  blah blah       id2                                                 []
    5  id6  blah blah       id4                                       [[id6, id7]]
    6  id7  blah blah       id6                                                 []
    

    Now you can explode your dataframe:

    df = df.explode('replies')
    print(df)
    
    # Output:
        id       text parent_id                    replies
    0  id1  blah blah  _ post _                 [id1, id3]
    0  id1  blah blah  _ post _            [id1, id2, id5]
    0  id1  blah blah  _ post _  [id1, id2, id4, id6, id7]
    1  id2  blah blah       id1                 [id2, id5]
    1  id2  blah blah       id1       [id2, id4, id6, id7]
    2  id3  blah blah       id1                        NaN
    3  id4  blah blah       id2            [id4, id6, id7]
    4  id5  blah blah       id2                        NaN
    5  id6  blah blah       id4                 [id6, id7]
    6  id7  blah blah       id6                        NaN