Search code examples
pythonpandasmergedistancesimilarity

How to merge strings that have substrings in common to produce some groups in a data frame in Python


I have a sample data:

a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})

What I want to do is merge some strings if they have sub strings in common. So, in this example, the strings 'b,c','a','a,c,d,e' should be merged together because they can be linked to each other. 'j,k,l' and 'k,l,m' should be in one group. In the end, I hope I can have something like:

               group
'b,c',         0
'a',           0
'a,c,d,e',     0
'f,g,h,i',     1
'j,k,l',       2
'k,l,m'        2

So, I can have three groups and there is no common sub strings between any two groups.

Now, I am trying to build up a similarity data frame, in which 1 means two strings have sub strings in common. Here is my code:

commonWords=1

for i in np.arange(a.shape[0]):
    a.loc[:,a.loc[i,'ACTIVITY']]=0

for i in a.loc[:,'ACTIVITY']:
    il=i.split(',')
    for j in a.loc[:,'ACTIVITY']:
        jl=j.split(',')
        c=[x in il for x in jl]
        c1=[x for x in c if x==True]
        a.loc[(a.loc[:,'ACTIVITY']==i),j]=1 if len(c1)>=commonWords else 0
    
a

The result is:

    ACTIVITY    b,c     a   a,c,d,e     f,g,h,i     j,k,l   k,l,m
0   b,c          1      0       1           0       0       0
1   a            0      1       1           0       0       0
2   a,c,d,e      1      1       1           0       0       0
3   f,g,h,i      0      0       0           1       0       0
4   j,k,l        0      0       0           0       1       1
5   k,l,m        0      0       0           0       1       1

From here, you can see if there is 1, then the related row and columns should be merged together.


Solution

  • Use networkx with connected_components:

    a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})
    
    import networkx as nx
    from itertools import combinations, chain
    
    #split values by , to lists
    splitted = a['ACTIVITY'].str.split(',')
    
    #create edges (can only connect two nodes)
    L2_nested = [list(combinations(l,2)) for l in splitted]
    L2 = list(chain.from_iterable(L2_nested))
    print (L2)
    [('b', 'c'), ('a', 'c'), ('a', 'd'), ('a', 'e'), ('c', 'd'), 
     ('c', 'e'), ('d', 'e'), ('f', 'g'), ('f', 'h'), ('f', 'i'), 
     ('g', 'h'), ('g', 'i'), ('h', 'i'), ('j', 'k'), ('j', 'l'), 
     ('k', 'l'), ('k', 'l'), ('k', 'm'), ('l', 'm')]
    

    #create the graph from the lists
    G=nx.Graph()
    G.add_edges_from(L2)
    connected_comp = nx.connected_components(G)
    
    #create dict for common values
    node2id = {x: cid for cid, c in enumerate(connected_comp) for x in c}
    
    # create groups by mapping first value of series called splitted
    a['group'] = [node2id.get(x[0]) for x in splitted]
    print (a)
      ACTIVITY  group
    0      b,c      0
    1        a      0
    2  a,c,d,e      0
    3  f,g,h,i      1
    4    j,k,l      2
    5    k,l,m      2