Search code examples
pythonpandasmergedistancesimilarity

How to merge strings that have certain number of substrings in common to produce some groups in a data frame in Python


I asked a question like this. But that is a simple one. Which has been resolved. how to merge strings that have substrings in common to produce some groups in a data frame in Python.

But here, I have an advanced version of the similar question:

I have a sample data:

a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})

What I want to do is merge some strings if they have sub strings in common. So, in this example, the strings 'b,c','a','a,c,d,e' should be merged together because they can be linked to each other. 'j,k,l' and 'k,l,m' should be in one group. In the end, I hope I can have something like:

               group
'b,c',         0
'a',           0
'a,c,d,e',     0
'f,g,h,i',     1
'j,k,l',       2
'k,l,m'        2

So, I can have three groups and there is no common sub strings between any two groups.

Now, I am trying to build up a similarity data frame, in which 1 means two strings have sub strings in common. Here is my code:

commonWords=1

for i in np.arange(a.shape[0]):
    a.loc[:,a.loc[i,'ACTIVITY']]=0

for i in a.loc[:,'ACTIVITY']:
    il=i.split(',')
    for j in a.loc[:,'ACTIVITY']:
        jl=j.split(',')
        c=[x in il for x in jl]
        c1=[x for x in c if x==True]
        a.loc[(a.loc[:,'ACTIVITY']==i),j]=1 if len(c1)>=commonWords else 0
    
a

The result is:

    ACTIVITY    b,c     a   a,c,d,e     f,g,h,i     j,k,l   k,l,m
0   b,c          1      0       1           0       0       0
1   a            0      1       1           0       0       0
2   a,c,d,e      1      1       1           0       0       0
3   f,g,h,i      0      0       0           1       0       0
4   j,k,l        0      0       0           0       1       1
5   k,l,m        0      0       0           0       1       1

In this code, commonWords means how many sub strings I hope that two strings have in common. For example, if commonWords=2, then two strings will be merged together only if there are two, or more than two sub strings in them. When commonWords=2, the group should be:

               group
'b,c',         0
'a',           1
'a,c,d,e',     2
'f,g,h,i',     3
'j,k,l',       4
'k,l,m'        4

Solution

  • Use:

    a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})
    
    
    from itertools import combinations, chain
    from collections import Counter
    
    #split values by , to lists
    splitted = a['ACTIVITY'].str.split(',')
    
    commonWords=2
    #create edges (can only connect two nodes)
    L2_nested = [list(combinations(l,commonWords)) for l in splitted]
    L2 = list(chain.from_iterable(L2_nested))
    

    #convert values to sets
    f1 = [set(k) for k, v in Counter(L2).items() if v >= commonWords]
    f2 = [set(x) for x in splitted]
    
    #create new columns for matched sets
    for val in f1:
        j = ','.join(val)
        a[j] = [j if len(val & x) == commonWords else np.nan for x in f2]
    print (a)
    
    #forward filling values of new columns and use factorize for groups
    new = pd.factorize(a[['ACTIVITY']].assign(ACTIVITY = a.index).ffill(axis=1).iloc[:, -1])[0]
    
    a = a[['ACTIVITY']].assign(group = new)
    print (a)
      ACTIVITY  group
    0      b,c      0
    1        a      1
    2  a,c,d,e      2
    3  f,g,h,i      3
    4    j,k,l      4
    5    k,l,m      4