pandas pattern-matching pandas-groupby string-concatenation

Groupby and A)Concate matching strings(and or substring) B)Sum the values

I have df:

row_numbers    ID     code        amount
   1           med    a           1
   2           med    a, b        1
   3           med    b, c        1
   4           med    c           1
   5           med    d           10
   6           cad    a, b        1 
   7           cad    a, b, d     0
   8           cad    e           2

Pasted the above df:

I wanted to do groupby on column-ID and A)Combine the strings if substring/string matches(on column-code) B)sum the values of column-amount.

Expected results:

Explanation:

column-row_numbers has no role here in df. I just took here to explain the output.

A)grouping on column-ID and looking at column-code, row1 string i.e., a is matching with row2's sub string. row2's substring i.e., b is matching with row3's substring. row3's substring i.e., c is matching with string of row4 and Hence combining row1, row2, row3 and row4. row5 string is not matching with any of string/substring so it is separate group. B) Based on this adding row1, row2, row3 and row4 values. and row5 as separate group.

Thanks in advance for your time and thoughts:).

EDIT - 1

Pasting the real time.

Expected output:

Explanation:

have to do on grouping column-id and concatenating the values of column-code and summing the values of column-units and vol. It is color coded the matching(to be contacted) values of column-code. row1 has link with row5 and row9. row9 has inturn link with row3. Hence combining row1, row5, row9, row3. Simliarly row2 and row7 and so on. row8 has no link with any of the values with-in group-med(column-id) and hence will be as separate row.

Thanks!.

Solution

Update: From your latest sample data, this is not a simple data munging. There is no vectorized solution. It relates to graph theory. You need to find connected components within each group of ID and do the calculation on each connected components.

Consider each string as a node of graph. If 2 strings are overlapped, they are connected nodes. For every node, you need to traverse all paths connected to it. Do calculation on all connected nodes through these paths. This traversal can be done by using Depth-first search logic.

However, before processing depth-first search, you need to preprocess strings to set to check overlapping.

Method 1: Recursive

Do the following:

Define a function dfs to recursively run depth-first search
Define a function gfunc to use with groupby apply. This function will traverse elements of each group of ID and return the desired dataframe.
Get rid of any blank spaces in each string and split and convert them to sets using replace, split and map and assign it to a new column new_code to df
Call groupby on ID and apply using function gfunc. Call droplevel and reset_index to get the desired output

Codes as follows:

import numpy as np

def dfs(node, index, glist, g_checked_rows):        
    ret_arr = df.loc[index, ['code', 'amount', 'volume']].values
    g_checked_rows.add(index)   
    for j, s in glist:
        if j not in g_checked_rows and not node.isdisjoint(s):      
            t_arr = dfs(s, j, glist, g_checked_rows)
            ret_arr[0]  += ', ' + t_arr[0]
            ret_arr[1:] += t_arr[1:]

    return ret_arr

def gfunc(x):   
    checked_rows = set()    
    final = []  
    code_list = list(x.new_code.items())
    for i, row in code_list:
        if i not in checked_rows:       
            final.append(dfs(row, i, code_list, checked_rows))

    return pd.DataFrame(final, columns=['code','units','vol'])

df['new_code'] = df.code.str.replace(' ','').str.split(',').map(set)
df_final = df.groupby('ID', sort=False).apply(gfunc).droplevel(1).reset_index()

Out[16]:
    ID                                        code  units  vol
0  med  CO-96, CO-B15, CO-B15, CO-96, OA-18, OA-18      4    4
1  med                        CO-16, CO-B20, CO-16      3    3
2  med                       CO-252, CO-252, CO-45      3    3
3  med                                      OA-258      1    1
4  cad                        PR-96, PR-96, CO-243      4    4
5  cad                        PR-87, OA-258, PR-87      3    3

Note: I assume your pandas version is 0.24+. If it is < 0.24, the last step you need to use reset_index and drop instead of droplevel and reset_index as follows

df_final = df.groupby('ID', sort=False).apply(gfunc).reset_index().drop('level_1', 1)

Method 2: Iterative

To make this complete, I implement a version of gfunc using iterative process instead of recursive. Iterative process requires only one function. However, the function is more complicated. The logic of iterative process as follows

push the first node to deque. Check if deque not empty, pop the top node out.
if a node is not marked checked, process it and mark it as checked
find all its neighbors in the reverse order of list of nodes that haven't been marked, push them to the deque
Check if deque not empty, pop out a node from the top deque and process from step 2

Code as follows:

def gfunc_iter(x):  
    checked_rows = set()    
    final = []  
    q = deque() 
    code_list = list(x.new_code.items())
    code_list_rev = code_list[::-1]
    for i, row in code_list:
        if i not in checked_rows:           
            q.append((i, row))
            ret_arr = np.array(['', 0, 0], dtype='O')
            while (q):
                n, node = q.pop()
                if n in  checked_rows:
                    continue                
                ret_arr_child = df.loc[n, ['code', 'amount', 'volume']].values
                if not ret_arr[0]:
                    ret_arr = ret_arr_child.copy()
                else:
                    ret_arr[0]  += ', ' + ret_arr_child[0]
                    ret_arr[1:] += ret_arr_child[1:]                    
                checked_rows.add(n)         
                #push to `q` all neighbors in the reversed list of nodes    
                for j, s in code_list_rev: 
                    if j not in checked_rows and not node.isdisjoint(s):
                        q.append((j, s))
            final.append(ret_arr)

    return pd.DataFrame(final, columns=['code','units','vol'])

df['new_code'] = df.code.str.replace(' ','').str.split(',').map(set)
df_final = df.groupby('ID', sort=False).apply(gfunc_iter).droplevel(1).reset_index()