Search code examples
pandaspattern-matchingpandas-groupbystring-concatenation

Groupby and A)Concate matching strings(and or substring) B)Sum the values


I have df:

row_numbers    ID     code        amount
   1           med    a           1
   2           med    a, b        1
   3           med    b, c        1
   4           med    c           1
   5           med    d           10
   6           cad    a, b        1 
   7           cad    a, b, d     0
   8           cad    e           2

Pasted the above df:

enter image description here

I wanted to do groupby on column-ID and A)Combine the strings if substring/string matches(on column-code) B)sum the values of column-amount.

Expected results:

enter image description here

Explanation:

column-row_numbers has no role here in df. I just took here to explain the output.

A)grouping on column-ID and looking at column-code, row1 string i.e., a is matching with row2's sub string. row2's substring i.e., b is matching with row3's substring. row3's substring i.e., c is matching with string of row4 and Hence combining row1, row2, row3 and row4. row5 string is not matching with any of string/substring so it is separate group. B) Based on this adding row1, row2, row3 and row4 values. and row5 as separate group.

Thanks in advance for your time and thoughts:).

EDIT - 1

Pasting the real time.

enter image description here

Expected output:

enter image description here

Explanation:

have to do on grouping column-id and concatenating the values of column-code and summing the values of column-units and vol. It is color coded the matching(to be contacted) values of column-code. row1 has link with row5 and row9. row9 has inturn link with row3. Hence combining row1, row5, row9, row3. Simliarly row2 and row7 and so on. row8 has no link with any of the values with-in group-med(column-id) and hence will be as separate row.

Thanks!.


Solution

  • Update: From your latest sample data, this is not a simple data munging. There is no vectorized solution. It relates to graph theory. You need to find connected components within each group of ID and do the calculation on each connected components.

    Consider each string as a node of graph. If 2 strings are overlapped, they are connected nodes. For every node, you need to traverse all paths connected to it. Do calculation on all connected nodes through these paths. This traversal can be done by using Depth-first search logic.

    However, before processing depth-first search, you need to preprocess strings to set to check overlapping.

    Method 1: Recursive

    Do the following:

    • Define a function dfs to recursively run depth-first search
    • Define a function gfunc to use with groupby apply. This function will traverse elements of each group of ID and return the desired dataframe.
    • Get rid of any blank spaces in each string and split and convert them to sets using replace, split and map and assign it to a new column new_code to df
    • Call groupby on ID and apply using function gfunc. Call droplevel and reset_index to get the desired output

    Codes as follows:

    import numpy as np
    
    def dfs(node, index, glist, g_checked_rows):        
        ret_arr = df.loc[index, ['code', 'amount', 'volume']].values
        g_checked_rows.add(index)   
        for j, s in glist:
            if j not in g_checked_rows and not node.isdisjoint(s):      
                t_arr = dfs(s, j, glist, g_checked_rows)
                ret_arr[0]  += ', ' + t_arr[0]
                ret_arr[1:] += t_arr[1:]
    
        return ret_arr
    
    def gfunc(x):   
        checked_rows = set()    
        final = []  
        code_list = list(x.new_code.items())
        for i, row in code_list:
            if i not in checked_rows:       
                final.append(dfs(row, i, code_list, checked_rows))
    
        return pd.DataFrame(final, columns=['code','units','vol'])
    
    df['new_code'] = df.code.str.replace(' ','').str.split(',').map(set)
    df_final = df.groupby('ID', sort=False).apply(gfunc).droplevel(1).reset_index()
    
    Out[16]:
        ID                                        code  units  vol
    0  med  CO-96, CO-B15, CO-B15, CO-96, OA-18, OA-18      4    4
    1  med                        CO-16, CO-B20, CO-16      3    3
    2  med                       CO-252, CO-252, CO-45      3    3
    3  med                                      OA-258      1    1
    4  cad                        PR-96, PR-96, CO-243      4    4
    5  cad                        PR-87, OA-258, PR-87      3    3
    

    Note: I assume your pandas version is 0.24+. If it is < 0.24, the last step you need to use reset_index and drop instead of droplevel and reset_index as follows

    df_final = df.groupby('ID', sort=False).apply(gfunc).reset_index().drop('level_1', 1)
    

    Method 2: Iterative

    To make this complete, I implement a version of gfunc using iterative process instead of recursive. Iterative process requires only one function. However, the function is more complicated. The logic of iterative process as follows

    1. push the first node to deque. Check if deque not empty, pop the top node out.
    2. if a node is not marked checked, process it and mark it as checked
    3. find all its neighbors in the reverse order of list of nodes that haven't been marked, push them to the deque
    4. Check if deque not empty, pop out a node from the top deque and process from step 2

    Code as follows:

    def gfunc_iter(x):  
        checked_rows = set()    
        final = []  
        q = deque() 
        code_list = list(x.new_code.items())
        code_list_rev = code_list[::-1]
        for i, row in code_list:
            if i not in checked_rows:           
                q.append((i, row))
                ret_arr = np.array(['', 0, 0], dtype='O')
                while (q):
                    n, node = q.pop()
                    if n in  checked_rows:
                        continue                
                    ret_arr_child = df.loc[n, ['code', 'amount', 'volume']].values
                    if not ret_arr[0]:
                        ret_arr = ret_arr_child.copy()
                    else:
                        ret_arr[0]  += ', ' + ret_arr_child[0]
                        ret_arr[1:] += ret_arr_child[1:]                    
                    checked_rows.add(n)         
                    #push to `q` all neighbors in the reversed list of nodes    
                    for j, s in code_list_rev: 
                        if j not in checked_rows and not node.isdisjoint(s):
                            q.append((j, s))
                final.append(ret_arr)
    
        return pd.DataFrame(final, columns=['code','units','vol'])
    
    df['new_code'] = df.code.str.replace(' ','').str.split(',').map(set)
    df_final = df.groupby('ID', sort=False).apply(gfunc_iter).droplevel(1).reset_index()