Search code examples
pandasdataframematchmultiple-columns

Sorting the dataframe by matching multiple columns and conditions


The input data:

| source_element |   source_text     | source_entity  | target_element | target_text  | target_entity  |
|---------------|-------------------|----------------|---------------|-------------|----------------|
|    vvvvvv      |       ['body']    |      name      |     aaaaa      |     ceo      |      code      |
|    vvvvvv      |       ['body']    |      name      |     bbbbb      |     cap      |      code      |
|    vvvvvv      |  ['body', 'fat']  |      name      |     ccccc      |     fat      |      name      |
|    hhhhhh      | ['man', 'hat']    |      name      |     kkkkk      |     hat      |      name      |
|    kkkkk       |      ['hat']      |      name      |     lllll      |     700      |     price      |
|    kkkkk       |      ['hat']      |      name      |     fffff      |     mud      |      code      |
|    eeeee       |     ['strong']    |      name      |     zzzzz      |     300      |     price      |
|    ssssss      |      ['head']     |      name      |     ddddd      |     zip      |      code      |
|    ssssss      | ['head', 'strong']|      name      |     eeeee      |    strong    |      name      |
|    ssssss      |  ['head', 'part'] |      name      |     uuuuu      |     part     |      name      |
|    uuuuu       |      ['part']     |      name      |     xxxxx      |     190      |     price      |

the output:

|  source_text   |  source_entity    |  target_text   |  target_entity    |
|-----------------|----------------|----------------|-------------------|
| ['body', 'fat']|       name        |     'ceo'      |       code        |
| ['body', 'fat']|       name        |     'cap'      |       code        |
| ['man', 'hat'] |       name        |     '700'      |      price        |
| ['man', 'hat'] |       name        |     'mud'      |       code        |
| ['head', 'strong'] |    name        |     'zip'      |       code        |
| ['head', 'strong'] |    name        |     '300'      |      price        |
| ['head', 'part'] |      name        |     'zip'      |       code        |
| ['head', 'part'] |      name        |     '190'      |      price        |

I need the output that should satisfy these conditions:

1.If source_entity and target_entity are "name", then the source_element or target_element of that particular row matches with source_element of any row in entire dataframe, replace the target_text and text_entity with the matched row of dataframe and remove the rows with source_entity and target_entity are "name".

2.Keep the other dataframe rows which is not effected with above condition


Solution

  • import pandas as pd
    
    cond = (df['source_entity'] == 'name') & (df['target_entity'] == 'name')
    bbb = df[~df.index.isin(df[cond].index)]
    
    def fff(x):
        a = bbb[bbb['source_element'] == df.loc[x, 'source_element']]
        b = bbb[bbb['source_element'] == df.loc[x, 'target_element']]
        if len(a) > 0:
            df.loc[a.index, 'source_text'] = df.loc[x, 'source_text']
            return x
        elif len(b) > 0:
            df.loc[b.index, 'source_text'] = df.loc[x, 'source_text']
            return x
    
    
    aaa = [fff(i) for i in df[cond].index]
    ind = df.index.isin(aaa)
    
    print(df[~ind][['source_text', 'source_entity', 'target_text', 'target_entity']])
    

    Output

         source_text source_entity target_text target_entity
    0    'body  fat'          name       'ceo'          code
    1    'body  fat'          name       'cap'          code
    4      'man hat'          name       '700'         price
    5  'head strong'          name       'zip'          code
    

    if you need to reset the indexes:

    print(df[~ind][['source_text', 'source_entity', 'target_text', 'target_entity']].reset_index(drop=True))
    

    Update

    A completely different approach is used here, in which strings are generated (created).

    The 'source_text' column data is turned into a string df['source_text'].str.join(',').

    Like last time, we get bbb strings that do not have a double 'name'.

    But, in the fff function, the indexes of matches for 'source_element', 'target_element' are obtained, and if they are found, we get 'source_text'. Everything is written to the aaa list. In qqq rows are generated. The source_text column is created on the fly assign(source_text=i[1]). The resulting lists are connected by np.vstack. The dataframe is created.

    import numpy as np
    import pandas as pd
    
    df['source_text'] = df['source_text'].str.join(',')
    
    cond = (df['source_entity'] == 'name') & (df['target_entity'] == 'name')
    bbb = df[~df.index.isin(df[cond].index)]
    
    
    def fff(x):
        a = bbb[bbb['source_element'] == df.loc[x, 'source_element']]
        b = bbb[bbb['source_element'] == df.loc[x, 'target_element']]
        if len(a) > 0 or len(b) > 0:
            tex = df.loc[x, 'source_text']
            return [list(a.index) + list(b.index), tex]
    
    
    aaa = [fff(i) for i in df[cond].index]
    
    qqq = np.array(
        [df.loc[i[0], ['source_entity', 'target_text', 'target_entity']].copy().assign(source_text=i[1]).values for i in aaa])
    
    qqq = np.vstack(qqq)
    
    df1 = pd.DataFrame(qqq, columns=['source_entity', 'target_text', 'target_entity', 'source_text'])
    
    df1['source_text'] = df1['source_text'].str.split(',')
    
    print(df1[['source_text', 'source_entity', 'target_text', 'target_entity']])
    

    Output

       source_text source_entity target_text target_entity
    0     body,fat          name         ceo          code
    1     body,fat          name         cap          code
    2      man,hat          name         700         price
    3      man,hat          name         mud          code
    4  head,strong          name         zip          code
    5  head,strong          name         300         price
    6    head,part          name         zip          code
    7    head,part          name         190         price
    

    If you need to return the lists back:

    df1['source_text'] = df1['source_text'].str.split(',')