Sorting the dataframe by matching multiple columns and conditions

The input data:

| source_element |   source_text     | source_entity  | target_element | target_text  | target_entity  |
|---------------|-------------------|----------------|---------------|-------------|----------------|
|    vvvvvv      |       ['body']    |      name      |     aaaaa      |     ceo      |      code      |
|    vvvvvv      |       ['body']    |      name      |     bbbbb      |     cap      |      code      |
|    vvvvvv      |  ['body', 'fat']  |      name      |     ccccc      |     fat      |      name      |
|    hhhhhh      | ['man', 'hat']    |      name      |     kkkkk      |     hat      |      name      |
|    kkkkk       |      ['hat']      |      name      |     lllll      |     700      |     price      |
|    kkkkk       |      ['hat']      |      name      |     fffff      |     mud      |      code      |
|    eeeee       |     ['strong']    |      name      |     zzzzz      |     300      |     price      |
|    ssssss      |      ['head']     |      name      |     ddddd      |     zip      |      code      |
|    ssssss      | ['head', 'strong']|      name      |     eeeee      |    strong    |      name      |
|    ssssss      |  ['head', 'part'] |      name      |     uuuuu      |     part     |      name      |
|    uuuuu       |      ['part']     |      name      |     xxxxx      |     190      |     price      |

the output:

|  source_text   |  source_entity    |  target_text   |  target_entity    |
|-----------------|----------------|----------------|-------------------|
| ['body', 'fat']|       name        |     'ceo'      |       code        |
| ['body', 'fat']|       name        |     'cap'      |       code        |
| ['man', 'hat'] |       name        |     '700'      |      price        |
| ['man', 'hat'] |       name        |     'mud'      |       code        |
| ['head', 'strong'] |    name        |     'zip'      |       code        |
| ['head', 'strong'] |    name        |     '300'      |      price        |
| ['head', 'part'] |      name        |     'zip'      |       code        |
| ['head', 'part'] |      name        |     '190'      |      price        |

I need the output that should satisfy these conditions:

1.If source_entity and target_entity are "name", then the source_element or target_element of that particular row matches with source_element of any row in entire dataframe, replace the target_text and text_entity with the matched row of dataframe and remove the rows with source_entity and target_entity are "name".

2.Keep the other dataframe rows which is not effected with above condition

Solution

import pandas as pd

cond = (df['source_entity'] == 'name') & (df['target_entity'] == 'name')
bbb = df[~df.index.isin(df[cond].index)]

def fff(x):
    a = bbb[bbb['source_element'] == df.loc[x, 'source_element']]
    b = bbb[bbb['source_element'] == df.loc[x, 'target_element']]
    if len(a) > 0:
        df.loc[a.index, 'source_text'] = df.loc[x, 'source_text']
        return x
    elif len(b) > 0:
        df.loc[b.index, 'source_text'] = df.loc[x, 'source_text']
        return x


aaa = [fff(i) for i in df[cond].index]
ind = df.index.isin(aaa)

print(df[~ind][['source_text', 'source_entity', 'target_text', 'target_entity']])

Output

     source_text source_entity target_text target_entity
0    'body  fat'          name       'ceo'          code
1    'body  fat'          name       'cap'          code
4      'man hat'          name       '700'         price
5  'head strong'          name       'zip'          code

if you need to reset the indexes:

print(df[~ind][['source_text', 'source_entity', 'target_text', 'target_entity']].reset_index(drop=True))

Update

A completely different approach is used here, in which strings are generated (created).

The 'source_text' column data is turned into a string df['source_text'].str.join(',').

Like last time, we get bbb strings that do not have a double 'name'.

But, in the fff function, the indexes of matches for 'source_element', 'target_element' are obtained, and if they are found, we get 'source_text'. Everything is written to the aaa list. In qqq rows are generated. The source_text column is created on the fly assign(source_text=i[1]). The resulting lists are connected by np.vstack. The dataframe is created.

import numpy as np
import pandas as pd

df['source_text'] = df['source_text'].str.join(',')

cond = (df['source_entity'] == 'name') & (df['target_entity'] == 'name')
bbb = df[~df.index.isin(df[cond].index)]


def fff(x):
    a = bbb[bbb['source_element'] == df.loc[x, 'source_element']]
    b = bbb[bbb['source_element'] == df.loc[x, 'target_element']]
    if len(a) > 0 or len(b) > 0:
        tex = df.loc[x, 'source_text']
        return [list(a.index) + list(b.index), tex]


aaa = [fff(i) for i in df[cond].index]

qqq = np.array(
    [df.loc[i[0], ['source_entity', 'target_text', 'target_entity']].copy().assign(source_text=i[1]).values for i in aaa])

qqq = np.vstack(qqq)

df1 = pd.DataFrame(qqq, columns=['source_entity', 'target_text', 'target_entity', 'source_text'])

df1['source_text'] = df1['source_text'].str.split(',')

print(df1[['source_text', 'source_entity', 'target_text', 'target_entity']])

Output

   source_text source_entity target_text target_entity
0     body,fat          name         ceo          code
1     body,fat          name         cap          code
2      man,hat          name         700         price
3      man,hat          name         mud          code
4  head,strong          name         zip          code
5  head,strong          name         300         price
6    head,part          name         zip          code
7    head,part          name         190         price

If you need to return the lists back:

df1['source_text'] = df1['source_text'].str.split(',')