The input data:
| source_element | source_text | source_entity | target_element | target_text | target_entity |
|---------------|-------------------|----------------|---------------|-------------|----------------|
| vvvvvv | ['body'] | name | aaaaa | ceo | code |
| vvvvvv | ['body'] | name | bbbbb | cap | code |
| vvvvvv | ['body', 'fat'] | name | ccccc | fat | name |
| hhhhhh | ['man', 'hat'] | name | kkkkk | hat | name |
| kkkkk | ['hat'] | name | lllll | 700 | price |
| kkkkk | ['hat'] | name | fffff | mud | code |
| eeeee | ['strong'] | name | zzzzz | 300 | price |
| ssssss | ['head'] | name | ddddd | zip | code |
| ssssss | ['head', 'strong']| name | eeeee | strong | name |
| ssssss | ['head', 'part'] | name | uuuuu | part | name |
| uuuuu | ['part'] | name | xxxxx | 190 | price |
the output:
| source_text | source_entity | target_text | target_entity |
|-----------------|----------------|----------------|-------------------|
| ['body', 'fat']| name | 'ceo' | code |
| ['body', 'fat']| name | 'cap' | code |
| ['man', 'hat'] | name | '700' | price |
| ['man', 'hat'] | name | 'mud' | code |
| ['head', 'strong'] | name | 'zip' | code |
| ['head', 'strong'] | name | '300' | price |
| ['head', 'part'] | name | 'zip' | code |
| ['head', 'part'] | name | '190' | price |
I need the output that should satisfy these conditions:
1.If source_entity and target_entity are "name", then the source_element or target_element of that particular row matches with source_element of any row in entire dataframe, replace the target_text and text_entity with the matched row of dataframe and remove the rows with source_entity and target_entity are "name".
2.Keep the other dataframe rows which is not effected with above condition
import pandas as pd
cond = (df['source_entity'] == 'name') & (df['target_entity'] == 'name')
bbb = df[~df.index.isin(df[cond].index)]
def fff(x):
a = bbb[bbb['source_element'] == df.loc[x, 'source_element']]
b = bbb[bbb['source_element'] == df.loc[x, 'target_element']]
if len(a) > 0:
df.loc[a.index, 'source_text'] = df.loc[x, 'source_text']
return x
elif len(b) > 0:
df.loc[b.index, 'source_text'] = df.loc[x, 'source_text']
return x
aaa = [fff(i) for i in df[cond].index]
ind = df.index.isin(aaa)
print(df[~ind][['source_text', 'source_entity', 'target_text', 'target_entity']])
Output
source_text source_entity target_text target_entity
0 'body fat' name 'ceo' code
1 'body fat' name 'cap' code
4 'man hat' name '700' price
5 'head strong' name 'zip' code
if you need to reset the indexes:
print(df[~ind][['source_text', 'source_entity', 'target_text', 'target_entity']].reset_index(drop=True))
Update
A completely different approach is used here, in which strings are generated (created).
The 'source_text' column data is turned into a string df['source_text'].str.join(',')
.
Like last time, we get bbb
strings that do not have a double 'name'.
But, in the fff
function, the indexes of matches for 'source_element', 'target_element' are obtained, and if they are found, we get 'source_text'
. Everything is written to the aaa
list. In qqq
rows are generated. The source_text column is created on the fly assign(source_text=i[1])
. The resulting lists are connected by np.vstack
. The dataframe is created.
import numpy as np
import pandas as pd
df['source_text'] = df['source_text'].str.join(',')
cond = (df['source_entity'] == 'name') & (df['target_entity'] == 'name')
bbb = df[~df.index.isin(df[cond].index)]
def fff(x):
a = bbb[bbb['source_element'] == df.loc[x, 'source_element']]
b = bbb[bbb['source_element'] == df.loc[x, 'target_element']]
if len(a) > 0 or len(b) > 0:
tex = df.loc[x, 'source_text']
return [list(a.index) + list(b.index), tex]
aaa = [fff(i) for i in df[cond].index]
qqq = np.array(
[df.loc[i[0], ['source_entity', 'target_text', 'target_entity']].copy().assign(source_text=i[1]).values for i in aaa])
qqq = np.vstack(qqq)
df1 = pd.DataFrame(qqq, columns=['source_entity', 'target_text', 'target_entity', 'source_text'])
df1['source_text'] = df1['source_text'].str.split(',')
print(df1[['source_text', 'source_entity', 'target_text', 'target_entity']])
Output
source_text source_entity target_text target_entity
0 body,fat name ceo code
1 body,fat name cap code
2 man,hat name 700 price
3 man,hat name mud code
4 head,strong name zip code
5 head,strong name 300 price
6 head,part name zip code
7 head,part name 190 price
If you need to return the lists back:
df1['source_text'] = df1['source_text'].str.split(',')