I want to add values from column "A" to column "B" in a new line, only if the corresponding value from column "A" is not contained in column "B".
My data frame:
df
A B
value1 value4
value1
value1 value4\nvalue1\n
value2 value4\nvalue7\n
value2 value4\n
value3
value3 value4\nvalue3\nvalue6
value3 value2\nvalue1
value3 value3\nvalue1\nvalue6
So, my desired data frame
df
A B
value1 value4\nvalue1
value1 value1
value1 value4\nvalue1
value2 value4\nvalue7\nvalue2
value2 value4\nvalue2
value3 value3
value3 value4\nvalue3\nvalue6
value3 value2\nvalue1\nvalue3
value3 value3\nvalue1\nvalue6
Additionally, the problem is that sometimes at the end of the string in column "B", a \n is present and sometimes is not.
I tried with df.A.str.contains(df["A"])
, but it doesn't work when contains()
has a data frame column. I think that something like this could work: df.B.str.extract('(%s)' % '|'.join(df.A))
, and this is only for the values from column "A" which are present in column "B" - for matches, and I don't know how to adapt to my problem.
Create a Boolean mask with a list comprehension. Replace missing values with the empty string and use Series.where
:
mask = [a in b for a, b in zip(df['A'], df['B'].fillna(''))]
b = df['B'].str.replace(r'\\n$', '')
df['B'] = b.where(mask, b + '\\n' + df['A']).fillna(df['A'])
print (df)
A B
0 value1 value4\nvalue1
1 value1 value1
2 value1 value4\nvalue1
3 value2 value4\nvalue7\nvalue2
4 value2 value4\nvalue2
5 value3 value3
6 value3 value4\nvalue3\nvalue6
7 value3 value2\nvalue1\nvalue3
8 value3 value3\nvalue1\nvalue6