Search code examples
python-3.xpandascalculated-columns

pandas dataframe column transformation


My input df is as below:

ID     item1       item2        item3
1      a,b         b,c          b
2      a,c,f       b,c          b,c,f
3      g,h,i       i            h,i
4      j,k         j,k          l

df datatypes for item1, item2 and item3 are string type.

I would like to add a 4th column and transformation required is as below: pseudo code:

Final_item = item3 - set[col(item1) + col(item2)] Basically, in the last column, is adding item1 and item2, then apply set to remove duplicates then subtract with item3 column.

Desired output as below:

ID     item1       item2        item3       Final_item
1      a,b         b,c          b           a,c
2      a,c,f       b,c          b,c,f       a 
3      g,h,i       i            h,i         g
4      j,k         j,k          l           j,k

Solution

  • First split columns and also joined columns by , and then get difference in list comprehension of zipped Series:

    i3 = df['item3'].str.split(',')
    i12 = (df['item1'] + ',' + df['item2']).str.split(',')
    df['Final_item'] = [','.join(set(b) - set(a)) for a, b in zip(i3, i12)]
    print (df)
       ID  item1 item2  item3 Final_item
    0   1    a,b   b,c      b        c,a
    1   2  a,c,f   b,c  b,c,f          a
    2   3  g,h,i     i    h,i          g
    3   4    j,k   j,k      l        j,k