Search code examples
pythonpandasstringlistdifference

pandas:output differences between two columns of lists of strings


I have a dataframe with two columns as follows:

df = pd.DataFrame({'pos_1':[['VERB', 'PRON', 'DET', 'NOUN', 'ADP'],['NOUN', 'PRON', 'DET', 'NOUN', 'ADV', 'ADV']],
               'pos:2':[['VERB', 'PRON', 'DET', 'NOUN', 'ADP'],['VERB', 'PRON', 'DET', 'NOUN', 'ADV', 'ADV']]})

and I am trying to output the differences between these two columns using apply.

df['diff'] = df.apply(lambda x: [i for i in x['pos_1'] if i not in x['pos_2']], axis=1)

my desired output for the diff column should be:

diff
1 []
2 ['NOUN','VERB']

but instead I get two empty lists in the diff column. I do not know which part I am doing wrong


Solution

  • If need compared both lists element wise and return differencies use zip with compare each pairs and last flatten it by nested list comprehension:

    f = lambda x: [z for i, j in zip(x['pos_1'],x['pos_2']) if i != j for z in [i, j]]
    df['diff'] = df.apply(f, axis=1)
    print (df)
    
                                   pos_1                              pos_2  \
    0       [VERB, PRON, DET, NOUN, ADP]       [VERB, PRON, DET, NOUN, ADP]   
    1  [NOUN, PRON, DET, NOUN, ADV, ADV]  [VERB, PRON, DET, NOUN, ADV, ADV]   
    
               diff  
    0            []  
    1  [NOUN, VERB]