Search code examples
pythonpandasdataframe

Differences between columns containing lists


I have a data frame where the columns values are list and want to find the differences between two columns.

data={'NAME':['JOHN','MARY','CHARLIE'],
      'A':[[1,2,3],[2,3,4],[3,4,5]],
      'B':[[2,3,4],[3,4,5],[4,5,6]]}
df=pd.DataFrame(data)

Why doesn't it work?

df = df.assign(X1 = lambda x: [y for y in x['A'] if y not in x['B']])

I get error :

TypeError: unhashable type: 'list'

I don't understand why?


Solution

  • So, this is where lambdas get interesting. These two lambdas will have the same result:

    df = df.assign(X1 = lambda x: [y for y in x['A']]) #unvectorized, x is the entire DataFrame
    df = df.assign(X1 = lambda x: x['A']) #vectorized, x is a single row
    

    One (lengthy) way to do what you are asking is to iterate through each row, and then compare the nested lists:

    df = df.assign(X1 = lambda x: [[y for y in x['A'][i] if y not in x['B'][i]] for i in range(len(x['A']))])
    

    which can be simplified to one of the following

    df = df.assign(X1 = [[y for y in r.A if y not in r.B] for i, r in df.iterrows()]) #similar structure to your initial solution
    df = df.assign(X2 = [list(set(r.A).difference(r.B)) for i, r in df.iterrows()]) #more efficient, especially for larger sets