Search code examples
pythonpandasnumpyunique

pandas get unique values from column of lists


How do I get the unique values of a column of lists in pandas or numpy such that second column from

enter image description here

would result in 'action', 'crime', 'drama'.

The closest (but non-functional) solutions I could come up with were:

 genres = data['Genre'].unique()

But this predictably results in a TypeError saying how lists aren't hashable.

TypeError: unhashable type: 'list'

Set seemed to be a good idea but

genres = data.apply(set(), columns=['Genre'], axis=1)

but also results in a TypeError: set() takes no keyword arguments


Solution

  • If you only want to find the unique values, I'd recommend using itertools.chain.from_iterable to concatenate all those lists

    import itertools
    
    >>> np.unique([*itertools.chain.from_iterable(df.Genre)])
    array(['action', 'crime', 'drama'], dtype='<U6')
    

    Or even faster

    >>> set(itertools.chain.from_iterable(df.Genre))
    {'action', 'crime', 'drama'}
    

    Timings

    df = pd.DataFrame({'Genre':[['crime','drama'],['action','crime','drama']]})
    df = pd.concat([df]*10000)
    
    %timeit set(itertools.chain.from_iterable(df.Genre))
    100 loops, best of 3: 2.55 ms per loo
        
    %timeit set([x for y in df['Genre'] for x in y])
    100 loops, best of 3: 4.09 ms per loop
    
    %timeit np.unique([*itertools.chain.from_iterable(df.Genre)])
    100 loops, best of 3: 12.8 ms per loop
    
    %timeit np.unique(df['Genre'].sum())
    1 loop, best of 3: 1.65 s per loop
    
    %timeit set(df['Genre'].sum())
    1 loop, best of 3: 1.66 s per loop