Search code examples
pythonpandaslistdataframedrop-duplicates

Python Pandas : Drop Duplicates Function - Unusual Behaviour


The error -> TypeError: unhashable type: 'list' disappears after saving the data frame and loading it again ...

Both data frames [saved and loaded, generated] have the same dtypes ...

Reproducible ->

--> import pandas as pd
--> l1 = [[1], [1], [1], [1], [1], [1], [1], [1], [6], [1], [6], [1], [6], [6], [6], [6], [6], [6], [6], [6], [6]]

## len(l1) is 21 ##

--> l2 = ['a']*21
--> l3 = ['c']*10 + ['d']*10 + ['e']
--> df = pd.DataFrame()
--> df['col1'], df['col2'], df['col3'] = l1, l3, l2
--> df
        col1 col2 col3
        0   [1]    c    a
        1   [1]    c    a
        2   [1]    c    a
        3   [1]    c    a
        4   [1]    c    a
        5   [1]    c    a
        6   [1]    c    a
        7   [1]    c    a
        8   [6]    c    a
        9   [1]    c    a
        10  [6]    d    a
        11  [1]    d    a
        12  [6]    d    a
        13  [6]    d    a
        14  [6]    d    a
        15  [6]    d    a
        16  [6]    d    a
        17  [6]    d    a
        18  [6]    d    a
        19  [6]    d    a
        20  [6]    e    a

--> df.dtypes
        col1    object
        col2    object
        col3    object
        dtype: object

--> df.drop_duplicates(subset=['col1', 'col2', 'col3'], keep='last', inplace=True)
        
        ## TypeError: unhashable type: 'list' ##

## Here if I save it as an excel and load again, then this error does not come up ... ##

--> df.to_excel('test.xlsx')
--> df_ = pd.read_excel('test.xlsx')
--> df_.dtypes
        Unnamed: 0     int64
        col1    object
        col2    object
        col3    object
        dtype: object
--> df_.drop_duplicates(subset=['col1', 'col2', 'col3'], keep='last', inplace=True)
--> df_
         Unnamed: 0 col1 col2 col3
        8       8   [6]    c    a
        9       9   [1]    c    a
        11      11  [1]    d    a
        19      19  [6]    d    a
        20      20  [6]    e    a

Does this behaviour have an explanation ?

Extended Traceback of Issue

Traceback (most recent call last):

File "", line 1, in

File "C:\Users\Agnij\Anaconda3\lib\site-packages\pandas\core\frame.py", line 4811, in drop_duplicates

duplicated = self.duplicated(subset, keep=keep)

File "C:\Users\Agnij\Anaconda3\lib\site-packages\pandas\core\frame.py", line 4888, in duplicated labels, shape = map(list, zip(*map(f, vals)))

File "C:\Users\Agnij\Anaconda3\lib\site-packages\pandas\core\frame.py", line 4863, in f vals, size_hint=min(len(self), _SIZE_HINT_LIMIT)

File "C:\Users\Agnij\Anaconda3\lib\site-packages\pandas\core\algorithms.py", line 636, in factorize values, na_sentinel=na_sentinel, size_hint=size_hint, na_value=na_value

File "C:\Users\Agnij\Anaconda3\lib\site-packages\pandas\core\algorithms.py", line 484, in _factorize_array uniques, codes = table.factorize(values, na_sentinel=na_sentinel, na_value=na_value)

File "pandas_libs\hashtable_class_helper.pxi", line 1815, in pandas._libs.hashtable.PyObjectHashTable.factorize

File "pandas_libs\hashtable_class_helper.pxi", line 1731, in pandas._libs.hashtable.PyObjectHashTable._unique


Solution

  • Because even though both columns are dtype objects, the items in them are different types:

    >>> df.loc[0,'col1']
    [1]
    
    
    >>> df_.loc[0, 'col1']
    '[1]'
    

    Since strings are hashable, you don't see the error that you had before with lists.