Search code examples
pythonarraysdataframenumpytypeerror

Python numpy np.setdiff1d giving error "TypeError: Cannot compare structured arrays unless they have a common dtype."


I have been searching for this error but did not get anything related to np.setdiff1d with type error. It would really help is you could let me know why this error and how I can resolve it. Below is my sample code snippet -

import pandas as pd
import numpy as np

data1 = {'a' : [32,156], 'b' :[56,177]}

data2 = {'c' : [12,32,12,45,32,45], 'd' :[11,56,76,43,44,45], 'e': [111,156,176,143,144,145], 'f':[411,456,476,443,444,445] }

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

## converting to array
npdf1= df1.to_records(index=False)
npdf2= df2.to_records(index=False)


diff = np.setdiff1d(npdf1,npdf2[['c','e']])
# Above line gave error "TypeError: Cannot compare structured arrays unless they have a common dtype.  I.e. `np.result_type(arr1, arr2)` must be defined."
# npdf1 >> gives below
# rec.array([( 32,  56), (111, 177)],
#          dtype=[('a', '<i8'), ('b', '<i8')])

# npdf2[['c','e']] >> gives below
# rec.array([(12, 111), (32, 156), (12, 176), (45, 143), (32, 144),
#            (45, 145)],
#           dtype={'names': ['c', 'e'], 'formats': ['<i8', '<i8'], 'offsets': [0, 16], 'itemsize': 32})

## Above the format is matching i8 but still not sure why the error.
## So as a work round I thought to converted the record arrays to normal numpy arrays

npdf1 = np.array(npdf1)

df2a = df2[['c','e']]
npdf2a = df2a.to_records(index=False)
npdf2a = np.array(npdf2a)

diff = np.setdiff1d(npdf1,npdf2a)

# Still get the error "TypeError: Cannot compare structured arrays unless they have a common dtype.  I.e. `np.result_type(arr1, arr2)` must be defined."


Solution

  • Your recarray, converted to a list, is a list of tuples, which can be made into a set:

    In [152]: npdf1
    Out[152]: 
    rec.array([( 32,  56), (156, 177)],
              dtype=[('a', '<i8'), ('b', '<i8')])
    In [153]: npdf1.tolist()
    Out[153]: [(32, 56), (156, 177)]
    In [154]: s1=set(npdf1.tolist())
    In [155]: s1
    Out[155]: {(32, 56), (156, 177)}
    

    similarly for 2 fields of the other frame. tolist removes the field names:

    In [159]: s2=set(npdf2[['c','e']].tolist())
    

    And then the ordinary set differences:

    In [160]: s1.difference(s2)
    Out[160]: {(32, 56), (156, 177)}
    In [161]: s2.difference(s1)
    Out[161]: {(12, 111), (12, 176), (32, 144), (32, 156), (45, 143), (45, 145)}
    

    import numpy.lib.recfunctions as rf has various functions to play with recarray (and structured arrays), including structured_to_unstructured and rename_fields. But I don't think those are needed here.