Search code examples
pythonnumpyunique

Numpy unique changes integer to string


I have data table which has string and integer columns such as:

test_data = [('A',1,2,3),('B',4,5,6),('A',1,2,3)]

I need unique rows, therefore I used numpy unique function:

summary, repeat = np.unique(test_data,return_counts=True, axis=0)

But after then my data types are changed. Summary is:

array([['A', '1', '2', '3'],
   ['B', '4', '5', '6']], dtype='<U1')

All data types are now string. How can I prevent this change? (Python 3.7, numpy 1.16.4)


Solution

  • If you have python objects and you want to retain them as python objects, use python functions:

    unique_rows = set(test_data)
    

    Or better yet:

    from collections import Counter
    
    rows_and_counts = Counter(test_data)
    

    These solutions do not copy the data: they retain references to the the original tuples just as they are. The numpy solution copies the data multiple times: once when converting to numpy, at least once when sorting, and possibly more when converting back.

    These solutions have O(N) algorithmic complexity because they both use a hash table. The numpy unique solution uses sorting, and is therefore of O(N log N) complexity.