I have data table which has string and integer columns such as:
test_data = [('A',1,2,3),('B',4,5,6),('A',1,2,3)]
I need unique rows, therefore I used numpy unique function:
summary, repeat = np.unique(test_data,return_counts=True, axis=0)
But after then my data types are changed. Summary is:
array([['A', '1', '2', '3'],
['B', '4', '5', '6']], dtype='<U1')
All data types are now string. How can I prevent this change? (Python 3.7, numpy 1.16.4)
If you have python objects and you want to retain them as python objects, use python functions:
unique_rows = set(test_data)
Or better yet:
from collections import Counter
rows_and_counts = Counter(test_data)
These solutions do not copy the data: they retain references to the the original tuples just as they are. The numpy solution copies the data multiple times: once when converting to numpy, at least once when sorting, and possibly more when converting back.
These solutions have O(N)
algorithmic complexity because they both use a hash table. The numpy unique
solution uses sorting, and is therefore of O(N log N)
complexity.