I have an 2-dimensional Numpy array where some rows are not unique, i.e., when I do:
import numpy as np
data.shape #number of rows X columns in data
# (75000, 8)
np.unique(data.T, axis=0).shape #number of unique rows is fewer than above
# (74801, 8)
Starting with the first row of data
, I would like to remove any row that is a duplicate of a previous row, maintaining the original order of the rows. In the above example, the final shape of the new Numpy array should be (74801, 8).
E.g., given the below data array
data = np.array([[1,2,1],[2,2,3],[3,3,2],[2,2,3],[1,1,2],[0,0,0],[3,3,2]])
print(data)
[[1 2 1]
[2 2 3]
[3 3 2]
[2 2 3]
[1 1 2]
[0 0 0]
[3 3 2]]
I'd like to have the unique rows in their original order, i.e.,
[[1 2 1]
[2 2 3]
[3 3 2]
[1 1 2]
[0 0 0]]
Any tips on an efficient solution would be greatly appreciated!
Try numpy.unique
with the "return_index" parameter:
data[np.sort(np.unique(data, axis = 0, return_index = True)[1])]
As it name indicates, it will return the unique rows and their indices in that order inside a tuple (that's why there's a [1]
at the end).
You can also use pandas
:
import pandas as pd
pd.DataFrame(data).drop_duplicates().values