Search code examples
pythonnumpyunique

Remove repeated rows in 2D numpy array, maintaining first instance and ordering


I have an 2-dimensional Numpy array where some rows are not unique, i.e., when I do:

import numpy as np

data.shape                        #number of rows X columns in data
# (75000, 8)

np.unique(data.T, axis=0).shape   #number of unique rows is fewer than above
# (74801, 8)

Starting with the first row of data, I would like to remove any row that is a duplicate of a previous row, maintaining the original order of the rows. In the above example, the final shape of the new Numpy array should be (74801, 8).

E.g., given the below data array

data = np.array([[1,2,1],[2,2,3],[3,3,2],[2,2,3],[1,1,2],[0,0,0],[3,3,2]])
print(data)
[[1 2 1]
 [2 2 3]
 [3 3 2]
 [2 2 3]
 [1 1 2]
 [0 0 0]
 [3 3 2]]

I'd like to have the unique rows in their original order, i.e.,

[[1 2 1]
 [2 2 3]
 [3 3 2]
 [1 1 2]
 [0 0 0]]

Any tips on an efficient solution would be greatly appreciated!


Solution

  • Try numpy.unique with the "return_index" parameter:

    data[np.sort(np.unique(data, axis = 0, return_index = True)[1])]
    

    As it name indicates, it will return the unique rows and their indices in that order inside a tuple (that's why there's a [1] at the end).


    You can also use pandas:

    import pandas as pd
    pd.DataFrame(data).drop_duplicates().values