Search code examples
pythonnumpyduplicatesdistinctdata-analysis

How to reduce the number of row repetitions in a numpy array


I want to clean my data reducing the number of duplicates. I do not want to delete ALL duplicates.

How can I get a numpy array with certain number of duplicates?

Suppose, I have

x = np.array([[1,2,3],[1,2,3],[5,5,5],[1,2,3],[1,2,3]])

and I set number of duplicates as 2.

And the output should be like

x
>>[[1,2,3],[1,2,3],[5,5,5]]

or

x
>>[[5,5,5],[1,2,3],[1,2,3]]

It does not meter in my task


Solution

  • Even though using list appending as an intermediate step is not always a good idea when you already have numpy arrays, in this case it is by far the cleanest way to do it:

    def n_uniques(arr, max_uniques):
        uniq, cnts = np.unique(arr, axis=0, return_counts=True)
        arr_list = []
        for i in range(cnts.size):
            num = cnts[i] if cnts[i] <= max_uniques else max_uniques
            arr_list.extend([uniq[i]] * num)
        return np.array(arr_list)
    
    x = np.array([[1,2,3],
                  [1,2,3],
                  [1,2,3],
                  [5,5,5],
                  [1,2,3],
                  [1,2,3],])
    
    reduced_arr = n_uniques(x, 2)