I want to clean my data reducing the number of duplicates. I do not want to delete ALL duplicates.
How can I get a numpy array with certain number of duplicates?
Suppose, I have
x = np.array([[1,2,3],[1,2,3],[5,5,5],[1,2,3],[1,2,3]])
and I set number of duplicates as 2.
And the output should be like
x
>>[[1,2,3],[1,2,3],[5,5,5]]
or
x
>>[[5,5,5],[1,2,3],[1,2,3]]
It does not meter in my task
Even though using list appending as an intermediate step is not always a good idea when you already have numpy arrays, in this case it is by far the cleanest way to do it:
def n_uniques(arr, max_uniques):
uniq, cnts = np.unique(arr, axis=0, return_counts=True)
arr_list = []
for i in range(cnts.size):
num = cnts[i] if cnts[i] <= max_uniques else max_uniques
arr_list.extend([uniq[i]] * num)
return np.array(arr_list)
x = np.array([[1,2,3],
[1,2,3],
[1,2,3],
[5,5,5],
[1,2,3],
[1,2,3],])
reduced_arr = n_uniques(x, 2)