Let's say I create some data and then create bins of different sizes:
from __future__ import division
x = np.random.rand(1,20)
new, = np.digitize(x,np.arange(1,x.shape[1]+1)/100)
new_series = pd.Series(new)
print(new_series.value_counts())
reveals:
20 17
16 1
4 1
2 1
dtype: int64
I basically want to transform the underlying data, if I set a minimum threshold of at least 2 per bin, so that new_series.value_counts()
is this:
20 17
16 3
dtype: int64
EDITED:
x = np.random.rand(1,100)
bins = np.arange(1,x.shape[1]+1)/100
new = np.digitize(x,bins)
n = new.copy()[0] # this will hold the the result
threshold = 2
for i in np.unique(n):
if sum(n == i) <= threshold:
n[n == i] += 1
n.clip(0, bins.size) # avoid adding beyond the last bin
n = n.reshape(1,-1)
This can move counts up multiple times, until a bin is filled sufficiently.
Instead of using np.digitize
, it might be simpler to use np.histogram
instead, because it will directly give you the counts, so that we don't need to sum
ourselves.