I have a dataset which is not really balanced. Hence, I want to drop values in specific ranges such that in each range I only am left with less observations than a prespecified number cutoff
.
I order to explain further what I mean I will show one example (values in column b are floats)
a b
0 1 0
1 7 0
2 9 5
3 3 9
4 5 6
I want to search the column b for specific ranges e.g. ranges = np.array([0, 2, 4, 6, 8, 10])
given a cutoff = 1
(could be any specified integer). For example the value 0
in the first row lies in the interval [0,2) (2 not included) and the second row does also contain a value from this interval. Because the cutoff is equal to one, only one value of b is allowed to lie in this interval. Hence, the second row is dropped (it would be great if the rows would not be droped sequentially but rather undeterministically) and we obtain the following reduced data frame.
a b
0 1 0
2 9 5
3 3 9
4 5 6
You could use pd.cut
in order to bin the values in b
according to ranges
, and remove those that are duplicated
:
ranges = np.array([0, 2, 4, 6, 8, 10])
df[~(pd.cut(df.b, ranges, include_lowest=True, right=False)).duplicated()]
a b
0 1 0
2 9 5
3 3 9
4 5 6
Where:
pd.cut(df.b, ranges, include_lowest=True, right=False)
0 [0, 2)
1 [0, 2)
2 [4, 6)
3 [8, 10)
4 [6, 8)
Name: b, dtype: category
Update
If you want to use a specific cuttoff of allowed duplicated values, you could group by the intervals returned by pd.cut
and select the first n
values that are duplicated by using [head
] in order to select the first n
rows belonging to the same interval.
The folliwing is the same dataframe as yours with an extra row, so the functionality is clearer:
print(df)
a b
0 1 0
1 7 0
2 7 0
3 9 5
4 3 9
5 5 6
cuttoff = 2
g = pd.cut(df.b, ranges, include_lowest=True, right=False)
df.groupby(g).head(cuttoff)
a b
0 1 0
1 7 0
3 9 5
4 3 9
5 5 6