I have a data frame which has a timestamp column and an error column, the error column has 6 type of values (NaN, D, E, F, G, H) and I need to extract periods in time with high density of error E and I don't know how to approach this.
My approach would be creating a histogram and then I could calculate a density of probability for every error, then I would iterate all the days and select those with highest probability for E.
Are there any approaches for this type of problems? thanks for your time
Here is a way:
df.groupby(df.timestamp.dt.date).error.apply(lambda s: s.eq(2).sum() / s.size)
We group by the date of the timestamps and apply a function to error
s that takes the ratio of 2's in the group. After this, you can chain idxmax
to get the date with highest error density, or with nlargest(n)
to get the highest n ones.
With the sample data provided, this gives:
timestamp
2019-11-10 0.4
Name: error, dtype: float64
(since there is only 1 day, only it appears.)