Search code examples
pythonpandasprobability-density

Periods in time with high density of error (in a data frame)


I have a data frame which has a timestamp column and an error column, the error column has 6 type of values (NaN, D, E, F, G, H) and I need to extract periods in time with high density of error E and I don't know how to approach this.

enter image description here

enter image description here

My approach would be creating a histogram and then I could calculate a density of probability for every error, then I would iterate all the days and select those with highest probability for E.

Are there any approaches for this type of problems? thanks for your time


Solution

  • Here is a way:

    df.groupby(df.timestamp.dt.date).error.apply(lambda s: s.eq(2).sum() / s.size)
    

    We group by the date of the timestamps and apply a function to errors that takes the ratio of 2's in the group. After this, you can chain idxmax to get the date with highest error density, or with nlargest(n) to get the highest n ones.

    With the sample data provided, this gives:

    timestamp
    2019-11-10    0.4
    Name: error, dtype: float64
    

    (since there is only 1 day, only it appears.)