Search code examples
pythonpandasgaps-in-data

How to fill gaps in anomaly detection data using pandas?


Assume I have a pandas DataFrame that only consists of 0 and 1 depending if an anomaly was detected or not:

input_data = pd.DataFrame(data={'my_event': [0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 1.]},
                          index=pd.date_range(start='2023-01-01 00:00:00', end='2023-01-01 00:00:10', freq='s'))

Now I would like to fill gaps in the detection depending on their size. E.g. I only want to fill gaps that are 2 seconds or shorter. What is the correct way to do something like this?

I found these questions here: 1, 2, 3 but the solutions seem to be not very straight forward. It kinda feels like there should be a simpler way to solve an issue like this.

EDIT

Sorry for the unprecise question! So a "gap" would in my case be a short time period where no anomaly was detected inside a larger time range that was detected as an anomaly.

For the example input_data the expected output would be a DataFrame with the following data

[0., 0., 1., 1., 1., 1., 0., 0., 0., 1., 1.]

So in this example the single 0. inside the region of ones was replaced by a one. Obviously all zeros could also be replaced by nans, if that would help. I just need to be able to specify the length of the gap that should be filled.


Solution

  • i dont know if i understood you well, but to fill gaps in the detection that are 2 seconds or shorter, you can do this :

        import pandas as pd
    
    input_data = pd.DataFrame(data={'my_event': [0., 0., 1., 1., 0., 1., 0., 0., 0., 1., 1.]},
                              index=pd.date_range(start='2023-01-01 00:00:00', end='2023-01-01 00:00:10', freq='s'))
    
    # Find consecutive sequences of 1's
    sequences = (input_data['my_event'] == 1).cumsum()
    
    # Calculate the time difference between consecutive events
    time_diff = input_data.index.to_series().diff().dt.total_seconds()
    
    # Find the gaps shorter than 2 seconds
    gaps = (sequences != sequences.shift(-1)) & (time_diff <= 2)
    
    # Fill the gaps with 1's
    input_data['my_event'][gaps] = 1
    
    print(input_data)
         my_event
    2023-01-01 00:00:00       0.0
    2023-01-01 00:00:01       0.0
    2023-01-01 00:00:02       1.0
    2023-01-01 00:00:03       1.0
    2023-01-01 00:00:04       1.0
    2023-01-01 00:00:05       1.0
    2023-01-01 00:00:06       0.0
    2023-01-01 00:00:07       0.0
    2023-01-01 00:00:08       0.0
    2023-01-01 00:00:09       1.0
    2023-01-01 00:00:10       1.0