Search code examples
pythonsignal-processingmeasurement

Calculating distance between a sequence of low points in dataset


I have a dataset that is composed of 360 measurements stored in a python dictionary looking something like this:

data = {137: 0.0, 210: 102.700984375, 162: 0.7173203125, 39: 134.47830729166665, 78: 10.707765625, 107: 0.0, 194: 142.042953125, 316: 2.6041666666666666e-06, 329: 0.0, 240: 46.4257578125, ...}

All measurements are stored in a key-value-pair. Plotted as a scatter plot (key on x, value on y) the data looks like this:

Scatter plot of data

As you can see, there are sections in the data, where the stored value is (close to) 0. I would now like to write a script, that calculates the distance of those sections - you could also call it the 'period' of the data.

What I have come up with feels very crude: I go through all items in sequence, and record the first key that has a value of 0. then I continue to go through the data until I find a key that has a value above 0 and record that key (-1). (I throw out all sequences, that are shorter than 5 consecutive 0s) Now I have the start and the end of my first sequence of 0s. I continue to do this, until i have all of those sequences. As there are ALWAYS two of these sequences in the data (there is no way for it to be more) I now calculate the midpoint of each sequence and subtract one midpoint from the other.

This gives me the distance.

But: This method is very much prone to error. sometimes there are artifacts in the middle of the sequence of 0s (slightly higher values every 2-4 data points. Also, if the data starts part way through a sequence of 0s I end up with three sequences.

There has to be a more elegant way of doing this.

I already looked into some scipy functions for determining the period of an oscillating signal, but the data seems to be too messy to get good results.

EDIT 1: Here is the full dataset (should be easily importable as a python dictionary). Python dictionary of sample data

EDIT 2: Following Droid's method I get this nicely structured Dataframe:

(...)
79    79    9.831346  False        1
80    80   10.168792  False        1
81    81   10.354690  False        1
82    82   10.439753  False        1
83    83   10.714523  False        1
84    84   10.859503  False        1
85    85   10.809422  False        1
86    86   10.257599  False        1
87    87    0.159802   True        2
88    88    0.000000   True        2
89    89    0.000000   True        2
90    90    0.000000   True        2
91    91    0.000000   True        2
92    92    0.000000   True        2
93    93    0.000000   True        2
(...)

Solution

  • First of all, do yourself a favour and convert the data into a dataframe :) doing something like pd.DataFrame.from_records(data).T.

    Then, the problem seems to me a lot like finding the length of sequences of same values, the "values" being a boolean indicating whether the signal is less than a certain arbitrary threshold (say 0.05, but you can make this exactly zero if you want). You can do that by defining a grouper that identifies all the values pertaining to the same sequence.

    For example, if df is your dataframe, the index is your x and y are the signal values, you can do something like (after having ordered by the index x)

    df['is_less'] = df['y'] < 0.05
    df['grouper'] = df['is_less'].diff().ne(0).cumsum()
    

    What the second row does is basically doing a discrete difference between the rows, then negating it and then doing a cumulative sum to get some integers. This is a grouper that you can now use to count the length of your events, which is exactly the distance between the start and the end of your "valleys" as you have an integer index.

    So you can simply do

    df[df.is_less].groupby('grouper').count()
    

    You can play around with the threshold to get the results exactly the way you want. This method will count all segments made of contiguous values (according to your initial condition); as soon as the condition is false you'll get a new grouper.

    I tested with your data and verified that it is working.