Search code examples
pythonpandasdata-cleaning

Data Cleaning with Time Series


I have a data cleaning question. I ran two experiments in a row without turning off the equipment. I want all my data from Experiment 1 to go in one csv, and all my data from Experiment 2 to go into a different csv. The most obvious demarcation between experiments is a longer time period, but unfortunately, this was never a fixed time period. Another possibility is to split the data by peaks in the tension data, and then to recombine them ... somehow. Does anyone have any thoughts for an algorithm that might achieve this? Below is some mock-data. The time data is in a pandas DateTimeIndex.

# Experiment 1, Trial 1
DateTimeIndex  Tension
7/25/2020 9:32 0
7/25/2020 9:33 0
7/25/2020 9:34 24
7/25/2020 9:35 100
7/25/2020 9:36 50
7/25/2020 9:37 20
7/25/2020 9:38 0
#Noise
7/25/2020 9:39 -25
7/25/2020 9:40 4
7/25/2020 9:41 11
#Experiment 1: Trial 2
7/25/2020 9:43 2
7/25/2020 9:44 3
7/25/2020 9:45 25
7/25/2020 9:46 150
7/25/2020 9:47 60
7/25/2020 9:48 70
7/25/2020 9:49 2
# Lots and Lost of Noise Between Trials
#Experiment 2: Trial 1
7/25/2020 10:06 0
7/25/2020 10:07 0
7/25/2020 10:08 24
7/25/2020 10:09 100
7/25/2020 10:10 50
7/25/2020 10:11 20
7/25/2020 10:12 -3

Solution

  • You can find the peaks of the signal using scipy's function (find peaks). This function has a good heuristic of finding peaks, and you can play with its' parameters to use to your benefit. After finding the peaks, you can take these indices and iterate over adjacent indices to access your different segments. See attached example:

    import numpy as np
    import matplotlib.pyplot as plt
    from scipy.signal import find_peaks
    from scipy.signal import medfilt
    data = np.sin(np.linspace(0, 8*np.pi))
    indices = find_peaks(data)[0]
    indices = np.unique(np.concatenate([[0, data.size-1], indices]))
    for i in range(len(indices) - 1):
      i0, i1 = indices[i: i+2]
      plt.plot(np.arange(i0, i1 + 1), data[i0:i1 + 1])
    

    The output: The output :)