Search code examples
algorithmtime-seriessignalssignal-processingdata-analysis

Extracting patterns from time-series


I have a time-series, which essentially amounts to some instrument recording the current time whenever it makes a "detection". The sampling rate is therefore not in constant time, however we can treat it as such by "re-sampling", relying on the fact that the detections are made reliably and we can simply insert 0's to "fill in" the gaps. This will be important later...

The instrument should detect the "signals" sent by another, nearby instrument. This second instrument emits a signal at some unknown period, T (e.g. 1 signal per second), with a "jitter" likely on the order of a few tenths of a percent of the period.

My goal is to determine this period (or frequency, if you like) using only the timestamps recorded by the "detecting" instrument. Unfortunately, however, the detector is flooded with noise, and a significant amount (I estimate 97-98%) of "detections" (and therefore "points" in the time-series) are due to noise. Therefore, extracting the period will require more careful analysis.


My first thought was to simply feed the time series into an FFT algorithm (I'm using FFTW/DHT), however this wasn't particularly enlightening. I've also tried my own (admittedly rather crude) algorithm, which simply computed a cross-correlation of the series with "clean" series of increasing period. I didn't get very far with this, either, and there are quite a handful of details to consider (phase, etc).

It occurs to me that something like this must've been done before, and surely there's a "nice" way to accomplish it.


Solution

  • Here's my approach. Given a period, we can score it using a dynamic program to find the subsequence of detection times that includes the first and last detection and maximizes the sum of gap log-likelihoods, where the gap log-likelihood is defined as minus the square of the difference of the gap and the period (Gaussian jitter model).

    If we have approximately the right period, then we can get a very good gap sequence (some weirdness at the beginning and end and wherever there is a missed detection, but this is OK).

    If we have the wrong period, then we end up with basically exponential jitter, which has low log-likelihood.

    The C++ below generates fake detection times with a planted period and then searches over periods. Scores are normalized by a (bad) estimate of the score for Poisson noise, so wrong periods score about 0.4. See the plot below.

    #include <algorithm>
    #include <cmath>
    #include <iostream>
    #include <limits>
    #include <random>
    #include <vector>
    
    namespace {
    
    static constexpr double kFalseNegativeRate = 0.01;
    static constexpr double kCoefficientOfVariation = 0.003;
    static constexpr int kSignals = 6000;
    static constexpr int kNoiseToSignalRatio = 50;
    
    template <class URNG>
    std::vector<double> FakeTimes(URNG &g, const double period) {
      std::vector<double> times;
      std::bernoulli_distribution false_negative(kFalseNegativeRate);
      std::uniform_real_distribution<double> phase(0, period);
      double signal = phase(g);
      std::normal_distribution<double> interval(period,
                                                kCoefficientOfVariation * period);
      std::uniform_real_distribution<double> noise(0, kSignals * period);
      for (int i = 0; i < kSignals; i++) {
        if (!false_negative(g)) {
          times.push_back(signal);
        }
        signal += interval(g);
        for (double j = 0; j < kNoiseToSignalRatio; j++) {
          times.push_back(noise(g));
        }
      }
      std::sort(times.begin(), times.end());
      return times;
    }
    
    constexpr double Square(const double x) { return x * x; }
    
    struct Subsequence {
      double score;
      int previous;
    };
    
    struct Result {
      double score = std::numeric_limits<double>::quiet_NaN();
      double median_interval = std::numeric_limits<double>::quiet_NaN();
    };
    
    Result Score(const std::vector<double> &times, const double period) {
      if (times.empty() || !std::is_sorted(times.begin(), times.end())) {
        return {};
      }
      std::vector<Subsequence> bests;
      bests.reserve(times.size());
      bests.push_back({0, -1});
      for (int i = 1; i < times.size(); i++) {
        Subsequence best = {std::numeric_limits<double>::infinity(), -1};
        for (int j = i - 1; j > -1; j--) {
          const double difference = times[i] - times[j];
          const double penalty = Square(difference - period);
          if (difference >= period && penalty >= best.score) {
            break;
          }
          const Subsequence candidate = {bests[j].score + penalty, j};
          if (candidate.score < best.score) {
            best = candidate;
          }
        }
        bests.push_back(best);
      }
      std::vector<double> intervals;
      int i = bests.size() - 1;
      while (true) {
        int previous_i = bests[i].previous;
        if (previous_i < 0) {
          break;
        }
        intervals.push_back(times[i] - times[previous_i]);
        i = previous_i;
      }
      if (intervals.empty()) {
        return {};
      }
      const double duration = times.back() - times.front();
      // The rate is doubled because we can look for a time in either direction.
      const double rate = 2 * (times.size() - 1) / duration;
      // Mean of the square of an exponential distribution with the given rate.
      const double mean_square = 2 / Square(rate);
      const double score = bests.back().score / (intervals.size() * mean_square);
      const auto median_interval = intervals.begin() + intervals.size() / 2;
      std::nth_element(intervals.begin(), median_interval, intervals.end());
      return {score, *median_interval};
    }
    
    } // namespace
    
    int main() {
      std::default_random_engine g;
      const auto times = FakeTimes(g, std::sqrt(2));
      for (int i = 0; i < 2000; i++) {
        const double period = std::pow(1.001, i) / 3;
        const Result result = Score(times, period);
        std::cout << period << ' ' << result.score << ' ' << result.median_interval
                  << std::endl;
      }
    }
    

    enter image description here