Search code examples
python-2.7dictionarypandastime-seriesreal-time

Proper Python data structure for real-time analysis?


Community,

Objective: I'm running a Pi project (i.e. Python) that communicates with an Arduino to get data from a load cell once a second. What data structure should I use to log (and do real-time analysis) on this data in Python?

I want to be able to do things like:

  1. Slice the data to get the value of the last logged datapoint.
  2. Slice the data to get the mean of the datapoints for the last n seconds.
  3. Perform a regression on the last n data points to get g/s.
  4. Remove from the log data points older than n seconds.

Current Attempts:

Dictionaries: I have appended a new key with a rounded time to a dictionary (see below), but this makes slicing and analysis hard.

log = {}

def log_data():
    log[round(time.time(), 4)] = read_data()

Pandas DataFrame: this was the one I was hopping for, because is makes time-series slicing and analysis easy, but this (How to handle incoming real time data with python pandas) seems to say its a bad idea. I can't follow their solution (i.e. storing in dictionary, and df.append()-ing in bulk every few seconds) because I want my rate calculations (regressions) to be in real time.

This question (ECG Data Analysis on a real-time signal in Python) seems to have the same problem as I did, but with no real solutions.

Goal:

So what is the proper way to handle and analyze real-time time-series data in Python? It seems like something everyone would need to do, so I imagine there has to pre-built functionality for this?

Thanks,

Michael


Solution

  • To start, I would question two assumptions:

    1. You mention in your post that the data comes in once per second. If you can rely on that, you don't need the timestamps at all -- finding the last N data points is exactly the same as finding the data points from the last N seconds.
    2. You have a constraint that your summary data needs to be absolutely 100% real time. That may make life more complicated -- is it possible to relax that at all?

    Anyway, here's a very naive approach using a list. It satisfies your needs. Performance may become a problem depending on how many of the previous data points you need to store.

    Also, you may not have thought of this, but do you need the full record of past data? Or can you just drop stuff?

    data = []
    
    new_observation = (timestamp, value)
    
    # new data comes in
    data.append(new_observation)
    
    
    # Slice the data to get the value of the last logged datapoint.
    data[-1]
    
    # Slice the data to get the mean of the datapoints for the last n seconds.
    mean(map(lambda x: x[1], filter(lambda o: current_time - o[0] < n, data)))
    
    # Perform a regression on the last n data points to get g/s.
    regression_function(data[-n:])
    
    # Remove from the log data points older than n seconds.
    data = filter(lambda o: current_time - o[0] < n, data)