Search code examples
pythonpandasperformancedata-sciencereal-time

Most performant data structure in Python to handle live streaming market data


I am about to handle live streaming stock market data, hundreds of "ticks" (dicts) per second, store them in an in-memory data structure and analyze the data.

I was reading up on pandas and got pretty excited about it, only to learn that pandas' append function is not recommended because it copies the whole data frame on each individual append. So it seems pandas is pretty much unusable to real time handling and analysis of high frequency streaming data, e.g. financial or sensor data.

So I'm back to native Python, which is quite OK. To save RAM, I am thinking about storing the last 100,000 data points or so on a rolling basis.

What would be the most performant Python data structure to use?

I was thinking using a list, and inserting data point number 100,001, then deleting the first element, as in del list[0]. That way, I can keep a rolling history of the last 100,000 data points, by my indices will get larger and larger. A native "rolling" data structure (as in C with a 16bit index and increments without overflow checks) seems not possible in Python?

What would be the best way to implement my real-time data analysis in Python?


Solution

  • The workflow you describe makes me think of a deque, basically a list that allows extending on one end (e.g. right), while popping (fetching/removing) them off the other end (e.g. left). The reference even has a short list of deque recipes to illustrate such common use cases as implementing tail or maintaining a moving average (as a generator).