python timestamp containers finance closest

nearest timestamp price - ready data structure in Python?

Price interpolation. Python data structure for efficient near miss searches?

I have price data

[1427837961000.0, 243.586], [1427962162000.0, 245.674], [1428072262000.0, 254.372], [1428181762000.0, 253.366], ...

with the first dimension a timestamp, and the second a price.

Now I want to know the price which is nearest to a given timestamp e.g. to 1427854534654.

What is the best Python container, data structure, or algorithm to solve this many hundred or thousand times per second? It is a standard problem, and has to be solved in many applications, so there should be a ready and optimized solution.

I have Googled, and found only bits and pieces that I could build upon - but I guess this question is so common, that the whole data structure should be ready as a module?

EDIT: Solved.

I used JuniorCompressor's solution with my bugfix for future dates.

The performance is fantastic:

3000000 calls took 12.82 seconds, so 0.00000427 per call (length of data = 1143).

Thanks a lot! StackOverFlow is great, and you helpers are the best!

Solution

It is very common for this problem to have your data sorted by the timestamp value and then binary search for every possible query. Binary search can be performed using the bisect module:

data = [
    [1427837961000.0, 243.586], 
    [1427962162000.0, 245.674], 
    [1428072262000.0, 254.372], 
    [1428181762000.0, 253.366]
]


data.sort(key=lambda l: l[0]) # Sort by timestamp
timestamps = [l[0] for l in data] # Extract timestamps

import bisect

def find_closest(t):
    idx = bisect.bisect_left(timestamps, t) # Find insertion point

    # Check which timestamp with idx or idx - 1 is closer
    if idx > 0 and abs(timestamps[idx] - t) > abs(timestamps[idx - 1] - t):
         idx -= 1

    return data[idx][1] # Return price

We can test like this:

>>> find_closest(1427854534654)
243.586

If we have n queries and m timestamp values, then each query needs O(log m) time. So the total time needed is O(n * log m).

In the above algorithm we search between two indexes. If we use only the midpoints of the timestamp intervals, we can simplify even more and create a faster search:

midpoints = [(a + b) / 2 for a, b in zip(timestamps, timestamps[1:])]
def find_closest_through_midpoints(t):
    return data[bisect.bisect_left(midpoints, t)][1]