Search code examples
pythontimestampcontainersfinanceclosest

nearest timestamp price - ready data structure in Python?


Price interpolation. Python data structure for efficient near miss searches?

I have price data

[1427837961000.0, 243.586], [1427962162000.0, 245.674], [1428072262000.0, 254.372], [1428181762000.0, 253.366], ...

with the first dimension a timestamp, and the second a price.

Now I want to know the price which is nearest to a given timestamp e.g. to 1427854534654.

What is the best Python container, data structure, or algorithm to solve this many hundred or thousand times per second? It is a standard problem, and has to be solved in many applications, so there should be a ready and optimized solution.

I have Googled, and found only bits and pieces that I could build upon - but I guess this question is so common, that the whole data structure should be ready as a module?


EDIT: Solved.

I used JuniorCompressor's solution with my bugfix for future dates.

The performance is fantastic:

3000000 calls took 12.82 seconds, so 0.00000427 per call (length of data = 1143).

Thanks a lot! StackOverFlow is great, and you helpers are the best!


Solution

  • It is very common for this problem to have your data sorted by the timestamp value and then binary search for every possible query. Binary search can be performed using the bisect module:

    data = [
        [1427837961000.0, 243.586], 
        [1427962162000.0, 245.674], 
        [1428072262000.0, 254.372], 
        [1428181762000.0, 253.366]
    ]
    
    
    data.sort(key=lambda l: l[0]) # Sort by timestamp
    timestamps = [l[0] for l in data] # Extract timestamps
    
    import bisect
    
    def find_closest(t):
        idx = bisect.bisect_left(timestamps, t) # Find insertion point
    
        # Check which timestamp with idx or idx - 1 is closer
        if idx > 0 and abs(timestamps[idx] - t) > abs(timestamps[idx - 1] - t):
             idx -= 1
    
        return data[idx][1] # Return price
    

    We can test like this:

    >>> find_closest(1427854534654)
    243.586
    

    If we have n queries and m timestamp values, then each query needs O(log m) time. So the total time needed is O(n * log m).

    In the above algorithm we search between two indexes. If we use only the midpoints of the timestamp intervals, we can simplify even more and create a faster search:

    midpoints = [(a + b) / 2 for a, b in zip(timestamps, timestamps[1:])]
    def find_closest_through_midpoints(t):
        return data[bisect.bisect_left(midpoints, t)][1]