Search code examples
pythonsearchinterpolationlarge-files

Using interpolation search to find beginning of list in large text file - Python


I need to find the last timestamp in a very large log file with an unknown number of lines before I reach a line with a timestamp. I read the file backwards one line at a time, which is usually very quick except for one case. Sometimes, I will run into a very large block (thousands of lines) with a known repeating pattern (one entry shown below) and no timestamps:

  goal_tolerance[0]: 
    name: joint_b
    position: -1
    velocity: -1
    acceleration: -1

Since this is the only case where I have this kind of problem, I can just throw a piece of code into the loop that checks for it before searching the log line by line.

The number after goal_tolerance is a counter, going up 1 each time the pattern repeats, so what I would like to do is use that number to calculate the beginning of the pattern. What I have now looks something like this:

if '  goal_tolerance' in line:
    gtolnum = line[17:-3]
    print gtolnum
    startFrom = currentPosition - ((long(gtolnum) + 1) * 95)
    break

However, this does not take into account the number of characters in the counter, so I end up running through the search loop several more times than necessary. Is there a fast way to include those characters in the calculation?

EDIT: I do not read the entire file to get to that point, since it is large and I have several hundred timestamps to search for in several hundred log files. My search function seeks to a position in the text file, then finds the beginning of a line near that point and reads it. The calculation is determining a file position I can use with .seek() based on the number of bytes or characters in the pattern.


Solution

  • I did some maths in the meantime and came up with a mathematical solution:

    ...
    n = long(gtolnum)
    q = len(gtolnum)        # I'll refer to this as the number's "level"
    x = n + 1 - 10**(q - 1) # Number of entries in the current level
    c = x * (q - 1)         # Additional digits in the current level
    i = 2
    p = 0
    while i < q:
        p += 9 * (q - i) * (10**(q - i))  # Additional digits in i levels previous
        i += 1
    startFrom = currentPosition - ((n + 1) * 95 + p + c)
    ...
    

    Seems like there should be a much simpler solution, but I'm not seeing it. Perhaps a log function could help?