Search code examples
pythonfilesearchduplicatescompare

Improve removing duplicate lines from a file from linear to logarithmic complexity


I have this code that works great and does what I want; removing duplicates from a big file that are in a small file. However it does it in linear form which is way too slow for the size of my data files, so I want to convert it to log:

import pandas
import fileinput

with open('small.txt') as fin:
    exclude = set(line.rstrip() for line in fin)
    for line in fileinput.input('big.txt', inplace=True):
        if line.rstrip() not in exclude:
            print(line, end='')
        else:
            print('')

This code is my attempt at conversion to a log function:

def log_search(small, big):
    first = 0
    last = len(big.txt) - 1
    while first <= last:
        mid = (first + last) / 2
        if str(mid) == small.txt:
            return True
        elif small.txt < str(mid):
            last = mid - 1
        else:
            first = mid + 1
    with open('small.txt') as fin:
        exclude = set(line.rstrip() for line in fin)
        for line in fileinput.input('big.txt', inplace=True):
            if line.rstrip() not in exclude:
                print(line, end='')
            else:
                print('')
            return log_search(small, big)   
  1. big file has millions of lines of int data.
  2. small file has hundreds of lines of int data.
  3. compare data and remove duplicated data in big file but leave line number blank.

Running the first block of code works, but it takes too long to search through the big file. Maybe I am approaching the problem in a wrong way. My attempt at converting it to log runs without error but does nothing.


Solution

  • I don't think there is a better or faster way to do this that what you are currently doing in your first approach. (Update: There is, see below.) Storing the lines from small.txt in a set and iterating the lines in big.txt, checking whether they are in that set, will have complexity of O(b), with b being the number of lines in big.txt.

    What you seem to be trying is to reduce this to O(s*logb), with s being the number of lines in small.txt, by using binary search to check for each line in small.txt whether it is in big.txt and removing/overwriting it then.

    This would work well if all the lines were in a list with random access to any array, but you have just the file, which does not allow random access to any line. It does, however, allow random access to any character with file.seek, which (at least in some cases?) seems to be O(1). But then you will still have to find the previous line break to that position before you can actually read that line. Also, you can not just replace lines with empty lines, but you have to overwrite the number with the same number of characters, e.g. spaces.

    So, yes, theoretically it can be done in O(s*logb), if you do the following:

    • implement binary search, searching not on the lines, but on the characters of the big file
      • for each position, backtrack to the last line break, then read the line to get the number
      • try again in the lower/upper half as usual with binary search
    • if the number is found, replace with as many spaces as there are digits in the number
    • repeat with the next number from the small file

    On my system, reading and writing a file with 10 million lines of numbers only took 3 seconds each, or about 8 seconds with fileinput.input and print. Thus, IMHO, this is not really worth the effort, but of course this may depend on how often you have to do this operation.


    Okay, so I got curious myself --and who needs a lunch break anyway?-- so I tried to implement this... and it works surprisingly well. This will find the given number in the file and replace it with an accordant number of - (not just a blank line, that's impossible without rewriting the entire file). Note that I did not thoroughly test the binary-search algorithm for edge cases, off-by-one erros etc.

    import os
    
    def getlineat(f, pos):
        pos = f.seek(pos)
        while pos > 0 and f.read(1) != "\n":
            pos = f.seek(pos-1)
        return pos+1 if pos > 0 else 0
    
    def bsearch(f, num):
        lower = 0
        upper = os.stat(f.name).st_size - 1
        while lower <= upper:
            mid = (lower + upper) // 2
            pos = getlineat(f, mid)
            line = f.readline()
            if not line: break # end of file
            val = int(line)
            if val == num:
                return (pos, len(line.strip()))
            elif num < val:
                upper = mid - 1
            elif num > val:
                lower = mid + 1 
        return (-1, -1)
    
    def overwrite(filename, to_remove):
        with open(filename, "r+") as f:
            positions = [bsearch(f, n) for n in to_remove]
            for n, (pos, length) in sorted(zip(to_remove, positions)):
                print(n, pos)
                if pos != -1:
                    f.seek(pos)
                    f.write("-" * length)
    
    import random
    to_remove = [random.randint(-500, 1500) for _ in range(10)]
    overwrite("test.txt", to_remove)
    

    This will first collect all the positions to be overwritten, and then do the actual overwriting in a second stes, otherwise the binary search will have problems when it hits one of the previously "removed" lines. I tested this with a file holding all the numbers from 0 to 1,000 in sorted order and a list of random numbers (both in- and out-of-bounds) to be removed and it worked just fine.

    Update: Also tested it with a file with random numbers from 0 to 100,000,000 in sorted order (944 MB) and overwriting 100 random numbers, and it finished immediately, so this should indeed be O(s*logb), at least on my system (the complexity of file.seek may depend on file system, file type, etc.).

    The bsearch function could also be generalized to accept another parameter value_function instead of hardcoding val = int(line). Then it could be used for binary-searching in arbitrary files, e.g. huge dictionaries, gene databases, csv files, etc., as long as the lines are sorted by that same value function.