Search code examples
pythonregextrie

Fast method to search for a string in a big text file with python


That's my current situation:

  • I have a 2.5MB text file with around 250k strings, sorted alphabetically
  • Each string is unique
  • I don't need to modify the entries in the text file: once the text file is loaded, it is never edited
  • The text file is loaded at start and then I just need to search for strings through it

The last point is the problem. Actually I need to search for complete match AND for partial match of the string. The algorithm I wrote just involved the use of regular expressions combined with some attempts to make the process faster: for example, I hardcoded into my script the indexes of the dictionary that identified the singular letters of the alphabet, and then split the big text file fictionary into 26 smaller dictionary. That was totally useless, the script is still incredibly slow. Skimming some posts here, I was convinced to try mmap: but it looked useless to find all the partial matches, given a regular expression. Eventually I came to the conclusion that a trie may solve my problem, though i hardly know what is this. Should I go with tries? If so, how should I proceed to the creation of a trie in python? Is marisa-trie module good? Thanks to everybody

EDIT: By "partial match", I mean that I have the prefix of a string. I do not need matches at the end or in the middle, just at the beginning.


Solution

  • Easiest and fastest solution:

    #!/usr/bin/env python
    
    d = {}
    
    # open your file here, i'm using /etc/hosts as an example...
    f = open("/etc/hosts","r")
    for line in f:
        line = line.rstrip()
        l = len(line)+1
        for i in xrange(1,l):
            d[line[:i]] = True
    f.close()
    
    
    while True:
        w = raw_input('> ')
        if not w:
            break
    
        if w in d:
            print "match found", w
    

    Here is slightly more complex, but memory efficient one:

    #!/usr/bin/env python
    
    d = []
    
    def binary_search(a, x, lo=0, hi=None):
        if hi is None:
            hi = len(a)
        while lo < hi:
            mid = (lo+hi)//2
            midval = a[mid]
            if midval < x:
                lo = mid+1
            elif midval > x:
                hi = mid
            else:
                return mid
        return -1
    
    
    f = open("/etc/hosts","r")
    for line in f:
        line=line.rstrip()
        l = len(line)+1
        for i in xrange(1,l):
            x = hash(line[:i])
            d.append(x)
    f.close()
    
    d.sort()
    
    while True:
        w = raw_input('> ')
        if not w:
            break
    
        if binary_search(d, hash(w)) != -1:
            print "match found", w