That's my current situation:
The last point is the problem. Actually I need to search for complete match AND for partial match of the string. The algorithm I wrote just involved the use of regular expressions combined with some attempts to make the process faster: for example, I hardcoded into my script the indexes of the dictionary that identified the singular letters of the alphabet, and then split the big text file fictionary into 26 smaller dictionary. That was totally useless, the script is still incredibly slow. Skimming some posts here, I was convinced to try mmap: but it looked useless to find all the partial matches, given a regular expression. Eventually I came to the conclusion that a trie may solve my problem, though i hardly know what is this. Should I go with tries? If so, how should I proceed to the creation of a trie in python? Is marisa-trie module good? Thanks to everybody
EDIT: By "partial match", I mean that I have the prefix of a string. I do not need matches at the end or in the middle, just at the beginning.
Easiest and fastest solution:
#!/usr/bin/env python
d = {}
# open your file here, i'm using /etc/hosts as an example...
f = open("/etc/hosts","r")
for line in f:
line = line.rstrip()
l = len(line)+1
for i in xrange(1,l):
d[line[:i]] = True
f.close()
while True:
w = raw_input('> ')
if not w:
break
if w in d:
print "match found", w
Here is slightly more complex, but memory efficient one:
#!/usr/bin/env python
d = []
def binary_search(a, x, lo=0, hi=None):
if hi is None:
hi = len(a)
while lo < hi:
mid = (lo+hi)//2
midval = a[mid]
if midval < x:
lo = mid+1
elif midval > x:
hi = mid
else:
return mid
return -1
f = open("/etc/hosts","r")
for line in f:
line=line.rstrip()
l = len(line)+1
for i in xrange(1,l):
x = hash(line[:i])
d.append(x)
f.close()
d.sort()
while True:
w = raw_input('> ')
if not w:
break
if binary_search(d, hash(w)) != -1:
print "match found", w