Search code examples
pythontext-searchsearch

Fastest way to searching word in non-indexed textfile - Python


Considering a text file of 1.5 million lines and around 50-100 words per line.

To find the lines that contains the word, using os.popen('grep -w word infile') seems to be faster than

for line in infile: 
  if word in line:
    print line

How else could one search a word in a textfile in python? What is the fastest way to search through that large unindex textfile?


Solution

  • There are several fast search algorithms (see wikipedia). They require you to compile the word into some structure. Grep is using Aho-Corasick algorithm.

    I haven't seen the source code for python's in but either

    1. word is compiled for each line which takes time (I doubt in compiles anything, obviously it can compile it, cache the results, etc), or
    2. the search is inefficient. Consider searching for "word" in "worword" you first check "worw" and fail, then you check "o", then "r" and fail, etc. But there is no reason to recheck "o" or "r" if you are smart. So for example, Knuth–Morris–Pratt algorithm creates a table based on the searched word that tells it how many characters can be skipped when fail occurs.