Search code examples
pythonwhoosh

Whoosh Functionality


I have over 700 small, locally stored text files (no more that 2Kb large). Each file is located in a folder such as:

  • 2012//05//18.txt
  • 2013//09/22.txt
  • 2014//11//29.txt

(each file represents a day).

The content contains daily "incident" information. I'm interested in being able to locate files containing specific words.

  1. Can files be indexed by whoosh in bulk ? That is, can python/whoosh code be written to index a set of files ?
  2. If so, can a python script be written to search for specific words within each file ?
  3. If this is not the functionality of whoosh, or if whoosh is overkill for such a task, what can better accomplish this ?

Solution

  • Preface: I've never used whoosh. I just read the quickstart.

    Seems like this would be pretty trivial to accomplish. The crux of the work you'll have to do is here:

    writer.add_document(title=u"First document", path=u"/a", content=u"This is the first document we've added!")
    

    This is where the index is actually created. Essentially, what it seems you will have to do is use something like os.walk to crawl your root directory, and fill in the data, and for each file it finds, fill in the writer.add_document params.

    for root, dirs, files in os.walk("/your/path/to/logs"):
        for file in files:
            filepath = os.path.join(root, file)
            with open(filepath, 'r') as content_file:
                file_contents = content_file.read()
                writer.add_document(title=file, content=the_contents, path=filepath)
    

    That's a starting point at least. It seems pretty straightforward to use the Searcher provided by Whoosh to run through your indexed files after that.

    with ix.searcher() as searcher:
        query = QueryParser("content", ix.schema).parse("somesearchstring")
        results = searcher.search(query)
        print(results)
    

    Like I said I've never used whoosh but this is what I would do were I you.