Search code examples
pythonsearchmatchtokenwhoosh

Match a query within token in Whoosh


I want to apply a search with Whoosh on a text. Right now this works only for exact matches of tokens (space delimited). I'd like to match also within a token (e.g.: match add in a token "added"). I know about stemming and variations, but this are not what I'm looking for. Thank you for your Help!

from whoosh.index import create_in
from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.qparser import QueryParser

schema = Schema(title=TEXT(), content=TEXT())
indexpath = (r"C:\Users\rettenma\.jupyter\JupyterWork"+
        r"folder\Python_Repository\bin\index")
ix = create_in(indexpath, schema)
writer = ix.writer()
writer.add_document(title=u"First document",
                content=u"This is the first document we've added!")
writer.commit()

with ix.searcher() as searcher:
    query = QueryParser("content", ix.schema).parse("add")
    results = searcher.search(query, terms=True)
    print(results[0])

This will raise an Error because of results being empty.


Solution

  • http://whoosh.readthedocs.io/en/latest/api/query.html#whoosh.query.Regex

    Sounds like you need regular expressions.

    [EDIT BEGIN]

    Hope this helps:

    https://regexr.com/3s2ta

    Above is the first example of capturing the words as describe by the OP. However, I noticed that there is a problem in that the Regex example will also capture any words containing "add", including i.e. Addendum, Daddy and so on. Having notices this, I have amended and re-forked the Regex example, the link is here below:

    https://regexr.com/3sg8q

    [EDIT FINISH]

    That is an example with extra testing to be sure you can catch all variations of the word "add", e.g. "add" / "adds" / "added" / "additional". Essentially, anything containing "add" + the rest of the word.