Search code examples
whoosh

Creating custom analyzers using whoosh


I am trying to implement a semantic search engine with deep NLP pipeline using Whoosh. Currently, I just have stemming analyzer, but I need to add lemmatizing and pos tagging to my analyzers.

 schema = Schema(id=ID(stored=True, unique=True), stem_text=TEXT(stored= True, analyzer=StemmingAnalyzer()))

I want to know how to add custom analyzers to my schema.


Solution

  • You can write a custom lemmatization filter and integrate into an existing whoosh analyzer. Quoting from Whoosh docs:

    Whoosh does not include any lemmatization functions, but if you have separate lemmatizing code you could write a custom whoosh.analysis.Filter to integrate it into a Whoosh analyzer.

    You can create an analyzer by combining a tokenizer with filters:

    my_analyzer = RegexTokenizer() | LowercaseFilter() | StopFilter() | LemmatizationFilter()
    

    or by adding a filter to an existing analyzer:

    my_analyzer = StandardAnalyzer() | LemmatizationFilter()
    

    You can define a filter like:

    def LemmatizationFilter(self, stream):
        for token in stream:
            yield token