I am trying to implement a semantic search engine with deep NLP pipeline using Whoosh. Currently, I just have stemming analyzer, but I need to add lemmatizing and pos tagging to my analyzers.
schema = Schema(id=ID(stored=True, unique=True), stem_text=TEXT(stored= True, analyzer=StemmingAnalyzer()))
I want to know how to add custom analyzers to my schema.
You can write a custom lemmatization filter and integrate into an existing whoosh analyzer. Quoting from Whoosh docs:
Whoosh does not include any lemmatization functions, but if you have separate lemmatizing code you could write a custom
whoosh.analysis.Filter
to integrate it into a Whoosh analyzer.
You can create an analyzer by combining a tokenizer with filters:
my_analyzer = RegexTokenizer() | LowercaseFilter() | StopFilter() | LemmatizationFilter()
or by adding a filter to an existing analyzer:
my_analyzer = StandardAnalyzer() | LemmatizationFilter()
You can define a filter like:
def LemmatizationFilter(self, stream):
for token in stream:
yield token