Search code examples
pythonfull-text-searchwhoosh

Whoosh not searching words with accent


I'm using Whoosh to implement a tiny local search engine. Documents contain both French and English languages.

As you may know, accents (à è é ...) are frequently used in the French language. So I had to deal with them using the accent folding as suggested by the Whoosh Documentation:

accent_analyzer = RegexAnalyzer(r'\w+') | LowercaseFilter() \
                  | StopFilter() | CharsetFilter(accent_map)

schema = Schema(path=ID(stored=True), content=TEXT(analyzer=accent_analyzer))

Indexing documents work just fine (no error).

But when it comes to search, I get no results for words that contain accents.

For e.g.

Let document D with content = u'unité logique' :

  • Searching using logique hits the documents.
  • Searching using unité doesn't.
  • Searching using unite doesn't.

So I suppose the index writer is ignoring words with accents that's why it shows no results for queries against these words whether the queries contained an accent or not.

Just a reminder that what I want to achieve is hitting the document D using both words unité and unite.


Solution

  • whoosh requires all strings to be in unicode

    Does whoosh require all strings to be unicode?

    for accent in unicode see http://unicodelookup.com/

    (https://ss64.com/unicode-accents.html)