I'm using Whoosh to implement a tiny local search engine. Documents contain both French and English languages.
As you may know, accents (à
è
é
...) are frequently used in the French language. So I had to deal with them using the accent folding as suggested by the Whoosh Documentation:
accent_analyzer = RegexAnalyzer(r'\w+') | LowercaseFilter() \
| StopFilter() | CharsetFilter(accent_map)
schema = Schema(path=ID(stored=True), content=TEXT(analyzer=accent_analyzer))
Indexing documents work just fine (no error).
But when it comes to search, I get no results for words that contain accents.
For e.g.
Let document D
with content = u'unité logique'
:
logique
hits the documents.unité
doesn't.unite
doesn't.So I suppose the index writer is ignoring words with accents that's why it shows no results for queries against these words whether the queries contained an accent or not.
Just a reminder that what I want to achieve is hitting the document D
using both words unité
and unite
.
whoosh requires all strings to be in unicode
Does whoosh require all strings to be unicode?
for accent in unicode see http://unicodelookup.com/