Search code examples
c#lucene.netaccent-insensitive

Lucene.net 4.8 unable to search with accent


based on some help here in stack overflow I managed to create a custom analyzer, but still cant work around search where a word has an accent.

public class CustomAnalyzer : Analyzer
{
    LuceneVersion matchVersion;

    public CustomAnalyzer(LuceneVersion p_matchVersion) : base()
    {
        matchVersion = p_matchVersion;
    }
    protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
    {
        Tokenizer tokenizer = new KeywordTokenizer(reader);
        TokenStream result = new StopFilter(matchVersion, tokenizer, StopAnalyzer.ENGLISH_STOP_WORDS_SET);            
        result = new LowerCaseFilter(matchVersion, result); 
        result = new StandardFilter(matchVersion, result);
        result = new ASCIIFoldingFilter(result);
        return new TokenStreamComponents(tokenizer, result);
       
    }
}

The idea is to be able to search for "perez" and also find "Pérez". Using that analyzer I recreated the index and searched but still no results for words with accent.

As LuceneVersion I'm using LuceneVersion.LUCENE_48

Any help would be greatly appreciated. Thanks!


Solution

  • Answered originally on GitHub, but copying here for context.

    Nope, it isn't valid to use multiple tokenizers in the same Analyzer, as there are strict consuming rules to adhere to.

    It would be great to build code analysis components to ensure developers adhere to these tokenizer rules while typing, such as the rule that ensures TokenStream classes are sealed or use a sealed IncrementToken() method (contributions welcome). It is not likely we will add any additional code analyzers prior to the 4.8.0 release unless they are contributed by the community, though, as these are not blocking the release. For the time being, the best way to ensure custom analyzers adhere to the rules are to test them with Lucene.Net.TestFramework, which also hits them with multiple threads, random cultures, and random strings of text to ensure they are robust.

    I built a demo showing how to setup testing on custom analyzers here: https://github.com/NightOwl888/LuceneNetCustomAnalyzerDemo (as well as showing how the above example fails the tests). The functioning analyzer just uses a WhiteSpaceTokenizer and ICUFoldingFilter. Of course, you may wish to add additional test conditions to ensure your custom analyzer meets your expectations, and then you can experiment with different tokenizers and adding or rearranging filters until you find a solution that meets all of your requirements (as well as plays by Lucene's rules). And of course, you can then later add additional conditions as you discover issues.

    using Lucene.Net.Analysis;
    using Lucene.Net.Analysis.Core;
    using Lucene.Net.Analysis.Icu;
    using Lucene.Net.Util;
    using System.IO;
    
    namespace LuceneExtensions
    {
        public sealed class CustomAnalyzer : Analyzer
        {
            private readonly LuceneVersion matchVersion;
    
            public CustomAnalyzer(LuceneVersion matchVersion)
            {
                this.matchVersion = matchVersion;
            }
    
            protected override TokenStreamComponents CreateComponents(string fieldName, TextReader reader)
            {
                // Tokenize...
                Tokenizer tokenizer = new WhitespaceTokenizer(matchVersion, reader);
                TokenStream result = tokenizer;
    
                // Filter...
                result = new ICUFoldingFilter(result);
    
                // Return result...
                return new TokenStreamComponents(tokenizer, result);
            }
        }
    }
    
    using Lucene.Net.Analysis;
    using NUnit.Framework;
    
    namespace LuceneExtensions.Tests
    {
        public class TestCustomAnalyzer : BaseTokenStreamTestCase
        {
            [Test]
            public virtual void TestRemoveAccents()
            {
                Analyzer a = new CustomAnalyzer(TEST_VERSION_CURRENT);
    
                // removal of latin accents (composed)
                AssertAnalyzesTo(a, "résumé", new string[] { "resume" });
    
                // removal of latin accents (decomposed)
                AssertAnalyzesTo(a, "re\u0301sume\u0301", new string[] { "resume" });
    
                // removal of latin accents (multi-word)
                AssertAnalyzesTo(a, "Carlos Pírez", new string[] { "carlos", "pirez" });
            }
        }
    }
    

    For other ideas about what test conditions you may use, I suggest having a look at Lucene.Net's extensive analyzer tests including the ICU tests. You may also refer to the tests to see if you can find a similar use case to yours for building queries (although do note that the tests don't show .NET best practices for disposing objects).