Search code examples
elasticsearchnest

Search with Nest not yielding expected result


I am creating an index with the following code:

        var ElasticSettings = new ConnectionSettings(new Uri(ConnectionString))
            .DefaultIndex(_IndexName)
        .DefaultMappingFor<PictureObject>(M => M
            .Ignore(_ => _._id)
            .Ignore(_ => _.Log))
            .DefaultFieldNameInferrer(_ => _);

    _ElasticClient = new ElasticClient(ElasticSettings);

    if (!_ElasticClient.IndexExists(_IndexName).Exists)
    {
        var I = _ElasticClient.CreateIndex(_IndexName, Ci => Ci
            .Settings(S => S
                .Analysis(A => A
                    .CharFilters(Cf => Cf.Mapping("expressions",
                        E => E.Mappings(ExpressionsList))
                    )
                    .TokenFilters(Tf => Tf.Synonym("synonyms",
                        Descriptor => new SynonymTokenFilter
                        {
                            Synonyms = SynonymsList,
                            Tokenizer = "whitespace"
                        })
                    )
                    .Analyzers(Analyzer => Analyzer
                        .Custom("index", C => C
                            .CharFilters("expressions")
                            .Tokenizer("standard")
                            .Filters("synonyms", "standard", "lowercase", "stop")
                        )
                        .Custom("search", C => C
                            .CharFilters("expressions")
                            .Tokenizer("standard")
                            .Filters("synonyms", "standard", "lowercase", "stop")
                        )
                    )
                )
            )
            .Mappings(Mapping => Mapping
                .Map<PictureObject>(Map => Map
                    .AutoMap()
                    .Properties(P => P
                        .Text(T => T
                            .Name(N => N.Title)
                            .Analyzer("index")
                            .SearchAnalyzer("search")
                        )
                        .Text(T => T
                            .Name(N => N.Tags)
                            .Analyzer("index")
                            .SearchAnalyzer("search")
                        )
                    )
                )
            )
        );

The fields I want to search are 'title' and 'tags'

My synonyms are in that format:

[ "big => large, huge", "small => tiny, minuscule", ]

and my expressions are like:

[ "stormy weather => storm", "happy day => joy", ]

I am doing tests with these two methods:

var Test1 = _ElasticClient.Search<PictureObject>(S => S
        .From(From)
        .Size(Take)
        .Query(_ => _.Fuzzy(Fuzz => Fuzz.Field(F => F.Tags).Field(T => T.Title).Value(Terms).MaxExpansions(2)))).Documents;

var resTest2 = _ElasticClient.Search<PictureObject>(S => S
        .Query(_ => _.QueryString(F => F.Query(Terms)))
        .From(From)
        .Size(Take));

When trying to match terms exactly as they are in the tags field, the two functions return different results. When trying to use synonyms, results vary again.

(Ultimately, I want to handle misspellings too, but for now I just do testing with verbatim strings)

What am I missing? (I still have a dodgy understanding of the API, so the mistakes may be very obvious)

Edit: Here is a full working example that can compile.

namespace Test
{
    using System;
    using System.Collections.Generic;
    using Nest;

    public class MyData
    {
        public string Id;
        public string Title;
        public string Tags;
    }

    public static class Program
    {
        public static void Main()
        {
            const string INDEX_NAME = "testindex";

            var ExpressionsList = new[]
            {
                "bad weather => storm",
                "happy day => sun"
            };

            var SynonymsList = new[]
            {
                "big => large, huge",
                "small => tiny, minuscule",
                "sun => sunshine, shiny, sunny"
            };

            // connect
            var ElasticSettings = new ConnectionSettings(new Uri("http://elasticsearch:9200"))
                .DefaultIndex(INDEX_NAME)
                .DefaultFieldNameInferrer(_ => _) // stop the camel case
                .DefaultMappingFor<MyData>(M => M.IdProperty("Id"));

            var Client = new ElasticClient(ElasticSettings);

            // erase the old index, if any
            if (Client.IndexExists(INDEX_NAME).Exists) Client.DeleteIndex(INDEX_NAME);

            // create the index
            var I = Client.CreateIndex(INDEX_NAME, Ci => Ci
                .Settings(S => S
                    .Analysis(A => A
                        .CharFilters(Cf => Cf.Mapping("expressions",
                            E => E.Mappings(ExpressionsList))
                        )
                        .TokenFilters(Tf => Tf.Synonym("synonyms",
                            Descriptor => new SynonymTokenFilter
                            {
                                Synonyms = SynonymsList,
                                Tokenizer = "whitespace"
                            })
                        )
                        .Analyzers(Analyzer => Analyzer
                            .Custom("index", C => C
                                .CharFilters("expressions")
                                .Tokenizer("standard")
                                .Filters("synonyms", "standard", "lowercase", "stop")
                            )
                            .Custom("search", C => C
                                .CharFilters("expressions")
                                .Tokenizer("standard")
                                .Filters("synonyms", "standard", "lowercase", "stop")
                            )
                        )
                    )
                )
                .Mappings(Mapping => Mapping
                    .Map<MyData>(Map => Map
                        .AutoMap()
                        .Properties(P => P
                            .Text(T => T
                                .Name(N => N.Title)
                                .Analyzer("index")
                                .SearchAnalyzer("search")
                            )
                            .Text(T => T
                                .Name(N => N.Tags)
                                .Analyzer("index")
                                .SearchAnalyzer("search")
                            )
                        )
                    )
                )
            );

            // add some data
            var Data = new List<MyData>
            {
                new MyData { Id = "1", Title = "nice stormy weather", Tags = "storm nice" },
                new MyData { Id = "2", Title = "a large storm with sunshine", Tags = "storm large sunshine" },
                new MyData { Id = "3", Title = "a storm during a sunny day", Tags = "sun storm" }
            };

            Client.IndexMany(Data);
            Client.Refresh(INDEX_NAME);


            // do some queries
            var TestA1 = Client.Search<MyData>(S => S.Query(_ => _.Fuzzy(Fuzz => Fuzz.Field(F => F.Tags).Field(T => T.Title).Value("stormy sunny").MaxExpansions(2)))).Documents;
            var TestA2 = Client.Search<MyData>(S => S.Query(_ => _.Fuzzy(Fuzz => Fuzz.Field(F => F.Tags).Field(T => T.Title).Value("stromy sunny").MaxExpansions(2)))).Documents;

            var TestB1 = Client.Search<MyData>(S => S.Query(_ => _.QueryString(F => F.Query("stormy sunny")))).Documents;
            // expected to return documents 1, 2, 3 because of synonyms: sun => sunny, shiny, sunshine

            var TestB2 = Client.Search<MyData>(S => S.Query(_ => _.QueryString(F => F.Query("bad weather")))).Documents;
            var TestB3 = Client.Search<MyData>(S => S.Query(_ => _.QueryString(F => F.Query("a large happy day")))).Documents;

            /*
             * I'm expecting the fuzzy queries to handle misspellings
             * Also, I'm expecting the expressions and synonyms to do the substitutions as they're written
             *
             * Ideally I'd like to handle:
             *  - expressions
             *  - synonyms
             *  - misspellings
             *
             * all together
             *
             * I have tried a lot of string examples while debugging and it's really hit or miss.
             * Unfortunately, I haven't kept the strings, but it was enough to see that there is something
             * wrong with my approach in this code.
             */
        }
    }
}

Solution

  • Here are a few pointers to get you on the right track

    Character filters

    var ExpressionsList = new[]
    {
        "bad weather => storm",
        "happy day => sun"
    };
    

    Consider whether these ought to be character filters; they may be, but typically, character filters are used in places where the tokenizer might tokenize incorrectly e.g.

    • Stripping HTML tags before tokenizing
    • Standard tokenizer removing & when we ideally wanted to keep and replace with and in a character filter
    • Standard tokenizer tokenizing c# as c, when ideally we wanted to keep and replace with csharp in a character filter

    It may be that you want to character filter, but it may be better handled by synonyms or a synonym graph in the case of multi-words.

    Custom analyzers

    The index and search custom analyzers are the same, you could remove one. Similarly, if not explicitly set, the search_analyzer for a text datatype field will be the configured analyzer, so this simplifies things a little.

    Synonyms

    var SynonymsList = new[]
    {
        "big => large, huge",
        "small => tiny, minuscule",
        "sun => sunshine, shiny, sunny"
    };
    

    This is a directional synonym map i.e. matches on the left hand side will be replaced with all alternatives on the right hand side. If all should be considered equal synonyms for each other, you likely don't want a directional map i.e.

    var SynonymsList = new[]
    {
        "big, large, huge",
        "small, tiny, minuscule",
        "sun, sunshine, shiny, sunny"
    };
    

    This would return all 3 documents for

    var TestB1 = Client.Search<MyData>(S => S.Query(_ => _.QueryString(F => F.Query("stormy sunny")))).Documents;
    // expected to return documents 1, 2, 3 because of synonyms: sun => sunny, shiny, sunshine
    

    Token filters

    .Custom("index", C => C
        .CharFilters("expressions")
        .Tokenizer("standard")
        .Filters("synonyms", "standard", "lowercase", "stop")
    )
    .Custom("search", C => C
        .CharFilters("expressions")
        .Tokenizer("standard")
        .Filters("synonyms", "standard", "lowercase", "stop")
    )
    

    The ordering of token filters matters, so you want to run the synonyms filter after the lowercase filter

    Fuzzy queries

    Fuzzy queries are term-level queries, so the query input does not undergo analysis, meaning if you run it against a field that is analyzed at index time, the fuzzy query input will need to match the terms output for a document from analysis at index time. This is likely not to yield the correct results if the query input is one that would be tokenized into multiple terms at index time i.e. the fuzzy query input will be treated as one complete term, but the index time value for the target document field may have been split into multiple terms.

    Take a look at the Fuzziness section from the Definitive Guide - it's for Elasticsearch 2.x but is largely still relevant for later versions. You likely want to use a full-text query that supports fuzziness and performs analysis at query time, like query_string, match or multi_match queries.

    An example

    Putting these together, here's an example to work with while developing

    public class MyData
    {
        public string Id;
        public string Title;
        public string Tags;
    }
    
    public static void Main()
    {
        const string INDEX_NAME = "testindex";
    
        var expressions = new[]
        {
                "bad weather => storm",
                "happy day => sun"
        };
    
        var synonyms = new[]
        {
                "big, large, huge",
                "small, tiny, minuscule",
                "sun, sunshine, shiny, sunny"
        };
    
        // connect
        var settings = new ConnectionSettings(new Uri("http://localhost:9200"))
            .DefaultIndex(INDEX_NAME)
            .DefaultFieldNameInferrer(s => s) // stop the camel case
            .DefaultMappingFor<MyData>(m => m.IdProperty("Id"))
            .DisableDirectStreaming()
            .PrettyJson()
            .OnRequestCompleted(callDetails =>
            {
                if (callDetails.RequestBodyInBytes != null)
                {
                    Console.WriteLine(
                        $"{callDetails.HttpMethod} {callDetails.Uri} \n" +
                        $"{Encoding.UTF8.GetString(callDetails.RequestBodyInBytes)}");
                }
                else
                {
                    Console.WriteLine($"{callDetails.HttpMethod} {callDetails.Uri}");
                }
    
                Console.WriteLine();
    
                if (callDetails.ResponseBodyInBytes != null)
                {
                    Console.WriteLine($"Status: {callDetails.HttpStatusCode}\n" +
                             $"{Encoding.UTF8.GetString(callDetails.ResponseBodyInBytes)}\n" +
                             $"{new string('-', 30)}\n");
                }
                else
                {
                    Console.WriteLine($"Status: {callDetails.HttpStatusCode}\n" +
                             $"{new string('-', 30)}\n");
                }
            });
    
        var Client = new ElasticClient(settings);
    
        // erase the old index, if any
        if (Client.IndexExists(INDEX_NAME).Exists) Client.DeleteIndex(INDEX_NAME);
    
        // create the index
        var createIndexResponse = Client.CreateIndex(INDEX_NAME, c => c
            .Settings(s => s
                .Analysis(a => a
                    .CharFilters(cf => cf
                        .Mapping("expressions", E => E
                            .Mappings(expressions)
                        )
                    )
                    .TokenFilters(tf => tf
                        .Synonym("synonyms", sy => sy
                            .Synonyms(synonyms)
                            .Tokenizer("whitespace")
                        )
                    )
                    .Analyzers(an => an
                        .Custom("index", ca => ca
                            .CharFilters("expressions")
                            .Tokenizer("standard")
                            .Filters("standard", "lowercase", "synonyms",  "stop")
                        )
                    )
                )
            )
            .Mappings(m => m
                .Map<MyData>(mm => mm
                    .AutoMap()
                    .Properties(p => p
                        .Text(t => t
                            .Name(n => n.Title)
                            .Analyzer("index")
                        )
                        .Text(t => t
                            .Name(n => n.Tags)
                            .Analyzer("index")
                        )
                    )
                )
            )
        );
    
        // add some data
        var data = new List<MyData>
            {
                new MyData { Id = "1", Title = "nice stormy weather", Tags = "storm nice" },
                new MyData { Id = "2", Title = "a large storm with sunshine", Tags = "storm large sunshine" },
                new MyData { Id = "3", Title = "a storm during a sunny day", Tags = "sun storm" }
            };
    
        Client.IndexMany(data);
        Client.Refresh(INDEX_NAME);
    
        //var query = "stormy sunny";
        var query = "stromy sunny";
        // var query = "bad weather";
        // var query = "a large happy day";
    
        var testA1 = Client.Search<MyData>(s => s
            .Query(q => q
                .MultiMatch(fu => fu
                    .Fields(f => f
                        .Field(ff => ff.Tags)
                        .Field(ff => ff.Title)
                    )           
                    .Query(query)
                    .Fuzziness(Fuzziness.EditDistance(2))
                )
            )
        ).Documents;
    }
    

    I've added .DisableDirectStreaming(), .PrettyJson() and an .OnRequestCompleted(...) handler to Connection settings so that you can see the requests and responses written to the console. These are useful while developing, but you'll likely want to remove for production as they add overhead. A small app like Linqpad will help whilst developing here :)

    The example uses a multi_match query with fuzziness enabled with an edit distance of 2 (may want to just use auto fuzziness here, it does a sensible job), running on the Tags and Title fields. All three documents are returned for the (misspelt) query "stromy sunny"