I am creating an index with the following code:
var ElasticSettings = new ConnectionSettings(new Uri(ConnectionString))
.DefaultIndex(_IndexName)
.DefaultMappingFor<PictureObject>(M => M
.Ignore(_ => _._id)
.Ignore(_ => _.Log))
.DefaultFieldNameInferrer(_ => _);
_ElasticClient = new ElasticClient(ElasticSettings);
if (!_ElasticClient.IndexExists(_IndexName).Exists)
{
var I = _ElasticClient.CreateIndex(_IndexName, Ci => Ci
.Settings(S => S
.Analysis(A => A
.CharFilters(Cf => Cf.Mapping("expressions",
E => E.Mappings(ExpressionsList))
)
.TokenFilters(Tf => Tf.Synonym("synonyms",
Descriptor => new SynonymTokenFilter
{
Synonyms = SynonymsList,
Tokenizer = "whitespace"
})
)
.Analyzers(Analyzer => Analyzer
.Custom("index", C => C
.CharFilters("expressions")
.Tokenizer("standard")
.Filters("synonyms", "standard", "lowercase", "stop")
)
.Custom("search", C => C
.CharFilters("expressions")
.Tokenizer("standard")
.Filters("synonyms", "standard", "lowercase", "stop")
)
)
)
)
.Mappings(Mapping => Mapping
.Map<PictureObject>(Map => Map
.AutoMap()
.Properties(P => P
.Text(T => T
.Name(N => N.Title)
.Analyzer("index")
.SearchAnalyzer("search")
)
.Text(T => T
.Name(N => N.Tags)
.Analyzer("index")
.SearchAnalyzer("search")
)
)
)
)
);
The fields I want to search are 'title' and 'tags'
My synonyms are in that format:
[ "big => large, huge", "small => tiny, minuscule", ]
and my expressions are like:
[ "stormy weather => storm", "happy day => joy", ]
I am doing tests with these two methods:
var Test1 = _ElasticClient.Search<PictureObject>(S => S
.From(From)
.Size(Take)
.Query(_ => _.Fuzzy(Fuzz => Fuzz.Field(F => F.Tags).Field(T => T.Title).Value(Terms).MaxExpansions(2)))).Documents;
var resTest2 = _ElasticClient.Search<PictureObject>(S => S
.Query(_ => _.QueryString(F => F.Query(Terms)))
.From(From)
.Size(Take));
When trying to match terms exactly as they are in the tags field, the two functions return different results. When trying to use synonyms, results vary again.
(Ultimately, I want to handle misspellings too, but for now I just do testing with verbatim strings)
What am I missing? (I still have a dodgy understanding of the API, so the mistakes may be very obvious)
Edit: Here is a full working example that can compile.
namespace Test
{
using System;
using System.Collections.Generic;
using Nest;
public class MyData
{
public string Id;
public string Title;
public string Tags;
}
public static class Program
{
public static void Main()
{
const string INDEX_NAME = "testindex";
var ExpressionsList = new[]
{
"bad weather => storm",
"happy day => sun"
};
var SynonymsList = new[]
{
"big => large, huge",
"small => tiny, minuscule",
"sun => sunshine, shiny, sunny"
};
// connect
var ElasticSettings = new ConnectionSettings(new Uri("http://elasticsearch:9200"))
.DefaultIndex(INDEX_NAME)
.DefaultFieldNameInferrer(_ => _) // stop the camel case
.DefaultMappingFor<MyData>(M => M.IdProperty("Id"));
var Client = new ElasticClient(ElasticSettings);
// erase the old index, if any
if (Client.IndexExists(INDEX_NAME).Exists) Client.DeleteIndex(INDEX_NAME);
// create the index
var I = Client.CreateIndex(INDEX_NAME, Ci => Ci
.Settings(S => S
.Analysis(A => A
.CharFilters(Cf => Cf.Mapping("expressions",
E => E.Mappings(ExpressionsList))
)
.TokenFilters(Tf => Tf.Synonym("synonyms",
Descriptor => new SynonymTokenFilter
{
Synonyms = SynonymsList,
Tokenizer = "whitespace"
})
)
.Analyzers(Analyzer => Analyzer
.Custom("index", C => C
.CharFilters("expressions")
.Tokenizer("standard")
.Filters("synonyms", "standard", "lowercase", "stop")
)
.Custom("search", C => C
.CharFilters("expressions")
.Tokenizer("standard")
.Filters("synonyms", "standard", "lowercase", "stop")
)
)
)
)
.Mappings(Mapping => Mapping
.Map<MyData>(Map => Map
.AutoMap()
.Properties(P => P
.Text(T => T
.Name(N => N.Title)
.Analyzer("index")
.SearchAnalyzer("search")
)
.Text(T => T
.Name(N => N.Tags)
.Analyzer("index")
.SearchAnalyzer("search")
)
)
)
)
);
// add some data
var Data = new List<MyData>
{
new MyData { Id = "1", Title = "nice stormy weather", Tags = "storm nice" },
new MyData { Id = "2", Title = "a large storm with sunshine", Tags = "storm large sunshine" },
new MyData { Id = "3", Title = "a storm during a sunny day", Tags = "sun storm" }
};
Client.IndexMany(Data);
Client.Refresh(INDEX_NAME);
// do some queries
var TestA1 = Client.Search<MyData>(S => S.Query(_ => _.Fuzzy(Fuzz => Fuzz.Field(F => F.Tags).Field(T => T.Title).Value("stormy sunny").MaxExpansions(2)))).Documents;
var TestA2 = Client.Search<MyData>(S => S.Query(_ => _.Fuzzy(Fuzz => Fuzz.Field(F => F.Tags).Field(T => T.Title).Value("stromy sunny").MaxExpansions(2)))).Documents;
var TestB1 = Client.Search<MyData>(S => S.Query(_ => _.QueryString(F => F.Query("stormy sunny")))).Documents;
// expected to return documents 1, 2, 3 because of synonyms: sun => sunny, shiny, sunshine
var TestB2 = Client.Search<MyData>(S => S.Query(_ => _.QueryString(F => F.Query("bad weather")))).Documents;
var TestB3 = Client.Search<MyData>(S => S.Query(_ => _.QueryString(F => F.Query("a large happy day")))).Documents;
/*
* I'm expecting the fuzzy queries to handle misspellings
* Also, I'm expecting the expressions and synonyms to do the substitutions as they're written
*
* Ideally I'd like to handle:
* - expressions
* - synonyms
* - misspellings
*
* all together
*
* I have tried a lot of string examples while debugging and it's really hit or miss.
* Unfortunately, I haven't kept the strings, but it was enough to see that there is something
* wrong with my approach in this code.
*/
}
}
}
Here are a few pointers to get you on the right track
var ExpressionsList = new[] { "bad weather => storm", "happy day => sun" };
Consider whether these ought to be character filters; they may be, but typically, character filters are used in places where the tokenizer might tokenize incorrectly e.g.
&
when we ideally wanted to keep and replace with and
in a character filterc#
as c
, when ideally we wanted to keep and replace with csharp
in a character filterIt may be that you want to character filter, but it may be better handled by synonyms or a synonym graph in the case of multi-words.
The index
and search
custom analyzers are the same, you could remove one. Similarly, if not explicitly set, the search_analyzer
for a text
datatype field will be the configured analyzer
, so this simplifies things a little.
var SynonymsList = new[] { "big => large, huge", "small => tiny, minuscule", "sun => sunshine, shiny, sunny" };
This is a directional synonym map i.e. matches on the left hand side will be replaced with all alternatives on the right hand side. If all should be considered equal synonyms for each other, you likely don't want a directional map i.e.
var SynonymsList = new[]
{
"big, large, huge",
"small, tiny, minuscule",
"sun, sunshine, shiny, sunny"
};
This would return all 3 documents for
var TestB1 = Client.Search<MyData>(S => S.Query(_ => _.QueryString(F => F.Query("stormy sunny")))).Documents; // expected to return documents 1, 2, 3 because of synonyms: sun => sunny, shiny, sunshine
.Custom("index", C => C .CharFilters("expressions") .Tokenizer("standard") .Filters("synonyms", "standard", "lowercase", "stop") ) .Custom("search", C => C .CharFilters("expressions") .Tokenizer("standard") .Filters("synonyms", "standard", "lowercase", "stop") )
The ordering of token filters matters, so you want to run the synonyms filter after the lowercase filter
Fuzzy queries are term-level queries, so the query input does not undergo analysis, meaning if you run it against a field that is analyzed at index time, the fuzzy query input will need to match the terms output for a document from analysis at index time. This is likely not to yield the correct results if the query input is one that would be tokenized into multiple terms at index time i.e. the fuzzy query input will be treated as one complete term, but the index time value for the target document field may have been split into multiple terms.
Take a look at the Fuzziness section from the Definitive Guide - it's for Elasticsearch 2.x but is largely still relevant for later versions. You likely want to use a full-text query that supports fuzziness and performs analysis at query time, like query_string
, match
or multi_match
queries.
Putting these together, here's an example to work with while developing
public class MyData
{
public string Id;
public string Title;
public string Tags;
}
public static void Main()
{
const string INDEX_NAME = "testindex";
var expressions = new[]
{
"bad weather => storm",
"happy day => sun"
};
var synonyms = new[]
{
"big, large, huge",
"small, tiny, minuscule",
"sun, sunshine, shiny, sunny"
};
// connect
var settings = new ConnectionSettings(new Uri("http://localhost:9200"))
.DefaultIndex(INDEX_NAME)
.DefaultFieldNameInferrer(s => s) // stop the camel case
.DefaultMappingFor<MyData>(m => m.IdProperty("Id"))
.DisableDirectStreaming()
.PrettyJson()
.OnRequestCompleted(callDetails =>
{
if (callDetails.RequestBodyInBytes != null)
{
Console.WriteLine(
$"{callDetails.HttpMethod} {callDetails.Uri} \n" +
$"{Encoding.UTF8.GetString(callDetails.RequestBodyInBytes)}");
}
else
{
Console.WriteLine($"{callDetails.HttpMethod} {callDetails.Uri}");
}
Console.WriteLine();
if (callDetails.ResponseBodyInBytes != null)
{
Console.WriteLine($"Status: {callDetails.HttpStatusCode}\n" +
$"{Encoding.UTF8.GetString(callDetails.ResponseBodyInBytes)}\n" +
$"{new string('-', 30)}\n");
}
else
{
Console.WriteLine($"Status: {callDetails.HttpStatusCode}\n" +
$"{new string('-', 30)}\n");
}
});
var Client = new ElasticClient(settings);
// erase the old index, if any
if (Client.IndexExists(INDEX_NAME).Exists) Client.DeleteIndex(INDEX_NAME);
// create the index
var createIndexResponse = Client.CreateIndex(INDEX_NAME, c => c
.Settings(s => s
.Analysis(a => a
.CharFilters(cf => cf
.Mapping("expressions", E => E
.Mappings(expressions)
)
)
.TokenFilters(tf => tf
.Synonym("synonyms", sy => sy
.Synonyms(synonyms)
.Tokenizer("whitespace")
)
)
.Analyzers(an => an
.Custom("index", ca => ca
.CharFilters("expressions")
.Tokenizer("standard")
.Filters("standard", "lowercase", "synonyms", "stop")
)
)
)
)
.Mappings(m => m
.Map<MyData>(mm => mm
.AutoMap()
.Properties(p => p
.Text(t => t
.Name(n => n.Title)
.Analyzer("index")
)
.Text(t => t
.Name(n => n.Tags)
.Analyzer("index")
)
)
)
)
);
// add some data
var data = new List<MyData>
{
new MyData { Id = "1", Title = "nice stormy weather", Tags = "storm nice" },
new MyData { Id = "2", Title = "a large storm with sunshine", Tags = "storm large sunshine" },
new MyData { Id = "3", Title = "a storm during a sunny day", Tags = "sun storm" }
};
Client.IndexMany(data);
Client.Refresh(INDEX_NAME);
//var query = "stormy sunny";
var query = "stromy sunny";
// var query = "bad weather";
// var query = "a large happy day";
var testA1 = Client.Search<MyData>(s => s
.Query(q => q
.MultiMatch(fu => fu
.Fields(f => f
.Field(ff => ff.Tags)
.Field(ff => ff.Title)
)
.Query(query)
.Fuzziness(Fuzziness.EditDistance(2))
)
)
).Documents;
}
I've added .DisableDirectStreaming()
, .PrettyJson()
and an .OnRequestCompleted(...)
handler to Connection settings so that you can see the requests and responses written to the console. These are useful while developing, but you'll likely want to remove for production as they add overhead. A small app like Linqpad will help whilst developing here :)
The example uses a multi_match
query with fuzziness enabled with an edit distance of 2 (may want to just use auto fuzziness here, it does a sensible job), running on the Tags
and Title
fields. All three documents are returned for the (misspelt) query "stromy sunny"