Search code examples
c#elasticsearchnestfull-text-indexing

How can I solve turkish letter issue in elasticsearch by using C# nest?


In Turkey, we have Turkish letters like 'ğ', 'ü', 'ş', 'ı', 'ö', 'ç'. But when we search generally we use the letters 'g', 'u', 's', 'i', 'o', 'c'. This is not a rule but we generally do it, think like a habit, something we used to. Forexample if i write camelcase "Ş" it should be searched "ş" and "s". Look please this link it is the same thing. But Their solution is too long and not perfect. How can i below thing?

My goal is this:

ProductName or Category.CategoryName may contain Turkish letters ("Eşarp") or some may be mistyped and written with English letters ("Esarp") Querystring may contain Turkish letters ("eşarp") or not ("esarp") Querystring may have multiple words Every indexed string field should be searched against querystring (full-text search)

indexing and full text searching in elasticsearch without dialitics using c# client Nest

My Code is :


  try
            {
                var node = new Uri(ConfigurationManager.AppSettings["elasticseachhost"]);
                var settings = new ConnectionSettings(node);
                settings.DefaultIndex("defaultindex").MapDefaultTypeIndices(m => m.Add(typeof(Customer), "myindex"));
                var client = new ElasticClient(settings);



                string command = Resource1.GetAllData;
                using (var ctx = new SearchEntities())
                {
                    Console.WriteLine("ORacle db is connected...");
                    var customers = ctx.Database.SqlQuery(command).ToList();
                    Console.WriteLine("Customer count : {0}", customers.Count);
                    if (customers.Count > 0)
                    {
                        var delete = client.DeleteIndex(new DeleteIndexRequest("myindex"));
                        foreach (var customer in customers)
                        {

                            client.Index(customer, idx => idx.Index("myindex"));
                            Console.WriteLine("Data is indexed in elasticSearch engine");
                        }


                    }
                }
            }
            catch (Exception ex)
            {
                Trace.WriteLine(ex.Message);
                Console.WriteLine(ex.Message);
            }

My Entity :


 public class Customer
    {
        public string Name{ get; set; }
        public string SurName { get; set; }
        public string Address{ get; set; }
}

I guess my desired solution is : (Create index with multi field mapping syntax with NEST 2.x)

but i can not understand.


Check this out:

[Nest.ElasticsearchType]
public class MyType
{
    // Index this & allow for retrieval.
    [Nest.Number(Store=true)]
    int Id { get; set; }

    // Index this & allow for retrieval.
    // **Also**, in my searching & sorting, I need to sort on this **entire** field, not just individual tokens.
    [Nest.String(Store = true, Index=Nest.FieldIndexOption.Analyzed, TermVector=Nest.TermVectorOption.WithPositionsOffsets)]
    string CompanyName { get; set; }

    // Don't index this for searching, but do store for display.
    [Nest.Date(Store=true, Index=Nest.NonStringIndexOption.No)]
    DateTime CreatedDate { get; set; }

    // Index this for searching BUT NOT for retrieval/displaying.
    [Nest.String(Store=false, Index=Nest.FieldIndexOption.Analyzed)]
    string CompanyDescription { get; set; }

    [Nest.Nested(Store=true, IncludeInAll=true)]
    // Nest this.
    List Locations { get; set; }
}

[Nest.ElasticsearchType]
public class MyChildType
{
    // Index this & allow for retrieval.
    [Nest.String(Store=true, Index = Nest.FieldIndexOption.Analyzed)]
    string LocationName { get; set; }

    // etc. other properties.
}
After this declaration, to create this mapping in elasticsearch you need to make a call similar to:

var mappingResponse = elasticClient.Map(m => m.AutoMap());

My second attempt for above challenge: ERROR: Analysis is not detected. Big problem is version differences. I found lots of sample all of them creating error like below : 'CreateeIndexDescriptor' does not contain a definition for "Analysis"... enter image description here


using Nest;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace ElasticSearchTest2
{
    class Program
    {
        public static Uri EsNode;
        public static ConnectionSettings EsConfig;
        public static ElasticClient client;
        static void Main(string[] args)
        {
            EsNode = new Uri("http://localhost:9200/");
            EsConfig = new ConnectionSettings(EsNode);
            client = new ElasticClient(EsConfig);

            var partialName = new CustomAnalyzer
            {
                Filter = new List { "lowercase", "name_ngrams", "standard", "asciifolding" },
                Tokenizer = "standard"
            };

            var fullName = new CustomAnalyzer
            {
                Filter = new List { "standard", "lowercase", "asciifolding" },
                Tokenizer = "standard"
            };

            client.CreateIndex("employeeindex5", c => c
                            .Analysis(descriptor => descriptor
                                .TokenFilters(bases => bases.Add("name_ngrams", new EdgeNGramTokenFilter
                                {
                                    MaxGram = 20,
                                    MinGram = 2,
                                    Side = "front"
                                }))
                                .Analyzers(bases => bases
                                    .Add("partial_name", partialName)
                                    .Add("full_name", fullName))
                            )
                            .AddMapping(m => m
                                .Properties(o => o
                                    .String(i => i
                                        .Name(x => x.Name)
                                        .IndexAnalyzer("partial_name")
                                        .SearchAnalyzer("full_name")
                                    ))));

            Employee emp = new Employee() { Name = "yılmaz", SurName = "eşarp" };
            client.Index(emp, idx => idx.Index("employeeindex5"));
            Employee emp2 = new Employee() { Name = "ayşe", SurName = "eşarp" };
            client.Index(emp2, idx => idx.Index("employeeindex5"));
            Employee emp3 = new Employee() { Name = "ömer", SurName = "eşarp" };
            client.Index(emp3, idx => idx.Index("employeeindex5"));
            Employee emp4 = new Employee() { Name = "gazı", SurName = "emir" };
            client.Index(emp4, idx => idx.Index("employeeindex5"));
        }
    }

    public class Employee
    {

        public string Name { set; get; }
        public string SurName { set; get; }


    }
}



Solution

  • what you want is to utilize the ASCII Folding Token Filter, this is quoted from the official elasticsearch page for it:

    A token filter of type asciifolding that converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.

    this means that it can convert characters like ç to the normal latin ones (which is in this case the letter c) because it's the closest matching from the standard ascii characters.

    so you can have a value like çar and when you want to perform a search, searching for both car or çar using the same token filter will return you the result that you are expecting.

    as an example, you can try the following call:

    • perform this POST request to your elasticsearch instance

    URL:

    http://YOUR_ELASTIC_SEARCH_INSTANCE_URL/_analyze/

    Request Body: { "tokenizer": "standard", "filter": [ "lowercase", "asciifolding" ], "text": "déja öne ğuess" }

    the result will be as the following:

    {
    "tokens": [
    {
    "token": "deja",
    "start_offset": 0,
    "end_offset": 4,
    "type": "<ALPHANUM>",
    "position": 0
    }
    ,
    {
    "token": "one",
    "start_offset": 5,
    "end_offset": 8,
    "type": "<ALPHANUM>",
    "position": 1
    }
    ,
    {
    "token": "guess",
    "start_offset": 9,
    "end_offset": 14,
    "type": "<ALPHANUM>",
    "position": 2
    }
    ]
    }
    

    note how the token property (the text that elastic will actually index and work against) is an english version of the original text provided.

    to learn more about ASCII Folding Token Filter, see this link: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-asciifolding-tokenfilter.html

    note: in order to utilize this technique, you will need to create your own analyzer.

    this is quoted from the official page for custom analyzers:

    When the built-in analyzers do not fulfill your needs, you can create a custom analyzer which uses the appropriate combination of:

    • zero or more character filters

    • a tokenizer

    • zero or more token filters.

    more into creating your custom analyzer can be found here: https://www.elastic.co/guide/en/elasticsearch/reference/current/analyzer-anatomy.html

    You can also find a sample of how you can create a custom analyzer using NEST from this answer: Create custom token filter with NEST