Search code examples
pythondjangoelasticsearchkibana

how to tokenize a filed in elk?


I want to tokenize a field(text) in all documents(60k) of index(post) what is the best approach?

GET /_analyze
{
"analyzer" : "standard",
"text" : ["this is a test"]
}

need tokenized text for tag cloud in my Django app


Solution

  • All strings data indexed as both text and keyword with standard analyzer by default. To explicitly create the mapping of the index you can use the following API call.

    PUT my_index
    {
      "mappings": {
        "properties": {
          "my_field_1": {
            "type": "text",
            "analyzer": "standard"
          },
          "my_field_2": {
            "type": "text",
            "analyzer": "standard"
          }
        }
      }
    }
    

    In that case, all data indexed into my_field_1 and my_field_2 will be eligible for full-text search.

    If you already have an index you can use the following approaches

    1. Use copy_to feature and copy all the field values inside of a field to make all them searchable inside of one field.
    2. Create an ingest pipeline and trigger the update by query API call. I'm sharing an example below.

    PUT my_index2/_doc/1
    {
      "my_field_1": "musab dogan",
      "my_field_2": "elasticsearch opensearch"
    }
    
    PUT _ingest/pipeline/all_into_one
    {
      "description": "Copy selected fields to a single new field",
      "processors": [
        {
          "script": {
            "source": """
              def newField = [];
              for (entry in ctx.entrySet()) {
                // Exclude fields starting with underscore
                if (!entry.getKey().startsWith("_")) {
                  newField.add(entry.getKey() + ": " + entry.getValue());
                }
              }
              ctx['new_field'] = newField;
            """
          }
        }
      ]
    }
    
    POST my_index2/_update_by_query?pipeline=all_into_one
    
    GET my_index2/_search
    {
      "query": {
        "match": {
          "new_field": "musab"
        }
      }
    }
    

    enter image description here

    After you run _update_by_query API call all existing data become updated. For the new incoming data you can add the ingest pipeline as default_pipeline.

    PUT my_index/_settings
    {
      "index.default_pipeline": "all_into_one"
    }