Search code examples
elasticsearchelasticsearch-plugin

Is it possible to set new field value when analyzing document being indexed in Elasticsearch?


For example:

  1. when indexing one document into elasticsearch;
  2. i want to analyze a field named description in the document by uax_url_email tokenizer/analyzer;
  3. if description does have any url, put the url into another field named urls array;
  4. finish index this document;

Now i can check whether field urls is empty to know whether description has any url.

Is this possible? Or does analyzer only contributes to the inverted index, not other fields?


Solution

  • You can use Ingest Pipeline Script processor with painless script. I hope this will help you.

    POST _ingest/pipeline/_simulate?verbose
    {
      "pipeline": {
        "processors": [
          {
            "script": {
              "description": "Extract 'tags' from 'env' field",
              "lang": "painless",
              "source": """
                
                def m = /(http|ftp|https):\/\/([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])/.matcher(ctx["content"]);
                ArrayList urls = new ArrayList();
                while(m.find())
                {
                  urls.add(m.group());
                }
                ctx['urls'] = urls;
              """,
              "params": {
                "delimiter": "-",
                "position": 1
              }
            }
          }
        ]
      },
      "docs": [
        {
          "_source": {
            "content": "My name is Sagar patel and i visit https://apple.com and https://google.com"
          }
        }
      ]
    }
    

    Above Pipeline will generate result like below:

    {
      "docs": [
        {
          "processor_results": [
            {
              "processor_type": "script",
              "status": "success",
              "description": "Extract 'tags' from 'env' field",
              "doc": {
                "_index": "_index",
                "_id": "_id",
                "_source": {
                  "urls": [
                    "https://apple.com",
                    "https://google.com"
                  ],
                  "content": "My name is Sagar patel and i visit https://apple.com and https://google.com"
                },
                "_ingest": {
                  "pipeline": "_simulate_pipeline",
                  "timestamp": "2022-07-13T12:45:00.3655307Z"
                }
              }
            }
          ]
        }
      ]
    }