Search code examples
elasticsearchspring-data-elasticsearch

Reindexing data in elastisearch after index change


I have a field carName which uses some analyzer:

@Field(type = FieldType.Text, searchAnalyzer = "myAnalyzer", analyzer = "myAnalyzer")
private String carName;

The myAnalyzer analyzer looks like this:

{
  "index": {
    "analysis": {
      "filter": {
        "myStopwords": {
          "ignore_case": "true",
          "type": "stop",
          "stopwords": [
            "word1",
            "word2"
          ]
        } 
      },
      "char_filter": {
        "myTrimmer": {
          "flags": "CASE_INSENSITIVE",
          "pattern": "somepatter",
          "replacement": "somrereplacement",
          "type": "pattern_replace"
        } 
      },
      "analyzer": {
        "myAnalyzer": {
          "filter": [
            "lowercase",
            "unique",
            "myStopwords"
          ],
          "char_filter": [
            "myTrimmer"
          ],
          "type": "custom",
          "tokenizer": "whitespace"
        } 
      }
    }
  }
}

Now myStopwords will expand or can shrink. In my database I have CAR entity and once someone is adding new car, it is indexed in ES as document. What do I have to do when someone changes stopwords list? Is it possible to refresh data only on ElasticSearch side, without even reading them from my database? Or due to changes in stopword list some data in index where carName is located could be lost during indexing - the words that were part of stopwords list for example? And in that case, unfortunatelly, I need to read cars from database again and index them again.. ?

As I understand the analyzer and in my case myAnalyzer is used during indexing process by ES, then at first sight it seems that if I change stopwords list (so it this case it is analyzer change), then I should reindex my cars right but maybe I am wrong ? If a car was named 'Ford King Taurus' and the King was not in the stopwords list, then what happens if I add King to stopword list.. And if `King' was in stopwords list and some documents were indexed and now is removed from that list, what happens with search then.. Would searching work fine after such mapping changes ?

I read about UpdateByQuery method that I think could be used for some similar cases to update for example part of the document. But could it be used here ? I mean.. how could I tell Elasticsearch, if it is neccessary, to refresh all carNames due to stopwords list change ?


Solution

  • If you're using the same analyzer and index time and search time and you update your stop words list, both index-time and search-time analyzers will use the new stop words list right away, however, anything that is already indexed will not be updated, you'll need to _update_by_query your index in order for the new stop words to be applied.

    A quick example:

    If you index Ford King Taurus and the stop words list doesn't contain King, then the following tokens will be indexed: Ford, King and Taurus. At search time, you can find the document using either of these three terms.

    Then you add King in the stop words list, close and reopen your index in order to refresh your analyzers. At this point, the former document with Ford King Taurus will not be searchable with King anymore since the search analyzer now ignores King even though the token King is still indexed. You could still find the document using the standard search analyzer and searching for king though, since the king token is still indexed.

    However, if you index a new document, say, Seat King, then only Seat will be indexed and searching for King will yield nothing.

    If you want your former document to pick up the new stop word King you need to either reindex the document or simply update your index in place using _update_by_query so the source documents get reindexed upon themselves, but with the index-time analyzer that has the new stop words list including King

    Here is a quick summary of all the above explanations:

    # 1. You create your index like normal
    PUT test2
    {
       "settings": {...},
       "mappings": {...}
    }
    
    # 2. You index "Ford King Taurus"
    POST test2/_doc/1 
    {
      "carName": "Ford King Taurus"
    }
    
    # 3. You can find it searching for "king"
    POST test2/_search 
    {
      "query": {
        "match": {
          "carName": "king"
        }
      }
    }
    
    # 4. You close the index, add "king" a new stop words and reopen the index
    POST test2/_close
    PUT test2/_settings
    {
      "index": {
        "analysis": {
          "filter": {
            "myStopwords": {
              "ignore_case": "true",
              "type": "stop",
              "stopwords": [
                "word1",
                "word2",
                "king"
              ]
            }
          },
          "analyzer": {
            "myAnalyzer": {
              "filter": [
                "lowercase",
                "unique",
                "myStopwords"
              ],
              "type": "custom",
              "tokenizer": "whitespace"
            }
          }
        }
      }
    }
    POST test2/_open
    
    # 5. You cannot find the document searching for "king"
    POST test2/_search
    {
      "query": {
        "match": {
          "carName": {
            "query": "king"
          }
        }
      }
    }
    => No results
    
    # 6. But you can still find it using the standard search analyzer
    POST test2/_search
    {
      "query": {
        "match": {
          "carName": {
            "query": "king",
            "analyzer": "standard"
          }
        }
      }
    }
    => 1 result
    
    # 7. You update your index in place
    POST test2/_update_by_query
    
    # 8. None of the search queries will find anything with "king"