Search code examples
elasticsearchstop-words

Removing stopwords from basic Terms aggregation in Elasticsearch?


I'm a little new to Elasticsearch, but basically I have an single index called posts with multiple post documents that take the following form:

"post": {
    "id": 123,
    "message": "Some message"
}

I'm trying to get the most frequently occurring words in the message field across the entire index, with a simple Terms aggregation:

curl -XPOST 'localhost:9200/posts/_search?pretty' -d '
{
    "aggs": {
        "frequent_words": {
            "terms": {
                "field": "message"
            }
        }
    }
}
'

Unfortunately, this aggregation includes stopwords, so I end up with a list of words like "and", "the", "then", etc. instead of more meaningful words.

I've tried applying an analyzer to exclude those stopwords, but to no avail:

curl -XPUT 'localhost:9200/posts/?pretty' -d '
{
    "settings": {
        "analysis": {
            "analyzer": {
                "standard": {
                    "type": "standard",
                    "stopwords": "_english_"
                }
            }
        }
    }
}'

Am I applying the analyzer correctly, or am I going about this the wrong way? Thanks!


Solution

  • I guess you forgot set analyzer to your message filed of your type field. Because Elasticsearch use their indexed data while aggregating data. This means that Elasticsearch dont get your stopwords if you analyze your field correctly. You can check this link. I used sense plugin of kibana to execute following requests. Check mapping create request

    PUT /posts
    {
        "settings": {
            "analysis": {
                "analyzer": {
                    "my_analyzer": {
                        "type": "standard",
                        "stopwords": ["test", "testable"]
                    }
                }
            }
        }
    }
    
    ### Dont forget these lines
    POST /posts/post/_mapping
    {
      "properties": {
        "message": {
          "type": "string", 
          "analyzer": "my_analyzer"
        }
      }
    }
    
    POST posts/post/1
    {
      "id": 1,
      "message": "Some messages"
    }
    
    POST posts/post/2
    {
      "id": 2,
      "message": "Some testable message"
    }
    
    POST posts/post/3
    {
      "id": 3,
      "message": "Some test message"
    }
    
    
    POST /posts/_search
    {
        "aggs": {
            "frequent_words": {
                "terms": {
                    "field": "message"
                }
            }
        }
    }
    

    This is my resultset for this search request :

    {
      "hits": {
      ...
      },
      "aggregations": {
        "frequent_words": {
          "doc_count_error_upper_bound": 0,
          "sum_other_doc_count": 0,
          "buckets": [
            {
              "key": "some",
              "doc_count": 3
            },
            {
              "key": "message",
              "doc_count": 2
            },
            {
              "key": "messages",
              "doc_count": 1
            }
          ]
        }
      }
    }