I'm a little new to Elasticsearch, but basically I have an single index called posts
with multiple post
documents that take the following form:
"post": {
"id": 123,
"message": "Some message"
}
I'm trying to get the most frequently occurring words in the message
field across the entire index, with a simple Terms aggregation:
curl -XPOST 'localhost:9200/posts/_search?pretty' -d '
{
"aggs": {
"frequent_words": {
"terms": {
"field": "message"
}
}
}
}
'
Unfortunately, this aggregation includes stopwords, so I end up with a list of words like "and", "the", "then", etc. instead of more meaningful words.
I've tried applying an analyzer to exclude those stopwords, but to no avail:
curl -XPUT 'localhost:9200/posts/?pretty' -d '
{
"settings": {
"analysis": {
"analyzer": {
"standard": {
"type": "standard",
"stopwords": "_english_"
}
}
}
}
}'
Am I applying the analyzer correctly, or am I going about this the wrong way? Thanks!
I guess you forgot set analyzer to your message filed of your type field. Because Elasticsearch use their indexed data while aggregating data. This means that Elasticsearch dont get your stopwords if you analyze your field correctly. You can check this link. I used sense plugin of kibana to execute following requests. Check mapping create request
PUT /posts
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "standard",
"stopwords": ["test", "testable"]
}
}
}
}
}
### Dont forget these lines
POST /posts/post/_mapping
{
"properties": {
"message": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
POST posts/post/1
{
"id": 1,
"message": "Some messages"
}
POST posts/post/2
{
"id": 2,
"message": "Some testable message"
}
POST posts/post/3
{
"id": 3,
"message": "Some test message"
}
POST /posts/_search
{
"aggs": {
"frequent_words": {
"terms": {
"field": "message"
}
}
}
}
This is my resultset for this search request :
{
"hits": {
...
},
"aggregations": {
"frequent_words": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "some",
"doc_count": 3
},
{
"key": "message",
"doc_count": 2
},
{
"key": "messages",
"doc_count": 1
}
]
}
}
}