Search code examples
elasticsearchaggregationelasticsearch-aggregation

Terms Aggregation excluding last vowels using Spanish analyze - Elasticsearch 6.4


I am trying to get keywords from a bunch of tweets in the Spanish language. The thing is that when I get the results the last vowel in most words in the response is removed. Any idea of why is this happening?

The data are clean tweets extracted from Twitter in the Spanish language

Here is the query:

{
                "query": { 
                    "bool": {
                        "must": {
                            "terms": {
                                "full_text_sentiment": "positive"
                            }
                        },
                        "filter": {
                            "range": {
                                "created_at": {
                                    "gte": greaterThanTime,
                                    "lte": lessThanTime
                                }
                            }
                        }   
                    }
                },
                "aggs": {
                    "keywords": {
                        "terms": { "field": "full_text_clean", "size": 10}
                    }
                }
            }

The mapping is the following for the field:

"full_text_clean": {
                    "type": "text",
                    "analyzer": "spanish",
                    "fielddata": true,
                    "fielddata_frequency_filter": {
                        "min": 0.1,
                        "max": 1.0,
                        "min_segment_size": 10
                    },
                    "fields": {
                        "keyword": {
                            "type": "keyword",
                            "ignore_above": 512
                        }
                    }
                }

And this is the buckets in the response:

[ { key: 'aquí', doc_count: 3 },
  { key: 'deport', doc_count: 3 },
  { key: 'informacion', doc_count: 3 },
  { key: '23', doc_count: 2 },
  { key: 'corazon', doc_count: 2 },
  { key: 'dios', doc_count: 2 },
  { key: 'mexic', doc_count: 2 },
  { key: 'mujer', doc_count: 2 },
  { key: 'quier', doc_count: 2 },
  { key: 'siempr', doc_count: 2 }]

where "deport", should be "deporte", "mexic" should be "mexico", "quier" should be "quiero" etc.

Any idea of what is happening?

Thank you!


Solution

  • Hello the spanish analyzer (reference here) contains a stemming token filter. It is this stemmer that reduce words to their root, and thus remove generally some characters at the end of words.

    More information about stemming here

    To avoid this behavior you will need to create a new custom analyzer without stemming.

    You can use the example from the documentation and just remove the spanish_stemmer filter.