ElasticSearch - how to filter hate words / insults in search analyze

I'm trying to configure a ElasticSearch 7.

I configured some stopwords, I thought it included those words too, but it doesn't seem to be the case...

What is the best practice ?

My current settings looks like :

'analysis' => [
    'filter' => [
        ...
        'english_stop' => [
            'type' => 'stop',
            'stopwords' => '_english_'
        ],
        'english_stemmer' => [
            'type' => 'stemmer',
            'language' => 'english'
        ],
        'english_possessive_stemmer' => [
            'type' => 'stemmer',
            'language' => 'possessive_english'
        ]
        ...
    ],
    'analyzer' => [
        'rebuilt_english' => [
            'type' => 'custom',
            'tokenizer' => 'standard',
            'filter' => [
                ...
                'english_possessive_stemmer',
                'lowercase',
                'english_stop',
                'english_stemmer'
            ]
        ]
    ]
]

Thanks

Solution

A) If you'd like to ELIMINATE results containing bad words — i.e. disregard them completely in the search response — you could add an index alias.

First create the index as you normally would:

PUT dirty-index
{
  "settings": {
    "analysis": {
      "filter": { ... },
      "analyzer": { ... }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "rebuilt_english"
      }
    }
  }
}

Add one "safe" and one "unsafe" doc:

POST dirty-index/_doc
{
  "content": "some regular text"
}

POST dirty-index/_doc
{
  "content": "some taboo text with bad words"
}

Save a filtered index alias, thus creating a safe-ish "view" of the original index:

PUT dirty-index/_alias/dirty-index-filtered
{
  "filter": {
    "bool": {
      "must_not": {
        "terms": {
          "content": ["taboo"]
        }
      }
    }
  }
}

taboo is just one of many bad words taken from: https://www.cs.cmu.edu/~biglou/resources/bad-words.txt

And voila — the alias only contains the "safe" doc. Verify via:

GET dirty-index-filtered/_search
{
  "query": {
    "match_all": {}
  }
}

B) If you'd like to CENSOR select terms before they're indexed, you could do so via an ingest pipeline.

Store the pipeline:

PUT _ingest/pipeline/my_data_cleanser
{
  "description": "Runs a doc thru a censoring replacer...",
  "processors": [
    {
      "script": {
        "source": """
          def bad_words = ['taboo', 'damn'];  // list all of 'em
          def CENSORED = '*CENSORED*';
          def content_copy = ctx.content;
          
          for (word in bad_words) {
            if (content_copy.contains(word)) {
              content_copy = content_copy.replace(word, CENSORED)
            }
          }
          
          ctx.content = content_copy;
        """
      }
    }
  ]
}

Then reference it as a URL param when indexing the docs:

                     |
                     v________
POST dirty-index/_doc?pipeline=my_data_cleanser
{
  "content": "some text with damn bad words"
}

which'll result in:

some text with *CENSORED* bad words

C) If you'd like to catch & replace select words as part of the ANALYSIS step, you could use a `pattern_replace` token filter.

PUT dirty-index
{
  "settings": {
    "analysis": {
      "filter": {
        "bad_word_replacer": {
          "type": "pattern_replace",
          "pattern": "((taboo)|(damn))",      <--- not sure how this'll scale to potentially hundreds of words
          "replacement": "*CENSORED*"
        }
      },
      "analyzer": {
        "rebuilt_english": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "bad_word_replacer"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "rebuilt_english"
      }
    }
  }
}

Note that this'll only affect the analyzed fields, but NOT the stored values:

POST dirty-index/_analyze?filter_path=tokens.token&format=yaml
{
  "field": "content",
  "text": ["some taboo text"]
}

The produced tokens would then be:

tokens:
- token: "some"
- token: "*CENSORED*"
- token: "text"

but they wouldn't be of too much avail because, if I understood your use case correctly, you don't need to disable searching for hate words — you need to disable their retrieval?

ElasticSearch - how to filter hate words / insults in search analyze

A) If you'd like to ELIMINATE results containing bad words — i.e. disregard them completely in the search response — you could add an index alias.

B) If you'd like to CENSOR select terms before they're indexed, you could do so via an ingest pipeline.

C) If you'd like to catch & replace select words as part of the ANALYSIS step, you could use a pattern_replace token filter.

C) If you'd like to catch & replace select words as part of the ANALYSIS step, you could use a `pattern_replace` token filter.