Search code examples
laravelelasticsearchlaravel-scout

ElasticSearch - how to filter hate words / insults in search analyze


I'm trying to configure a ElasticSearch 7.

I configured some stopwords, I thought it included those words too, but it doesn't seem to be the case...

What is the best practice ?

My current settings looks like :

'analysis' => [
    'filter' => [
        ...
        'english_stop' => [
            'type' => 'stop',
            'stopwords' => '_english_'
        ],
        'english_stemmer' => [
            'type' => 'stemmer',
            'language' => 'english'
        ],
        'english_possessive_stemmer' => [
            'type' => 'stemmer',
            'language' => 'possessive_english'
        ]
        ...
    ],
    'analyzer' => [
        'rebuilt_english' => [
            'type' => 'custom',
            'tokenizer' => 'standard',
            'filter' => [
                ...
                'english_possessive_stemmer',
                'lowercase',
                'english_stop',
                'english_stemmer'
            ]
        ]
    ]
]

Thanks


Solution

  • A) If you'd like to ELIMINATE results containing bad words — i.e. disregard them completely in the search response — you could add an index alias.

    First create the index as you normally would:

    PUT dirty-index
    {
      "settings": {
        "analysis": {
          "filter": { ... },
          "analyzer": { ... }
        }
      },
      "mappings": {
        "properties": {
          "content": {
            "type": "text",
            "analyzer": "rebuilt_english"
          }
        }
      }
    }
    

    Add one "safe" and one "unsafe" doc:

    POST dirty-index/_doc
    {
      "content": "some regular text"
    }
    
    POST dirty-index/_doc
    {
      "content": "some taboo text with bad words"
    }
    

    Save a filtered index alias, thus creating a safe-ish "view" of the original index:

    PUT dirty-index/_alias/dirty-index-filtered
    {
      "filter": {
        "bool": {
          "must_not": {
            "terms": {
              "content": ["taboo"]
            }
          }
        }
      }
    }
    

    taboo is just one of many bad words taken from: https://www.cs.cmu.edu/~biglou/resources/bad-words.txt

    And voila — the alias only contains the "safe" doc. Verify via:

    GET dirty-index-filtered/_search
    {
      "query": {
        "match_all": {}
      }
    }
    

    B) If you'd like to CENSOR select terms before they're indexed, you could do so via an ingest pipeline.

    Store the pipeline:

    PUT _ingest/pipeline/my_data_cleanser
    {
      "description": "Runs a doc thru a censoring replacer...",
      "processors": [
        {
          "script": {
            "source": """
              def bad_words = ['taboo', 'damn'];  // list all of 'em
              def CENSORED = '*CENSORED*';
              def content_copy = ctx.content;
              
              for (word in bad_words) {
                if (content_copy.contains(word)) {
                  content_copy = content_copy.replace(word, CENSORED)
                }
              }
              
              ctx.content = content_copy;
            """
          }
        }
      ]
    }
    

    Then reference it as a URL param when indexing the docs:

                         |
                         v________
    POST dirty-index/_doc?pipeline=my_data_cleanser
    {
      "content": "some text with damn bad words"
    }
    

    which'll result in:

    some text with *CENSORED* bad words
    

    C) If you'd like to catch & replace select words as part of the ANALYSIS step, you could use a pattern_replace token filter.

    PUT dirty-index
    {
      "settings": {
        "analysis": {
          "filter": {
            "bad_word_replacer": {
              "type": "pattern_replace",
              "pattern": "((taboo)|(damn))",      <--- not sure how this'll scale to potentially hundreds of words
              "replacement": "*CENSORED*"
            }
          },
          "analyzer": {
            "rebuilt_english": {
              "type": "custom",
              "tokenizer": "standard",
              "filter": [
                "bad_word_replacer"
              ]
            }
          }
        }
      },
      "mappings": {
        "properties": {
          "content": {
            "type": "text",
            "analyzer": "rebuilt_english"
          }
        }
      }
    }
    

    Note that this'll only affect the analyzed fields, but NOT the stored values:

    POST dirty-index/_analyze?filter_path=tokens.token&format=yaml
    {
      "field": "content",
      "text": ["some taboo text"]
    }
    

    The produced tokens would then be:

    tokens:
    - token: "some"
    - token: "*CENSORED*"
    - token: "text"
    

    but they wouldn't be of too much avail because, if I understood your use case correctly, you don't need to disable searching for hate words — you need to disable their retrieval?