I'm trying to configure a ElasticSearch 7.
I configured some stopwords, I thought it included those words too, but it doesn't seem to be the case...
What is the best practice ?
My current settings looks like :
'analysis' => [
'filter' => [
...
'english_stop' => [
'type' => 'stop',
'stopwords' => '_english_'
],
'english_stemmer' => [
'type' => 'stemmer',
'language' => 'english'
],
'english_possessive_stemmer' => [
'type' => 'stemmer',
'language' => 'possessive_english'
]
...
],
'analyzer' => [
'rebuilt_english' => [
'type' => 'custom',
'tokenizer' => 'standard',
'filter' => [
...
'english_possessive_stemmer',
'lowercase',
'english_stop',
'english_stemmer'
]
]
]
]
Thanks
First create the index as you normally would:
PUT dirty-index
{
"settings": {
"analysis": {
"filter": { ... },
"analyzer": { ... }
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "rebuilt_english"
}
}
}
}
Add one "safe" and one "unsafe" doc:
POST dirty-index/_doc
{
"content": "some regular text"
}
POST dirty-index/_doc
{
"content": "some taboo text with bad words"
}
Save a filtered index alias, thus creating a safe-ish "view" of the original index:
PUT dirty-index/_alias/dirty-index-filtered
{
"filter": {
"bool": {
"must_not": {
"terms": {
"content": ["taboo"]
}
}
}
}
}
taboo
is just one of many bad words taken from: https://www.cs.cmu.edu/~biglou/resources/bad-words.txt
And voila — the alias only contains the "safe" doc. Verify via:
GET dirty-index-filtered/_search
{
"query": {
"match_all": {}
}
}
Store the pipeline:
PUT _ingest/pipeline/my_data_cleanser
{
"description": "Runs a doc thru a censoring replacer...",
"processors": [
{
"script": {
"source": """
def bad_words = ['taboo', 'damn']; // list all of 'em
def CENSORED = '*CENSORED*';
def content_copy = ctx.content;
for (word in bad_words) {
if (content_copy.contains(word)) {
content_copy = content_copy.replace(word, CENSORED)
}
}
ctx.content = content_copy;
"""
}
}
]
}
Then reference it as a URL param when indexing the docs:
|
v________
POST dirty-index/_doc?pipeline=my_data_cleanser
{
"content": "some text with damn bad words"
}
which'll result in:
some text with *CENSORED* bad words
pattern_replace
token filter.PUT dirty-index
{
"settings": {
"analysis": {
"filter": {
"bad_word_replacer": {
"type": "pattern_replace",
"pattern": "((taboo)|(damn))", <--- not sure how this'll scale to potentially hundreds of words
"replacement": "*CENSORED*"
}
},
"analyzer": {
"rebuilt_english": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"bad_word_replacer"
]
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "rebuilt_english"
}
}
}
}
Note that this'll only affect the analyzed fields, but NOT the stored values:
POST dirty-index/_analyze?filter_path=tokens.token&format=yaml
{
"field": "content",
"text": ["some taboo text"]
}
The produced tokens would then be:
tokens:
- token: "some"
- token: "*CENSORED*"
- token: "text"
but they wouldn't be of too much avail because, if I understood your use case correctly, you don't need to disable searching for hate words — you need to disable their retrieval?