My Search application leverages Solr in order to search on some wikis and forums content. Sometimes vulgar words appear in posts and consequently they are indexed in Solr and appear in suggestions and searches as well.
Is there a way for Solr to ignore a set of predefined words considered vulgar?
The user case would be the following. We have:
A) a schema like:
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="title" type="string" indexed="true" stored="true" >
<field name="body" type="string" indexed="true" stored="true" >
B) a text file containing the vulgar words to ignore: words_to_ignore.txt. For instance it would contain:
badword1 badword2
C) A wiki having title "my wiki badword1" ;
If we ran the query:
http://localhost:8983/my_wiki_collection/select?q=name:(wiki+AND+badword1)
We would expect Solr to return the document:
<doc>
<str name="id">abcd-acdf-a1ga</str>
<str name="name">my wiky</str>
<str name="body">This is my amazing wiki</str>
</doc>
Just add them to your stopwords list.
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StopFilterFactory