Search code examples
solrstemming

How do you remove a word completely from an Apache Solr index?


I'm running Apache Solr 6.6.5. When a user searches for "ETCS" (a special technical term) then all documents are matches that contain the word "etc". But I only want to match documents that really contain "ETCS". Solr should never even index "etc" since it is such a common word. The stemmer should never turn "etc" into "etcs" (the plural stemming).

I added "etc" to stopwords.txt:

# Contains words which shouldn't be indexed for fulltext fields, e.g., because
# they're too common. For documentation of the format, see
# http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StopFilterFactory
# (Lines starting with a pound character # are ignored.)
etc

I added "etc" to protwords.txt:

#-----------------------------------------------------------------------
# This file blocks words from being operated on by the stemmer and word delimiter.
&
<
>
'
"
etc

That helps to not match documents that contain "etc", but documents containing "etc.", "etc," or similar are still matched.

So I could add even more variants to protwords.txt:

&
<
>
'
"
etc
etc.
etc..
etc...
etc,

But that will always be incomplete. How can I tell the stemmer to consider "etc" as tokenized word with arbitrary non-word characters around it?

My schema.xml: https://gist.github.com/klausi/f59ee47a9b14b915f5bb44bd6cf1c945


Solution

  • 1.)

    I added "etc" to protwords.txt:

    you should add etcs to protwords to protect stemming of the term etcs.

    2.)

    So I could add even more variants to protwords.txt:

    Add all variations of words you like to remove from the index into the stopwords.txt, not the protwords.txt

    3.) check what filed type you are using. Maybe you can tune that here a bit

    //Edit: adding a link to your schema.xml will not help as long as you does not explain, which field you are using.

    4.) don`t forget to restart and (if needed) reindex your index.