I'm running Apache Solr 6.6.5. When a user searches for "ETCS" (a special technical term) then all documents are matches that contain the word "etc". But I only want to match documents that really contain "ETCS". Solr should never even index "etc" since it is such a common word. The stemmer should never turn "etc" into "etcs" (the plural stemming).
I added "etc" to stopwords.txt:
# Contains words which shouldn't be indexed for fulltext fields, e.g., because
# they're too common. For documentation of the format, see
# http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StopFilterFactory
# (Lines starting with a pound character # are ignored.)
etc
I added "etc" to protwords.txt:
#-----------------------------------------------------------------------
# This file blocks words from being operated on by the stemmer and word delimiter.
&
<
>
'
"
etc
That helps to not match documents that contain "etc", but documents containing "etc.", "etc," or similar are still matched.
So I could add even more variants to protwords.txt:
&
<
>
'
"
etc
etc.
etc..
etc...
etc,
But that will always be incomplete. How can I tell the stemmer to consider "etc" as tokenized word with arbitrary non-word characters around it?
My schema.xml: https://gist.github.com/klausi/f59ee47a9b14b915f5bb44bd6cf1c945
1.)
I added "etc" to protwords.txt:
you should add etcs
to protwords to protect stemming of the term etcs
.
2.)
So I could add even more variants to protwords.txt:
Add all variations of words you like to remove from the index into the stopwords.txt
, not the protwords.txt
3.) check what filed type you are using. Maybe you can tune that here a bit
//Edit: adding a link to your schema.xml
will not help as long as you does not explain, which field you are using.
4.) don`t forget to restart and (if needed) reindex your index.