Search code examples
solrstanford-nlppos-tagger

How to only remove stopwords when they are not nouns?


I'm using Solr 5 and need to remove stop words to prevent over-matching and avoid bloating the index with high IDF terms. However, the corpus includes a lot part numbers and name initials like "Steve A" and "123-OR-A". In those cases, I don't want "A" and "OR" to get removed by the stopword filter factory as they need to be searchable.

The Stanford POS tagger does a great job detecting that the above examples are nouns, not stop words, but is this the right approach for solving my problem?

Thanks!


Solution

  • Only you can decide whether this is the right approach. If you can integrate POS tagger in and it gives you useful results - that's good.

    But just to give you an alternative, you could look at duplicating your fields and processing them differently. For example, if you see 123-OR-A being split and stopword-cleaned, that probably means you have WordDelimiterFilterFactory in your analyzer stack. That factory has a lot of parameters you could try tweaking. Or, you could copyField your content to another (store=false) field and process it without WordDelimiterFilterFactory all together. Then you search over both copies of your data, possibly with different boost for different fields.