I have a site which is searchable using Lucene. I've noticed from logs that users sometimes don't find what they're looking for because they enter a singular term, but only the plural version of that term is used on the site. I would like the search to find uses of other forms of a word as well. This is a problem that I'm sure has been solved many times over, so what are the best practices for this?
Please note: this site only has English content.
Some approaches I've thought of:
- Look up the word in some kind of thesaurus file to determine alternate forms of a given word.
- Some examples:
- Searches for "car", also add "cars" to the query.
- Searches for "carry", also add "carries" and "carried" to the query.
- Searches for "small", also add "smaller" and "smallest" to the query.
- Searches for "can", also add "can't", "cannot", "cans", and "canned" to the query.
- And it should work in reverse (i.e. search for "carries" should add "carry" and "carried").
- Drawbacks:
- Doesn't work for many new technical words unless the dictionary/thesaurus is updated frequently.
- I'm not sure about the performance of searching the thesaurus file.
- Generate the alternate forms algorithmically, based on some heuristics.
- Some examples:
- If the word ends in "s" or "es" or "ed" or "er" or "est", drop the suffix
- If the word ends in "ies" or "ied" or "ier" or "iest", convert to "y"
- If the word ends in "y", convert to "ies", "ied", "ier", and "iest"
- Try adding "s", "es", "er" and "est" to the word.
- Drawbacks:
- Generates lots of non-words for most inputs.
- Feels like a hack.
- Looks like something you'd find on TheDailyWTF.com. :)
- Something much more sophisticated?
I'm thinking of doing some kind of combination of the first two approaches, but I'm not sure where to find a thesaurus file (or what it's called, as "thesaurus" isn't quite right, but neither is "dictionary").
Consider including the PorterStemFilter
in your analysis pipeline. Be sure to perform the same analysis on queries that is used when building the index.
I've also used the Lancaster stemming algorithm with good results. Using the PorterStemFilter
as a guide, it is easy to integrate with Lucene.