Search code examples
lucenewildcardquotes

How does Lucene work with quotes and wildcards


When I search in lucene for the Dutch word bieten is their a difference between the following: bieten, "bieten", "*bieten*" and *bieten* when using the DutchAnalyzer and allowing leading wildcards?

Because as far I can find in thee parser syntax the quotes are there just to handle spaces and all words are always search like their are wildcards around them.

The reason I ask this question because I found out that by using the DutchAnalyzer all words are striped of their plural before being entered in the index. Which in my case means biet is stored in the index and not bieten. And when searching with bieten or "bieten" or "bieten" it also modifies the query to biet.
But when I'm using *bieten* the query doesn't change and stays a plural. Which doesn't give any results.
So

  bieten   -->> biet 
 "bieten"  -->> biet
"*bieten*" -->> biet 
 *bieten*  -->> *bieten*

Why is the last search translated to a different query then the others.

Queryparser syntax: https://lucene.apache.org/core/2_9_4/queryparsersyntax.html
Screenshot Lucene: http://oi63.tinypic.com/1z5krdg.jpg


Solution

  • Wildcard, regex and fuzzy queries are not analyzed by the query parser, that's why it's different.

    Words are definitely not searched with wildcards around them. The query *bieten* would be intended to match things like "xxbietenxx". Finding words within a sentence does not involve wildcards, though. That's what analysis is for. It splits the text into single-word terms.

    To explain each of those queries:

    • bieten - Simple term query. Search for the given word.
    • "bieten" - Phrase query. Analyze and find the given multi-term phrase. In this case the phrase is one term long, and so the same as a term query.
    • "*bieten*" - Again, phrase query. Not a wildcard query in any way. You can't use wildcards in phrases. The analyzer will remove the punctuation, making this identical to the last one.
    • *bieten* - Wildcard query. This will match "bietenxx", "xxbieten", and "xxbietenxx", but will not be analyzed, and so won't match the post-analysis term "biet".