Search code examples
javalucenequery-parser

QueryParser with CustomAnalyzer messes order of use of PatternReplaceCharFilter


I am using org.apache.lucene.queryparser.classic.QueryParser in lucene 6.0.0 to parse queries using a CustomAnalyzer as shown below:

public static void testFilmAnalyzer() throws IOException, ParseException {
    CustomAnalyzer nameAnalyzer = CustomAnalyzer.builder()
            .addCharFilter("patternreplace",
                    "pattern", "(movie|film|picture).*",
                    "replacement", "")
            .withTokenizer("standard")
            .build();

    QueryParser qp = new QueryParser("name", nameAnalyzer);
    qp.setDefaultOperator(QueryParser.Operator.AND);
    String[] strs = {"avatar film fiction", "avatar-film fiction", "avatar-film-fiction"};

    for (String str : strs) {
        System.out.println("Analyzing \"" + str + "\":");
        showTokens(str, nameAnalyzer);
        Query q = qp.parse(str);
        System.out.println("Parsed query of \"" + str + "\":");
        System.out.println(q + "\n");
    }
}

private static void showTokens(String text, Analyzer analyzer) throws IOException {
    StringReader reader = new StringReader(text);
    TokenStream stream = analyzer.tokenStream("name", reader);
    CharTermAttribute term = stream.addAttribute(CharTermAttribute.class);
    stream.reset();
    while (stream.incrementToken()) {
        System.out.print("[" + term.toString() + "]");
    }
    stream.close();
    System.out.println();
}

I get the following output, when I invoke testFilmAnalyzer:

Analyzing "avatar film fiction":
[avatar]
Parsed query of "avatar film fiction":
+name:avatar +name:fiction

Analyzing "avatar-film fiction":
[avatar]
Parsed query of "avatar-film fiction":
+name:avatar +name:fiction

Analyzing "avatar-film-fiction":
[avatar]
Parsed query of "avatar-film-fiction":
name:avatar

It seems like the analyzer uses the PatternReplaceCharFilter in its correct intended order (i.e. before tokenization), while the QueryParser does so afterwards. Does anyone have an explanation for that? Isn't that a bug?


Solution

  • No, it's not a bug. CharFilters are always applied before tokenization, whether at query time or index time.

    However, spaces have meaning in QueryParser syntax, which is entirely independent of analysis. Spaces separate clauses of the query, and each clause is analyzed on it's own. This is easier to see if you don't rely on the default field, in which case we would need to rewrite the query: avatar-film fiction, to: name:avatar-film name:fiction. Each of the two clauses, "avatar-film" and "fiction", are analyzed separately, causing the results you are seeing.

    Try using phrase queries:

    String[] strs = {"\"avatar film fiction\"", "\"avatar-film fiction\"", "\"avatar-film-fiction\""};
    

    and you should see the results you are expecting.