Search code examples
lucenexpathjackrabbitjcrjsr170

Problems with hyphen in Jackrabbit XPath query


Firstly, let me just say that I'm very new to JSR-170 and Jackrabbit/Lucene in general.

I have the following XPath query:

//*[@sling:resourceType="users/user-profile" and jcr:contains(*/*/*,'sophie\-a')] order by @jcr:score descending

I have a user named Sophie-Allen and a user named Sophie-Anne. Searching using the above query returns zero results, where searching for 'sophie' alone returns both users. I understand that the hyphen means exclude in JSR-170, but I've escaped it (as you can see above).

Why is this query not returning both users?

Another strange thing is when I use asterisks (the hyphens are all escaped when executed):

  • Searching for 'sophie-allen' returns Sophie-Allen's record.
  • Searching for 'soph*' returns both Sophie-Allen and Sophie-Anne.
  • Searching for 'sophie-a* returns nothing.
  • Searching for 'sophie-allen*' returns nothing.

I understand that with jcr:contains, technically you don't need to use asterisks, but looking at the above behaviour, it seems to have some sort of effect.

Is there something else that I'm missing with regards to hyphens and asterisks in XPath queries and searching a JCR? I've googled everything I can think of and read through the spec, but can't seem to find anything that answers my question.

Thanks in advance.

Edit: It looks like a 'phrase query' doesn't work with jcr:contains (anymore?) as the default Lucene Analyzer tokenizes on the hyphen, meaning it splits 'sophie-allen' to sophie and allen.

Edit 2: I've tried using a custom analyzer and tokenizer as suggested by someone on the Jackrabbit Users list, but that hasn't helped either, Lucene is still taking the hyphen and omitting the results I want.


Solution

  • While working on this with a colleague, we discovered this JIRA for ModeShape, incidentally logged by Randall (who answered here too). It turns out that the problem is caused by the fact that jackrabbit isn't handling a wildcard in a search term with a wildcard properly/too well.

    Randall had done a fix for ModeShape but my colleagues and project team nominated not to fix our problem at this stage as the use of Jackrabbit was not 100% certain.

    I'd like to associate the answer to this question to Randall, but his post isn't the actual answer. I'll mark this post as the answer, unless Randall comes along and posts something.