Search code examples
solrlucene

Solr query not working as expected when it contains the `@` character


I have a field named email_txt of type text_general that holds a list of emails of type [email protected]. My goal is to to create a query that will only search the username and disregard the domain.

  1. My query looks something like this:
email_txt:*abc*@*

This produces 0 results. I expect to receive results where the username contains abc, like [email protected], [email protected], [email protected] and [email protected].

  1. The second attempt at the query looks like this:
email_txt:*abc*

It does find results, including the desired ones from above, but also "false positives", where the domain contains abc, like [email protected], which is not desired.

I have had a look at the The Standard Query Parser documentation and it confirms that @ is not a special character. Even so, I have tried to escape it with no success.

email_txt:*abc*\@*
  1. To note, the following query also gives 0 results, as if the @ char does not exist in the data.
email_txt:*@*

Now the actual question. Is @ a special character? If so, how can it be escaped, if not what am I doing wrong in the query?


Notes: I am using Solr version 6.3.0, the doc I've linked is for 6.6 (the closest available)


Solution

  • When you're using the StandardTokenizer (which the default field types text_general, text_en, etc. use by default), the content will be split into tokens when the @ sign occurs. That means that for your example, there are actually two or three tokens being stored, (izz and helpmeabc.com) or (izz, helpmeabc and com).

    A wildcard match is applied against the tokens by themselves (unless using the complex phrase query parser), where no tokenization and filtering taking place (except for multi term aware filters such as the lowercase filter).

    The effect is that your query, *abc*@* attempts to match a token containing @, but since the processing when you're indexing splits on @ and separate the tokens based on that character, no tokens contain @ - and thus, giving you no hits.

    You can use the string field type or a KeywordTokenizer paired with filters such as the lower case filter, etc. to get the original input more or less as a complete token instead.