Search code examples
regexsolrgmail

How can one query Solr for gmail address ignoring dots and pluses?


I'd like to query a Solr database for text containing a given gmail address. I'd like to search by the canonical gmail address, and get any results that gmail interprets as the same address.

Example

Searching for [email protected] should match all of the following strings:

But not match:

Is this possible with a regex or some other way?


Note: Info about gmail's "dot's don't matter" and plus sign extensions can be found at https://gmail.googleblog.com/2008/03/2-hidden-ways-to-get-more-from-your.html


Solution

  • If you know some type of search will be a requirement, you deal with it at the indexing time for efficiency purposes.

    So, you want to extract those email addresses and put them into a separate field for pre-processing (removing dot and removing +anything). Then you search both fields, possibly boosting on the email one.

    You may find it easier using UAX29URLEmailTokenizerFactory and TypeTokenFilterFactory (as a whitelist by email type) to keep just email addresses in the copied field.