I'd like to query a Solr database for text containing a given gmail address. I'd like to search by the canonical gmail address, and get any results that gmail interprets as the same address.
Example
Searching for [email protected]
should match all of the following strings:
But not match:
Is this possible with a regex or some other way?
Note: Info about gmail's "dot's don't matter" and plus sign extensions can be found at https://gmail.googleblog.com/2008/03/2-hidden-ways-to-get-more-from-your.html
If you know some type of search will be a requirement, you deal with it at the indexing time for efficiency purposes.
So, you want to extract those email addresses and put them into a separate field for pre-processing (removing dot and removing +anything). Then you search both fields, possibly boosting on the email one.
You may find it easier using UAX29URLEmailTokenizerFactory and TypeTokenFilterFactory (as a whitelist by email type) to keep just email addresses in the copied field.