Search code examples
solrlucenedatastax-enterprise

solr query to target mixed case string


I have some email addresses stored that have incorrect formatting, they have mixed case in their domain, I need to be able to grab those resources out so that I can correct them. This is a special case to fix broken data, I need to pull back all resources with mixed case domains.

I haven't the first clue on how to go about this query, nor whether it's even possible.


Solution

  • You can perform a query using regular expression that attempts to match a lowercase character together with a upper case character. It'll depend on exactly how the address is stored (it'll work on each token, so if there's a LowercaseFilterFactory in the chain, I'm guessing it won't find any hits):

    # retrieve all those that have a lowercase letter followed by a uppercase letter
    q=email:/.*[a-z][A-Z].*/
    
    # retrieve all those that have a uppercase letter followed by a lowercase letter
    q=email:/.*[A-Z][a-z].*/
    

    There will be overlaps between these two queries, so process them in sequence to avoid performing the same work twice.