Search code examples
regexrubystringsplitacronym

Ruby: break string into words by capital letters and acronyms


I need to break a string into several strings by capital letters and acronyms, I could do this:

myString.scan(/[A-Z][a-z]+/)

But it works for only capital letters, in cases like:

QuickFoxReadingPDF

or

LazyDogASAPSleep

The all-capital acronyms are missing in the result.

What should I change the RegEx to, or are there any alternatives?

Thanks!

Update:

Later I found some of my data has digits, like "RabbitHole3", It would be great if the solution could consider digits, ie. ["Rabbit", "Hole3"].


Solution

  • Use

    s.split(/(?<=\p{Ll})(?=\p{Lu})|(?<=\p{Lu})(?=\p{Lu}\p{Ll})/)
    

    See proof.

    Explanation

    --------------------------------------------------------------------------------
      (?<=                     look behind to see if there is:
    --------------------------------------------------------------------------------
        \p{Ll}                 any lowercase letter
    --------------------------------------------------------------------------------
      )                        end of look-behind
    --------------------------------------------------------------------------------
      (?=                      look ahead to see if there is:
    --------------------------------------------------------------------------------
        \p{Lu}                 any uppercase letter
    --------------------------------------------------------------------------------
      )                        end of look-ahead
    --------------------------------------------------------------------------------
     |                        OR
    --------------------------------------------------------------------------------
      (?<=                     look behind to see if there is:
    --------------------------------------------------------------------------------
        \p{Lu}                 any uppercase letter
    --------------------------------------------------------------------------------
      )                        end of look-behind
    --------------------------------------------------------------------------------
      (?=                      look ahead to see if there is:
    --------------------------------------------------------------------------------
        \p{Lu}\p{Ll}           any uppercase letter, any lowercase letter
    --------------------------------------------------------------------------------
      )                        end of look-ahead
    

    Ruby code:

    str = 'QuickFoxReadingPDF'
    p str.split(/(?<=\p{Ll})(?=\p{Lu})|(?<=\p{Lu})(?=\p{Lu}\p{Ll})/)
    

    Results: ["Quick", "Fox", "Reading", "PDF"]