I need to break a string into several strings by capital letters and acronyms, I could do this:
myString.scan(/[A-Z][a-z]+/)
But it works for only capital letters, in cases like:
QuickFoxReadingPDF
or
LazyDogASAPSleep
The all-capital acronyms are missing in the result.
What should I change the RegEx to, or are there any alternatives?
Thanks!
Update:
Later I found some of my data has digits, like "RabbitHole3", It would be great if the solution could consider digits, ie. ["Rabbit", "Hole3"]
.
Use
s.split(/(?<=\p{Ll})(?=\p{Lu})|(?<=\p{Lu})(?=\p{Lu}\p{Ll})/)
See proof.
Explanation
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
\p{Ll} any lowercase letter
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
\p{Lu} any uppercase letter
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
\p{Lu} any uppercase letter
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
\p{Lu}\p{Ll} any uppercase letter, any lowercase letter
--------------------------------------------------------------------------------
) end of look-ahead
str = 'QuickFoxReadingPDF'
p str.split(/(?<=\p{Ll})(?=\p{Lu})|(?<=\p{Lu})(?=\p{Lu}\p{Ll})/)
Results: ["Quick", "Fox", "Reading", "PDF"]