Search code examples
pythonregexpython-re

Regex for matching only capitalized words stuck together (i.e. not separated by whitespace)


I have a long list of strings which are all random words, all of them capitalized, such as 'Pomegranate' and 'Yellow Banana'. However, some of them are stuck together, like so: 'AppleOrange'. There are no special characters or digits.

What I need is a regular expression on Python that matches 'Apple' and 'Orange' separately, but not 'Pomegranate' or 'Yellow'.

As expected, I'm very new to this, and I've only managed to write r"(?<!\s)([A-Z][a-z]*)"... But that still matches 'Yellow' and 'Pomegranate' . How do I do this?


Solution

  • If they all start with an uppercase char and optional lowercase chars, you can make use of lookarounds and an alternation to match both variations

    (?<=[a-z])[A-Z][a-z]*|[A-Z][a-z]*(?=[A-Z])
    

    The pattern matches:

    • (?<=[a-z]) Assert a-z to the left
    • [A-Z][a-z]* match A-Z and optional chars a-z
    • | or
    • [A-Z][a-z]* match A-Z and optional chars a-z
    • (?=[A-Z]) Assert A-Z to the right

    Regex demo

    Example

    import re
    
    pattern = r"(?<=[a-z])[A-Z][a-z]*|[A-Z][a-z]*(?=[A-Z])"
    s = ("AppleOrange\nPomegranate Yellow Banana")
    
    print(re.findall(pattern, s))
    

    Output

    ['Apple', 'Orange']
    

    Another option could be getting out of the way what you don't want by matching it, and use a capture group for what you want to keep and remove the empty entries from the result:

    (?<!\S)[A-Z][a-z]*(?!\S)|([A-Z][a-z]*)
    

    Regex demo | Python demo

    import re
    
    pattern = r"(?<!\S)[A-Z][a-z]*(?!\S)|([A-Z][a-z]*)"
    s = ("AppleOrange\nPomegranate Yellow Banana")
    
    print([x for x in re.findall(pattern, s) if x])