Search code examples
pythonregexpython-2.7python-3.xcapitalize

Python regex: Match ALL consecutive capitalized words


Short question:

I have a string:

title="Announcing Elasticsearch.js For Node.js And The Browser"

I want to find all pairs of words where each word is properly capitalized.

So, expected output should be:

['Announcing Elasticsearch.js', 'Elasticsearch.js For', 'For Node.js', 'Node.js And', 'And The', 'The Browser']

What I have right now is this:

'[A-Z][a-z]+[\s-][A-Z][a-z.]*'

This gives me the output:

['Announcing Elasticsearch.js', 'For Node.js', 'And The']

How can I change my regex to give desired output?


Solution

  • You can use this:

    #!/usr/bin/python
    import re
    
    title="Announcing Elasticsearch.js For Node.js And The Browser TEst"
    pattern = r'(?=((?<![A-Za-z.])[A-Z][a-z.]*[\s-][A-Z][a-z.]*))'
    
    print re.findall(pattern, title)
    

    A "normal" pattern can't match overlapping substrings, all characters are founded once for all. However, a lookahead (?=..) (i.e. "followed by") is only a check and match nothing. It can parse the string several times. Thus if you put a capturing group inside the lookahead, you can obtain overlapping substrings.