Search code examples
regexregex-group

Finding title case within sentence using regex


I am trying to use Regex to extract title cased phrases and word that occur within the sentences.

Effort so far:

(?:[A-Z][a-z]+\s?)+  

This regex code when applied on the sample sentence below finds those words shown as bold. But I need to ignore words like This and Whether (sentence starters).

Sample Sentence:

This is a Sample Sentence to check the Real Value of this code. Whether it works or Not depends upon the result.

Expectation:

This is a Sample Sentence to check the Real Value of this code. Whether it works or Not depends upon the result.

Useful code:

import regex as re

text='This is a Sample Sentence to check the Real Value of this code. Whether it works or Not depends upon the result. A State Of The Art Technology is needed to do this work.'
rex=r'(?<!^|[.!?]\ )\b[A-Z][a-z]+(?:\ [A-Z][a-z]+)*\b'

matches = re.finditer(rex,text)
results = [match[0] for match in matches]
print(results)

Result:

['Sample Sentence', 'Real Value', 'Not', 'State Of The Art Technology']

Solution

  • Assuming your regex flavor supports Lookbehinds, I would use something like this:

    (?<!^|\.\ )\b[A-Z][a-z]+(?:\ [A-Z][a-z]+)*\b
    

    Demo.

    This will support words that are preceded by an abbreviation, punctuation, or pretty much anything other than a period (end of previous sentence).


    Edit:

    As per Nick's suggestion in the comments, it's probably better to include ! and ? in the Lookbehind to support sentences ending with either of them, not just the period:

    (?<!^|[.!?]\ )\b[A-Z][a-z]+(?:\ [A-Z][a-z]+)*\b
    

    Demo.