Search code examples
pythonregexregular-language

Regex python find uppercase names


I have a text file of the type:

[...speech...]

NAME_OF_SPEAKER_1: [...speech...]

NAME_OF_SPEAKER_2: [...speech...]

My aim is to isolate the speeches of the various speakers. They are clearly identified because the name of each speaker is always indicated in uppercase letters (name+surname). However, in the speeches there can be nouns (not people's names) which are in uppercase letter, but there is only one word that is actually long enough to give me issue (it has four letter, say it is 'ABCD'). I was thinking to identifiy the position of each speaker's name (I assume every name long at least 3 letters) with something like

re.search('[A-Z^(ABCD)]{3,}',text_to_search)

in order to exclude that specific (constant) word 'ABCD'. However, the command identifies that word instead of excluding it. Any ideas about how to overcome this problem?


Solution

  • In the pattern that you tried, you get partial matches, as there are no boundaries and [A-Z^(ABCD)]{3,} will match 3 or more times any of the listed characters.

    A-Z will also match ABCD, so it could also be written as [A-Z^)(]{3,}

    Instead of using the negated character class, you could assert that the word that consists only of uppercase chars A-Z does not contain ABCD using a negative lookahead (?!

    \b(?![A-Z]*ABCD)[A-Z]{3,}\b
    

    Regex demo

    If the name should start with 3 uppercase char, and can contain also lowercase chars, an underscore or digits, you could add \w* after matching 3 uppercase chars:

    \b(?![A-Z]*ABCD)[A-Z]{3}\w*\b
    

    Regex demo