I need to find all the words in a file which start with an upper case, I tried the below code but it returns an empty string.
import os
import re
matches = []
filename = 'C://Users/Documents/romeo.txt'
with open(filename, 'r') as f:
for line in f:
regex = "^[A-Z]\w*$"
matches.append(re.findall(regex, line))
print(matches)
File:
Hi, How are You?
Output:
[Hi,How,You]
You can use a word boundary instead of the anchors ^
and $
\b[A-Z]\w*
Note that if you use matches.append
, you add an item to the list and re.findall returns a list, which will give you a list of lists.
import re
matches = []
regex = r"\b[A-Z]\w*"
filename = r'C:\Users\Documents\romeo.txt'
with open(filename, 'r') as f:
for line in f:
matches += re.findall(regex, line)
print(matches)
Output
['Hi', 'How', 'You']
If there should be a whitespace boundary to the left, you could also use
(?<!\S)[A-Z]\w*
If you don't want to match words using \w
with only uppercase chars, you could use for example a negative lookahead to assert not only uppercase chars till a word boundary
\b[A-Z](?![A-Z]*\b)\w*
\b
A word boundary to prevent a partial match[A-Z]
Match an uppercase char A-Z(?![A-Z]*\b)
Negative lookahead, assert not only uppercase chars followed by a word boundary\w*
Match optional word charsTo match a word that starts with an uppercase char, and does not contain any more uppercase chars:
\b[A-Z][^\WA-Z]*\b
\b
A word boundary[A-Z]
Match an uppercase char A-Z[^\WA-Z]*
Optionally match a word char without chars A-Z\b
A word boundary