Newbie here, I have been trying to learn regex for some time but sometimes I feel I can't understand how regex is handling strings. Because in planning phase I seem to work it out, but in implementation it doesn't work as I expect it.
Here is my little problem: I have strings that contains one or more names (team names). The problem is that if the string contains more than one, there is no separator. All names are joint directly.
Some examples :
------------String -----------------Contains----------Names to be extracted
I want to capture each name in each string and use them in a loop later on. But I can't seem to implement the pattern I imagine for it.
The pattern implementation in my head for the strings are like this:
Well I tried in vain some code in which the step two captures only one instance and step 3 normally gives another.
re.findall('([A-Z0-9].*s)*([A-Z].*)+', 'RangersIslandersMolsDevil')
That returns only two names:
[('RangersIslandersMols', 'Devil')]
whereas I want four:
[Rangers, Islanders, Mols, Devil]
([A-Z0-9].*s)*
will capture as many of any character as it can, so that's causing 'RangersIslandersMols' to get stuck together as one match.
It sounds like the boundary between team names is defined as a lowercase letter (not necessarily an 's', as in 'Avalanche') followed immediately by an uppercase letter or number, so our pattern should look for:
Because a team name can have multiple words, we'll also look for a space followed by the same pattern as above, for any possible number of words.
Try this pattern:
>>> pattern = r'[A-Z0-9]+[a-z]+(?: [A-Z0-9]+[a-z]+)*'
>>> findall(pattern, "RangersIslandersDevils49ersWashginton Football TeamAvalancheWarriors")
['Rangers', 'Islanders', 'Devils', '49ers', 'Washginton Football Team', 'Avalanche', 'Warriors']