I am new to regex using Python. Now I have got a question like:
myTry=['a bb Aas','aa 1 Aasdf','aa bb (cc) AA','aaa ASD','aa . ASD','aaaa 1 bb Aas']
What I want to find is the sub strings that before capitals (A in this example), it may include multiple words and () but not include numbers and .. So, in this example, The below strings in myTry should be detected:
'a bb Aas'
'aa bb (cc) AA'
'aaa ASD'
The result should be:
'a bb'
'aa bb (cc)'
'aaa'
I have no idea to use regex to define a pattern like 'include something and exclude something at the same time'.
especially the first and the last strings: 'a bb Aas' and 'aaaa 1 bb Aas'. I want the first one and I do not want the second one. But I do not know how many words would be and how many numbers would be in these words. But as long as there are numbers and . before capitals, I do not need them.
You can use two regex operations. The first filters out invalid results by matching on ^[a-zA-Z\s\(\)]*$
, and the second collects the desired substrings using a positive lookahead: .*?(?= [A-Z])
.
import re
my_try = ['a bb Aas','aa 1 Aasdf','aa bb (cc) AA','aaa ASD','aa . ASD','aaaa 1 bb Aas']
filtered = [x for x in my_try if re.match(r'^[a-zA-Z\s\(\)]*$', x)]
result = [re.match(r'.*?(?= [A-Z])', x).group(0) for x in filtered]
print(result) # => ['a bb', 'aa bb (cc)', 'aaa']
If you anticipate that some strings might pass the filter (that is, contain something other than alphabetical characters, parenthesis or whitespace), but might not match the lookahead, you'll need to filter the intermediate result:
import re
my_try = ['a bb Aas','aaa ASD','aa . ASD','aaaa 1 bb Aas', '']
# ^^ could cause problems
filtered = [x for x in my_try if re.match(r'^[a-zA-Z\s\(\)]*$', x)]
matches = [re.match(r'.*?(?= [A-Z])', x) for x in filtered]
result = [x.group(0) for x in matches if x]
print(result) # => ['a bb', 'aaa']