I have the following list of expressions in python
LIST1=["AR BR_18_0138249", "AR R_16_01382649", "BR 16 0138264", "R 16 01382679" ]
In the above string a few patterns are alpha numeric but there is a space between the two second set of sequences. I expect the following output
"AR BR_18_0138249"
"AR R_16_01382649"
"BR 16 0138264"
"R 16 01382679"
I have tried the following code
import regex as re
pattern = r"(\bB?R_\w+)(?!.*\1)|(\bB?R \w+)(?!.*\1)|(\bR?^sd \w+)(?!.*\1)"
for i in LIST1:
rest = re.search(pattern, i)
if rest:
print(rest.group(1))
I have obtained the following result
BR_18_0138249
R_16_01382649
None
None
I am unable to get the sequences with the spaces. I request someone to guide me in this regard
You can use
\b(B?R(?=([\s_]))(?:\2\d+)+)\b(?!.*\b\1\b)
See the regex demo
Details
\b
- a word boundary(B?R(?=([\s_]))(?:\2\d+)+)
- Group 1: an optional B
, then R
, then one or more sequences of a whitespace or underscore followed with one or more digits (if you need to support letters here, replace \d+
with [^\W_]
)\b
- a word boundary(?!.*\b\1\b)
- a negative lookahead that fails the match if there are
.*
- any zero or more chars other than line break chars, as many as possible\b\1\b
- the same value as in Group 1 matched as a whole word (not enclosed with letters, digits or underscores).See a Python re
demo (you do not need the PyPi regex module here):
import re
LIST1=["AR BR_18_0138249", "AR R_16_01382649", "BR 16 0138264", "R 16 01382679" ]
pattern = r"\b(B?R(?=([\s_]))(?:\2\d+)+)\b(?!.*\b\1\b)"
for i in LIST1:
rest = re.search(pattern, i)
if rest:
print(rest.group(1))