I am using this regex pattern pattern = r'cig[\s:.]*(\w{10})'
to extract the 10 characters after the '''cig''' contained in each line of my dataframe. With this pattern I am accounting for all cases, except for the ones where that substring contains some spaces inside it.
For example, I am trying to extract Z9F27D2198
from the string
/BENEF/FORNITURA GAS FEB-20 CIG Z9F 27D2198 01762-0000031
In the previous string, it seems like Stack overflow formatted it, but there should be 17 whitespaces between F
and 2
, after CIG
.
Could you help me to edit the regex pattern in order to account for the white spaces in that 10-characters substring? I am also using flags=re.I
to ignore the case of the strings in my re.findall
calls.
To give an example string for which this pattern works:
CIG7826328A2B FORNITURA ENERGIA ELETTRICA U TENZE COMUNALI CONVENZIONE CONSIP E
and it outputs what I want: 7826328A2B
.
Thanks in advance.
You can use
r'(?i)cig[\s:.]*(\S(?:\s*\S){9})(?!\S)'
See the regex demo. Details:
cig
- a cig
string[\s:.]*
- zero or more whitespaces, :
or .
(\S(?:\s*\S){9})
- Group 1: a non-whitespace char and then nine occurrences of zero or more whitespaces followed with a non-whitespace char(?!\S)
- immediately to the right, there must be a whitespace or end of string.In Python, you can use
import re
text = "/BENEF/FORNITURA GAS FEB-20 CIG Z9F 27D2198 01762-0000031"
pattern = r'cig[\s:.]*(\S(?:\s*\S){9})(?!\S)'
matches = re.finditer(pattern, text, re.I)
for match in matches:
print(re.sub(r'\s+', '', match.group(1)), ' found at ', match.span(1))
# => Z9F27D2198 found at (32, 57)
See the Python demo.