This is the input string:
text Expedien N0 18-00232995
$cat input_file
some text before Expedien: 1-21-212-16-26 some random text
Reference RE9833 of all sentences.
abc
123
456
something blah blah Ref.:
tramite 1234567
Ref.:
some junk Expedien N° 18-00777 # some new content
some text Expedien N°18-0022995 # some garbled content
For other strings: This code works but I want to also fetch if, a word consists of a number and I want to find the number after that matching(word-number present as an entity in the list) and so in such a case:
getting output ('Expedien', 'N0') but expected output is ('Expedien N0', '18-00232995').
The code that fetches other entities is as follows:
import re
s="""your_text_here"""
my_list = ['Ref.:', 'Reference', 'tramite', 'Expediente', 'Expediente No', 'Expedien N°', 'Exp.No', 'Expedien', 'Expedien N0']
rx = r'(?<!\w)({})\W*([A-Z]*\d+(?:-+[A-Z]*\d+)*)'.format('|'.join(map(re.escape,my_list)))
print(re.findall(rx, s))
Output:
[('Expedien', '1-21-212-16-26'), ('Reference', 'RE9833'), ('tramite', '1234567'), ('Expedien N°', '18-00777'), ('Expedien N°', '18-0022995')]
How do I get ('Expedien N0', '18-00232995') this output by manipulating the above regex
A small change is needed for you to get your desired output. In your,
my_list = ['Ref.:', 'Reference', 'tramite', 'Expediente', 'Expediente No', 'Expedien N°', 'Exp.No', 'Expedien', 'Expedien N0']
You have specified Expedien
before Expedien N0
due to which in text Expedien N0 18-00232995
text Expedien
matches and N0
matches in second group and leaves no scope for matching Expedien N0
as it comes later. Hence if you just change the order in your list and place Expedien N0
before Expedien
, then Expedien N0
matches the first group and 18-00232995
gets captured in second group and gives you your desired results. Check your modified python code below,
import re
s="""text Expedien N0 18-00232995
$cat input_file
some text before Expedien: 1-21-212-16-26 some random text
Reference RE9833 of all sentences.
abc
123
456
something blah blah Ref.:
tramite 1234567
Ref.:
some junk Expedien N° 18-00777 # some new content
some text Expedien N°18-0022995 # some garbled content"""
my_list = ['Ref.:', 'Reference', 'tramite', 'Expediente', 'Expediente No', 'Expedien N°', 'Exp.No', 'Expedien N0', 'Expedien']
rx = r'(?<!\w)({})\W*([A-Z]*\d+(?:-+[A-Z]*\d+)*)'.format('|'.join(map(re.escape,my_list)))
print(rx)
print(re.findall(rx, s))
Prints,
[('Expedien N0', '18-00232995'), ('Expedien', '1-21-212-16-26'), ('Reference', 'RE9833'), ('tramite', '1234567'), ('Expedien N°', '18-00777'), ('Expedien N°', '18-0022995')]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You have your intended tuple here in your findall results