Search code examples
pythonregexstring-matching

Get a number present after a particular pattern of a matching string which consists of word and number


This is the input string:

text Expedien N0 18-00232995
$cat input_file
some text before Expedien: 1-21-212-16-26 some random text
Reference RE9833 of all sentences.
abc
123
456
something blah blah Ref.: 
tramite  1234567
Ref.:
some junk Expedien N° 18-00777 # some new content
some text Expedien N°18-0022995 # some garbled content

For other strings: This code works but I want to also fetch if, a word consists of a number and I want to find the number after that matching(word-number present as an entity in the list) and so in such a case:

getting output ('Expedien', 'N0') but expected output is ('Expedien N0', '18-00232995').

The code that fetches other entities is as follows:

import re
s="""your_text_here"""
my_list = ['Ref.:', 'Reference', 'tramite', 'Expediente', 'Expediente No', 'Expedien N°', 'Exp.No', 'Expedien', 'Expedien N0']
rx = r'(?<!\w)({})\W*([A-Z]*\d+(?:-+[A-Z]*\d+)*)'.format('|'.join(map(re.escape,my_list)))
print(re.findall(rx, s))

Output:

[('Expedien', '1-21-212-16-26'), ('Reference', 'RE9833'), ('tramite', '1234567'), ('Expedien N°', '18-00777'), ('Expedien N°', '18-0022995')]

How do I get ('Expedien N0', '18-00232995') this output by manipulating the above regex


Solution

  • A small change is needed for you to get your desired output. In your,

    my_list = ['Ref.:', 'Reference', 'tramite', 'Expediente', 'Expediente No', 'Expedien N°', 'Exp.No', 'Expedien', 'Expedien N0']
    

    You have specified Expedien before Expedien N0 due to which in text Expedien N0 18-00232995 text Expedien matches and N0 matches in second group and leaves no scope for matching Expedien N0 as it comes later. Hence if you just change the order in your list and place Expedien N0 before Expedien, then Expedien N0 matches the first group and 18-00232995 gets captured in second group and gives you your desired results. Check your modified python code below,

    import re
    s="""text Expedien N0 18-00232995
    $cat input_file
    some text before Expedien: 1-21-212-16-26 some random text
    Reference RE9833 of all sentences.
    abc
    123
    456
    something blah blah Ref.: 
    tramite  1234567
    Ref.:
    some junk Expedien N° 18-00777 # some new content
    some text Expedien N°18-0022995 # some garbled content"""
    my_list = ['Ref.:', 'Reference', 'tramite', 'Expediente', 'Expediente No', 'Expedien N°', 'Exp.No', 'Expedien N0', 'Expedien']
    
    rx = r'(?<!\w)({})\W*([A-Z]*\d+(?:-+[A-Z]*\d+)*)'.format('|'.join(map(re.escape,my_list)))
    print(rx)
    print(re.findall(rx, s))
    

    Prints,

    [('Expedien N0', '18-00232995'), ('Expedien', '1-21-212-16-26'), ('Reference', 'RE9833'), ('tramite', '1234567'), ('Expedien N°', '18-00777'), ('Expedien N°', '18-0022995')]
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You have your intended tuple here in your findall results