Search code examples
pythonregexregex-lookaroundsregex-negationregex-group

Fetching a certain text from multi-line file


I would like to filter a certain text from a file using Regex package in python, taking into consideration the text file has multiple newlines and spaces. The file may have several data blocks, but the only required is the one with specific keywords. In my problem it should belong to a group contains "Route-Details" keyword.

Let us say that the file(sample.txt), is shown below.

.
.
.
 Host1<-->Host2 Con. ID:         0x0fc2f0d9  (abc123)
  Con. Information:
     [Gw]  Route-Details 
        R-Code:      0xaaaa (1a2) Route-Details
        Router-ID:     0x21       (a4)  [Gw] 
        Path-Code:    0x00e   (15)
        Data: 123-abcd.djsjdkks www.somesite. port 11

Coded info
                   aa aa aa aa aa aa aa aa   1111-aaa
                   aa aa aa aa aa aa aa aa   1111-aaa
.
.
.

This what I have written

import re
with open("sample.txt", "r") as fl:
    in_file= fl.read()

(re.search('(?<=Route-Details).* Data:', in_file,re.DOTALL).group())

I expect to obtain this.

123-abcd.djsjdkks www.somesite. port 11

However, I got this.

R-Code:      0xaaaa (1a2) Route-Details
        Router-ID:     0x21       (a4)  [Gw] 
        Path-Code:    0x00e   (15)
        Data:

I wonder if I can get simplified and elaborated solution(s) for this. Thanks so much for your help.


Solution

  • You can use a positive look-behind and capturing group:

    re.findall(r'(?<=Data: )(.*?)\n', text)
    

    Yields:

    ['123-abcd.djsjdkks www.somesite. port 11']
    

    Additionally, you can try the following to include the Route-Details condition you specified:

    re.findall(r'(?<=Route-Details).*?(?<=Data: )(.*?)\n', text, re.DOTALL)
    

    For a detailed explanation, see here. Also, re.DOTALL specifies that the . character will match all characters, including newlines.