Search code examples
pythonregextext-extraction

Extract lines between specific start/end pattern from text file


I want to extract the lines between specified start-pattern (inclusive) and end-pattern (exclusive).

My code below does extract some lines, but not the first line that matches the start-pattern. In my desired target output I want also the first line that matches.

Code Attempt

import re
import xlswriter

linenum = 0
myline = []
pattern_start = re.compile(r"^vsi ipcbb")
pattern_stop = re.compile(r"^vsi ipcbb-ipran")
with open(r'readline.txt', 'rt') as myfile :
    for row in myfile :
      if pattern_start.search(row) != None :
        for line in myfile :
            linenum += 1
            if pattern_stop.search(line) != None:
                break
            myline.append((linenum, line.rstrip('\n')))

with xlsxwriter.Workbook('readline.xlsx') as workbook:
    worksheet = workbook.add_worksheet('VSI')
    for row_num,data in enumerate(myline):
        worksheet.write_row(row_num + 0, 0, data)

Given Input as text file

!Last configuration was updated at 2021-04-22 05:52:21 UTC by 
!Last configuration was saved at 2021-04-22 19:00:49 UTC by 
!PdtPrivateInfo = System current forwarding-mode: compatible
!MKHash 0000000000000000
vsi ipcbb-RAC_YBPNM01H-00 static
 description *** M-ipcbb-RAC_YBPNM01H(via RAG_MBSPM01H&RAG_YBPNM01H) ***
 tnl-policy TE
 diffserv-mode pipe af1 green
#
vsi ipcbb-ipran-RSG_NKY2M-00 static
 description *** IPCBB-IPRAN VLAN61 Inherit(RAG_NKY2M01H-RAG_NKY2M02H) ***
 tnl-policy TE
 diffserv-mode pipe af1 green
#

Actual Output (lines extracted)

 description *** M-ipcbb-RAC_YBPNM01H(via RAG_MBSPM01H&RAG_YBPNM01H) ***
 tnl-policy TE
 diffserv-mode pipe af1 green
#

Wanted Output (lines extracted)

vsi ipcbb-RAC_YBPNM01H-00 static
 description *** M-ipcbb-RAC_YBPNM01H(via RAG_MBSPM01H&RAG_YBPNM01H) ***
 tnl-policy TE
 diffserv-mode pipe af1 green
#

Solution

  • You can work with a boolean mode-flag like extract_on, which signals if currently in between start and stop and should extract the line. Also the line-matching can be done using re.match function, which either returns a match-object or None.

    import re
    
    pattern_start = re.compile(r"^vsi ipcbb")
    pattern_stop = re.compile(r"^vsi ipcbb-ipran")
    
    i = 0
    extract_on = False
    extracts = []
    with open(r'readline.txt', 'rt') as myfile:
        for line in myfile:
            i += 1  # line counting starts with 1
            if pattern_start.match(line):
                extract_on = True
            if pattern_stop.search(line):
                extract_on = False
            if extract_on:
                extracts.append((i, line.rstrip('\n')))
    
    for line in extracts:
        print(line)
    

    Given your input, it will ignore the first 4 lines, extract the middle 5, and again ignores the last 5. So print-out of extracted lines including position-in-file is:

    (5, 'vsi ipcbb-RAC_YBPNM01H-00 static')
    (6, ' description *** M-ipcbb-RAC_YBPNM01H(via RAG_MBSPM01H&RAG_YBPNM01H) ***')
    (7, ' tnl-policy TE')
    (8, ' diffserv-mode pipe af1 green')
    (9, '#')
    

    Left out the XLS-writing, which is assumed to be working as expected.