Search code examples
pythonregexregex-groupregex-lookarounds

Regex Expression not spanning newlines


I have a string of data as a result of another working regex:

pattern = re.compile(r'CALL\|\d*\|.*\n*((?:\n.*)+?)(?=\nCALL|\Z)',re.MULTILINE)
matches = pattern.finditer(text)
list_dataframe = []
audit_data = []
counter = 0
for match in matches:
    while counter < 1:
        small_string = match.group(0)
        print(small_string)
        print('end small string \n')
        pattern1 = re.compile(r'BENCHMARK\|Assigned\|(?:.*)(?=\|Assigned\||\Z)',re.MULTILINE) #
        assignments_iterator = pattern1.finditer(small_string)

Small string appears as:

CALL|2197040|77-MOTOR VEHICLE COLLISION|11/30/2022|18:22:31.0|28 ST S/I275-UNDER|439111.88686|1246532.42713|False||False|2022-11-30 18:21:24||
CUSTOM-DATA|2197040|1/1/0001 12:00:00 AM|1/1/0001 12:00:00 AM|1/1/0001 12:00:00 AM|1/1/0001 12:00:00 AM||SP3||||SP3|SP3|||||||0|0|0|0|
BENCHMARK|Assigned|E10|0|2197040|Assigned|||
BENCHMARK|TurnoutTime|E10|1.11666666666667|2197040|TurnoutTime|Engine-ALS||
BENCHMARK|OnScene|E10|1.11666666666667|2197040|OnScene|||
BENCHMARK|UserDefCategory|E10|0|2197040|UserDefCategory|UnitType=R||
BENCHMARK|InService|E10|3.08333333333333|2197040|InService|||
BENCHMARK|Assigned|E4|0|2197040|Assigned|||
BENCHMARK|TurnoutTime|E4|0.733333333333333|2197040|TurnoutTime|Engine-ALS||
BENCHMARK|OnScene|E4|0.733333333333333|2197040|OnScene|||
BENCHMARK|UserDefCategory|E4|0|2197040|UserDefCategory|UnitType=E||
BENCHMARK|InService|E4|4.43333333333333|2197040|InService|||
BENCHMARK|Assigned|T1|0|2197040|Assigned|||
BENCHMARK|TurnoutTime|T1|1.31666666666667|2197040|TurnoutTime|Truck||
BENCHMARK|OnScene|T1|7.93333333333333|2197040|OnScene|||
BENCHMARK|UserDefCategory|T1|0|2197040|UserDefCategory|UnitType=T||
BENCHMARK|InService|T1|66.0333333333333|2197040|InService|||
BENCHMARK|Assigned|R3|3.71666666666667|2197040|Assigned|||

I am trying to create a regex compiler to capture the data between "BENCHMARK|Assigned" so that the data would appear:

iter1

BENCHMARK|Assigned|E10|0|2197040|Assigned|||
BENCHMARK|TurnoutTime|E10|1.11666666666667|2197040|TurnoutTime|Engine-ALS||
BENCHMARK|OnScene|E10|1.11666666666667|2197040|OnScene|||
BENCHMARK|UserDefCategory|E10|0|2197040|UserDefCategory|UnitType=R||
BENCHMARK|InService|E10|3.08333333333333|2197040|InService|||

iter2

BENCHMARK|Assigned|E4|0|2197040|Assigned|||
BENCHMARK|TurnoutTime|E4|0.733333333333333|2197040|TurnoutTime|Engine-ALS||
BENCHMARK|OnScene|E4|0.733333333333333|2197040|OnScene|||
BENCHMARK|UserDefCategory|E4|0|2197040|UserDefCategory|UnitType=E||
BENCHMARK|InService|E4|4.43333333333333|2197040|InService|||
    pattern1 = re.compile(r'BENCHMARK\|Assigned\|(?:.*)(?=BENCHMARK\|Assigned\||\Z)',re.MULTILINE) 
    assignments_iterator = pattern1.finditer(small_string)

but I only get single lines for this when I iterate through. I am unable to get past the newlines like I did with my first regex expression.

BENCHMARK|Assigned|E10|0|2197040
BENCHMARK|Assigned|E4|0|2197040
BENCHMARK|Assigned|T1|0|2197040
BENCHMARK|Assigned|R3|3.71666666666667|2197040
BENCHMARK|Assigned|E3|3.73333333333333|2197040
BENCHMARK|Assigned|ME3|6|2197040

Solution

  • Try to use re.S flag in combination with re.M (Regex demo):

    text = '''\
    CALL|2197040|77-MOTOR VEHICLE COLLISION|11/30/2022|18:22:31.0|28 ST S/I275-UNDER|439111.88686|1246532.42713|False||False|2022-11-30 18:21:24||
    CUSTOM-DATA|2197040|1/1/0001 12:00:00 AM|1/1/0001 12:00:00 AM|1/1/0001 12:00:00 AM|1/1/0001 12:00:00 AM||SP3||||SP3|SP3|||||||0|0|0|0|
    BENCHMARK|Assigned|E10|0|2197040|Assigned|||
    BENCHMARK|TurnoutTime|E10|1.11666666666667|2197040|TurnoutTime|Engine-ALS||
    BENCHMARK|OnScene|E10|1.11666666666667|2197040|OnScene|||
    BENCHMARK|UserDefCategory|E10|0|2197040|UserDefCategory|UnitType=R||
    BENCHMARK|InService|E10|3.08333333333333|2197040|InService|||
    BENCHMARK|Assigned|E4|0|2197040|Assigned|||
    BENCHMARK|TurnoutTime|E4|0.733333333333333|2197040|TurnoutTime|Engine-ALS||
    BENCHMARK|OnScene|E4|0.733333333333333|2197040|OnScene|||
    BENCHMARK|UserDefCategory|E4|0|2197040|UserDefCategory|UnitType=E||
    BENCHMARK|InService|E4|4.43333333333333|2197040|InService|||
    BENCHMARK|Assigned|T1|0|2197040|Assigned|||
    BENCHMARK|TurnoutTime|T1|1.31666666666667|2197040|TurnoutTime|Truck||
    BENCHMARK|OnScene|T1|7.93333333333333|2197040|OnScene|||
    BENCHMARK|UserDefCategory|T1|0|2197040|UserDefCategory|UnitType=T||
    BENCHMARK|InService|T1|66.0333333333333|2197040|InService|||
    BENCHMARK|Assigned|R3|3.71666666666667|2197040|Assigned|||'''
    
    import re
    
    for group in re.findall(r'^(BENCHMARK\|Assigned.*?)\s*(?=^BENCHMARK\|Assigned|\Z)', text, flags=re.M|re.S):
        print(group)
        print('-'*80)
    

    Prints:

    BENCHMARK|Assigned|E10|0|2197040|Assigned|||
    BENCHMARK|TurnoutTime|E10|1.11666666666667|2197040|TurnoutTime|Engine-ALS||
    BENCHMARK|OnScene|E10|1.11666666666667|2197040|OnScene|||
    BENCHMARK|UserDefCategory|E10|0|2197040|UserDefCategory|UnitType=R||
    BENCHMARK|InService|E10|3.08333333333333|2197040|InService|||
    --------------------------------------------------------------------------------
    BENCHMARK|Assigned|E4|0|2197040|Assigned|||
    BENCHMARK|TurnoutTime|E4|0.733333333333333|2197040|TurnoutTime|Engine-ALS||
    BENCHMARK|OnScene|E4|0.733333333333333|2197040|OnScene|||
    BENCHMARK|UserDefCategory|E4|0|2197040|UserDefCategory|UnitType=E||
    BENCHMARK|InService|E4|4.43333333333333|2197040|InService|||
    --------------------------------------------------------------------------------
    BENCHMARK|Assigned|T1|0|2197040|Assigned|||
    BENCHMARK|TurnoutTime|T1|1.31666666666667|2197040|TurnoutTime|Truck||
    BENCHMARK|OnScene|T1|7.93333333333333|2197040|OnScene|||
    BENCHMARK|UserDefCategory|T1|0|2197040|UserDefCategory|UnitType=T||
    BENCHMARK|InService|T1|66.0333333333333|2197040|InService|||
    --------------------------------------------------------------------------------
    BENCHMARK|Assigned|R3|3.71666666666667|2197040|Assigned|||
    --------------------------------------------------------------------------------