Search code examples
pythonpython-3.xregexparsingtext

Capture all characters in single string between regex matches


I have a log file with the following format:

00:00:09.476 ;; 00:00:11.111 side:start top:15% bottom:10% sound:80%
FOO BAR FOO FOO FOO BAR

00:00:11.111 ;; 00:00:12.278 side:start top:15% bottom:10% sound:78%
BAR BAR BAR' BAR. FOO.BAR

00:00:12.278 ;; 00:00:14.447 side:start top:15% bottom:10% sound:43%
BOO.BOO. FARFAR.FAR

What I am trying to do is capture all of the text beneath the log data for each entry, so ultimately, end up with a list looking like:

['FOO BAR FOO FOO FOO BAR', 'BAR BAR BAR' BAR. FOO.BAR', 'BOO.BOO. FARFAR.FAR']

I have written the following regular expression and tested that it properly matches the log data:

"\d\d:\d\d:\d\d.\d\d\d ;; \d\d:\d\d:\d\d.\d\d\d side:start top:\d\d% bottom:\d\d% sound:\d\d%"

But I am looking to capture all of the information between these matches, and I am not certain if this is even the best way to do it, vs iterating through the 123,378 line text file and ignoring both blank spaces and matches to the above expression.

What is the most efficient way to return a list of the text after each log entry?


Solution

  • You can use re.findall with a pattern using a lookahead:

    ^\d\d:\d\d:\d\d.\d\d\d ;; \d\d:\d\d:\d\d.\d\d\d .*((?:\n(?!\d\d:\d\d:\d\d.\d\d\d ;; \d\d:\d\d:\d\d.\d\d\d).*)*)
    

    Regex demo

    import re
    
    pattern = r"^\d\d:\d\d:\d\d.\d\d\d ;; \d\d:\d\d:\d\d.\d\d\d .*((?:\n(?!\d\d:\d\d:\d\d.\d\d\d ;; \d\d:\d\d:\d\d.\d\d\d).*)*)"
    
    s = ("00:00:09.476 ;; 00:00:11.111 side:start top:15% bottom:10% sound:80%\n"
                "FOO BAR FOO FOO FOO BAR\n\n"
                "00:00:11.111 ;; 00:00:12.278 side:start top:15% bottom:10% sound:78%\n"
                "BAR BAR BAR' BAR. FOO.BAR\n\n"
                "00:00:12.278 ;; 00:00:14.447 side:start top:15% bottom:10% sound:43%\n"
                "BOO.BOO. FARFAR.FAR")
    
    res = [x.strip() for x in re.findall(pattern, s, re.M)]
    print(res)
    

    Output

    ['FOO BAR FOO FOO FOO BAR', "BAR BAR BAR' BAR. FOO.BAR", 'BOO.BOO. FARFAR.FAR']
    

    Or if the data is that specific, shorten it to:

    ^\d\d:\d\d:\d\d.\d{3} ;; .*((?:\n(?!\d\d:\d\d:\d\d.\d{3} ;;).*)*)
    

    Regex demo