Search code examples
regexhttp-live-streamingpython-re

Regular expression to match closest tag above specific word (HLS media playlist)


Given a HLS media playlist as follows:

#EXTM3U
#EXT-X-VERSION:3
#EXT-X-ALLOW-CACHE:NO
#EXT-X-TARGETDURATION:7
#EXT-X-MEDIA-SEQUENCE:0

#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:22.621+02:00
#EXTINF:6.666666667,
seg1.ts
#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:29.637+02:00
#EXTINF:6.666666667,
seg2.ts
#EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:36.583+02:00
#EXTINF:6.666666666,
seg3.ts

I want to create a regular expression to match the datetime following the EXT-X-PROGRAM-DATE-TIME tag closest to a specified .ts file name. For example, I want to be able to retrieve the datetime 2022-09-12T10:03:29.637+02:00, by specifying that the match should end with seg2.ts. It should work even if new tags are added in between the file name and the EXT-X-PROGRAM-DATE-TIME tag in the future.

This pattern (EXT-X-PROGRAM-DATE-TIME:(.*)[\s\S]*?seg2.ts) is my best effort so far, but I can't figure out how make the match start at the last possible EXT-X-PROGRAM-DATE-TIME tag. The lazy quantifier did not help. The group that is currently captured is the datetime following the first EXT-X-PROGRAM-DATE-TIME, i.e. 2022-09-12T10:03:22.621+02:00.

I also looked at using negative lookahead, but I can't figure out how to combine that with matching a variable number of characters and whitespaces before the seg2.ts.

I'm sure this has been answered before in another context, but I just can't find the right search terms.


Solution

  • We can use re.search here along with a regex tempered dot trick:

    #Python 2.7.17
    
    import re
    
    inp = """#EXTM3U
    #EXT-X-VERSION:3
    #EXT-X-ALLOW-CACHE:NO
    #EXT-X-TARGETDURATION:7
    #EXT-X-MEDIA-SEQUENCE:0
    
    #EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:22.621+02:00
    #EXTINF:6.666666667,
    seg1.ts
    #EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:29.637+02:00
    #EXTINF:6.666666667,
    seg2.ts
    #EXT-X-PROGRAM-DATE-TIME:2022-09-12T10:03:36.583+02:00
    #EXTINF:6.666666666,
    seg3.ts"""
    
    match = re.search(r'#EXT-X-PROGRAM-DATE-TIME:(\S+)(?:(?!EXT-X-PROGRAM-DATE-TIME).)*\bseg2\.ts', inp, flags=re.S)
    if match:
        print(match.group(1))  # 2022-09-12T10:03:29.637+02:00
    

    Here is an explanation of the regex pattern:

    • #EXT-X-PROGRAM-DATE-TIME:
    • (\S+) match and capture the timestamp
    • (?:(?!EXT-X-PROGRAM-DATE-TIME).)* match all content WITHOUT crossing the next section
    • \bseg2\.ts match the filename if match: