Search code examples
pythonregexsrt

parsing a .srt file with regex


I am doing a small script in python, but since I am quite new I got stuck in one part: I need to get timing and text from a .srt file. For example, from

1
00:00:01,000 --> 00:00:04,074
Subtitles downloaded from www.OpenSubtitles.org

I need to get:

00:00:01,000 --> 00:00:04,074

and

Subtitles downloaded from www.OpenSubtitles.org.

I have already managed to make the regex for timing, but i am stuck for the text. I've tried to use look behind where I use my regex for timing:

( ?<=(\d+):(\d+):(\d+)(?:\,)(\d+) --> (\d+):(\d+):(\d+)(?:\,)(\d+) )\w+

but with no effect. Personally, i think that using look behind is the right way to solve this, but i am not sure how to write it correctly. Can anyone help me? Thanks.


Solution

  • Honestly, I don't see any reason to throw regex at this problem. .srt files are highly structured. The structure goes like:

    • an integer starting at 1, monotonically increasing
    • start --> stop timing
    • one or more lines of subtitle content
    • a blank line

    ... and repeat. Note the bold part - you might have to capture 1, 2, or 20 lines of subtitle content after the time code.

    So, just take advantage of the structure. In this way you can parse everything in just one pass, without needing to put more than one line into memory at a time and still keeping all the information for each subtitle together.

    from itertools import groupby
    # "chunk" our input file, delimited by blank lines
    with open(filename) as f:
        res = [list(g) for b,g in groupby(f, lambda x: bool(x.strip())) if b]
    

    For example, using the example on the SRT doc page, I get:

    res
    Out[60]: 
    [['1\n',
      '00:02:17,440 --> 00:02:20,375\n',
      "Senator, we're making\n",
      'our final approach into Coruscant.\n'],
     ['2\n', '00:02:20,476 --> 00:02:22,501\n', 'Very good, Lieutenant.\n']]
    

    And I could further transform that into a list of meaningful objects:

    from collections import namedtuple
    
    Subtitle = namedtuple('Subtitle', 'number start end content')
    
    subs = []
    
    for sub in res:
        if len(sub) >= 3: # not strictly necessary, but better safe than sorry
            sub = [x.strip() for x in sub]
            number, start_end, *content = sub # py3 syntax
            start, end = start_end.split(' --> ')
            subs.append(Subtitle(number, start, end, content))
    
    subs
    Out[65]: 
    [Subtitle(number='1', start='00:02:17,440', end='00:02:20,375', content=["Senator, we're making", 'our final approach into Coruscant.']),
     Subtitle(number='2', start='00:02:20,476', end='00:02:22,501', content=['Very good, Lieutenant.'])]