Search code examples
pythonregexcaption

Problem formatting text from captions using regular expressions


I'm trying to get the captions text to analyze it but I'm stuck trying to get the subtitles text in a readable way. I'm using regular expressions to get the captions numbers, captions time and captions speech. When it gets to the speech I get a lot of blank lines because the subtitles are set up like the image. So I just want to create a list that only contains the speech and not no the blank lines. The list that I'm getting is in an image too.

Here's a sample from the captions too:

1
00:00:00,030 --> 00:00:05,370
so here we are at the offices of my

2
00:00:02,240 --> 00:00:05,370



3
00:00:02,250 --> 00:00:07,319
accountants of your Eric Biddle mr.

4
00:00:05,360 --> 00:00:07,319



5

MY LIST

CAPTIONS:

import re

filename = r'test_subtitle.srt'
pattern_number = re.compile('^\d+$')
pattern_time = re.compile('^[\d]+:[\d]+:[\d]+,[\d]+ --> [\d]+:[\d]+:[\d]+,[\d]+$')
pattern_speech = re.compile("^[A-Za-z,;'\"\\s]+[.?!]*$")

for i, line in enumerate(open(filename)):
    for match in re.findall(pattern_number, line):
        print(match)

for i, line in enumerate(open(filename)):
    for match in re.findall(pattern_time, line):
        print(match)

speech = []

for i, line in enumerate(open(filename)):
    for match in re.findall(pattern_speech, line):
        speech.append(match)

print(speech)

Solution

  • I recommend you scan the text as a whole and not the individual lines. Also you can use groups in your pattern to capture and contain data. I would read the data as follows:

    with open('test_subtitle.srt', 'r') as f:
        subtitles = f.read()
    

    Then using the following code I would match the single sections and extract the data:

    import re
    
    num_pat = r'(\d+)'
    time_pat = r'(\d{2,}:\d{2}:\d{2},\d{3}) --> (\d{2,}:\d{2}:\d{2},\d{3})'
    sentence_pat = r'([^\n]*)\n'
    
    data_pattern = re.compile(r'\n'.join([num_pat, time_pat, sentence_pat]))
    print('data_pattern:', data_pattern)
    
    for i in re.finditer(data_pattern, subtitles):
        print('-'*20)
        print(i.group(1))
        print(f'time: {i.group(2)} --> {i.group(3)}')
        print('text:', repr(i.group(4)))
        print()
    

    A problem I also noticed in your code is that when defining your patterns you were using normal strings instead of raw strings and you weren't escaping your backslashes. If you want to use backslashes without escaping you should use a raw string. Hope this helped.