Search code examples
pythontext-files

Split long sentences of a text file around the middle on comma (multiple commas)


I have a .srt file that I'd like to split to watch with mpv. It's a whole book turned into .srt for language learning, with an audiobook to go along. My problem is, it's in Japanese, which doesn't have space between words, so mpv doesn't break long sentences, instead it makes them very tiny to fit the one line size.

I tried Subtitle Edit, but it's not working for Japanese.

So I'm trying to do my own script, although I don't know much about it. I'm stuck on how to break a sentence that has multiple commas, how would I choose one around the middle?

Here's what I got so far:


with open("test.txt", encoding="utf8") as file:
    for line in file:
       #print(line)
       size = len(line)
       if size > 45:
           #break sentence in half, using Japanese comma 、

Here's the text file I'm using for testing:

10
00:00:55,640 --> 00:01:09,580
クラスで一番、明るくて、優しくて、運動神経がよくて、しかも、頭もよくて、みんなその子と友達になりたがる。

11
00:01:11,090 --> 00:01:24,500
だけどその子は、たくさんいるクラスメートの中に私がいることに気づいて、その顔にお日様みたいな眩しく、優しい微笑みをふわーっと浮かべる。

12
00:01:24,730 --> 00:01:32,250
私に近づき、「こころちゃん、ひさしぶり!」

13
00:01:32,910 --> 00:01:35,180
と挨拶をする。

14
00:01:37,450 --> 00:01:41,730
周りの子がみんな息を吞む中、「前から知ってるの。

15
00:01:42,000 --> 00:01:42,820
ね?」

16
00:01:43,820 --> 00:01:46,550
と私に目配せをする。

Solution

  • My compiler was being weird when I tried to open the file only once, so my solution does the following: Read every line and store them to a list, go through the list and find all the lines that are > 45 characters, find a comma near the middle, then add the line before and after to the list. Once done, write the list to the file.

    fileLines = []
    
    def findCommaNearMiddle(line):
        length = len(line)
        middle = int(length/2)
        # check values on either side until comma is found
        distance = 0
        while distance <= middle:
            if line[middle+distance] == '、':
                return middle+distance
            elif line[middle-distance] == '、':
                return middle-distance
            distance += 1
        return -1 # idealy, this will never happen
    
    with open("test.txt", "r", encoding="utf8") as file:
        fileText = file.read()
        fileLines = fileText.split('\n');
        for i in range(len(fileLines)):
            line = fileLines[i]
            size = len(line)
            if size > 45:
                middleComma = findCommaNearMiddle(line)
                fileLines[i] = line[:middleComma]
                fileLines.insert(i+1, line[middleComma+1:]) # +1 to get rid of comma
        file.close()
    
    with open("test.txt", "w", encoding="utf8") as file:
        for line in fileLines:
            file.write(line + '\n')
    
        file.close()
    

    If you want to be able to split by characters other than '、', just add another condition to the two if statements that goes something like or line[middle+distance] == '。':