Search code examples
awkgrepline-breaks

Removing SOME line breaks from srt/txt file


I have a text file which has numbered entries, a timecode and a transcript. I am trying to remove the line breaks in the transcript and leave the others. I'm trying to use grep or awk.

File is like

1
00:00:27,160 --> 00:00:29,054
Sometimes there's not much dialogue.

2
00:00:30,100 --> 00:00:31,090
But other times there is quite a bit,
and it's formatted into two lines

3
00:00:31,500 --> 00:00:33,700
I want to remove the line breaks only on
these long lines, leaving all other formatting.

4
00:00:33,805 --> 00:00:37,285
So that all dialogue ends up being on a single
line no matter how long that line.

Output would look like:

1
00:00:27,160 --> 00:00:29,054
Sometimes there's not much dialogue.

2
00:00:30,100 --> 00:00:31,090
But other times there is quite a bit, and it's formatted into two lines

3
00:00:31,500 --> 00:00:33,700
I want to remove the line breaks only on these long lines, leaving all other formatting.

4
00:00:33,805 --> 00:00:37,285
So that all dialogue ends up being on a single line no matter how long that line.

thanks to all who have provided help


Solution

  • Don't rely on lines starting (or not) with any specific characters - just attach the 4th and subsequent lines in each record to the end of the 3rd line of that record:

    $ awk '
    BEGIN { RS=ORS=""; FS=OFS="\n" }
    {
        print $1,$2,$3
        for (i=4;i<=NF;i++)
            printf " %s", $i
        print "\n\n"
    }
    ' file
    1
    00:00:27,160 --> 00:00:29,054
    Sometimes there's not much dialogue.
    
    2
    00:00:30,100 --> 00:00:31,090
    But other times there is quite a bit, and it's formatted into two lines
    
    3
    00:00:31,500 --> 00:00:33,700
    I want to remove the line breaks only on these long lines, leaving all other formatting.
    
    4
    00:00:33,805 --> 00:00:37,285
    So that all dialogue ends up being on a single line no matter how long that line.