Search code examples
pythonstringtextyoutube

Remove transcript timestamps and join the lines to make paragraph


  • File: Plain Text Document
  • Content: Youtube timestamped transcript

enter image description here

I can separately remove each line's timestamp:

for count, line in enumerate(content, start=1):
        if count % 2 == 0:
            s = line.replace('\n','')
            print(s) 

I can also join the sentences if I don't remove the timestamps:

with open('file.txt') as f:
    print (" ".join(line.strip() for line in f))

But I attempted to do these together (removing timestamps and joining the lines) in various formats but no right outcome:

with open('Russell Brand Script.txt') as m:
    for count, line in enumerate(m, start=1):
        if count % 2 == 0:
            sentence=line.replace('\n',' ')
            print(" ".join(sentence.rstrip('\n'))) 

I also tried various form of print(" ".join(sentence.rstrip('\n'))) and print(" ".join(sentence.strip())) but the results is always either of below:

enter image description here

How can I remove the timestamps and join the sentences to create a paragraph at once?


Solution

  • Whenever you call .join() on a string, it inserts the separator between every character of the string. You should also note that print(), by default, adds a newline after the string is printed.

    To get around this, you can save each modified sentence to a list, and then output the entire paragraph at once at the end using "".join(). This gets around the newline issue described above, and gives you the ability to do additional processing on the paragraph afterwards, if desired.

    with open('put_your_filename_here.txt') as m:
        sentences = []
        for count, line in enumerate(m, start=1):
            if count % 2 == 0:
                sentence=line.replace('\n', '')
                sentences.append(sentence)
        print(' '.join(sentences))
    

    (Made a small edit to the code -- the old version of the code produced a trailing space after the paragraph.)