Search code examples
bashsubtitle

I need some idea on text processing for SRT subtitles


Title says what I really need ATM.

Basically I've created an OCR toolchain based on Tesseract and ImageMagick. I've managed to get it to the point the output text is very consistent. I'm using this to OCR some old hardsubbed videos and make them into soft subbed SRT subs. To take the screenshots for the image input I'm using a modified version of an old shell script I found and rewrote ages ago. Those get feed into a second script that processes them into a form readable by Tessaract. At this point I could easily do the remainder of the work by hand, but I'd like to automate all but the final proofread pass if possible.

Example Text (From current project)

03:04.418  Their parents have always written    letters thanking us. =  
03:05.018  Their parents have always written    letters thanking us. =  
03:05.619  Their parents have always written    letters thanking us. =  
03:06.219  Their parents have always written    letters thanking us. =  
03:06.820  Their parents have always written    letters thanking us. =  
03:07.421  Their parents have always written    letters thanking us. =  
03:08.021  Their parents have always written    letters thanking us. =  
03:08.622  This seminary was highly reeemmended.    | am relieved te leave her in your care. =  
03:09.222  This seminary was highly reeemmended.    | am relieved te leave her in your care. =  
03:09.823  This seminary was highly reeemmended.    | am relieved te leave her in your care. =  
03:10.424  This seminary was highly reeemmended.    | am relieved te leave her in your care. =  
03:11.024  This seminary was highly reeemmended.    | am relieved te leave her in your care. =  
03:11.625  This seminary was highly reeemmended.    | am relieved te leave her in your care. =  
03:12.225  In additien te all the previeus requests se far..."  
03:12.826  In additien te all the previeus requests se far..."  
03:13.427  In additien te all the previeus requests se far..."  
03:14.027  In additien te all the previeus requests se far..."  
03:14.628  In additien te all the previeus requests se far..."

basically I want to match the Text and pull the timestamps from the first and last lines and set them up in srt format

1
00:03:04,418 --> 00:03:08,021
Their parents have always written
letters thanking us. =  

2
00:03:08,622 --> 00:03:08,622
This seminary was highly reeemmended
| am relieved te leave her in your care. = 

3
00:03:12,225 --> 00:03:14,628
In additien te all the previeus requests se far..."

At this point I'm fine with it being a separate script.

Basically sub.txt in sub.srt out. Then do a Proofread pass. Now there is a bit of Variability in the detected text but it's minimal. I is occasionally detected as | or [, and it sometimes mixes up o and e in some odd corner cases.

Edit February 2 2020:

I've made some changes and tweaks to further get what I wanted. to Both MY shell script and Ivans. I've eliminated The blank sub Lines produced by ivans script and mine as well.

UPDATED processing and ocr script BTW

#!/bin/bash -x
 
cd "$1"
mkdir ocr

for f in *.png ;
do
base="$(basename "$f" | cut -d "." -f 1,2)"
echo "$base"
if [[ -z "$2" ]] ; 
then
tran="$(convert "$f"  -separate -average  -crop +0+720 -threshold 11% -fill black -draw 'color 700,10 floodfill' +repage ocr/"$base".png)"
  
else
tran="$(convert "$f"  -separate -average  -crop +0+720 -negate -threshold 15% -fill white -draw 'color 700,10 floodfill' +repage ocr/"$base".png)"
  
fi 
$tran
cd ocr
magick mogrify -pointsize 50 -fill blue -draw 'text 1400,310 "L" ' +repage "$base".png
cd ..


done
cd ocr
for i in *.png ;
do base2="$(basename "$i" | cut -d "." -f 1,2 | cut -d ":" -f 2,3)"
tesseract "$i" stdout -c page_separator='' --psm 6 --oem 1 --dpi 300 | { tr '\n' ' '; tr -s  [:space:] ' ';  echo; } >> text.txt
echo "$base2""  " >> time.txt

done
awk '{printf ("%s", $0); getline < "text.txt"; print $0 }' time.txt >> out.txt
sed -i 's/|/I/g' out.txt
sed -i 's/\[/I/g' out.txt
#sed -i 's/L//g' out.txt
#sed -i 's/=//g' out.txt
sed -i 's/.$//' out.txt
sed -i 's/.$//' out.txt

while read line ; do
sed "/[[:alpha:]]/ !d" >> sub.txt
done <out.txt
exit

The Part Making the Blue L is to ensure every line has something in it for timestamp matching.

UPDATED IVAN SRT SCRIPT

#!/bin/bash -x

sub="$1"            # path to sub file
OLD=$IFS            # remember current delimiter
IFS=$'\n'           # set delimiter to the new line
raw=( $(cat $sub) ) # load sub into raw array
IFS=$OLD            # set default delimiter back

reset () {
    unset raw[0]        # remove 1-st item from array
    raw=( "${raw[@]}" ) # rearange array
}

output () {
   
    printf "00:$time1 --> 00:$time3\n$text1\n\n"
    
    }

speen () {
    time3=$time2
    reset
    test=( "${raw[@]::2}" ) # get two more items
    test2=( ${test[0]} )    # split 2-nd item
    time2=${test2[0]}       # get 2-nd timing
    text2=${test2[@]:1}     # get 2-nd text
    
    # if only one item in test than this is the end, return
    
            
    [[ "${test[1]}" ]] || { printf "00:$time1 --> 00:$time2\n$text1\n\n"; raw=; return; }
    #   compare,     speen more if match,  print ang go further if not 
    
    [[ "$text1" == "$text2" ]] && speen || output
}

N=1 # set counter
while [[ "${raw[@]}" ]]; do # loop through data
    echo $((N++))       # print and inc counter
    test1=( $raw )      # get 1-st item
    time1=${test1[0]}   # get 1-st timing
    text1=${test1[@]:1}
    # get 1-st text
    speen
done

I just added a third time variable to save the old time2 value as time3. Basically Eliminating the blank timestamp line broke his matching. I realized that time2 was the First non matching time stamp. So I needed to save the one prior from the last loop. Thus time3=$time2 Then rest the time2 value. Then use the old time2 ( now time3) to print the sub string.


Solution

  • Ended with this

    #!/bin/bash
    
    sub=file            # path to sub file
    OLD=$IFS            # remember current delimiter
    IFS=$'\n'           # set delimiter to the new line
    raw=( $(cat $sub) ) # load sub into raw array
    IFS=$OLD            # set default delimiter back
    
    reset () {
        unset raw[0]        # remove 1-st item from array
        raw=( "${raw[@]}" ) # rearange array
    }
    
    output () {
        text1=${text1//|/I} # change | to I in text
        text1=${text1//[/I} # change [ to I in text
        printf "$time1 --> $time2\n$text1\n\n"    
    }
    
    speen () {
        reset
        test=( "${raw[@]::2}" ) # get two more items
        test2=( ${test[0]} )    # split 2-nd item
        time2=${test2[0]}       # get 2-nd timing
        text2=${test2[@]:1}     # get 2-nd text
        # if only one item in test than this is the end, return
        [[ "${test[1]}" ]] || { printf "$time1 --> $time2\n$text1\n\n"; raw=; return; }
        #   compare,     speen more if match,  print ang go further if not 
        [[ "$text1" == "$text2" ]] && speen || output
    }
    
    N=1 # set counter
    while [[ "${raw[@]}" ]]; do # loop through data
        echo $((N++))       # print and inc counter
        test1=( $raw )      # get 1-st item
        time1=${test1[0]}   # get 1-st timing
        text1=${test1[@]:1} # get 1-st text
        speen
    done