Title says what I really need ATM.
Basically I've created an OCR toolchain based on Tesseract and ImageMagick. I've managed to get it to the point the output text is very consistent. I'm using this to OCR some old hardsubbed videos and make them into soft subbed SRT subs. To take the screenshots for the image input I'm using a modified version of an old shell script I found and rewrote ages ago. Those get feed into a second script that processes them into a form readable by Tessaract. At this point I could easily do the remainder of the work by hand, but I'd like to automate all but the final proofread pass if possible.
Example Text (From current project)
03:04.418 Their parents have always written letters thanking us. =
03:05.018 Their parents have always written letters thanking us. =
03:05.619 Their parents have always written letters thanking us. =
03:06.219 Their parents have always written letters thanking us. =
03:06.820 Their parents have always written letters thanking us. =
03:07.421 Their parents have always written letters thanking us. =
03:08.021 Their parents have always written letters thanking us. =
03:08.622 This seminary was highly reeemmended. | am relieved te leave her in your care. =
03:09.222 This seminary was highly reeemmended. | am relieved te leave her in your care. =
03:09.823 This seminary was highly reeemmended. | am relieved te leave her in your care. =
03:10.424 This seminary was highly reeemmended. | am relieved te leave her in your care. =
03:11.024 This seminary was highly reeemmended. | am relieved te leave her in your care. =
03:11.625 This seminary was highly reeemmended. | am relieved te leave her in your care. =
03:12.225 In additien te all the previeus requests se far..."
03:12.826 In additien te all the previeus requests se far..."
03:13.427 In additien te all the previeus requests se far..."
03:14.027 In additien te all the previeus requests se far..."
03:14.628 In additien te all the previeus requests se far..."
basically I want to match the Text and pull the timestamps from the first and last lines and set them up in srt format
1
00:03:04,418 --> 00:03:08,021
Their parents have always written
letters thanking us. =
2
00:03:08,622 --> 00:03:08,622
This seminary was highly reeemmended
| am relieved te leave her in your care. =
3
00:03:12,225 --> 00:03:14,628
In additien te all the previeus requests se far..."
At this point I'm fine with it being a separate script.
Basically sub.txt in sub.srt out. Then do a Proofread pass. Now there is a bit of Variability in the detected text but it's minimal. I is occasionally detected as |
or [
, and it sometimes mixes up o and e in some odd corner cases.
Edit February 2 2020:
I've made some changes and tweaks to further get what I wanted. to Both MY shell script and Ivans. I've eliminated The blank sub Lines produced by ivans script and mine as well.
UPDATED processing and ocr script BTW
#!/bin/bash -x
cd "$1"
mkdir ocr
for f in *.png ;
do
base="$(basename "$f" | cut -d "." -f 1,2)"
echo "$base"
if [[ -z "$2" ]] ;
then
tran="$(convert "$f" -separate -average -crop +0+720 -threshold 11% -fill black -draw 'color 700,10 floodfill' +repage ocr/"$base".png)"
else
tran="$(convert "$f" -separate -average -crop +0+720 -negate -threshold 15% -fill white -draw 'color 700,10 floodfill' +repage ocr/"$base".png)"
fi
$tran
cd ocr
magick mogrify -pointsize 50 -fill blue -draw 'text 1400,310 "L" ' +repage "$base".png
cd ..
done
cd ocr
for i in *.png ;
do base2="$(basename "$i" | cut -d "." -f 1,2 | cut -d ":" -f 2,3)"
tesseract "$i" stdout -c page_separator='' --psm 6 --oem 1 --dpi 300 | { tr '\n' ' '; tr -s [:space:] ' '; echo; } >> text.txt
echo "$base2"" " >> time.txt
done
awk '{printf ("%s", $0); getline < "text.txt"; print $0 }' time.txt >> out.txt
sed -i 's/|/I/g' out.txt
sed -i 's/\[/I/g' out.txt
#sed -i 's/L//g' out.txt
#sed -i 's/=//g' out.txt
sed -i 's/.$//' out.txt
sed -i 's/.$//' out.txt
while read line ; do
sed "/[[:alpha:]]/ !d" >> sub.txt
done <out.txt
exit
The Part Making the Blue L is to ensure every line has something in it for timestamp matching.
UPDATED IVAN SRT SCRIPT
#!/bin/bash -x
sub="$1" # path to sub file
OLD=$IFS # remember current delimiter
IFS=$'\n' # set delimiter to the new line
raw=( $(cat $sub) ) # load sub into raw array
IFS=$OLD # set default delimiter back
reset () {
unset raw[0] # remove 1-st item from array
raw=( "${raw[@]}" ) # rearange array
}
output () {
printf "00:$time1 --> 00:$time3\n$text1\n\n"
}
speen () {
time3=$time2
reset
test=( "${raw[@]::2}" ) # get two more items
test2=( ${test[0]} ) # split 2-nd item
time2=${test2[0]} # get 2-nd timing
text2=${test2[@]:1} # get 2-nd text
# if only one item in test than this is the end, return
[[ "${test[1]}" ]] || { printf "00:$time1 --> 00:$time2\n$text1\n\n"; raw=; return; }
# compare, speen more if match, print ang go further if not
[[ "$text1" == "$text2" ]] && speen || output
}
N=1 # set counter
while [[ "${raw[@]}" ]]; do # loop through data
echo $((N++)) # print and inc counter
test1=( $raw ) # get 1-st item
time1=${test1[0]} # get 1-st timing
text1=${test1[@]:1}
# get 1-st text
speen
done
I just added a third time variable to save the old time2 value as time3. Basically Eliminating the blank timestamp line broke his matching. I realized that time2 was the First non matching time stamp. So I needed to save the one prior from the last loop. Thus time3=$time2
Then rest the time2 value. Then use the old time2 ( now time3) to print the sub string.
Ended with this
#!/bin/bash
sub=file # path to sub file
OLD=$IFS # remember current delimiter
IFS=$'\n' # set delimiter to the new line
raw=( $(cat $sub) ) # load sub into raw array
IFS=$OLD # set default delimiter back
reset () {
unset raw[0] # remove 1-st item from array
raw=( "${raw[@]}" ) # rearange array
}
output () {
text1=${text1//|/I} # change | to I in text
text1=${text1//[/I} # change [ to I in text
printf "$time1 --> $time2\n$text1\n\n"
}
speen () {
reset
test=( "${raw[@]::2}" ) # get two more items
test2=( ${test[0]} ) # split 2-nd item
time2=${test2[0]} # get 2-nd timing
text2=${test2[@]:1} # get 2-nd text
# if only one item in test than this is the end, return
[[ "${test[1]}" ]] || { printf "$time1 --> $time2\n$text1\n\n"; raw=; return; }
# compare, speen more if match, print ang go further if not
[[ "$text1" == "$text2" ]] && speen || output
}
N=1 # set counter
while [[ "${raw[@]}" ]]; do # loop through data
echo $((N++)) # print and inc counter
test1=( $raw ) # get 1-st item
time1=${test1[0]} # get 1-st timing
text1=${test1[@]:1} # get 1-st text
speen
done