I need to extract captcha from url and recognised it with Tesseract. My code is:
#!/usr/bin/perl -X
###
$user = 'user'; #Enter your username here
$pass = 'pass'; #Enter your password here
###
#Server settings
$home = "http://perltest.adavice.com";
$url = "$home/c/test.cgi?u=$user&p=$pass";
#Get HTML code!
$html = `GET "$url"`
###Add code here!
#Grab img from HTML code
if ($html =~ m%img[^>]*src="(/[^"]*)"%s)
{
$img = $1;
}
###
die "<img> not found\n" if (!$img);
#Download image to server (save as: ocr_me.img)
print "GET '$home$img' > ocr_me.img\n";
system "GET '$home$img' > ocr_me.img";
###Add code here!
#Run OCR (using shell command tesseract) on img and save text as ocr_result.txt
system("tesseract ocr_me.img ocr_result");
print "GET '$txt' > ocr_result.txt\n";
system "GET '$txt' > ocr_result.txt";
###
die "ocr_result.txt not found\n" if (!-e "ocr_result.txt");
# check OCR results:
$txt = 'cat ocr_result.txt';
$txt =~ s/[^A-Za-z0-9\-_\.]+//sg;
$img =~ s/^.*\///;
print `echo -n "file=$img&text=$txt" | POST "$url"`;
Image parse correctly. This image contain captcha and looks like:
My output is:
GET 'http://perltest.adavice.com/captcha/1533110309.png' > ocr_me.img
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
GET '' > ocr_result.txt
Captcha text not specified
As you see, script parse image correctly. But Tesseract didnt see anything in that PNG file. I am trying to specify additional parameters such as -psm and -l with shell command tesseract, but this also giving nothing
UPDATE: After read answer @Dave Cross, I am tried his suggestion.
In output I got:
http://perltest.adavice.com/captcha/1533141024.png
ocr_me.img
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
[]
200Captcha text not specified
Original image file not specified
Captcha text not specified
Why I need text from image .PNG? Maybe this additional information can help you.
Look at that:
This is how $url looks like in browser. My goal here is create query for this page in wim using perl. For this I need fill in forms above my $user, $pass and $txt (from recognized with Tesseract image). And send that with POST 'url' (last string in code).
Several strange things going on here. Any one of them could be causing your problems.
-X
on your shebang line is a terrible idea. It explicitly turns off warnings. I suggest you remove it, add use warnings
to your code and fix all the problems that reveals (I'd suggest adding use strict
too, but you'd need to declare all of your variables).GET
.GET
again, using a variable called $txt
that doesn't have a value. That's not going to work!$txt = 'cat ocr_result.txt'
doesn't do what you think it does. You want backticks, not single quotes.Update: Obviously, I don't have access to your username or password, so I can't reconstruct all of your code. But this seems to work fine for accessing the image in your example and extracting the text from it.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use LWP::Simple;
my $img_url = 'http://perltest.adavice.com/captcha/1533110309.png';
my $img_file = 'ocr_me.img';
getstore($img_url, $img_file);
my $txt = `tesseract $img_file stdout`;
say $txt;
Here's your actual error:
system("tesseract ocr_me.img ocr_result");
print "GET '$txt' > ocr_result.txt\n";
system "GET '$txt' > ocr_result.txt";
You ask tesseract
to write its output to ocr_result.txt
, but two lines later, you overwrite that file with the output of a failed call to GET
. I'm not sure what you think that's going to do, but it will trash whatever output tesseract
has already stored in that file.
Updated Update:
Here's my current version of the code:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use LWP::Simple qw[$ua get getstore];
use File::Basename;
###
my $user = 'xxxx'; #Enter your username here
my $pass = 'xxxx'; #Enter your password here
###
#Server settings
my $home = "http://perltest.adavice.com";
my $url = "$home/c/test.cgi?u=$user&p=$pass";
#Get HTML code!
my $html = get($url);
my $img;
###Add code here!
#Grab img from HTML code
if ($html =~ m%img[^>]*src="(/[^"]*)"%s)
{
$img = $1;
}
my $img_url = $home . $img;
my $img_file = 'ocr_me.img';
getstore($img_url, $img_file);
say $img_url;
say $img_file;
# Looks like tesseract adds two newlines to its output -
# so chomp() it twice!
chomp(my $txt = `tesseract ocr_me.img stdout`);
chomp($txt);
say "[$txt]";
$txt =~ s/\W+//g;
my $resp = $ua->post($url, {
u => $user,
p => $pass,
file => basename($img),
text => $txt,
});
print $resp->code;
print $resp->content;
I've changed a few things.
$img_url
from $url . $img
to $home . $img
(this is what was stopping it from getting the correct image).chomp
ed (twice!) the output from tesseract
to remove newlines.POST
.$txt
before POST
ing it.It still doesn't quite work. It seems to hang waiting for a response from the server. But I'm afraid I've run out of time to help you.