Search code examples
perlocrtesseractcaptcha

Tesseract dont recognize captcha in png file, which contains numbers and letters of the English alphabet


I need to extract captcha from url and recognised it with Tesseract. My code is:

#!/usr/bin/perl -X
###
$user = 'user'; #Enter your username here
$pass = 'pass'; #Enter your password here
###
#Server settings
$home = "http://perltest.adavice.com";
$url = "$home/c/test.cgi?u=$user&p=$pass";
#Get HTML code!
$html = `GET "$url"`
###Add code here!
#Grab img from HTML code
if ($html =~ m%img[^>]*src="(/[^"]*)"%s)
{
    $img = $1;
}
###
die "<img> not found\n" if (!$img);
#Download image to server (save as: ocr_me.img)
print "GET '$home$img' > ocr_me.img\n";
system "GET '$home$img' > ocr_me.img";
###Add code here!
#Run OCR (using shell command tesseract) on img and save text as ocr_result.txt
system("tesseract ocr_me.img ocr_result");
print "GET '$txt' > ocr_result.txt\n";
system "GET '$txt' > ocr_result.txt";
###
die "ocr_result.txt not found\n" if (!-e "ocr_result.txt");
# check OCR results:
$txt = 'cat ocr_result.txt';
$txt =~ s/[^A-Za-z0-9\-_\.]+//sg;
$img =~ s/^.*\///;
print `echo -n "file=$img&text=$txt" | POST "$url"`;

Image parse correctly. This image contain captcha and looks like:

My image PNG file, which contains a captcha

My output is:

GET 'http://perltest.adavice.com/captcha/1533110309.png' > ocr_me.img
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
GET '' > ocr_result.txt
Captcha text not specified

As you see, script parse image correctly. But Tesseract didnt see anything in that PNG file. I am trying to specify additional parameters such as -psm and -l with shell command tesseract, but this also giving nothing

UPDATE: After read answer @Dave Cross, I am tried his suggestion.

In output I got:

http://perltest.adavice.com/captcha/1533141024.png
ocr_me.img
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
[]
200Captcha text not specified
Original image file not specified
Captcha text not specified

Why I need text from image .PNG? Maybe this additional information can help you. Look at that: enter image description here

This is how $url looks like in browser. My goal here is create query for this page in wim using perl. For this I need fill in forms above my $user, $pass and $txt (from recognized with Tesseract image). And send that with POST 'url' (last string in code).


Solution

  • Several strange things going on here. Any one of them could be causing your problems.

    1. Having -X on your shebang line is a terrible idea. It explicitly turns off warnings. I suggest you remove it, add use warnings to your code and fix all the problems that reveals (I'd suggest adding use strict too, but you'd need to declare all of your variables).
    2. I'd recommend using LWP::Simple instead of shelling out to GET.
    3. Please don't use regexes to parse HTML. Use a real HTML parser instead. Web::Query is my current favourite.
    4. You then run GET again, using a variable called $txt that doesn't have a value. That's not going to work!
    5. $txt = 'cat ocr_result.txt' doesn't do what you think it does. You want backticks, not single quotes.

    Update: Obviously, I don't have access to your username or password, so I can't reconstruct all of your code. But this seems to work fine for accessing the image in your example and extracting the text from it.

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    use feature 'say';
    
    use LWP::Simple;
    
    my $img_url  = 'http://perltest.adavice.com/captcha/1533110309.png';
    my $img_file = 'ocr_me.img';
    
    getstore($img_url, $img_file);
    
    my $txt = `tesseract $img_file stdout`;
    
    say $txt;
    

    Here's your actual error:

    system("tesseract ocr_me.img ocr_result");
    print "GET '$txt' > ocr_result.txt\n";
    system "GET '$txt' > ocr_result.txt";
    

    You ask tesseract to write its output to ocr_result.txt, but two lines later, you overwrite that file with the output of a failed call to GET. I'm not sure what you think that's going to do, but it will trash whatever output tesseract has already stored in that file.

    Updated Update:

    Here's my current version of the code:

    #!/usr/bin/perl
    use strict;
    use warnings;
    use feature 'say';
    use LWP::Simple qw[$ua get getstore];
    use File::Basename;
    ###
    my $user = 'xxxx'; #Enter your username here
    my $pass = 'xxxx'; #Enter your password here
    ###
    #Server settings
    my $home = "http://perltest.adavice.com";
    my $url = "$home/c/test.cgi?u=$user&p=$pass";
    #Get HTML code!
    my $html = get($url);
    my $img;
    ###Add code here!
    #Grab img from HTML code
    if ($html =~ m%img[^>]*src="(/[^"]*)"%s)
    {
        $img = $1;
    }
    my $img_url = $home . $img;
    my $img_file = 'ocr_me.img';
    
    getstore($img_url, $img_file);
    
    say $img_url;
    say $img_file;
    
    # Looks like tesseract adds two newlines to its output -
    # so chomp() it twice!
    chomp(my $txt = `tesseract ocr_me.img stdout`);
    chomp($txt);
    
    say "[$txt]";
    
    $txt =~ s/\W+//g;
    
    my $resp = $ua->post($url, {
      u    => $user,
      p    => $pass,
      file => basename($img),
      text => $txt,
    });
    
    print $resp->code;
    print $resp->content;
    

    I've changed a few things.

    1. Corrected $img_url from $url . $img to $home . $img (this is what was stopping it from getting the correct image).
    2. Switched to using LWP::Simple throughout (it's just easier).
    3. chomped (twice!) the output from tesseract to remove newlines.
    4. Used File::Basename to get the correct filename to pass in the final POST.
    5. Removed any non-word characters from $txt before POSTing it.

    It still doesn't quite work. It seems to hang waiting for a response from the server. But I'm afraid I've run out of time to help you.