Search code examples
opencvimagemagickocrtesseractpython-tesseract

Tesseract / OCR / OpenCV : Need to read captcha


I am trying to read the following captcha images with magick with no success so far. I am ok to use either magick or OpenCV to solve this catpcha.

Captcha Image

enter image description here

enter image description here

enter image description here

enter image description here

So, far i have tried erode, gaussian blur and paint function but i am still not getting the whole word before tesseract can process the image. I have also tried using the characterwhitelist of tesseract but i guess it needs something before it can even use that whitelist.

The best that i have reached is this image:

enter image description here

Command used : magick.exe c:\e793df3c-b831-11e6-88e4-544635854505.jpg -negate -morphology erode rectangle:1 -negate -threshold 25% -paint 1 c:\ofdbmf-2.jpg

Is it impossible ?


Solution

  • For those who are interested :

    There are two ways to accomplish it :

    Method #1 : If you have captcha source available

    If you already have the source available, you can look out for the fonts that the source is using. In this method, Since we have the source code, we can try to modify it to save out maximum(probably more than 10,000) CAPTCHA images along with the expected answer for each image.

    You can use a simple ‘for’ loop and save all pictures with correct answer as the filename.

    This will be your training data.

    Then from here, split the image to each letter and reference that back to the letter from the filename, that way you will have multiple of the same letter images created in different angles and shape. You can use OpenCV Blobs here, then threshold it and then do the contour find.

    One problem that you might face here is that you would have overlapping letters, for that a simple hack here is to say that if a single contour area is a lot wider than it is tall, that means we probably have two letters squished together. In that case, we can just split the conjoined letter in half down the middle and treat it as two separate letters.

    Now that we have a way to extract individual letters, you can run it across all the CAPTCHA images. The goal is to collect different variations of each letter. We can save each letter in it’s own folder to keep things organized.

    Finally, you can use simple convolutional neural network architecture with two convolutional layers and two fully-connected layers.

    This way you will have 100% success rate in identifying the captcha letters/numbers.

    Method #2 : If you don't have the source

    Pretty much, you have to do a lot of work now, to start with, make sure you have the background of:

    1) Python 2) Keras 3) tensorflow 4) OpenCV

    If you do, then make your first step to Download as many captcha images as you can. I usually look for the Network tab in the Google Chrome developers options and then find the path to the captchas and then put that in loop to start downloading them.

    Then, use the OpenCV to distill the images that you have downloaded by creating blobs, thresholding and contour defination

    Finally, comes the Training part and then testing and validation.

    For more info : https://mathematica.stackexchange.com/questions/143691/crack-captcha-using-deep-learning?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa