Search code examples
csvimage-processingocrtesseractpython-tesseract

How to programmatically read and convert email in an image to text?


I have a CSV file with 1900+ entries of GIF image links.

Each image contains an email address.

I would like to programmatically read every entry and convert them in to its corresponding text, preferably in another or the same CSV file. I use Mac OS and prefer using Python or Java to accomplish this.

Any idea on how to do it using OCR or through any other methods? An example code will be greatly appreciated.

I've tried tesseract for a sample entry but the result wasn't accurate. Here's what I tried:

 $ tesseract email.gif out

email.gif looks like:

[email protected]

The output generated in out.txt is:

gveen|L7uvs2fl1fl@yahLm cum

The CSV file looks as shown below (first 2 entries):

This is my first question in SO. Apologies if I missed out any other relevant information. I will be happy to provide more.


Solution

  • Updated Answer

    Your images are rather small and blocky for tesseract...

    enter image description here

    You may get on better enlarging them and sharpeneing them with ImageMagick like this:

    convert email.gif -resize 600x -unsharp 0x8 -threshold 95% x.png     # Enlarge and sharpen
    tesseract x.png text                                                 # OCR
    

    enter image description here

    Result

    [email protected]
    

    If your CSV file looks like your example, and is called file.csv http://d1hnc0v5nyu4l2.cloudfront.net/kh/communications/original/1417577580/C2AFA720-7A9C-11E4-9201-22000AA51306?1417577580 http://d306v9rz034cgu.cloudfront.net/kh/communications/original/1367212416/55BE4627-B463-4523-8332-4046835D3D79?1367212416

    you might write

    #!/bin/bash
    while read f; do
       convert "$f" -resize 600x -unsharp 0x8 -threshold 95% image.png
       tesseract image.png text
       grep "[a-z0-9]" text.txt >> results.txt
    done < file.csv
    

    And your file results.txt will have

    [email protected]
    cambodia][email protected]
    

    If you do indeed plan to use ImageMagick or tesseract on OSX, please consider installing it with homebrew. It will make your life easier. Ask if you don't know how.

    Original Answer

    Well, it may be a start to use tesseract. Basically, you pass it the name of an input image file (email.png in my example) and the base of an output text file, like this:

    tesseract email.png text -psm 7
    

    Then you will get some text in file text.txt like this

    lmAV@chwL7v\d1vave\z:um
    

    You can try all sorts of different parameters and strategies for cleaning up your input file, probably using ImageMagick.

    As you don't say what OS you use, or what your CSV file looks like, it is hard to help any further at the moment.