I have a CSV file with 1900+ entries of GIF image links.
Each image contains an email address.
I would like to programmatically read every entry and convert them in to its corresponding text, preferably in another or the same CSV file. I use Mac OS and prefer using Python or Java to accomplish this.
Any idea on how to do it using OCR or through any other methods? An example code will be greatly appreciated.
I've tried tesseract for a sample entry but the result wasn't accurate. Here's what I tried:
$ tesseract email.gif out
email.gif looks like:
[email protected]
The output generated in out.txt is:
gveen|L7uvs2fl1fl@yahLm cum
The CSV file looks as shown below (first 2 entries):
This is my first question in SO. Apologies if I missed out any other relevant information. I will be happy to provide more.
Updated Answer
Your images are rather small and blocky for tesseract...
You may get on better enlarging them and sharpeneing them with ImageMagick like this:
convert email.gif -resize 600x -unsharp 0x8 -threshold 95% x.png # Enlarge and sharpen
tesseract x.png text # OCR
Result
[email protected]
If your CSV file looks like your example, and is called file.csv
http://d1hnc0v5nyu4l2.cloudfront.net/kh/communications/original/1417577580/C2AFA720-7A9C-11E4-9201-22000AA51306?1417577580
http://d306v9rz034cgu.cloudfront.net/kh/communications/original/1367212416/55BE4627-B463-4523-8332-4046835D3D79?1367212416
you might write
#!/bin/bash
while read f; do
convert "$f" -resize 600x -unsharp 0x8 -threshold 95% image.png
tesseract image.png text
grep "[a-z0-9]" text.txt >> results.txt
done < file.csv
And your file results.txt
will have
[email protected]
cambodia][email protected]
If you do indeed plan to use ImageMagick
or tesseract
on OSX, please consider installing it with homebrew
. It will make your life easier. Ask if you don't know how.
Original Answer
Well, it may be a start to use tesseract
. Basically, you pass it the name of an input image file (email.png
in my example) and the base of an output text file, like this:
tesseract email.png text -psm 7
Then you will get some text in file text.txt
like this
lmAV@chwL7v\d1vave\z:um
You can try all sorts of different parameters and strategies for cleaning up your input file, probably using ImageMagick.
As you don't say what OS you use, or what your CSV file looks like, it is hard to help any further at the moment.