Search code examples
pythonocrtesseract

Extracting code from photograph of T-shirt via OCR


I recently saw someone with a T-shirt with some Perl code on the back. I took a photograph of it and cropped out the code:

alt text

Next I tried to extract the code from the image via OCR, so I installed Tesseract OCR and the Python bindings for it, pytesser.

Pytesser only works on TIFF images, so I converted the image in Gimp and entered the following code (Ubuntu 9.10):

>>> from pytesser import *
>>> image = Image.open('code.tif')
>>> print image_to_string(image)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pytesser.py", line 30, in image_to_string
    util.image_to_scratch(im, scratch_image_name)
  File "util.py", line 7, in image_to_scratch
    im.save(scratch_image_name, dpi=(200,200))
  File "/usr/lib/python2.6/dist-packages/PIL/Image.py", line 1406, in save
    save_handler(self, fp, filename)
  File "/usr/lib/python2.6/dist-packages/PIL/BmpImagePlugin.py", line 197, in _save
    raise IOError("cannot write mode %s as BMP" % im.mode)
IOError: cannot write mode RGBA as BMP
>>> r,g,b,a = image.split()
>>> img = Image.merge("RGB", (r,g,b))
>>> print image_to_string(img)
Tesseract Open Source OCR Engine

     éi     _   l_` _ t  
  ’   ‘" fY`  
  {  W       IKQW
  ·  __·_  ‘ ·-»·      
       :W   Z  
  ··  I  A n   1   
           ;f        
     `    `      
`T     .' V   _ ‘  
I  {Z.; » ;,. , ;  y i-   4 : %:,,    
      `· »    V; ` ?    
‘,—·.    
H***li¥v·•·}I§¢   ` _  »¢is5#__·¤G$++}§;“»‘7·
  71   ’    Q  {  NH IQ
  ytéggygi {     ;g¤qg;gm·;,g(g,,3) {3;;+-
   § {Jf**$d$ }‘$p•¢L#d¤ Sc}
  »   i `  i A1:

That's clearly gibberish that comes out of the OCR engine. So, my question is:

  • What do I have to do to get better OCR results out of Tesseract?
  • Or, does anybody else have better luck extracting the code from the above image in another way?

Solution

  • You can probably type faster than you can clean up images and install OCR engines:

    #!/usr/bin/perl
    (my$d=q[AA                GTCAGTTCCT
      CGCTATGTA                 ACACACACCA
        TTTGTGAGT                ATGTAACATA
          CTCGCTGGC              TATGTCAGAC
            AGATTGATC          GATCGATAGA
              ATGATAGATC     GAACGAGTGA
                TAGATAGAGT GATAGATAGA
                  GAGAGA GATAGAACGA
                    TC GATAGAGAGA
                     TAGATAGACA G
                   ATCGAGAGAC AGATA
                 GAACGACAGA TAGATAGAT
               TGAGTGATAG    ACTGAGAGAT
             AGATAGATTG        ATAGATAGAT
           AGATAGATAG           ACTGATAGAT
         AGAGTGATAG             ATAGAATGAG
       AGATAGACAG               ACAGACAGAT
      AGATAGACAG               AGAGACAGAT
      TGATAGATAG             ATAGATAGAT
      TGATAGATAG           AATGATAGAT
       AGATTGAGTG        ACAGATCGAT
         AGAACCTTTCT   CAGTAACAGT
           CTTTCTCGC TGGCTTGCTT
             TCTAA CAACCTTACT
               G ACTGCCTTTC
               TGAGATAGAT CGA
             TAGATAGATA GACAGAC
           AGATAGATAG  ATAGAATGAC
         AGACAGAGAG      ACAGAATGAT
       CGAGAGACAG          ATAGATAGAT
      AGAATGATAG             ACAGATAGAC
      AGATAGATAG               ACAGACAGAT
      AGACAGACTG                 ATAGATAGAT
       AGATAGATAG                 AATGACAGAT
         CGATTGAATG               ACAGATAGAT
           CGACAGATAG             ATAGACAGAT
             AGAGTGATAG          ATTGATCGAC
               TGATTGATAG      ACTGATTGAT
                 AGACAGATAG  AGTGACAGAT
                   CGACAGA TAGATAGATA
                     GATA GATAGATAG
                        ATAGACAGA G
                      AGATAGATAG ACA
                    GTCGCAAGTTC GCTCACA
    ])=~s/\s+//g;%a=map{chr $_=>$i++}65,84,67,
    71;$p=join$;,keys%a;while($d=~/([$p]{4})/g
    ){next if$j++%96>=16;$c=0;for$d(0..3){$c+=
    $a{substr($1,$d,1)}*(4**$d)}$perl.=chr $c}
                 eval $perl;
    

    Edit: typo.