Search code examples
pythondjangoherokutesseract

Using Tesseract on Heroku with Django


I would like to add OCR capabilities to my Django app running on Heroku. I suspect the easiest way is by using Tesseract. I've noticed that there are a number of python wrappers for Tesseract's API, but what is the best way to get Tesseract installed and running on Heroku? Via a custom buildpack like heroku-buildpack-tesseract maybe?


Solution

  • I'll try to capture some notes on the solution I arrived at here.

    My .buildpacks file:

    https://github.com/heroku/heroku-buildpack-python
    https://github.com/clearideas/heroku-buildpack-ghostscript
    https://github.com/marcolinux/heroku-buildpack-libraries
    

    My .buildpacks_bin_download file:

    tesseract-ocr https://s3.amazonaws.com/tesseract-ocr/heroku/tesseract-ocr-3.02.02.tar.gz 3.02 eng,spa
    

    Here is the key piece of python that does the OCRing of pdf files:

            # Additional processing
            document_path = Path(str(document.attachment_file))
    
            if document_path.ext == '.pdf':
                working_path = Path('temp', document.directory)
                working_path.mkdir(parents=True)
    
                input_path = Path(working_path, name)
                input_path.write_file(document.attachment_file.read(), 'w')
    
                rb = ReadBot()
    
                args = [
                    'VBEZ',
                    # '-sDEVICE=tiffg4',
                    '-sDEVICE=pnggray',
                    '-dNOPAUSE',
                    '-r600x600',
                    '-sOutputFile=' + str(working_path) + '/page-%00d.png',
                    str(input_path)
                ]
    
                ghostscript.Ghostscript(*args)
                image_paths = working_path.listdir(pattern='*.png')
                txt = ''
    
                for image_path in image_paths:
                    ocrtext = rb.interpret(str(image_path))
                    txt = txt + ocrtext
    
                document.notes = txt
                document.save()
                working_path.rmtree()