Search code examples
tesseractpython-tesseract

How to detect language or script from an input image using Python or Tesseract OCR?


Given an input image which can be in any language or writing system, how do I detect what script the text in the picture uses?

Any Python-based or Tesseract-OCR based solution would be appreciated.


Note that script here means writing systems like Latin, Cyrillic, Devanagari, etc., for corresponding languages like English, Russian, Hindi, etc. (respectively)


Solution

  • Pre-requisites:

    • Install Tesseract: sudo apt install tesseract-ocr tesseract-ocr-all
    • Install PyTessract: pip install pytesseract

    Script-Detection:

    import pytesseract
    import re
    
    def detect_image_lang(img_path):
        try:
            osd = pytesseract.image_to_osd(img_path)
            script = re.search("Script: ([a-zA-Z]+)\n", osd).group(1)
            conf = re.search("Script confidence: (\d+\.?(\d+)?)", osd).group(1)
            return script, float(conf)
        except e:
            return None, 0.0
    
    script_name, confidence = detect_image_lang("image.png")
    

    Language-Detection:

    After performing OCR (using Tesseract), pass the text through langdetect library (or any other lib).