Search code examples
pdfcmdocrtesseracturdu

text is being changed when i do copy it from searchable pdf file (created with tesseract command) and paste it in notepad


I have created a searchable pdf file by running following command on one of my images.

tesseract page.jpg test pdf --oem 1 --psm 5 -l urd

this the image which I have converted to searchable pdf. enter image description here

the image contains Urdu text, but when I am copying it from newly created pdf file and pasting it in any other text editor, this is what I am getting.

GehbFie”

any tesseract OCR and encoding expert here who can solve my issue please, any help will be highly appreciated, thanks in advance.


Solution

  • pdf is the config file name. it needs to come last in the command, after --oem --psm -l etc.

    the correct format for the command is following.

    tesseract page.jpg test --oem 1 --psm 5 -l urd pdf
    

    I resolved my issue in this way.