Search code examples
pdfencodingcp1252cp1251

changing pdf text encoding


I have a PDF document (that is my schoolbook) and the problem is that although the text is printed normally, it is copied in the form of some random glyphs. I found, that it is because of text being encoded on cp1251 but trying to be decoded as cp1252 (or viceversa idk but copied glyphs belong to 1252). Pasting text to decoder from 1252 to 1251 I can get the original text (pic related)

enter image description here

To solve my problem of text searching and copying I just used OCR, but maybe there is a way to change it's encoding in some pdf headers? Also I do need to copy some of the illustrations for school seminars, but Inkscape and AI still output theese glyphs in 1252.

Opening the text in Adobe Acrobat DC, I saw that he was complaining about the font 1251 Times. In Npp i found such ones

1146 0 obj
<<
/Ascent 756
/CapHeight 750
/Descent -195
/Flags 32
/FontBBox [-91 -224 1237 943]
/FontFamily (1251 Times)
/FontFile2 1147 0 R
/FontName /OGAHOK+1251Times
/FontStretch /Normal
/FontWeight 400
/ItalicAngle 0
/StemV 90
/Type /FontDescriptor
>>
endobj
1145 0 obj
<<
/BaseFont /OGAHOK+1251Times
/Encoding /WinAnsiEncoding
/FirstChar 32
/FontDescriptor 1146 0 R
/LastChar 255
/Subtype /TrueType
/Type /Font
/Widths [351 0 0 0 0 0 828 0 392 392 0 0 326 448 288 455 531 533 532 532 532 532 532 531 531 532 288 0 0 0 0 0 864 724 714 776 0 706 0 0 875 417 0 0 0 0 882 0 661 0 770 599 678 0 0 983 0 0 0 0 0 0 0 0 0 495 539 499 565 489 322 491 583 294 0 532 287 887 590 566 563 0 376 385 332 568 486 729 0 503 476 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 554 554 0 952 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 896 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 699 714 0 747 0 0 597 886 0 812 0 1034 875 0 877 0 776 678 729 0 0 858 0 0 0 0 0 0 759 0 0 495 559 523 434 539 489 757 449 622 622 577 550 715 636 566 622 563 499 468 503 764 500 621 553 880 880 0 760 501 517 820 546]
>>
endobj
1150 0 obj
<<
/Filter /FlateDecode
/Length1 32416
/Length 24094
>>
stream

By replacing all occurrences of 1251 with 1252, I have achieved nothing. What is the right way to di this thing? And is there such a right way?


Solution

  • OGAHOK+1251Times (or similar six random characters and a nametag of a font)

    Very often indicates the source was recognised as OCR (One Character Relative to another) thus each letter or a line of letters or a page of letters can have its own font, that here look-likes Times Roman in, as you discovered, 1251 style lettering.

    So changing the name to 1252 would be like saying the Times is Verdana it can not change the raw data.

    I am surprised, but pleased for you, that you can get some readable 1251 to convert to 1252, however reasonable conversion within the potentially corrupted font metrics would be neigh on impossible to replace one symbol at a time to the other and maintain string shape see the varying /Widths.

    However without your base PDF file that is based on experience rather than a fail with your source.

    [Update]

    Wow! that file has 600 fonts ! something has processed those badly

    The problem seems to stem from the use of WinAnsiEncoding rather than some UTF-8 or compatible coding method. I am looking to see if there is any way to modify, but not sure if it could help or make things worse. Here I can try editing settings but in this screenshot from Tracker PDF X-change Editor making changes does not help, unless the text is cut, converted and pasted back.

    enter image description here