Search code examples
pythonunicodecharacterremap

remap scrambled PDF characters to readable text


I do have a problem due to cups-PDF creating PDF documents where characters are mapped to strange symbols [on Ubuntu Linux 14.04 and 16.04}. I think its some kind of unicode even if Python is telling me its string type. type(object) python returns "string"

No difference if I grab the text out of the PDF via Mouse copy paste from evince / Firefox or via Python PDFminer module. So its true, the PDF has broken text information which is rendered correct on PDF document itself. I did not know that, but text, and text-graphic on PDF document seem to be no bound very tight together.

When I do copy text from such created PDF document by example the name "Raphael" turns into "✡✍✑✒✍☛✓" so each single character maps to "✡=R ✍=a ✑=p ✒=h ✍=a ☛=e ✓=l"

Another example is: "Devel" turns into "✭☛✮☛✓"

How can I write a function in Python which shifts this "wrong" information to the correct one? On the PDF Document everything is perfectly readable.

This has something todo with cups-PDF using postscript to create the PDF but not adding the correct font/character information to the document.

If the letter 'l' is always the Symbol '✓' which is this checkmark unicode character

How can I do a remap of the characters in this strange representation to correct representation in Python? So how can I shift or remap symbol '✓' to letter 'l'? Any Idea?

Why I need this? I need to search for a text value in this documents.


Solution

  • The PDF appears to be using a specialised font to prevent copying. The text is scrambled, but so are the letters in the font. So if a once was mapped to Unicode codepoint U+0061, the PDF has replaced all those a's with U+270D instead, and the special font replaced the normal "WRITING HAND" glyph with the letter a.

    In other words, it's using a substitution cypher.

    You'll have to unscramble this like any other substitution cypher: you need to create a reverse mapping from encrypted codepoint to un-encrypted codepoint. You can use the PDF as a guide; as a human you can easily read the actual text, and you can also see how it relates to the copied Unicode codepoints.

    For example, we know that U+270D maps to U+0061:

    >>> hex(ord('✍'))
    '0x270d'
    >>> hex(ord('a'))
    '0x61'
    

    because when you copy an a from the PDF, you got the 270d codepoint instead. Simply build up a table for the rest of the alphabet. That may sound like a lot of manual work, but you already have the plaintext. Imagine not knowing what the text contains (e.g. you only had the symbols that copying the text produces); then you'd have to do a full cryptanalysis first (for a substitution cypher, assume a specific language, and count symbols; each language has a typical frequency distribution for its letters and such a distribution can often be matched in an encrypted body of text to map back to the original letters).

    Theoretically, you should be able to extract the specialised font, then analyse that to produce a translation table. This would require some form of computer vision however; the computer won't easily know that the raster of pixels or series of vector lines form a specific letter. For roughly 70 codepoints (uppercase, lowercase, digits, some punctuation) it'll probably easier to just create the table by hand.

    Once you have a table, Python can do the translation for you; I've taken your clues and created a partial table for just those letters:

    mapping = {
        0x270d: 'a',
        0x261b: 'e',
        0x2712: 'h',
        0x2713: 'l',
        0x2711: 'p',
        0x272e: 'v',
    
        0x272d: 'D',
        0x2721: 'R',
    }
    
    print(encrypted.translate(mapping))
    

    All you need to do is fill in the remaining mappings; the str.translate() method will then take care of the rest.

    Demo using the above partial table on your sample encrypted text samples:

    >>> print("✡✍✑✒✍☛✓".translate(mapping))
    Raphael
    >>> print("✭☛✮☛✓".translate(mapping))
    Devel