Search code examples
pdfcorruptionmojibake

Extract text from corrupt (?) pdf document


In a project I'm working on we scrape legal documents from various government sites and then make them searchable online.

Every now and then we encounter a PDF that seems to be corrupt. Here's an example of one.

If you open it in a PDF reader, it looks fine, but:

  • If you try to copy and paste it, you get corrupted text
  • If you run it through any tools like pdftotext, you corrupted text
  • If you do just about anything else to it -- you guessed it -- you get corrupted text

Yet, if you open it in a reader, it looks fine! So I know the text is there, but something is wrong, wrong wrong! The result is that on my site it looks really bad.

Is there anything I can do?

Update: I did more research today. Thanks to @Andrew Cash's observation that this is essentially a Caesar cipher, I realized I could search for the documents. This link will show you about 200 of these in my system. Looking through the larger sample set, it looks like these are all created by the same software, pdffactory v. 3.51! So I blame a bug, not deliberate obfuscation.

Update 2: The link above won't provide any results anymore. These are purged from my system using my solution below.


Solution

  • Weary of this issue and not wanting to deal with OCR, I manually sorted out the cipher. Here she be, as a python dict along with some rudimentary code that I was using to test it. I'm sure this could be improved, but it does work for all letters except uppercase Q and uppercase X, which I haven't yet been able to find.

    It's missing a fair bit of punctuation too at least for now (all of these are missing, for example: <>?{}\|!~`@#$%^_=+).

    # -*- coding: utf-8 -*-
    
    import re
    import sys
    
    letter_map = {
     u'¿':'a',
     u'regex':'b',
     u'regex':'c',
     u'regex':'d',
     u'»':'e',
     u'o':'f',
     u'1':'g',
     u'regex':'h',
     u'·':'i',
     u'¶':'j',
     u'μ':'k',
     u'regex':'l',
     u'3':'m',
     u'2':'n',
     u'±':'o',
     u'°':'p',
     u'regex':'q',
     u'®':'r',
     u'-':'s',
     u'¬':'t',
     u'«':'u',
     u'a':'v',
     u'©':'w',
     u'regex':'x',
     u'§':'y',
     u'¦':'z',
     u'ß':'A',
     u'Þ':'B',
     u'Ý':'C',
     u'Ü':'D',
     u'Û':'E',
     u'Ú':'F',
     u'Ù':'G',
     u'Ø':'H',
     u'×':'I',
     u'Ö':'J',
     u'Õ':'K',
     u'Ô':'L',
     u'Ó':'M',
     u'Ò':'N',
     u'Ñ':'O',
     u'Ð':'P',
     u'':'Q', # Missing
     u'Î':'R',
     u'Í':'S',
     u'Ì':'T',
     u'Ë':'U',
     u'Ê':'V',
     u'É':'W',
     u'':'X', # Missing
     u'Ç':'Y',
     u'Æ':'Z',
     u'ð':'0',
     u'ï':'1',
     u'î':'2',
     u'í':'3',
     u'ì':'4',
     u'ë':'5',
     u'ê':'6',
     u'é':'7',
     u'è':'8',
     u'ç':'9',
     u'ò':'.',
     u'ô':',',
     u'æ':':',
     u'å':';',
     u'Ž':"'",
     u'•':"'",
     u'•':"'", # s/b double quote, but identical to single.
     u'Œ':"'", # s/b double quote, but identical to single.
     u'ó':'-', # dash
     u'Š':'-', # n-dash
     u'‰':'--', # em-dash
     u'ú':'&',
     u'ö':'*',
     u'ñ':'/',
     u'÷':')',
     u'ø':'(',
     u'Å':'[',
     u'Ã':']',
     u'‹':'•',
     }
    
    ciphertext = u'''YOUR STUFF HERE'''
    
    plaintext = ''
    
    for letter in ciphertext:
        try:
            plaintext += letter_map[letter]
        except KeyError:
            plaintext += letter
    
    # These are multi-length replacements
    plaintext = re.sub(u'm⁄4', 'b', plaintext)
    plaintext = re.sub(u'g⁄n', 'c', plaintext)
    plaintext = re.sub(u'g⁄4', 'd', plaintext)
    plaintext = re.sub(u' ́', 'l', plaintext)
    plaintext = re.sub(u' ̧', 'h', plaintext)
    plaintext = re.sub(u' ̈', 'x', plaintext)
    plaintext = re.sub(u' ̄u', 'qu', plaintext)
    
    for letter in plaintext:
        try:
            sys.stdout.write(letter)
        except UnicodeEncodeError:
            continue