Search code examples
pythonbase64docin-memory

Reading a doc file in memory


I have a json where it stores various files types (e.g., pdfs, docx, doc) in base64 format. So I have been able to successfully convert pdfs and docx files, and read their content by passing them in memory, rather than converting them into a physical file and then reading them. However, I am unable to do this with doc files.

Can someone point me in the right direction. I'm on windows and have tried textract but cannot get the library to work. I am open to other solutions.

#This works using a docx file
resume = (df.iloc[180]['Candidate_Resume_Attachment_Base64_Image'])
resume_bytes = resume.encode('ascii')
decoded = base64.decodebytes(resume_bytes)
result = BytesIO()
result.write(decoded)
docxReader = docx2txt.process(result)

#This does not working using a doc file
message=((df.iloc[361]['Candidate_Resume_Attachment_Base64_Image']))
resume_bytes = message.encode('ascii')
decoded = base64.decodebytes(resume_bytes)
result = BytesIO()
result.write(decoded)
word = win32.gencache.EnsureDispatch('Word.Application')
word.Visible = False
doc = word.Documents.Open(result)

#error:
    ret = self._oleobj_.InvokeTypes(19, LCID, 1, (13, 0), ((16396, 1), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17), (16396, 17)),FileName

com_error: (-2147352571, 'Type mismatch.', None, 16)

Solution

  • In case anyone else needs to read doc files in memory, this is my hacky solution until I find a better one.

    1)read the doc file using olefile library, which results in a mix of characters in unicode. 2) use regex to capture the text.

            import olefile
            #retrieve base64 image and decode into bytes, in this case from a df
            message = row['text']
            text_bytes = message.encode('ascii')
            decoded = base64.decodebytes(text_bytes)
            #write in memory
            result = BytesIO()
            result.write(decoded)
            #open and read file
            ole=olefile.OleFileIO(result)
            y = ole.openstream('WordDocument').read()
            y=y.decode('latin-1',errors='ignore')
            #replace all characters that are not part of the unicode list below (all latin characters) and spaces with an Astrisk. This can probably be shortened using a similar pattern used in the next step and combining them
            y=(re.sub(r'[^\x0A,\u00c0-\u00d6,\u00d8-\u00f6,\u00f8-\u02af,\u1d00-\u1d25,\u1d62-\u1d65,\u1d6b-\u1d77,\u1d79-\u1d9a,\u1e00-\u1eff,\u2090-\u2094,\u2184-\u2184,\u2488-\u2490,\u271d-\u271d,\u2c60-\u2c7c,\u2c7e-\u2c7f,\ua722-\ua76f,\ua771-\ua787,\ua78b-\ua78c,\ua7fb-\ua7ff,\ufb00-\ufb06,\x20-\x7E]',r'*', y))
            #Isolate the body of the text from the rest of the gibberish
            p=re.compile(r'\*{300,433}((?:[^*]|\*(?!\*{14}))+?)\*{15,}')
            result=(re.findall(p, y))
            #remove * left in the capture group
            result = result[0].replace('*','')
    

    For me, I needed to make sure that during decoding, accent characters are not lost, and since my documents are in English, Spanish, and Portguese, I opted to decode using latin-1. From there I use regex patterns to identify the text needed. After decoding, I found that in all of my documents, the capture group is preceeded by ~400 '*' and a ':' . Unsure if this is the norm for all doc documents when decoding using this method, but I used this as a starting point to create a regex pattern to isolate the text needed from the rest of the gibberish.