Extract text from .ppt file in Python on Serverless Platform

I'm trying to extract the text from both .ppt and .pptx files using python on a serverless platform. Currently .pptx files are dealt with using python-pptx, however this package doesn't support .ppt files. I'm aware that usually you can use win32com to open a PowerPoint application to convert them, but I don't have this capability on the serverless platform. Are there any non-API solutions I could use to extract the text from the file?

Solution

I managed to find a solution to this, it's certainly not optimal but did the job. Hope it helps anyone who has been stuck on the same issue.

import olefile
import re

ppt_file_path = 'path/to/file.ppt'
with olefile.OleFileIO(ppt_file_path) as ole:
  bin_data = ole.openstream("PowerPoint Document").read()

def remove_non_printable_characters(input_string):
    printable_regex = re.compile('[^\x20-\x7E]')
    cleaned_string = printable_regex.sub('', input_string)
    return(cleaned_string)
text_data = bin_data.decode('utf-8', errors='replace')
all_text = re.findall('\x00\x00[a-zA-Z0-9].*?\x00\x00', text_data)
all_text = [x.replace('\x00\x00', '') for x in all_text if x != '\x00\x00\x00\x00']
all_text = [x for x in all_text if len(x) <= len(remove_non_printable_characters(x))]

EDIT: Turns out this doesn't get all the text within a .ppt file. I repurposed the code from oledump. Once you have the text you can use text cleaning methods to get it as you want :)


import olefile
import re

REGEX_STANDARD = b'[\x09\x20-\x7E]'
def extract_unicode(data):
    regex = b'((' + REGEX_STANDARD + b'\x00){%d,})'
    return [foundunicodestring.replace(b'\x00', b'') for foundunicodestring, dummy in re.findall(regex % 4, data)]

def extract_ascii(data):
    regex = REGEX_STANDARD + b'{%d,}'
    return [foundasciistring for foundasciistring in re.findall(regex % 4, data)]

ole = olefile.OleFileIO('/path/to/ppt.ppt')
text = ole.openstream('PowerPoint Document')
read = text.readlines()
ole.close()
all_unicode = []
for r in read:
  extracted_ascii = extract_ascii(r) + extract_unicode(r)
  all_unicode.append(' '.join([x.decode('utf-8', errors = 'replace') for x in extracted_unicode]))

my_str =' '.join(all_unicode)
my_str