Search code examples
pythonpowerpointpython-pptx

Extract text from .ppt file in Python on Serverless Platform


I'm trying to extract the text from both .ppt and .pptx files using python on a serverless platform. Currently .pptx files are dealt with using python-pptx, however this package doesn't support .ppt files. I'm aware that usually you can use win32com to open a PowerPoint application to convert them, but I don't have this capability on the serverless platform. Are there any non-API solutions I could use to extract the text from the file?


Solution

  • I managed to find a solution to this, it's certainly not optimal but did the job. Hope it helps anyone who has been stuck on the same issue.

    import olefile
    import re
    
    ppt_file_path = 'path/to/file.ppt'
    with olefile.OleFileIO(ppt_file_path) as ole:
      bin_data = ole.openstream("PowerPoint Document").read()
    
    def remove_non_printable_characters(input_string):
        printable_regex = re.compile('[^\x20-\x7E]')
        cleaned_string = printable_regex.sub('', input_string)
        return(cleaned_string)
    text_data = bin_data.decode('utf-8', errors='replace')
    all_text = re.findall('\x00\x00[a-zA-Z0-9].*?\x00\x00', text_data)
    all_text = [x.replace('\x00\x00', '') for x in all_text if x != '\x00\x00\x00\x00']
    all_text = [x for x in all_text if len(x) <= len(remove_non_printable_characters(x))]
    

    EDIT: Turns out this doesn't get all the text within a .ppt file. I repurposed the code from oledump. Once you have the text you can use text cleaning methods to get it as you want :)

    
    import olefile
    import re
    
    REGEX_STANDARD = b'[\x09\x20-\x7E]'
    def extract_unicode(data):
        regex = b'((' + REGEX_STANDARD + b'\x00){%d,})'
        return [foundunicodestring.replace(b'\x00', b'') for foundunicodestring, dummy in re.findall(regex % 4, data)]
    
    def extract_ascii(data):
        regex = REGEX_STANDARD + b'{%d,}'
        return [foundasciistring for foundasciistring in re.findall(regex % 4, data)]
    
    ole = olefile.OleFileIO('/path/to/ppt.ppt')
    text = ole.openstream('PowerPoint Document')
    read = text.readlines()
    ole.close()
    all_unicode = []
    for r in read:
      extracted_ascii = extract_ascii(r) + extract_unicode(r)
      all_unicode.append(' '.join([x.decode('utf-8', errors = 'replace') for x in extracted_unicode]))
    
    my_str =' '.join(all_unicode)
    my_str